# COM4509/6509 - Lab 1: Probability

## Introduction to the Dataset

For this exercise we'll use the [Diabetes Health Indicators Dataset](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset): 253,680 Americans were called to ask about some of their health-related behaviours (eating fruit etc), and some health outcomes (have they diabetes, etc).

We first need to import various modules we'll use, download the dataset and open it, including [pandas](https://pandas.pydata.org), a data analysis and manipulation library, for looking at this data.

In [2]:
from itertools import count

import pandas as pd   #useful for data access and manipulation
import urllib.request #used to download the dataset
import matplotlib.pyplot as plt #useful for plotting
#ensure our plots appear in the notebook:
%matplotlib inline
import numpy as np    #numpy is useful for matrix/array/tensor manipulation

Matplotlib is building the font cache; this may take a moment.


Now download and open the dataset:

In [3]:
urllib.request.urlretrieve('https://drive.google.com/u/0/uc?id=1dprY31miDsQSZZwMkOfHoqkH4TQ8gV2W&export=download', './diabetes.csv')
df = pd.read_csv('diabetes.csv')

We can look at the content: It is a large table, containing 253680 rows (each row is a person) and 22 columns:

In [4]:
print("The columns in the dataset:")
print(df.columns)

The columns in the dataset:
Index(['Unnamed: 0', 'Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI',
       'Smoker', 'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits',
       'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost',
       'GenHlth', 'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age',
       'Education', 'Income'],
      dtype='object')


The dataframe itself:

In [5]:
df

Unnamed: 0.1,Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,62.0,4.0,3.0
1,1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,52.0,6.0,1.0
2,2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,62.0,4.0,8.0
3,3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,72.0,3.0,6.0
4,4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,72.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,42.0,6.0,7.0
253676,253676,2.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,72.0,2.0,4.0
253677,253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,27.0,5.0,2.0
253678,253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,52.0,5.0,1.0


To get summary statistics about all these, use the .describe() method of the dataframe:

In [6]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,126839.5,0.296921,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.094186,0.756544,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,57.138127,5.050434,6.053875
std,73231.252481,0.69816,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.292087,0.429169,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,15.323466,0.985774,2.071148
min,0.0,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,21.0,1.0,1.0
25%,63419.75,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,47.0,4.0,5.0
50%,126839.5,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,57.0,5.0,7.0
75%,190259.25,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,67.0,6.0,8.0
max,253679.0,2.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,82.0,6.0,8.0


## Probabilities

We can estimate probabilities from this sample of the population.

Probabilities can be defined as the limit between the ratio of the number of positive outcomes (e.g. heads, if flipping a coin) and the number of trials:

$$P(Y = y) = \lim_{N \rightarrow \infty} \frac{n_y}{N}$$

If you only have a finite number of samples, we can assume that the ratio is still approximately correct for now:

$$P(Y = y) \approx \frac{n_y}{N}$$

### Using the dataset to find some probabilities

What is the probability a person in this dataset is under 50?

As with numpy arrays, we can do a boolean operation on all of them to get a boolean array of which items the condition is true for:

In [7]:
df['Age']<50

0         False
1         False
2         False
3         False
4         False
          ...  
253675     True
253676    False
253677     True
253678    False
253679    False
Name: Age, Length: 253680, dtype: bool

If we ask for the mean, it will give us the proportion that are true:

In [9]:
P_AgeLessThan50 = np.mean(df['Age']<50) #we can use the mean as the comparison returns a boolean with False being treated as a zero, and a True as a one.
print("%0.1f%% of people in the dataset are <50 years old" % (100*P_AgeLessThan50))

29.3% of people in the dataset are <50 years old


What is the probability a person in the dataset is under 50 AND regularly eats vegatables? (JOINT PROBABILITY)

In [11]:
P_AgeLessThan50andVeg = np.mean((df['Age']<50) & (df['Veggies']))
print(P_AgeLessThan50andVeg)
print("%0.1f%% of people in the dataset are <50 years old AND eat vegetables." % (100*P_AgeLessThan50andVeg))

0.23927388836329233
23.9% of people in the dataset are <50 years old AND eat vegetables.


### Product Rule

What is the probability that they eat vegetables GIVEN they are under 50.

The product rule for probability we learnt was:

$$P(A,B) = P(A|B) P(B)$$

If we rearrange it we can find the conditional probability (please make sure you understand this step):

$$\frac{P(Veg=true, Age<50)}{P(Age<50)} = P(Veg=true\;|\;Age<50)$$

This is a conditional probability. Let's work it out:

In [12]:
#P(Veg|lt50) = P(Veg, lt50) / P(lt50)
print("Probability of eating vegetables GIVEN they are under 50: %0.1f%%" % (100*P_AgeLessThan50andVeg/P_AgeLessThan50))

Probability of eating vegetables GIVEN they are under 50: 81.8%


We can check this a different way, by picking out those who are under 50, and the looking at the proportion of those who eat vegetables:

In [13]:
dfAgeLessThan50 = df[df['Age']<50] #makes a new dataframe with just those under 50.

#of this dataframe, the proportion who eat veg:
print("Probability of eating vegetables GIVEN they are under 50: %0.1f%%" % (100*np.mean(dfAgeLessThan50['Veggies'])))

Probability of eating vegetables GIVEN they are under 50: 81.8%


### Exercise 1: Fruit, Vegetables and Indepence

We can get if someone eats fruit regularly using `df['Fruits']==1` and if they eat veg regularly using `df['Veggies']==1`. Compute:

- a. the probability of eating fruit regularly.
- b. the probability of eating vegetables regularly.
- c. the probability of eating both fruit AND vegetables regularly.
- d. does it seem like they are independent or not?

In [31]:
eating_fruits = df['Fruits']==1
#print(eating_fruits)
#print(eating_fruits.count())
eating_vegtables = df['Veggies']==1
#eating_vegtables_df = df.loc[(df['Veggies']==1)]
#eating_vegtables_df.count()

P_EatFruitsANDEatVeggie = np.mean((df['Fruits']==1) & (df['Veggies']==1))
print(P_EatFruitsANDEatVeggie)
P_fruits = 100*np.mean(df['Veggies']==1)
P_Veggie = 100*np.mean(df['Fruits']==1) 

result = P_EatFruitsANDEatVeggie/(P_fruits)*(P_Veggie)
print(result)

0.5625670135603911
0.4397370171532387


---
Answer here.

---

### Exercise 2: Diabetes

We can find the proportion of the participants that have diabetes.

The diabetes column is:

- 0 = no diabetes
- 1 = prediabetes
- 2 = diabetes

So we can write:

In [None]:
P_diabetes = np.mean(df['Diabetes_012']==2)
print("Probability of diabetes (in this cohort): %0.1f%%" % (100*np.mean(P_diabetes)))

- a. What proportion of those who eat fruit AND vegetables have diabetes?

---

In [None]:
#Answer here

---

- b. A smaller proportion of those who eat fruit and vegetables have diabetes, can we say from this that eating fruit and vegetables can reduce the risk of diabetes?

---

Answer here.

---

### Exercise 3: Plotting

One of the columns is the BMI (body mass index) of the participants. Let's plot it in a histogram.

In [None]:
#it can be useful, for data that has both high counts and low, to switch to a log axis.
#we can do this with the hist method by adding the log=True parameter (Try turning it on / off)
plt.hist(df['BMI'],20)#,log=True);
plt.grid()
plt.xlabel('BMI')
plt.ylabel('Frequency')

Exercise:
    
- a. The second parameter in the call to the `plt.hist` function selects the number of bins. Change it to 30 to get more detail.
- b. Is this a normal distribution? Why/why not?

---

Answer to b here.

---

Let's look at how BMI and income interact. The 'Income' column is *categorical* (with category 1 meaning less than \$10k/year; and category 8 means \$75k/year or more). Full details are in the ['codebook' for the dataset](https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf).

So we create two numpy arrays `BMIofLowIncome` and `BMIofHighIncome` containing the BMI values for those two groups. We can then plot them as histograms too. I've plotted the density (as the two groups are different sizes, this makes it easier to compare). A density means the area under both histograms each sum to one.

In [None]:
BMIofLowIncome = df[df['Income']==1]['BMI'].to_numpy()
BMIofHighIncome = df[df['Income']==8]['BMI'].to_numpy()
plt.hist(BMIofLowIncome,20,density=True,histtype='stepfilled',color='skyblue',ec="black",label="Lowest Income")
plt.hist(BMIofHighIncome,20,alpha=0.4,density=True,histtype='stepfilled',color='green',ec="black",label="Highest Income");
plt.legend()
plt.ylabel('Density')
plt.xlabel('BMI')

In [None]:
print("Mean BMI of high income and low income groups: %0.1f, %0.1f kg/m^2" % (np.mean(BMIofLowIncome),np.mean(BMIofHighIncome)))
print("%0.1f%% of people in the high income group are severely obese, while %0.1f%% of people in the low income group are." % (100*np.mean(BMIofHighIncome>40),100*np.mean(BMIofLowIncome>40)))

Although the mean BMI is only slightly higher in the low income group, the increased variance means far more of those in the low income group have very high BMI values.

### Exercise 4: Expectations and Moments

We can compute the variance, using `np.var` of the two groups BMIs:

In [None]:
np.var(BMIofLowIncome)

In [None]:
np.var(BMIofHighIncome)

- a. Can you compute the variance without using the `np.var` or `np.mean` functions? Instead think about how we computed the appropriate expectations. You might find `np.sum(x)` useful (this sums over a list or array) and `len(x)` which will give you the length of an array or list, x.

---

In [None]:
#Answer here.

---

- b. The next moment, after the mean and variance is the skewness of a distribution. It is computed by $$\tilde{\mu}_3 = E\Big[\Big(\frac{X-\mu}{\sigma}\Big)^3 \Big]$$
can you compute this?

---

In [None]:
#Answer here.

---

# Naive Bayes

The health centre wants to do a blood test on those most at risk of developing diabetes. e.g. if the probability of having diabetes is more than 20\%.

A new patient arrives, who has a BMI of 40. What's the chance that they have diabetes?

$$P(Diabetes = true \; |\; BMI = 40)$$

Remember that the dataframe has a 'Diabetes_012' column (0=no diabetes, 1=pre-diabetic, 2=diabetic). So the proportion of the dataset who are diabetic is:

In [None]:
np.mean(df['Diabetes_012']==2)

Note that the dataset is **not a representative sample**, so really we wouldn't necessarily want to use it for doing this sort of inference, but we'll continue, as an illustration!

For our patient we can just look at the proportion of those with BMI=40 who have diabetes:

In [None]:
#Here I create a temporary dataframe with those of a BMI of 40 using `df['BMI']==40]`.
#I then test each value of the Diabetes_012 column, and find the average number that have
#this equal to 2. This givens me the proportion.

dfBMI40 = df[df['BMI']==40] #make a new dataframe with just those with a BMI of 40.
np.mean(dfBMI40['Diabetes_012']==2) #find the proportion of this subset with diabetes

So they fall into our 'high risk' category, as 32% of those with a BMI=40 in the dataset have diabetes.

In [None]:
#note that, due to the categorical nature of the age the ages are just at these
#discrete points:
np.unique(df['Age'])

## The curse of dimensionality

Supposing we also know they are in the age 21 category, and smoke...

We can again make a dataframe containing just those who:
- have a BMI = 40
- are in the age = 21 category
- and smoke,

In [None]:
#Here I create a temporary dataframe containing those who have a BMI of 40, are in the
#Age=21 category AND smoke, using `df[(df['BMI']==40) & (df['Age']==21) & (df['Smoker']==1)]`
dfBMI40Age21Smoke = df[(df['BMI']==40) & (df['Age']==21) & (df['Smoker']==1)]
np.mean(dfBMI40Age21Smoke['Diabetes_012']==2)

Great, 0% chance!!
But there's something a bit wrong about this analyis...

In [None]:
len(dfBMI40Age21Smoke)

There are only 6 people in the dataset with a BMI of 40, who are in the 21 years old category, who smoke.

We can display this whole set:

In [None]:
dfBMI40Age21Smoke

### What are we assuming here?

When we assumed that we can approximate the probability with the ratio of cases to the total number in that condition, we were assuming that the total number was very large, so we were approximating infinite numbers of samples,

$$P(Y = y | Z) \approx \frac{n_{y|z}}{N_Z}$$

Six is not a large enough sample.

### Naive Bayes

<mark>The Naive Bayes classifier is an approach for making this type of inference by assuming **conditional independence between the features** (given the class).</mark> We've already seen (with the fruit and veg above) that this assumption probably is invalid, but it can still give reasonable results.

Let's think about this more carefully:

We are interested in computing the probability of having diabetes (given some features about the person), i.e.:

$P(D=true | x_1, x_2,...,x_n)$.

Quoting from the [wikipedia article](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

> The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible.

We've seen this problem with the patient above. Only six rows (out of 253,680) in the database have the same features.

### Rewriting with Bayes Rule

We can write the above conditional distribution as:

$$P(D | x_1, x_2,...,x_n) = \frac{P(x_1, x_2,...,x_n | D) \times P(D)}{P(x_1, x_2,...,x_n)}$$

We can use the chain rule of probability, to see that we can write the likelihood term as:

$$P(x_1, x_2,...,x_n | D) = P(x_1 | x_2,...,x_n, D) \times P(x_2 | x_3,...,x_n, D) \times \;... \times \; P(x_{n-1} \;|\; x_n, D)$$

It's not obvious to see why, so let's go through this step-by-step. We start with our likelihood term telling us the probability of having a BMI of 40, an age of 21 and smoking, given Diabetes = true (or false):

$$P(B=40, A=21, S = true | D)$$

We can write this using the product rule, as:

$$P(B=40 | A=21, S = true, D) \times P(A = 21, S = true | D)$$

We can then apply the same reasoning to the second term (try this yourself).

> It's probably worth stopping, writing this down on a bit of paper and thinking about it! This is quite a difficult step, and once you've understood it the rest will be fairly easy.
>
>  We are using:
> $$P(A,B) = P(A|B)\; P(B)$$
> but there's an additional conditional, 'given C' on everything:
> $$P(A,B|C) = P(A|B,C)\; P(B|C)$$

The result is that we can write our likelihood as a product of conditional probabilities:

$$P(B=40, A=21, S = true | D) \; \\= \\ P(B=40 | A=21, S = true, D) \;\;\times\;\; P(A=21 | S = true, D) \;\;\times\;\; P(S=true | D)$$


### The Naive-bit of Naive Bayes

> **Reminder: Condition independence.** If two random variables ($X,Y$) are conditionally independent, given a third ($Z$) it means that $P(X|Y,Z) = P(X|Z)$. I.e. (for a given value of Z) the probability of $X$ isn't influenced by the value of $Y$. It is written as $X \perp \!\!\! \perp Y | Z$.

Naive Bayes: We now make the (conditional) independence assumption in our expression above, that **all the features are conditionally independent (given the diabetes status, D)**.

So if our naive-Bayes assumption holds, we can write that:

$$P(B=40 \; | \; A=21, \; S = true, \; D) \; =\; P(B=40\; |\; D)$$

and,

$$P(A=21 | S = true, D) = P(A=21 | D)$$

Substituting in:

$$P(B=40, A=21, S = true | D) \; =\; P(B=40 | D) \; P(A=21 | D) \; P(S=true | D)$$

Please note that this, in general, isn't true. We are saying it's true because we have assumed conditional independence, e.g. that $B \perp \!\!\! \perp A | D$, etc.

### Why is this useful?

The number of people in the dataset who had a BMI=40, were age=21, and smoked was only 6. But lots of individuals had *each* of these characteristics, so we can use these separately:

Note that Naive Bayes is often used to compute a ratio between two conditions, but here we'll compute the posterior.

$$P(D = true | x_1, x_2,...,x_n) = \frac{P(x_1, x_2,...,x_n | D = true) P(D = true)}{P(x_1, x_2,...,x_n)}$$

To compute the demoninator we marginalise: I.e. we can compute $P(x_1, x_2,...,x_n,D=true)$ and $P(x_1, x_2,...,x_n,D=false)$ and add them up to get $P(x_1, x_2,...,x_n)$.

We've already computed the former when we computed the numerator, as it is the joint probability:

$$P(x_1, x_2,...,x_n, D = true) = P(x_1, x_2,...,x_n | D = true) P(D = true)$$

so we just need to compute,

$$P(x_1, x_2,...,x_n, D = false) = P(x_1, x_2,...,x_n | D = false) P(D = false).$$

Once we find these, we can add them up:

$$P(x_1, x_2,...,x_n) = P(x_1, x_2,...,x_n, D = true) + P(x_1, x_2,...,x_n, D = false)$$

### Putting it together...

Applying the Naive Bayes (conditional independence assumption) and substituting in our features (to keep it readable I've hidden the values the RVs are equal to). So we assume:

$P(B, A, S | D)\;\;\; =\; P(B | D)\;\;\; P(A | D) \;\;\; P(S | D)\\
P(B, A, S | \neg D)\; =\; P(B | \neg D)\; P(A | \neg D) \; P(S | \neg D)$

We can then use these to compute the denominator. Here we're using the above approximations to the likelihoods:

$$P(B,A,S) = P(B, A, S | D) P(D) + P(B, A, S| \neg D)P(\neg D)$$

and finally the posterior we're interested in: (remember that we are now going to be using our approximations, based on our assumptions about independence for these terms).

$$P(D | B, A, S) = \frac{P(B, A, S|D)\;P(D)}{P(B, A, S)}$$


In [None]:
#we create two dataframes, one of those with diabetes, and one with those without,
dfDiabetes = df[df['Diabetes_012']==2]
dfnotDiabetes = df[df['Diabetes_012']<2]

PBMI40_givenDiabetesTrue = np.mean(dfDiabetes['BMI']==40)
PBMI40_givenDiabetesFalse = np.mean(dfnotDiabetes['BMI']==40)

PAge21_givenDiabetesTrue = np.mean(dfDiabetes['Age']==21)
PAge21_givenDiabetesFalse = np.mean(dfnotDiabetes['Age']==21)

PSmoking_givenDiabetesTrue = np.mean(dfDiabetes['Smoker']==1)
PSmoking_givenDiabetesFalse = np.mean(dfnotDiabetes['Smoker']==1)

PDiabetesTrue = np.mean(df['Diabetes_012']==2)
PDiabetesFalse = np.mean(df['Diabetes_012']<2)

In [None]:
#P(BMI=40,Age=21,Smoking=true | Diabetes=true)
PallGivenDiabetesTrue = (PBMI40_givenDiabetesTrue * PAge21_givenDiabetesTrue * PSmoking_givenDiabetesTrue * PDiabetesTrue)
#P(BMI=40,Age=21,Smoking=true | Diabetes=false)
PallGivenDiabetesFalse = (PBMI40_givenDiabetesFalse * PAge21_givenDiabetesFalse * PSmoking_givenDiabetesFalse * PDiabetesFalse)

print("Probaility of diabetes, given BMI=40, Age=21 and Smoker = %0.1f %%" % (100*(PallGivenDiabetesTrue) / (PallGivenDiabetesTrue + PallGivenDiabetesFalse)))

### Exercise:

- a. What's the probability they have diabetes if they are 62, not a smoker and have a BMI of 20?
- b. What's the probability they are over 50 if they have a BMI of 20 and have diabetes and don't smoke? Use both Naive Bayes and compare to the answer computed with the full conditional distribution (without the independence assumption).

---

In [None]:
#Answers here.

---

### Related topics

- If you play with Naive Bayes you might find situations where no rows in the training set have that feature value. A simple approach to handling that is simply to 'add one' to all the frequencies. This is [Laplace Smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) (and is the equivalent to adding a prior).
- Naive Bayes can be considered as a (simple) Bayesian belief network:

<img width=300 src="https://www.researchgate.net/publication/283161090/figure/fig1/AS:648613512364033@1531652920537/A-typical-Naive-Bayes-network-diagram.png" />

<small><small>A Naive Bayes Network, from <i>Ibrahim et al. (2015). doi: 10.1016/j.procs.2015.09.194.</i></small></small>

A more complex network can be constructed, with some conditional dependencies added between features using edges. [Wikipedia article on Bayesian networks](https://en.wikipedia.org/wiki/Bayesian_network).

### Summary

With a handful of basic tools (the product rule, marginalisation, etc) we are able to perform really useful inference about important questions.

# Notes for other lecturers on data prep

The original ages were stored using a [14 category age](https://www.icpsr.umich.edu/web/NAHDAP/studies/34085/datasets/0001/variables/AGEG5YR?archive=NAHDAP). So I converted these to years (using the centre of each category) and saved as a new file:

```
newdf = df[df['Age']!=99]
newdf['Age'] = newdf['Age']*5+17
newdf.loc[newdf['Age']==22,'Age']=21
newdf.to_csv('diabetes2.csv')
```