# Machine learning notebook

# Author: Marwah Saleh

<h2>Diabetes dataset</h2>
<p>This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.</p>
<h5>Content</h5>
<p>Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
<ul>
    <li>Pregnancies: Number of times pregnant</li>
    <li>Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test</li>
    <li>BloodPressure: Diastolic blood pressure (mm Hg)</li>
    <li>SkinThickness: Triceps skin fold thickness (mm)</li>
    <li>Insulin: 2-Hour serum insulin (mu U/ml)</li>
    <li>BMI: Body mass index (weight in kg/(height in m)^2)</li>
    <li>DiabetesPedigreeFunction: Diabetes pedigree function</li>
    <li>Age: Age (years)</li>
    <li>Outcome: Class variable (0 or 1)</li>
</ul>
</p>
<ul>
<li>Number of Instances: 768</li>
<li>Number of Attributes: 8 plus class</li>
<li>For Each Attribute: (all numeric-valued)</li>
</ul>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
dataset = pd.read_csv("diabetesmain.csv");
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


There are some missing values in several colums which are recorded as zeros, I want to replace them by "NaN" to be able to deal with them as missing values.

# Author : Yara Mohamed 

I want to substitute each zero value with temp to differentiate between it an between missing values that are marked as zero.

In [3]:
dataset.loc[dataset["Outcome"] == 0 , "Outcome"] = "temp"
dataset.loc[dataset["Pregnancies"] == 0 , "Pregnancies"] = "temp"
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,temp
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,temp
4,temp,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,temp
764,2,122,70,27,0,36.8,0.340,27,temp
765,5,121,72,23,112,26.2,0.245,30,temp
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
dataset.to_csv("diabetes.csv")

In [5]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,temp
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,temp
4,temp,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,temp
764,2,122,70,27,0,36.8,0.340,27,temp
765,5,121,72,23,112,26.2,0.245,30,temp
766,1,126,60,0,0,30.1,0.349,47,1


In [6]:
dataset = pd.read_csv('diabetes.csv' , na_values="0")
print(dataset.shape)
dataset

(768, 10)


Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,1,85.0,66.0,29.0,,26.6,0.351,31,temp
2,2.0,8,183.0,64.0,,,23.3,0.672,32,1
3,3.0,1,89.0,66.0,23.0,94.0,28.1,0.167,21,temp
4,4.0,temp,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...,...
763,763.0,10,101.0,76.0,48.0,180.0,32.9,0.171,63,temp
764,764.0,2,122.0,70.0,27.0,,36.8,0.340,27,temp
765,765.0,5,121.0,72.0,23.0,112.0,26.2,0.245,30,temp
766,766.0,1,126.0,60.0,,,30.1,0.349,47,1


In [7]:
dataset= dataset.drop('Unnamed: 0',axis=1)

In [8]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,temp
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,temp
4,temp,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,temp
764,2,122.0,70.0,27.0,,36.8,0.340,27,temp
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,temp
766,1,126.0,60.0,,,30.1,0.349,47,1


Now I want to return the original zero values which were replaced by temp.

In [9]:
dataset.loc[dataset["Outcome"] == "temp" , "Outcome"] = 0
dataset.loc[dataset["Pregnancies"] == "temp" , "Pregnancies"] = 0
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,0
766,1,126.0,60.0,,,30.1,0.349,47,1


In [10]:
X = dataset.iloc[:, :-1].values # features matrix
y = dataset.iloc[:, -1].values # output

In [11]:
X

array([['6', 148.0, 72.0, ..., 33.6, 0.627, 50],
       ['1', 85.0, 66.0, ..., 26.6, 0.351, 31],
       ['8', 183.0, 64.0, ..., 23.3, 0.672, 32],
       ...,
       ['5', 121.0, 72.0, ..., 26.2, 0.245, 30],
       ['1', 126.0, 60.0, ..., 30.1, 0.349, 47],
       ['1', 93.0, 70.0, ..., 30.4, 0.315, 23]], dtype=object)

In [12]:
y

array(['1', 0, '1', 0, '1', 0, '1', 0, '1', '1', 0, '1', 0, '1', '1', '1',
       '1', '1', 0, '1', 0, 0, '1', '1', '1', '1', '1', 0, 0, 0, 0, '1',
       0, 0, 0, 0, 0, '1', '1', '1', 0, 0, 0, '1', 0, '1', 0, 0, '1', 0,
       0, 0, 0, '1', 0, 0, '1', 0, 0, 0, 0, '1', 0, 0, '1', 0, '1', 0, 0,
       0, '1', 0, '1', 0, 0, 0, 0, 0, '1', 0, 0, 0, 0, 0, '1', 0, 0, 0,
       '1', 0, 0, 0, 0, '1', 0, 0, 0, 0, 0, '1', '1', 0, 0, 0, 0, 0, 0, 0,
       0, '1', '1', '1', 0, 0, '1', '1', '1', 0, 0, 0, '1', 0, 0, 0, '1',
       '1', 0, 0, '1', '1', '1', '1', '1', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       '1', 0, 0, 0, 0, 0, 0, 0, 0, '1', 0, '1', '1', 0, 0, 0, '1', 0, 0,
       0, 0, '1', '1', 0, 0, 0, 0, '1', '1', 0, 0, 0, '1', 0, '1', 0, '1',
       0, 0, 0, 0, 0, '1', '1', '1', '1', '1', 0, 0, '1', '1', 0, '1', 0,
       '1', '1', '1', 0, 0, 0, 0, 0, 0, '1', '1', 0, '1', 0, 0, 0, '1',
       '1', '1', '1', 0, '1', '1', '1', '1', 0, 0, 0, 0, 0, '1', 0, 0,
       '1', '1', 0, 0, 0, '1', '1', '1', '1',

In [13]:
from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # define the imputer 
imputer.fit(X) 
X = imputer.transform(X)

In [14]:
X

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

In [15]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,0
766,1,126.0,60.0,,,30.1,0.349,47,1


In [16]:
dataset= pd.DataFrame(X,columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age'])

In [17]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,72.0,35.00000,155.548223,33.6,0.627,50.0
1,1.0,85.0,66.0,29.00000,155.548223,26.6,0.351,31.0
2,8.0,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0
3,1.0,89.0,66.0,23.00000,94.000000,28.1,0.167,21.0
4,0.0,137.0,40.0,35.00000,168.000000,43.1,2.288,33.0
...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.00000,180.000000,32.9,0.171,63.0
764,2.0,122.0,70.0,27.00000,155.548223,36.8,0.340,27.0
765,5.0,121.0,72.0,23.00000,112.000000,26.2,0.245,30.0
766,1.0,126.0,60.0,29.15342,155.548223,30.1,0.349,47.0


In [18]:
dataset.insert(8,"Outcome",y)

In [19]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.00000,155.548223,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.00000,155.548223,26.6,0.351,31.0,0
2,8.0,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.00000,94.000000,28.1,0.167,21.0,0
4,0.0,137.0,40.0,35.00000,168.000000,43.1,2.288,33.0,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.00000,180.000000,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.00000,155.548223,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.00000,112.000000,26.2,0.245,30.0,0
766,1.0,126.0,60.0,29.15342,155.548223,30.1,0.349,47.0,1


# Author: Nada Ahmed

In [20]:
dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885
std,3.369578,30.435949,12.096346,8.790942,85.021108,6.875151,0.331329,11.760232
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0
25%,1.0,99.75,64.0,25.0,121.5,27.5,0.24375,24.0
50%,3.0,117.0,72.202592,29.15342,155.548223,32.4,0.3725,29.0
75%,6.0,140.25,80.0,32.0,155.548223,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


There are no missing values and all of them are replaced.

In [21]:
dataset.duplicated() #Returns True for every row that is a duplicate, othwerwise False

0      False
1      False
2      False
3      False
4      False
       ...  
763    False
764    False
765    False
766    False
767    False
Length: 768, dtype: bool

There are not any duplicated row in tha dataset.

# Normalizing the dataset:

In [22]:
# perform a robust scaler transform of the dataset
from sklearn.preprocessing import MinMaxScaler
trans1= MinMaxScaler()
scaled_dataset1 = trans1.fit_transform(dataset)
# convert the array back to a dataframe
scaled_dataset1 =pd.DataFrame(scaled_dataset1)

print(scaled_dataset1.describe())

                0           1           2           3           4           5  \
count  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000   
mean     0.226180    0.501205    0.493930    0.240798    0.170130    0.291564   
std      0.198210    0.196361    0.123432    0.095554    0.102189    0.140596   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      0.058824    0.359677    0.408163    0.195652    0.129207    0.190184   
50%      0.176471    0.470968    0.491863    0.240798    0.170130    0.290389   
75%      0.352941    0.620968    0.571429    0.271739    0.170130    0.376278   
max      1.000000    1.000000    1.000000    1.000000    1.000000    1.000000   

                6           7           8  
count  768.000000  768.000000  768.000000  
mean     0.168179    0.204015    0.348958  
std      0.141473    0.196004    0.476951  
min      0.000000    0.000000    0.000000  
25%      0.070773    0.050000    0.000000  
50%   

As you can see above, the max=1, and the min=0 for all columns

# Standarizing the dataset:

In [23]:
# perform a robust scaler transform of the dataset
from sklearn.preprocessing import StandardScaler
trans2 = StandardScaler()
scaled_dataset2 = trans2.fit_transform(dataset)
# convert the array back to a dataframe
scaled_dataset2 =pd.DataFrame(scaled_dataset2)
print(scaled_dataset2.describe())

                  0             1             2             3             4  \
count  7.680000e+02  7.680000e+02  7.680000e+02  7.680000e+02  7.680000e+02   
mean  -6.476301e-17 -3.561966e-16  6.915764e-16  7.956598e-16 -3.330669e-16   
std    1.000652e+00  1.000652e+00  1.000652e+00  1.000652e+00  1.000652e+00   
min   -1.141852e+00 -2.554131e+00 -4.004245e+00 -2.521670e+00 -1.665945e+00   
25%   -8.448851e-01 -7.212214e-01 -6.953060e-01 -4.727737e-01 -4.007289e-01   
50%   -2.509521e-01 -1.540881e-01 -1.675912e-02  8.087936e-16 -3.345079e-16   
75%    6.399473e-01  6.103090e-01  6.282695e-01  3.240194e-01 -3.345079e-16   
max    3.906578e+00  2.541850e+00  4.102655e+00  7.950467e+00  8.126238e+00   

                  5             6             7             8  
count  7.680000e+02  7.680000e+02  7.680000e+02  7.680000e+02  
mean   3.515706e-16  2.451743e-16  1.931325e-16  7.401487e-17  
std    1.000652e+00  1.000652e+00  1.000652e+00  1.000652e+00  
min   -2.075119e+00 -1.189553e+0

We can see that the distributions have been adjusted and that the mean is a very small number close to zero and the standard deviation is very close to 1.0 for each variable.

In [24]:
dataset.to_csv("diabetes.csv")