# Machine learning notebook

<h2>Diabetes dataset</h2>
<h5>Author: Marwah Saleh</h5>
<p>This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.</p>
<h5>Content</h5>
<p>Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
<ul>
    <li>Pregnancies: Number of times pregnant</li>
    <li>Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test</li>
    <li>BloodPressure: Diastolic blood pressure (mm Hg)</li>
    <li>SkinThickness: Triceps skin fold thickness (mm)</li>
    <li>Insulin: 2-Hour serum insulin (mu U/ml)</li>
    <li>BMI: Body mass index (weight in kg/(height in m)^2)</li>
    <li>DiabetesPedigreeFunction: Diabetes pedigree function</li>
    <li>Age: Age (years)</li>
    <li>Outcome: Class variable (0 or 1)</li>
</ul>
</p>
<ul>
<li>Number of Instances: 768</li>
<li>Number of Attributes: 8 plus class</li>
<li>For Each Attribute: (all numeric-valued)</li>
</ul>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
dataset = pd.read_csv("diabetesmain.csv");
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


There are some missing values in several colums which are recorded as zeros, I want to replace them by "NaN" to be able to deal with them as missing values.

# Author : Yara Mohamed 

In [3]:
dataset.loc[dataset["Outcome"] == 0 , "Outcome"] = "2"
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,2
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,2
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,2
764,2,122,70,27,0,36.8,0.340,27,2
765,5,121,72,23,112,26.2,0.245,30,2
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
dataset.to_csv("diabetes.csv")

In [5]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,2
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,2
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,2
764,2,122,70,27,0,36.8,0.340,27,2
765,5,121,72,23,112,26.2,0.245,30,2
766,1,126,60,0,0,30.1,0.349,47,1


In [6]:
dataset = pd.read_csv('diabetes.csv' , na_values="0")
print(dataset.shape)
dataset

(768, 10)


Unnamed: 0.1,Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,6.0,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,1.0,85.0,66.0,29.0,,26.6,0.351,31,2
2,2.0,8.0,183.0,64.0,,,23.3,0.672,32,1
3,3.0,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,2
4,4.0,,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...,...
763,763.0,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63,2
764,764.0,2.0,122.0,70.0,27.0,,36.8,0.340,27,2
765,765.0,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30,2
766,766.0,1.0,126.0,60.0,,,30.1,0.349,47,1


In [7]:
dataset= dataset.drop('Unnamed: 0',axis=1)

In [8]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,2
2,8.0,183.0,64.0,,,23.3,0.672,32,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,2
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63,2
764,2.0,122.0,70.0,27.0,,36.8,0.340,27,2
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30,2
766,1.0,126.0,60.0,,,30.1,0.349,47,1


In [9]:
X = dataset.iloc[:, :-1].values # features metrix
y = dataset.iloc[:, -1].values # output

In [10]:
X

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

In [11]:
y

array([1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2,
       1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1,
       2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 2,
       1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,
       1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1,
       2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1,
       1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2,
       1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2,
       1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2,
       2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1,

In [16]:
from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # define the imputer 
imputer.fit(X) 
X = imputer.transform(X)

In [17]:
X

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

In [18]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,72.0,35.0,,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,,26.6,0.351,31.0
2,8.0,183.0,64.0,,,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,,137.0,40.0,35.0,168.0,43.1,2.288,33.0
...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63.0
764,2.0,122.0,70.0,27.0,,36.8,0.340,27.0
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30.0
766,1.0,126.0,60.0,,,30.1,0.349,47.0


In [19]:
dataset= pd.DataFrame(X,columns = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age'])

In [20]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.000000,148.0,72.0,35.00000,155.548223,33.6,0.627,50.0
1,1.000000,85.0,66.0,29.00000,155.548223,26.6,0.351,31.0
2,8.000000,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0
3,1.000000,89.0,66.0,23.00000,94.000000,28.1,0.167,21.0
4,4.494673,137.0,40.0,35.00000,168.000000,43.1,2.288,33.0
...,...,...,...,...,...,...,...,...
763,10.000000,101.0,76.0,48.00000,180.000000,32.9,0.171,63.0
764,2.000000,122.0,70.0,27.00000,155.548223,36.8,0.340,27.0
765,5.000000,121.0,72.0,23.00000,112.000000,26.2,0.245,30.0
766,1.000000,126.0,60.0,29.15342,155.548223,30.1,0.349,47.0


In [22]:
dataset.insert(8,"Outcome",y)

In [23]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.000000,148.0,72.0,35.00000,155.548223,33.6,0.627,50.0,1
1,1.000000,85.0,66.0,29.00000,155.548223,26.6,0.351,31.0,2
2,8.000000,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0,1
3,1.000000,89.0,66.0,23.00000,94.000000,28.1,0.167,21.0,2
4,4.494673,137.0,40.0,35.00000,168.000000,43.1,2.288,33.0,1
...,...,...,...,...,...,...,...,...,...
763,10.000000,101.0,76.0,48.00000,180.000000,32.9,0.171,63.0,2
764,2.000000,122.0,70.0,27.00000,155.548223,36.8,0.340,27.0,2
765,5.000000,121.0,72.0,23.00000,112.000000,26.2,0.245,30.0,2
766,1.000000,126.0,60.0,29.15342,155.548223,30.1,0.349,47.0,1


In [25]:
dataset.loc[dataset["Outcome"] == 2 , "Outcome"] = "0"
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.000000,148.0,72.0,35.00000,155.548223,33.6,0.627,50.0,1
1,1.000000,85.0,66.0,29.00000,155.548223,26.6,0.351,31.0,0
2,8.000000,183.0,64.0,29.15342,155.548223,23.3,0.672,32.0,1
3,1.000000,89.0,66.0,23.00000,94.000000,28.1,0.167,21.0,0
4,4.494673,137.0,40.0,35.00000,168.000000,43.1,2.288,33.0,1
...,...,...,...,...,...,...,...,...,...
763,10.000000,101.0,76.0,48.00000,180.000000,32.9,0.171,63.0,0
764,2.000000,122.0,70.0,27.00000,155.548223,36.8,0.340,27.0,0
765,5.000000,121.0,72.0,23.00000,112.000000,26.2,0.245,30.0,0
766,1.000000,126.0,60.0,29.15342,155.548223,30.1,0.349,47.0,1


<h5>Author: Nada Ahmed</h5>

In [4]:
dataset.describe()

Unnamed: 0,Pregnancies,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0
mean,3.845052,0.471876,33.240885,0.348958
std,3.369578,0.331329,11.760232,0.476951
min,0.0,0.078,21.0,0.0
25%,1.0,0.24375,24.0,0.0
50%,3.0,0.3725,29.0,0.0
75%,6.0,0.62625,41.0,1.0
max,17.0,2.42,81.0,1.0


In [5]:
dataset.duplicated() #Returns True for every row that is a duplicate, othwerwise False

0      False
1      False
2      False
3      False
4      False
       ...  
763    False
764    False
765    False
766    False
767    False
Length: 768, dtype: bool

Normalizing the dataset:

In [5]:
# perform a robust scaler transform of the dataset
from sklearn.preprocessing import MinMaxScaler
trans1= MinMaxScaler()
scaled_dataset1 = trans1.fit_transform(dataset)
# convert the array back to a dataframe
scaled_dataset1 =pd.DataFrame(scaled_dataset1)

print(scaled_dataset1.describe())

                0           1           2           3           4           5  \
count  768.000000  763.000000  733.000000  541.000000  394.000000  757.000000   
mean     0.226180    0.501205    0.493930    0.240798    0.170130    0.291564   
std      0.198210    0.197004    0.126349    0.113880    0.142759    0.141615   
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000   
25%      0.058824    0.354839    0.408163    0.163043    0.074820    0.190184   
50%      0.176471    0.470968    0.489796    0.239130    0.133413    0.288344   
75%      0.352941    0.625806    0.571429    0.315217    0.211538    0.376278   
max      1.000000    1.000000    1.000000    1.000000    1.000000    1.000000   

                6           7           8  
count  768.000000  768.000000  768.000000  
mean     0.168179    0.204015    0.348958  
std      0.141473    0.196004    0.476951  
min      0.000000    0.000000    0.000000  
25%      0.070773    0.050000    0.000000  
50%   

As you can see above, the max=1, and the min=0 for all columns

Standarizing the dataset:

In [7]:
# perform a robust scaler transform of the dataset
from sklearn.preprocessing import StandardScaler
trans2 = StandardScaler()
scaled_dataset2 = trans2.fit_transform(dataset)
# convert the array back to a dataframe
scaled_dataset2 =pd.DataFrame(scaled_dataset2)
print(scaled_dataset2.describe())

                  0             1             2             3             4  \
count  7.680000e+02  7.630000e+02  7.330000e+02  5.410000e+02  3.940000e+02   
mean  -6.476301e-17  1.070936e-16 -4.871047e-16 -3.447643e-17  1.127130e-17   
std    1.000652e+00  1.000656e+00  1.000683e+00  1.000925e+00  1.001271e+00   
min   -1.141852e+00 -2.545803e+00 -3.911938e+00 -2.116442e+00 -1.193241e+00   
25%   -8.448851e-01 -7.434474e-01 -6.792777e-01 -6.834067e-01 -6.684780e-01   
50%   -2.509521e-01 -1.535857e-01 -3.274557e-02 -1.465704e-02 -2.575192e-01   
75%    6.399473e-01  6.328967e-01  6.137865e-01  6.540926e-01  2.904259e-01   
max    3.906578e+00  2.533562e+00  4.008080e+00  6.672840e+00  5.820456e+00   

                  5             6             7             8  
count  7.570000e+02  7.680000e+02  7.680000e+02  7.680000e+02  
mean   3.660656e-16  2.451743e-16  1.931325e-16  7.401487e-17  
std    1.000661e+00  1.000652e+00  1.000652e+00  1.000652e+00  
min   -2.060204e+00 -1.189553e+0

We can see that the distributions have been adjusted and that the mean is a very small number close to zero and the standard deviation is very close to 1.0 for each variable.