<a href="https://colab.research.google.com/github/Jency07/machine-learning/blob/main/Data_Imputation_20MAI0026.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**DATA IMPUTATION**

Imputing refers to using a model to replace missing values.

There are many options we could consider when replacing a missing value, for example: 

*   A constant value that has meaning within the domain, such as 0, distinct from all other values.

*   A value from another randomly selected record.

*   A mean, median or mode value for the column.

*   A value estimated by another predictive model.

[Github Link](https://github.com/Jency07/machine-learning/blob/main/Data_Imputation_20MAI0026.ipynb)

### ***1.Import libraries***

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

### ***2.Mount filesystem***

In [2]:
from google.colab import files
uploaded=files.upload()

Saving pima-indians-diabetes.csv to pima-indians-diabetes (1).csv


### ***3.Load the dataset***

In [3]:
# load the dataset
dataset = pd.read_csv('pima-indians-diabetes (1).csv')
dataset.head(10)

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure,Triceps skin fold thickness,Serum insulin,Body mass index,Diabetes pedigree function,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [4]:
# summarize the dataset
dataset.describe()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure,Triceps skin fold thickness,Serum insulin,Body mass index,Diabetes pedigree function,Age,Class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


We can see that there are columns that have a minimum value of zero (0). 

1: Number of times pregenant

2. Plasma glucose concentration

3: Diastolic blood pressure

4: Triceps skinfold thickness

5: 2-Hour serum insulin

6: Body mass index

On some columns, a value of zero does not make sense and indicates an invalid or missing value.

1: Plasma glucose concentration

2: Diastolic blood pressure

3: Triceps skinfold thickness

4: 2-Hour serum insulin

5: Body mass index

In [5]:
# count the number of missing values for each column
missing_count = (dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]] == 0).sum()
# report the results
print(missing_count)

Plasma glucose concentration      5
Diastolic blood pressure         35
Triceps skin fold thickness     227
Serum insulin                   374
Body mass index                  11
dtype: int64


### **4. Data Imputation - Replacing missing values with mean**

In [6]:
print("\nRoll No: 20MAI0026")
print("***************************\n")

# load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv')

# mark zero values as missing or NaN
dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]] = dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]].replace(0, np.nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the imputer - mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# define the model
lda = LinearDiscriminantAnalysis()

# define the modeling pipeline
pipeline = Pipeline(steps=[('imputer', imputer),('model', lda)])

# define the cross validation procedure
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(pipeline, X, y, cv=kfold, scoring='accuracy')

# report the mean performance
print('Accuracy: %.3f' % result.mean())


Roll No: 20MAI0026
***************************

Accuracy: 0.762


### **5. Data Imputation - Replacing missing values with median**

In [7]:
print("\nRoll No: 20MAI0026")
print("***************************\n")

# load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv')

# mark zero values as missing or NaN
dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]] = dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]].replace(0, np.nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the imputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

# define the model
lda = LinearDiscriminantAnalysis()

# define the modeling pipeline
pipeline = Pipeline(steps=[('imputer', imputer),('model', lda)])

# define the cross validation procedure
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(pipeline, X, y, cv=kfold, scoring='accuracy')

# report the median performance
print('Accuracy: %.3f' % result.mean())


Roll No: 20MAI0026
***************************

Accuracy: 0.760


### **6. Data Imputation - Replacing missing values with most frequent**

In [8]:
print("\nRoll No: 20MAI0026")
print("***************************\n")

# load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv')

# mark zero values as missing or NaN
dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]] = dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]].replace(0, np.nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the imputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# define the model
lda = LinearDiscriminantAnalysis()

# define the modeling pipeline
pipeline = Pipeline(steps=[('imputer', imputer),('model', lda)])

# define the cross validation procedure
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(pipeline, X, y, cv=kfold, scoring='accuracy')

# report the mist frequent performance
print('Accuracy: %.3f' % result.mean())


Roll No: 20MAI0026
***************************

Accuracy: 0.760


### **7. Data Imputation - Replacing missing values with constant values**

In [9]:
print("\nRoll No: 20MAI0026")
print("***************************\n")

# load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv')

# mark zero values as missing or NaN
dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]] = dataset[["Plasma glucose concentration","Diastolic blood pressure","Triceps skin fold thickness","Serum insulin","Body mass index"]].replace(0, np.nan)

# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]

# define the imputer
imputer = SimpleImputer(missing_values=np.nan, strategy='constant')

# define the model
lda = LinearDiscriminantAnalysis()

# define the modeling pipeline
pipeline = Pipeline(steps=[('imputer', imputer),('model', lda)])

# define the cross validation procedure
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model
result = cross_val_score(pipeline, X, y, cv=kfold, scoring='accuracy')

# report the constant performance
print('Accuracy: %.3f' % result.mean())


Roll No: 20MAI0026
***************************

Accuracy: 0.763
