# Coding Exercises

### Coding exercise 2
Coding Exercise 2: Handling Missing Data in a Dataset for Machine Learning
1. Import the necessary Python libraries for data preprocessing, including the `SimpleImputer` class from the scikit-learn library.
2. Download the "Pima Indians Diabetes Database" dataset from the UCI Machine Learning Repository or use a preloaded version of this dataset in your local environment.
3. Load the dataset into a pandas DataFrame using the `read_csv` function from the pandas library.
4. Identify missing data in your dataset. Print out the number of missing entries in each column. Analyze its potential impact on machine learning model training. This step is crucial as missing data can lead to inaccurate and misleading results.
5. Implement a strategy for handling missing data, which is to replace it with the mean value, based on the nature of your dataset. Other strategies might include dropping the rows or columns with missing data, or replacing the missing data with a median or a constant value.
6. Configure an instance of the `SimpleImputer` class to replace missing values with the mean value of the column.
7. Apply the `fit` method of the `SimpleImputer` class on the numerical columns of your matrix of features.
8. Use the `transform` method of the `SimpleImputer` class to replace missing data in the specified numerical columns.
9. Update the matrix of features by assigning the result of the `transform` method to the correct columns.
10. Print your updated matrix of features to verify the success of the missing data replacement

In [2]:
# Importing the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.impute import SimpleImputer

In [3]:
# Load the dataset
dataset = pd.read_csv("diabetes.csv")

In [4]:
dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# Identify missing data (assumes that missing data is represented as NaN)
dataset[dataset.isna().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


In [6]:
# Print the number of missing entries in each column
# No n/a values in dataset
print(dataset[dataset.isna().any(axis=1)])

Empty DataFrame
Columns: [Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]
Index: []


In [7]:
# Configure an instance of the SimpleImputer class
from sklearn.impute import SimpleImputer

In [8]:
# Fit the imputer on the DataFrame
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [9]:
# Apply the transform to the DataFrame
imputer.fit(X[:, 1:3])

In [10]:
#Print your updated matrix of features
X[:, 1:3] = imputer.transform(X[:, 1:3])