# Data Science Assessment

In [52]:
#Cell to import necessary Libraries
import pandas as pd
import numpy as np

## Data Exploration
In this section, we will load the dataset into a Pandas DataFrame and perform some initial data exploration to understand the dataset better. We will check the shape of the dataset, data types of columns, display the first 5 rows of the dataset, display the last 5 rows of the dataset, display 20 random rows of the dataset, and check for null/NAN values in the dataset. This will help us get the first idea of what the Data is and what data will need to be Cleaned.


In [53]:
# Define Path to Dataset
path = "Datasets/Diabetes Dataset.csv"

# Load dataset into Pandas DataFrame
df = pd.read_csv(path)

# Display the shape of the dataset
print(f"Shape of the dataset: {df.shape}")

#Display data types of columns in the Dataset
print(f"Data types of columns in the dataset: {df.dtypes}")

# Display the first 5 rows of the dataset
print(df.head())

# display last 5 rows of the dataset
print(df.tail())

# display 20 rows of data 
print(df.sample(20))

# Display null/NAN values in the dataset
print(f"\n Null/NAN values:", df.isnull().sum())

#Display rows with 0 values in the Glucose a column
print(f"\n Rows with 0 values in the Glucose column: \n", df.loc[df['Glucose'] == 0].head())

# Display the summary statistics of the dataset
print(f"\n Summary Statistics of the dataset: \n", df.describe())




Shape of the dataset: (2768, 10)
Data types of columns in the dataset: Id                            int64
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object
   Id  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0   1            6      148             72             35        0  33.6   
1   2            1       85             66             29        0  26.6   
2   3            8      183             64              0        0  23.3   
3   4            1       89             66             23       94  28.1   
4   5            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1         

## Continued Data Exploration
From our initial Exploration, we can see that there are no NAN/ Null values in the dataset. However, some columns have 0 in which should not be possible. We will explore which ones these are in a below cell

In [54]:
#Display rows with 0 Value in the Glucose  column
print(f"\n Examples of Rows with 0 values in the Glucose column: \n", df.loc[df['Glucose'] == 0].head())
#Display rows with 0 Value in the BloodPressure  column
print(f"\n Examples of Rows with 0 values in the BloodPressure column: \n", df.loc[df['BloodPressure'] == 0].head())
#Display rows with 0 Value in the SkinThickness  column
print(f"\n Examples of Rows with 0 values in the SkinThickness column: \n", df.loc[df['SkinThickness'] == 0].head())
#Display rows with 0 Value in the Insulin  column
print(f"\n Examples of Rows with 0 values in the Insulin column: \n", df.loc[df['Insulin'] == 0].head())
#Display rows with 0 Value in the BMI  column
print(f"\n Examples Rows with 0 values in the BMI column: \n", df.loc[df['BMI'] == 0].head())




 Examples of Rows with 0 values in the Glucose column: 
       Id  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
75    76            1        0             48             20        0  24.7   
182  183            1        0             74             20       23  27.7   
342  343            1        0             68             35        0  32.0   
349  350            5        0             80             32        0  41.0   
502  503            6        0             68             41        0  39.0   

     DiabetesPedigreeFunction  Age  Outcome  
75                      0.140   22        0  
182                     0.299   21        0  
342                     0.389   22        0  
349                     0.346   37        1  
502                     0.727   41        1  

 Examples of Rows with 0 values in the BloodPressure column: 
     Id  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
7    8           10      115              0       

## Data Preprocessing
From the first look at the Dataset, we can see that some of the Columns : Skin Thickness, Insulin, BMI, Blood Pressure, and Glucose have zero values which are not possible. We will replace these zero values with NaN values and then check the proportion of missing values in the dataset, which will help us to decide on the best way to handle the missing values. This way, we can make the Data more useful for purpose of Data analytics and any predictive models

In [55]:

# Define column names of relevant columns 
columns_to_clean = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
# Duplicate DataFrame to be worked on (can be used to validate this step to check values have changed)
diabetes_data_cleaned = df.copy()
# Replace 0 with NAN value using numpy
diabetes_data_cleaned[columns_to_clean] = diabetes_data_cleaned[columns_to_clean].replace(0, np.nan)

## Data PreProcessing Validation 
This section will be used to compare the original DataFrame to the Modified dataframe from above (Note: This is just to validate that the above preprocessing worked as intended)


In [56]:
# Display the first 5 rows of the original dataset
print("\n Original Dataset: \n", df.head())
# Display the first 5 rows of the cleaned dataset
print("\n Cleaned Dataset: \n", diabetes_data_cleaned.head())

# Display the last 5 rows of the original dataset
print("\n Original Dataset: \n", df.tail())
# Display the last 5 rows of the cleaned dataset
print("\n Cleaned Dataset: \n", diabetes_data_cleaned.tail())





 Original Dataset: 
    Id  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0   1            6      148             72             35        0  33.6   
1   2            1       85             66             29        0  26.6   
2   3            8      183             64              0        0  23.3   
3   4            1       89             66             23       94  28.1   
4   5            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

 Cleaned Dataset: 
    Id  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0   1            6    148.0           72.0           35.0      NaN  33.6   
1   2            1     85.0           66.0           29.0      NaN  26.6   
2   3   

As we can see from the output of the cell above, there are now NaN values ion the cleaned dataset where there were 0 values in the original dataset. This is the desired outcome of the Data Preprocessing step.


In [57]:
#Attempt to display rows with 0 values in the glucose column of the cleaned dataset to validate the cleaning (Expected output is an empty DataFrame)
print(f"\n Examples of Rows with 0 values in the Glucose column in the cleaned dataset: \n", diabetes_data_cleaned.loc[diabetes_data_cleaned['Glucose'] == 0].head())


 Examples of Rows with 0 values in the Glucose column in the cleaned dataset: 
 Empty DataFrame
Columns: [Id, Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome]
Index: []


## Data Preprocessing Continued 

In [58]:
# Display the proportion of missing values in the cleaned dataset as a percentage
print(f"\n Proportion of missing values in the cleaned dataset: \n", diabetes_data_cleaned.isnull().mean() * 100)
# save cleaned dataset to csv to work on 
diabetes_data_cleaned.to_csv("Datasets/Diabetes-Dataset-Cleaned.csv", index=False)




 Proportion of missing values in the cleaned dataset: 
 Id                           0.000000
Pregnancies                  0.000000
Glucose                      0.650289
BloodPressure                4.515896
SkinThickness               28.901734
Insulin                     48.049133
BMI                          1.408960
DiabetesPedigreeFunction     0.000000
Age                          0.000000
Outcome                      0.000000
dtype: float64


As we can see from the output above, the proportion of missing values in the dataset is quite high, especially in the Skin Thickness and Insulin columns. 

### Handling Missing Values


In [59]:
#handle missing insulin values 
# Display the number of missing values in the Insulin column
print(f"\n Number of missing values in the Insulin column before imputation: \n", diabetes_data_cleaned['Insulin'].isnull().sum())
# impute the missing values in the Insulin column with the mean(average) value
diabetes_data_cleaned['Insulin'].fillna(diabetes_data_cleaned['Insulin'].mean(), inplace=True)
# Display the number of missing values in the Insulin column after imputation
print(f"\n Number of missing values in the Insulin column after imputation: \n", diabetes_data_cleaned['Insulin'].isnull().sum())

#print some values to validate the imputation
print(f"\n Some values in the Insulin column after imputation: \n", diabetes_data_cleaned['Insulin'].head())




 Number of missing values in the Insulin column before imputation: 
 1330

 Number of missing values in the Insulin column after imputation: 
 0

 Some values in the Insulin column after imputation: 
 0    154.23783
1    154.23783
2    154.23783
3     94.00000
4    168.00000
Name: Insulin, dtype: float64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  diabetes_data_cleaned['Insulin'].fillna(diabetes_data_cleaned['Insulin'].mean(), inplace=True)
