<a href="https://colab.research.google.com/github/Just-Aymz/Student-Database-Management/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Import Libraries**
___

In [None]:
import pandas as pd
import numpy as np

import matplotlib as plt
import seaborn as sns

from sklearn.impute import KNNImputer
from scipy.stats import kurtosis, skew, shapiro
from sklearn.preprocessing import MinMaxScaler, RobustScaler


from typing import Optional

## **Read Files**
___


In [None]:
# Set Pandas to display all columns
pd.set_option('display.max_columns', None)

# Store the Github repository and file name as a variable
url = "https://raw.githubusercontent.com/Just-Aymz/Diabetes/refs/heads/main/diabetes.csv?token=GHSAT0AAAAAAC45AXRMZK3NFFQV43CSV7NYZ36PPYQ"

# Store the dataframe as a variable
df = pd.read_csv(url)

## **Dataset Identification**
___

In [None]:
# Return the shape and the size of the dataset.
print(f"The shape of the dataset is: {df.shape}\nThe size of the dataset is: {df.size}")

The shape of the dataset is: (768, 9)
The size of the dataset is: 6912


In [None]:
# Return the axes of the dataset
df.axes

[RangeIndex(start=0, stop=768, step=1),
 Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
        'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
       dtype='object')]

In [None]:
# Return the summary of each feature in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [None]:
# Return a sample of the top 5 rows of the dataset
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# Return a samole of the bottom 5 rows of the dataset
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


# **Data Preprocessing**
___


### 1. **Data Cleaning**
Dataset preprocessing is the process of cleaning and transforming raw data into a format suitable for analysis or machine learning. It involves handling missing values, removing duplicates, correcting inconsistencies, scaling or normalizing features, encoding categorical variables, and selecting relevant features. The goal is to ensure that the data is accurate, consistent, and structured to improve the performance of analytical models and algorithms.



#### 1.1 **Duplicate Values**
In the context of a dataframe, a record can be a duplicate of another record, if all the data values of each feature are identical. A record can also be considered a duplicate if the data values of a specified subset of features are identical. In this section, we look to identify and remove the duplicate values within the dataset.


In [None]:
dupes = df.duplicated().sum()
if dupes != 0:
  print(f"Total number of duplicate values: {dupes}")
else:
  print("There are no duplicate values in the dataset")

There are no duplicate values in the dataset


#### 1.2. **Null Values**
___

In [None]:
nulls = np.count_nonzero(df.isnull().values)
if nulls != 0:
  print(f"Total number of null values: {nulls}")
else:
  print(f"There are no duplicate values in the dataset")

There are no duplicate values in the dataset


#### 1.3 **Outlier Values**
___

In [None]:
def outlier_Count(column: str,
                  features_with_outliers: Optional[list]=None,
                  dataframe: Optional[pd.DataFrame]=df,
                  store: Optional[bool]=False):
  # Store the q1 and q3 values
  q1 = dataframe[column].quantile(0.25)
  q3 = dataframe[column].quantile(0.75)

  # Store the IQR value
  IQR = q3 - q1

  # Use the IQR to find the upper and the lower limits
  upper = dataframe[column] > q3 + (IQR * 1.5)
  lower = dataframe[column] < q1 - (IQR * 1.5)

  # Store the outlier values
  outliers = dataframe[column][upper | lower]

  if store == False:
    return len(outliers)
  if (store == True) & (len(outliers) > 0):
    features_with_outliers.append(column)

for feature in df.drop(columns=['Outcome']):
  if ((outliers := outlier_Count(feature)) > 0):
    print(f"{feature}:\nTotal Outliers: {outliers}\n")


Pregnancies:
Total Outliers: 4

Glucose:
Total Outliers: 5

BloodPressure:
Total Outliers: 45

SkinThickness:
Total Outliers: 1

Insulin:
Total Outliers: 34

BMI:
Total Outliers: 19

DiabetesPedigreeFunction:
Total Outliers: 29

Age:
Total Outliers: 9



#### 1.4 **Miscellaneous Errors**
___
These errors can be generalized as the errors found in the data values of the records in the dataframe. This step includes converting features to correct datatypes. In this section we look to correct those data errors in order to feed a clean dataset to our machine learning models.


##### 1.4.1. **Numeric Features**

In [None]:
# Return a summary of the descriptive statistics of the dataset.
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Based on the minimum values of the Glucose, BloodPressure, SkinThickness, Insulin and BMI, we can logically deduce that these are input errors, as these values are not physiologically possible. Therefore, we can try and replace these values with N/A values as it can be assumed that these values were missing from the data collection point.


In [None]:
# Store the features that will need to have the 0 values replaced with Nan values
error_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

for column in df:
  # If the feature of the dataframe is in the error_features list
  if column in error_features:
    # Replace the 0 values with nan.
    df[column] = df[column].replace(0, np.nan)

In [None]:
# Return the total number of null values in the dataframe
new_nulls = np.count_nonzero(df.isnull().values)
print(f"Total number of null values: {new_nulls}")

Total number of null values: 652


In [None]:
# Return the total number of null values per feature of the dataset
df.isnull().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11
DiabetesPedigreeFunction,0
Age,0
Outcome,0


In [None]:
def statistics(column: str,
               dataframe: Optional[pd.DataFrame]=df,
               dct: Optional[bool]=False) -> None:
    """
    A function that outputs statistics for features with more than 12 unique
    values.The skew and Kurtosis values, along with the p-value from the
    Shapiro-Wilk test. These values help evaluate the distribution of the
    features of the dataframe passed in to the dataframe parameter.

    Args:
        dataframe:
            - A pandas dataframe object.

    Returns:
        None
    """
    skew_and_distribution = {}
    _skew = skew(dataframe[column])
    _kurtosis = kurtosis(dataframe[column], fisher=True)
    print(
        f"\n{column}\n"
        f"skew: {_skew:.4f}\n"
        f"kurtosis: {_kurtosis:.4f}"
    )

    # Perform Shapiro-Wilk test
    stat, p_value = shapiro(dataframe[column])
    skew_and_distribution.update({'test_score': f'{stat:.4f}',
                                  'p_value': f'{p_value:.4f}',
                                  'feature': column})

    # Print the results
    print(f"Shapiro-Wilk Normality test: {stat:.4f}")
    print(f"P-value: {p_value:.4f}")

    # Interpret the p-value
    alpha = 0.05
    if p_value > alpha:
        print("The data is likely normally distributed (fail to reject H0).")
        skew_and_distribution.update({'hypothesis_result': 'fail to reject H0'})
    else:
        print("The data is not normally distributed (reject H0).")
        skew_and_distribution.update({'hypothesis_result': 'reject H0'})

    # Check the absolute values of each skew value of a feature
    if np.abs(_skew) < 0.5:
        print('distribution is almost symmetrical')
        skew_and_distribution.update({'skew': 'almost symmetrical'})
    elif 0.5 <= np.abs(_skew) <= 1:
        print('distribution is modertely skewed')
        skew_and_distribution.update({'skew': 'modertely skewed'})
    else:
        print('distribution is highly skewed')
        skew_and_distribution.update({'skew': 'highly skewed'})

    # Extremity of tail distribution
    if _kurtosis > 0:
        print(
            f'Leptokurtic distribution - heavier tails and a sharper '
            f'peak than the normal distribution.\n'
            f'This type of distribution is often associated with higher '
            f'peakedness and a greater probability of extreme values.\n'
        )
    elif _kurtosis < 0:
        print(
            f'Platykurtic distribution - lighter tails and a flatter '
            f'peak than the normal distribution.\n'
            f'This type of distribution is often associated with less  '
            f'peakedness and a lower probability of extreme values.\n'
        )
    else:
        print(
            f'Mesokurtic distribution - similar peak and tail shape as the  '
            f'normal distribution.\n'
            )

    # Return dictionary
    if dct == True:
        return skew_and_distribution

# Store a dataset with no null values to not affect the SHapiro-Wilk Normality
# Test.
non_null_df = df.dropna()
skew_and_distribution = []
for column in non_null_df.drop(columns=['Outcome']):
  feature_stats = statistics(column, non_null_df, dct=True)
  skew_and_distribution.append(feature_stats)


Pregnancies
skew: 1.3305
kurtosis: 1.4522
Shapiro-Wilk Normality test: 0.8530
P-value: 0.0000
The data is not normally distributed (reject H0).
distribution is highly skewed
Leptokurtic distribution - heavier tails and a sharper peak than the normal distribution.
This type of distribution is often associated with higher peakedness and a greater probability of extreme values.


Glucose
skew: 0.5159
kurtosis: -0.4924
Shapiro-Wilk Normality test: 0.9642
P-value: 0.0000
The data is not normally distributed (reject H0).
distribution is modertely skewed
Platykurtic distribution - lighter tails and a flatter peak than the normal distribution.
This type of distribution is often associated with less  peakedness and a lower probability of extreme values.


BloodPressure
skew: -0.0872
kurtosis: 0.7700
Shapiro-Wilk Normality test: 0.9899
P-value: 0.0087
The data is not normally distributed (reject H0).
distribution is almost symmetrical
Leptokurtic distribution - heavier tails and a sharper peak 

In [None]:
pd.DataFrame(skew_and_distribution)

Unnamed: 0,test_score,p_value,feature,hypothesis_result,skew
0,0.853,0.0,Pregnancies,reject H0,highly skewed
1,0.9642,0.0,Glucose,reject H0,modertely skewed
2,0.9899,0.0087,BloodPressure,reject H0,almost symmetrical
3,0.9876,0.002,SkinThickness,reject H0,almost symmetrical
4,0.804,0.0,Insulin,reject H0,highly skewed
5,0.9738,0.0,BMI,reject H0,modertely skewed
6,0.8486,0.0,DiabetesPedigreeFunction,reject H0,highly skewed
7,0.8383,0.0,Age,reject H0,highly skewed


**Feature skew and distribution summary**:

| Feature | Skew | Distribution |
|---------|------|--------------|
|Pregnancies | highly skewed | Reject |
|Glucose | modertely skewed | Reject |
|BloodPressure | almost symmetrical | Reject |
|SkinThickness | almost symmetrical | Reject |
|Insulin | highly skewed | Reject |
| BMI | modertely skewed | Reject |
| DiabetesPedigreeFunction | highly skewed | Reject |
| Age | highly skewed | Reject |

The KNNImputer (K-Nearest Neighbors Imputer) is an imputation method that estimates missing values by considering the k nearest neighbors in the feature space. It imputes missing values based on the values of these closest data points.

Given that the features with missing values contain outliers and do not follow a normal distribution (as indicated by the rejection of the null hypothesis in the Shapiro-Wilk Normality Test), KNNImputer is an appropriate choice to impute the missing data. This method does not assume a specific data distribution, making it more robust to outliers and non-normal data. Moreover, since the dataset is not too large, the computational cost of KNNImputer, although memory-intensive, is manageable

In [None]:
# Instantiate a KNNImputer object
imputer = KNNImputer()
imputed_df = pd.DataFrame(imputer.fit_transform(df))

# Map the column names to match the original dataframe
imputed_columns = list(range(9))
original_columns = df.columns
imputed_df.rename(dict(zip(imputed_columns, original_columns)), axis=1)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,169.0,33.6,0.627,50.0,1.0
1,1.0,85.0,66.0,29.0,58.6,26.6,0.351,31.0,0.0
2,8.0,183.0,64.0,25.8,164.6,23.3,0.672,32.0,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,0.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63.0,0.0
764,2.0,122.0,70.0,27.0,165.0,36.8,0.340,27.0,0.0
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30.0,0.0
766,1.0,126.0,60.0,35.2,134.2,30.1,0.349,47.0,1.0


### 2. **Data Transformation**
Data transformation refers to the process of converting data from one format or structure into another. This step is important in preparing your data for analysis, machine learning models, or other types of data processing. The main goal of data transformation is to make the dataset more suitable for analysis or machine learning by improving its structure, scaling, and features.


In [None]:
# Store the features with outlier values
outlier_features = []
for feature in df:
  outlier_Count(column=feature, features_with_outliers=outlier_features, store=True)

outlier_features

['Pregnancies',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [None]:
def feature_transformation(feature: str,
                           dataframe: Optional[pd.DataFrame]=df,
                           columns: Optional[list]=outlier_features):
  if feature in columns:
    # Instantiate RobustScaler object
    scaler = RobustScaler()
    # Fit and Transform the feature
    dataframe[feature] = scaler.fit_transform(dataframe[[feature]])
  elif feature == 'BMI':
    # Instantiate MinMaxScaler object
    scaler = MinMaxScaler()
    # Fit and transform the feature
    dataframe[feature] = scaler.fit_transform(dataframe[[feature]])
  elif feature in []


SyntaxError: expected ':' (<ipython-input-32-300973bd6cbd>, line 14)