<a href="https://colab.research.google.com/github/Jeesoo-Jhun/DS-NTL-091624/blob/main/%5BFIS_DS%5D_CHALLENGE_Guided_MLR_into_Pima_Indian_Diabetic_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---
---

# 🧰 **CHALLENGE ASSIGNMENT** 🧰

## **A machine-learning-driven sample investigation into diabetic predictions on individuals of Pima Indian descent.**

In this challenge, you'll perform an end-to-end data analysis and machine learning summary assessment with a new dataset that's well-known in the predictive analytics world: the **[Pima Indians Diabetes](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)** dataset.

For the sake of this investigation, the dataset will be referred to programmatically via the filename **`diabetes.csv`**.

---
---

As this is an end-to-end machine learning analysis, we'll be performing sufficient initial data cleaning, investigation, and summarization in order to get the best sense for how to best construct effective models.

In [None]:
# System and Operational Importations
import sys, os

# Standard Data Science/Analysis Toolkit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt; plt.style.use("ggplot")
import seaborn as sns

# Machine Learning Tools, Utilities, and Scoring Metrics
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

# Suite of Machine Learning Algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression

# Setup to Ignore Version Errors and Deprecations
import warnings
warnings.filterwarnings("ignore")

For starters, let's get access to our dataset.

At this point, the dataset should be externally uploaded to your current local session using the relevant toolbar to the left-hand-side of this window.

Assuming that step is complete, you can run the following cell to import our data.

In [None]:
PATH = "diabetes.csv"

dataset = pd.read_csv(PATH)

Once you've gotten access to your dataset, you can investigate general information on its features and overall shape.

In [None]:
dataset.info()

Additionally, you can use a tried-and-tested method for understanding a sample of the data: **`.head()`**.

In [None]:
dataset.head(3)

You can also get some basic descriptive statistics across the dataset using the **`.describe()`** method.

In [None]:
dataset.describe()

To progress further into the investigation, null value occurrences across our data need to be cleaned and imputed.

For this data, however, as you probably recall from our **`.info()`** call up above, there doesn't appear to be any occurrences of **`np.nan`** values.

Does this mean that there truly are no null values at all?

Well, not exactly.

In this dataset, there appear to be several zero-based values occur in features that don't make logical sense.

For instance, assuming **`BloodPressure`** measures a patient's blood pressure (which I'd say is a more-than-fair assumption), a blood pressure of zero (0) makes no sense for this type of dataset.

This can only mean that the patient information was likely not logged during data entry and that the data was pre-imputed with zero-values.

We can see this pattern emerge across five features in particular:
- _Patient Glucose Level_ (**`Glucose`**)
- _Patient Blood Pressure_ (**`BloodPressure`**)
- _Patient Skin Thickness_ (**`SkinThickness`**)
- _Patient Insulin Level_ (**`Insulin`**)
- _Patient Body Mass Index_ (**`BMI`**)

These features are all similar in that they comprise values where zero-value occurrences don't make logical sense in terms of the data's domain, as those values cannot exist in human conditions and therefore can be permissibly interpreted as data entry oversights that are likely due to null value recording.

As such, you'll need to analyze zero-value occurrences in these five features and convert them _back_ into true null values (**`np.nan`**) before moving ahead with appropriate null value imputation.

For starters, get a basic assessment of all the summed occurrences of null values across the dataset by feature by converting zero-value occurrences in the previously-detected zero-invalid features and then counting **`np.nan`** occurrences.

In [None]:
INVALID_NULL_FEATURES = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

try:
    dataset[INVALID_NULL_FEATURES] = dataset[INVALID_NULL_FEATURES].replace(0, np.NaN)
except dataset[INVALID_NULL_FEATURES].isnull().sum().sum() == 0:
    pass
finally:
    print(dataset.isnull().sum())

There appears to be far more null values propagated across this dataset than originally detected!

At least they're in the correct format now.

Before beginning to truly impute them, go ahead and run the following cell to produce an effective data visualization showcasing the distributions of each major feature in the dataset.

In [None]:
dataset.hist(rwidth=0.95, figsize=(12, 12))

You can see that for the invalid features, there exist presences of zero-value occurrences (towards the left-hand side of their respective visualizations).

Keep this in mind as you perform some advanced data imputation for the aforementioned features in an attempt to sanitize the data.

Specifically, you'll impute the features based on the occurrences of normal distributions in their data.

---

> ### 📌 **OBJECTIVE: Distribution-Based Null Feature Imputation** 📌
>
> Generally in previous courses and work, you've accomplished basic null value imputation by simply removing occurrences of null values as they occur or, at the very most, replacing them with manually-assigned "dummy values" (e.g. -1, 0, "NULL").
>
> However, null value imputation is a fairly comprehensive subdomain of data analysis and can necessitate more advanced methods as needed to retain as much signal as possible, especially when performing more advanced tasks overall such as those in the context of machine learning.
>
> 🔍 _For this objective, your task is to perform mean-based and median-based null value imputation on the aforementioned five zero-invalid features._ 🔎
>
> #### **The major idea to understand is that data that is _normally distributed_ can be imputed with that feature's _mean_, while data that is _non-normally distributed_ or substantially skewed can be imputed with that feature's _median_.**
>
> In other words, while much of the actual null value imputation code is pre-written for you - you'll have to consider carefully which of the five zero-invalid features are normally distributed vs. non-normally distributed/skewed in order to discern which features should be imputed by mean or by median.
>
> Consider your choices carefully and feel free to experiment with different combinations and selections: your choices of how to impute specific features may very well have an impact on the final performances of your models!

---

In [None]:
INVALID_NULL_FEATURES_NORMAL =      [""" EDIT ME! """]
INVALID_NULL_FEATURES_NONNORMAL =   [""" EDIT ME! """]

try:
    for feature_normal in INVALID_NULL_FEATURES_NORMAL:
        dataset[feature_normal].fillna(dataset[feature_normal].mean(),
                                       inplace=True)
    for feature_nonnormal in INVALID_NULL_FEATURES_NONNORMAL:
        dataset[feature_nonnormal].fillna(dataset[feature_nonnormal].median(),
                                          inplace=True)
except dataset[INVALID_NULL_FEATURES].isnull().sum().sum() == 0:
    pass
finally:
    print(dataset.isnull().sum())

Once your imputation task is completed, you can visualize the resultant distributions by running the code below.

You'll notice it's a replica of the original feature-wise data distribution visualization code -- the difference is that now you should be able to see slight alterations in the zero-invalid features now that they've been appropriately imputed with mean-based or median-based replacement methods.

In [None]:
dataset.hist(rwidth=0.95, figsize=(12, 12))

As a final sanity check for this analysis, you can also distribute your data in another feature-wise method with respect to each feature's general impactfulness on the target.

In this case, the dataset's target feature is measured as **`Outcome`**, so the relevant feature-wise plot can be constructed as such.

In [None]:
sns.pairplot(dataset, hue="Outcome", palette="husl")

And lastly, you can produce a heatmap to measure correlational values between each pair of features.

This is extremely useful to build an intuition for basic feature selection or reduction, as well as to gain a broad understanding of where signal is distributed across the data.

In [None]:
plt.figure(figsize=(12, 10))

sns.heatmap(dataset.corr(), annot=True, cmap="RdYlGn")

Now that relevant predecessory data analysis and visualization has been conducted successfully, you're ready to move on to machine learning and finally developing your algorithmic sample platter!

---
---

For starters, as with every machine learning analysis, you start by identifying the relevant target variable across the dataset and isolating it in order to produce **`X`** and **`y`** data segments.

In [None]:
TARGET = ["Outcome"]

X, y = dataset.drop(columns=TARGET, axis=1), dataset[TARGET]

Once the data has been segmented, generally it's acceptable to move directly on to further segmentation into training and testing splits.

However, it's good practice when conducting machine learning to scale and normalize the data prior to predictive modeling -- especially when working with some types of classifiers such as ones that use distance-based metrics (e.g. KNN, SVM).

As such, you'll make use of the **`StandardScaler()`** module to scale the data prior to further segmentation and fitness.

To start, you'll instantiate and declare the relevant data structure.

In [None]:
scaler = StandardScaler()

Now that the relevant data structure is instantiated, you can fit and transform it against the independent data (**`X`**).

This can be done in two explicit lines of code (one calling **`.fit()`** and one calling **`.transform()`**) but for brevity's sake, it can also be done and is recommended to be performed in one hybridized call of **`.fit_transform()`**.

In [None]:
X_scaled = scaler.fit_transform(X)

Notice that the last cell produces a new copy of the data that retains scaled changes, referred to as **`X_scaled`**.

This is good practice when performing post-analysis dataset alterations: to persist major changes across copies of the originally processed data such that it's easier to track how said changes impact modeling performance.

It's important to make sure that references to the independent data are updated across future steps of machine learning analysis to reflect correctly across the newly scaled data rather than the original copy.

As such, the subsequent training/testing segmentation call will reference the newer copy **`X_scaled`** as opposed to the original.

In [None]:
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled, y,
                                                                  train_size=0.7,
                                                                  test_size=0.3,
                                                                  random_state=42)

For this specific assessment, a 70/30 split for training and testing data is used for simplicity's sake -- however, this specific split is a parameter that can be manipulated for performance optimization.

---

> ### 📌 **OBJECTIVE: Alternate Scaling Methods** 📌
>
> The object that's been in use so far for scaling data (**`StandardScaler()`**) is a powerful one, but only one in a competent data scientist's arsenal.
>
> The naming convention behind the data structure reveals its secrets: it utilizes normalization (also referred to as _standardization_) to retrofit data within a feature such that while its distribution's shape is not changed, the data is altered such that the mean of the data falls on zero (0) and the standard deviation of the data falls on one (1).
>
> In this way, all features in a dataset can be reincorporated within a similar domain to improve performance without sacrificing the nuance and expressed variance of each unique feature.
>
> Another popular scaling method exists, however, that can often perform more optimally than the standard scaler due to how important boundaries can make an impact on predictive performance.
>
> #### **That data structure is the `MinMaxScaler()` and its naming convention reveals that it transfigures all data based on their minimum and maximum values without impacting the general shape of their distributions.**
>
> 🔍 _Your task is to perform a second scaling step across the original data by utilizing **`MinMaxScaler()`** instead of `StandardScaler()` to fit and transform the data._ 🔎
>
> Remember to name your transformed `X` data relevantly upon performing this alternate scaling method!

---

In [None]:
""" Complete OBJECTIVE: Alternate Scaling Methods in this cell! """

Now that our data is relevantly scaled and formatted for machine learning, it's time to set up the relevant predictive algorithms!

For this challenge, it's important to get used to working with multiple classifiers at once -- oftentimes in the context of real-world machine learning, one never relies on simply one or two algorithms all the time to complete their investigations.

Indeed, it's often more effective to leverage a suite of algorithms in order to identify the most performant ones as well as identify more interesting patterns and heuristics of the data based on how well specific algorithms fit to them.

In this specific challenge, five major algorithms are leveraged:

- **K-Nearest Neighbors** (`KNeighborsClassifier()`)
- **Support Vector Machine** (`SVC()`)
- **CART-Based Decision Tree** (`DecisionTreeClassifier()`)
- **Gaussian Naive Bayes** (`GaussianNB()`)
- **Logistic Regressor** (`LogisticRegression()`)

In [None]:
models = {
    "KNN": {
        "Estimator": KNeighborsClassifier(),
        },
    "SVM": {
        "Estimator": SVC(),
        },
    "CART": {
        "Estimator": DecisionTreeClassifier(),
        },
    "NB": {
        "Estimator": GaussianNB(),
        },
    "LOGREG": {
        "Estimator": LogisticRegression(),
        }
}

for name in models:
    models[name]["All_Scores"] = list()
    models[name]["Top_Score"] = float()
    models[name]["Mean_Score"] = float()
    models[name]["Std_Score"] = float()

For this challenge, the **`models`** object will be used to reference not just the actual algorithmic data structures themselves, but also their performance metrics per each assessment walkthrough.

In other words, this data structure will represent your sample platter.

In order to capture the most reliable accuracy determinations per assessment, you can leverage _K-Folds Cross Validation_ with enough folds to obtain a healthily generalized performance score for each model.

For the sake of simplicity, ten (10) folds will be used.

Because of this, four values will be ascertained and saved from each simulation:

- **All Ten (10) Scores per Each Fold** (`models[name]["All_Scores"]`)
- **The Top Score from All Folds** (`models[name]["Top_Score"]`)
- **The Average Score from All Folds** (`models[name]["Mean_Score"]`)
- **The Standard Deviation of Scores Across All Folds** (`models[name]["Std_Score"]`)

By running the next cell, you should be able to populate the `models` data structure with relevant values for all folds of cross-validation and obtain the aforementioned four values.

In [None]:
NUM_FOLDS, SCORING = 10, "accuracy"

for name in models:
    folds = StratifiedKFold(n_splits=NUM_FOLDS)
    results = cross_val_score(estimator=models[name]["Estimator"],
                              X=X_train_scaled,
                              y=y_train,
                              cv=folds,
                              scoring=SCORING)
    models[name]["Top_Score"] = results.max()
    models[name]["Mean_Score"] = results.mean()
    models[name]["Std_Score"] = results.std()
    for result in results:
        models[name]["All_Scores"].append(result)

Now that your model-tracking metrics data structure is effectively populated, you can go ahead and take a look at your best performant algorithms using the cell below.

In [None]:
for name in models:
    print("\n[MODEL TYPE: {}]\n".format(name))
    print(">>>> Top Performance: \t\t{:.4f}".format(models[name]["Top_Score"]))
    print(">>>> Average Performance: \t{:.4f}".format(models[name]["Mean_Score"]))
    print(">>>> Spread of Performance: \t{:.4f}".format(models[name]["Std_Score"]))

---

> ### 📌 **OBJECTIVES: Picking and Fine-Tuning "The Best Dish"** 📌
>
> From the last cell, you should have obtained metrics for the five tested models.
>
> But your work isn't done!
>
> 🔍 _Now, you have the task of selecting the top performant model from the prior five and performing some model optimization and evaluation techniques in order to try and get the accuracy scores as high as you can!_ 🔎
>
>
> Remember, there's several ways of doing this:
>
> - 1️⃣ **Hyperparameter Tuning for the Algorithm** 🔬
> - 2️⃣ **Implementing Dimensionality Reduction with PCA/LDA** 🔳
> - 3️⃣ **Additional Methods for Data Cleaning and Preprocessing** 🧽
> - 4️⃣ **Tweaking Values for Cross-Validation and Train/Test Splitting** 🍕
> - 5️⃣ **New Variants of the Predictive Classifiers** 📚
>
> The true character of a data scientist is not how they start: it's how they finish.
>
> 🔍 _As such, you're now required to perform **at least four (4)** of these five model improvement steps using the skills you've covered in previous seminars and tutorials._ 🔎
>
> For example, you can choose to perform **hyperparameter tuning**, **dimensionality reduction**, **tweaking values**, and **testing new classifier variants** in order to satisfy the remaining objectives in this notebook.
>
> #### **You're recommended to produce these four changes in a new machine learning analysis using cells below this text as opposed to altering the cells above -- this way, you can visually see the performance improvements and difference in accuracy in your top selected model.**
>
> This is a much more extensive and more exhaustive set-of-objectives than any other objective in the notebook and will represent the conclusion of this assignment -- spend as much time/effort on this component as you can to ensure successful improvement to your modeling accuracies.

---

In [None]:
"""
Complete OBJECTIVES: Picking and Fine-Tuning 'The Best Dish'
Using This Cell & Any Additional Cells Needed!
"""

Once you've completed the final objective and are satisfied with your progression in this notebook, consider this challenge completed!

Great work!

---
---
---