# D209 Data Mining I Performance Assessment - Task 2
### NVM2 — NVM2 Task 2: Predictive Analysis
#### Data Mining I — D209
#### PRFA — NVM2
> André Davis
> StudentID: 010630641
> MSDA
>
> Competencies
> 4030.06.1 : Classification Data Mining Models
>   The graduate applies observations to appropriate classes and categories using classification models.
>
> 4030.06.3 : Data Mining Model Performance
>   The graduate evaluates data mining model performance for precision, accuracy, and model comparison.

#### Table of Contents
<ul>
    <li><a href="#research-question">A1: Research Question</li>
    <li><a href="#goal-of-analysis">A2: Objectives and Goals of Analysis</a></li>
    <li><a href="#justification">B1: Justification of Classification Method</a></li>
    <li><a href="#assumption-of-classification">B2: Assumptions of a Classification Model</a></li>
    <li><a href="#packages-and-analysis-support">B3: Benefits of Chosen Tools</a></li>
    <li><a href="#data-preparation-goals">C1: Data Preparation Goals and Necessary Manipulation</a></li>
    <li><a href="#variable-selection-and-identification">C2: Variable Selection \& Identification</a></li>
    <li><a href="#data-preparation">C3: Preparation of Data</a></li>
    <li><a href="#copy-of-prepared-data">C4: Copy of Prepared Data Set</a></li>
    <li><a href="#data-splitting-and-copying">D1: Data Splitting, Copy of Split Data</a></li>
    <li><a href="#analysis-description">D2: Analysis Description</a></li>
    <li><a href="#classification-analysis-code">D3: Classification Analysis Code</a></li>
    <li><a href="#accuracy-of-classification-model">E1: Accuracy of Classification Model</a></li>
    <li><a href="#model-results">E2: Model Results</a></li>
    <li><a href="#classification-limitations">E3: Classification Limitations</a></li>
    <li><a href="#recommended-action">E4: Recommended Action</a></li>
    <li><a href="#panopto-recording">F: Panopto Recording</a></li>
    <li><a href="#code-references">G: Code References</a></li>
    <li><a href="#source-references">H: Source References</a></li>
</ul>

<a id="research-question"></a>
# A1: Research Question

<a id="objectives-and-goals"></a>
# A2: Objective and Goals of Analysis

<a id="justification"></a>
# B1: Justification of Classification Method

<a id="classification-model"></a>
# B2: Assumptions of a Classification Model

<a id="packages-and-analysis-support"></a>
# B3: List Packages & Support of Analysis

#### Analytic package for Python:
 * **Data manipulation:**
    * [Pandas](https://pandas.pydata.org/docs/)
      - Pandas library is used for data manipulation, analysis, and cleaning. Pandas' main data structures are the Series (1-dimensional) and DataFrame (2-dimensional).
          - [Version 2](https://towardsdatascience.com/whats-new-in-pandas-2-0-5df366eb0197) is coming soon and will replace NumPy with Apache Arrows.
    * [NumPy](https://numpy.org/doc/)
         - Library used for numerical computing. It provides support for multidimensional arrays, matrices, and high-level mathematical functions to perform complex computations quickly and efficiently.
 * **Statistics:**
    * [VIF - Variance Inflation Factor](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html)
        * *variance_inflation_factor* calculates the variance inflation factor (VIF) for a set of predictor variables in a linear regression model, which is a measure of the degree of multicollinearity between the predictors.
 * **Displaying Data:**
    * [Seaborn](https://seaborn.pydata.org/)
        * This is a data visualization library.
    * [MatPlotLib](https://matplotlib.org/)
        * This is a data visualization library.
 * **Analysis:**
    * SkLearn (Scikit-learn - Machine Learning):
        * [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
        * [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)
        * [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
        * [f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html)
        * [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
        * [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
        * [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
        * [roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)
        * [roc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)
        * [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [None]:
import textwrap
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, classification_report



#load the data ignoring the first column as it's simply an index
medical_data = pd.read_csv('./Data/Medical/medical_clean.csv', index_col=0)
medical_data.info()

<a id="data-preparation-goals"></a>
# C1: Data Preparation Goals and Necessary Manipulation

<a id="variable-selection-and-identification"></a>
# C2: Variable Selection & Identification

<a id="data-preparation"></a>
# C3: Preparation of Data

<a id="copy-of-prepared-data"></a>
# C4: Copy of Prepared Data Set


<a id="data-splitting-and-copying"></a>
# D1: Data Splitting, Copy of Split Data

<a id="analysis-description"></a>
# D2: Analysis Description

<a id="classification-analysis-code"></a>
# D3: Classification Analysis Code

<a id="accuracy-of-classification-model"></a>
# E1: Accuracy of Classification Model

<a id="model-results"></a>
# E2: Model Results

<a id="classification-limitations"></a>
# E3: Classification Limitations

<a id="recommended-action"></a>
# E4: Recommended Action

<a id="panopto-recording"></a>
# F: Panopto Recording

Summary of Environments:
  * OS: Windows 11 + macOS Ventura (I work in both environments)
  * Language: Python
  * Environment: Jupyter Notebook through JetBrains DataSpell IDE (Cross-Platform)

[D209 - Panopto Recording]()

<a id="code-references"></a>
# G: Code References

<a id="source-references"></a>
# H: Source References

#### Citations
 * Bruce, P., Bruce, A., & Gedeck, P. (2019). Practical Statistics for Data Scientists: 50 Essential Concepts. O'Reilly Media, Inc.