# Homework Assignment 5: Model Evaluation
As in the previous assignments, in this homework assignment you will continue your exploration of the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM), described in the paper found [here](https://doi.org/10.1038/s41597-020-0548-x).


This assignment will utilize a copy of the extracted feature dataset we have been working with. The dataset has been processed by performing outlier clipping, z-score and range scaling, and forward feature selection to select 20 features. We are now going to utilize more than one partition worth of data, so for the z-score and range scaling, the mean, standard deviation, minimum, and maximum were calculated using data from both partitions so that a global scaling can be performed on each partition. 

---

## Step 1: Downloading the Data

This assignment will continue to only use [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) and will add the use of [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD) as a testing set. 

---

For this assignment, cleaning, transforming, and normalization of the data has been completed using both partitions to find the various minimum, maximum, standard deviation, and mean values needed to perform these operations. Recall from lecture that we should not perform these operations on each partition individually, but as a whole as there may(will) be different values for these in different partitions. 

For example, if we perform simple range scaling on each partition individually and we see a range of 0 to 100 in one partition and 0 to 10 in another. After individual scaling the values with 100 in the first would be mapped to 1 just like the values that had 10 in the second. This can cause serious performance problems in your model, so I have made sure that the normalization was treated properly for you. 

Below you will find the full partitions and `toy` sampled data from each partition, where only 20 samples from each of our 5 classes have been included in the data.  

#### Full
- [Full Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1ExtractedFeatures.csv)
- [Full Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition2ExtractedFeatures.csv)

#### Toy
- [Toy Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition1ExtractedFeatures.csv)
- [Toy Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition2ExtractedFeatures.csv)

Now that you have the two files, you should load each into a Pandas DataFrame using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. 

---

### Evaluation Metric

For each of the models we evaluate in this assignmnet, you will calculate the True Skill Statistic score using the test data from Partition 2 to determine which model performs the best for classifying the positive flaring class.

    True skill statistic (TSS) = TPR + TNR - 1 = TPR - (1-TNR) = TPR - FPR

Where:

    True positive rate (TPR) = TP/(TP+FN) Also known as recall or sensitivity
    True negative rate (TNR) = TN/(TN+FP) Also known as specificity or selectivity
    False positive rate (FPR) = FP/(FP+TN) = (1-TNR) Also known as fall-out or false alarm ratio


**Recall**

    True positive (TP)
    True negative (TN)
    False positive (FP)
    False negative (FN)
    
See [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for more information.

Below is a function implemented to provide your score for each model.

In [1]:
import os
import itertools
import pandas as pd
from pandas import DataFrame 
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

In [2]:
def calc_tss(y_true=None, y_predict=None):
    """
    Calculates the true skill score for binary classification based on the output of the confusion
    table function
    """
    scores = confusion_matrix(y_true, y_predict).ravel()
    TN, FP, FN, TP = scores
    print('TN={0}\tFP={1}\tFN={2}\tTP={3}'.format(TN, FP, FN, TP))
    tp_rate = TP / float(TP + FN) if TP > 0 else 0  
    fp_rate = FP / float(FP + TN) if FP > 0 else 0
    
    return tp_rate - fp_rate

As in the previous assignment, we will be utilizing a binary classification of our 5 class dataset. So, below is the helper function to change our class labels from the 5 class target feature to the binary target feature. The function is implemented to take a dataframe (e.g. our `abt`) and prepares it for a binary classification by merging the `X`- and `M`-class samples into one group, and the rest (`NF`, `B`, and `C`) into another group, labeled with `1`s and `0`s, respectively.

In [3]:
def dichotomize_X_y(data: pd.DataFrame):
    """
    dichotomizes the dataset and split it into the features (X) and the labels (y).
    
    :return: two np.ndarray objects X and y.
    """
    data_dich = data.copy()
    data_dich['lab'] = data_dich['lab'].map({'NF': 0, 'B': 0, 'C': 0, 'M': 1, 'X': 1})
    y = data_dich['lab']
    X = data_dich.drop(['lab'], axis=1)
    return X.values, y.values

In [4]:
data_dir = 'C:/Users/Hyunki/anaconda3/envs/csc4780/HW5/'
data_file = "normalized_partition1ExtractedFeatures.csv"
data_file2 = "normalized_partition2ExtractedFeatures.csv"
# data_file = "toy_normalized_partition1ExtractedFeatures.csv"
# data_file2 = "toy_normalized_partition2ExtractedFeatures.csv"

In [5]:
abt = pd.read_csv(os.path.join(data_dir, data_file))
abt2 = pd.read_csv(os.path.join(data_dir, data_file2))

---
### Q1 (10 points)

Just like you did with the previous assignment, you will be utilizing a few different types of feature selection to find subsets of descriptive features to use in the models we will be evaluating.  For this question you will again be utilizing the [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). You will then be using 3 diferent feature evaluation functions.

-  [scikit-learn f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)

- [scikit-learn mutual_info_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)

- [chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)

For each of these combinations of evaluation functions, you need to construct a 20 feature training and testing dataset. This will be done by:
<ol>
    <li>Use the `SelectKBest` class with each of the evaluation functions to perform feature selection using Partition 1 as your input data</li>
    <li>Construct a new train `DataFrame` for each instance of `SelectKBest` from 1 with the `lab` class labels using Partition 1</li>
    <li>Construct a new test `DataFrame` for each instance of `SelectKBest` from 1 with the `lab` class labels using Partition 2</li>
</ol>

After this question, you should have a total of 6 `DataFrame`s to use in later questions, a train and test pair for each feature selection method.

---

In [6]:
numFeat = 20
abt_cpy = abt.copy()


In [7]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
y1 = abt_cpy['lab']
X1 = abt_cpy.drop(['lab'], axis=1)

abt_cpy2 = abt2.copy()
y2 = abt_cpy2['lab']
X2 = abt_cpy2.drop(['lab'], axis=1)

selector1 = SelectKBest(f_classif, k=numFeat)
selector2 = SelectKBest(mutual_info_classif, k=numFeat)
selector3 = SelectKBest(chi2, k=numFeat)

list_df1, list_df2 = [], []
for idx, sel in enumerate([selector1, selector2, selector3]):
    sel.fit(X1, y1)
    list_df1.append(DataFrame(sel.transform(X1), columns=sel.get_feature_names_out(abt.iloc[:, 1:].columns)))
    list_df2.append(DataFrame(sel.transform(X2), columns=sel.get_feature_names_out(abt2.iloc[:, 1:].columns)))

    list_df1[idx]["lab"] = y1
    list_df2[idx]["lab"] = y2


df_f_class_train, df_mutual_train, df_chi2_train = tuple(list_df1)
df_f_class_test, df_mutual_test, df_chi2_test = tuple(list_df2)

df_f_class_train.head(5)

Unnamed: 0,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTUSJH_mean,TOTUSJH_linear_weighted_average,TOTUSJH_quadratic_weighted_average,TOTUSJH_last_value,ABSNJZH_max,ABSNJZH_stddev,ABSNJZH_dderivative_stddev,...,ABSNJZH_average_absolute_change,ABSNJZH_average_absolute_derivative_change,ABSNJZH_avg_mono_increase_slope,SAVNCPP_avg_mono_decrease_slope,SAVNCPP_avg_mono_increase_slope,SAVNCPP_dderivative_stddev,SAVNCPP_average_absolute_change,SAVNCPP_gderivative_stddev,SAVNCPP_average_absolute_derivative_change,lab
0,0.238758,0.27599,0.244674,0.249912,0.249364,0.250146,0.271143,0.129462,0.166475,0.143979,...,0.146038,0.138859,0.14719,0.91776,0.071006,0.080273,0.07632,0.11159,0.065199,NF
1,0.106759,0.123894,0.114227,0.114189,0.111154,0.109375,0.108247,0.08167,0.122735,0.094326,...,0.096518,0.091213,0.101681,0.973846,0.023172,0.025106,0.025291,0.031024,0.022525,NF
2,0.116361,0.141522,0.126501,0.128604,0.12515,0.123883,0.120471,0.092963,0.105429,0.107461,...,0.111193,0.106288,0.108718,0.966885,0.029407,0.029272,0.028676,0.035628,0.023937,NF
3,0.315587,0.328616,0.322563,0.321839,0.322904,0.322843,0.320618,0.170967,0.183616,0.142697,...,0.146319,0.132143,0.147689,0.940147,0.056448,0.060894,0.06122,0.073133,0.054571,NF
4,0.125745,0.140699,0.134145,0.134307,0.13353,0.132812,0.124584,0.117,0.103106,0.105344,...,0.109309,0.104402,0.104563,0.966184,0.029922,0.032664,0.034008,0.034723,0.031312,NF


---
### Q2 (10 points)

Now that we have our training and testing datasets for each of our feature subsets, we need to attempt to perform hyperparameter tuning on our model for each of the datasets. We want to see which combination of dataset and parameter settings seem to provide the best results. 

In order to do this, we must first dichotomize the training and testing data. Lucky for you, a method has already been provided to do this. All you need to do is apply it to teach of the `DataFrame`s you constructed in Q1.  

With your binary classification dataset constructed, now it's time to start training and testing some models. We will start with the simple [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), and try several different settings to see how/if using different settings will improve our score. So, for each of your three copies of the Partition 1 training datasets that have had their `lab` columns converted to a binary label, train 4 different instances with the following settings. **(see documentation to know what these are)** In total you will train and evaluate 12 model setting and feature selected data pairings. 

|Model Number| n_neighbors | p |
|------------|-------------|---|
|1|3|1|
|2|3|2|
|3|5|1|
|4|5|2|


Once you have done that, test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result. **NOTE: The model does take a little while to evaluate.**

---

In [8]:
n_neighbors = [3, 5]
p = [1,2]
temp = [n_neighbors, p]
params = list(itertools.product(*temp))
params

[(3, 1), (3, 2), (5, 1), (5, 2)]

In [9]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------

list_df1 = df_f_class_train, df_mutual_train, df_chi2_train
list_df2 = df_f_class_test, df_mutual_test, df_chi2_test
list_models = []
for idx, (n_neighbors, p) in enumerate(params):
    for train_set, test_set in zip(list_df1, list_df2):
        X_train, y_train = dichotomize_X_y(train_set)
        X_test, y_test = dichotomize_X_y(test_set)
        model = KNeighborsClassifier(n_neighbors=n_neighbors, p=p)
        model.fit(X=X_train, y=y_train)
        list_models.append([idx+1, n_neighbors, p, calc_tss(y_test, model.predict(X=X_test))])

df_result_knc = DataFrame(list_models, columns=["Model Number", "n_neighbors", "p", "tss"])
df_result_knc

TN=86045	FP=1111	FN=1038	TP=363
TN=86101	FP=1055	FN=1153	TP=248
TN=86107	FP=1049	FN=942	TP=459
TN=85973	FP=1183	FN=1043	TP=358
TN=85991	FP=1165	FN=1121	TP=280
TN=86086	FP=1070	FN=1032	TP=369
TN=86182	FP=974	FN=1044	TP=357
TN=86228	FP=928	FN=1149	TP=252
TN=86160	FP=996	FN=946	TP=455
TN=86094	FP=1062	FN=1057	TP=344
TN=86215	FP=941	FN=1135	TP=266
TN=86138	FP=1018	FN=1038	TP=363


Unnamed: 0,Model Number,n_neighbors,p,tss
0,1,3,1,0.246353
1,1,3,1,0.164912
2,1,3,1,0.315587
3,2,3,2,0.241958
4,2,3,2,0.18649
5,2,3,2,0.251106
6,3,5,1,0.243643
7,3,5,1,0.169224
8,3,5,1,0.31334
9,4,5,2,0.233354


---
### Q3 (10 points)

After evaluating the various results from Q2, you will notice that the results are not all that great with greater than 1000 false negatives for nearly all of our settings tried. But, what can be done to improve our results? If you read the documentation for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), which you certainly should have, you will see that we were only using the `MinkowskiDistance` metric with different values of `p`. If you look into the [DistanceMetric](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric) documentation for the neighbors classifiers, you will see there are several others available to use.

So, for this question, train and evaluate two more instances of [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for each of our different feature selection train test datsets, but this time using the `ChebyshevDistance` metric instead of the `MinkowskiDistance` metric.  For these models you will only be changing the number neighbors to 3 and 5, as the values of `p` are not used for the `ChebyshevDistance` metric. 

---

In [10]:
n_neighbors = [3, 5]
temp = [n_neighbors]
params = list(itertools.product(*temp))

In [11]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------

list_df1 = df_f_class_train, df_mutual_train, df_chi2_train
list_df2 = df_f_class_test, df_mutual_test, df_chi2_test
list_models = []
for idx, (n_neighbors,) in enumerate(params):
    for train_set, test_set in zip(list_df1, list_df2):
        X_train, y_train = dichotomize_X_y(train_set)
        X_test, y_test = dichotomize_X_y(test_set)
        model = KNeighborsClassifier(n_neighbors=n_neighbors, metric="chebyshev")
        model.fit(X=X_train, y=y_train)
        list_models.append([idx+1, n_neighbors, calc_tss(y_test, model.predict(X=X_test))])

df_result_knc_chebyshev = DataFrame(list_models, columns=["Model Number", "n_neighbors", "tss"])
df_result_knc_chebyshev

TN=85934	FP=1222	FN=1056	TP=345
TN=86001	FP=1155	FN=1138	TP=263
TN=86113	FP=1043	FN=1059	TP=342
TN=86062	FP=1094	FN=1068	TP=333
TN=86155	FP=1001	FN=1167	TP=234
TN=86179	FP=977	FN=1080	TP=321


Unnamed: 0,Model Number,n_neighbors,tss
0,1,3,0.232232
1,1,3,0.174471
2,1,3,0.232144
3,2,5,0.225135
4,2,5,0.155538
5,2,5,0.217912


---
### Q4 (10 points)

After evaluating the results from Q3, you will see that the results are no better than those we found for Q2. This leads to the thought that maybe the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) is just not a good fit for the problem we are applying it to. So, let's move on to another classifier for this problem. 

In this question, you will utilize the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and try several different settings to see how/if using different settings will improve our score. So, continuing to use our training/testing pairs constructed with different feature selection methods that have had their `lab` column converted to a binary label, train 8 different instances with the following settings. **(see documentation to know what these are)**

|Model Number| criterion | max_depth | splitter |
|------------|---------|-------------|---|
|1|gini|5|best|
|2|gini|5|random|
|3|gini|None|best|
|4|gini|None|random|
|5|entropy|5|best|
|6|entropy|5|random|
|7|entropy|None|best|
|8|entropy|None|random|



Once you have done that, test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result.

---

In [12]:
criterion = ['gini', 'entropy']
depth = [5, None]
splitter = ['best', 'random']
temp = [criterion, depth, splitter]
params = list(itertools.product(*temp))

In [13]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------

list_df1 = df_f_class_train, df_mutual_train, df_chi2_train
list_df2 = df_f_class_test, df_mutual_test, df_chi2_test
list_models = []
for idx, (crit, dep, split) in enumerate(params):
    for train_set, test_set in zip(list_df1, list_df2):
        X_train, y_train = dichotomize_X_y(train_set)
        X_test, y_test = dichotomize_X_y(test_set)
        model = DecisionTreeClassifier(criterion=crit, max_depth=dep, splitter=split)
        model.fit(X=X_train, y=y_train)
        list_models.append([idx+1, crit, dep, split, calc_tss(y_test, model.predict(X=X_test))])

df_result_dtc = DataFrame(list_models, columns=["Model Number", "criterion", "max_depth", "splitter", "tss"])
df_result_dtc

TN=86973	FP=183	FN=1331	TP=70
TN=86793	FP=363	FN=1275	TP=126
TN=86334	FP=822	FN=1084	TP=317
TN=87033	FP=123	FN=1350	TP=51
TN=87086	FP=70	FN=1365	TP=36
TN=86740	FP=416	FN=1236	TP=165
TN=85753	FP=1403	FN=1045	TP=356
TN=84852	FP=2304	FN=1057	TP=344
TN=85874	FP=1282	FN=903	TP=498
TN=85829	FP=1327	FN=1043	TP=358
TN=85753	FP=1403	FN=1105	TP=296
TN=85976	FP=1180	FN=985	TP=416
TN=87003	FP=153	FN=1335	TP=66
TN=86988	FP=168	FN=1356	TP=45
TN=86301	FP=855	FN=1116	TP=285
TN=86853	FP=303	FN=1200	TP=201
TN=87103	FP=53	FN=1368	TP=33
TN=86699	FP=457	FN=1090	TP=311
TN=86133	FP=1023	FN=1164	TP=237
TN=85738	FP=1418	FN=1098	TP=303
TN=86004	FP=1152	FN=1019	TP=382
TN=85727	FP=1429	FN=945	TP=456
TN=85559	FP=1597	FN=1009	TP=392
TN=86114	FP=1042	FN=994	TP=407


Unnamed: 0,Model Number,criterion,max_depth,splitter,tss
0,1,gini,5.0,best,0.047865
1,1,gini,5.0,best,0.085771
2,1,gini,5.0,best,0.216836
3,2,gini,5.0,random,0.034991
4,2,gini,5.0,random,0.024893
5,2,gini,5.0,random,0.113
6,3,gini,,best,0.238007
7,3,gini,,best,0.219104
8,3,gini,,best,0.340751
9,4,gini,,random,0.240306


---
### Q5 (10 points)

After evaluating results from Q4, you will see that the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) was able to accomplish a bit of an improvement over the best resutls we found for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).  This is indeed great, but can we do better than this if we use yet another classifier? Let's move on to yet another and find out.

For this question you will be utilizing the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier. We won't be changing any of the default settings, just train 1 model for each of our feature selected data subsets. You will again be using your training/testing pairs constructed with different feature selection methods that have had their `lab` column converted to a binary label. You will then test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result.

---

In [14]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
list_models = []
for idx, (train_set, test_set) in enumerate(zip(list_df1, list_df2)):
    X_train, y_train = dichotomize_X_y(train_set)
    X_test, y_test = dichotomize_X_y(test_set)
    model = GaussianNB()
    model.fit(X=X_train, y=y_train)
    list_models.append([idx+1, calc_tss(y_test, model.predict(X=X_test))])

df_result_gnb = DataFrame(list_models, columns=["Model Number", "tss"])
df_result_gnb

TN=78609	FP=8547	FN=113	TP=1288
TN=75968	FP=11188	FN=170	TP=1231
TN=76896	FP=10260	FN=98	TP=1303


Unnamed: 0,Model Number,tss
0,1,0.821278
1,2,0.750291
2,3,0.81233


---
### Q6 (10 points)

If you recall from a lecture some time back, it was shown that another way of improving the results of classification is to perform some form of sampling to balance the number of samples there are for the various classes. The reason why this works for specific classifiers, and methods for doing the sampling, are numerious and we don't have enough time to cover all of them in this course.  However, it is still beneficial to know this works and that it is something that you should be considering when you are training models.  

So, for this question, we will implement a very naive method for sampling so we can use the results for training our models again.  Below you will find a function stub, complete the function and have it return a copy of the input dataframe where each class (except for the smallest one) have been undersampled to match the size of the smallest class in the dataset. In this function you should assume the `lab` column is the class label and not the dicotomized binary classification converted label.

To do this you may want to use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to get groups of rows from your DataFrame.  You may also wish to use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) function to select a number of rows from a group. You can also use the [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method to process each group from your grouped rows. These are just hints, you can solve the problem how you see fit.

Once this function is complete, apply it to each of your training datasets that have been constructed with different feature selection methods from partition 1 (the ones with all the NF, C, .., X labels). You will not be applying this to your testing sets. After you have your sampled feature selected datasets, you will then apply your function that converts the multi-class problem to a binary problem to each of the resultant selected subsets so we can use these new undersampled data for the next several questions.

---

In [15]:
def perform_under_sample(data:DataFrame)->DataFrame:
    #----------------------------------------------
    # TODO: Complete here.
    #----------------------------------------------
    min_value = data.groupby("lab").count().min().values[0]
    return data.groupby("lab").sample(n=min_value)
    

In [16]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
list_under_sampled = []
for train_set in list_df1:
    list_under_sampled.append(perform_under_sample(train_set))

df_f_class_train_us, df_mutual_train_us, df_chi2_train_us = tuple(list_under_sampled)


---
### Q7

For this question repeat what you did for Q2, but with your balanced binary classification datasets constructed in Q6, uese the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), and try several different settings to see how/if using different settings will improve our score. 

So, train 4 different instances with the following settings for each of your feature selected subsets, for a total of 12 different evaluations. **(see documentation to know what these are)**

|Model Number| n_neighbors | p |
|------------|-------------|---|
|1|3|1|
|2|3|2|
|3|5|1|
|4|5|2|


Once you have done that, test each of your models using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (these should not have been balanced). You shall then calculate and print the TSS score for each result. **NOTE: The model now takes less time to evaluate!**

---

In [17]:
n_neighbors = [3, 5]
p = [1,2]
temp = [n_neighbors, p]
params = list(itertools.product(*temp))

In [18]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
list_models = []
for idx, (n_neighbors, p) in enumerate(params):
    for train_set, test_set in zip(list_under_sampled, list_df2):
        X_train, y_train = dichotomize_X_y(train_set)
        X_test, y_test = dichotomize_X_y(test_set)
        model = KNeighborsClassifier(n_neighbors=n_neighbors, p=p)
        model.fit(X=X_train, y=y_train)
        list_models.append([idx+1, n_neighbors, p, calc_tss(y_test, model.predict(X=X_test))])

df_result_knc = DataFrame(list_models, columns=["Model Number", "n_neighbors", "p", "tss"])
df_result_knc

TN=82535	FP=4621	FN=548	TP=853
TN=80595	FP=6561	FN=477	TP=924
TN=83103	FP=4053	FN=569	TP=832
TN=82338	FP=4818	FN=597	TP=804
TN=80537	FP=6619	FN=520	TP=881
TN=83229	FP=3927	FN=549	TP=852
TN=82461	FP=4695	FN=537	TP=864
TN=80548	FP=6608	FN=440	TP=961
TN=83089	FP=4067	FN=523	TP=878
TN=82112	FP=5044	FN=532	TP=869
TN=80591	FP=6565	FN=461	TP=940
TN=83110	FP=4046	FN=496	TP=905


Unnamed: 0,Model Number,n_neighbors,p,tss
0,1,3,1,0.555831
1,1,3,1,0.58425
2,1,3,1,0.547359
3,2,3,2,0.518596
4,2,3,2,0.552892
5,2,3,2,0.56308
6,3,5,1,0.562833
7,3,5,1,0.610121
8,3,5,1,0.580032
9,4,5,2,0.562398


---
### Q8

After evaluating the various results from Q7, you will notice that some of the results are improved over the same experiments we conducted in Q2. Additionally, you should also notice a improvement in the speed at which the results were obtained. The question now is will we continue to see these improvements for all of our experiments? So, let's move on and see.

For this question, you will repeat the experiments from Q3, but using the balanced binary classification datasets constructed in Q6. You will still be using the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) like you did in Q7, but you will again be changing from using the `MinkowskiDistance` metric with different values of `p` to using the `ChebyshevDistance` metric. You will construct two models for each of your feature selected datasets by changing the number neighbors to 3 and 5.

Once you have done that, test each of your models using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (these should not have been balanced), then calculate and print the TSS score for each result. 

---

In [19]:
n_neighbors = [3, 5]
temp = [n_neighbors]
params = list(itertools.product(*temp))

In [20]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
list_models = []
for idx, (n_neighbors,) in enumerate(params):
    for train_set, test_set in zip(list_under_sampled, list_df2):
        X_train, y_train = dichotomize_X_y(train_set)
        X_test, y_test = dichotomize_X_y(test_set)
        model = KNeighborsClassifier(n_neighbors=n_neighbors, metric="chebyshev")
        model.fit(X=X_train, y=y_train)
        list_models.append([idx+1, n_neighbors, calc_tss(y_test, model.predict(X=X_test))])

df_result_knc_chebyshev = DataFrame(list_models, columns=["Model Number", "n_neighbors", "tss"])
df_result_knc_chebyshev

TN=82234	FP=4922	FN=552	TP=849
TN=80448	FP=6708	FN=551	TP=850
TN=83233	FP=3923	FN=584	TP=817
TN=82006	FP=5150	FN=475	TP=926
TN=80555	FP=6601	FN=497	TP=904
TN=83037	FP=4119	FN=551	TP=850


Unnamed: 0,Model Number,n_neighbors,tss
0,1,3,0.549522
1,1,3,0.529744
2,1,3,0.538144
3,2,5,0.601867
4,2,5,0.569516
5,2,5,0.559449


---
### Q9

After evaluating the results of Q8 things are looking a little less encouraging, since none of those results look to be better than the results of Q7. However, the results from Q3 weren't really any better than Q2 in the first place, so not all is lost.  Let's continue on and see how things turn out with models like we used in Q4 since those were actaully an improvement over Q2 originally.

So, in this question, you will utilize the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), like you did in Q4, and try several different settings to see how/if using different settings will improve our score. The difference will again be that you are now using the balanced binary classification datasets constructed in Q6 to train 8 different instances for each of your feature selected datasets using the following settings. **(see documentation to know what these are)**

|Model Number| criterion | max_depth | splitter |
|------------|---------|-------------|---|
|1|gini|5|best|
|2|gini|5|random|
|3|gini|None|best|
|4|gini|None|random|
|5|entropy|5|best|
|6|entropy|5|random|
|7|entropy|None|best|
|8|entropy|None|random|



Once you have done that, test each of your models using your binary classification copy of copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (this should not have been balanced), then calculate and print the TSS score for each result. 

---

In [21]:
criterion = ['gini', 'entropy']
depth = [5, None]
splitter = ['best', 'random']
temp = [criterion, depth, splitter]
params = list(itertools.product(*temp))

In [22]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
list_df1 = df_f_class_train, df_mutual_train, df_chi2_train
list_df2 = df_f_class_test, df_mutual_test, df_chi2_test
list_models = []
for idx, (crit, dep, split) in enumerate(params):
    for train_set, test_set in zip(list_under_sampled, list_df2):
        X_train, y_train = dichotomize_X_y(train_set)
        X_test, y_test = dichotomize_X_y(test_set)
        model = DecisionTreeClassifier(criterion=crit, max_depth=dep, splitter=split)
        model.fit(X=X_train, y=y_train)
        list_models.append([idx+1, crit, dep, split, calc_tss(y_test, model.predict(X=X_test))])

df_result_dtc = DataFrame(list_models, columns=["Model Number", "criterion", "max_depth", "splitter", "tss"])
df_result_dtc

TN=81254	FP=5902	FN=577	TP=824
TN=81164	FP=5992	FN=498	TP=903
TN=83639	FP=3517	FN=565	TP=836
TN=82789	FP=4367	FN=438	TP=963
TN=79367	FP=7789	FN=515	TP=886
TN=83798	FP=3358	FN=493	TP=908
TN=83124	FP=4032	FN=730	TP=671
TN=81524	FP=5632	FN=714	TP=687
TN=83544	FP=3612	FN=737	TP=664
TN=82393	FP=4763	FN=797	TP=604
TN=81020	FP=6136	FN=643	TP=758
TN=83618	FP=3538	FN=621	TP=780
TN=83429	FP=3727	FN=536	TP=865
TN=82719	FP=4437	FN=560	TP=841
TN=84468	FP=2688	FN=800	TP=601
TN=79501	FP=7655	FN=250	TP=1151
TN=80749	FP=6407	FN=373	TP=1028
TN=82737	FP=4419	FN=310	TP=1091
TN=82249	FP=4907	FN=660	TP=741
TN=81809	FP=5347	FN=662	TP=739
TN=83671	FP=3485	FN=738	TP=663
TN=82965	FP=4191	FN=640	TP=761
TN=81548	FP=5608	FN=596	TP=805
TN=83072	FP=4084	FN=601	TP=800


Unnamed: 0,Model Number,criterion,max_depth,splitter,tss
0,1,gini,5.0,best,0.520434
1,1,gini,5.0,best,0.575789
2,1,gini,5.0,best,0.556364
3,2,gini,5.0,random,0.637261
4,2,gini,5.0,random,0.543037
5,2,gini,5.0,random,0.60958
6,3,gini,,best,0.432682
7,3,gini,,best,0.425744
8,3,gini,,best,0.432504
9,4,gini,,random,0.376471


---
### Q10

Unlike with [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), it seems that the sampling didn't really help much for the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).  Where before we saw a 3X improvement with the Decision Tree over the KNN classifier, we now see similar results for both classifiers.  Let's see how switching to the sampled data affectes our best performing classifier when we were using the full dataset.

For this question you will again be utilizing the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier as you did in Q5 but using your balanced binary classification dataset constructed in Q6 to train just 1 model for each feature selected dataset. Once you have done that, test the model using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (this should not have been balanced), then calculate and print the TSS score. 

---

In [23]:
#----------------------------------------------
# TODO: Complete here.
#----------------------------------------------
list_models = []
for idx, (train_set, test_set) in enumerate(zip(list_under_sampled, list_df2)):
    X_train, y_train = dichotomize_X_y(train_set)
    X_test, y_test = dichotomize_X_y(test_set)
    model = GaussianNB()
    model.fit(X=X_train, y=y_train)
    list_models.append([idx+1, calc_tss(y_test, model.predict(X=X_test))])

df_result_gnb = DataFrame(list_models, columns=["Model Number", "tss"])
df_result_gnb

TN=83275	FP=3881	FN=351	TP=1050
TN=80802	FP=6354	FN=340	TP=1061
TN=80755	FP=6401	FN=159	TP=1242


Unnamed: 0,Model Number,tss
0,1,0.704935
1,2,0.684412
2,3,0.813067


Unfortunately, we don't see much improvement for our [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier. 

**Note:The TA would like you to turn in assignments that have been run and have results, so make sure to do a restart and run all from the kernel menu. Then make sure to save before you turn it in. You might find it necessary to use the toy dataset if you have time constraints.**