# Homework Assignment 5: Model Evaluation
As in the previous assignments, in this homework assignment you will continue your exploration of the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM), described in the paper found [here](https://doi.org/10.1038/s41597-020-0548-x).


This assignment will utilize a copy of the extracted feature dataset we have been working with. The dataset has been processed by performing outlier clipping, z-score and range scaling, and forward feature selection to select 20 features. We are now going to utilize more than one partition worth of data, so for the z-score and range scaling, the mean, standard deviation, minimum, and maximum were calculated using data from both partitions so that a global scaling can be performed on each partition. 

---

## Step 1: Downloading the Data

This assignment will continue to only use [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) and will add the use of [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD) as a testing set. 

---

For this assignment, cleaning, transforming, and normalization of the data has been completed using both partitions to find the various minimum, maximum, standard deviation, and mean values needed to perform these operations. Recall from lecture that we should not perform these operations on each partition individually, but as a whole as there may(will) be different values for these in different partitions. 

For example, if we perform simple range scaling on each partition individually and we see a range of 0 to 100 in one partition and 0 to 10 in another. After individual scaling the values with 100 in the first would be mapped to 1 just like the values that had 10 in the second. This can cause serious performance problems in your model, so I have made sure that the normalization was treated properly for you. 

Below you will find the full partitions and `toy` sampled data from each partition, where only 20 samples from each of our 5 classes have been included in the data.  

#### Full
- [Full Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1ExtractedFeatures.csv)
- [Full Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition2ExtractedFeatures.csv)

#### Toy
- [Toy Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition1ExtractedFeatures.csv)
- [Toy Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition2ExtractedFeatures.csv)

Now that you have the two files, you should load each into a Pandas DataFrame using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. 

---

### Evaluation Metric

For each of the models we evaluate in this assignmnet, you will calculate the True Skill Statistic score using the test data from Partition 2 to determine which model performs the best for classifying the positive flaring class.

    True skill statistic (TSS) = TPR + TNR - 1 = TPR - (1-TNR) = TPR - FPR

Where:

    True positive rate (TPR) = TP/(TP+FN) Also known as recall or sensitivity
    True negative rate (TNR) = TN/(TN+FP) Also known as specificity or selectivity
    False positive rate (FPR) = FP/(FP+TN) = (1-TNR) Also known as fall-out or false alarm ratio


**Recall**

    True positive (TP)
    True negative (TN)
    False positive (FP)
    False negative (FN)
    
See [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for more information.

Below is a function implemented to provide your score for each model.

In [1]:
import os
import itertools
import pandas as pd
from pandas import DataFrame 
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

In [2]:
def calc_tss(y_true=None, y_predict=None):
    """
    Calculates the true skill score for binary classification based on the output of the confusion
    table function
    """
    scores = confusion_matrix(y_true, y_predict).ravel()
    TN, FP, FN, TP = scores
    print('TN={0}\tFP={1}\tFN={2}\tTP={3}'.format(TN, FP, FN, TP))
    tp_rate = TP / float(TP + FN) if TP > 0 else 0  
    fp_rate = FP / float(FP + TN) if FP > 0 else 0
    
    return tp_rate - fp_rate

As in the previous assignment, we will be utilizing a binary classification of our 5 class dataset. So, below is the helper function to change our class labels from the 5 class target feature to the binary target feature. The function is implemented to take a dataframe (e.g. our `abt`) and prepares it for a binary classification by merging the `X`- and `M`-class samples into one group, and the rest (`NF`, `B`, and `C`) into another group, labeled with `1`s and `0`s, respectively.

In [3]:
def dichotomize_X_y(data: pd.DataFrame):
    """
    dichotomizes the dataset and split it into the features (X) and the labels (y).
    
    :return: two np.ndarray objects X and y.
    """
    data_dich = data.copy()
    data_dich['lab'] = data_dich['lab'].map({'NF': 0, 'B': 0, 'C': 0, 'M': 1, 'X': 1})
    y = data_dich['lab']
    X = data_dich.drop(['lab'], axis=1)
    return X.values, y.values

In [4]:
data_dir = '/Users/carltonbrown/Desktop/FDS/'
#data_file = "partition1/toy_normalized_partition1ExtractedFeatures.csv"
#data_file2 = "Partition 2/toy_normalized_partition2ExtractedFeatures.csv"
data_file = "partition1/normalized_partition1ExtractedFeatures.csv"
data_file2 = "Partition 2/normalized_partition2ExtractedFeatures.csv"


In [5]:
abt = pd.read_csv(os.path.join(data_dir, data_file))
abt2 = pd.read_csv(os.path.join(data_dir, data_file2))

---
### Q1 (10 points)

Just like you did with the previous assignment, you will be utilizing a few different types of feature selection to find subsets of descriptive features to use in the models we will be evaluating.  For this question you will again be utilizing the [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). You will then be using 3 diferent feature evaluation functions.

-  [scikit-learn f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)

- [scikit-learn mutual_info_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)

- [chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)

For each of these combinations of evaluation functions, you need to construct a 20 feature training and testing dataset. This will be done by:
<ol>
    <li>Use the `SelectKBest` class with each of the evaluation functions to perform feature selection using Partition 1 as your input data</li>
    <li>Construct a new train `DataFrame` for each instance of `SelectKBest` from 1 with the `lab` class labels using Partition 1</li>
    <li>Construct a new test `DataFrame` for each instance of `SelectKBest` from 1 with the `lab` class labels using Partition 2</li>
</ol>

After this question, you should have a total of 6 `DataFrame`s to use in later questions, a train and test pair for each feature selection method.

---

In [6]:
numFeat = 20
abt_cpy = abt.copy()


In [7]:
from sklearn.feature_selection import SelectKBest, f_classif,chi2,mutual_info_classif


data_d = abt.copy()
data_e = abt2.copy()

#TRAIN
y = data_d['lab'].copy()
X = data_d.copy().drop(['lab'], axis=1)
new = SelectKBest(f_classif,k=numFeat).fit(X, y)
cols = new.get_support(indices=True)
train_f_classif = X.iloc[:,cols]

train_f_classif=train_f_classif.join(y)


new = SelectKBest(chi2,k=numFeat).fit(X, y)
cols = new.get_support(indices=True)
train_chi2 = X.iloc[:,cols]

train_chi2=train_chi2.join(y)


new = SelectKBest(mutual_info_classif,k=numFeat).fit(X, y)
cols = new.get_support(indices=True)
train_mutual_info_classif = X.iloc[:,cols]

train_mutual_info_classif=train_mutual_info_classif.join(y)


#TEST
y = data_e['lab'].copy()
X = data_e.copy().drop(['lab'], axis=1)

new = SelectKBest(f_classif,k=numFeat).fit(X, y)
cols = new.get_support(indices=True)
test_f_classif = X.iloc[:,cols]

test_f_classif=test_f_classif.join(y)


new = SelectKBest(chi2,k=numFeat).fit(X, y)
cols = new.get_support(indices=True)
test_chi2 = X.iloc[:,cols]

test_chi2=test_chi2.join(y)


new = SelectKBest(mutual_info_classif,k=numFeat).fit(X, y)
cols = new.get_support(indices=True)
test_mutual_info_classif = X.iloc[:,cols]

test_mutual_info_classif=test_mutual_info_classif.join(y)


---
### Q2 (10 points)

Now that we have our training and testing datasets for each of our feature subsets, we need to attempt to perform hyperparameter tuning on our model for each of the datasets. We want to see which combination of dataset and parameter settings seem to provide the best results. 

In order to do this, we must first dichotomize the training and testing data. Lucky for you, a method has already been provided to do this. All you need to do is apply it to teach of the `DataFrame`s you constructed in Q1.  

With your binary classification dataset constructed, now it's time to start training and testing some models. We will start with the simple [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), and try several different settings to see how/if using different settings will improve our score. So, for each of your three copies of the Partition 1 training datasets that have had their `lab` columns converted to a binary label, train 4 different instances with the following settings. **(see documentation to know what these are)** In total you will train and evaluate 12 model setting and feature selected data pairings. 

|Model Number| n_neighbors | p |
|------------|-------------|---|
|1|3|1|
|2|3|2|
|3|5|1|
|4|5|2|


Once you have done that, test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result. **NOTE: The model does take a little while to evaluate.**

---

In [8]:
n_neighbors = [3, 5]
p = [1,2]
temp = [n_neighbors, p]
params = list(itertools.product(*temp))

In [9]:
print(params)
train_f_classif
X_train, Y_train = dichotomize_X_y(train_f_classif)
X_test, Y_test = dichotomize_X_y(test_f_classif)
print('KNN_f_classif')
for x,y in params:
    KNN_f_classif = KNeighborsClassifier(n_neighbors=x,p=y)
    KNN_f_classif.fit(X_train,Y_train)
    predictions = KNN_f_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    
    
train_chi2

print('KNN_chi2')
X_train, Y_train = dichotomize_X_y(train_chi2)
X_test, Y_test = dichotomize_X_y(test_chi2)
for x,y in params:
    KNN_chi2 = KNeighborsClassifier(n_neighbors=x,p=y)
    KNN_chi2.fit(X_train,Y_train)
    predictions = KNN_chi2.predict(X_test)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)


train_mutual_info_classif
print('KNN_mutual_info_classif')
X_train, Y_train = dichotomize_X_y(train_mutual_info_classif)
X_test, Y_test = dichotomize_X_y(test_mutual_info_classif)
for x,y in params:
    KNN_mutual_info_classif = KNeighborsClassifier(n_neighbors=x,p=y)
    KNN_mutual_info_classif.fit(X_train,Y_train)
    predictions = KNN_mutual_info_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    

test_f_classif
test_chi2
test_mutual_info_classif


[(3, 1), (3, 2), (5, 1), (5, 2)]
KNN_f_classif
TN=85645	FP=1511	FN=882	TP=519
TN=85645	FP=1511	FN=882	TP=519
0.3531129492584269
TN=85107	FP=2049	FN=907	TP=494
TN=85107	FP=2049	FN=907	TP=494
0.3290957128928679
TN=85769	FP=1387	FN=890	TP=511
TN=85769	FP=1387	FN=890	TP=511
0.3488254785064817
TN=85361	FP=1795	FN=900	TP=501
TN=85361	FP=1795	FN=900	TP=501
0.33700645857588984
KNN_chi2
TN=86095	FP=1061	FN=939	TP=462
0.31759088013980297
TN=86080	FP=1076	FN=1025	TP=376
0.25603404975282207
TN=86166	FP=990	FN=949	TP=452
0.31126775263199324
TN=86121	FP=1035	FN=1036	TP=365
0.24865293598925178
KNN_mutual_info_classif
TN=68907	FP=18249	FN=1358	TP=43
TN=68907	FP=18249	FN=1358	TP=43
-0.17869081239841372
TN=68436	FP=18720	FN=1379	TP=22
TN=68436	FP=18720	FN=1379	TP=22
-0.1990842087480442
TN=73832	FP=13324	FN=1343	TP=58
TN=73832	FP=13324	FN=1343	TP=58
-0.11147630333872767
TN=84985	FP=2171	FN=1389	TP=12
TN=84985	FP=2171	FN=1389	TP=12
-0.016344047440396567


Unnamed: 0,TOTBSQ_slope_of_longest_mono_decrease,R_VALUE_slope_of_longest_mono_increase,TOTUSJH_slope_of_longest_mono_increase,TOTUSJH_slope_of_longest_mono_decrease,ABSNJZH_slope_of_longest_mono_increase,ABSNJZH_slope_of_longest_mono_decrease,MEANGBT_slope_of_longest_mono_increase,MEANGBT_slope_of_longest_mono_decrease,MEANJZH_slope_of_longest_mono_increase,MEANJZD_slope_of_longest_mono_increase,...,EPSY_slope_of_longest_mono_increase,R_VALUE_slope_of_longest_mono_decrease,USFLUX_slope_of_longest_mono_decrease,TOTFY_slope_of_longest_mono_increase,SAVNCPP_slope_of_longest_mono_increase,TOTFX_slope_of_longest_mono_increase,TOTUSJZ_slope_of_longest_mono_increase,TOTFY_slope_of_longest_mono_decrease,TOTUSJZ_slope_of_longest_mono_decrease,lab
0,0.999998,0.592783,0.135893,0.999785,0.165156,0.983001,0.135091,0.997227,0.071504,0.080211,...,0.069680,0.996801,0.995512,0.010549,0.061615,0.011542,0.003620,0.961848,0.979902,NF
1,0.999972,0.000000,0.099753,0.984008,0.092857,0.993876,0.139501,0.998092,0.072129,0.067014,...,0.118649,1.000000,0.995483,0.012447,0.018034,0.005273,0.002644,0.981403,0.966216,NF
2,0.999994,0.542522,0.117917,0.991531,0.140554,0.999030,0.210169,0.997967,0.116901,0.091698,...,0.139939,0.999984,0.993084,0.011441,0.026691,0.009723,0.002274,0.992178,0.995778,NF
3,0.999880,0.032062,0.148679,0.988606,0.132879,0.993281,0.111992,0.999899,0.048804,0.050429,...,0.099111,0.998548,0.991620,0.013906,0.037199,0.013309,0.004147,0.994151,0.895906,NF
4,0.999966,0.853203,0.079944,0.998111,0.088042,0.992888,0.259817,0.996885,0.121557,0.128235,...,0.149614,0.997864,0.998579,0.004338,0.017189,0.005733,0.001269,0.987130,0.989757,NF
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88552,0.999939,0.658997,0.111167,0.995514,0.112496,0.992461,0.171748,0.999282,0.081818,0.078232,...,0.108534,0.999760,0.999503,0.017840,0.027152,0.005075,0.002517,0.989290,0.984129,NF
88553,0.999935,0.000000,0.050387,0.997721,0.055907,0.996825,0.311138,0.981158,0.146653,0.126859,...,0.168564,1.000000,0.996039,0.001713,0.006410,0.001795,0.000493,0.999946,0.997583,NF
88554,0.999985,0.282474,0.083450,0.997472,0.063175,0.994946,0.287632,0.998275,0.247218,0.160138,...,0.326943,0.831377,0.998076,0.003396,0.006100,0.002519,0.001121,0.996091,0.990473,NF
88555,0.999917,0.000000,0.082119,0.999161,0.094470,0.992008,0.196919,0.992609,0.101378,0.109766,...,0.143973,1.000000,0.998674,0.005847,0.016647,0.003269,0.001495,0.990364,0.996667,NF


---
### Q3 (10 points)

After evaluating the various results from Q2, you will notice that the results are not all that great with greater than 1000 false negatives for nearly all of our settings tried. But, what can be done to improve our results? If you read the documentation for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), which you certainly should have, you will see that we were only using the `MinkowskiDistance` metric with different values of `p`. If you look into the [DistanceMetric](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric) documentation for the neighbors classifiers, you will see there are several others available to use.

So, for this question, train and evaluate two more instances of [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for each of our different feature selection train test datsets, but this time using the `ChebyshevDistance` metric instead of the `MinkowskiDistance` metric.  For these models you will only be changing the number neighbors to 3 and 5, as the values of `p` are not used for the `ChebyshevDistance` metric. 

---

In [10]:
n_neighbors = [3, 5]
temp = [n_neighbors]
params = list(itertools.product(*temp))

In [11]:
print(params)
print('KNN_f_classif')
X_train, Y_train = dichotomize_X_y(train_f_classif)
X_test, Y_test = dichotomize_X_y(test_f_classif)

for x in n_neighbors:
    print(x)
    KNN_f_classif = KNeighborsClassifier(n_neighbors=x,metric='chebyshev')
    KNN_f_classif.fit(X_train,Y_train)
    predictions = KNN_f_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    
    
train_chi2

print('KNN_chi2')
X_train, Y_train = dichotomize_X_y(train_chi2)
X_test, Y_test = dichotomize_X_y(test_chi2)
for x in n_neighbors:
    KNN_chi2 = KNeighborsClassifier(n_neighbors=x,metric='chebyshev')
    KNN_chi2.fit(X_train,Y_train)
    predictions = KNN_chi2.predict(X_test)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)


train_mutual_info_classif
print('KNN_mutual_info_classif')
X_train, Y_train = dichotomize_X_y(train_mutual_info_classif)
X_test, Y_test = dichotomize_X_y(test_mutual_info_classif)
for x in n_neighbors:
    KNN_mutual_info_classif = KNeighborsClassifier(n_neighbors=x,metric='chebyshev')
    KNN_mutual_info_classif.fit(X_train,Y_train)
    predictions = KNN_mutual_info_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    

[(3,), (5,)]
KNN_f_classif
3
TN=86995	FP=161	FN=1322	TP=79
TN=86995	FP=161	FN=1322	TP=79
0.05454103169556019
5
TN=86976	FP=180	FN=1310	TP=91
TN=86976	FP=180	FN=1310	TP=91
0.06288834227985499
KNN_chi2
TN=86144	FP=1012	FN=1061	TP=340
0.23107243375559422
TN=86197	FP=959	FN=1079	TP=322
0.21883257302394987
KNN_mutual_info_classif
TN=37182	FP=49974	FN=655	TP=746
TN=37182	FP=49974	FN=655	TP=746
-0.04090885102722108
TN=62295	FP=24861	FN=676	TP=725
TN=62295	FP=24861	FN=676	TP=725
0.23224036586836388


---
### Q4 (10 points)

After evaluating the results from Q3, you will see that the results are no better than those we found for Q2. This leads to the thought that maybe the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) is just not a good fit for the problem we are applying it to. So, let's move on to another classifier for this problem. 

In this question, you will utilize the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and try several different settings to see how/if using different settings will improve our score. So, continuing to use our training/testing pairs constructed with different feature selection methods that have had their `lab` column converted to a binary label, train 8 different instances with the following settings. **(see documentation to know what these are)**

|Model Number| criterion | max_depth | splitter |
|------------|---------|-------------|---|
|1|gini|5|best|
|2|gini|5|random|
|3|gini|None|best|
|4|gini|None|random|
|5|entropy|5|best|
|6|entropy|5|random|
|7|entropy|None|best|
|8|entropy|None|random|



Once you have done that, test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result.

---

In [12]:
criterion = ['gini', 'entropy']
depth = [5, None]
splitter = ['best', 'random']
temp = [criterion, depth, splitter]
params = list(itertools.product(*temp))

In [13]:
X_train, Y_train = dichotomize_X_y(abt)
X_test, Y_test = dichotomize_X_y(abt2)
for x,y,z in params:
    DTClassifier = DecisionTreeClassifier(criterion = x,  max_depth = y, splitter = z)
    DTClassifier.fit(X_train,Y_train)
    predictions = DTClassifier .predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    

TN=86149	FP=1007	FN=1035	TP=366
TN=86149	FP=1007	FN=1035	TP=366
0.2496879748862533
TN=86732	FP=424	FN=1251	TP=150
TN=86732	FP=424	FN=1251	TP=150
0.10220154109940748
TN=84584	FP=2572	FN=946	TP=455
TN=84584	FP=2572	FN=946	TP=455
0.2952577194767452
TN=85718	FP=1438	FN=925	TP=476
TN=85718	FP=1438	FN=925	TP=476
0.3232581652549864
TN=86910	FP=246	FN=1209	TP=192
TN=86910	FP=246	FN=1209	TP=192
0.13422244275272782
TN=86726	FP=430	FN=1121	TP=280
TN=86726	FP=430	FN=1121	TP=280
0.19492356269193845
TN=86184	FP=972	FN=987	TP=414
TN=86184	FP=972	FN=987	TP=414
0.2843507956345574
TN=85703	FP=1453	FN=967	TP=434
TN=85703	FP=1453	FN=967	TP=434
0.29310747334052517


---
### Q5 (10 points)

After evaluating results from Q4, you will see that the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) was able to accomplish a bit of an improvement over the best resutls we found for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).  This is indeed great, but can we do better than this if we use yet another classifier? Let's move on to yet another and find out.

For this question you will be utilizing the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier. We won't be changing any of the default settings, just train 1 model for each of our feature selected data subsets. You will again be using your training/testing pairs constructed with different feature selection methods that have had their `lab` column converted to a binary label. You will then test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result.

---

In [14]:
X_train, Y_train = dichotomize_X_y(abt)
X_test, Y_test = dichotomize_X_y(abt2)
GNBClassifier = GaussianNB()
GNBClassifier.fit(X_train,Y_train)
predictions = GNBClassifier.predict(X_test)
calc_tss(Y_test,predictions)
TSS_score =calc_tss(Y_test,predictions)
print(TSS_score)
    

TN=68408	FP=18748	FN=74	TP=1327
TN=68408	FP=18748	FN=74	TP=1327
0.7320720442892868


---
### Q6 (10 points)

If you recall from a lecture some time back, it was shown that another way of improving the results of classification is to perform some form of sampling to balance the number of samples there are for the various classes. The reason why this works for specific classifiers, and methods for doing the sampling, are numerious and we don't have enough time to cover all of them in this course.  However, it is still beneficial to know this works and that it is something that you should be considering when you are training models.  

So, for this question, we will implement a very naive method for sampling so we can use the results for training our models again.  Below you will find a function stub, complete the function and have it return a copy of the input dataframe where each class (except for the smallest one) have been undersampled to match the size of the smallest class in the dataset. In this function you should assume the `lab` column is the class label and not the dicotomized binary classification converted label.

To do this you may want to use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to get groups of rows from your DataFrame.  You may also wish to use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) function to select a number of rows from a group. You can also use the [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method to process each group from your grouped rows. These are just hints, you can solve the problem how you see fit.

Once this function is complete, apply it to each of your training datasets that have been constructed with different feature selection methods from partition 1 (the ones with all the NF, C, .., X labels). You will not be applying this to your testing sets. After you have your sampled feature selected datasets, you will then apply your function that converts the multi-class problem to a binary problem to each of the resultant selected subsets so we can use these new undersampled data for the next several questions.

---

In [15]:

def perform_under_sample(data:DataFrame)->DataFrame:
    #print(data['lab'])
    NewDf= data.copy()
    minv = min(NewDf['lab'].value_counts())
   
    
    NewDf =NewDf.groupby(['lab']).sample(n=minv)

    return NewDf
    
perform_under_sample(abt)

Unnamed: 0.1,lab,TOTUSJH_var,TOTUSJH_difference_of_vars,TOTBSQ_min,TOTBSQ_max,TOTBSQ_median,TOTBSQ_mean,TOTBSQ_var,TOTBSQ_difference_of_mins,TOTBSQ_difference_of_maxs,...,TOTUSJZ_slope_of_longest_mono_decrease,TOTUSJZ_gderivative_stddev,MEANPOT_max,MEANPOT_gderivative_mean,TOTFX_stddev,SAVNCPP_slope_of_longest_mono_decrease,TOTPOT_avg_mono_decrease_slope,USFLUX_stddev,TOTBSQ_dderivative_stddev,Unnamed: 0
49096,B,0.709618,0.689552,0.926691,0.929230,0.927264,0.927593,0.872558,0.801373,0.856077,...,0.962007,0.011867,0.000032,0.322539,0.055663,0.915419,0.999915,0.110819,0.025365,0.554406
60934,B,0.646126,0.616407,0.723703,0.826830,0.776568,0.793105,0.837860,0.800235,0.859606,...,0.990218,0.003999,0.000069,0.322570,0.015591,0.996932,0.999983,0.019856,0.004434,0.688084
8998,B,0.694324,0.629761,0.924546,0.927401,0.926492,0.926235,0.863283,0.883118,0.842644,...,0.991265,0.011846,0.000050,0.322546,0.111352,0.995501,0.999865,0.082890,0.032000,0.101608
22658,B,0.803690,0.661773,0.953099,0.953998,0.953571,0.953413,0.875402,0.886332,0.868963,...,0.965470,0.018105,0.000186,0.322539,0.183933,0.962814,0.999747,0.029646,0.026298,0.255861
70799,B,0.716481,0.716957,0.878875,0.886128,0.882212,0.882024,0.839113,0.810303,0.844765,...,0.891158,0.022926,0.000030,0.322544,0.084576,0.981849,0.999912,0.064778,0.020309,0.799483
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10092,X,0.816747,0.829742,0.979773,0.981914,0.980131,0.980240,0.924737,0.888862,0.936065,...,0.802352,0.018069,0.000304,0.322528,0.168200,0.895796,0.999316,0.095715,0.059541,0.113962
62519,X,0.842520,0.874926,0.983365,0.986923,0.986510,0.986042,0.945309,0.958983,0.897114,...,0.969298,0.053753,0.000335,0.322588,0.617088,0.986798,0.996853,0.107066,0.168775,0.705983
41857,X,0.736321,0.783956,0.933724,0.936398,0.935379,0.935123,0.881464,0.880156,0.854426,...,0.974129,0.027330,0.000183,0.322556,0.164420,0.972967,0.999815,0.057483,0.030000,0.472661
67891,X,0.763460,0.775258,0.951789,0.956352,0.955411,0.954806,0.917118,0.932767,0.895310,...,0.996157,0.014696,0.000281,0.322557,0.141894,0.613687,0.999847,0.091038,0.033657,0.766645


In [16]:
train_f_classif =perform_under_sample(train_f_classif) 
train_chi2 =perform_under_sample(train_chi2)
train_mutual_info_classif =perform_under_sample(train_mutual_info_classif)


---
### Q7

For this question repeat what you did for Q2, but with your balanced binary classification datasets constructed in Q6, uese the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), and try several different settings to see how/if using different settings will improve our score. 

So, train 4 different instances with the following settings for each of your feature selected subsets, for a total of 12 different evaluations. **(see documentation to know what these are)**

|Model Number| n_neighbors | p |
|------------|-------------|---|
|1|3|1|
|2|3|2|
|3|5|1|
|4|5|2|


Once you have done that, test each of your models using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (these should not have been balanced). You shall then calculate and print the TSS score for each result. **NOTE: The model now takes less time to evaluate!**

---

In [17]:
n_neighbors = [3, 5]
p = [1,2]
temp = [n_neighbors, p]
params = list(itertools.product(*temp))

In [20]:
print(params)
print('KNN_f_classif')

X_train, Y_train = dichotomize_X_y(train_f_classif)
X_test, Y_test = dichotomize_X_y(test_f_classif)

for x,y in params:
    
    KNN_f_classif = KNeighborsClassifier(n_neighbors=x,p=y)
    KNN_f_classif.fit(X_train,Y_train)
    predictions = KNN_f_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score) 
train_chi2

print('KNN_chi2')
X_train, Y_train = dichotomize_X_y(train_chi2)
X_test, Y_test = dichotomize_X_y(test_chi2)
for x,y in params:
    KNN_chi2 = KNeighborsClassifier(n_neighbors=x,p=y)
    KNN_chi2.fit(X_train,Y_train)
    predictions = KNN_chi2.predict(X_test)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
train_mutual_info_classif

print('KNN_mutual_info_classif')
X_train, Y_train = dichotomize_X_y(train_mutual_info_classif)
X_test, Y_test = dichotomize_X_y(test_mutual_info_classif)
for x,y in params:
    KNN_mutual_info_classif = KNeighborsClassifier(n_neighbors=x,p=y)
    KNN_mutual_info_classif.fit(X_train,Y_train)
    predictions = KNN_mutual_info_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    
    
  

[(3, 1), (3, 2), (5, 1), (5, 2)]
KNN_f_classif
TN=83028	FP=4128	FN=429	TP=972
TN=83028	FP=4128	FN=429	TP=972
0.6464268014143435
TN=80975	FP=6181	FN=306	TP=1095
TN=80975	FP=6181	FN=306	TP=1095
0.7106657701964029
TN=82919	FP=4237	FN=383	TP=1018
TN=82919	FP=4237	FN=383	TP=1018
0.6780098605832481
TN=79219	FP=7937	FN=214	TP=1187
TN=79219	FP=7937	FN=214	TP=1187
0.7561853696485359
KNN_chi2
TN=83136	FP=4020	FN=475	TP=926
0.6148322685660593
TN=83136	FP=4020	FN=489	TP=912
0.6048394063248032
TN=83213	FP=3943	FN=511	TP=890
0.590019810400765
TN=83084	FP=4072	FN=468	TP=933
0.6192320683589533
KNN_mutual_info_classif
TN=16365	FP=70791	FN=1	TP=1400
TN=16365	FP=70791	FN=1	TP=1400
0.187052987171198
TN=0	FP=87156	FN=0	TP=1401
TN=0	FP=87156	FN=0	TP=1401
0.0
TN=19634	FP=67522	FN=0	TP=1401
TN=19634	FP=67522	FN=0	TP=1401
0.2252742209371701
TN=0	FP=87156	FN=0	TP=1401
TN=0	FP=87156	FN=0	TP=1401
0.0


---
### Q8

After evaluating the various results from Q7, you will notice that some of the results are improved over the same experiments we conducted in Q2. Additionally, you should also notice a improvement in the speed at which the results were obtained. The question now is will we continue to see these improvements for all of our experiments? So, let's move on and see.

For this question, you will repeat the experiments from Q3, but using the balanced binary classification datasets constructed in Q6. You will still be using the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) like you did in Q7, but you will again be changing from using the `MinkowskiDistance` metric with different values of `p` to using the `ChebyshevDistance` metric. You will construct two models for each of your feature selected datasets by changing the number neighbors to 3 and 5.

Once you have done that, test each of your models using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (these should not have been balanced), then calculate and print the TSS score for each result. 

---

In [21]:
n_neighbors = [3, 5]
temp = [n_neighbors]
params = list(itertools.product(*temp))

In [24]:
print(params)
print('KNN_f_classif')

X_train, Y_train = dichotomize_X_y(train_f_classif)
X_test, Y_test = dichotomize_X_y(test_f_classif)

for x in n_neighbors:
    
    KNN_f_classif = KNeighborsClassifier(n_neighbors=x,metric='chebyshev')
    KNN_f_classif.fit(X_train,Y_train)
    predictions = KNN_f_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
       
train_chi2
print('KNN_chi2')
X_train, Y_train = dichotomize_X_y(train_chi2)
X_test, Y_test = dichotomize_X_y(test_chi2)
for x in n_neighbors:
    KNN_chi2 = KNeighborsClassifier(n_neighbors=x,metric='chebyshev')
    KNN_chi2.fit(X_train,Y_train)
    predictions = KNN_chi2.predict(X_test)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)

train_mutual_info_classif
print('KNN_mutual_info_classif')
X_train, Y_train = dichotomize_X_y(train_mutual_info_classif)
X_test, Y_test = dichotomize_X_y(test_mutual_info_classif)
for x in n_neighbors:
    KNN_mutual_info_classif = KNeighborsClassifier(n_neighbors=x,metric='chebyshev')
    KNN_mutual_info_classif.fit(X_train,Y_train)
    predictions = KNN_mutual_info_classif.predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
   
    

[(3,), (5,)]
KNN_f_classif
TN=0	FP=87156	FN=0	TP=1401
TN=0	FP=87156	FN=0	TP=1401
0.0
TN=3	FP=87153	FN=0	TP=1401
TN=3	FP=87153	FN=0	TP=1401
3.442103813855457e-05
KNN_chi2
TN=82922	FP=4234	FN=564	TP=837
0.5488508483594309
TN=82903	FP=4253	FN=521	TP=880
0.5793252110493645
KNN_mutual_info_classif
TN=8248	FP=78908	FN=95	TP=1306
TN=8248	FP=78908	FN=95	TP=1306
0.026826199456476796
TN=1396	FP=85760	FN=64	TP=1337
TN=1396	FP=85760	FN=64	TP=1337
-0.029664399546241782


---
### Q9

After evaluating the results of Q8 things are looking a little less encouraging, since none of those results look to be better than the results of Q7. However, the results from Q3 weren't really any better than Q2 in the first place, so not all is lost.  Let's continue on and see how things turn out with models like we used in Q4 since those were actaully an improvement over Q2 originally.

So, in this question, you will utilize the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), like you did in Q4, and try several different settings to see how/if using different settings will improve our score. The difference will again be that you are now using the balanced binary classification datasets constructed in Q6 to train 8 different instances for each of your feature selected datasets using the following settings. **(see documentation to know what these are)**

|Model Number| criterion | max_depth | splitter |
|------------|---------|-------------|---|
|1|gini|5|best|
|2|gini|5|random|
|3|gini|None|best|
|4|gini|None|random|
|5|entropy|5|best|
|6|entropy|5|random|
|7|entropy|None|best|
|8|entropy|None|random|



Once you have done that, test each of your models using your binary classification copy of copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (this should not have been balanced), then calculate and print the TSS score for each result. 

---

In [25]:
criterion = ['gini', 'entropy']
depth = [5, None]
splitter = ['best', 'random']
temp = [criterion, depth, splitter]
params = list(itertools.product(*temp))

In [26]:
X_train, Y_train = dichotomize_X_y(perform_under_sample(abt))
X_test, Y_test = dichotomize_X_y(abt2)
for x,y,z in params:
    DTClassifier = DecisionTreeClassifier(criterion = x,  max_depth = y, splitter = z)
    DTClassifier.fit(X_train,Y_train)
    predictions = DTClassifier .predict(X_test)
    calc_tss(Y_test,predictions)
    TSS_score =calc_tss(Y_test,predictions)
    print(TSS_score)
    

TN=82738	FP=4418	FN=504	TP=897
TN=82738	FP=4418	FN=504	TP=897
0.589566243816129
TN=82861	FP=4295	FN=545	TP=856
TN=82861	FP=4295	FN=545	TP=856
0.5617126955304147
TN=82382	FP=4774	FN=739	TP=662
TN=82382	FP=4774	FN=739	TP=662
0.417744283478796
TN=83257	FP=3899	FN=625	TP=776
TN=83257	FP=3899	FN=625	TP=776
0.5091542026146624
TN=82647	FP=4509	FN=499	TP=902
TN=82647	FP=4509	FN=499	TP=902
0.5920910183644715
TN=82866	FP=4290	FN=372	TP=1029
TN=82866	FP=4290	FN=372	TP=1029
0.6852532901942644
TN=82599	FP=4557	FN=559	TP=842
TN=82599	FP=4557	FN=559	TP=842
0.5487137292917285
TN=82562	FP=4594	FN=479	TP=922
TN=82562	FP=4594	FN=479	TP=922
0.6053912731047226


---
### Q10

Unlike with [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), it seems that the sampling didn't really help much for the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).  Where before we saw a 3X improvement with the Decision Tree over the KNN classifier, we now see similar results for both classifiers.  Let's see how switching to the sampled data affectes our best performing classifier when we were using the full dataset.

For this question you will again be utilizing the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier as you did in Q5 but using your balanced binary classification dataset constructed in Q6 to train just 1 model for each feature selected dataset. Once you have done that, test the model using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (this should not have been balanced), then calculate and print the TSS score. 

---

In [28]:
X_train, Y_train = dichotomize_X_y(perform_under_sample(abt))
X_test, Y_test = dichotomize_X_y(abt2)
GNBClassifier = GaussianNB()
GNBClassifier.fit(X_train,Y_train)
predictions = GNBClassifier.predict(X_test)
calc_tss(Y_test,predictions)
TSS_score =calc_tss(Y_test,predictions)
print(TSS_score)
      

TN=76294	FP=10862	FN=154	TP=1247
TN=76294	FP=10862	FN=154	TP=1247
0.7654514099260152


Unfortunately, we don't see much improvement for our [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier. 

**Note:The TA would like you to turn in assignments that have been run and have results, so make sure to do a restart and run all from the kernel menu. Then make sure to save before you turn it in. You might find it necessary to use the toy dataset if you have time constraints.**