# Project Template: Phase 2

Below are some concrete steps that you can take while doing your analysis for phase3. This guide isn't "one size fit all" so you will probably not do everything listed. But it still serves as a good "pipeline" for how to do data analysis.

If you do engage in a step, you should clearly mention it in the notebook.

---


## 2.1) Decide on what models you will use and compare

Select at least 3 models to compare on your prediction task. At least 2 of your models should be ones we've covered in class. 

Some resources try to help you select a well-performing model for your data:
* [sklearn's Flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
* [geeks4geeks Flowchart](https://www.geeksforgeeks.org/flowchart-for-basic-machine-learning-models/)
* [SAS Cheatsheet](https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet.png)

**Note**: These are general guides, and not guarantees of success. Some of the models are also outside of what we have covered, but you can explore them if you want to.

In addition to selecting a model you think will perform well, there are other reasons to select a model:
* To serve as a baseline (naive) approach you expect to outperform with more complex/appropriate models.
* You need a model that is human interpretable (e.g. Decision Tree).
* The model has historically performed well on similar tasks.
* Some properties of the model are effective for the type of data you have. Remember, at the end of most Seminars, you learned the strengths and weaknesses of each model.

1. Model Decision tree classifier: I am selecting Decision tree classifier because...
2. AdaBoost classifier : I am selecting K Neighbors classifier because...
3. Model ANN: I am selecting ANN because...

I'm not sure exactly why I'll be using each of these models yet. I'm still learning so my plan is to experiment with these various classifiers to help develop an intution for what works best.

## 2.2) Split into train and test
Make sure to split your data *before* you apply any transformations.

**Note**: If you have multiple records from the same object (e.g., multiple attempts from the same student), these should all go in either training or test, but not split between them. See the examples for how to accomplish this.

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [8]:
df = pd.read_csv('data/clean.csv')
df

Unnamed: 0,Severity,Distance(mi),Temperature(F),Wind_Chill(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in),Sunrise_Sunset,...,start_month,start_day,start_hour,start_minute,start_second,end_month,end_day,end_hour,end_minute,end_second
0,3,3.230,42.1,36.1,58.0,29.76,10.0,10.4,0.00,1.0,...,2,8,0,37,8,2,8,6,37,8
1,2,0.747,36.9,36.1,91.0,29.68,10.0,10.4,0.02,1.0,...,2,8,5,56,20,2,8,11,56,20
2,2,0.055,36.0,36.1,97.0,29.70,10.0,10.4,0.02,1.0,...,2,8,6,15,39,2,8,12,15,39
3,2,0.123,39.0,36.1,55.0,29.65,10.0,10.4,0.02,1.0,...,2,8,6,51,45,2,8,12,51,45
4,3,0.500,37.0,29.8,93.0,29.69,10.0,10.4,0.01,0.0,...,2,8,7,53,43,2,8,13,53,43
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2845337,2,0.543,86.0,86.0,40.0,28.92,10.0,13.0,0.00,0.0,...,8,23,18,3,25,8,23,18,32,1
2845338,2,0.338,70.0,70.0,73.0,29.39,10.0,6.0,0.00,0.0,...,8,23,19,11,30,8,23,19,38,23
2845339,2,0.561,73.0,73.0,64.0,29.74,10.0,10.0,0.00,0.0,...,8,23,19,0,21,8,23,19,28,49
2845340,2,0.772,71.0,71.0,81.0,29.62,10.0,8.0,0.00,0.0,...,8,23,19,0,21,8,23,19,29,42


In [9]:
Y = df['Severity']
X = df.drop(columns=['Severity'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

### 2.2.1) Sampling (If needed)

If one of your classes is very underrepresented (e.g. 1000 of Class 0; 200 of Class 1), you might consider oversampling the minority class (e.g. sample 1000 times with replacement from 200 instances), or undersampling the majority class (e.g. sample 200 times from 1000 instances).

Check out [np.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) for how to sample a vector.

**Note 1**: You should only ever sample the *training dataset*, never the test. After all, you can't chose the class distribution of your test data!

**Note 2**: Sampling can help a classifier perform better on the minority class, often at the cost of *overall* performance. But this is no guarantee. If you chose to sample, you should compare your classifiers' performance with and without sampling to see if it actually helped.

**Note 3**: Make sure you sample the *same* indices from your training and test data -- otherwise they won't match anymore!


Play around with sampling below (or skip this step if you don't need sampling).

In [None]:
# df.groupby('Severity').apply(lambda x: x.sample(frac=0.01))

 When you're done, write the `sample_data` method to perform sampling on any training dataset.

In [None]:
def sample_data(X_train, Y_train):
    """
    Input: The original X_train and Y_train training dataset
    Output: A new training dataset with sampling applied (same columns, different rows)
    """
    # For example, undersample the majority class, or oversample the minority class.
    
    return (X_train, Y_train)

## 2.3) Feature Transformation

Use your training data to fit any transformers or encoder your need, then apply the fit transformer to your test data. This applies to:
* Normalizing/standardizing your features
* Using Bag of Words or TF-IDF to encode strings
* PCA or dimensionality reduction

**Rationale**: In practice, we won't be able to see the test data we'll be making predicting for, so we shouldn't use that data as the basis for any transformation or feature extractio.

Try your feature transformation below:

 When you're done, write the `apply_feature_transformation` method to perform transformation on any training/test split.

In [12]:
def apply_feature_transformation(X_train, X_test):
    """
    Input: The original X_train and X_test feature sets.
    Output: The transformed X_train and X_test feature sets.

    Scaling features
    """
    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train_scaled = scaler.transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    return X_train_scaled, X_test_scaled

In [13]:
X_train, X_test = apply_feature_transformation(X_train, X_test)

## 2.4) Train and Explore your Models
Using the models you decided upon in the beginning, now train these models. Conduct preliminary evaluations to see if using said models are even feasible, before potentially wasting time tuning a model thats no-good.

### decision tree classifier

In [17]:
dt_clf = DecisionTreeClassifier(criterion='gini')
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)

In [18]:
print(confusion_matrix(y_test, dt_pred))
print('---------------------------------')
print(classification_report(y_test, dt_pred))

[[  2626   1773    600    176]
 [  1955 467541  19358  17581]
 [   668  16951  10287   3150]
 [   188  13445   2764  10006]]
---------------------------------
              precision    recall  f1-score   support

           1       0.48      0.51      0.49      5175
           2       0.94      0.92      0.93    506435
           3       0.31      0.33      0.32     31056
           4       0.32      0.38      0.35     26403

    accuracy                           0.86    569069
   macro avg       0.51      0.54      0.52    569069
weighted avg       0.87      0.86      0.87    569069



### adaboost classifier

In [19]:
ada_clf = AdaBoostClassifier()
ada_clf.fit(X_train, y_train)
ada_pred = ada_clf.predict(X_test)

In [20]:
print(confusion_matrix(y_test, ada_pred))
print('---------------------------------')
print(classification_report(y_test, ada_pred))

[[  1032   4128     15      0]
 [   667 505429    317     22]
 [   194  30655    193     14]
 [    36  26283     47     37]]
---------------------------------
              precision    recall  f1-score   support

           1       0.53      0.20      0.29      5175
           2       0.89      1.00      0.94    506435
           3       0.34      0.01      0.01     31056
           4       0.51      0.00      0.00     26403

    accuracy                           0.89    569069
   macro avg       0.57      0.30      0.31    569069
weighted avg       0.84      0.89      0.84    569069



### KNeighbors

In [21]:
kn_clf = MLPClassifier(max_iter=200)
kn_clf.fit(X_train, y_train)
kn_pred = kn_clf.predict(X_test)



In [None]:
print(confusion_matrix(y_test, kn_pred))
print('---------------------------------')
print(classification_report(y_test, kn_pred))

## 2.5) Hyperparameter Tuning
For promising models, tune them even further to squeeze out the best possible performance. Some questions to consider.

1. What hyperparamaters should I tune? Why?
2. What values ranges should I choose for each param? Why?
3. Should I use try the values manually, or use the [built-in tuning functions](https://scikit-learn.org/stable/modules/grid_search.html)?

**Make sure to only tune on the training dataset!**

In [None]:
from sklearn.model_selection import GridSearchCV

def find_best_hyperparameters_m1(X_train, Y_train):
    """
    Input: The training X features and Y labels/values
    Output: The classifier with the best hyperparams and the predictions
    """
    clf = None # Create your base classifier
    param_grid = {"param_1": [0, 1, 2],
                  "param_2": ['value1', 'value2']}
    
    search = GridSearchCV(clf, param_grid)
    search.fit(X_train,y_train)
    return search, search.predict(X_test)

## Put it All Together

Now, combine the "scratch work" that you did above into a tidy function that someone could use to replicate your work and process in a single step.

In [None]:
def evaluate_model1(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    hyperparameters = find_best_hyperparameters_m1(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions

In [None]:
def evaluate_model2(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    # You need to create a new hyperparameter selector for your second model, or remove this step
    hyperparameters = find_best_hyperparameters_m2(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions

In [None]:
def evaluate_model3(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    # You need to create a new hyperparameter selector for your second model, or remove this step
    hyperparameters = find_best_hyperparameters_m3(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions