__NAME:__ __FULLNAME__  
__SECTION:__ __NUMBER__  
__CS 5703: Machine Learning Practice__

# Homework 10: FORESTS AND BOOSTING

## Assignment Overview
Follow the TODOs and read through and understand any provided code.  
If you have any questions, please post them to Slack.

### Task
For this assignment you will be exploring Random Forests and Boosting for the purposes of distinguishing Tropical Storms from Tropical Depression given raw data.

### [Data set](https://www.kaggle.com/noaa/hurricane-database)
The dataset is based on cyclone weather data from NOAA.  
You can obtain the data from the server and git under datasets/cyclones.

We will be predicting whether a cyclone status is a tropical depression (TD) or not.  
Status can be the following types:  
* TD – tropical depression  
* TS – tropical storm   
* HU – hurricane intensity  
* EX – Extratropical cyclone  
* SD – subtropical depression intensity  
* SS – subtropical storm intensity  
* LO – low, neither a tropical, subtropical, nor extratropical cyclone  
* WV – Tropical Wave  
* DB – Disturbance  


### Objectives
Gain experience with:
* DecisionTreeClassifiers
* RandomForests
* Boosting

### General References
* [Guide to Jupyter](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)
* [Python Built-in Functions](https://docs.python.org/3/library/functions.html)
* [Python Data Structures](https://docs.python.org/3/tutorial/datastructures.html)
* [Numpy Reference](https://docs.scipy.org/doc/numpy/reference/index.html)
* [Numpy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
* [Summary of matplotlib](https://matplotlib.org/3.1.1/api/pyplot_summary.html)
* [DataCamp: Matplotlib](https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python?utm_source=adwords_ppc&utm_campaignid=1565261270&utm_adgroupid=67750485268&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=1t1&utm_creative=332661264365&utm_targetid=aud-299261629574:dsa-473406587955&utm_loc_interest_ms=&utm_loc_physical_ms=9026223&gclid=CjwKCAjw_uDsBRAMEiwAaFiHa8xhgCsO9wVcuZPGjAyVGTitb_-fxYtkBLkQ4E_GjSCZFVCqYCGkphoCjucQAvD_BwE)
* [Pandas DataFrames](https://urldefense.proofpoint.com/v2/url?u=https-3A__pandas.pydata.org_pandas-2Ddocs_stable_reference_api_pandas.DataFrame.html&d=DwMD-g&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=9ngmsG8rSmDSS-O0b_V0gP-nN_33Vr52qbY3KXuDY5k&m=mcOOc8D0knaNNmmnTEo_F_WmT4j6_nUSL_yoPmGlLWQ&s=h7hQjqucR7tZyfZXxnoy3iitIr32YlrqiFyPATkW3lw&e=)
* [Sci-kit Learn Trees](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
* [Sci-kit Learn Ensemble Models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)
* [Sci-kit Learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
* [Sci-kit Learn Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)
* [Sci-kit Learn Pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [Sci-kit Learn Preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
* [Decision Trees](https://medium.com/machine-learning-101/chapter-3-decision-trees-theory-e7398adac567)
### Hand-In Procedure
* Execute all cells so they are showing correct results
* Notebook (from Jupyter or Colab):
  + Submit this file (.ipynb) to the Gradescope Notebook HW10 dropbox
* Note: there is no need to submit a PDF file or to submit directly to Canvas

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np

import scipy.stats as stats
import os, re, fnmatch
import pathlib, itertools, time
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.patheffects as peffects
import time as timelib

from math import ceil, floor
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, LabelEncoder
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.metrics import explained_variance_score, confusion_matrix, roc_curve, auc
from sklearn.metrics import f1_score, mean_squared_error, classification_report
from sklearn.metrics import precision_recall_fscore_support, precision_recall_curve
import joblib
from IPython import display


from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

# Default figure parameters
plt.rcParams['figure.figsize'] = (6,5)
plt.rcParams['font.size'] = 10
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['figure.constrained_layout.use'] = False
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12


plt.style.use('ggplot')

In [None]:
# COLAB ONLY
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# COLAB ONLY
## We've discovered a better way to do imports into colab
## Now, instead of executing the files, we will copy them
## into your colab VM, then import them as normal
import pathlib
import shutil
import re

#folder = "."
# TODO: set to your Google Drive folder that contains the py files
folder = '/content/drive/MyDrive/hw10'

# Copy library python files to local directory
for n in pathlib.Path(folder).iterdir():
  if re.search(r'.*\.py', n.name):
    shutil.copy(n, n.name)

In [None]:
# LOCAL MACHINES AND COLAB
import visualize
import metrics_plots
from pipeline_components import DataSampleDropper, DataFrameSelector
from pipeline_components import DataScaler, DataLabelEncoder

In [None]:
""" PROVIDED
Functions for exporting trees to .dot and .pngs
"""
from PIL import Image
def image_combine(ntrees, big_name='big_tree.png', fname_fmt='tree_%02d.png'):
    '''
    Function for combining some of the trees in the forest into on image
    Amalgamate the pngs of the trees into one big image
    PARAMS:
        ntrees: number of trees from the ensemble to export
        big_name: file name for the png containing all ntrees
        fname_fmt: file name format string used to read the exported files
    '''
    # Read the pngs
    imgs = [Image.open(fname_fmt % x) for x in range(ntrees)]

    # Determine the individual and total sizes
    widths, heights = zip(*(i.size for i in imgs))
    total_width = sum(widths)
    max_height = max(heights)

    # Create the combined image
    big_img = Image.new('RGB', (total_width, max_height))
    x_offset = 0
    for im in imgs:
        big_img.paste(im, (x_offset, 0))
        x_offset += im.size[0]
    big_img.save(big_name) 
    print("Created %s" % big_name)
    return big_img

def export_trees(forest, ntrees=3, fname_fmt='tree_%02d'):
    '''
    Write trees into inidividual files from the forest
    PARAMS:
        forest: ensemble of trees classifier
        ntrees: number of trees from the ensemble to export
        fname_fmt: file name format string used to name the exported files
    '''
    for t in range(ntrees):
        estimator = forest.estimators_[t]
        basename = fname_fmt % t
        fname = basename + '.dot'
        pngname = basename + '.png'
        export_graphviz(estimator, out_file=fname, rounded=True, filled=True)
        # Command line instruction to execute dot and create the image
        !dot -Tpng {fname} > {pngname}
        print("Created %s and %s" % (fname, pngname))


In [None]:
''' PROVIDED

Data set organization functions

'''
def to_numerical(coord):
    '''
    Convert Latitude and Longitude into numerical values

    '''
    
    direction = re.findall(r'[NSWE]' , coord)[0]
    num = re.match('[\d]{1,3}.[\d]{0,1}' , coord)[0]
    
    # North and East are positive directions
    if direction in ['N', 'E']:
        return float(num)
    return -1. * float(num)


def clean_data_set(df, classes=['TD', 'HU', 'TS'], fix_columns=[]):
    """ PROVIDED
    Make adjustments to the data.

    For wind speed, NaNs are current represented by -999.
    We will replace these with NaN.

    For Latitude and Longitude, these are strings such as 
    28.0W. We will replace these with numerical values where
    positive directions are N and E, and negative directions 
    are S and W.
    """
    
    # Convert -999 values to NaNs. These are missing values
    NaNvalue = -999
    df = df.replace(NaNvalue, np.nan).copy()
    
    # Interpolate NaNs for columns in fix_columns
    for c in fix_columns:
        med = df[c].median()
        df[c] = df[c].fillna(med)

    # Set the datatype of the categorical attributes
    cate_attribs = ['Event', 'Status']
    df[cate_attribs] = df[cate_attribs].astype('category')

    # Set the datatype of the Data attribute to datetime64[ns]
    df['Date'] = df['Date'].astype('datetime64[ns]')
    
    # Clean up lat/long
    df['Latitude'] = df['Latitude'].apply(to_numerical)
    df['Longitude'] = df['Longitude'].apply(to_numerical)

    # class label is defined by the order in the classes parameter
    # All other labels will be NaN
    
    for i,c in enumerate(classes):
        isclass = df['Status'].str.contains(c)
        df.loc[isclass, 'label'] = i
        
    return df

In [None]:
''' PROVIDED

Performance report generator
'''
def generate_performance_report(model, Xtrain, ytrain, 
                                Xval, yval, targetnames):
    '''
    Produce a performance report for a model as a function of the training
    and validation data sets.  Includes:
    - Confusion matrices
    - ROC and PR-ROC curves
    '''
    
    # Compute the model's predictions.
    preds = model.predict(Xtrain)
    preds_val = model.predict(Xval)

    # Compute the prediction probabilities. 
    proba = model.predict_proba(Xtrain)
    proba_val = model.predict_proba(Xval)

    # Compute the model's mean accuracy. 
    score = model.score(Xtrain, ytrain) 
    score_val = model.score(Xval, yval)
    
    print("Training Score: %.4f" % score)
    print("Validation Score %.4f" % score_val)
    
    # Confusion Matrix
    cmtx = confusion_matrix(ytrain, preds)
    cmtx_val = confusion_matrix(yval, preds_val)
    metrics_plots.confusion_mtx_colormap(cmtx, targetnames, targetnames)
    metrics_plots.confusion_mtx_colormap(cmtx_val, targetnames, targetnames)

    # KS, ROC, and PRC Curves
    roc_prc_results = metrics_plots.ks_roc_prc_plot(ytrain, proba[:,1])
    roc_prc_results_val = metrics_plots.ks_roc_prc_plot(yval, proba_val[:,1])

    # Compute the PSS and F1 Score
    pss_val = metrics_plots.skillScore(yval, preds_val)
    f1_val = f1_score(yval, preds_val)
    print("Val PSS: %.4f" % pss_val[0])
    print("Val F1 Score %.4f" % f1_val)

# LOAD DATA

In [None]:
# TODO: set appropriately

filename_val = '/content/drive/MyDrive/MLP_2022/datasets/cyclones/pacific.csv'
#filename_val = 'cyclones/pacific.csv'
filename_tr = '/content/drive/MyDrive/MLP_2022/datasets/cyclones/atlantic.csv'
#filename_tr = 'cyclones/atlantic.csv'


# Read both files
cyclones_val = pd.read_csv(filename_val)
nRows, nCols = cyclones_val.shape
print(f'Validation: {nRows} rows and {nCols} columns')

cyclones_tr = pd.read_csv(filename_tr)
nRows, nCols = cyclones_tr.shape
print(f'Training: {nRows} rows and {nCols} columns')

In [None]:
cyclones_tr.columns

In [None]:
""" PROVIDED
Clean up the data frames
"""
targetnames = ['TS', 'TD']
inter_columns = ['Maximum Wind', 'Minimum Pressure', 'Low Wind NE',
       'Low Wind SE', 'Low Wind SW', 'Low Wind NW', 'Moderate Wind NE',
       'Moderate Wind SE', 'Moderate Wind SW', 'Moderate Wind NW',
       'High Wind NE', 'High Wind SE', 'High Wind SW', 'High Wind NW']

df_val = clean_data_set(cyclones_val, classes=targetnames, fix_columns=inter_columns)
df_tr = clean_data_set(cyclones_tr, classes=targetnames, fix_columns=inter_columns)

In [None]:
""" PROVIDED
Display the quantitiy of NaNs for each feature
"""
df_tr.isna().sum()

In [None]:
""" PROVIDED
Display summary statistics for each feature of the dataframe
"""
df_tr.describe()

# PRE-PROCESS DATA

In [None]:
df_tr.columns

In [None]:
""" PROVIDED
Construct preprocessing pipeline
"""
# Features to use for prediction + the predictor (last item)
selected_features = ['Latitude', 'Longitude', 
                     'Low Wind NE',
                     'Low Wind SE', 
                     'Low Wind SW',
                     'Low Wind NW',
                     'Moderate Wind NE', 
                     'Minimum Pressure',
                     'Moderate Wind SE', 
                     'Moderate Wind NE', 
                     'Moderate Wind NW',
                     'Moderate Wind SW',
                     'High Wind NE', 
                     'High Wind NW',
                     'High Wind SE',
                     'label']

# Pipeline for filtering the data
pipe = Pipeline([
    ('FeatureSelector', DataFrameSelector(selected_features)),
    ('RowDropper', DataSampleDropper())
])

In [None]:
""" PROVIDED
Pre-process the data using the defined pipeline
"""
tr_data = pipe.fit_transform(df_tr)
nsamples, ncols = tr_data.shape
nsamples, ncols

In [None]:
""" PROVIDED
Pre-process the data using the defined pipeline
"""
val_data = pipe.fit_transform(df_val)
nsamples, ncols = val_data.shape
nsamples, ncols

In [None]:
""" PROVIDED
Verify all NaNs removed
"""
tr_data.isna().any()

In [None]:
""" PROVIDED
Verify all NaNs removed
"""
val_data.isna().any()

# VISUALIZE DATA

In [None]:
""" PROVIDED
Display the Pearson correlation between all pairs of the features
use visualize.scatter_corrplots
"""
cdata = tr_data.astype('float64').copy()
visualize.scatter_corrplots(cdata.values, cdata.columns, corrfmt="%.1f", FIGW=15)

## Reflection #1

a.  Which features do you expect to be most relevent for predicting the label?

__TODO__




# Create Training and Validation Data Sets

In [None]:
""" PROVIDED
Create the training and validation data sets

"""

X = tr_data.drop(['label'], axis=1).values
y = tr_data['label'].astype('int64').values

# We originally were planning to use the other data set as validation, but 
#  the atlantic and pacific are very different conditions

#Xval = val_data.drop(['label'], axis=1).copy()
#yval = val_data['label'].astype('int64').copy()

# Subsample the training set
#Xtrain, Xval, ytrain, yval = train_test_split(X, y, stratify=y, test_size=0.5)#, random_state=42)

# Because there is temporal autocorrelation, we are just splitting 
#  the data into a training set and a validation set
split = 14000
Xtrain = X[:split,:]
ytrain = y[:split]
Xval = X[split:,:]
yval = y[split:]
Xtrain.shape, ytrain.shape, Xval.shape, yval.shape

# DECISION TREE CLASSIFIER

In [None]:
""" TODO
Create and train DecisionTree for comparision with the ensemble methods 

Select appropriate parameters for the Decision Tree Classifier

"""
tree_clf = DecisionTreeClassifier( TODO )
tree_clf.fit(Xtrain, ytrain)

In [None]:
""" PROVIDED
Compute the predictions, prediction probabilities, and the accuracy scores
for the training and validation sets for the learned instance of the model

Display the confusion matrix, KS plot, ROC curve, and PR curve for the training 
and validation sets using metrics_plots.ks_roc_prc_plot

The red dashed line in the ROC and PR plots are indicative of the expected 
performance for a random classifier, which would predict postives at the 
rate of occurance within the data set
"""

generate_performance_report(tree_clf, Xtrain, ytrain,
                            Xval, yval,
                            targetnames)

In [None]:
""" PROVIDED
Export the tree as a .dot file and create the png
"""
fname = 'tree.dot'
pngname = 'tree.png'
export_graphviz(tree_clf, feature_names=selected_features[:-1],
                class_names=targetnames, out_file=fname, 
                rounded=True, filled=True)

# If the following command does not work, you can manually convert
# the dot file into a png here: 
#  https://onlineconvertfree.com/convert-format/dot-to-png/
!dot -Tpng {fname} -o {pngname}

In [None]:
display.Image("tree.png")

## Reflection #2

a. Compare the performance between small, medium-sized and large trees with respect to the validation set.

__TODO__


# RANDOM FOREST CLASSIFIER

In [None]:
""" TODO
Create and train a RandomForest
Explore various configurations of the hyper-parameters. 

Train the models on the training set and evaluate them for the training and
validation sets.

Examine the API and the book for the meaning and impact of different 
hyper-parameters
"""
forest_clf = RandomForestClassifier( TODO )
forest_clf.fit(Xtrain, ytrain)

In [None]:
""" PROVIDED
Export some trees from your favorite model as a .dot file
We can use the estimators_ attribute of the forest to get a list of the trees

Amalgamate the pngs of the trees into one big image
"""
ntrees = 2

'''
This will work on colab

If running on a local machine, and if the dot command does not work on your computer,
please modify the export_trees function by commenting out the line where the dot 
command is being invokedThen you can manually convert each dot file into a png file 
at the following website:
https://onlineconvertfree.com/convert-format/dot-to-png/
After converting all of the dot files into a png, you should be able to use the 
image_comibne() function
'''
export_trees(forest_clf, #feature_names=X.columns, class_names=targetnames, rounded=True, filled=True, 
             fname_fmt='e_rf_model_%02d')
big_img = image_combine(ntrees, big_name='e_rf_model.png', 
                        fname_fmt='e_rf_model_%02d.png')

In [None]:
''' PROVIDED
Display the tree file
'''
display.Image("e_rf_model.png")


### TRAINING AND VALIDATION RESULTS

In [None]:
""" PROVIDED
Compute the predictions, prediction probabilities, and the accuracy scores
for the training and validation sets for the learned instance of the model

Display the confusion matrix, KS plot, ROC curve, and PR curve for the training 
and validation sets using metrics_plots.ks_roc_prc_plot

The red dashed line in the ROC and PR plots are indicative of the expected 
performance for a random classifier, which would predict postives at the 
rate of occurance within the data set
"""

generate_performance_report(forest_clf, Xtrain, ytrain,
                            Xval, yval,
                            targetnames)

# ADABOOSTING

In [None]:
""" TODO
Create and train a Boosting model 

Explore various boosting models to improve your validation performance.
Train the models on the training set and evaluate them for the training and
validation sets. Try boosting the benmark tree_clf 
"""
tree_clf2 = DecisionTreeClassifier( TODO )

ada_clf = AdaBoostClassifier( TODO )

ada_clf.fit(Xtrain, ytrain)

### TRAINING AND VALIDATION RESULTS

In [None]:
"""
Compute the predictions, prediction probabilities, and the accuracy scores
for the training and validation sets for the learned instance of the model

Display the confusion matrix, KS plot, ROC curve, and PR curve for the training 
and validation sets using metrics_plots.ks_roc_prc_plot

The red dashed line in the ROC and PR plots are indicative of the expected 
performance for a random classifier, which would predict postives at the 
rate of occurance within the data set
"""

generate_performance_report(ada_clf, Xtrain, ytrain,
                            Xval, yval,
                            targetnames)

In [None]:
""" PROVIDED
Export some trees from your favorite model as a .dot file
We can use the estimators_ attribute of the forest to get a list of the trees

Amalgamate the pngs of the trees into one big image
"""
ntrees = 2

'''
This will work on colab

If running on a local machine, and if the dot command does not work on your computer,
please modify the export_trees function by commenting out the line where the dot 
command is being invokedThen you can manually convert each dot file into a png file 
at the following website:
https://onlineconvertfree.com/convert-format/dot-to-png/
After converting all of the dot files into a png, you should be able to use the 
image_comibne() function
'''
export_trees(ada_clf, #feature_names=X.columns, class_names=targetnames, rounded=True, filled=True, 
             fname_fmt='e_ada_model_%02d')
big_img = image_combine(ntrees, big_name='e_ada_model.png', 
                        fname_fmt='e_ada_model_%02d.png')

In [None]:
''' PROVIDED
Display the tree file
'''
display.Image("e_ada_model.png")


## Reflection #3

a. Compare the PR_AUC validation performance for your best single decision tree to both a reasonable forest and a reasonable ada-boosted forest.

__TODO__



b.  Explain why Boosting shows an improvement in performance relative to the Random Forest (or, at least should, in many problems).

__TODO__



# FEATURE IMPORTANCE

In [None]:
""" PROVIDED
Display the feature imporantances
see the API for RandomForests and boosted tree
you can create a DataFrame to help with the display
"""
feature_imp = pd.DataFrame([tree_clf.feature_importances_, 
                            forest_clf.feature_importances_,
                            ada_clf.feature_importances_], 
                           columns=selected_features[:-1], 
                           index=['DecisionTree', 'RandomForest', 'AdaBoosting']).T
feature_imp.plot.bar()
plt.xlabel('Feature Name')
plt.ylabel('Fraction of Importance')
plt.title('Feature Importance')

## Reflection #4

a. Which features were most important for all three models?

__TODO__



b. Which features show the biggest differences in importance across the three models?

__TODO__
