In [None]:
'''
.....IMPORTANT USAGE INSTRUCTIONS........

##### IF USING CHPC - UTAH #####

1. Download this Jupyter Notebook to a local location on your Computer
2. Go to https://ondemand.chpc.utah.edu and sign in using your uNID and Password.
3. At the Top of the Page, notice the Menu "Interactive Apps". Click and Choose "Jupyter Notebook on Notchpeak"
4. A form will open, enter all details, and then Launch a Jupyter Notebook. It will take a minute.
5. Click on "Connect to Jupyter"
6. Once Jupyter Launches. On Top Right Notice "Upload Button". Use this to Upload this Notebook.
7. The Notebook will be uploaded. Finish writing the Code whereever specified.
8. Run each Block of Code and then finally download the Jupyter Notebook by going to File >> Download as >>


##### IF USING GOOGLE COLAB #####

1. Download this Jupyter Notebook to a local location on your Computer
2. Go to https://colab.research.google.com/ and sign in using your Google Account - So that your work is saved in
   your Google Drive permanently.
3. Go to File >> Upload Notebook.
4. Finish writing the Code whereever specified.
5. Run each Block of Code ad then finally download the Jupyter Notebook by going to File >> Download .ipynb

'''

In [None]:
'''
.....IMPORTANT SUBMISSION INSTRUCTIONS........

Once everything runs successfully, download the jupyter notebook and attach that to your submission in Canvas. 
During evaluation, I will run your Jupyter Notebook to verify that everything is running as expected.

Do not forget to include your main results and plots in your latex file (with other homework questions) 
before submission.

'''

In [None]:
'''

Problem 1 - Refer to your own code from HW - 1


Notes:

(i) You are implementing a ridge regression. Recall what purpose the penalty term serves.
(ii) By design, we expect your estimates for the 9th degree polynomial to shrink when compared to the estimates
generated in HW - 1. If you are using the same seed as HW - 1, you can observe the shrinkage yourself.
(iii) Also, notice, what happens when you keep changing your shrinkage parameter.


'''

In [None]:
'''

Problem 2.1 - Code has been given to you. Make sure you understand every line of the import step and data structuring

'''

In [2]:
#Libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import math
from collections import Counter
from sklearn.utils import shuffle

#Matplotlib Settings with ggplot Theme
from matplotlib import pyplot as plt
plt.style.use('ggplot')

# Fixing random state for reproducibility
np.random.seed(19680801)


In [3]:

#Data Import and Preprocessing

'''
Note : The dataset data_seed.dat can be downloaded from Canvas or github. In this notebook, I have placed this file
       directly under my home directory indicated by ~/.
       
       In CHPC - Your home directory is /uufs/cpc.utah.edu/common/home/<uNID>
       
       In Google Colab - On the Left Hand Menu, Click on Files and then Upload this data file.

'''

#Import Data File into Pandas - 210 Rows and 8 Columns. The dataset has no columns
df = pd.read_csv('~/data_seed.dat', sep='\s+', header=None, skiprows=0)

#Add Column Names
df.columns = ['A', 'P', 'C', 'L_Kern','W_Kern', 'Asy_Coeff','L_Kern_Grv','Y']

#Scale X Columns -  The idea is to Scale each column in the dataset using built in Standard Scaler
cols_to_norm = ['A', 'P', 'C', 'L_Kern','W_Kern', 'Asy_Coeff','L_Kern_Grv']
df[cols_to_norm] = StandardScaler().fit_transform(df[cols_to_norm])

#Shuffle the Dataset and then split into 5 separate dataframes to be used later for CV
seeds = shuffle(df)

#Split shuffled data into 5 data frames of size 42 each.
split_size = int(seeds.shape[0]/5)

fold_1 = seeds.iloc[0:split_size]
fold_2 = seeds.iloc[split_size: 2*split_size]
fold_3 = seeds.iloc[2*split_size: 3*split_size]
fold_4 = seeds.iloc[3*split_size: 4*split_size]
fold_5 = seeds.iloc[4*split_size: 5*split_size]


In [4]:
#Let's Check how the dataset looks like
df.head()

Unnamed: 0,A,P,C,L_Kern,W_Kern,Asy_Coeff,L_Kern_Grv,Y
0,0.142098,0.215462,6.1e-05,0.304218,0.141702,-0.986152,-0.383577,1
1,0.011188,0.008224,0.428515,-0.168625,0.197432,-1.788166,-0.922013,1
2,-0.192067,-0.360201,1.442383,-0.763637,0.208048,-0.667479,-1.189192,1
3,-0.347091,-0.475333,1.039381,-0.688978,0.319508,-0.960818,-1.229983,1
4,0.445257,0.330595,1.374509,0.066666,0.805159,-1.563495,-0.475356,1


In [None]:
'''

Problem 2.2 - Fill in the missing pieces for the k-NN implementation when you see a "Step -" prompt. Use the rest of the code to generate 
necessary metrics. 

Note: You can generate a completely different code if you find it easier, or use chunks of this code as you please.

'''

In [None]:
'''

The Logic behind k-NN

i. You want to learn k closest neighbors of any given point and assign to that point the most common class 
(either by vote or by computing an average) found in the neighborhood as defined by the closest k points.

ii. Computationally, k-NN is called a Lazy method because there is actually no training involved, which means, that
given a set of points with known classes (call them H), you can directly predict the class of a new point (t) given that
you have H. As we discussed in (i), we take t, compute its distance from all points in H. Choose "k" points in H that
aee closest to t. Take the most common class out of those k by voting or by computing an average. This common class is
actually the predicted class for t

iii. However, the question is how do we know what the value of "k" should be? The answer is that we do not know this 
apriori. Hence we have to do a grid search on a set of candidate values of "k" and use a resampling method such as 
5 fold CV or LOOCV to get a more robust answer

iii. In the context of this problem, we consider both cases i.e. (a) 5 fold CV, (b) LOOCV

(a) 5 fold CV - We create the 5 folds (Code provided). We make 5 passes through the data. In each pass, we reserve 
one fold and use the other 4 folds to predict the classes of the held out fold. We then use the prediction and the 
true label to compute an accuracy_score for that pass. At the end of 5 passes, we average out the accuracy_score and
report is as the 5 fold cross validated accuracy_score

(b) LOOCV - Same as above. Just that instead of 5 passes we have to make "n" (Size of the dataset) passes because in
each pass we predict a [1 (true label matches predicted), 0(otherwise)] score for just one data point that is held out
using the others to predict the class of the held out datapoint. In the end we count the number of 1's and divide by n 
to get the accuracy_score.


Implementation Note:

Remember that there is no training phase. For predicting the class of any point (from one of the five folds) we just
look at its neighbourhood in the set of other 4 folds taken together.

'''

In [None]:

#Function that finds Euclidean Distance Between Points. These two points are in the form of numpy lists.
def pair_euclid_dist(a,b):
    return numpy.linalg.norm(a-b, axis = 1)


#We use the trg dataset to predict the classes of each point of the test set.
#Here both trg and test are pandas dataframes
def knn_predict(trg, tst, k):
    
    #Initialize a counter
    correct_predictions = 0
    
    #Iterate over all datapoints in tst. One point at a time taken into consideration.
    for index, row in tst.iterrows(): 
            '''
                Step - Use the pair_euclid_dist function from above to calculate pair wise distances from
                every point in trg for the point under consideration
            '''
            
            '''
                Step - sort the distances above and take the k points in trg that have the least distances
                from the point under consideration.
            '''
    
            '''
                Step - For the k selected points, take each of their classes and take a majority vote.
                This majority vote is your prediction for the current datapoint under consideration from tst.
                Make sure you have a mechanism that solves for ties. 
            '''
        
            '''
                Step - if this prediction == true label of the tst data point under consideration, 
                increment the counter above i.e. correct_predictions += 1
            '''
    
    #Report the Average Accuracy over tst.
    
    return correct_predictions/tst.shape[0]

In [None]:
#Define all k's that we want to try for k-NN. We don't know k apriori, so we need to grid search.
k_try = [1,5,10,15]


#Function that estimates Accuracy of the k-nn classifier using 5 fold Cross Validation
def knn_cv(fold_1,fold_2,fold_3,fold_4,fold_5):
    '''
    This function iterates through the various values of k and prints out the average 
    accuracy using CV 
    '''
    #Iterate over all values of the tuning Parameter
    for trial in k_try:
        #Try k-nn for each fold and Average the Classification Accuracy
        
        '''
        Step - Find a way to call knn_predict() function above such that each time your parameter
        tst is one of the five folds and trg are the other four folds stacked on top of each together.
        '''
        
        tst_acc_f1 = knn_predict(...)
        tst_acc_f2 = knn_predict(...)
        tst_acc_f3 = knn_predict(...)
        tst_acc_f4 = knn_predict(...)
        tst_acc_f5 = knn_predict(...)
        print('CV - 5 Accuracy for k_try = ',trial, ' is: ', np.mean([tst_acc_f1,
                                                                      tst_acc_f2,
                                                                      tst_acc_f3,
                                                                      tst_acc_f4,
                                                                      tst_acc_f5]))

        
#Function that estimates Accuracy of the k-nn classifier using Leave One out Validation
def knn_loocv(full_set):
    
    #Iterate over all values of the tuning Parameter
    for trial in k_try:
        
        '''
           Step - Fill in the code here to do LOOCV. Idea is the same. Just that tst has size 1. trg is all
           other points in dataset seeds.
        '''
        
        '''
          Step - Remember, you have to report the average accuracy over all datapoints.
        '''
        
        '''
          Step - Print appropriately.
        '''
        print('LOOCV Accuracy for k_try = ',trial, ' is: ', .......)
    


In [None]:
#Fit KNN with CV = 5   
knn_cv(fold_1,fold_2,fold_3,fold_4,fold_5)

'''        
Step - Finally you have to report the best "k" i.e. one that produces the highest mean accuracy
'''

#Fit KNN with LOOCV      
knn_loocv(seeds) #Remember seeds is the name of our full dataset.

#Plotting Test Errors from both 5 fold CV and LOOCV

'''
Step
Line plot of errors for each k for 5 fold CV
Line plot of errors for each k for LOOCV

Remark:

Make sure the plot is properly labeled.

'''

In [None]:
'''

Problem 2.3 - The following example shows steps to call an inbuilt classifier (Random Forest) from Scikit Learn. 
Choose two other classifiers from scikit-learn to solve this problem. The sequence of steps will be roughly the same.

Note : Do not use Random Forests (because we want you to learn to fit your own classifiers). 
To see a list of available list of classifiers, run the code block below

'''

In [1]:

'''Prints a list of available inbuilt classifiers'''

from sklearn.base import ClassifierMixin
from sklearn.utils.testing import all_estimators
classifiers=[est for est in all_estimators() if issubclass(est[1], ClassifierMixin)]
for classifier in classifiers:
    print(classifier)



('AdaBoostClassifier', <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>)
('BaggingClassifier', <class 'sklearn.ensemble._bagging.BaggingClassifier'>)
('BernoulliNB', <class 'sklearn.naive_bayes.BernoulliNB'>)
('CalibratedClassifierCV', <class 'sklearn.calibration.CalibratedClassifierCV'>)
('CategoricalNB', <class 'sklearn.naive_bayes.CategoricalNB'>)
('CheckingClassifier', <class 'sklearn.utils._mocking.CheckingClassifier'>)
('ClassifierChain', <class 'sklearn.multioutput.ClassifierChain'>)
('ComplementNB', <class 'sklearn.naive_bayes.ComplementNB'>)
('DecisionTreeClassifier', <class 'sklearn.tree._classes.DecisionTreeClassifier'>)
('DummyClassifier', <class 'sklearn.dummy.DummyClassifier'>)
('ExtraTreeClassifier', <class 'sklearn.tree._classes.ExtraTreeClassifier'>)
('ExtraTreesClassifier', <class 'sklearn.ensemble._forest.ExtraTreesClassifier'>)
('GaussianNB', <class 'sklearn.naive_bayes.GaussianNB'>)
('GaussianProcessClassifier', <class 'sklearn.gaussian_process._gpc.G



In [6]:
'''Random Forest'''

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#Define X, y
X = seeds[['A', 'P', 'C', 'L_Kern','W_Kern', 'Asy_Coeff','L_Kern_Grv']]
y = seeds['Y']

#Test Train Split - Using the Built in Method.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#Initialize the RF Classifier
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

#Define Parameter Grid to perform Grid Search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['auto', 'sqrt', 'log2']
}

#Initiate Grid Search
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)

#Fit Model with Grid Search
CV_rfc.fit(X_train,y_train)

#Training Accuracy
print('RF - Best Params = ',CV_rfc.best_params_)
print('CV - 5 : Accuracy Score = ',CV_rfc.best_score_ )

#Test Accuracy
print('Test Set Accuracy = ', CV_rfc.score(X_test, y_test))




RF - Best Params =  {'n_estimators': 50, 'max_features': 'log2'}
CV - 5 : Accuracy Score =  0.9455782312925171
Test Set Accuracy =  0.8888888888888888
