<a href="https://colab.research.google.com/github/Calebmonroe/DS2/blob/main/CMonroe_exam2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports and Functions

In [123]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.preprocessing import PolynomialFeatures

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [124]:
def class_performance(xdata,ydata,model):
  print("\n\n",model)
# Classifiers with the score method output accuracy by default
  predy = model.predict(xdata)
  acc = accuracy_score(predy,ydata)
  prec, rec, fscore, supp = precision_recall_fscore_support(ydata, predy, average=None, zero_division=0)
  print("\nmodel accuracy on supplied data:\t", round(acc, 3))

  print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

# Midterm Exam 2 - Take home problem, general description

The following describes your take-home exam exercise, please read the instructions carefully.   You should submit your solution to this exercise as both a PDF and Jupyter notebook with your name in the filename (e.g. RJohnson_exam1.pdf, RJohnson_exam1.ipynb) **no later than 11:59pm on Sunday, April 14.**

Take care in presenting your results.  **All results and discussion should be written in markdown format at the end of your code.**  Be clear in your discussion if you are referencing a figure or table above.  Any data or quantitative comparisons should have the data presented along with the discussion.


In this exercise, you will be using an astronomical dataset to train a model to make predictions about an object's classification.  Your task will be to train the best classifier using specific combinations of the available features.  **Once trained and optimized, you should present your final model and a discussion of its target-level performance on the training and test data.**  Be sure you have prepared your data as necessary before modeling.


For all of the choices you make (sampling, scaling, model type and parameters), you should provide justification for those choices, evidence that you have explored other choices, and why those other choices were not preferable.


## The available data features include:
* u = Ultraviolet filter in the photometric system
* g = Green filter in the photometric system
* r = Red filter in the photometric system
* i = Near Infrared filter in the photometric system
* z = Infrared filter in the photometric system

* alpha = object's right ascension celestial coordinate
* delta = object's declination celestial coordinate
* redshift = redshift value based on the increase in wavelength

* class = object class (galaxy, star or quasar object)


# Specific Exercise and Data Details

The dataset contains multiwavelength observations of 100_000 objects from the **S**loan **D**igital **S**ky **S**urvey (SDSS), the largest spectroscopic survey of our universe ever performed.  In this exercise, you will be asked to train a model that is capable of predicting the class of an object based on either that object's spectroscopic measurements (u',g',r',i',z') or its 3D position information (alpha, delta, redshift), or a combination of both.  

For the purposes of this problem, the **spectroscopic measurements will be 'u', 'g', 'r', 'i', 'z'**.  Each object's **position information is given as 3 coordinates, 'alpha', 'delta', 'redshift'**.  


Your task is to:
* use the **spectroscopic features alone** to train a model which performs the best at classifying each of the 3 target classes
* use the **3D position information alone** to train a model which performs the best at classifying each of the 3 target classes
* use a **combination of all spectroscopic and positional measurements** to train a model which performs the best at classifying each of the 3 target classes

For your models:
* you should **explore using a Decision Tree, a Support Vector Machine, and a Stochastic Gradient Descent** classifier on your datasets
* present the final performance by **showing/discussing the target-level performance of your best classifier using the confusion matrix, precision, recall, f1, and support scores**

## Bonus Question
You may only receive credit for one of the bonus options, so choose which you want to attempt

**Bonus option 1:** Using PCA reduction on the SDSS dataset (combining positional and spectroscopic features), reduce/project your dataset onto its 2 principle components.  Train each of your model classifiers (DT, SVC, SGD) using this reduced dataset and evaluate their performances.  Using only the best performing model, produce a 2D plot of the projected data, highlighting all of the data points that your model misclassified.

**Bonus option 2:** Using PCA reduction on the SDSS dataset (combining positional and spectroscopic features), reduce/project your dataset onto its 2 principle components.  Train a Guassian Mixture mixture model classifier on the data to predict the object the class and evaluate its performance against the given identifiers.  Using the best performing model, produce a 2D plot of the projected data, highlighting all of the data points that your model misclassified.

## Some notes about this dataset

* Though only a portion of the SDSS, this dataset is still very large, containing 100,000 objects, so even relatively simple computations can take time
* No model training should take >10 minutes; if it does, adjust your parameters
* You're being asked to look at the dataset features in 3 ways (position only, spectroscopic only, and combined), and to train 3 models for each, meaning that you will be evaluating a total of 9 model/dataset combinations; all model evaluations should be performed with maximizing both the training and test performance in mind.  Your final result and presentation should be only on the best model and a discusssion of why/how it was best.  

## Preparing the data (you should use the resulting train/test datasets for your analysis)

In [125]:
web_path = 'http://public.gettysburg.edu/~rjohnson/ds325/' #if using data over web
df_s = pd.read_csv(web_path+'star_classification.csv')

In [126]:
df_s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 18 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   obj_ID       100000 non-null  float64
 1   alpha        100000 non-null  float64
 2   delta        100000 non-null  float64
 3   u            100000 non-null  float64
 4   g            100000 non-null  float64
 5   r            100000 non-null  float64
 6   i            100000 non-null  float64
 7   z            100000 non-null  float64
 8   run_ID       100000 non-null  int64  
 9   rerun_ID     100000 non-null  int64  
 10  cam_col      100000 non-null  int64  
 11  field_ID     100000 non-null  int64  
 12  spec_obj_ID  100000 non-null  float64
 13  class        100000 non-null  object 
 14  redshift     100000 non-null  float64
 15  plate        100000 non-null  int64  
 16  MJD          100000 non-null  int64  
 17  fiber_ID     100000 non-null  int64  
dtypes: float64(10), int64(7),

In [127]:
feat = df_s.drop(['obj_ID','run_ID', 'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'class', 'plate', 'MJD', 'fiber_ID'], axis=1, inplace=False)
targ = df_s['class']

In [128]:
feat

Unnamed: 0,alpha,delta,u,g,r,i,z,redshift
0,135.689107,32.494632,23.87882,22.27530,20.39501,19.16573,18.79371,0.634794
1,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,0.779136
2,142.188790,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,0.644195
3,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.25010,0.932346
4,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,0.116123
...,...,...,...,...,...,...,...,...
99995,39.620709,-2.594074,22.16759,22.97586,21.90404,21.30548,20.73569,0.000000
99996,29.493819,19.798874,22.69118,22.38628,20.45003,19.75759,19.41526,0.404895
99997,224.587407,15.700707,21.16916,19.26997,18.20428,17.69034,17.35221,0.143366
99998,212.268621,46.660365,25.35039,21.63757,19.91386,19.07254,18.62482,0.455040


In [129]:
# create 3 datasets (spectroscopic, position, combined)

spec_feat = feat.drop(['alpha','delta','redshift'], axis=1, inplace=False)

pos_feat = feat.drop(['u','g','r','i','z'], axis=1, inplace=False)

comb_feat = feat

In [130]:
# setup train/test split for 3 datasets
# for spectroscopic
X_s_tr, X_s_tst, y_tr, y_tst = train_test_split(spec_feat, targ, test_size=0.2, random_state=42)
# for positional
X_p_tr, X_p_tst, y_tr, y_tst = train_test_split(pos_feat, targ, test_size=0.2, random_state=42)
# for combined
X_tr, X_tst, y_tr, y_tst = train_test_split(comb_feat, targ, test_size=0.2, random_state=42)


# Spectroscopic features

In [131]:
scaler_s = StandardScaler()#initializing standard scaler for spectroscopic dataset

X_s_tr_scaled = scaler_s.fit_transform(X_s_tr)#fit / transform training data

X_s_tst_scaled = scaler_s.transform(X_s_tst)#transform testing data

In [132]:
spec_decision_tree = DecisionTreeClassifier(max_depth=2, random_state=42) #Define decision tree model and fit model
spec_decision_tree.fit(X_s_tr_scaled, y_tr)

In [133]:
acc_train = spec_decision_tree.score(X_s_tr_scaled, y_tr) #using scoring method for decision tree
acct = spec_decision_tree.score(X_s_tst_scaled, y_tst)

y = y_tr; yt=y_tst #defining y train and y test

predy = spec_decision_tree.predict(X_s_tr_scaled) #making predictions
predyt = spec_decision_tree.predict(X_s_tst_scaled)

print("model accuracy on training data:\t", acc_train.round(3)) #printing metrics
prec, rec, fscore, supp = precision_recall_fscore_support(y, predy, average=None, zero_division=1)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

print("model accuracy on testing data:\t", acct.round(3))
prec, rec, fscore, supp = precision_recall_fscore_support(yt, predyt, average=None, zero_division=1)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

conf_matrix = confusion_matrix(y_tr, predy) #printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

conf_matrix = confusion_matrix(y_tst, predyt)
print("\nConfusion Matrix:")
print(conf_matrix)

model accuracy on training data:	 0.684
Precision:	 [0.691 0.648 1.   ] 
Recall:		 [0.964 0.582 0.   ] 
Fscore:		 [0.805 0.613 0.   ] 
Support:	 [47585 15164 17251]
model accuracy on testing data:	 0.682
Precision:	 [0.69  0.642 1.   ] 
Recall:		 [0.967 0.573 0.   ] 
Fscore:		 [0.805 0.605 0.   ] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[45868  1717     0]
 [ 6333  8831     0]
 [14173  3078     0]]

Confusion Matrix:
[[11463   397     0]
 [ 1622  2175     0]
 [ 3525   818     0]]


In [134]:
linear_svm = SVC(kernel='linear') #trying out linear SVC model before implementing polynomials
linear_svm.fit(X_s_tr_scaled, y_tr)

In [135]:
linear_svm_train_predictions = linear_svm.predict(X_s_tr_scaled) #creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", linear_svm.score(X_s_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, linear_svm_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, linear_svm_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.7475875
Precision:	 [0.776 0.667 0.711] 
Recall:		 [0.933 0.81  0.181] 
Fscore:		 [0.848 0.731 0.288] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[44414  2857   314]
 [ 1933 12276   955]
 [10854  3280  3117]]


In [136]:
linear_svm_test_predictions = linear_svm.predict(X_s_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", linear_svm.score(X_s_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, linear_svm_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, linear_svm_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.74645
Precision:	 [0.778 0.659 0.712] 
Recall:		 [0.934 0.8   0.189] 
Fscore:		 [0.849 0.722 0.298] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11073   702    85]
 [  514  3036   247]
 [ 2652   871   820]]


In [137]:
poly_kernel_svm_clf = make_pipeline(StandardScaler(),SVC(kernel="poly", degree=3, coef0=1, C=5))
poly_kernel_svm_clf.fit(X_s_tr_scaled, y_tr)#setting up / fitting polynomial kernel svm to compare to linear

In [138]:
poly_svm_train_predictions = poly_kernel_svm_clf.predict(X_s_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", poly_kernel_svm_clf.score(X_s_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, poly_svm_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, poly_svm_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.8269
Precision:	 [0.868 0.716 0.814] 
Recall:		 [0.937 0.836 0.516] 
Fscore:		 [0.901 0.771 0.631] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[44584  1839  1162]
 [ 1625 12675   864]
 [ 5177  3181  8893]]


In [139]:
poly_svm_test_predictions = poly_kernel_svm_clf.predict(X_s_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", poly_kernel_svm_clf.score(X_s_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, poly_svm_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, poly_svm_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.8241
Precision:	 [0.868 0.708 0.809] 
Recall:		 [0.937 0.826 0.516] 
Fscore:		 [0.901 0.762 0.63 ] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11108   455   297]
 [  429  3135   233]
 [ 1264   840  2239]]


In [140]:
from sklearn.linear_model import SGDClassifier#importing SGDClassifier
sgd_clf = SGDClassifier(max_iter=10000, random_state=42)#initializing classifier and fitting SGD model
sgd_clf.fit(X_s_tr_scaled,y_tr)

In [141]:
sgd_clf_train_predictions = sgd_clf.predict(X_s_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", sgd_clf.score(X_s_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, sgd_clf_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, sgd_clf_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.712
Precision:	 [0.761 0.588 1.   ] 
Recall:		 [0.919 0.873 0.   ] 
Fscore:		 [0.832 0.702 0.001] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[43713  3872     0]
 [ 1924 13240     0]
 [11825  5419     7]]


In [142]:
sgd_clf_test_predictions = sgd_clf.predict(X_s_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", sgd_clf.score(X_s_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, sgd_clf_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, sgd_clf_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.7095
Precision:	 [0.762 0.579 1.   ] 
Recall:		 [0.918 0.869 0.   ] 
Fscore:		 [0.833 0.695 0.   ] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[10889   971     0]
 [  497  3300     0]
 [ 2909  1433     1]]


# 3D Position Information

In [143]:
scaler_s = StandardScaler()#initializing standard scaler for spectroscopic dataset

X_p_tr_scaled = scaler_s.fit_transform(X_p_tr)#fit / transform training data

X_p_tst_scaled = scaler_s.transform(X_p_tst)#transform testing data

In [144]:
pos_decision_tree = DecisionTreeClassifier(max_depth=2, random_state=42)#Define decision tree model and fit model
pos_decision_tree.fit(X_p_tr, y_tr)

In [145]:
acc_train = pos_decision_tree.score(X_p_tr_scaled, y_tr)#using scoring method for decision tree
acct = pos_decision_tree.score(X_p_tst_scaled, y_tst)

y = y_tr; yt=y_tst#defining y train and y test

predy = pos_decision_tree.predict(X_p_tr_scaled)# making predictions
predyt = pos_decision_tree.predict(X_p_tst_scaled)

print("model accuracy on training data:\t", acc_train.round(3))# printing metrics
prec, rec, fscore, supp = precision_recall_fscore_support(y, predy, average=None, zero_division=1)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

print("model accuracy on testing data:\t", acct.round(3))
prec, rec, fscore, supp = precision_recall_fscore_support(yt, predyt, average=None, zero_division=1)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

conf_matrix = confusion_matrix(y_tr, predy)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

conf_matrix = confusion_matrix(y_tst, predyt)
print("\nConfusion Matrix:")
print(conf_matrix)



model accuracy on training data:	 0.499
Precision:	 [0.751 0.985 0.326] 
Recall:		 [0.27 0.65 1.  ] 
Fscore:		 [0.397 0.783 0.492] 
Support:	 [47585 15164 17251]
model accuracy on testing data:	 0.498
Precision:	 [0.747 0.984 0.327] 
Recall:		 [0.267 0.644 1.   ] 
Fscore:		 [0.394 0.779 0.493] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[12832   152 34601]
 [ 4258  9863  1043]
 [    0     0 17251]]

Confusion Matrix:
[[3171   40 8649]
 [1072 2447  278]
 [   0    0 4343]]


In [146]:
linear_svm = SVC(kernel='linear') #trying out linear SVC model before implementing polynomials
linear_svm.fit(X_p_tr_scaled, y_tr)

In [147]:
linear_svm_train_predictions = linear_svm.predict(X_p_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", linear_svm.score(X_p_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, linear_svm_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, linear_svm_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.944125
Precision:	 [0.934 0.956 0.963] 
Recall:		 [0.975 0.785 1.   ] 
Fscore:		 [0.954 0.862 0.981] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[46379   547   659]
 [ 3258 11900     6]
 [    0     0 17251]]


In [148]:
linear_svm_test_predictions = linear_svm.predict(X_p_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", linear_svm.score(X_p_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, linear_svm_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, linear_svm_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9439
Precision:	 [0.933 0.959 0.963] 
Recall:		 [0.975 0.782 1.   ] 
Fscore:		 [0.954 0.862 0.981] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11564   128   168]
 [  825  2971     1]
 [    0     0  4343]]


In [149]:
poly_kernel_svm_clf = make_pipeline(StandardScaler(),SVC(kernel="poly", degree=3, coef0=1, C=5))
poly_kernel_svm_clf.fit(X_p_tr_scaled, y_tr)#setting up / fitting polynomial kernel svm to compare to linear

In [150]:
poly_svm_train_predictions = poly_kernel_svm_clf.predict(X_p_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", poly_kernel_svm_clf.score(X_p_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, poly_svm_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, poly_svm_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9443875
Precision:	 [0.929 0.968 0.972] 
Recall:		 [0.982 0.764 1.   ] 
Fscore:		 [0.955 0.854 0.986] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[46711   387   487]
 [ 3571 11589     4]
 [    0     0 17251]]


In [151]:
poly_svm_test_predictions = poly_kernel_svm_clf.predict(X_p_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", poly_kernel_svm_clf.score(X_p_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, poly_svm_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, poly_svm_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9447
Precision:	 [0.929 0.969 0.972] 
Recall:		 [0.982 0.766 1.   ] 
Fscore:		 [0.955 0.856 0.986] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11641    94   125]
 [  886  2910     1]
 [    0     0  4343]]


In [152]:
sgd_clf = SGDClassifier(max_iter=10000, random_state=42)#
sgd_clf.fit(X_p_tr_scaled,y_tr)#initializing classifier and fitting SGD model

In [153]:
sgd_clf_train_predictions = sgd_clf.predict(X_p_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", sgd_clf.score(X_p_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, sgd_clf_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, sgd_clf_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.7333625
Precision:	 [0.692 0.972 0.2  ] 
Recall:		 [0.993 0.753 0.   ] 
Fscore:		 [0.816 0.849 0.   ] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[47246   335     4]
 [ 3742 11422     0]
 [17250     0     1]]


In [154]:
sgd_clf_test_predictions = sgd_clf.predict(X_p_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", sgd_clf.score(X_p_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, sgd_clf_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, sgd_clf_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.73295
Precision:	 [0.691 0.972 0.   ] 
Recall:		 [0.993 0.759 0.   ] 
Fscore:		 [0.815 0.853 0.   ] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11777    82     1]
 [  915  2882     0]
 [ 4343     0     0]]


# Combination of both features

In [155]:
scaler_s = StandardScaler()#initializing standard scaler for spectroscopic dataset

X_tr_scaled = scaler_s.fit_transform(X_tr)#fit / transform training data

X_tst_scaled = scaler_s.transform(X_tst)#transform testing data

In [156]:
combo_decision_tree = DecisionTreeClassifier(max_depth=2, random_state=42)#Define decision tree model and fit model
combo_decision_tree.fit(X_tr_scaled, y_tr)

In [157]:
acc_train = combo_decision_tree.score(X_tr_scaled, y_tr)#using scoring method for decision tree
acct = combo_decision_tree.score(X_tst_scaled, y_tst)

y = y_tr; yt=y_tst #defining y train and y test

predy = combo_decision_tree.predict(X_tr_scaled) #making predictions
predyt = combo_decision_tree.predict(X_tst_scaled)

print("model accuracy on training data:\t", acc_train.round(3))#printing metrics
prec, rec, fscore, supp = precision_recall_fscore_support(y, predy, average=None, zero_division=1)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

print("model accuracy on testing data:\t", acct.round(3))
prec, rec, fscore, supp = precision_recall_fscore_support(yt, predyt, average=None, zero_division=1)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3), "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp)

conf_matrix = confusion_matrix(y_tr, predy) #printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

conf_matrix = confusion_matrix(y_tst, predyt)
print("\nConfusion Matrix:")
print(conf_matrix)

model accuracy on training data:	 0.947
Precision:	 [0.938 0.945 0.977] 
Recall:		 [0.977 0.796 1.   ] 
Fscore:		 [0.957 0.864 0.988] 
Support:	 [47585 15164 17251]
model accuracy on testing data:	 0.947
Precision:	 [0.937 0.946 0.976] 
Recall:		 [0.977 0.794 1.   ] 
Fscore:		 [0.956 0.863 0.988] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[46468   708   409]
 [ 3087 12075     2]
 [    0     0 17251]]

Confusion Matrix:
[[11583   171   106]
 [  783  3013     1]
 [    0     0  4343]]


In [158]:
linear_svm = SVC(kernel='linear') #trying out linear SVC model before implementing polynomials
linear_svm.fit(X_tr_scaled, y_tr)

In [159]:
linear_svm_train_predictions = linear_svm.predict(X_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", linear_svm.score(X_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, linear_svm_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, linear_svm_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9598625
Precision:	 [0.961 0.951 0.963] 
Recall:		 [0.972 0.876 1.   ] 
Fscore:		 [0.967 0.912 0.981] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[46247   684   654]
 [ 1867 13291     6]
 [    0     0 17251]]


In [160]:
linear_svm_test_predictions = linear_svm.predict(X_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", linear_svm.score(X_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, linear_svm_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, linear_svm_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9594
Precision:	 [0.961 0.95  0.963] 
Recall:		 [0.971 0.876 1.   ] 
Fscore:		 [0.966 0.912 0.981] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11519   174   167]
 [  470  3326     1]
 [    0     0  4343]]


In [161]:
poly_kernel_svm_clf = make_pipeline(StandardScaler(),SVC(kernel="poly", degree=3, coef0=1, C=5))
poly_kernel_svm_clf.fit(X_tr_scaled, y_tr)#setting up / fitting polynomial kernel svm to compare to linear

In [162]:
poly_svm_train_predictions = poly_kernel_svm_clf.predict(X_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", poly_kernel_svm_clf.score(X_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, poly_svm_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, poly_svm_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9687
Precision:	 [0.969 0.968 0.968] 
Recall:		 [0.979 0.902 1.   ] 
Fscore:		 [0.974 0.934 0.984] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[46568   456   561]
 [ 1479 13677     8]
 [    0     0 17251]]


In [163]:
poly_svm_test_predictions = poly_kernel_svm_clf.predict(X_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", poly_kernel_svm_clf.score(X_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, poly_svm_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, poly_svm_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.9679
Precision:	 [0.968 0.97  0.967] 
Recall:		 [0.979 0.898 1.   ] 
Fscore:		 [0.973 0.932 0.983] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11606   107   147]
 [  387  3409     1]
 [    0     0  4343]]


In [164]:
sgd_clf = SGDClassifier(max_iter=10000, random_state=42)
sgd_clf.fit(X_tr_scaled,y_tr)#initializing classifier and fitting SGD model

In [165]:
sgd_clf_train_predictions = sgd_clf.predict(X_tr_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", sgd_clf.score(X_tr_scaled, y_tr))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tr, sgd_clf_train_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tr, sgd_clf_train_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.904075
Precision:	 [0.9   0.883 0.941] 
Recall:		 [0.944 0.909 0.789] 
Fscore:		 [0.922 0.896 0.858] 
Support:	 [47585 15164 17251]

Confusion Matrix:
[[44934  1808   843]
 [ 1367 13786    11]
 [ 3632    13 13606]]


In [166]:
sgd_clf_test_predictions = sgd_clf.predict(X_tst_scaled)#creating predictions
print("Validation vs training: Model accuracy, precision, recall, f1, and support")#using score method for metrics
print("model accuracy on training data:\t", sgd_clf.score(X_tst_scaled, y_tst))
prec, rec, fscore, supp = precision_recall_fscore_support(y_tst, sgd_clf_test_predictions, average=None)
print("Precision:\t", prec.round(3), "\nRecall:\t\t", rec.round(3),
      "\nFscore:\t\t", fscore.round(3), "\nSupport:\t", supp.round(3))

conf_matrix = confusion_matrix(y_tst, sgd_clf_test_predictions)#printing confusion matrix
print("\nConfusion Matrix:")
print(conf_matrix)

Validation vs training: Model accuracy, precision, recall, f1, and support
model accuracy on training data:	 0.90485
Precision:	 [0.902 0.883 0.937] 
Recall:		 [0.943 0.91  0.797] 
Fscore:		 [0.922 0.896 0.861] 
Support:	 [11860  3797  4343]

Confusion Matrix:
[[11180   450   230]
 [  339  3456     2]
 [  874     8  3461]]


# Discussion
For each of the spectroscopic, 3D position, and combination data sets, I have fit and tested 4 different classification models. These models include a decision tree model, a linear SVM model, a polynomial kernel SVM model, and a stochastic gradient classifier. All of the independent variables were also scaled using StandardScaler in order to create a similar scale for each variable. I noticed that in the original data set, certain variables produced substantially higher or lower values. As a result, I viewed scaling the variables to be a necesarry step in the data preparation of this exam.

First, the spectroscopic decision tree did not perform well compared to the other models. The model accuracy on the training and testing sets respectively was .684 and .682. This resulted in relatively moderate precision, recall, and F1 scores as well for all 3 classes. When a new decision tree model was fit using the 3D position data set, the model performed slightly worse than the decision tree model using the spectroscopic data set. The accuracy scores for the training and testing data are .499 and .498 respectively. The other scores also did not perform, in general, very well compared to other models. However, the decision tree used for the combination dataset performed very well. The model accuracy for the training and testing data was .947 for both sets of data. As a result, the precision recall, and F1 scores also resulted in higher metrics than the previous two decision tree models.

The first linear SVM model on the spectroscopy data set produced an accuracy score of .747 and .746 respectively between the training and test sets. The precision recall and F1 scores were all very similar, producing relatively high scores. The second linear SVM model produced an accuracy score of .9441 and .9439 between the training and testing sets. Many of the precision recall and F1 scores appear to be greater than .8 and .9 in many cases as well for this model. The final linear SVM model yields a .9598 and .9594 accuracy score on the training and testing sets. This model appears to be the best linear SVM model as the rest of the metrics are higher than the previously discussed models as well.

Next, I implemented a polynomial kernel SVM model to expand the features and further capture non-linear relationships within the data set. I also wanted to compare the results between the polynomial SVM models and and linear SVM models. The first polynomial SVM model performed significantly better than the linear SVM model. This model produced an accuracy score of .8269 and .8241 on the respective training and testing sets. The polynomial SVM model improve substantially using the 3D position information data set. The accuracy, scores were .9443 and .9447 on the training and testing sets. Finally, using the combination of both data sets, the polynomial SVM model yielded an accuracy score of .9687 and .9679 on the training and testing sets. The precision recall and F1 scores all performed better as well.

Finally, the SGD, classifier model on the spectroscopic features resulted in an accuracy score of .712 and .7095 for the training and testing sets. The SGD classifier perform slightly better using the 3D position dataset as the accuracy scores were .7333 and .7329. All of the precision recall and F1 scores between both data sets were inconsistent with each other, and relatively low compared to other models. The final SGD classifier model using a combination of both datasets results in significantly higher accuracy, scores for both the training and testing sets. These accuracy scores are .9040 and .9048, which are much higher than the previous two models. The precision recall and F1 scores are also much higher than the previous two models.

Overall, the best performing model on the spectroscopic data was the polynomial SVM kernel model. This is because the accuracy scores and metrics overall for this model are substantially higher. The best models for the 3D position dataset were both the linear SVM model, and the polynomial kernel SVM model. Although the accuracy, scores and metrics were slightly different for these models, they were extremely similar to one another, and performed overall very well. Finally, the best model for the combination dataset was also the SVM kernel polynomial model. The model correctly classified over 96% of the data into its respective class within both the training and testing sets. This was also the most effective model in general out of the 12 models that were trained and tested.