### Codio Activity 16.9: Investigating your own data

For this activity, you are asked to go out and choose a dataset to build a classification model with.  Specifically, you are to compare the `LogisticRegression`, `KNearestNeighborsClassifier`, and `SVC` estimators in terms of performance and speed in model fitting.  You should optimize this model according to what metric you believe is the appropriate one for the task between `precision`, `recall`, or `accuracy`.  

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import time 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDRegressor, LinearRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix

### Gathering the data

For your dataset, consider using an example dataset from either [kaggle](https://www.kaggle.com/) or the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php).  Select an appropriate dataset that is a classification problem.  Download the data file and work in a notebook locally to perform your analysis.  Be sure to grid search different model parameters and compare the different estimators.  Construct a DataFrame of the model results with the following information:

| model | train score | test score | average fit time |
| ----- | -----   | -------   | ------- |
| KNN | ? | ? | ? |
| Logistic Regression | ? | ? | ? |
| SVC | ? | ? | ? |

The assignment will expect a DataFrame with this exact structure and index and column names.  You will be graded based on the exact match of the structure of the DataFrame.  One suggestion is to build a DataFrame and write this out to `.json`, copy and paste this below to create the DataFrame.  Alternatively, you can write it out to a `.csv` file and copy the text, or simply hardcode the DataFrame based on your results.

In [57]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
rice_cammeo_and_osmancik = fetch_ucirepo(id=545)
  
# data (as pandas dataframes) 
rice_data = rice_cammeo_and_osmancik.data.original
rice_data.info()

#X = rice_cammeo_and_osmancik.data.features 
#y = rice_cammeo_and_osmancik.data.targets 
  
# metadata 
# print(rice_cammeo_and_osmancik.metadata) 
  
# variable information 
# print(rice_cammeo_and_osmancik.variables) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3810 entries, 0 to 3809
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Area               3810 non-null   int64  
 1   Perimeter          3810 non-null   float64
 2   Major_Axis_Length  3810 non-null   float64
 3   Minor_Axis_Length  3810 non-null   float64
 4   Eccentricity       3810 non-null   float64
 5   Convex_Area        3810 non-null   int64  
 6   Extent             3810 non-null   float64
 7   Class              3810 non-null   object 
dtypes: float64(5), int64(2), object(1)
memory usage: 238.3+ KB


In [None]:
#   Column Key Code

#   0                Area    
#   1           Perimeter
#   2   Major_Axis_Length
#   3   Minor_Axis_Length
#   4        Eccentricity  
#   5         Convex_Area
#   6              Extent  
#   7               Class

sns.pairplot(rice_data, hue='Class')

fig1 = px.scatter(rice_data, x='Major_Axis_Length', y='Perimeter', color='Class')
fig1.show()

fig2 = px.scatter(rice_data, x='Major_Axis_Length', y='Convex_Area', color='Class')
fig2.show()

In [58]:
X = rice_data[['Major_Axis_Length', 'Convex_Area']]
y = rice_data.Class

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=518)

scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_train)
X_ts_scaled = scaler.transform(X_test)

# Logistic Regression

In [84]:
start = time.time()
lgr = LogisticRegression(multi_class='multinomial').fit(X_tr_scaled, y_train)
end = time.time()

lgr_train_score = lgr.score(X_tr_scaled,y_train)
lgr_test_score = lgr.score(X_ts_scaled,y_test)
elapsed_time = end - start

print(lgr_train_score)
print(lgr_test_score)
print(elapsed_time)

0.928596429821491
0.9265477439664218
0.00797891616821289


# KNN

In [76]:
params = {'n_neighbors': list(range(1, 22, 2))}

knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid=params)
knn_grid.fit(X_tr_scaled, y_train)
best_k = list(knn_grid.best_params_.values())[0]
best_acc = knn_grid.score(X_ts_scaled, y_test)

print(best_acc)
print(best_k)

df1 = pd.DataFrame(knn_grid.cv_results_)
time1 = knn_grid.cv_results_['mean_fit_time'].sum()
index1 = np.argmax(df1['mean_test_score'])

print(f'Compute time: {time1}')
print(f'Parameters of Max Test Score: {df1['params'][index1]}')
print(f'Max Test Score: {df1['mean_test_score'][index1]}')


0.9244491080797481
15
Compute time: 0.03130545616149902
Parameters of Max Test Score: {'n_neighbors': 15}
Max Test Score: 0.924045044272715


In [83]:
start1 = time.time()
knn_final = KNeighborsClassifier(n_neighbors=15).fit(X_tr_scaled, y_train)
end1 = time.time()

knn_train_score = knn_final.score(X_tr_scaled, y_train)
knn_test_score = knn_final.score(X_ts_scaled,y_test)
elapsed_time1 = end1 - start1

print(knn_train_score)
print(knn_test_score)
print(elapsed_time1)

0.9334966748337417
0.9244491080797481
0.004988908767700195


# SVC

In [70]:
params = {'kernel': ['rbf', 'poly', 'linear', 'sigmoid'],
         'gamma': [0.1, 1.0, 10.0, 100.0],}

grid = GridSearchCV(SVC(), param_grid=params).fit(X_tr_scaled, y_train)
grid_score = grid.score(X_ts_scaled, y_test)

In [72]:
best_kernel = grid.best_params_['kernel']
best_gamma = grid.best_params_['gamma']

print(best_kernel)
print(best_gamma)

linear
0.1


In [None]:
#df2 = pd.DataFrame(grid.cv_results_)
#time = grid.cv_results_['mean_fit_time'].sum()
#index = np.argmax(df2['mean_test_score'])

#print(f'Compute time: {time}')
#print(f'Parameters of Max Test Score: {df2['params'][index]}')
#print(f'Max Test Score: {df2['mean_test_score'][index]}')

In [82]:
start2 = time.time()
svc_final = SVC(kernel='linear', gamma=0.1).fit(X_tr_scaled, y_train)
end2 = time.time()

svc_train_score = svc_final.score(X_tr_scaled, y_train)
svc_test_score = svc_final.score(X_ts_scaled,y_test)
elapsed_time2 = end2 - start2

print(svc_train_score)
print(svc_test_score)
print(elapsed_time2)


0.9292964648232411
0.9265477439664218
0.04687809944152832


### Problem 1

#### DataFrame of modeling results

Assign your constructed results DataFrame to `results_df` below.  Be sure that the `model` column above is the index of the DataFrame, and the three column names match the order and formatting of the example above.

In [85]:
### GRADED
results_df = ''

    
### BEGIN SOLUTION
res_dict = {'model': ['KNN', 'Logistic Regression', 'SVC'],
           'train score': [0.9334966748337417, 0.928596429821491, 0.9292964648232411],
           'test score': [0.9244491080797481, 0.9265477439664218, 0.9265477439664218],
           'average fit time': [0.004988908767700195, 0.00797891616821289, 0.04687809944152832]}
results_df = pd.DataFrame(res_dict).set_index('model')
### END SOLUTION

### ANSWER CHECK
print(type(results_df))
print(results_df.shape)

<class 'pandas.core.frame.DataFrame'>
(3, 3)


In [3]:
### BEGIN HIDDEN TESTS
res_dict_ = {'model': ['KNN', 'Logistic Regression', 'SVC'],
           'train score': [0, 0, 0],
           'test score': [0, 0, 0],
           'average fit time': [0, 0, 0]}
results_df_ = pd.DataFrame(res_dict_).set_index('model')
#
#
#
assert results_df.shape == results_df_.shape
assert type(results_df) == type(results_df_)
### END HIDDEN TESTS