# Assignment 2: Classification

### <font color='red'>Submit before the deadline as no late submission is accepted.</font> 

Deliverables:

- Submit your answers to conceptual questions (described in another file) in a pdf file
- Write down your codes in the given cells described in this file, denoted as "YOUR CODE HERE"
- Write down your discussion in the given cells, denoted as "YOUR DISCUSSION HERE"
- Submit two files: .pdf and .ipynb files to eLearning


This assignment covers Supervised Learning models. In this assignment, you are required to use one clean dataset to train FOUR classification models for discrete targets.


The total score of the implementation part is: 70 pts

In [3]:
NAME = "Tarun Chegondi Naga Sri Narahari Sai"
NAME

'Tarun Chegondi Naga Sri Narahari Sai'

In [4]:
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

### Data
For this section of the assignment we will be working with the [UCI Mushroom Data Set](http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io) stored in `mushrooms.csv`. The data will be used to train a model to predict whether or not a mushroom is poisonous. The following attributes are provided:

*Attribute Information:*

1. cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s 
2. cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s 
3. cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y 
4. bruises?: bruises=t, no=f 
5. odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s 
6. gill-attachment: attached=a, descending=d, free=f, notched=n 
7. gill-spacing: close=c, crowded=w, distant=d 
8. gill-size: broad=b, narrow=n 
9. gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y 
10. stalk-shape: enlarging=e, tapering=t 
11. stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? 
12. stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s 
13. stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s 
14. stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y 
15. stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y 
16. veil-type: partial=p, universal=u 
17. veil-color: brown=n, orange=o, white=w, yellow=y 
18. ring-number: none=n, one=o, two=t 
19. ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z 
20. spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y 
21. population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y 
22. habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

<br>

The data in the mushrooms dataset is currently encoded with strings. We need to convert the categorical variables into numeric indicators. One way to achieve this is by using the pd.get_dummies function, which will create indicator variables for each category in the data, effectively converting them to a numeric representation that can be used by the algorithms in sklearn.

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

mush = pd.read_csv('mushrooms.csv')
cat_mush = pd.get_dummies(mush)

X_mush = cat_mush.iloc[:,2:]
y_mush = cat_mush.iloc[:,1]

X_train, X_test, y_train, y_test = train_test_split(X_mush, y_mush, random_state=0)

In [6]:
X_train.shape[0]

6093

As we can see from the figure, the two classes does not seem to be linearly separable and can create some challenges for classification. Let us try different models to complete the classification task and check their performance.

### Question 1 (10 points)
- Train a DecisionTreeClassifier with default parameters and random_state=0. 
- What are the 5 most important features found by the decision tree?

In [7]:
# YOUR CODE HERE (10 points)
from sklearn.tree import DecisionTreeClassifier

# Train a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=0)
dt_classifier.fit(X_train, y_train)

# Retrieve and sort feature importances
importances = dt_classifier.feature_importances_
feature_names = X_mush.columns
important_features = sorted(zip(feature_names, importances), key=lambda x: x[1], reverse=True)[:5]

# Print the most important features
for feature, importance in important_features:
    print(f"{feature}: {importance}")


odor_n: 0.6251435175471661
stalk-root_c: 0.1691757144252228
stalk-root_r: 0.08658915843078754
spore-print-color_r: 0.03437506344670402
odor_l: 0.023503682936672883


### Question 2 (10 points)
- Train a Decision Tree model. Set "max_depth" to 6, "min_samples_split" to 2, "max_leaf_nodes" to  10, and random_state to 0.
- Report the test accuracy of the decision tree model.

In [8]:
# YOUR CODE HERE (10 points)
# Train the Decision Tree model with specified parameters
dt_classifier = DecisionTreeClassifier(max_depth=6, min_samples_split=2, max_leaf_nodes=10, random_state=0)
dt_classifier.fit(X_train, y_train)

# Predict and calculate the accuracy
from sklearn.metrics import accuracy_score
y_pred = dt_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy}")

Test Accuracy: 1.0


### Question 2 (15 points)
- Train a linear svm classifier with grid search and cross-validation. Set random_state as 0. Let the choices of C be: [0.001, 0.01, 0.1, 1, 10, 100, 10000, 1000000]. Use 5-fold cross-validation.
- Report (1) the best C chosen, (2) the test accuracy under the best model, and (3) the mean validation accuracy through the cross-validation process (under the best model).
- Given the choice, discuss briefly: do you think a hard-margin SVM can outperform soft-margin SVM in this case? Why?

In [9]:
# YOUR CODE HERE (10 points)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 10000, 1000000]}

# Set up the GridSearch with 5-fold cross-validation
grid_search = GridSearchCV(SVC(random_state=0), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
best_C = grid_search.best_params_['C']
best_score = grid_search.best_score_
test_accuracy = grid_search.score(X_test, y_test)

print(f"Best C: {best_C}")
print(f"Mean Validation Accuracy: {best_score}")
print(f"Test Accuracy: {test_accuracy}")

Best C: 1
Mean Validation Accuracy: 1.0
Test Accuracy: 1.0


### Question 3 (10 points)
- Train
a kernel svm classifier with grid search and cross-validation. Set random_state as 0. Let's apply rbf kernel. Let the choices of C be: [0.1, 1, 10]. Let the choices of gamma be: [0.0001, 0.001, 0.01, 0.1, 1, 10]. Use 5-fold cross-validation.
- Report (1) the best C and gamma chosen, (2) the test accuracy under the best model, and (3) the mean validation accuracy through the cross-validation process (under the best model).
- Note: The code may take up to several minutes to run. 

In [11]:
# YOUR CODE HERE
# Adjust the parameter grid for C and gamma
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 10]}

# Set up the GridSearch with the rbf kernel
grid_search = GridSearchCV(SVC(kernel='rbf', random_state=0), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and accuracy
best_params = grid_search.best_params_
best_score = grid_search.best_score_
test_accuracy = grid_search.score(X_test, y_test)

print(f"Best Parameters: {best_params}")
print(f"Mean Validation Accuracy: {best_score}")
print(f"Test Accuracy: {test_accuracy}")


Best Parameters: {'C': 1, 'gamma': 0.1}
Mean Validation Accuracy: 1.0
Test Accuracy: 1.0


### Question 4. Ensemble Methods - Random Forest (15 points)
- Train a random forest model. Specifically, train 100 decision trees (i.e., n_estimators=100). For each tree, set max_depth = 6, min_samples_split = 2, etc.

In [12]:
# YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=2, random_state=0)
rf_classifier.fit(X_train, y_train)

# Predict and calculate the accuracy
y_pred = rf_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Test Accuracy: {accuracy}")

Random Forest Test Accuracy: 1.0


### Question 5 (10 Points)
- Compare the mean validation score across all models. If we would like to choose one model for prediction based on model performance (i.e., in this case, accuracy), which one would you choose? Explain briefly.