## 3950 Assignment 1: Part 2

For this assignment we want to use some sort of tree based model to classify the data below. We have a very small training set, so overfitting is a very real concern. 


Importing Libraries

- Setting Up: Loading Necessary Tools

In [20]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (8, 8)

In [19]:
name = "Francky Katana"

#Settings used to control the EDA. 
show_eda = False

In [8]:
# Loading the data
df = pd.read_csv('training.csv')
df = df.drop(columns={"id"})
if show_eda:
    sns.pairplot(df)
    plt.show()

In [9]:
df.shape

(250, 201)

#### Modelling 

- Getting Data Ready
- Constructing the Model
- Fine-Tuning: Adjusting the Model for Best Performance
- Teaching Our Model

In [10]:
# Splitting the data into features and target
X = df.drop('target', axis=1)
y = df['target']

In [11]:
# Building a pipeline with a decision tree classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', DecisionTreeClassifier())
])

In [12]:
# Configuring hyperparameters for the Decision Tree to prevent overfitting
parameters = {
    'classifier__max_depth': [None, 10, 20, 30, 40, 50],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

In [13]:
# Using GridSearchCV to automatically test and pick the best settings for our model and to make sure it works well on different parts of the data.
grid_search = GridSearchCV(pipe, parameters, cv=5, scoring='accuracy')
grid_search.fit(X, y)

In [21]:
# Extracting the best pipeline (including both the scaler and the classifier)
best_pipeline = grid_search.best_estimator_

# Extracting the best model
best = grid_search.best_estimator_.named_steps['classifier']

### Testing the Model

In [25]:
#Load Test Data
test_df = pd.read_csv("testing.csv")
test_df = test_df.drop(columns={"id"})
#Create tests and score
test_y = np.array(test_df["target"]).reshape(-1,1)
test_X = np.array(test_df.drop(columns={"target"}))

preds = best.predict(test_X)

roc_score = roc_auc_score(test_y, preds)
acc_score = accuracy_score(test_y, preds)

print(roc_score)
print(acc_score)
print(name, np.mean([roc_score, acc_score]))


0.539171212633071
0.5397974683544304
Francky Katana 0.5394843404937507


### What Accuracy Changes Were Used

Please list here what you did to try to increase accuracy and/or limit overfitting:
<ul>
<li>I used GridSearchCV for fine-tuning by testing different settings of the decision tree to see which one works the best. It helped me find the best tree depth and the right number of samples needed before making a decision in the tree.
<li> Implemented cross-validation to checks how well our model performs on different parts of the data. 
<li> I tweaked settings like how deep the tree should grow (max_depth), how many samples we need at least to split a decision (min_samples_split), and how many samples we need at the end (min_samples_leaf) to help prevent the tree from being too complex and memorizing the data (overfitting).
</ul>