In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (8,8)

## 3950 Assignment 1: Part 2

For this assignment we want to use some sort of tree based model to classify the data below. We have a very small training set, so overfitting is a very real concern. 

Some specifics for this assignment:
<ul>
<li>Please use the show_eda to control if EDA stuff is shown. I don't really need to see all the EDA stuff (nor do you after you've done it), so we can make it configurable with a variable to speed up time. Please set this FALSE when you submit, so I can run all and see the outcome without histograms etc...
<li>Please ensure that whatever model you end up with is in a variable named best at the end.
<li>Please use some pipeline in prepping the data. The test data is in an identical format to the training data, so whatever pipeline you've created for your training will work for the testing. 
<li>The accuracy scoring will be an average of accuracy and roc_auc. 
</ul>

### Grading Metrics
<ul>
<li><b>Pipeline Used - 10pts</b> The data loading needs to be in a pipeline. See the test part for illustration. When testing I'll call your pipe with the new data (format is identical to training), so any prep stuff should be in the pipeline. 
<li><b>Tree Based Model Used - 5pts</b> The model used for classification needs to be some variety of tree, beyond that it is up to you. 
<li><b>Accuracy - 5pts</b> The final accuracy acheived. This will be a rough ranking, I'm assuming most people will get a similar level of accuracy, marks will only be deducted if yours is far wosrse, as that's an indication that you probably didn't take any/many steps to improve things. 
<li><b>Clarity and Formatting - 5pts</b> Is it organized and can I read it?
    <ul>
    <li> <b>Note:</b> for this assignment, and in general, please get rid of my comments and replace them with your own. I'm going to read this, so all of these instructions aren't really required. Think of this as a template, get rid of the stuff that isn't needed, and leave only the things you need to explain your code. 
    </ul>
</ul>

For submission, please drop the URL for your repository in the dropbox.

In [2]:
#Please change to your name.
name = "Jasman Jawandha"

#Please use this to control EDA. 
show_eda = False

In [3]:
#Load data
df = pd.read_csv("training.csv")
df = df.drop(columns={"id"})
df.sample(5)

Unnamed: 0,target,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199,var_200
140,1,0.581,0.059,0.424,0.867,0.111,0.097,0.669,0.304,0.537,...,0.759,0.503,0.03,0.614,0.051,0.109,0.022,0.454,0.329,0.132
87,1,0.242,0.078,0.858,0.596,0.332,0.712,0.493,0.323,0.504,...,0.231,0.658,0.354,0.771,0.695,0.333,0.384,0.772,0.67,0.726
0,0,0.66,0.106,0.434,0.387,0.903,0.661,0.158,0.291,0.21,...,0.015,0.377,0.479,0.05,0.395,0.123,0.833,0.461,0.99,0.105
132,0,0.416,0.034,0.056,0.294,0.856,0.781,0.863,0.985,0.71,...,0.814,0.724,0.206,0.567,0.75,0.334,0.373,0.319,0.981,0.438
129,0,0.111,0.325,0.558,0.17,0.591,0.991,0.489,0.836,0.462,...,0.466,0.798,0.401,0.231,0.509,0.312,0.178,0.196,0.98,0.046


### Starting

For this assignment, you have a small training set, so combatting overfitting is key in being accurate!

In [4]:
df.shape

(250, 201)

In [5]:
df.columns

Index(['target', 'var_1', 'var_2', 'var_3', 'var_4', 'var_5', 'var_6', 'var_7',
       'var_8', 'var_9',
       ...
       'var_191', 'var_192', 'var_193', 'var_194', 'var_195', 'var_196',
       'var_197', 'var_198', 'var_199', 'var_200'],
      dtype='object', length=201)

#### Do Modelling Stuff

Make a tree model (of some vareity) and make it fit well. Keep in mind the possibility of your tree overfitting, and think of steps you may need to combat that shoudl it occur. 

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score

# Load and preprocess the data
df = pd.read_csv("training.csv")
df = df.drop(columns={"id"})  


class EDA:
    def __init__(self, dataframe, target):
        self.data = dataframe
        self.target = target
        self.numerical_features = self.data.drop(columns=[self.target]).columns.tolist()

    def describe(self):
        # Descriptive statistics
        return self.data.describe()

    def plot_distributions(self, n_cols=2):
        # Plot distributions of numerical features
        n = len(self.numerical_features)
        n_rows = np.ceil(n / n_cols).astype(int)
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 6, n_rows * 4))
        axes = axes.flatten()

        for i, col in enumerate(self.numerical_features):
            sns.histplot(data=self.data, x=col, kde=True, ax=axes[i])
            axes[i].set_title(f'Distribution of {col}')
            axes[i].set_ylabel('Frequency')
            axes[i].set_xlabel(col)

        # Remove empty subplots
        for i in range(n, n_rows * n_cols):
            fig.delaxes(axes[i])
        
        plt.tight_layout()
        plt.show()

    def plot_correlation_matrix(self):
        # Plot correlation matrix
        corr_matrix = self.data[self.numerical_features].corr()
        plt.figure(figsize=(12, 10))
        sns.heatmap(corr_matrix, cmap='coolwarm', annot=False, fmt=".1f")
        plt.title('Correlation Matrix of Numerical Features')
        plt.show()

    def plot_target_distribution(self):
        # Plot distribution of the target variable
        sns.countplot(x=self.target, data=self.data)
        plt.title('Distribution of Target Variable')
        plt.show()

    def plot_feature_correlation_with_target(self):
        # Plot correlations of numerical features with the target variable
        target_corr = self.data.corr()[self.target].sort_values(ascending=False)
        target_corr.drop(self.target, inplace=True) # Remove target self-correlation
        plt.figure(figsize=(8, 10))
        sns.barplot(x=target_corr, y=target_corr.index)
        plt.title('Feature Correlation with Target')
        plt.xlabel('Correlation Coefficient')
        plt.ylabel('Features')
        plt.show()
# Instantiate the EDA class
eda = EDA(df, 'target')
# Conditional execution based on the value of show_eda
if show_eda:
    # Descriptive statistics
    print(eda.describe())
    
    # Plots
    eda.plot_distributions()  # Distributions of features
    eda.plot_correlation_matrix()  # Correlation matrix
    eda.plot_target_distribution()  # Distribution of the target variable
    eda.plot_feature_correlation_with_target()  # Feature correlation with the target


show_eda = False  


# Split the data into features and target
X = df.drop('target', axis=1)
y = df['target']

# Splitting data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Preprocessing and modeling pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Feature scaling
    ('classifier', RandomForestClassifier(random_state=42))  # Tree-based model
])

# Parameters for GridSearchCV
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Grid search with cross-validation
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Best model
best = grid_search.best_estimator_

# Predictions for evaluation
y_pred = best.predict(X_test)
y_prob = best.predict_proba(X_test)[:, 1]  # Probability estimates for ROC AUC

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

# Final score
final_score = (accuracy + roc_auc) / 2

print(f'Accuracy: {accuracy}')
print(f'ROC AUC: {roc_auc}')
print(f'Final Score: {final_score}')

# The above best model can now be used with new data in the same format



Fitting 5 folds for each of 81 candidates, totalling 405 fits
Accuracy: 0.58
ROC AUC: 0.6639610389610389
Final Score: 0.6219805194805195


### Finishing

At the conclusion, please name your best model "best". If you look down below in the testing stuff, it should be usable to score as "best". 

You should be able to call it like this and it should work (with whatever data names you have)

In [7]:
#This can be moved from here

print(best.score(X_test, y_test))
print(best)

0.58
Pipeline(steps=[('scaler', StandardScaler()),
                ('classifier',
                 RandomForestClassifier(min_samples_leaf=2, n_estimators=300,
                                        random_state=42))])


### Testing

Please leave the stuff below as-is in your file. 

This will take your best model and score it with the test data. If you want to test to make sure that yours works, make a copy of the data file and rename it testing.csv, then make sure this runs ok. I will do the same, but the contents of my test file will be different. 

In [8]:
#Load Test Data
test_df = pd.read_csv("testing.csv")
test_df = test_df.drop(columns={"id"})
#Create tests and score
test_y = np.array(test_df["target"]).reshape(-1,1)
test_X = np.array(test_df.drop(columns={"target"}))

preds = best.predict(test_X)

roc_score = roc_auc_score(test_y, preds)
acc_score = accuracy_score(test_y, preds)

print(roc_score)
print(acc_score)
print(name, np.mean([roc_score, acc_score]))


0.6508886235128861
0.6505822784810127
Jasman Jawandha 0.6507354509969494


### What Accuracy Changes Were Used

Please list here what you did to try to increase accuracy and/or limit overfitting:
<ul>
<li>
<li>
</ul>

The changes made to potentially increase the accuracy revolve around a more comprehensive approach to model tuning and the inclusion of an Exploratory Data Analysis (EDA) process.
Expanded Hyperparameter Space in Grid Search
The range of values for hyperparameters such as n_estimators, max_depth, min_samples_split, and min_samples_leaf in the RandomForestClassifier has been expanded. This broader search space allows the GridSearchCV to explore a wider array of model configurations, potentially finding a more optimal set of parameters that improve model performance.

Inclusion of an EDA Process
The addition of an EDA class and the subsequent execution of various EDA methods (conditional on the show_eda flag) underscore the importance of understanding the dataset before model training. While the direct impact of EDA on the model's numerical performance metrics (like accuracy and ROC AUC) might not be immediately apparent, EDA is crucial for:
Identifying anomalies, outliers, or errors in the data that could mislead the training process.
Discovering patterns, correlations, or trends that could inform feature engineering, feature selection, or the choice of model.
Ensuring data quality and consistency, which are foundational for training effective models.


