In [11]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (8,8)

## 3950 Assignment 1: Part 2

For this assignment we want to use some sort of tree based model to classify the data below. We have a very small training set, so overfitting is a very real concern. 

Some specifics for this assignment:
<ul>
<li>Please use the show_eda to control if EDA stuff is shown. I don't really need to see all the EDA stuff (nor do you after you've done it), so we can make it configurable with a variable to speed up time. Please set this FALSE when you submit, so I can run all and see the outcome without histograms etc...
<li>Please ensure that whatever model you end up with is in a variable named best at the end.
<li>Please use some pipeline in prepping the data. The test data is in an identical format to the training data, so whatever pipeline you've created for your training will work for the testing. 
<li>The accuracy scoring will be an average of accuracy and roc_auc. 
</ul>

### Grading Metrics
<ul>
<li><b>Pipeline Used - 10pts</b> The data loading needs to be in a pipeline. See the test part for illustration. When testing I'll call your pipe with the new data (format is identical to training), so any prep stuff should be in the pipeline. 
<li><b>Tree Based Model Used - 5pts</b> The model used for classification needs to be some variety of tree, beyond that it is up to you. 
<li><b>Accuracy - 5pts</b> The final accuracy acheived. This will be a rough ranking, I'm assuming most people will get a similar level of accuracy, marks will only be deducted if yours is far wosrse, as that's an indication that you probably didn't take any/many steps to improve things. 
<li><b>Clarity and Formatting - 5pts</b> Is it organized and can I read it?
    <ul>
    <li> <b>Note:</b> for this assignment, and in general, please get rid of my comments and replace them with your own. I'm going to read this, so all of these instructions aren't really required. Think of this as a template, get rid of the stuff that isn't needed, and leave only the things you need to explain your code. 
    </ul>
</ul>

For submission, please drop the URL for your repository in the dropbox.

In [12]:
name = "Tolulope Falaki"

show_eda = False

In [13]:

# Loading the training data
df = pd.read_csv('training.csv')

if 'id' in df.columns:
    df = df.drop(columns=["id"])

# sample of the data display
print(df.sample(5))


     target  var_1  var_2  var_3  var_4  var_5  var_6  var_7  var_8  var_9  \
144       1  0.926  0.657  0.726  0.675  0.547  0.206  0.094  0.773  0.897   
177       1  0.072  0.308  0.007  0.626  0.809  0.578  0.003  0.472  0.014   
11        0  0.716  0.534  0.857  0.493  0.963  0.852  0.975  0.176  0.048   
82        0  0.654  0.980  0.241  0.674  0.920  0.983  0.076  0.886  0.380   
51        1  0.944  0.821  0.079  0.581  0.439  0.210  0.544  0.536  0.771   

     ...  var_191  var_192  var_193  var_194  var_195  var_196  var_197  \
144  ...    0.945    0.627    0.611    0.005    0.475    0.050    0.561   
177  ...    0.944    0.792    0.130    0.911    0.867    0.043    0.570   
11   ...    0.348    0.794    0.487    0.749    0.630    0.919    0.448   
82   ...    0.747    0.250    0.659    0.115    0.379    0.630    0.939   
51   ...    0.801    0.876    0.917    0.429    0.672    0.548    0.298   

     var_198  var_199  var_200  
144    0.754    0.681    0.319  
177    0.261  

### Starting

For this assignment, you have a small training set, so combatting overfitting is key in being accurate!

In [14]:
df.shape

(250, 201)

#### Do Modelling Stuff

Make a tree model (of some vareity) and make it fit well. Keep in mind the possibility of your tree overfitting, and think of steps you may need to combat that shoudl it occur. 

In [16]:

# Initially, numeric_features was defined using column names from the DataFrame.
# However, during testing, the ColumnTransformer needs to handle inputs as NumPy arrays without column names.
# Therefore, i redefine numeric_features using column indices
# This change ensures that the ColumnTransformer can process the test data correctly during testing,
# even when the data comes as a NumPy array without column names.

# Generate a list of feature indices (excluding the last column which is the target)
numeric_features = [i for i in range(df.shape[1] - 1)]  

# Define a pipeline for numerical feature transformation,
# which includes a SimpleImputer to fill in any missing values with the median of the column
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# Define the preprocessor as a ColumnTransformer,
# which applies the numeric_transformer pipeline to the numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ]
)



In [17]:

#  model pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Splitting the data
X = df.drop(columns=["target"])
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating the pipeline
from sklearn.ensemble import RandomForestClassifier

model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestClassifier(random_state=42))])



In [19]:
from sklearn.model_selection import cross_val_score

# model evaluation
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
print("ROC AUC scores:", scores)
print("Average ROC AUC score:", scores.mean())


ROC AUC scores: [0.59875    0.635      0.5802005  0.60401003 0.71679198]
Average ROC AUC score: 0.6269505012531329


### Finishing

At the conclusion, please name your best model "best". If you look down below in the testing stuff, it should be usable to score as "best". 

You should be able to call it like this and it should work (with whatever data names you have)

In [20]:

best = model.fit(X_train, y_train)
print(best.score(X_test, y_test))
print(best)

0.56
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                   10, 11, 12, 13, 14, 15, 16,
                                                   17, 18, 19, 20, 21, 22, 23,
                                                   24, 25, 26, 27, 28, 29, ...])])),
                ('classifier', RandomForestClassifier(random_state=42))])


### Testing

Please leave the stuff below as-is in your file. 

This will take your best model and score it with the test data. If you want to test to make sure that yours works, make a copy of the data file and rename it testing.csv, then make sure this runs ok. I will do the same, but the contents of my test file will be different. 

In [None]:
#Load Test Data
test_df = pd.read_csv("testing.csv")
test_df = test_df.drop(columns={"id"})
#Create tests and score
test_y = np.array(test_df["target"]).reshape(-1,1)
test_X = np.array(test_df.drop(columns={"target"}))

preds = best.predict(test_X)

roc_score = roc_auc_score(test_y, preds)
acc_score = accuracy_score(test_y, preds)

print(roc_score)
print(acc_score)
print(name, np.mean([roc_score, acc_score]))




### What Accuracy Changes Were Used

Please list here what you did to try to increase accuracy and/or limit overfitting:
<ul>
<li>
<li>
</ul>

Splitting the Data: The data was split into training and testing sets to ensure that there was a set of unseen data to evaluate the model's performance and to prevent it from learning the noise specific to the training set.

Cross-Validation: Cross-validation was used to evaluate the model's performance. This technique involves partitioning the data into subsets, training the model on some subsets (training set) and evaluating it on the remaining subsets (validation set). This helps ensure the model's ability to generalize to unseen data.

Random Forest Classifier: By choosing a RandomForestClassifier, which is an ensemble of decision trees, the model inherently has a reduction in overfitting compared to individual decision trees. This is because it averages out biases, reduces variance, and improves accuracy.

Imputing Missing Values: Using SimpleImputer with a strategy of median to fill in missing values can help improve model accuracy by providing reasonable estimates for missing data, rather than discarding these rows entirely.