## Here are some explanations relating to the numbered questions:

1. Our dataset is "StudentPerformanceFactors.csv". We chose 'Learning_Disabilities' as our binary categorical response variable. Its values are 'Yes' and 'No'

2. See code, we got the following results:
    Confusion Matrix:
    [[3525    1]
    [ 433    4]]

    Accuracy: 0.8904870047943477
    Prediction Error: 0.10951299520565227
    True Positive Rate (TPR): 0.009153318077803204
    True Negative Rate (TNR): 0.9997163925127623
    F1 Score: 0.01809954751131222

As you can see, a similar outcome to last week's Check-In occurred. Our model simply predicts almost everything to by Negative, and since the majority of the dataset is True Negatives, this ends up being a decent method. This is seen in the 99.9% TNR.

3. See code, we got the following results:
    AUC Scores for each fold: [0.48451198 0.6037405  0.66759988 0.55671313 0.60366838]
    Average AUC: 0.5832467748468251
    Accuracy Scores for each fold: [0.89056604 0.88301887 0.89393939 0.88257576 0.88636364]
    Average Accuracy: 0.8872927387078331

Again, these metrics are well-created, but the data used is misleading / not the best for creating examples. However, the accuracy measures across each fold hover around the 89% mark, which is decent.


In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, roc_curve, roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier

from scipy.stats import pointbiserialr

complete_data = pd.read_csv("StudentPerformanceFactors.csv")

Matplotlib is building the font cache; this may take a moment.


In [2]:

#putting column name in a variable for now cuz idk which were using
predictor_variable = "Exam_Score"
response_variable = 'Learning_Disabilities'
main_features = [response_variable, predictor_variable]
data = complete_data[main_features]

# making the response variable have a binary encoding
data[response_variable] = data[response_variable].map({'No': 0, 'Yes': 1})

# Check to see if data cleaning needed
nans_response = data[response_variable].isnull().sum()
print(f"NaNs in {response_variable}: {nans_response}")

nans_predictor = data[predictor_variable].isnull().sum()
print(f"NaNs in {predictor_variable}: {nans_predictor}")

infs_response = np.isinf(data[response_variable]).sum()
print(f"Infs in {response_variable}: {infs_response}")

infs_predictor = np.isinf(data[predictor_variable]).sum()
print(f"Infs in {predictor_variable}: {infs_predictor}")


NaNs in Learning_Disabilities: 0
NaNs in Exam_Score: 0
Infs in Learning_Disabilities: 0
Infs in Exam_Score: 0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[response_variable] = data[response_variable].map({'No': 0, 'Yes': 1})


In [3]:
# using the same code from the previous check-in, we'll split the data into training, validation, and test sets
# Next, divide the new dataframe into 3 different data sets using a 60:20:20 split
# We chose 60:20:20 as opposed to 80:10:10 or somwhere in between to decrease the likelihood of overfitting, since the metrics used are potentially subseptible to overfitting


train_and_validation_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

train_df, validation_df = train_test_split(train_and_validation_df, test_size=0.25, random_state=42)

train_df.to_csv('Student_Performance_train.csv', index=False)
validation_df.to_csv('Student_Performance_validation.csv', index=False)
test_df.to_csv('Student_Performance_test.csv', index=False)

In [4]:
## calculate correlation between our two variables using the testing set

corr_coef, _ = pointbiserialr(train_df[response_variable], train_df[predictor_variable])

print("Train Correlation:", corr_coef)

## calculate correlation between our two variables using the validation set

corr_coef, _ = pointbiserialr(validation_df[response_variable], validation_df[predictor_variable])

print("Validation Correlation:", corr_coef)

## calculate correlation between our two variables using the test set

corr_coef, _ = pointbiserialr(test_df[response_variable], test_df[predictor_variable])

print("Test Correlation:", corr_coef)

Train Correlation: -0.08261872057924033
Validation Correlation: -0.1181143330823718
Test Correlation: -0.05319532502889844


In [5]:

x_train = train_df[[predictor_variable]]
y_train = train_df[response_variable]

# Training the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)

# Making predictions
y_pred = rf_model.predict(x_train)

# Calculating the confusion matrix and other metrics
c_m = confusion_matrix(y_train, y_pred)
print("Confusion Matrix:")
print(c_m)

accuracy = accuracy_score(y_train, y_pred)
prediction_error = 1 - accuracy
true_positive_rate = recall_score(y_train, y_pred)  # TPR
true_negative_rate = recall_score(y_train, y_pred, pos_label=0)  # TNR
f1 = f1_score(y_train, y_pred)  # F1 Score

print(f"\nAccuracy: {accuracy}")
print(f"Prediction Error: {prediction_error}")
print(f"True Positive Rate (TPR): {true_positive_rate}")
print(f"True Negative Rate (TNR): {true_negative_rate}")
print(f"F1 Score: {f1}")

# 5-fold cross-validation on validation set
X_val = validation_df[[predictor_variable]]
y_val = validation_df[response_variable]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Calculate AUC, accuracy and F1 across the folds
auc_scores = cross_val_score(rf_model, X_val, y_val, cv=cv, scoring='roc_auc')
accuracy_scores = cross_val_score(rf_model, X_val, y_val, cv=cv, scoring='accuracy')
f1_scores = cross_val_score(rf_model, X_val, y_val, cv=cv, scoring='f1')

print("\nAUC Scores for each fold:", auc_scores)
print("Average AUC:", np.mean(auc_scores))
print("Accuracy Scores for each fold:", accuracy_scores)
print("Average Accuracy:", np.mean(accuracy_scores))
print("F1 Scores for each fold:", f1_scores)
print("Average F1:", np.mean(f1_scores))

# Plot AUC curve
rf_model.fit(X_val, y_val)  # Fit model to validation data for ROC curve plotting
y_val_proba = rf_model.predict_proba(X_val)[:, 1]  # Probability of the positive class

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_val, y_val_proba)
auc = roc_auc_score(y_val, y_val_proba)

fig = go.Figure()

fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'ROC Curve (AUC = {auc:.2f})'))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', line=dict(dash='dash'), name='Random'))

fig.update_layout(
    title="Random Forest ROC Curve on Validation Set",
    xaxis_title="False Positive Rate",
    yaxis_title="True Positive Rate",
    showlegend=True
)

fig.show()


Confusion Matrix:
[[3525    1]
 [ 433    4]]

Accuracy: 0.8904870047943477
Prediction Error: 0.10951299520565227
True Positive Rate (TPR): 0.009153318077803204
True Negative Rate (TNR): 0.9997163925127623
F1 Score: 0.01809954751131222

AUC Scores for each fold: [0.48451198 0.6037405  0.66759988 0.55671313 0.60366838]
Average AUC: 0.5832467748468251
Accuracy Scores for each fold: [0.89056604 0.88301887 0.89393939 0.88257576 0.88636364]
Average Accuracy: 0.8872927387078331
F1 Scores for each fold: [0. 0. 0. 0. 0.]
Average F1: 0.0
