## Here are some explanations relating to the numbered questions:

1. We chose 'Learning_Disabilities' as our binary categorical response variable. Its values are 'Yes' and 'No'

2. We chose 'Exam_Score' as our predictor varriable. We expect to see some sort of negative correlation, which means that, as Exam_Score increases, the likelihood of Learning_Disabilities being 'Yes' should decrease.

3. See code
4. See code

5. We initially ran the code with the default threshold of 0.5. We then tested lower thresholds to see if they would improve the model's ability to identify a Positive case. Upon testing the lower thresholds, the accuracy of the model fell significantly. So, we decided to proceed with the default threshold and our model obtained an accuracy of roughly 89%. This is a great metric, but it is worth noting that the model had a TPR of 0.0 and a TNR of 1.0, meaning that the model simply predicted 'No' for all cases of having a 'Learning Disability.' This speaks a lot towards the skew of the dataset, as the model technically performed well despite making the same, majority based decision for all data points. As the correlation between our predictor and response variable was quite slight (around -0.08), it makes sense that the model chose a blanket answer to predict the majority of cases.
This dataset just simply does not have great binary categorical options, and more data points with a learning disability would help to test the model more rigorously.

In [9]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, roc_curve, roc_auc_score, f1_score
from sklearn.ensemble import RandomForestClassifier

from scipy.stats import pointbiserialr
from sklearn.linear_model import LogisticRegression

complete_data = pd.read_csv("StudentPerformanceFactors.csv")

In [10]:
# First, take code from previous check-in to clean the data to have our get our updated data frame
# Although during this regression modeling, we will only use a subset of these rows, they are all included to create our training, validation, and test data sets\
# These are the features we selected at the beginning of the project to use in our modeling

# 1) "Learning_Disabilities" is a binary feature, so we'll use this as our response variable
# 2) We'll be using "Exam_Score" as our predictor for this classification model

#putting column name in a variable for now cuz idk which were using
predictor_variable = "Exam_Score"
response_variable = 'Learning_Disabilities'
main_features = [response_variable, predictor_variable]
data = complete_data[main_features]

# making the response variable have a binary encoding
data[response_variable] = data[response_variable].map({'No': 0, 'Yes': 1})

# Check to see if data cleaning needed
nans_response = data[response_variable].isnull().sum()
print(f"NaNs in {response_variable}: {nans_response}")

nans_predictor = data[predictor_variable].isnull().sum()
print(f"NaNs in {predictor_variable}: {nans_predictor}")

infs_response = np.isinf(data[response_variable]).sum()
print(f"Infs in {response_variable}: {infs_response}")

infs_predictor = np.isinf(data[predictor_variable]).sum()
print(f"Infs in {predictor_variable}: {infs_predictor}")


NaNs in Learning_Disabilities: 0
NaNs in Exam_Score: 0
Infs in Learning_Disabilities: 0
Infs in Exam_Score: 0




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [3]:
# using the same code from the previous check-in, we'll split the data into training, validation, and test sets
# Next, divide the new dataframe into 3 different data sets using a 60:20:20 split
# We chose 60:20:20 as opposed to 80:10:10 or somwhere in between to decrease the liklihood of overfitting, since the metrics used are potentially subseptible to overfitting


train_and_validation_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

train_df, validation_df = train_test_split(train_and_validation_df, test_size=0.25, random_state=42)

train_df.to_csv('Student_Performance_train.csv', index=False)
validation_df.to_csv('Student_Performance_validation.csv', index=False)
test_df.to_csv('Student_Performance_test.csv', index=False)

In [4]:
## calculate correlation between our two variables using the testing set

corr_coef, _ = pointbiserialr(train_df[response_variable], train_df[predictor_variable])

print("Train Correlation:", corr_coef)

## calculate correlation between our two variables using the validation set

corr_coef, _ = pointbiserialr(validation_df[response_variable], validation_df[predictor_variable])

print("Validation Correlation:", corr_coef)

## calculate correlation between our two variables using the test set

corr_coef, _ = pointbiserialr(test_df[response_variable], test_df[predictor_variable])

print("Test Correlation:", corr_coef)

Train Correlation: -0.08261872057924036
Validation Correlation: -0.11811433308237178
Test Correlation: -0.053195325028898434


In [8]:

x_train = train_df[[predictor_variable]]
y_train = train_df[response_variable]

# Training the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)

# Making predictions
y_pred = rf_model.predict(x_train)

# Calculating the confusion matrix and other metrics
c_m = confusion_matrix(y_train, y_pred)
print("Confusion Matrix:")
print(c_m)

accuracy = accuracy_score(y_train, y_pred)
prediction_error = 1 - accuracy
true_positive_rate = recall_score(y_train, y_pred)  # TPR
true_negative_rate = recall_score(y_train, y_pred, pos_label=0)  # TNR
f1 = f1_score(y_train, y_pred)  # F1 Score

print(f"\nAccuracy: {accuracy}")
print(f"Prediction Error: {prediction_error}")
print(f"True Positive Rate (TPR): {true_positive_rate}")
print(f"True Negative Rate (TNR): {true_negative_rate}")
print(f"F1 Score: {f1}")

# 5-fold cross-validation on validation set
X_val = validation_df[[predictor_variable]]
y_val = validation_df[response_variable]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Calculate AUC, accuracy and F1 across the folds
auc_scores = cross_val_score(rf_model, X_val, y_val, cv=cv, scoring='roc_auc')
accuracy_scores = cross_val_score(rf_model, X_val, y_val, cv=cv, scoring='accuracy')
f1_scores = cross_val_score(rf_model, X_val, y_val, cv=cv, scoring='f1')

print("\nAUC Scores for each fold:", auc_scores)
print("Average AUC:", np.mean(auc_scores))
print("Accuracy Scores for each fold:", accuracy_scores)
print("Average Accuracy:", np.mean(accuracy_scores))
print("F1 Scores for each fold:", f1_scores)
print("Average F1:", np.mean(f1_scores))

# Plot AUC curve
rf_model.fit(X_val, y_val)  # Fit model to validation data for ROC curve plotting
y_val_proba = rf_model.predict_proba(X_val)[:, 1]  # Probability of the positive class

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_val, y_val_proba)
auc = roc_auc_score(y_val, y_val_proba)

fig = go.Figure()

fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'ROC Curve (AUC = {auc:.2f})'))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines', line=dict(dash='dash'), name='Random'))

fig.update_layout(
    title="Random Forest ROC Curve on Validation Set",
    xaxis_title="False Positive Rate",
    yaxis_title="True Positive Rate",
    showlegend=True
)

fig.show()


Confusion Matrix:
[[3525    1]
 [ 433    4]]


NameError: name 'f1_score' is not defined