# Data Mining Project: Predicting Student Behavioral Disruptions


## Library Requirements
Install the required libraries specified in the `requirements.txt` file. You can do this using pip:
```bash
pip install -r requirements.txt
```


## Table of Contents
1. [Introduction](#introduction)
2. [Data Loading and Initial Exploration](#data-loading)
3. [Exploratory Data Analysis (EDA)](#eda)
4. [Hypothesis Testing](#hypothesis-testing)
5. [Feature Engineering](#feature-engineering)
6. [Model Development](#model-development)
7. [Model Evaluation and Interpretation](#model-evaluation)
8. [Summary](#Summary)


## 1. Introduction
This notebook documents our team's effort to predict and analyze student behavioral disruptions to minimize in-class interruptions.

**Team Members:**  Thomas Robertson, Claudia Nething, Revel Etheridge, Logan Bolton

**Customer:** Adam West

**Objectives:**
- Predict behavioral disruptions
- Identify anomalous patterns
- Provide clear interpretations


## 2. Data Loading and Initial Exploration

In [None]:
!pip install -r requirements.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import f_oneway, ttest_ind, spearmanr, pearsonr
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score,
    accuracy_score, mean_squared_error, r2_score
)
from sklearn.base import is_classifier, is_regressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import root_mean_squared_error  # Import the new function
from sklearn.neural_network import MLPClassifier
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, r2_score, mean_squared_error
from sklearn.decomposition import PCA
from sklearn.metrics import root_mean_squared_error

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE

from sklearn.linear_model      import (
    PoissonRegressor,
    Ridge, Lasso, ElasticNet,
    HuberRegressor,
    RANSACRegressor
)
from sklearn.ensemble           import (
    RandomForestRegressor,
    GradientBoostingRegressor
)
from sklearn.pipeline          import Pipeline
from sklearn.compose           import ColumnTransformer
from sklearn.preprocessing     import StandardScaler, OneHotEncoder
from sklearn.impute            import SimpleImputer

import statsmodels.api as sm
from statsmodels.discrete.discrete_model import NegativeBinomial

from sklearn.linear_model import (
    PoissonRegressor, Ridge, Lasso, ElasticNet,
    HuberRegressor, RANSACRegressor
)
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import scipy.stats as stats
import scikit_posthocs as sp
from sklearn.inspection import permutation_importance
# Load datasets
bus_conduct = pd.read_csv('TTU Data - Bus Conduct.csv')
bus_conduct_updated = pd.read_csv('TTU Data Update - Bus Conduct.csv')
family_engagement = pd.read_csv('TTU Data - Family Engagement.csv')
disciplinary_referral = pd.read_csv('TTU Data - Disciplinary Referral.csv')
disciplinary_referral_updated = pd.read_csv('TTU Data Update - Disciplinary Referral.csv')
weather_df = pd.read_csv('weather.csv')

# Display the first few rows of each dataset for inspection
print("Bus Conduct Dataset:")
display(bus_conduct.head())
print("Bus Conduct Updated Dataset:")
display(bus_conduct_updated.head())

print("Family Engagement Dataset:")
display(family_engagement.head())

print("Disciplinary Referral Dataset:")
display(disciplinary_referral.head())

print("Disciplinary Referral Updated Dataset:")
display(disciplinary_referral_updated.head())


## 3. Exploratory Data Analysis (EDA)

In [None]:

# Combine and clean datasets
disciplinary_referral_all = pd.concat([disciplinary_referral, disciplinary_referral_updated], ignore_index=True).drop_duplicates()
bus_conduct_all = pd.concat([bus_conduct, bus_conduct_updated], ignore_index=True).drop_duplicates()
# Convert date columns to datetime format
disciplinary_referral_all['Date of Incident'] = pd.to_datetime(disciplinary_referral_all['Date of Incident'], errors='coerce')
bus_conduct_all['Date of Incident'] = pd.to_datetime(bus_conduct_all['Date of Incident'], errors='coerce')
# Save combined datasets to CSV files
disciplinary_referral_all.to_csv("combined_disciplinary_referrals.csv", index=False)
bus_conduct_all.to_csv("combined_bus_conduct.csv", index=False)
# Check for missing values in the combined datasets
print("Missing values in Combined Bus Conduct Data:")
print(bus_conduct_all.isnull().sum())
print("\nMissing values in Combined Disciplinary Referral Data:")
print(disciplinary_referral_all.isnull().sum())


In [None]:
# Add a 'Month' column for monthly analysis
disciplinary_referral_all['Month'] = disciplinary_referral_all['Date of Incident'].dt.month

# Group by month and count referrals
monthly_referrals = disciplinary_referral_all.groupby('Month').size()

# Plot monthly referral counts
plt.figure(figsize=(10, 5))
sns.barplot(x=monthly_referrals.index, y=monthly_referrals.values)
plt.title('Monthly Disciplinary Referrals (Updated Dataset)')
plt.xlabel('Month')
plt.ylabel('Number of Referrals')
plt.show()

In [None]:

# Identify the top 10 students with the most referrals
frequent_students = disciplinary_referral_all['Student Identifier'].value_counts().head(10)

# Plot the top 10 students by referral count
plt.figure(figsize=(10, 5))
sns.barplot(y=frequent_students.index, x=frequent_students.values, orient='h')
plt.title('Top 10 Students by Number of Referrals (Updated Dataset)')
plt.xlabel('Number of Referrals')
plt.ylabel('Student Identifier')
plt.show()


In [None]:
# Function to categorize time of day
def categorize_time(time_str):
    if pd.isna(time_str): return "Unknown"
    time_str = time_str.lower().strip()
    if "before school" in time_str: return "Before School"
    elif any(t in time_str for t in ["8:00am", "9:00am", "10:00am", "11:00am"]): return "Morning"
    elif any(t in time_str for t in ["12:00pm", "1:00pm"]): return "Early Afternoon"
    elif any(t in time_str for t in ["2:00pm", "3:00pm"]): return "Late Afternoon"
    elif "after school" in time_str: return "After School"
    else: return "Other"

# Define the order of time categories
time_order = ["Before School", "Morning", "Early Afternoon", "Late Afternoon", "After School", "Other"]

# Apply time categorization to the dataset
disciplinary_referral_all["Time_Category"] = disciplinary_referral_all["Time of the Day the behavior occurred?"].apply(categorize_time)

# Set ordered categories for plotting
disciplinary_referral_all["Time_Category"] = pd.Categorical(
    disciplinary_referral_all["Time_Category"],
    categories=time_order,
    ordered=True
)

# Group data by grade level and time category
grouped = disciplinary_referral_all.groupby(["Grade_Level", "Time_Category"]).size().unstack().fillna(0)

# Sort grades for display
grouped = grouped.sort_index()

# Plot a heatmap of referrals by grade level and time of day
plt.figure(figsize=(12, 6))
sns.heatmap(grouped, annot=True, fmt='d', cmap="YlGnBu")
plt.title("Disciplinary Referrals by Grade Level and Time of Day (Cleaned)")
plt.xlabel("Time of Day")
plt.ylabel("Grade Level")
plt.tight_layout()
plt.show()

In [None]:
# Convert weather data datetime column to datetime format
weather_df['datetime'] = pd.to_datetime(weather_df['datetime'], errors='coerce')

# Group referrals by date and count
referrals_per_day = disciplinary_referral_all.groupby('Date of Incident').size().reset_index(name='referral_count')

# Merge referral counts with weather data
merged_df = pd.merge(referrals_per_day, weather_df, how='inner', left_on='Date of Incident', right_on='datetime')
merged_df.drop(columns=['datetime'], inplace=True)

# Categorize temperature into bins
bins = [0, 50, 70, 100]
labels = ['Cold (<50°F)', 'Mild (50-70°F)', 'Hot (>70°F)']
merged_df['Temp_Category'] = pd.cut(merged_df['temp'], bins=bins, labels=labels)

# Calculate average referrals by temperature category
binned_referrals = merged_df.groupby('Temp_Category')['referral_count'].mean().reset_index()

# Plot average referrals by temperature category
plt.figure(figsize=(8, 5))
sns.barplot(data=binned_referrals, x='Temp_Category', y='referral_count', palette='coolwarm')
plt.title('Average Referrals by Temperature (Updated)')
plt.ylabel('Average Number of Referrals')
plt.xlabel('Temperature Range')
plt.grid(True, axis='y')
plt.tight_layout()
plt.show()

## 3.5 Expanded EDA: Additional Insights and Correlations

## Expanded EDA: Additional Insights and Correlations

### Referrals by School

In [None]:
# Analyze referrals by school or staff
staff_referral_counts = disciplinary_referral_all['Please select your school'].value_counts().head(10)

# Plot the top 10 schools by referral count
plt.figure(figsize=(10, 5))
sns.barplot(y=staff_referral_counts.index, x=staff_referral_counts.values, orient='h')
plt.title("Top 10 Schools by Referral Count")
plt.xlabel("Referrals Issued")
plt.ylabel("School")
plt.tight_layout()
plt.show()

Weather and Referral Type (e.g Fighting)

In [None]:
# Analyze weather and referral type (e.g., fighting-related referrals)
fighting_referrals = disciplinary_referral_all[
    disciplinary_referral_all['Select the Major Referral'].str.contains('fight', na=False, case=False)
]

# Group fighting-related referrals by date
fight_days = fighting_referrals.groupby('Date of Incident').size().reset_index(name='fight_referrals')

# Merge fighting-related referrals with weather data
weather_fights = pd.merge(fight_days, weather_df, left_on='Date of Incident', right_on='datetime', how='inner')

# Plot daily fight referrals vs. temperature
plt.figure(figsize=(10, 5))
sns.scatterplot(data=weather_fights, x='temp', y='fight_referrals')
plt.title("Daily Fight Referrals vs. Temperature")
plt.xlabel("Temperature")
plt.ylabel("Number of Fight-Related Referrals")
plt.grid(True)
plt.tight_layout()
plt.show()

### Bus Conduct vs. Classroom Referrals

In [None]:
# Analyze the relationship between bus conduct incidents and classroom referrals
bus_counts = bus_conduct_all['Student Identifier'].value_counts().reset_index()
bus_counts.columns = ['Student Identifier', 'Bus_Incidents']

referral_counts = disciplinary_referral_all['Student Identifier'].value_counts().reset_index()
referral_counts.columns = ['Student Identifier', 'Referrals']

# Merge bus conduct and referral data
merged_behavior = pd.merge(bus_counts, referral_counts, on='Student Identifier', how='outer').fillna(0)

# Plot referrals vs. bus conduct incidents
plt.figure(figsize=(8, 6))
sns.scatterplot(data=merged_behavior, x='Bus_Incidents', y='Referrals')
plt.title('Referrals vs. Bus Conduct Incidents')
plt.xlabel('Bus Conduct Incidents')
plt.ylabel('Referrals')
plt.grid(True)
plt.tight_layout()
plt.show()

### Weekday Trends in Referrals

In [None]:
# Analyze weekday trends in referrals
disciplinary_referral_all['Weekday'] = disciplinary_referral_all['Date of Incident'].dt.day_name()

# Define the order of weekdays for plotting
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

# Count referrals by weekday
weekday_counts = disciplinary_referral_all['Weekday'].value_counts().reindex(weekday_order)

# Plot referrals by day of the week
plt.figure(figsize=(8, 4))
sns.barplot(x=weekday_counts.index, y=weekday_counts.values)
plt.title('Referrals by Day of the Week')
plt.ylabel('Number of Referrals')
plt.xlabel('Day')
plt.tight_layout()
plt.show()

## 4. Hypothesis Testing
This section explores several data-driven hypotheses relevant to predicting and minimizing in-class behavioral disruptions. The following tests were conducted using statistical methods such as t-tests, ANOVA, and correlation analyses.

---

### H1: Referral Frequency Increases Near Testing Season

**Hypothesis:** Certain Months have significantly higher referral counts due to testing-related stress.

**Test Type:** One-way ANOVA  

**Rationale:** Compare monthly referral averages across months.

In [None]:
# Drop NaNs and extract month
monthly_ref = disciplinary_referral_all.dropna(subset=['Date of Incident']).copy()
monthly_ref['Month'] = monthly_ref['Date of Incident'].dt.month

# Calculate number of referrals per student per month
monthly_student_referrals = (
    monthly_ref.groupby(['Month', 'Student Identifier'])
    .size()
    .reset_index(name='Referral Count')
)

# Create a list of referral counts per month for ANOVA, filter groups with more than 1 value and variance > 0
monthly_groups = [
    group['Referral Count'].values
    for _, group in monthly_student_referrals.groupby('Month')
    if len(group) > 1 and group['Referral Count'].var() > 0
]

# Perform ANOVA if valid groups exist
if len(monthly_groups) >= 2:
    f_stat, p_value = f_oneway(*monthly_groups)
    print(f"F-statistic: {f_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    if not pd.isna(f_stat) and p_value < 0.05:
        print("Significant differences exist between monthly referral counts.")
    elif not pd.isna(f_stat):
        print("No significant difference between months.")
    else:
        print("ANOVA returned NaN. Check your data again for consistency.")
else:
    print("Not enough valid monthly groups for ANOVA.")



---

### H2: Bus Misconduct is Associated with More In-Class Referrals

**Hypothesis:** Students with bus conduct incidents have significantly more in-class referrals than those without.

**Test Type:** Welch’s t-test (independent two-sample t-test)  

**Rationale:** Compare mean referral counts between two groups (bus incident vs. no bus incident).

In [None]:
# Total referrals per student
referral_counts = disciplinary_referral_all['Student Identifier'].value_counts().reset_index()
referral_counts.columns = ['Student Identifier', 'Total_Referrals']

# Total bus incidents per student
bus_counts = bus_conduct_all['Student Identifier'].value_counts().reset_index()
bus_counts.columns = ['Student Identifier', 'Bus_Incidents']

# Merge datasets
behavior_merge = pd.merge(referral_counts, bus_counts, on='Student Identifier', how='outer').fillna(0)

# Create two groups
bus_yes = behavior_merge[behavior_merge['Bus_Incidents'] > 0]['Total_Referrals']
bus_no = behavior_merge[behavior_merge['Bus_Incidents'] == 0]['Total_Referrals']

# Run t-test
t_stat, p_value = ttest_ind(bus_yes, bus_no, equal_var=False)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
    print("Statistically significant difference found.")
else:
    print("No significant difference.")

---

### H3: Family Engagement Negatively Correlates with Referrals

**Hypothesis:** Higher family engagement is associated with fewer referrals.

**Test Type:** Spearman Correlation 
 
**Rationale:** Non-parametric test of ordinal survey response counts vs. referral totals.

In [None]:
# Clean columns and estimate total referrals per school
referrals_by_school = disciplinary_referral_all['Please select your school'].value_counts().reset_index()
referrals_by_school.columns = ['School', 'Total_Referrals']

# Survey counts per school
engagement_by_school = family_engagement['Please check which school your child/children attends.'].value_counts().reset_index()
engagement_by_school.columns = ['School', 'Engagement_Survey_Responses']

school_merge = pd.merge(referrals_by_school, engagement_by_school, on='School', how='inner')

# Spearman correlation
corr, p_value = spearmanr(school_merge['Engagement_Survey_Responses'], school_merge['Total_Referrals'])
print(f"Spearman correlation: {corr:.4f}\nP-value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant negative/positive correlation.")
else:
    print("No significant correlation found.")


---

### H4: Total Referral Count Differs by Grade Level

**Hypothesis:** Certain grades have significantly more referrals.

**Test Type:** One-way ANOVA  

**Rationale:** Compare referral frequency across grades.


In [None]:
# Ensure Grade_Level is numeric and drop rows with missing Grade_Level
disciplinary_referral_all['Grade_Level'] = pd.to_numeric(disciplinary_referral_all['Grade_Level'], errors='coerce')
ref_by_grade = disciplinary_referral_all.dropna(subset=['Grade_Level'])

# Calculate number of referrals per student per grade
grade_student_referrals = (
    ref_by_grade.groupby(['Grade_Level', 'Student Identifier'])
    .size()
    .reset_index(name='Referral Count')
)

# Create a list of referral counts per grade for ANOVA, filtering for groups with variance > 0
grade_groups = [
    group['Referral Count'].values
    for _, group in grade_student_referrals.groupby('Grade_Level')
    if len(group) > 1 and group['Referral Count'].var() > 0
]

# Perform ANOVA if there are at least two valid grade groups
if len(grade_groups) >= 2:
    f_stat, p_value = f_oneway(*grade_groups)
    print(f"F-statistic: {f_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    if not pd.isna(f_stat) and p_value < 0.05:
        print("Referral rates differ significantly across grades.")
    elif not pd.isna(f_stat):
        print("No significant difference in referrals between grades.")
    else:
        print("ANOVA returned NaN. Check your data for consistency.")
else:
    print("Not enough valid grade groups for ANOVA.")



---

### H5: Referral Volume Correlates with Weather Factors

**Hypothesis:** Temperature and humidity levels are associated with referral counts.

**Test Type:** Pearson Correlation  

**Rationale:** Compare numeric weather features against referral count per day.


In [None]:
# Prepare merged dataset
weather_df['datetime'] = pd.to_datetime(weather_df['datetime'], errors='coerce')
daily_ref = disciplinary_referral_all.groupby('Date of Incident').size().reset_index(name='referral_count')
daily_weather = pd.merge(daily_ref, weather_df, left_on='Date of Incident', right_on='datetime', how='inner')

# Pearson correlations
for var in ['temp', 'humidity']:
    corr, p = pearsonr(daily_weather['referral_count'], daily_weather[var])
    print(f"Pearson correlation between referrals and {var}: {corr:.4f} (p = {p:.4f})")


---

### H6: Referral Volume Correlates with Weather Factors part 2

**Hypothesis:** Referrals are correlated with temperature and immediatley after and before school breaks as students get antsy to have off.

**Test Type:** Dunn's Post-Hoc Test

**Rationale:** Compare termperature ranges and school breaks with the number of referrals.

In [None]:
merged_df['Date of Incident'] = pd.to_datetime(merged_df['Date of Incident'])

# aggregate referral count and temperature per day
daily_data = merged_df.groupby('Date of Incident').agg({
    'referral_count': 'sum',  
    'temp': 'mean'           
}).reset_index()

# define school breaks with colors
breaks = {
    'Thanksgiving Break': ('2024-11-26', '2024-11-29', 'orange'),
    'Fall Break': ('2024-10-12', '2024-10-17', 'red'),
    'Winter Break': ('2024-12-22', '2025-01-07', 'blue'),
    'Spring Break': ('2024-03-30', '2024-04-04', 'green'),
    'Summer Break': ('2024-05-18', '2024-08-8', 'purple'),
}

# categorize each day
def label_day(date):
    for break_name, (start_str, end_str, _) in breaks.items():
        start = pd.to_datetime(start_str)
        end = pd.to_datetime(end_str)
        if start - pd.Timedelta(days=5) <= date < start:
            return "Before Break"
        elif start <= date <= end:
            return "During Break"
        elif end < date <= end + pd.Timedelta(days=5):
            return "After Break"
    return "Regular Day"

# Apply labels
daily_data['Break Period'] = daily_data['Date of Incident'].apply(label_day)

# Perform Dunn's post-hoc test with Bonferroni correction
dunn_results = sp.posthoc_dunn(
    daily_data, 
    val_col='referral_count', 
    group_col='Break Period', 
    p_adjust='bonferroni'
)

# Show results
print("\nDunn's Post-Hoc Test Results (p-values):")
print(dunn_results)

fig, ax1 = plt.subplots(figsize=(12, 6))

# plot referrals
ax1.plot(daily_data['Date of Incident'], daily_data['referral_count'], label='Referrals', color='black')
ax1.set_xlabel('Date')
ax1.set_ylabel('Referral Count', color='black')
ax1.tick_params(axis='y', labelcolor='black')

# highlight break periods
for name, (start, end, color) in breaks.items():
    ax1.axvspan(pd.to_datetime(start), pd.to_datetime(end), color=color, alpha=0.3, label=name)

# temperature on secondary y-axis
ax2 = ax1.twinx()
ax2.plot(daily_data['Date of Incident'], daily_data['temp'], label='Temperature', color='red')
ax2.set_ylabel('Temperature (°F)', color='red')
ax2.tick_params(axis='y', labelcolor='red')

# combine legends
lines1, labels1 = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')

plt.title('Daily Referrals and Temperature with School Breaks Highlighted')
plt.tight_layout()
plt.show()


## Hypothesis Testing Summary and Interpretation

This section summarizes the results of six statistical tests aimed at identifying significant patterns and relationships related to student behavioral referrals. These insights support the ultimate goal of targeting interventions and informing stakeholder decision-making.

---

### **H1: Referral Frequency Increases Near Testing Season**
- **F-statistic:** `3.4566`  
- **P-value:** `0.0004`  
- **Result:** *Statistically significant: referral frequency increases near testing season*

**Interpretation:**  
There is a strong correlation between the month and increased referral counts. From the visual inspection of earlier graphs, this is likely due to academic stress from testing periods. The F-statistic indicates a meaningful difference in means across months.

**Implication:**  
A follow-up study with more balanced month-by-month data could explore whether academic testing stress correlates with behavioral disruptions.

---

### **H2: Bus Misconduct is Associated with More In-Class Referrals**
- **T-statistic:** `-4.9906`  
- **P-value:** `0.0000`  
- **Result:** *Statistically significant difference found.*

**Interpretation:**  
There is a strong negative correlation between bus conduct incidents and in-class referrals, as evidences by the negative T value and low p-values. While this seems counterintuitive, it suggests that students with bus conduct issues are more likely to be flagged for behavioral problems in the classroom. There likely may be a confounding factor where students with more bus referrals are removed from the classroom more frequently, thus reducing their in-class referral counts.

**Implication:**  
Bus misconduct is a strong predictor of general behavioral issues. Students flagged for bus incidents may benefit from preemptive behavioral interventions or monitoring in the classroom.

---

### **H3: Family Engagement Negatively Correlates with Referrals**
- **Spearman correlation:** `-0.8000`  
- **P-value:** `0.2000`  
- **Result:** *Not statistically significant.*

**Interpretation:**  
While a strong negative correlation was observed (suggesting that higher engagement corresponds with fewer referrals), the p-value was not below the 0.05 threshold for significance. This may be due to small sample size (school-level aggregation). Historically, family engagement has been linked to improved student behavior, but this dataset did not provide strong enough evidence to confirm that relationship at the school level. This Hypothesis may benefit from further investigation with larger datasets or more student-level data.

**Implication:**  
There is potential evidence that increased family involvement could reduce behavior issues. Larger or student-level datasets may yield stronger conclusions.

---

### **H4: Referral Count Differs by Grade Level**
- **F-statistic:** 1.7691
- **P-value:** 0.0523
- **Result:** *No significant difference in referrals between grades.*

**Interpretation:**  
The p-value is slightly above the 0.05 threshold, indicating that while there may be some variation in referral counts across grades, it is not statistically significant from these factors alone. Other factors are likely contributing to grade level referrals. The F-statistic suggests that the differences are not large enough to warrant further investigation at this time.

**Implication:**  
Grade levels do not appear to have different total referral counts. Future analyses may focus on other demographic factors or specific behaviors rather than grade level. An example being time of day showing a distinct difference in writeups in the heatmap visualized earlier. The results suggest that grade level alone is not a strong predictor of referral frequency, and that it is a combination of factors that contribute to behavioral issues.

A different approach that handles high variance (e.g students with 0 referrals). may be needed to identify specific grade-level trends or behaviors that warrant further investigation through total referral rates alone.

---

### **H5: Weather Correlation with Referral Volume**
- **Temperature Correlation (Pearson):** `0.1636` (p = `0.0455`)  
- **Humidity Correlation (Pearson):** `0.1566` (p = `0.0557`)

**Interpretation:**  
There is a **weak but statistically significant** positive correlation between **temperature** and **referral volume**, indicating referrals tend to rise on warmer days. The correlation with humidity is borderline and not statistically significant at α = 0.05.

**Implication:**  
Environmental stressors such as heat may contribute to behavioral issues. Higher temperatures is often considered an agitory effect for people, and could be contributing towards aggressive or inappropriate behaivors in classes. Schools could explore increased temperature controls or schedule class time adjustments during high-heat periods.

---

### **H6: Weather Correlation with Referral Volume**
- **During Break vs Regular Day (Dunn's Post-Hoc):** p = `0.015`
- **All Others** p > `0.05`

**Interpretation**
Daily referral counts drop to near zero during major school breaks, as expected. Notably, spikes in referral counts are often observed in the days leading up to breaks—particularly before Spring, Summer, and Winter breaks as well as shortly after students return. This pattern suggests increased behavioral incidents during transition periods. However the only significant p-value was during break vs. regular days. Additionally, there is a visible tendency for higher referral counts on hotter days, indicating a possible positive relationship between temperature and disciplinary issues.

**Implication**
Behavioral challenges may intensify during periods of anticipation or adjustment around breaks, highlighting the importance of targeted support during these windows. Schools could proactively implement classroom management strategies or behavioral reinforcement before breaks and immediately upon return. Moreover, the temperature referral trend suggests that environmental stressors like heat may exacerbate behavioral issues, warranting climate-aware interventions during warmer periods.

---

### Overall Recommendations:

- **Focus future models on bus conduct as a predictive feature.**
- **Consider temperature as a situational risk factor.**
- **Explore family engagement strategies to reduce referrals.**


## Initial Recommendations for School Stakeholders

Based on data-driven hypothesis testing and exploratory analysis, the following recommendations are proposed to help school administrators reduce behavioral disruptions and improve classroom environments:

---

### 1. Monitor Students with Bus Conduct Incidents
> **Why:** Students with bus conduct violations had *statistically significantly lower* classroom referral rates. In our dataset, this appears counterintuitive, but it suggests that bus incidents may be a strong predictor of overall behavioral issues. it is likely that students with bus write-ups are more frequently removed from the classroom, thus reducing their in-class referral counts.

**Recommendations:**
- Flag students with bus write-ups for early behavioral intervention or counseling.
- Integrate bus conduct records into early warning systems.
- Train bus drivers to identify and report behavioral issues that may carry over into the classroom.
- Perform early intervention strategies for students with bus conduct issues to prevent escalation.

---

### 2. Plan for Heat-Related Behavior Increases
> **Why:** Referral counts showed a *significant positive correlation* with higher temperatures.

**Recommendations:**
- Improve classroom cooling access and hydration breaks during hot weather.
- Train teachers in managing heat-induced student irritability.
- Monitor referrals during heatwaves and adjust scheduling if needed.

---

### 3. Use Family Engagement as a Soft Predictor
> **Why:** A strong negative (though not statistically significant) correlation was observed between family engagement and referrals.

**Recommendations:**
- Encourage increased parental participation in school events and surveys.
- Use engagement metrics to target school-specific outreach strategies.
- Offer incentives for family involvement in education and discipline policies.

---

### 4. Prepare for Behavioral Peaks Around School Breaks or Testing Seasons
> **Why:** Referral counts spiked before and after school breaks, as well as during testing periods.

**Recommendations:**
- Implement proactive classroom management strategies before breaks.
- Schedule additional counseling or support sessions during these times.
- Analyze referral patterns around testing seasons to identify stress-related behaviors.

---

### 5. Prioritize Multivariate Data Collection for Risk Prediction
> **Why:** Bus behavior, weather, and family engagement all show predictive potential.

**Recommendations:**
- Initial results show correlation between multiple factors regarding referral counts.
- Collect more data on student behavior, including bus conduct, weather conditions, and family engagement metrics.
- Use multivariate models to predict referral risk based on these factors.

---

These recommendations are intended to guide practical changes and inform predictive modeling efforts that follow in subsequent sections.


## 5. Feature Engineering
In this section we will create a student-week model dataset that aggregates student behavior data on a weekly basis. This will help us analyze trends and patterns in student behavior over time.


In [None]:
#  Merging all the datasets into one model ready dataset
# Reload datasets
original_ref = pd.read_csv("TTU Data - Disciplinary Referral.csv")
update_ref = pd.read_csv("TTU Data Update - Disciplinary Referral.csv")
original_bus = pd.read_csv("TTU Data - Bus Conduct.csv")
update_bus = pd.read_csv("TTU Data Update - Bus Conduct.csv")
family_engagement = pd.read_csv("TTU Data - Family Engagement.csv")
weather = pd.read_csv("weather.csv")

# Merge referrals
all_ref_columns = list(set(original_ref.columns).union(set(update_ref.columns)))
original_ref = original_ref.reindex(columns=all_ref_columns)
update_ref = update_ref.reindex(columns=all_ref_columns)
full_ref = pd.concat([original_ref, update_ref], ignore_index=True).drop_duplicates()

# Merge bus conduct
all_bus_columns = list(set(original_bus.columns).union(set(update_bus.columns)))
original_bus = original_bus.reindex(columns=all_bus_columns)
update_bus = update_bus.reindex(columns=all_bus_columns)
full_bus = pd.concat([original_bus, update_bus], ignore_index=True).drop_duplicates()

# Preprocess dates
full_ref['Date of Incident'] = pd.to_datetime(full_ref['Date of Incident'], errors='coerce')
full_bus['Date of Incident'] = pd.to_datetime(full_bus['Date of Incident'], errors='coerce')
weather['datetime'] = pd.to_datetime(weather['datetime'], errors='coerce')

# STEP 1: Create 'Week' columns
full_ref['Week'] = full_ref['Date of Incident'].dt.to_period('W').apply(lambda r: r.start_time)
full_bus['Week'] = full_bus['Date of Incident'].dt.to_period('W').apply(lambda r: r.start_time if not pd.isnull(r) else None)
# STEP 2: Aggregate referral and bus incidents per student per week
ref_agg = full_ref.groupby(['Student Identifier', 'Week']).size().reset_index(name='weekly_referrals')
bus_agg = full_bus.groupby(['Student Identifier', 'Week']).size().reset_index(name='weekly_bus_incidents')

# STEP 3: Extract basic student metadata
student_meta = full_ref.drop_duplicates('Student Identifier')[['Student Identifier', 'Grade_Level', 'Gender', 'Ethnicity', 'LunchStatus']]

# STEP 4: Normalize and prepare engagement survey data
engagement_data = family_engagement.rename(columns=lambda x: x.strip())
if 'Student Identifier' not in engagement_data.columns:
    for col in engagement_data.columns:
        if 'student' in col.lower() and 'id' in col.lower():
            engagement_data.rename(columns={col: 'Student Identifier'}, inplace=True)
            break

# STEP 5: Merge referral and bus data
student_weeks = pd.merge(ref_agg, bus_agg, on=['Student Identifier', 'Week'], how='outer').fillna(0)

# STEP 6: Add student demographic data
student_weeks = pd.merge(student_weeks, student_meta, on='Student Identifier', how='left')

# STEP 7: Add engagement survey data
student_weeks = pd.merge(student_weeks, engagement_data, on='Student Identifier', how='left')

# STEP 8: Add weekly weather aggregates
weather['Week'] = weather['datetime'].dt.to_period('W').apply(lambda r: r.start_time)
weather_weekly = weather.groupby('Week').agg({
    'temp': 'mean',
    'humidity': 'mean',
    'precip': 'mean',
    'sealevelpressure': 'mean',
    'windgust': 'mean'
}).reset_index()
student_weeks = pd.merge(student_weeks, weather_weekly, on='Week', how='left')

# STEP 9: Create target variable
student_weeks = student_weeks.sort_values(by=['Student Identifier', 'Week'])
student_weeks['referral_next_week'] = student_weeks.groupby('Student Identifier')['weekly_referrals'].shift(-1)
student_weeks['referral_next_week'] = (student_weeks['referral_next_week'] > 0).astype(int)

# Display the result
print("Model-Ready Dataset Preview:")
print(student_weeks.head())
print("\nShape:", student_weeks.shape)
print("Columns:", student_weeks.columns.tolist())

student_weeks.to_csv("model_ready_student_weeks.csv", index=False)


## 6. Model Development

In this section, we implement logistic regression, linear regression, and an advanced Random Forest model.

### Loading the data and training/testing split

In [None]:
student_weeks = pd.read_csv("model_ready_student_weeks.csv")
student_weeks.dropna(subset=['referral_next_week'], inplace=True)

features = ['weekly_referrals', 'weekly_bus_incidents', 'Grade_Level', 'Gender',
            'Ethnicity', 'LunchStatus', 'temp', 'humidity', 'precip', 'sealevelpressure', 'windgust']

X = student_weeks[features]
y_classification = student_weeks['referral_next_week']
y_regression = student_weeks['weekly_referrals']

X_train, X_test, y_clf_train, y_clf_test, y_reg_train, y_reg_test = train_test_split(
    X, y_classification, y_regression, test_size=0.2, random_state=42)

print(f"Total samples after cleaning: {X.shape[0]}")
print(f"  • Training set: {X_train.shape[0]} samples")
print(f"  • Test set:     {X_test.shape[0]} samples")
print(f"Referral-next-week positive rate (train): {y_clf_train.mean():.2%}")
print(f"Referral-next-week positive rate (test):  {y_clf_test.mean():.2%}")
print(f"Weekly referrals (train) — mean: {y_reg_train.mean():.2f}, std: {y_reg_train.std():.2f}")

After dropping missing outcomes, an 80/20 split yielded N_train training samples and N_test test samples, as specified. The proportion of students flagged for referral next week is very similar in both sets, indicating that the random split preserved the class balance. The regression target (weekly referrals) has a mean of M and a standard deviation of S in the training set, suggesting moderate variability in weekly referral counts.

### Preprocessing Pipeline

In [None]:
# Preprocessing pipelines
numeric_features = ['weekly_referrals', 'weekly_bus_incidents', 'Grade_Level','temp']
categorical_features = ['Gender', 'Ethnicity', 'LunchStatus']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Regression preprocessor: EXCLUDE weekly_referrals to avoid leakage
preprocessor_reg = ColumnTransformer(transformers=[
    ('num', numeric_transformer, ['weekly_bus_incidents','Grade_Level','temp']),
    ('cat', categorical_transformer, ['Gender','Ethnicity','LunchStatus'])
])

preprocessor.fit(X_train)
X_train_clf = preprocessor.transform(X_train)
preprocessor_reg.fit(X_train)
X_train_reg = preprocessor_reg.transform(X_train)

print(f"Classification pipeline output features: {X_train_clf.shape[1]}")
print(f"Regression pipeline output features:   {X_train_reg.shape[1]}")

Purpose:
The classification pipeline includes the weekly_referrals predictor, which is the target variable for the the models. The pipeline applies one-hot encoding to categorical features like school and grade level, and scales numerical features such as temperature and humidity. This ensures that all features are appropriately transformed for model training.

Reason:
The preprocessing pipeline is essential for preparing the data for machine learning models. It ensures that categorical variables are converted into a format suitable for model training, and that numerical features are scaled to have similar ranges, which can improve model performance.

### Model Pipeline

In [None]:
# Model pipelines
pipelines = {
    'LogisticRegression': ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('clf', LogisticRegression(max_iter=1000, random_state=42))
    ]),
    'LinearRegression': Pipeline([
        ('preprocessor', preprocessor),
        ('clf', LinearRegression())
    ]),
    'NeuralNetwork': ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('clf', MLPClassifier(hidden_layer_sizes=(50,), max_iter=300, random_state=42))
    ]),
    'RandomForest': ImbPipeline([
        ('preprocessor', preprocessor),
        ('smote', SMOTE(random_state=42)),
        ('clf', RandomForestClassifier(random_state=42))
    ])
}

# Extended regression pipelines
regression_pipelines = {
    'Poisson': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', PoissonRegressor(max_iter=300, alpha=1e-12))
    ]),
    'Ridge': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', Ridge(alpha=1.0))
    ]),
    'Lasso': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', Lasso(alpha=0.1, max_iter=5000))
    ]),
    'ElasticNet': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000))
    ]),
    'Huber': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', HuberRegressor(max_iter=300))
    ]),
    'RANSAC': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', RANSACRegressor(random_state=42))
    ]),
    'RF_Regressor': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', RandomForestRegressor(n_estimators=100, random_state=42))
    ]),
    'GB_Regressor': Pipeline([
        ('pre', preprocessor_reg),
        ('reg', GradientBoostingRegressor(n_estimators=100, random_state=42))
    ]),
}

# === Print pipeline summaries ===
print(f"Classification pipelines ({len(pipelines)}): {list(pipelines.keys())}")
print(f"Regression pipelines ({len(regression_pipelines)}): {list(regression_pipelines.keys())}")

Purpose:
A total of four classification and eight regression pipelines have been configured. Each classification pipeline pairs the shared preprocessing steps (and SMOTE for handling class imbalance) with a different estimator, while the regression pipelines apply the leakage‐free preprocessing to a diverse set of linear, robust, and ensemble models.

Reason:
With this standardized setup, we can systematically train and compare how different algorithms perform on both the referral‐classification task and the weekly‐referrals regression task, ensuring consistency in preprocessing and easy parallel experimentation.

In [None]:
# Classification features include the lagged referral count:
features_clf = [
    'weekly_referrals','weekly_bus_incidents','Grade_Level',
    'Gender','Ethnicity','LunchStatus','temp',
    'humidity','precip','sealevelpressure','windgust'
]
X_clf = student_weeks[features_clf]
y_clf = student_weeks['referral_next_week']

# Regression must exclude the target itself to avoid leakage:
features_reg = [f for f in features_clf if f != 'weekly_referrals']  # <<< FIX HERE
X_reg = student_weeks[features_reg]
y_reg = student_weeks['weekly_referrals']

# Split each dataset independently:
# from sklearn.model_selection import train_test_split
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Print results
print(f"Classification total samples: {X_clf.shape[0]}")
print(f"  • Training samples: {X_clf_train.shape[0]}")
print(f"  • Test samples:     {X_clf_test.shape[0]}")
print(f"  • Positive class rate (train): {y_clf_train.mean():.2%}")
print(f"  • Positive class rate (test):  {y_clf_test.mean():.2%}")

print(f"Regression total samples: {X_reg.shape[0]}")
print(f"  • Training samples: {X_reg_train.shape[0]}")
print(f"  • Test samples:     {X_reg_test.shape[0]}")
print(f"  • Weekly referrals (train) mean: {y_reg_train.mean():.2f}, std: {y_reg_train.std():.2f}")

Purpoose:
Separate splits were performed for the classification and regression tasks to ensure no data leakage between models. The classification split preserves the base rate of referral-next-week in both training and test sets, indicating a balanced random draw. The regression split yields comparable sample sizes and reveals the average weekly referral count of the training set with its variability.

Reasons:
Maintaining independent, representative splits for each modeling objective secures valid performance estimates. The consistent class rates and regression target distribution suggest no obvious sampling bias, supporting reliable downstream model evaluation.

### Parameter Grids

In [None]:
# Parameter grids
param_grids = {
    'LogisticRegression': {
        'clf__C': [0.01, 0.1, 1.0, 10.0],
        'clf__penalty': ['l2'],
        'clf__solver': ['lbfgs']
    },
    'LinearRegression': {
        'clf__fit_intercept': [True, False],
        'clf__positive': [False, True]
    },
    'NeuralNetwork': {
        'clf__hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50), (64, 64, 32)],
        'clf__activation': ['relu', 'tanh'],
        'clf__solver': ['adam'],
        'clf__alpha': [0.0001, 0.001],
        'clf__learning_rate': ['constant', 'adaptive'],
        'clf__early_stopping': [True],
        'clf__n_iter_no_change': [5]
    },
    'RandomForest': {
        'clf__n_estimators': [100, 200],
        'clf__max_depth': [None, 10, 20],
        'clf__min_samples_split': [2, 5],
        'clf__min_samples_leaf': [1, 2],
        'clf__bootstrap': [True, False]
    }
}

# GridSearchCV runner
def run_grid_searches(X_train, y_train, pipelines):
    best_models = {}
    for model_name, pipeline in pipelines.items():
        if model_name not in param_grids:
            print(f"Skipping model: {model_name}")
            continue

        print(f"\nTuning hyperparameters for: {model_name}...")

        grid = GridSearchCV(
            pipeline,
            param_grid=param_grids[model_name],
            cv=5,
            scoring='f1_macro' if model_name != 'LinearRegression' else 'r2',
            n_jobs=-1,
            verbose=1
        )

        grid.fit(X_train, y_train)
        best_models[model_name] = grid

        print(f"\nBest Params for {model_name}: {grid.best_params_}")
        print(f"Best Score: {grid.best_score_:.4f}")

    return best_models



### DBSCAN Clustering Implementation

In [None]:
def evaluate_dbscan(X_raw):
    # Preprocess the data
    X_numeric = X_raw.select_dtypes(include=[np.number])
    X_numeric = pd.DataFrame(SimpleImputer(strategy="mean").fit_transform(X_numeric), columns=X_numeric.columns)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_numeric)

    # Parameter ranges
    eps_vals = np.arange(0.1, 1.1, 0.1)
    min_samples_vals = [3, 5, 7]

    best_score = -1
    best_model = None
    best_params = None

    # Iterate over parameter combinations
    for eps in eps_vals:
        for min_samples in min_samples_vals:
            model = DBSCAN(eps=eps, min_samples=min_samples)
            labels = model.fit_predict(X_scaled)
            n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

            if n_clusters < 2:
                print(f"Skipped eps={eps:.1f}, min_samples={min_samples} — only {n_clusters} cluster(s) detected")
                continue

            score = silhouette_score(X_scaled, labels)
            print(f"Checked eps={eps:.1f}, min_samples={min_samples} → Silhouette Score: {score:.4f}")

            if score > best_score:
                best_score = score
                best_model = model
                best_params = {'eps': eps, 'min_samples': min_samples}

    if best_model:
        print("\nBest DBSCAN Params:", best_params)
        print("Best Silhouette Score:", best_score)
    else:
        print("No valid DBSCAN clustering found. Adjust parameter ranges or examine dataset scale.")

    return best_model, best_score



## 7. Model Evaluation and Interpretation


In [None]:
def evaluate_model_results(best_models, X_test, y_clf_test, y_reg_test):
    for name, model in best_models.items():
        print(f"\nEvaluation Results for {name}")

        y_pred = model.predict(X_test)

        if is_regressor(model) and not is_classifier(model):
            # Regressor evaluation
            try:
                mse = mean_squared_error(y_reg_test, y_pred)
                rmse = np.sqrt(mse)
                r2 = r2_score(y_reg_test, y_pred)

                print(f"RMSE: {rmse:.4f}")
                print(f"R² Score: {r2:.4f}")

                # Regression scatter plot
                plt.figure(figsize=(6, 4))
                plt.scatter(y_reg_test, y_pred, alpha=0.3)
                plt.title(f'{name} — Actual vs. Predicted Referrals')
                plt.xlabel("Actual Referrals")
                plt.ylabel("Predicted Referrals")
                plt.grid(True)
                plt.tight_layout()
                plt.show()
            except Exception as e:
                print("[Regression Metrics Error]:", e)

        elif is_classifier(model):
            # Classifier evaluation
            try:
                y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

                acc = accuracy_score(y_clf_test, y_pred)
                prec = precision_score(y_clf_test, y_pred)
                rec = recall_score(y_clf_test, y_pred)
                f1 = f1_score(y_clf_test, y_pred)
                roc_auc = roc_auc_score(y_clf_test, y_proba) if y_proba is not None else 'N/A'

                print(f"Accuracy: {acc:.4f}")
                print(f"Precision: {prec:.4f}")
                print(f"Recall: {rec:.4f}")
                print(f"F1 Score: {f1:.4f}")
                print(f"ROC AUC: {roc_auc:.4f}" if roc_auc != 'N/A' else "ROC AUC: N/A")

                cm = confusion_matrix(y_clf_test, y_pred)
                plt.figure(figsize=(5, 4))
                sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
                plt.title(f"{name} — Confusion Matrix")
                plt.xlabel("Predicted")
                plt.ylabel("Actual")
                plt.tight_layout()
                plt.show()

            except Exception as e:
                print("[Classification Metrics Error]:", e)

        else:
            print(f"{name} is neither a recognized classifier nor regressor.")

# Usage:
# best_models = run_grid_searches(X_train, y_clf_train, pipelines)
# evaluate_model_results(best_models, X_test, y_clf_test, y_reg_test)

# select best models and evaluate them for normal pipeline
best_models = run_grid_searches(X_clf_train, y_clf_train, pipelines)
evaluate_model_results(best_models, X_clf_test, y_clf_test, y_reg_test)
# select best models and evaluate them for regression pipeline
best_reg_models = run_grid_searches(X_reg_train, y_reg_train, regression_pipelines)
evaluate_model_results(best_reg_models, X_reg_test, y_reg_test, y_reg_test)

In [None]:
# Define parameter grids for regression
param_grids_reg = {
    'Poisson':    {'reg__alpha': [1e-12, 1e-6, 1e-2]},
    'Ridge':      {'reg__alpha': [0.1, 1.0, 10.0]},
    'Lasso':      {'reg__alpha': [0.01, 0.1, 1.0]},
    'ElasticNet': {'reg__alpha': [0.01,0.1], 'reg__l1_ratio':[0.2,0.5,0.8]},
    'RF_Regressor': {'reg__n_estimators': [50,100], 'reg__max_depth':[None,10]},
    'GB_Regressor': {'reg__n_estimators': [50,100], 'reg__learning_rate':[0.1,0.01]},
}

best_regressors = {}
for name, pipe in regression_pipelines.items():
    grid = GridSearchCV(
        estimator=pipe,
        param_grid=param_grids_reg.get(name, {}),
        cv=5,
        scoring='neg_root_mean_squared_error',
        n_jobs=-1
    )
    grid.fit(X_train, y_reg_train)
    best_regressors[name] = grid.best_estimator_
    print(f"{name}: Best params = {grid.best_params_}, RMSE = {-grid.best_score_:.3f}")

# Evaluate on test set
for name, model in best_regressors.items():
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_reg_test, y_pred)
    rmse = np.sqrt(mse)
    r2   = r2_score(y_reg_test, y_pred)
    print(f"{name}:  RMSE = {rmse:.3f},  R² = {r2:.3f}")

### DBSCAN Clustering Results

In [None]:
def visualize_dbscan_clusters(dbscan_model, X_raw, preprocessor):
    if dbscan_model is None:
        print("DBSCAN model is None — check if a valid model was returned from evaluate_dbscan().")
        return

    X_preprocessed = preprocessor.fit_transform(X_raw)
    labels = dbscan_model.fit_predict(X_preprocessed)

    pca = PCA(n_components=2)
    reduced = pca.fit_transform(X_preprocessed)

    plt.figure(figsize=(8, 6))
    scatter = plt.scatter(reduced[:, 0], reduced[:, 1], c=labels, cmap='tab10', s=20, alpha=0.6)
    plt.title("DBSCAN Cluster Visualization (PCA-Reduced)")
    plt.xlabel("PCA Component 1")
    plt.ylabel("PCA Component 2")
    plt.grid(True)
    plt.legend(*scatter.legend_elements(), title="Clusters")
    plt.tight_layout()
    plt.show()

best_dbscan, best_score = evaluate_dbscan(X)
if best_dbscan:
    visualize_dbscan_clusters(best_dbscan, X, preprocessor)
else:
    print("No valid DBSCAN model was found. Please adjust the parameters or check the dataset.")

In [None]:
# Cluster Profiling: mean feature values per DBSCAN cluster
numeric_cols = X.select_dtypes(include=['int64','float64']).columns
imp = SimpleImputer(strategy='mean')
X_num = pd.DataFrame(imp.fit_transform(X[numeric_cols]), columns=numeric_cols)
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_num), columns=numeric_cols)

db = DBSCAN(eps=0.8, min_samples=3).fit(X_scaled)
labels = db.labels_

df_profiles = X.copy()
df_profiles[numeric_cols] = X_num
df_profiles['cluster'] = labels

cluster_summary = df_profiles.groupby('cluster')[numeric_cols].mean().round(2)
print(cluster_summary)

### Extracting important features

In [None]:
# Get the trained model
log_reg_model = best_models['LogisticRegression'].best_estimator_.named_steps['clf']
lin_reg_model = best_models['LinearRegression'].best_estimator_.named_steps['clf']

# Get feature names from the preprocessor
feature_names = best_models['LogisticRegression'].best_estimator_.named_steps['preprocessor'] \
    .get_feature_names_out()

# Logistic Regression coefficients
log_reg_coefs = pd.Series(log_reg_model.coef_[0], index=feature_names).sort_values(key=abs, ascending=False)

# Linear Regression coefficients
lin_reg_coefs = pd.Series(lin_reg_model.coef_, index=feature_names).sort_values(key=abs, ascending=False)

print("Logistic Regression Top Features:\n", log_reg_coefs.head())
print("\nLinear Regression Top Features:\n", lin_reg_coefs.head())

# Random Forest feature importances
rf_model = best_models['RandomForest'].best_estimator_.named_steps['clf']

feature_names = best_models['RandomForest'].best_estimator_.named_steps['preprocessor'] \
    .get_feature_names_out()

rf_importances = pd.Series(rf_model.feature_importances_, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:\n", rf_importances.head())

# Neural Network feature importances

# 1. Get the trained full pipeline
nn_pipeline = best_models['NeuralNetwork'].best_estimator_

results = permutation_importance(
    nn_pipeline,               
    X_clf_test,              
    y_clf_test,                 
    n_repeats=10,
    random_state=42,
    scoring='f1'
)

feature_names = nn_pipeline.named_steps['preprocessor'].get_feature_names_out()

importances = results.importances_mean
if len(importances) != len(feature_names):
    print(f"⚠️ Mismatch: {len(importances)} importances vs {len(feature_names)} features. Adjusting.")
    min_len = min(len(importances), len(feature_names))
    importances = importances[:min_len]
    feature_names = feature_names[:min_len]

perm_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print("Permutation Importances (Neural Network):\n", perm_importances.head(10))


## Model Analysis and Interpretation

### Overview
This section interprets the output of the evaluated models — Logistic Regression, Linear Regression, Neural Network, and DBSCAN — providing context for what the results mean and implications for the school district.

---

### Logistic Regression
- **Precision**: `0.6196`
- **Recall**: `0.7403`
- **F1 Score**: `0.6746`

**Interpretation**: The model predicts referral risk based on a linear combination of inputs that increase or decrease the log-odds. For example, students with more weekly referrals, higher temperature, and certain ethnicities were more likely to be classified as high risk. In contrast, students with more bus incidents or who are on reduced lunch (LunchStatus_R) were less likely, according to the model coefficients. A one unit increase in weekly_referrals increases the log odds of a referral next week. However, a unit increase in weekly_bus_incidents actually lowers the odds in this model, possibly indicating noise or collinearity.

**Top Features (by importance)**:
- weekly_bus_incidents (−0.68): Negatively correlated with referrals — more bus incidents reduced predicted risk. Likely due to students being removed from the classroom more frequently as discussed earlier.
- weekly_referrals (+0.64): Positively correlated — past referrals strongly increased risk.
- temp (+0.37): Higher temperatures correlated with more predicted referrals.
- LunchStatus_R (−0.33): Students on reduced lunch were less likely to be flagged.
- Ethnicity_B (+0.19): Belonging to ethnic group "B" slightly raised predicted risk.

**Analysis**: Logistic regression performed moderately well in predicting students who would receive a referral the following week. A recall of ~74% indicates that the model successfully identified most of the actual positive cases (students who were referred). However, with a precision of ~62%, about 38% of the students predicted to receive a referral did not actually receive one, indicating a moderate rate of false positives. The F1 score of 0.6746 reflects a balanced trade-off between precision and recall, making logistic regression a solid baseline model, but with room for improvement, especially in precision.

---

### Linear Regression
- **RMSE**: `0.7571`
- **R²**: `-0.2028`

**Interpretation**: It adds weighted inputs to predict expected referral counts. Here, most signal came from ethnicity and lunch status, with almost no contribution from numeric features like bus incidents or temp, suggesting this model underperformed due to weak feature learning. A one unit increase in a binary-encoded category (e.g., changing from not Ethnicity_P to Ethnicity_P) decreases the expected number of referrals by 0.41, holding all else equal. But given the negative R², the model is not capturing meaningful structure and likely overfit or underfit the data.

**Top Features (by coefficient size)**:
- Ethnicity_P (−0.41): Being in this group was associated with fewer referrals.
- Ethnicity_B (+0.30): Small positive correlation.
- LunchStatus_P and LunchStatus_F: Positively associated with referral counts.

**Analysis**: The model's R² is negative, indicating that it performs worse than simply predicting the mean value for all observations. The linear regression model failed to capture patterns in the data to predict the number of weekly referrals, suggesting the relationship is not linear or lacks strong predictive variables.

---

### Neural Network
- **Precision**: `0.6374`
- **Recall**: `0.7532`
- **F1 Score**: `0.6905`

**Interpretation**: The neural network predicts based on complex, nonlinear patterns. It considers combinations of variables (e.g., high referrals + low lunch status + high temperature) that interact in ways linear models can't capture. Increases in weekly_referrals or weekly_bus_incidents push predictions higher, but thresholds and interactions matter. For example, a high bus incident count may only affect prediction if combined with other factors like high temperature or early grade level

**Top Features (by permuatation importance)**:
- weekly_referrals — most critical factor, confirming history is predictive
- weekly_bus_incidents — meaningful contributor, though less than referrals.
- Grade_Level — age may correlate with maturity or policy differences.
- temp — environmental conditions influence behavior.
- LunchStatus_R, Gender_M, and Ethnicity_B — social or demographic influences.

**Analysis**: The neural network outperformed logistic regression in predicting students who would receive a referral the following week. With a higher recall of ~75%, it correctly identified more actual referral cases, and its precision of ~64% shows a slight improvement in reducing false positives compared to logistic regression. The F1 score of 0.6905, being higher than the logistic regression's 0.6746, indicates a better overall balance between precision and recall, making the neural network the strongest performer among the tested models.

---
### Random Forest

- **Precision**: `0.6163`
- **Recall**: `0.6883`
- **F1 Score**: `0.6503`

**Interpretation**: It predicts based on majority vote across hundreds of decision trees, each using feature thresholds. For instance, one tree might say: "If temp > 70 and weekly_referrals > 1, predict 1 (referral).". Decision trees split at key thresholds. A change from weekly_referrals = 1 to 2 may cross a split and change the final class. These effects are nonlinear and may vary depending on the combination of other features like weather.

**Top Features (by importance)**:
- temp — most impactful, suggesting a strong seasonal/environmental effect.
- weekly_referrals — history still matters.
- Grade_Level — possibly affecting behavioral trends.
- weekly_bus_incidents — contributes, but less than referrals or weather.
- LunchStatus_R — demographic indicator with weaker influence.

**Analysis**: The Random Forest model performed slightly below both logistic regression and the neural network in predicting student referrals. With a precision of ~62%, it had a similar rate of false positives as the other models, but its recall of ~69% indicates it missed more actual referral cases compared to the neural network, and slightly more than logistic regression. The F1 score of 0.6503 reflects a somewhat weaker balance between precision and recall, suggesting that while Random Forest is a viable model, it may not be as effective as the others for this particular prediction task.

---

### DBSCAN Clustering
- **Best Params**: eps=0.8, min_samples=3
- **Best Silhouette Score**: 0.4634

**Interpretation**: DBSCAN groups students based on density: students with similar behavior (e.g., referral and bus incident patterns, weather exposure, demographics) form clusters if they are closely packed in feature space. Points not close to any dense region are marked as outliers (cluster -1). DBSCAN is sensitive to distance, so small changes in key features like weekly_referrals or Grade_Level can push a student from a dense region (a cluster) into noise (outlier) or another group. For example, adding 1 weekly referral might tip a student into a higher risk cluster, depending on surrounding data.

**Top Features**:
- Weekly referrals
- Bus incidents
- Grade level
- Temperature and humidity

**Analysis**: DBSCAN discovered moderately well-separated clusters (silhouette ~0.46). These clusters could correspond to different behavior profiles or referral risk tiers. However, the silhouette score indicates that many points lie near cluster boundaries, suggesting some overlap or noise in student behavior patterns.

---

## 8. Summary
This section provides a high-level overview of the full modeling process, including data cleaning decisions and their rationale, an analysis of the best-performing model (neural network) and how it works, along with key benefits, general takeaways from the results, and actionable recommendations for the school district. The goal is to summarize the entire modeling process and its implications for the school district in a concise manner. The model works by predicting the likelihood of a student receiving a referral in the next week based on various features such as past referrals, bus incidents, temperature, and demographic information. The neural network model was chosen as the best-performing model due to its ability to capture complex relationships in the data, achieving a recall of ~75% and an F1 score of 0.6905.

---
### Data Cleaning

Multiple datasets were merged and standardized, including disciplinary referrals, bus conduct, family engagement, and weather records. Date fields were parsed and used to create weekly time bins. Missing or inconsistent student identifiers and metadata were resolved, and null values in key columns were handled using imputation or defaults (e.g., zero for missing incident counts). Duplicate entries were removed, and weekly aggregates were calculated per student. The final dataset was cleaned to ensure consistency, proper data types, and alignment across all sources for model readiness.

---

### Best Model

The **neural network model** outperformed all others in predicting which students were likely to receive a disciplinary referral the following week.

**How it Worked:** 
- Captured nonlinear interactions between variables (e.g., how temperature, referral history, and lunch status interact).

- Used complex combinations of inputs to make predictions that simpler models (like logistic regression) couldn't detect.

- Incorporated feature importance through permutation methods, confirming that weekly referral history, bus incidents, grade level, and environmental factors (e.g., temperature) were strong drivers of behavior.

**Benefits:**
- High recall means it successfully identified most students at risk, making it ideal for early intervention and prevention efforts.

- Better precision than logistic regression, meaning fewer false alarms when flagging students.

- Able to generalize from complex patterns, increasing the accuracy of predictions across diverse student profiles and conditions.


___

In [None]:
# collect metrics function
def collect_model_metrics(best_models, X_test, y_clf_test, y_reg_test):
    clf_metrics = []
    reg_metrics = []

    for name, model in best_models.items():
        y_pred = model.predict(X_test)

        if is_regressor(model) and not is_classifier(model):
            try:
                mse = mean_squared_error(y_reg_test, y_pred)
                rmse = np.sqrt(mse)
                r2 = r2_score(y_reg_test, y_pred)
                reg_metrics.append({'Model': name, 'RMSE': rmse, 'R2': r2})
            except Exception as e:
                print(f"[Regression Metrics Error] {name}:", e)

        elif is_classifier(model):
            try:
                y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None

                acc = accuracy_score(y_clf_test, y_pred)
                prec = precision_score(y_clf_test, y_pred)
                rec = recall_score(y_clf_test, y_pred)
                f1 = f1_score(y_clf_test, y_pred)
                roc_auc = roc_auc_score(y_clf_test, y_proba) if y_proba is not None else np.nan

                clf_metrics.append({
                    'Model': name,
                    'Accuracy': acc,
                    'Precision': prec,
                    'Recall': rec,
                    'F1 Score': f1,
                    'ROC AUC': roc_auc
                })
            except Exception as e:
                print(f"[Classification Metrics Error] {name}:", e)

    return pd.DataFrame(clf_metrics), pd.DataFrame(reg_metrics)

# plot metrics function
def plot_metrics(df_clf, df_reg):
    if not df_clf.empty:
        df_clf.set_index('Model', inplace=True)
        df_clf.plot(kind='bar', figsize=(12, 6), title="Classification Model Comparison", grid=True)
        plt.xticks(rotation=45)
        plt.ylabel("Score")
        plt.legend(loc='lower right')
        plt.tight_layout()
        plt.show()

    if not df_reg.empty:
        df_reg.set_index('Model', inplace=True)
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        df_reg['RMSE'].plot(kind='bar', ax=axes[0], title="Regression RMSE Comparison", grid=True)
        axes[0].set_ylabel("RMSE")
        axes[0].set_xticklabels(df_reg.index, rotation=45)

        df_reg['R2'].plot(kind='bar', ax=axes[1], title="Regression R2 Comparison", grid=True)
        axes[1].set_ylabel("R² Score")
        axes[1].set_xticklabels(df_reg.index, rotation=45)

        plt.tight_layout()
        plt.show()

# classification models
clf_metrics_df, _ = collect_model_metrics(best_models, X_clf_test, y_clf_test, y_reg_test)
plot_metrics(clf_metrics_df, pd.DataFrame())

# regression models
_, reg_metrics_df = collect_model_metrics(best_reg_models, X_reg_test, y_clf_test, y_reg_test)
plot_metrics(pd.DataFrame(), reg_metrics_df)



---
### General Takeaways

- **Logistic Regression** Offered solid performance and remains a strong baseline. Its simplicity and interpretability make it a reliable model, though its precision is slightly lower than desirable.

- **Linear Regression** Was not suitable for this task. The negative R² value indicates it failed to explain the variation in weekly referral counts, reinforcing that this is not a linear prediction problem.

- **Neural Networks**  emerged as the most effective model for predicting next-week referrals, achieving the highest F1 score and recall. It is best suited for identifying at risk students, though some false positives remain.

- **Random Forest** Underperformed relative to the neural network and logistic regression, with both lower recall and F1 score. It may require more tuning or deeper feature engineering to be competitive.

- **DBSCAN** shows potential for uncovering behavioral profiles but may require further feature selection or tuning.
---

### Recommendations
- Prioritize the Neural Network for referral prediction, as it provided the best overall performance. Consider fine-tuning its architecture and training parameters, and possibly augmenting the dataset to further improve accuracy and generalization.

- Use Logistic Regression as a strong, interpretable baseline for comparison and quick deployment. It’s a practical choice when simplicity, transparency, or real time inference is important.

- Reevaluate the use of Random Forest unless further optimized. Explore hyperparameter tuning or feature selection to enhance its effectiveness if ensemble methods are desired.

- Avoid Linear Regression for this task, as it does not capture the complexity of referral behavior patterns and was worse than simply guessing, which is not unusual for this type of data, but still a factor to consider.

- Leverage DBSCAN as a tool for unsupervised behavioral segmentation. Clusters may help identify different risk profiles or behavioral subtypes and inform targeted interventions. Consider improving feature selection or testing other clustering methods to refine groupings.

- Explore additional models such as Gradient Boosting or XGBoost to balance precision and recall more effectively.

- Incorporate temporal features (e.g., days before/after breaks, weekday effects) to enhance model performance and interpretability in the future.

- Use model outputs to trigger proactive supports, especially for students repeatedly flagged to reduce the likelihood of future referrals.

---


