# CS M148 First Project Check-in
## Student Performance Factors Analysis

**Team:** xxxx  
**Course:** CS M148 Fall 2025, UCLA  
**Date:** October 10, 2025

---

## 1. Dataset Selection

### Dataset Overview

Our team has chosen the **Student Performance Factors** dataset from Kaggle for our course project.

**Dataset Details:**
- **Source:** [Kaggle - Student Performance Factors](https://www.kaggle.com/datasets/lainguyn123/student-performance-factors)
- **Creator:** Lai Ng.
- **License:** CC0: Public Domain
- **Size:** 6,607 student records
- **Variables:** 20 total (19 features + 1 target variable)

### Research Question

**Primary Goal:** Predict student exam scores based on various academic, socioeconomic, and lifestyle factors.

This dataset is particularly valuable because it provides a comprehensive view of factors that may influence academic performance, allowing us to:
1. Identify which factors have the strongest predictive power for student success
2. Understand relationships between different variables (e.g., sleep vs. study hours)
3. Provide data-driven insights for educational interventions
4. Practice machine learning techniques on a real-world educational dataset

### Why This Dataset?

We selected this dataset because:
- **Practical Relevance:** Educational outcomes directly impact students, educators, and policymakers
- **Rich Feature Set:** 19 diverse features covering multiple domains (academic, family, lifestyle, resources)
- **Clean Structure:** Well-organized CSV format suitable for machine learning workflows
- **Appropriate Size:** 6,607 records provide sufficient data for training/testing without excessive computational demands
- **Interpretability:** Results can provide actionable insights for educational improvement

## 2. Main Features and Study Rationale

Our dataset contains **19 predictor variables** and **1 target variable** (`Exam_Score`). We've organized these features into five meaningful categories:

### Feature Categories

#### A. Academic Factors
These directly relate to student learning behaviors and past performance:
- **Hours_Studied** (numeric): Weekly study hours - *Expected to have strong positive correlation with exam scores*
- **Attendance** (numeric %): Class attendance rate - *Higher attendance should correlate with better performance*
- **Previous_Scores** (numeric): Past academic performance - *Likely the strongest predictor; past performance indicates ability*
- **Motivation_Level** (categorical: Low/Medium/High): Student's intrinsic motivation
- **Tutoring_Sessions** (numeric): Number of tutoring sessions attended - *Additional academic support*

#### B. Family and Socioeconomic Factors
These capture the student's home environment and support system:
- **Parental_Involvement** (categorical: Low/Medium/High): Parent engagement in education
- **Parental_Education_Level** (categorical: High School/College/Postgraduate): Parents' highest education
- **Family_Income** (categorical: Low/Medium/High): Household income level

*Why study these?* Socioeconomic factors often correlate with educational resources and support available at home.

#### C. Educational Resources
These measure access to learning materials and quality instruction:
- **Access_to_Resources** (categorical: Low/Medium/High): Availability of learning materials
- **Internet_Access** (binary: Yes/No): Home internet availability for research and learning
- **School_Type** (categorical: Public/Private): Type of school attended
- **Teacher_Quality** (categorical: Low/Medium/High): Quality of instruction received

*Why study these?* Resource availability can create opportunity gaps between students.

#### D. Lifestyle and Health Factors
These capture student wellbeing and time allocation:
- **Sleep_Hours** (numeric): Average nightly sleep - *Sleep affects cognitive function and learning*
- **Extracurricular_Activities** (binary: Yes/No): Participation in non-academic activities
- **Physical_Activity** (numeric): Hours per week of physical exercise
- **Learning_Disabilities** (binary: Yes/No): Presence of diagnosed learning disabilities

*Why study these?* Physical and mental health significantly impact academic performance.

#### E. Environmental Factors
These describe the student's broader context:
- **Distance_from_Home** (categorical: Near/Moderate/Far): Commute distance to school
- **Peer_Influence** (categorical: Positive/Neutral/Negative): Influence of friend group
- **Gender** (categorical: Male/Female): Student gender

### Target Variable
- **Exam_Score** (numeric): Final examination score (our prediction target)

### Why Study These Features?

1. **Holistic Understanding:** These features provide a 360-degree view of student life, from academic habits to home environment
2. **Actionable Insights:** Unlike fixed factors (e.g., gender), many features are modifiable (study hours, sleep, tutoring)
3. **Policy Implications:** Understanding resource access and socioeconomic impacts can inform educational equity policies
4. **Student Wellbeing:** Lifestyle factors help us understand the balance between academics and health
5. **Predictive Modeling:** The variety of feature types (numeric, categorical, binary) allows us to practice different encoding and modeling techniques

## 3. Data Cleaning and Preprocessing

In this section, we demonstrate our data cleaning process, including:
- Loading and initial inspection
- Missing value detection and analysis
- Imputation strategies
- Outlier detection

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Libraries loaded successfully!")

In [None]:
# Load the dataset
df = pd.read_csv('../data/StudentPerformanceFactors.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
print("\n" + "="*50)
print("First few rows:")
df.head()

In [None]:
# Basic information about the dataset
print("Dataset Information:")
print("="*50)
df.info()

In [None]:
# Statistical summary of numeric features
print("Statistical Summary of Numeric Features:")
print("="*50)
df.describe()

### 3.1 Missing Value Detection

In [None]:
# Check for missing values
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})

# Filter to show only columns with missing values
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

print("Missing Value Analysis:")
print("="*50)
if len(missing_data) > 0:
    print(missing_data.to_string(index=False))
    print(f"\nTotal columns with missing values: {len(missing_data)}")
else:
    print("No missing values detected!")

In [None]:
# Visualize missing values pattern
if df.isnull().sum().sum() > 0:
    plt.figure(figsize=(12, 6))
    sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
    plt.title('Missing Values Heatmap (Yellow = Missing)', fontsize=14, fontweight='bold')
    plt.xlabel('Features')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Bar plot of missing values
    missing_counts = df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
    if len(missing_counts) > 0:
        plt.figure(figsize=(10, 5))
        missing_counts.plot(kind='bar', color='coral')
        plt.title('Missing Values Count by Feature', fontsize=14, fontweight='bold')
        plt.xlabel('Features')
        plt.ylabel('Number of Missing Values')
        plt.xticks(rotation=45, ha='right')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()
else:
    print("No missing values to visualize.")

### 3.2 Missing Value Imputation Strategy

Based on our analysis, we will apply appropriate imputation methods:

**Imputation Principles:**
1. **Categorical Variables:** Use mode (most frequent value) imputation
2. **Numeric Variables:** Consider median for skewed distributions, mean for normal distributions
3. **Missing Completely At Random (MCAR):** If missingness is < 5%, simple imputation is acceptable
4. **Domain Knowledge:** Consider creating an "Unknown" category if missingness might be informative

In [None]:
# Create a copy for cleaning
df_cleaned = df.copy()

# Identify columns with missing values
columns_with_missing = df_cleaned.columns[df_cleaned.isnull().any()].tolist()

print("Applying Imputation Strategies:")
print("="*50)

if len(columns_with_missing) > 0:
    for col in columns_with_missing:
        # Get data type
        if df_cleaned[col].dtype == 'object':  # Categorical
            mode_value = df_cleaned[col].mode()[0]
            df_cleaned[col].fillna(mode_value, inplace=True)
            print(f"✓ {col}: Filled {df[col].isnull().sum()} missing values with mode '{mode_value}'")
        else:  # Numeric
            median_value = df_cleaned[col].median()
            df_cleaned[col].fillna(median_value, inplace=True)
            print(f"✓ {col}: Filled {df[col].isnull().sum()} missing values with median {median_value}")
    
    print(f"\nImputation complete! Remaining missing values: {df_cleaned.isnull().sum().sum()}")
else:
    print("No missing values to impute.")

# Verify no missing values remain
assert df_cleaned.isnull().sum().sum() == 0, "Error: Missing values still present!"
print("\n✓ Verification passed: All missing values have been handled.")

### 3.3 Outlier Detection

We'll check for outliers in numeric features using the IQR (Interquartile Range) method and visualizations.

In [None]:
# Select numeric columns only
numeric_cols = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()

print(f"Numeric columns for outlier analysis: {numeric_cols}")
print("="*50)

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    if idx < len(axes):
        axes[idx].boxplot(df_cleaned[col].dropna(), vert=True)
        axes[idx].set_title(f'{col}', fontweight='bold')
        axes[idx].set_ylabel('Value')
        axes[idx].grid(True, alpha=0.3)

# Hide extra subplots if any
for idx in range(len(numeric_cols), len(axes)):
    axes[idx].axis('off')

plt.suptitle('Box Plots for Outlier Detection', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

In [None]:
# IQR-based outlier detection
def detect_outliers_iqr(data, column):
    """Detect outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return len(outliers), lower_bound, upper_bound

print("Outlier Detection (IQR Method):")
print("="*70)
print(f"{'Feature':<25} {'Outliers':<12} {'Lower Bound':<15} {'Upper Bound':<15}")
print("-"*70)

for col in numeric_cols:
    n_outliers, lower, upper = detect_outliers_iqr(df_cleaned, col)
    print(f"{col:<25} {n_outliers:<12} {lower:<15.2f} {upper:<15.2f}")

print("\nNote: We retain outliers as they may represent genuine data variation in educational contexts.")

### 3.4 Data Cleaning Summary

**Actions Taken:**
1. ✓ Loaded dataset (6,607 records, 20 features)
2. ✓ Identified and visualized missing values
3. ✓ Imputed missing categorical variables using mode
4. ✓ Imputed missing numeric variables using median
5. ✓ Detected outliers using IQR method and box plots
6. ✓ Verified data quality

**Dataset Status:** Ready for exploratory data analysis and modeling!

## 4. Exploratory Data Analysis (EDA)

In this section, we perform exploratory analysis to:
1. Understand the distribution of our target variable (Exam_Score)
2. Identify relationships between features and exam scores
3. Determine the best predictor variables for simple linear regression
4. Visualize patterns in the data

### 4.1 Target Variable Analysis

In [None]:
# Analyze target variable: Exam_Score
print("Exam Score Distribution Analysis:")
print("="*50)
print(f"Mean: {df_cleaned['Exam_Score'].mean():.2f}")
print(f"Median: {df_cleaned['Exam_Score'].median():.2f}")
print(f"Std Dev: {df_cleaned['Exam_Score'].std():.2f}")
print(f"Min: {df_cleaned['Exam_Score'].min():.2f}")
print(f"Max: {df_cleaned['Exam_Score'].max():.2f}")
print(f"Range: {df_cleaned['Exam_Score'].max() - df_cleaned['Exam_Score'].min():.2f}")

# Check for normality
skewness = df_cleaned['Exam_Score'].skew()
kurtosis = df_cleaned['Exam_Score'].kurtosis()
print(f"\nSkewness: {skewness:.3f} {'(approximately symmetric)' if abs(skewness) < 0.5 else '(skewed)'}")
print(f"Kurtosis: {kurtosis:.3f}")

In [None]:
# Visualize Exam_Score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram with KDE
axes[0].hist(df_cleaned['Exam_Score'], bins=30, edgecolor='black', alpha=0.7, color='skyblue')
axes[0].axvline(df_cleaned['Exam_Score'].mean(), color='red', linestyle='--', linewidth=2, label=f"Mean: {df_cleaned['Exam_Score'].mean():.1f}")
axes[0].axvline(df_cleaned['Exam_Score'].median(), color='green', linestyle='--', linewidth=2, label=f"Median: {df_cleaned['Exam_Score'].median():.1f}")
axes[0].set_xlabel('Exam Score', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Exam Scores', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(df_cleaned['Exam_Score'], vert=True)
axes[1].set_ylabel('Exam Score', fontsize=12)
axes[1].set_title('Exam Score Box Plot', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 4.2 Correlation Analysis

**Why correlation analysis?**
- Identifies which numeric features have the strongest linear relationships with exam scores
- Helps select the best predictor for simple linear regression
- Reveals multicollinearity issues between features

In [None]:
# Calculate correlation matrix for numeric features
correlation_matrix = df_cleaned[numeric_cols].corr()

# Create correlation heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Numeric Features', fontsize=16, fontweight='bold', pad=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Correlation with Exam_Score (sorted)
exam_score_corr = correlation_matrix['Exam_Score'].drop('Exam_Score').sort_values(ascending=False)

print("Correlation with Exam Score (Ranked):")
print("="*50)
for feature, corr in exam_score_corr.items():
    print(f"{feature:<25} {corr:>6.3f}")

# Visualize correlations with Exam_Score
plt.figure(figsize=(10, 6))
exam_score_corr.plot(kind='barh', color=['green' if x > 0 else 'red' for x in exam_score_corr])
plt.title('Feature Correlations with Exam Score', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

### 4.3 Bivariate Analysis: Top Predictors vs Exam Score

Based on correlation analysis, we'll create scatter plots for the top predictors to visualize their relationships with exam scores.

In [None]:
# Select top 6 correlated features
top_features = exam_score_corr.abs().nlargest(6).index.tolist()

print(f"Top 6 Features Most Correlated with Exam Score:")
print("="*50)
for i, feature in enumerate(top_features, 1):
    print(f"{i}. {feature}: {exam_score_corr[feature]:.3f}")

In [None]:
# Create scatter plots for top features
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    axes[idx].scatter(df_cleaned[feature], df_cleaned['Exam_Score'], alpha=0.5, s=20)
    
    # Add regression line
    z = np.polyfit(df_cleaned[feature], df_cleaned['Exam_Score'], 1)
    p = np.poly1d(z)
    axes[idx].plot(df_cleaned[feature], p(df_cleaned[feature]), "r--", linewidth=2, label='Trend line')
    
    axes[idx].set_xlabel(feature, fontsize=11)
    axes[idx].set_ylabel('Exam Score', fontsize=11)
    axes[idx].set_title(f'{feature} vs Exam Score\n(r = {exam_score_corr[feature]:.3f})', 
                        fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].legend()

plt.suptitle('Top Predictors vs Exam Score', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

### 4.4 Categorical Features Analysis

We'll analyze how categorical variables relate to exam scores using box plots.

In [None]:
# Select key categorical features
categorical_features = ['Parental_Involvement', 'Access_to_Resources', 'Motivation_Level', 
                        'Teacher_Quality', 'School_Type', 'Peer_Influence']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for idx, feature in enumerate(categorical_features):
    if feature in df_cleaned.columns:
        df_cleaned.boxplot(column='Exam_Score', by=feature, ax=axes[idx])
        axes[idx].set_xlabel(feature, fontsize=11)
        axes[idx].set_ylabel('Exam Score', fontsize=11)
        axes[idx].set_title(f'Exam Score by {feature}', fontsize=12, fontweight='bold')
        plt.sca(axes[idx])
        plt.xticks(rotation=45, ha='right')

plt.suptitle('Exam Score Distribution by Categorical Features', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

### 4.5 Preparing for Simple Linear Regression

**Goal:** Identify the single best predictor for simple linear regression.

**Selection Criteria:**
1. Highest absolute correlation with Exam_Score
2. Clear linear relationship visible in scatter plot
3. No extreme outliers that would distort the regression line

In [None]:
# Identify the best single predictor
best_predictor = exam_score_corr.abs().idxmax()
best_correlation = exam_score_corr[best_predictor]

print("Best Single Predictor for Simple Linear Regression:")
print("="*50)
print(f"Feature: {best_predictor}")
print(f"Correlation: {best_correlation:.4f}")
print(f"R-squared (r²): {best_correlation**2:.4f}")
print(f"\nInterpretation: {best_predictor} explains approximately {(best_correlation**2)*100:.1f}% ")
print(f"of the variance in Exam Scores.")

In [None]:
# Detailed visualization of best predictor
from scipy.stats import pearsonr

# Calculate statistics
corr_coef, p_value = pearsonr(df_cleaned[best_predictor], df_cleaned['Exam_Score'])

# Create detailed plot
plt.figure(figsize=(12, 7))
plt.scatter(df_cleaned[best_predictor], df_cleaned['Exam_Score'], alpha=0.5, s=30, label='Data points')

# Fit regression line
z = np.polyfit(df_cleaned[best_predictor], df_cleaned['Exam_Score'], 1)
p = np.poly1d(z)
plt.plot(df_cleaned[best_predictor], p(df_cleaned[best_predictor]), "r-", linewidth=3, 
         label=f'Regression line: y = {z[0]:.3f}x + {z[1]:.3f}')

plt.xlabel(best_predictor, fontsize=13)
plt.ylabel('Exam Score', fontsize=13)
plt.title(f'Simple Linear Regression: {best_predictor} vs Exam Score\n' +
          f'Correlation: r = {corr_coef:.4f}, R² = {corr_coef**2:.4f}, p-value < 0.001', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='best')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nStatistical Significance:")
print(f"Pearson correlation coefficient: {corr_coef:.4f}")
print(f"P-value: {p_value:.2e}")
print(f"Result: {'Statistically significant' if p_value < 0.05 else 'Not significant'} at α = 0.05")

### 4.6 Feature Importance Summary for Linear Regression

Based on our EDA, we can rank features by their potential predictive power:

In [None]:
# Create comprehensive feature ranking
feature_ranking = pd.DataFrame({
    'Feature': exam_score_corr.index,
    'Correlation': exam_score_corr.values,
    'Abs_Correlation': exam_score_corr.abs().values,
    'R_Squared': (exam_score_corr.values ** 2)
}).sort_values('Abs_Correlation', ascending=False)

print("Feature Ranking for Predictive Modeling:")
print("="*70)
print(f"{'Rank':<6} {'Feature':<25} {'Correlation':<15} {'R²':<10}")
print("-"*70)
for idx, row in enumerate(feature_ranking.itertuples(), 1):
    print(f"{idx:<6} {row.Feature:<25} {row.Correlation:<15.4f} {row.R_Squared:<10.4f}")

print("\n" + "="*70)
print("Recommendations for Simple Linear Regression:")
print(f"1. Primary predictor: {feature_ranking.iloc[0]['Feature']} (r = {feature_ranking.iloc[0]['Correlation']:.3f})")
print(f"2. Alternative predictor: {feature_ranking.iloc[1]['Feature']} (r = {feature_ranking.iloc[1]['Correlation']:.3f})")
print(f"3. Third option: {feature_ranking.iloc[2]['Feature']} (r = {feature_ranking.iloc[2]['Correlation']:.3f})")

## Summary and Next Steps

### What We Accomplished:

1. **Dataset Selection** ✓
   - Selected Student Performance Factors dataset (6,607 students, 20 variables)
   - Identified research goal: predict exam scores from diverse factors

2. **Feature Understanding** ✓
   - Categorized 19 features into 5 domains (academic, family, resources, lifestyle, environmental)
   - Explained rationale for studying each feature category

3. **Data Cleaning** ✓
   - Detected and visualized missing values
   - Applied appropriate imputation (mode for categorical, median for numeric)
   - Identified outliers using IQR method and box plots
   - Verified data quality (zero missing values after cleaning)

4. **Exploratory Data Analysis** ✓
   - Analyzed target variable distribution (Exam_Score)
   - Computed correlation matrix for all numeric features
   - Identified top predictors through correlation analysis
   - Visualized relationships using scatter plots and box plots
   - Selected best predictor for simple linear regression

### Key Findings:

- **Best predictor identified** for simple linear regression
- **Strong correlations found** between academic factors and exam scores
- **Data quality confirmed** - ready for modeling
- **Relationships visualized** between features and target variable

### Next Steps:

1. Build simple linear regression model using top predictor
2. Evaluate model performance (MSE, RMSE, R²)
3. Develop multiple linear regression with top features
4. Encode categorical variables for advanced modeling
5. Compare different regression algorithms (Ridge, Lasso, Random Forest, XGBoost)
6. Perform feature engineering to create interaction terms
7. Conduct cross-validation for robust model evaluation

---

**Dataset is ready for machine learning modeling!**