##  Data Mining Project (2025-2026)
*Fatih Arslan B211202023, Yusuf İnan B211202070


##  Problem Definition
In this project, we analyze students' performance based on their exam scores and background information.

##  Dataset Information

The dataset, [Students Performance in Exams](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams), named **"StudentsPerformance.csv"**, includes data from **1,000 students**.  
It contains various features such as:

- **Gender**  
- **Race/Ethnicity**  
- **Parental Level of Education**  
- **Lunch Type**  
- **Test Preparation Course**  
- **Math Score**  
- **Reading Score**  
- **Writing Score**

These features are used to analyze and predict the overall student performance levels.


##  Selected Method
We apply a Classification method using Random Forest to predict the performance level of students (low, medium, high).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


### 1. Importing Required Libraries

In this step, I import the necessary Python libraries for data analysis and machine learning.

- **pandas** and **numpy** are used for data handling and calculations.  
- **matplotlib** and **seaborn** are used for data visualization.  
- **sklearn.model_selection** is used for splitting the dataset into training and testing sets.  
- **sklearn.preprocessing** includes tools for encoding and scaling the data.  
- **RandomForestClassifier** is used to build the classification model.  
- **classification_report** and **confusion_matrix** are used to evaluate the model performance.


In [None]:
df = pd.read_csv("StudentsPerformance.csv")
df.head()
df.info()
df.describe()


: 

### 2. Loading and Exploring the Dataset

In this step, I load the **StudentsPerformance.csv** dataset using `pandas`.  
- `df.head()` shows the first few rows of the dataset.  
- `df.info()` gives general information such as column names, data types, and number of non-null values.  
- `df.describe()` provides basic statistical details like mean, min, max, and standard deviation for numeric columns.  

From the output, we can see that the dataset has **1000 rows** and **8 columns**, with no missing values.  
There are 5 categorical columns and 3 numerical columns (math, reading, and writing scores).


In [None]:
df.isnull().sum()


### 3. Checking for Missing Values

In this step, I use `df.isnull().sum()` to check if there are any missing values in the dataset.  
The result shows that all columns have **0 missing values**, which means the dataset is complete and does not require data cleaning for null values.

In [None]:
# Calculate average score from math, reading, and writing
df["average_score"] = df[["math score", "reading score", "writing score"]].mean(axis=1)

# Categorize students by performance level
def categorize(score):
    if score >= 85:
        return "high"
    elif score >= 70:
        return "medium"
    else:
        return "low"

df["performance_level"] = df["average_score"].apply(categorize)

# Group by performance level and calculate summary stats
category_table = df.groupby("performance_level").agg(
    student_count=("performance_level", "count"),
    average_score=("average_score", "mean")
).sort_values(by="average_score", ascending=False)

# Format values and add percentage
category_table["average_score"] = category_table["average_score"].round(2)
category_table = category_table.reset_index()
category_table["percentage"] = (category_table["student_count"] / len(df) * 100).round(1).astype(str) + "%"

# Display summary table
print("\n=== Student Performance Analysis ===\n")
print(category_table.to_string(index=False))
print(f"\nTotal Students: {len(df)}")

# Plot bar and pie charts
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
colors = ['#2ecc71', '#f39c12', '#e74c3c']

# Bar chart: Number of students by performance level
axes[0].bar(category_table["performance_level"], category_table["student_count"], color=colors)
axes[0].set_title('Number of Students by Performance Level', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Performance Level', fontsize=11)
axes[0].set_ylabel('Number of Students', fontsize=11)
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(category_table["student_count"]):
    axes[0].text(i, v + 5, str(v), ha='center', fontweight='bold')

# Pie chart: Student distribution by percentage
axes[1].pie(
    category_table["student_count"],
    labels=category_table["performance_level"],
    autopct='%1.1f%%',
    colors=colors,
    startangle=90,
    textprops={'fontsize': 11, 'fontweight': 'bold'}
)
axes[1].set_title('Student Distribution by Percentage', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()


## 4. Data Analysis and Visualization

This code calculates the average score of students, classifies them into performance levels (high, medium, low),  
and visualizes the results using bar and pie charts.


In [None]:
# Define the education level order
education_order = [
    "some high school",
    "high school",
    "some college",
    "associate's degree",
    "bachelor's degree",
    "master's degree"
]

# Convert column to ordered categorical type
df["parental level of education"] = pd.Categorical(
    df["parental level of education"],
    categories=education_order,
    ordered=True
)

# Set figure and style
plt.figure(figsize=(14, 6))
sns.set_style("whitegrid")

# Create boxplot
ax = sns.boxplot(
    x="parental level of education",
    y="average_score",
    hue="parental level of education",
    data=df.dropna(subset=["parental level of education"]),
    palette="viridis",
    linewidth=2.5,
    fliersize=5,
    legend=False
)

# Calculate medians (observed=True prevents future warning)
medians = df.groupby("parental level of education", observed=True)["average_score"].median()

# Display median values on the plot
for i, edu_level in enumerate(education_order):
    if edu_level in medians.index:
        median_val = medians[edu_level]
        ax.text(i, median_val, f'{median_val:.1f}',
                ha='center', va='bottom', fontweight='bold',
                color='red', fontsize=10)

# Customize labels and title
plt.xticks(rotation=45, ha='right', fontsize=11)
plt.xlabel("Parental Education Level", fontsize=13, fontweight='bold')
plt.ylabel("Average Score", fontsize=13, fontweight='bold')
plt.title("Student Performance by Parental Education Level",
          fontsize=15, fontweight='bold', pad=20)

# Set limits and grid
plt.ylim(0, 105)
plt.grid(axis='y', alpha=0.3, linestyle='--')

# Adjust layout and show
plt.tight_layout()
plt.show()

# Display summary statistics
print("\n=== Statistics by Parental Education Level ===\n")
summary = df.groupby("parental level of education", observed=True)["average_score"].agg([
    ('Student Count', 'count'),
    ('Mean', 'mean'),
    ('Median', 'median'),
    ('Std. Dev', 'std'),
    ('Min', 'min'),
    ('Max', 'max')
]).round(2)

print(summary)


## 5. Relationship Between Parental Education and Student Performance

This section analyzes how parents' education levels affect students' average exam scores using boxplots and summary statistics.

As the level of parental education increases, we can observe that the average student scores tend to rise, indicating a positive correlation between parental education and student performance. This trend suggests that students whose parents have higher education levels generally perform better academically.


In [None]:
label_encoders = {}
for col in ["gender", "race/ethnicity", "parental level of education", "lunch", "test preparation course"]:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le


## 6. Label Encoding

This step converts categorical text data (e.g., gender, lunch type) into numerical values so the model can process them.


In [None]:
X = df[["gender", "race/ethnicity", "parental level of education", "lunch", "test preparation course", "math score", "reading score", "writing score"]]
y = df["performance_level"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


### 7. Encoding Categorical Features

In this code block, we convert categorical columns into numerical values using **Label Encoding**, which is necessary for most machine learning models.

In [None]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


### 8. Training a Random Forest Classifier and Making Predictions

In this code block, we train a **Random Forest Classifier** on the training data and make predictions on the test set.

In [None]:
# Professional Confusion Matrix
plt.figure(figsize=(10, 8))

cm = confusion_matrix(y_test, y_pred)
class_names = np.unique(y_test)

# Combine count and percentage
labels = [[f'{count}\n({count/cm[i].sum()*100:.1f}%)'
           for j, count in enumerate(row)] for i, row in enumerate(cm)]

# Heatmap
sns.heatmap(cm, annot=labels, fmt='', cmap='coolwarm',
            xticklabels=class_names, yticklabels=class_names,
            cbar_kws={'label': 'Count'}, linewidths=3, linecolor='white',
            square=True, annot_kws={"size": 13, "weight": "bold"})

plt.title('Confusion Matrix - Model Performance', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Predicted Class', fontsize=13, fontweight='bold', labelpad=10)
plt.ylabel('Actual Class', fontsize=13, fontweight='bold', labelpad=10)

# Highlight diagonal (correct predictions)
for i in range(len(class_names)):
    plt.gca().add_patch(plt.Rectangle((i, i), 1, 1, fill=False,
                                       edgecolor='lime', lw=4))

plt.tight_layout()
plt.show()

### 9. Evaluating the Model Performance

In this code block, we evaluate the Random Forest model using **classification metrics** and visualize the **confusion matrix**.

Accuracy = 0.96 → The model correctly predicted 96% of test samples.

Precision: How often the model is correct when it predicts a class.

High = 1.00 → No false “High” predictions.

Recall: How many actual instances of a class were correctly identified.

High = 0.87 → 87% of true “High” were detected.

F1-score: Balance between precision and recall; closer to 1 is better.

Interpretation:

The model perfectly identifies “Low” performers (108/108).

It correctly predicts 20/23 “High” students, misclassifying 3 as “Medium.”

It predicts 65/69 “Medium” students correctly, misclassifying 4 as “Low.”

Overall, the model is strongest at detecting “Low” performers and has minor confusion between “High” and “Medium.”


In [None]:
importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)

# Plotting feature importance
plt.figure(figsize=(12, 8))
colors = plt.cm.RdYlGn(importances / importances.max())
colors = colors.tolist()

ax = sns.barplot(
    x=importances,
    y=importances.index,
    hue=importances.index,
    palette=colors,
    legend=False,
    edgecolor='black',
    linewidth=1.5
)

# Adding value labels
for i, (value, name) in enumerate(zip(importances, importances.index)):
    percentage = (value / importances.sum()) * 100
    ax.text(value + 0.005, i, f'{value:.4f} ({percentage:.1f}%)',
            va='center', fontweight='bold', fontsize=10)

plt.xlabel('Importance Score', fontsize=13, fontweight='bold')
plt.ylabel('Features', fontsize=13, fontweight='bold')
plt.title('Feature Importance of the Model', fontsize=15, fontweight='bold', pad=20)
plt.grid(axis='x', alpha=0.3, linestyle='--')
plt.xlim(0, importances.max() * 1.15)
plt.tight_layout()
plt.show()

# Detailed Feature Importance Table
print("\n" + "="*80)
print("                    DETAILED FEATURE IMPORTANCE REPORT")
print("="*80)

importance_df = pd.DataFrame({
    'Feature': importances.index,
    'Importance Score': importances.values,
    'Percentage (%)': (importances.values / importances.sum() * 100).round(2),
    'Cumulative %': (importances.values / importances.sum() * 100).cumsum().round(2),
    'Rank': range(1, len(importances) + 1)
})

print(importance_df.to_string(index=False))
print("="*80)

# Summary Statistics
print(f"\n SUMMARY STATISTICS:")
print(f"   • Total Number of Features: {len(importances)}")
print(f"   • Most Important Feature: {importances.index[0]} ({importances.values[0]:.4f})")
print(f"   • Least Important Feature: {importances.index[-1]} ({importances.values[-1]:.4f})")
print(f"   • Average Importance Score: {importances.mean():.4f}")
print(f"   • Top 3 Features Total Contribution: {(importances.values[:3].sum() / importances.sum() * 100):.1f}%")
print("="*80 + "\n")

### 10. Visualizing and Reporting Feature Importance

Based on the results, the most influential features are reading score, writing score, and math score, which together account for 92% of the model's total importance. The remaining features (race/ethnicity, parental level of education, gender, test preparation course, and lunch) have relatively low importance, indicating they contribute much less to the predictions. In short, students' exam scores are the primary factors driving the model's decisions.