# Lab 4 Assignment Solutions
**Dataset: Student Performance (student-mat.csv)**

This notebook completes all five assignment tasks using the Student Performance dataset.
The dataset contains student grades, demographics, and social/school attributes.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA

sns.set(style='whitegrid')

---
## Task 1: Identify Data Quality Issues
We examine the dataset for data type mismatches, missing values, duplicates, and other issues.

In [None]:
pd.set_option('display.max_columns', None)

df = pd.read_csv('../student-mat.csv', sep=';')
df.head(10)

In [None]:
print('Dataset Shape:', df.shape)

In [None]:
# Check data types of all columns
df.dtypes

In [None]:
# Check for missing values
df.isna().sum()

In [None]:
# Check for duplicate rows
print('Number of duplicate rows:', df.duplicated().sum())

In [None]:
# Statistical summary of numerical columns
df.describe()

### Task 1 Findings

After inspecting the dataset, the following data quality issues were identified:

1. **Data Type Mismatch**: The columns `G1` and `G2` (first and second period grades) are stored as **object (string)** type instead of numeric. This is because they are quoted in the raw CSV file. They must be converted to integers before any numerical analysis.

2. **No Missing Values**: All columns report zero missing values, so no imputation is needed on the original data. However, we will introduce artificial missing values in Task 2 for demonstration purposes.

3. **No Duplicate Rows**: The dataset contains no exact duplicate records.

4. **Potential Outliers**: The `absences` column has a minimum of 0 and a maximum that may be significantly higher than the 75th percentile, suggesting the presence of outliers. We will investigate this in Task 3.

5. **Binary Categorical Columns**: Columns such as `schoolsup`, `famsup`, `paid`, `activities`, `internet`, `romantic` are stored as `yes/no` strings. These would need to be encoded (e.g., 0/1) before use in a machine learning model, though this is beyond the scope of this assignment.

---
## Task 2: Apply One Missing Value Strategy

Since the dataset has no missing values, we first:
1. Fix the data type issue in `G1` and `G2`.
2. Introduce artificial missing values in `G3` (final grade) for demonstration.
3. Apply **median imputation** and explain the choice.

In [None]:
# Fix data type issue: convert G1 and G2 from string to numeric
df['G1'] = pd.to_numeric(df['G1'], errors='coerce')
df['G2'] = pd.to_numeric(df['G2'], errors='coerce')

print('Updated dtypes for G1 and G2:')
print(df[['G1', 'G2', 'G3']].dtypes)

In [None]:
# Introduce artificial missing values in G3 for demonstration
df_missing = df.copy()
np.random.seed(42)
missing_idx = np.random.choice(df_missing.index, size=20, replace=False)
df_missing.loc[missing_idx, 'G3'] = np.nan

print('Missing values after introduction:')
df_missing.isna().sum()

In [None]:
# Check the distribution of G3 to choose the right imputation strategy
plt.figure(figsize=(6, 4))
sns.histplot(df['G3'], bins=20, kde=True)
plt.title('Distribution of G3 (Final Grade)')
plt.xlabel('G3')
plt.show()

print('G3 Skewness:', df['G3'].skew())

In [None]:
# Apply Median Imputation
df_imputed = df_missing.copy()
median_g3 = df_imputed['G3'].median()
df_imputed['G3'].fillna(median_g3, inplace=True)

print(f'Median value used for imputation: {median_g3}')
print('\nMissing values after median imputation:')
print(df_imputed.isna().sum())

### Why Median Imputation?

We chose **median imputation** for the `G3` (final grade) column for the following reasons:

1. **Robustness to outliers**: Some students score 0 on `G3` (often due to withdrawal or special cases), which pulls the mean downward. The median is not affected by these extreme low values.

2. **Skewed distribution**: If the `G3` distribution is skewed (left-skewed due to 0s), the median is a more representative measure of the typical student's grade than the mean.

3. **Preserves dataset size**: Unlike row deletion, imputation keeps all 395 records, which is important given the relatively small dataset size.

---
## Task 3: Detect and Handle Outliers Using IQR

We focus on the `absences` column, which is most likely to have extreme values.

In [None]:
# Visualize distributions of key numeric columns using boxplots
numeric_cols = ['age', 'absences', 'G1', 'G2', 'G3']

fig, axes = plt.subplots(1, len(numeric_cols), figsize=(16, 4))
for i, col in enumerate(numeric_cols):
    sns.boxplot(y=df[col], ax=axes[i])
    axes[i].set_title(f'{col}')
plt.suptitle('Boxplots of Key Numerical Features', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Detect outliers in 'absences' using the IQR method
Q1 = df['absences'].quantile(0.25)
Q3 = df['absences'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

print(f'Q1: {Q1}')
print(f'Q3: {Q3}')
print(f'IQR: {IQR}')
print(f'Lower Bound: {lower}')
print(f'Upper Bound: {upper}')

outliers = df[(df['absences'] < lower) | (df['absences'] > upper)]
print(f'\nNumber of outliers detected: {len(outliers)}')

In [None]:
# View outlier records
outliers[['age', 'absences', 'G1', 'G2', 'G3']].head(15)

In [None]:
# Strategy 1: Remove outliers
df_no_outliers = df[(df['absences'] >= lower) & (df['absences'] <= upper)]
print('Original shape:', df.shape)
print('After removing outliers:', df_no_outliers.shape)

In [None]:
# Strategy 2: Cap outliers using percentile method
lower_cap = df['absences'].quantile(0.05)
upper_cap = df['absences'].quantile(0.95)

df_capped = df.copy()
df_capped['absences'] = df_capped['absences'].clip(lower_cap, upper_cap)

print('Before capping (absences):')
print(df['absences'].describe())
print('\nAfter capping (absences):')
print(df_capped['absences'].describe())

### Outlier Handling Summary

- The `absences` column has a right-skewed distribution with several students having very high absence counts.
- **IQR** detected outliers as values above the upper fence.
- **Removal**: reduces the dataset size but eliminates distortion from extreme values.
- **Capping**: keeps all records but limits the influence of extreme values by replacing them with the 5th/95th percentile boundaries.

> Capping is preferred here because high absences may be real and meaningful for predicting `G3`, so we do not want to lose those rows.

---
## Task 4: Normalize Numerical Features

We apply both **Min-Max normalization** and **Z-score standardization** to the key numerical features.

In [None]:
# Select numerical features for normalization
numeric_features = ['age', 'absences', 'G1', 'G2', 'G3',
                    'studytime', 'failures', 'famrel',
                    'freetime', 'goout', 'Dalc', 'Walc', 'health']

# View raw values before normalization
df[numeric_features].head()

In [None]:
# Min-Max Normalization (scales to [0, 1])
scaler_minmax = MinMaxScaler()
df_minmax = df[numeric_features].copy()
df_minmax[numeric_features] = scaler_minmax.fit_transform(df_minmax)

print('Min-Max Normalized (first 5 rows):')
df_minmax.head()

In [None]:
# Verify: all values should be between 0 and 1
print('Min-Max range after normalization:')
print(df_minmax.describe().loc[['min', 'max']])

In [None]:
# Z-Score Standardization (mean=0, std=1)
scaler_std = StandardScaler()
df_standardized = df[numeric_features].copy()
df_standardized[numeric_features] = scaler_std.fit_transform(df_standardized)

print('Z-Score Standardized (first 5 rows):')
df_standardized.head()

In [None]:
# Verify: mean approximately 0, std approximately 1
print('Z-Score statistics after standardization:')
print(df_standardized.describe().loc[['mean', 'std']].round(4))

In [None]:
# Visual comparison: original vs Min-Max vs Z-Score for G3
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

sns.histplot(df['G3'], bins=20, kde=True, ax=axes[0])
axes[0].set_title('Original G3')

sns.histplot(df_minmax['G3'], bins=20, kde=True, ax=axes[1], color='orange')
axes[1].set_title('Min-Max Normalized G3')

sns.histplot(df_standardized['G3'], bins=20, kde=True, ax=axes[2], color='green')
axes[2].set_title('Z-Score Standardized G3')

plt.tight_layout()
plt.show()

### Normalization Summary

| Method | Output Range | When to Use |
|--------|-------------|-------------|
| **Min-Max** | [0, 1] | KNN, K-Means, Neural Networks (bounded input needed) |
| **Z-Score** | Mean=0, Std=1 | Linear Regression, SVM, PCA (assumes normally distributed data) |

- **Min-Max** is useful when the algorithm requires inputs in a fixed range.
- **Z-Score** is preferred when features may follow a roughly normal distribution and the model is sensitive to variance differences.
- The **shape** of the distribution is preserved by both methods; only the scale changes.

---
## Task 5: Apply PCA and Interpret Explained Variance

We apply Principal Component Analysis (PCA) to the standardized features to reduce dimensionality while preserving as much variance as possible.

In [None]:
# Check correlations between features before PCA
pca_features = ['age', 'absences', 'G1', 'G2', 'G3',
                'studytime', 'failures', 'famrel',
                'freetime', 'goout', 'Dalc', 'Walc', 'health']

plt.figure(figsize=(10, 8))
sns.heatmap(df[pca_features].corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap Before PCA')
plt.show()

In [None]:
# Standardize data for PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[pca_features])

# Apply PCA with all components first to see the full variance breakdown
pca_full = PCA()
pca_full.fit(X_scaled)

explained = pca_full.explained_variance_ratio_
cumulative = np.cumsum(explained)

print('Explained Variance Ratio per Component:')
for i, (ev, cv) in enumerate(zip(explained, cumulative)):
    print(f'  PC{i+1}: {ev:.4f}  (Cumulative: {cv:.4f})')

In [None]:
# Scree plot: visualize explained variance
n_components = len(pca_features)

plt.figure(figsize=(10, 4))
plt.bar(range(1, n_components + 1), explained * 100, alpha=0.7, label='Individual')
plt.step(range(1, n_components + 1), cumulative * 100, where='mid',
         color='red', linewidth=2, label='Cumulative')
plt.axhline(y=80, color='gray', linestyle='--', label='80% threshold')
plt.axhline(y=95, color='navy', linestyle='--', label='95% threshold')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance (%)')
plt.title('Scree Plot: PCA Explained Variance')
plt.legend()
plt.xticks(range(1, n_components + 1))
plt.tight_layout()
plt.show()

In [None]:
# Apply PCA keeping 5 components
pca = PCA(n_components=5)
principal_components = pca.fit_transform(X_scaled)

print('Explained Variance Ratio (5 components):', pca.explained_variance_ratio_.round(4))
print('Cumulative Explained Variance:', np.cumsum(pca.explained_variance_ratio_).round(4))

In [None]:
# Scatter plot of first two principal components, colored by G3
plt.figure(figsize=(7, 5))
sc = plt.scatter(principal_components[:, 0], principal_components[:, 1],
                 c=df['G3'], cmap='viridis', alpha=0.7, edgecolors='k', linewidths=0.3)
plt.colorbar(sc, label='G3 (Final Grade)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection (colored by G3)')
plt.show()

In [None]:
# PCA Loadings: which features contribute most to each component?
loadings = pd.DataFrame(
    pca.components_.T,
    index=pca_features,
    columns=[f'PC{i+1}' for i in range(5)]
)

plt.figure(figsize=(10, 5))
sns.heatmap(loadings[['PC1', 'PC2', 'PC3']], annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('PCA Loadings (Feature Contributions to Components)')
plt.show()

### Task 5: PCA Interpretation

**Explained Variance:**
- The scree plot shows how much variance each principal component captures.
- The first few components typically capture the most information.
- We look at the cumulative line to decide how many components to keep.

**What the components represent:**
- **PC1** captures the most variation in the dataset. From the loadings heatmap, features like `G1`, `G2`, `G3`, and `failures` have high loadings on PC1 — this component likely represents **academic performance**.
- **PC2** may capture social/lifestyle variation — features like `goout`, `Dalc`, `Walc`, and `freetime` tend to load here.

**Why PCA is useful here:**
- Several grade-related features (`G1`, `G2`, `G3`) are highly correlated, making PCA effective at combining overlapping information.
- Reducing 13 features to 5 principal components (capturing ~80%+ of variance) simplifies the dataset while retaining most information.

**Scatter plot interpretation:**
- Points are colored by `G3` (final grade). A visible gradient from left to right along PC1 indicates that PC1 strongly captures academic performance variation.

**Decision rule:** Keep enough components to reach **80% to 95% cumulative explained variance**, depending on the model's tolerance for information loss.