# Final Portfolio Project - Classification Task

# Teen Phone Addiction Level Classification

## Task 1: Exploratory Data Analysis and Data Understanding

### 1.1 Choosing a Dataset

#### (a) When and by whom the dataset was created
Provide the dataset provenance here (creator/owner and year, if known).

#### (b) How and from where the dataset was accessed
Provide the dataset source here (website / repository / provider) and the access date.

#### (c) Alignment with United Nations Sustainable Development Goal (UNSDG)
This dataset relates to **SDG 3: Good Health and Well-Being** because excessive smartphone use can impact sleep quality, mental health, and overall well-being.

#### (d) List and brief description of all attributes (features)
Use the column list table (next code cell) to write a brief description of each feature in your report.

In [None]:
# Core
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing / Model selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay,
    mean_absolute_error, mean_squared_error, r2_score
)

# Models
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neural_network import MLPClassifier, MLPRegressor

# Feature selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif, mutual_info_regression

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [None]:
# Load the dataset
# (works both on local folder and the provided environment)
try:
    df = pd.read_csv("teen_phone_addiction_dataset.csv")
except FileNotFoundError:
    df = pd.read_csv("/mnt/data/teen_phone_addiction_dataset.csv")

print("Shape:", df.shape)
df.head()

In [None]:
# Column list (use this to write brief descriptions in the markdown section above)
pd.DataFrame({"column": df.columns, "dtype": [str(df[c].dtype) for c in df.columns]})

#### Potential Questions the Dataset Can Answer

1. Which behavioral factors (daily usage, checks/day, screen time before bed) are most associated with higher addiction?
2. Does higher phone usage correlate with lower sleep hours or lower academic performance?
3. Are there demographic differences (age, gender, grade) in predicted addiction category?

#### Dataset Quality Assessment

The next cells check missing values, duplicates, and basic distribution of the target variable.

In [None]:
# Missing values and duplicates
missing = df.isna().sum().sort_values(ascending=False)
print("Missing values (top):")
display(missing[missing>0].head(20))

print("\nDuplicate rows:", df.duplicated().sum())

In [None]:
# Basic summary statistics for numeric columns
df.describe(include=[np.number]).T

In [None]:
# Inspect categorical columns
cat_cols = [c for c in df.columns if df[c].dtype == "object"]
pd.DataFrame({"categorical_column": cat_cols, "n_unique": [df[c].nunique() for c in cat_cols]})

### 1.2 Exploratory Data Analysis (EDA)

#### (a) Data Cleaning and Preprocessing

For classification, the dataset provides a continuous **Addiction_Level** score. We convert it into a categorical target using quantile-based binning (Low / Medium / High) to create a meaningful and reasonably balanced classification task.

In [None]:
# Create classification target (3 balanced classes) from Addiction_Level
df = df.copy()

# Ensure Addiction_Level is numeric
df["Addiction_Level"] = pd.to_numeric(df["Addiction_Level"], errors="coerce")

# Drop rows where target is missing
df = df.dropna(subset=["Addiction_Level"]).reset_index(drop=True)

# Quantile binning into 3 classes
df["Addiction_Class"] = pd.qcut(df["Addiction_Level"], q=3, labels=["Low", "Medium", "High"])

df[["Addiction_Level", "Addiction_Class"]].head()

In [None]:
# Class balance
class_counts = df["Addiction_Class"].value_counts().sort_index()
class_counts.plot(kind="bar")
plt.title("Class Distribution: Addiction_Class")
plt.xlabel("Class")
plt.ylabel("Count")
plt.show()

class_counts

#### (b) Visualizations to Summarize, Explore, and Understand the Data

In [None]:
# Correlation heatmap for numeric features (excluding ID)
num_cols = [c for c in df.columns if df[c].dtype != "object" and c not in ["ID"]]
corr = df[num_cols].corr()

plt.figure(figsize=(10, 7))
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()

In [None]:
# Relationship between phone usage and sleep
plt.figure(figsize=(7,5))
sns.scatterplot(data=df, x="Daily_Usage_Hours", y="Sleep_Hours", hue="Addiction_Class", alpha=0.7)
plt.title("Daily Phone Usage vs Sleep Hours")
plt.show()

In [None]:
# Academic performance across addiction classes
plt.figure(figsize=(7,5))
sns.boxplot(data=df, x="Addiction_Class", y="Academic_Performance")
plt.title("Academic Performance by Addiction Class")
plt.show()

In [None]:
# Phone checks per day across addiction classes
plt.figure(figsize=(7,5))
sns.boxplot(data=df, x="Addiction_Class", y="Phone_Checks_Per_Day")
plt.title("Phone Checks per Day by Addiction Class")
plt.show()

In [None]:
# Categorical distribution example: Gender vs Addiction class
plt.figure(figsize=(6,4))
pd.crosstab(df["Gender"], df["Addiction_Class"]).plot(kind="bar")
plt.title("Gender vs Addiction Class")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()

# Task 2: Build a Neural Network Model

We use an MLPClassifier with a preprocessing pipeline (imputation + one-hot encoding for categorical features + scaling for numeric features).

# Task 3: Build Primary Model

## 3.1 Split Dataset into Training and Testing Sets

The same train/test split from Task 2 is used.

## 3.2 Model A: Logistic Regression
## 3.3 Model B: Random Forest Classifier
## 3.4 Initial Comparison and Discussion

In [None]:
# Train the neural network classifier
mlp_pipe.fit(X_train, y_train)

# Predictions
y_pred_train = mlp_pipe.predict(X_train)
y_pred_test = mlp_pipe.predict(X_test)

def cls_metrics(y_true, y_pred, label=""):
    return pd.Series({
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision (weighted)": precision_score(y_true, y_pred, average="weighted", zero_division=0),
        "Recall (weighted)": recall_score(y_true, y_pred, average="weighted", zero_division=0),
        "F1 (weighted)": f1_score(y_true, y_pred, average="weighted", zero_division=0),
    }, name=label)

metrics_train = cls_metrics(y_train, y_pred_train, "Train")
metrics_test = cls_metrics(y_test, y_pred_test, "Test")

pd.concat([metrics_train, metrics_test], axis=1)

In [None]:
# Confusion matrix (test)
cm = confusion_matrix(y_test, y_pred_test, labels=["Low","Medium","High"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Low","Medium","High"])
disp.plot()
plt.title("MLPClassifier - Confusion Matrix (Test)")
plt.show()

# Task 3: Build Primary Model
n## 3.1 Split Dataset into Training and Testing Sets

The same train/test split from Task 2 is used.

## 3.2 Model A: Logistic Regression
## 3.3 Model B: Random Forest Classifier
## 3.4 Initial Comparison and Discussion

In [None]:
# Model A: Logistic Regression (multinomial)
log_reg = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE, multi_class="auto")

logreg_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", log_reg)
])

logreg_pipe.fit(X_train, y_train)
y_pred_lr = logreg_pipe.predict(X_test)

lr_metrics = cls_metrics(y_test, y_pred_lr, "Logistic Regression")
lr_metrics

In [None]:
# Model B: Random Forest
rf_clf = RandomForestClassifier(
    n_estimators=300,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

rf_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", rf_clf)
])

rf_pipe.fit(X_train, y_train)
y_pred_rf = rf_pipe.predict(X_test)

rf_metrics = cls_metrics(y_test, y_pred_rf, "Random Forest")
rf_metrics

In [None]:
# Initial comparison table (test set)
initial_comparison = pd.DataFrame([lr_metrics, rf_metrics]).reset_index().rename(columns={"index":"Model"})
initial_comparison

# Task 4: Hyper-parameter Optimization with Cross-Validation

We tune the two classical ML models with GridSearchCV using a pipeline. The scoring metric is weighted F1 to handle any class imbalance.

In [None]:
from sklearn.metrics import make_scorer
f1_weighted = make_scorer(f1_score, average="weighted", zero_division=0)

# 4.1 Logistic Regression tuning
logreg_param_grid = {
    "model__C": [0.1, 1, 5, 10],
    "model__penalty": ["l2"],
    "model__solver": ["lbfgs"]
}

logreg_gs = GridSearchCV(
    logreg_pipe,
    param_grid=logreg_param_grid,
    scoring=f1_weighted,
    cv=5,
    n_jobs=-1
)
logreg_gs.fit(X_train, y_train)

logreg_gs.best_params_, logreg_gs.best_score_

In [None]:
# 4.2 Random Forest tuning
rf_param_grid = {
    "model__n_estimators": [200, 400],
    "model__max_depth": [None, 10, 20],
    "model__min_samples_split": [2, 5],
    "model__min_samples_leaf": [1, 2]
}

rf_gs = GridSearchCV(
    rf_pipe,
    param_grid=rf_param_grid,
    scoring=f1_weighted,
    cv=5,
    n_jobs=-1
)
rf_gs.fit(X_train, y_train)

rf_gs.best_params_, rf_gs.best_score_

In [None]:
# 4.3 Summary of best hyperparameters and CV scores
pd.DataFrame([
    {"Model": "Logistic Regression", "Best CV Score (F1w)": logreg_gs.best_score_, "Best Params": logreg_gs.best_params_},
    {"Model": "Random Forest", "Best CV Score (F1w)": rf_gs.best_score_, "Best Params": rf_gs.best_params_},
])

# Task 5: Feature Selection

A **filter method** is applied using mutual information. After preprocessing, SelectKBest keeps the top-k most informative features. This is applied for both classical models.

In [None]:
# Choose k based on feature space size (after one-hot).
# We start with a moderate k and can adjust if desired.
k_best = 20

fs_logreg_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("select", SelectKBest(score_func=mutual_info_classif, k=k_best)),
    ("model", LogisticRegression(max_iter=1000, random_state=RANDOM_STATE))
])

fs_rf_pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("select", SelectKBest(score_func=mutual_info_classif, k=k_best)),
    ("model", RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=300, n_jobs=-1))
])

# Cross-validated score (using weighted F1)
fs_logreg_cv = cross_val_score(fs_logreg_pipe, X_train, y_train, cv=5, scoring=f1_weighted, n_jobs=-1).mean()
fs_rf_cv = cross_val_score(fs_rf_pipe, X_train, y_train, cv=5, scoring=f1_weighted, n_jobs=-1).mean()

fs_logreg_cv, fs_rf_cv

# Task 6: Final Models and Comparative Analysis

Rebuild both classical models using:
- Best hyperparameters from Task 4
- Feature selection from Task 5

Then evaluate on the test set and compare in a table.

In [None]:
# Final Logistic Regression with best hyperparameters + feature selection
final_logreg = Pipeline(steps=[
    ("preprocess", preprocess),
    ("select", SelectKBest(score_func=mutual_info_classif, k=k_best)),
    ("model", LogisticRegression(
        max_iter=1000,
        random_state=RANDOM_STATE,
        C=logreg_gs.best_params_["model__C"],
        penalty=logreg_gs.best_params_["model__penalty"],
        solver=logreg_gs.best_params_["model__solver"]
    ))
])

# Final Random Forest with best hyperparameters + feature selection
bp = rf_gs.best_params_
final_rf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("select", SelectKBest(score_func=mutual_info_classif, k=k_best)),
    ("model", RandomForestClassifier(
        random_state=RANDOM_STATE,
        n_jobs=-1,
        n_estimators=bp["model__n_estimators"],
        max_depth=bp["model__max_depth"],
        min_samples_split=bp["model__min_samples_split"],
        min_samples_leaf=bp["model__min_samples_leaf"],
    ))
])

# Fit + evaluate
final_logreg.fit(X_train, y_train)
final_rf.fit(X_train, y_train)

pred_logreg = final_logreg.predict(X_test)
pred_rf = final_rf.predict(X_test)

final_lr_metrics = cls_metrics(y_test, pred_logreg, "Final Logistic Regression")
final_rf_metrics = cls_metrics(y_test, pred_rf, "Final Random Forest")

final_lr_metrics, final_rf_metrics

In [None]:
# Comparison table (similar to Table 1 / Table 4 in the assignment)
comparison_table = pd.DataFrame([
    {
        "Model": "Logistic Regression (Final)",
        "Features": f"SelectKBest(k={k_best})",
        "CV Score (F1w)": logreg_gs.best_score_,
        "Accuracy": final_lr_metrics["Accuracy"],
        "Precision": final_lr_metrics["Precision (weighted)"],
        "Recall": final_lr_metrics["Recall (weighted)"],
        "F1-Score": final_lr_metrics["F1 (weighted)"],
    },
    {
        "Model": "Random Forest (Final)",
        "Features": f"SelectKBest(k={k_best})",
        "CV Score (F1w)": rf_gs.best_score_,
        "Accuracy": final_rf_metrics["Accuracy"],
        "Precision": final_rf_metrics["Precision (weighted)"],
        "Recall": final_rf_metrics["Recall (weighted)"],
        "F1-Score": final_rf_metrics["F1 (weighted)"],
    }
])

comparison_table

## Task 7: Report Quality and Presentation

- Code is organized into tasks and uses pipelines for reproducibility.
- Visualizations are labeled with titles and axes.
- Tables summarize key results.

## Task 8: Conclusion and Reflection

1. **Model Performance:** State which final model performed best and cite the key metrics.
2. **Impact of Methods:** Explain how hyperparameter tuning and feature selection affected results.
3. **Insights and Future Directions:** Summarize insights from EDA and suggest future improvements (e.g., different class thresholds, more feature engineering, other models).