More Explanations in ApplyingML_Report

### Linear Regression:
We started with Linear Regression as a baseline model to predict depression scores. It assumes a linear relationship between features and the target variable.
- **MSE:** 0.1595  
- **R²:** 0.3422

This model explained about 34% of the variance in depression scores.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv("second_Student_Depression_Dataset.csv")

# Transform 'Sleep Duration' into an ordinal numeric feature
sleep_map = {
    "Less than 5 hours": 1,
    "5-6 hours": 2,
    "7-8 hours": 3,
    "More than 8 hours": 4
}
df['Sleep_Ordinal'] = df['Sleep Duration'].map(sleep_map)

# Select numerical features to be used as predictors
features = ['Academic Pressure', 'Work Pressure', 'CGPA',
            'Study Satisfaction', 'Job Satisfaction',
            'Work/Study Hours', 'Financial Stress',
            'Sleep_Ordinal']  # newly transformed feature
target = 'Depression'

# Remove rows with missing values in the selected columns
df_model = df[features + [target]].dropna()

# Define independent variables (X) and dependent variable (y)
X = df_model[features]
y = df_model[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred))
print("R-squared (R²):", r2_score(y_test, y_pred))

# Visualize model coefficients
coefficients = pd.Series(model.coef_, index=features)
plt.figure(figsize=(10,6))
sns.barplot(x=coefficients.values, y=coefficients.index)
plt.title("Linear Regression Coefficients")
plt.xlabel("Coefficient Value")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

### KNN Regressor:
K-Nearest Neighbors was used to test if local similarity among students would provide better predictions.

- **MSE:** 0.1851  
- **R²:** 0.2369

The model performed worse than Linear Regression, indicating that global trends are more important than local ones in this dataset.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("second_Student_Depression_Dataset.csv")

# Transform 'Sleep Duration' into an ordinal numeric feature
sleep_map = {
    "Less than 5 hours": 1,
    "5-6 hours": 2,
    "7-8 hours": 3,
    "More than 8 hours": 4
}
df['Sleep_Ordinal'] = df['Sleep Duration'].map(sleep_map)

# Select numerical features to be used as predictors
features = ['Academic Pressure', 'Work Pressure', 'CGPA',
            'Study Satisfaction', 'Job Satisfaction',
            'Work/Study Hours', 'Financial Stress',
            'Sleep_Ordinal']
target = 'Depression'

# Remove rows with missing values in the selected columns
df_model = df[features + [target]].dropna()

# Define independent variables (X) and dependent variable (y)
X = df_model[features]
y = df_model[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the KNN Regressor model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
print("KNN Regression - Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred))
print("KNN Regression - R-squared (R²):", r2_score(y_test, y_pred))


### Decision Tree Regressor:
Decision Tree was initially overfitting. After limiting `max_depth` to 4, performance improved significantly:

- **MSE:** 0.1657  
- **R²:** 0.317

It performed close to Linear Regression and provides rule-based interpretability.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("second_Student_Depression_Dataset.csv")

# Transform 'Sleep Duration' into an ordinal numeric feature
sleep_map = {
    "Less than 5 hours": 1,
    "5-6 hours": 2,
    "7-8 hours": 3,
    "More than 8 hours": 4
}
df['Sleep_Ordinal'] = df['Sleep Duration'].map(sleep_map)

# Select numerical features to be used as predictors
features = ['Academic Pressure', 'Work Pressure', 'CGPA',
            'Study Satisfaction', 'Job Satisfaction',
            'Work/Study Hours', 'Financial Stress',
            'Sleep_Ordinal']
target = 'Depression'

# Remove rows with missing values in the selected columns
df_model = df[features + [target]].dropna()

# Define independent variables (X) and dependent variable (y)
X = df_model[features]
y = df_model[target]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Decision Tree Regressor model
model = DecisionTreeRegressor(max_depth=4, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
print("Decision Tree Regression - Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred))
print("Decision Tree Regression - R-squared (R²):", r2_score(y_test, y_pred))