# ML IN CLASS GROUP ASSIGNMENT (rough notes)

Not trying to be fancy here. Plain scratchpad style.
Goal: answer in-class questions quickly using the CSVs we got. Might refine later.


In [1]:
import pandas as pd
from pathlib import Path

# Discover project root by searching upward for requirements.txt
cur = Path.cwd()
root = None
for p in [cur] + list(cur.parents):
    if (p / "requirements.txt").exists():
        root = p
        break
if root is None:
    root = cur  # fallback

# Direct references to in-class supervised and unsupervised samples
supervised_file = root / "data" / "in_class" / "sample_supervised.csv"
unsupervised_file = root / "data" / "in_class" / "sample_unsupervised.csv"

missing = []
if not supervised_file.exists():
    missing.append(f"Missing supervised file: {supervised_file}")
if not unsupervised_file.exists():
    missing.append(f"Missing unsupervised file: {unsupervised_file}")
if missing:
    raise FileNotFoundError(" | ".join(missing))

sup = pd.read_csv(supervised_file)
unsup = pd.read_csv(unsupervised_file)

sup.head(), unsup.head()

(  Student  daily_screen_time_hours  sleep_duration_hours  mental_health_score
 0       A                      4.8                   6.6                   32
 1       B                      3.9                   4.5                   75
 2       C                     10.5                   7.1                   22
 3       D                      8.8                   5.1                   22
 4       E                      5.9                   7.4                   64,
   Student  daily_screen_time_hours  sleep_duration_hours
 0       A                      4.8                   6.6
 1       B                      3.9                   4.5
 2       C                     10.5                   7.1
 3       D                      8.8                   5.1
 4       E                      5.9                   7.4)

## 1. BACKGROUND & OBJECTIVE 

Background (short): students spend lots of hours on screens (phone / laptop / socials). That touches sleep + mood + stress + overall mental health.
Why we care: want to see patterns → who looks balanced vs heavy screen use.

Research objective:
- Predict mental_health_score from lifestyle + usage stuff.
- Also cluster students to see natural groups (maybe "high screen, low sleep" vs "balanced").



## 2. Describe Features of the Data (dot points)

**Supervised dataset columns**
- Demographics: `age`, `gender`, `location_type`
- Digital use: `daily_screen_time_hours`, `phone_usage_hours`, `laptop_usage_hours`, `tablet_usage_hours`, `tv_usage_hours`, `gaming_hours`
- Activity split: `social_media_hours`, `work_related_hours`, `entertainment_hours`
- Lifestyle/health: `sleep_duration_hours`, `physical_activity_hours_per_week`, `caffeine_intake_mg_per_day`, `mindfulness_minutes_per_day`, `eats_healthy`, `uses_wellness_apps`
- Psychological: `mood_rating`, `mental_health_score`

**Unsupervised dataset columns**
- Lifestyle/health subset for clustering: `location_type`, `uses_wellness_apps`, `eats_healthy`, `caffeine_intake_mg_per_day`, `mindfulness_minutes_per_day`
- Can optionally add screen/sleep features from supervised set for richer clusters.

**Basic shapes**
- Supervised rows: `{sup.shape[0]}`; features: `{sup.shape[1]}`
- Unsupervised rows: `{unsup.shape[0]}`; features: `{unsup.shape[1]}`


In [2]:
print("Supervised shape:", sup.shape)
print("Unsupervised shape:", unsup.shape)

print("\nSupervised columns:")
print(list(sup.columns))

print("\nUnsupervised columns:")
print(list(unsup.columns))

Supervised shape: (5, 4)
Unsupervised shape: (5, 3)

Supervised columns:
['Student', 'daily_screen_time_hours', 'sleep_duration_hours', 'mental_health_score']

Unsupervised columns:
['Student', 'daily_screen_time_hours', 'sleep_duration_hours']


## 3. ALGORITHMS (plain picks)

Supervised:
Target: mental_health_score (numeric) → treat as regression.
Try quick baselines: KNN (simple distance based), Decision Tree (handles mixed scales ok). Good enough for demo.

Unsupervised:
K-Means (fast, needs scaling but okay for demo).
Hierarchical (ward) just to show dendrogram / confirm groups.

Feature sets:
- For prediction: most numeric usage + lifestyle (exclude target). Might drop anxiety/depression if we treat them as separate outcomes.
- For clustering: screen time + sleep + physical_activity + caffeine + mindfulness + maybe diet/app usage.

Goal: show at least one manual step + one code run.


### 4. SAMPLE ROWS

In [3]:
# Show raw sample rows (no renaming)
unsup[['Student','daily_screen_time_hours','sleep_duration_hours']].head()

Unnamed: 0,Student,daily_screen_time_hours,sleep_duration_hours
0,A,4.8,6.6
1,B,3.9,4.5
2,C,10.5,7.1
3,D,8.8,5.1
4,E,5.9,7.4


## Manual K-Means (using our CSV, 5 samples, 2 features)

We use 2 features to keep the hand calculation readable:
- x = daily_screen_time_hours
- y = sleep_duration_hours


---

### Step 1. Choose K and initialise centroids

Let K = 2 clusters.

Pick two actual points as centroids (far apart so clusters are obvious):
- C1 = A = (4.8, 6.6)
- C2 = C = (10.5, 7.1)


---

### Step 2. Compute Euclidean distances to each centroid

Distance formula (2D):

$d((x,y),(a,b)) = \sqrt{(x-a)^2 + (y-b)^2}$

Work one example fully:

For D to C1:

$d(D,C1) = \sqrt{(8.8-4.8)^2 + (5.1-6.6)^2}$  
$= \sqrt{4.0^2 + (-1.5)^2}$  
$= \sqrt{16 + 2.25}$  
$= \sqrt{18.25} \approx 4.27$

(hand note: square, add, sqrt)

Compute the rest the same way:

| Student | Point (x,y) | d to C1 | d to C2 | Nearest cluster |
|---|---|---:|---:|---|
| A | (4.8, 6.6) | 0.00 | 5.72 | C1 |
| B | (3.9, 4.5) | 2.28 | 7.09 | C1 |
| C | (10.5, 7.1) | 5.72 | 0.00 | C2 |
| D | (8.8, 5.1) | 4.27 | 2.62 | C2 |
| E | (5.9, 7.4) | 1.36 | 4.61 | C1 |

So after assignment:
- Cluster 1 = {A, B, E}
- Cluster 2 = {C, D}

---

### Step 3. Recompute centroids (mean of each cluster)

New centroid for Cluster 1:

Screen time mean:  
$(4.8 + 3.9 + 5.9)/3 = 14.6/3 = 4.87$

Sleep mean:  
$(6.6 + 4.5 + 7.4)/3 = 18.5/3 = 6.17$

So:

C1' = (4.87, 6.17)

New centroid for Cluster 2:

Screen time mean:  
$(10.5 + 8.8)/2 = 19.3/2 = 9.65$

Sleep mean:  
$(7.1 + 5.1)/2 = 12.2/2 = 6.10$

So:

C2' = (9.65, 6.10)

(snote: centroid ~= "average location")

---

### Step 4. Second assignment check (convergence)

Recalculate distances using C1' and C2'.

Example for A:

$d(A,C1') = \sqrt{(4.8-4.87)^2 + (6.6-6.17)^2}$  
$= \sqrt{(-0.07)^2 + (0.43)^2}$  
$= \sqrt{0.005 + 0.186}$  
$= \sqrt{0.191} \approx 0.44$

$d(A,C2') = \sqrt{(4.8-9.65)^2 + (6.6-6.10)^2}$  
$= \sqrt{(-4.85)^2 + (0.50)^2}$  
$= \sqrt{23.52 + 0.25}$  
$= \sqrt{23.77} \approx 4.88$

So A stays in Cluster 1.

Doing the same for all points gives the same assignments:
- Cluster 1 = {A, B, E}
- Cluster 2 = {C, D}

Therefore the algorithm converges after 1 iteration.

---

### Step 5. Interpret clusters (in plain words)

Cluster 1 (A, B, E):
- lower screen time (around 4 to 6 hours)
- higher sleep (around 6 to 7 hours)
= healthier digital lifestyle group

Cluster 2 (C, D):
- high screen time (around 9 to 10.5 hours)
- lower sleep (around 5 to 7 hours)
= heavy-screen-use group


In [4]:
# Code validation: manual K-Means distances (2 features from unsup sample)
import math, pandas as pd

# Derive points directly from previously loaded unsupervised sample
# Keep column names screen/sleep for existing logic
points = unsup[['Student','daily_screen_time_hours','sleep_duration_hours']].rename(columns={
    'daily_screen_time_hours':'screen',
    'sleep_duration_hours':'sleep'
})

# Initial centroids chosen in notes (A and C)
C1 = points.loc[points.Student=='A', ['screen','sleep']].iloc[0].to_list()  # (4.8,6.6)
C2 = points.loc[points.Student=='C', ['screen','sleep']].iloc[0].to_list()  # (10.5,7.1)

rows = []
for _, r in points.iterrows():
    d1 = math.sqrt((r.screen - C1[0])**2 + (r.sleep - C1[1])**2)
    d2 = math.sqrt((r.screen - C2[0])**2 + (r.sleep - C2[1])**2)
    cluster = 'C1' if d1 < d2 else 'C2'
    rows.append({'Student': r.Student,'d_to_C1': round(d1,2),'d_to_C2': round(d2,2),'Cluster': cluster})

dist_df = pd.DataFrame(rows)
dist_df

Unnamed: 0,Student,d_to_C1,d_to_C2,Cluster
0,A,0.0,5.72,C1
1,B,2.28,7.09,C1
2,C,5.72,0.0,C2
3,D,4.27,2.62,C2
4,E,1.36,4.61,C1


In [5]:
# Recompute centroids and second-pass assignment
import numpy as np

cluster1_pts = points[points.Student.isin(dist_df[dist_df.Cluster=='C1'].Student)][['screen','sleep']].values
cluster2_pts = points[points.Student.isin(dist_df[dist_df.Cluster=='C2'].Student)][['screen','sleep']].values

new_C1 = cluster1_pts.mean(axis=0)
new_C2 = cluster2_pts.mean(axis=0)
print('New_C1 (mean of A,B,E):', new_C1)
print('New_C2 (mean of C,D):', new_C2)

rows2 = []
for _, r in points.iterrows():
    d1 = math.sqrt((r.screen - new_C1[0])**2 + (r.sleep - new_C1[1])**2)
    d2 = math.sqrt((r.screen - new_C2[0])**2 + (r.sleep - new_C2[1])**2)
    cluster = 'C1' if d1 < d2 else 'C2'
    rows2.append({'Student': r.Student,'d_to_newC1': round(d1,2),'d_to_newC2': round(d2,2),'Cluster': cluster})

second_df = pd.DataFrame(rows2)
second_df, dist_df[['Student','Cluster']].equals(second_df[['Student','Cluster']])

New_C1 (mean of A,B,E): [4.86666667 6.16666667]
New_C2 (mean of C,D): [9.65 6.1 ]


New_C1 (mean of A,B,E): [4.86666667 6.16666667]
New_C2 (mean of C,D): [9.65 6.1 ]


(  Student  d_to_newC1  d_to_newC2 Cluster
 0       A        0.44        4.88      C1
 1       B        1.93        5.97      C1
 2       C        5.71        1.31      C2
 3       D        4.08        1.31      C2
 4       E        1.61        3.97      C1,
 True)

In [6]:
# Summary validation (manual only, no sklearn)
print('Manual first-pass assignments:')
print(dist_df[['Student','Cluster']])
print('\nSecond pass (should match):')
print(second_df[['Student','Cluster']])

if dist_df[['Student','Cluster']].equals(second_df[['Student','Cluster']]):
    print('\nCentroid update did not change assignments -> converged.')
else:
    print('\nAssignments changed -> would need more iterations (unexpected).')

# Simple cluster descriptions
print('\nCluster 1 members:', list(dist_df[dist_df.Cluster=='C1'].Student))
print('Cluster 2 members:', list(dist_df[dist_df.Cluster=='C2'].Student))


Manual first-pass assignments:
  Student Cluster
0       A      C1
1       B      C1
2       C      C2
3       D      C2
4       E      C1

Second pass (should match):
  Student Cluster
0       A      C1
1       B      C1
2       C      C2
3       D      C2
4       E      C1

Centroid update did not change assignments -> converged.

Cluster 1 members: ['A', 'B', 'E']
Cluster 2 members: ['C', 'D']
