# Notebook 01: Data Exploration & Understanding

Purpose:
- Inspect raw datasets
- Understand available signals
- Identify activity, time, and outcome columns
- Prepare for schema design

In [1]:
import pandas as pd
import numpy as np 

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 200)

## Datasets Loaded

This notebook explores the following raw datasets:

1. Student Study Habits
2. Enhanced Student Habits & Performance
3. Time Management and Productivity Insights

Each dataset is explored independently.

In [5]:
df_study=pd.read_csv("../data/raw/student_study_habits.csv")
df_study.shape

(500, 13)

In [7]:
df_study.head(5)

Unnamed: 0,study_hours_per_week,sleep_hours_per_day,attendance_percentage,assignments_completed,final_grade,participation_level_Low,participation_level_Medium,internet_access_Yes,parental_education_High School,parental_education_Master's,parental_education_PhD,extracurricular_Yes,part_time_job_Yes
0,0.52723,0.685236,0.993245,0.222222,71.104897,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.4214,0.881883,0.883478,0.555556,62.240021,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.552393,0.220286,0.683469,1.0,65.268855,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.698283,0.612594,0.520094,0.222222,66.609921,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.405419,0.369871,0.831127,0.333333,58.967484,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0


In [9]:
df_study.columns

Index(['study_hours_per_week', 'sleep_hours_per_day', 'attendance_percentage', 'assignments_completed', 'final_grade', 'participation_level_Low', 'participation_level_Medium', 'internet_access_Yes',
       'parental_education_High School', 'parental_education_Master's', 'parental_education_PhD', 'extracurricular_Yes', 'part_time_job_Yes'],
      dtype='object')

### Student Study Habits — Initial Observations
# Columns related to study behavior:

study_hours_per_week — represents weekly academic effort.

assignments_completed — proxy for task completion and consistency.

attendance_percentage — reflects discipline and routine related to studies.

participation_level_Low, participation_level_Medium — indicate engagement level in academic activities.

# Columns related to health / extracurriculars:

sleep_hours_per_day — health and recovery indicator.

extracurricular_Yes — participation in non-academic activities (sports, clubs, etc.).

part_time_job_Yes — time commitment outside academics that may affect productivity.

# Columns related to outcomes:

final_grade — academic performance outcome influenced by study and habits.

# Columns possibly irrelevant to NEEL (or low priority):

internet_access_Yes — contextual factor, not a behavioral signal.

parental_education_High School, parental_education_Master's, parental_education_PhD — background attributes, not actionable for day-to-day behavior analysis.

In [13]:
df_habits=pd.read_csv("../data/raw/enhanced_student_habits_performance_dataset.csv")
df_habits.shape

(80000, 31)

In [14]:
df_habits.head(5)

Unnamed: 0,student_id,age,gender,major,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,previous_gpa,semester,stress_level,dropout_risk,social_activity,screen_time,study_environment,access_to_tutoring,family_income_range,parental_support_level,motivation_level,exam_anxiety_score,learning_style,time_management_score,exam_score
0,100000,26,Male,Computer Science,7.645367,3.0,0.1,Yes,70.3,6.2,Poor,3,Some College,High,6.0,Yes,4.0,5,5.8,No,0,10.9,Co-Learning Group,Yes,High,9,7,8,Reading,3.0,100
1,100001,28,Male,Arts,5.7,0.5,0.4,No,88.4,7.2,Good,4,PhD,Low,6.8,No,4.0,7,5.8,No,5,8.3,Co-Learning Group,Yes,Low,7,2,10,Reading,6.0,99
2,100002,17,Male,Arts,2.4,4.2,0.7,No,82.1,9.2,Good,4,High School,Low,5.7,Yes,3.79,4,8.0,No,5,8.0,Library,Yes,High,3,9,6,Kinesthetic,7.6,98
3,100003,27,Other,Psychology,3.4,4.6,2.3,Yes,79.3,4.2,Fair,3,Master,Medium,8.5,Yes,4.0,6,4.6,No,3,11.7,Co-Learning Group,Yes,Low,5,3,10,Reading,3.2,100
4,100004,25,Female,Business,4.7,0.8,2.7,Yes,62.9,6.5,Good,6,PhD,Low,9.2,No,4.0,4,5.7,No,2,9.4,Quiet Room,Yes,Medium,9,1,10,Reading,7.1,98


In [15]:
df_habits.columns

Index(['student_id', 'age', 'gender', 'major', 'study_hours_per_day', 'social_media_hours', 'netflix_hours', 'part_time_job', 'attendance_percentage', 'sleep_hours', 'diet_quality',
       'exercise_frequency', 'parental_education_level', 'internet_quality', 'mental_health_rating', 'extracurricular_participation', 'previous_gpa', 'semester', 'stress_level', 'dropout_risk',
       'social_activity', 'screen_time', 'study_environment', 'access_to_tutoring', 'family_income_range', 'parental_support_level', 'motivation_level', 'exam_anxiety_score', 'learning_style',
       'time_management_score', 'exam_score'],
      dtype='object')

### Enhanced Student Habits & Performance — Initial Observations

# Activity-related columns:

study_hours_per_day — primary academic activity indicator.

social_media_hours — leisure / distraction-related activity.

netflix_hours — passive entertainment activity.

extracurricular_participation — engagement in non-academic activities.

part_time_job — external work commitment affecting available time.

social_activity — level of social engagement outside academics.

screen_time — overall digital exposure combining work and leisure.

# Time-related columns:

sleep_hours — recovery and health-related time allocation.

study_hours_per_day — daily academic time investment.

screen_time — time spent on digital devices.

exercise_frequency — regularity of physical activity (time proxy).

semester — temporal academic progression.

# Performance indicators:

exam_score — immediate academic outcome.

previous_gpa — historical academic performance.

attendance_percentage — consistency and discipline indicator.

time_management_score — self-reported efficiency in handling tasks.

dropout_risk — high-level risk outcome related to disengagement.

# Contextual / background columns (not direct behavior signals):

age, gender, major — demographic context.

parental_education_level, family_income_range, parental_support_level — background factors.

internet_quality, access_to_tutoring, study_environment — environmental enablers.

learning_style — preference indicator, not a behavior.

diet_quality, mental_health_rating, stress_level, exam_anxiety_score, motivation_level — psychological and health context.

# Relevance to NEEL:

This dataset provides a comprehensive view of daily activities, health, leisure, and academic performance.

It is especially useful for weekly/monthly behavioral analysis, habit detection, and goal-alignment reasoning.

Background and demographic variables should be treated as context, not as direct drivers of suggestions.

In [16]:
df_time=pd.read_csv("../data/raw/Time Management and Productivity Insights.csv")
df_time.shape

(85, 9)

In [17]:
df_time.head(5)

Unnamed: 0,User ID,Age,Daily Work Hours,Daily Leisure Hours,Daily Exercise Minutes,Daily Sleep Hours,Productivity Score,Screen Time (hours),Commute Time (hours)
0,1,62,5.5,4.0,92,5.2,55,3.7,0.6
1,2,32,4.8,3.5,6,8.8,69,7.2,1.9
2,3,52,3.4,2.1,75,7.2,68,3.3,2.0
3,4,50,9.4,4.0,53,6.9,91,7.5,1.6
4,5,63,8.7,5.6,46,7.4,72,2.8,2.3


In [18]:
df_time.columns

Index(['User ID', 'Age', 'Daily Work Hours', 'Daily Leisure Hours', 'Daily Exercise Minutes', 'Daily Sleep Hours', 'Productivity Score', 'Screen Time (hours)', 'Commute Time (hours)'], dtype='object')

### Time Management & Productivity — Initial Observations

# Activity-related columns:

Daily Work Hours — time spent on work or primary responsibilities.

Daily Leisure Hours — time allocated to non-work, leisure activities.

Daily Exercise Minutes — physical activity contributing to health and energy.

Commute Time (hours) — unavoidable daily time cost affecting available productivity.

# Time & recovery-related columns:

Daily Sleep Hours — recovery and rest indicator.

Screen Time (hours) — digital exposure influencing focus and fatigue.

# Productivity / outcome indicators:

Productivity Score — overall daily productivity outcome influenced by time allocation and habits.

# Contextual / background columns:

Age — demographic context, not a behavioral signal.

User ID — identifier, not a feature.

# Relevance to NEEL:

This dataset provides a clear daily time-allocation snapshot across work, leisure, exercise, sleep, and commute.

It is well-suited for weekly and monthly productivity analysis, time balance visualization, and habit pattern detection.

The dataset supports NEEL’s goal of explaining how time distribution affects perceived productivity, rather than predicting raw scores.