
# 📊 Social Media Usage Analysis: Student Dataset

This notebook performs end-to-end exploration and analysis on a dataset containing student information, including age, usage hours, mental health scores, and academic performance.

## Covered Sections:
1. Basic Data Exploration
2. Missing Data Investigation
3. Duplicate Check
4. Data Validation
5. Descriptive Statistics
6. Platform Stats
7. Usage Patterns
8. Academic Impact
9. Sleep & Mental Health
10. Country Analysis
11. Relationship Status
12. New Categories (Binning)


In [3]:

import numpy as np
import pandas as pd

# Load CSV dataset
df = pd.read_csv(r"C:\Users\wwwsh\OneDrive\Desktop\proj_1\data_sci\kaggle\one\one.csv")


## 1. 📌 Basic Data Exploration

In [4]:

print(df.info())
print(df.head())
print(df.tail())

print("Mean Age:", df["Age"].mean())
print("Max Daily Usage (Hours):", df["Avg_Daily_Usage_Hours"].max())
print("Min Daily Usage (Hours):", df["Avg_Daily_Usage_Hours"].min())
print("Mean Mental Health Score:", df["Mental_Health_Score"].mean())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 705 entries, 0 to 704
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Student_ID                    705 non-null    int64  
 1   Age                           705 non-null    int64  
 2   Gender                        705 non-null    object 
 3   Academic_Level                705 non-null    object 
 4   Country                       705 non-null    object 
 5   Avg_Daily_Usage_Hours         705 non-null    float64
 6   Most_Used_Platform            705 non-null    object 
 7   Affects_Academic_Performance  705 non-null    object 
 8   Sleep_Hours_Per_Night         705 non-null    float64
 9   Mental_Health_Score           705 non-null    int64  
 10  Relationship_Status           705 non-null    object 
 11  Conflicts_Over_Social_Media   705 non-null    int64  
 12  Addicted_Score                705 non-null    int64  
dtypes: fl

## 2. 🔍 Missing Data Investigation

In [5]:

print("Missing Values per Column:")
print(df.isnull().sum())


Missing Values per Column:
Student_ID                      0
Age                             0
Gender                          0
Academic_Level                  0
Country                         0
Avg_Daily_Usage_Hours           0
Most_Used_Platform              0
Affects_Academic_Performance    0
Sleep_Hours_Per_Night           0
Mental_Health_Score             0
Relationship_Status             0
Conflicts_Over_Social_Media     0
Addicted_Score                  0
dtype: int64


## 3. 📑 Duplicate Check

In [6]:

print("Duplicate Student_IDs:", df['Student_ID'].duplicated().sum())
print("Duplicate Rows:", df.duplicated().sum())


Duplicate Student_IDs: 0
Duplicate Rows: 0


## 4. ✅ Data Validation

In [7]:

print("Invalid Age Entries:", (df['Age'] < 0).sum())
print("Unrealistic Daily Usage (>24hrs):", (df['Avg_Daily_Usage_Hours'] > 24).sum())
print("Invalid Sleep Hours (<0):", (df['Sleep_Hours_Per_Night'] < 0).sum())
print("Invalid Sleep Hours (>24):", (df['Sleep_Hours_Per_Night'] > 24).sum())


Invalid Age Entries: 0
Unrealistic Daily Usage (>24hrs): 0
Invalid Sleep Hours (<0): 0
Invalid Sleep Hours (>24): 0


## 5. 📊 Descriptive Statistics

In [8]:

print("Age - Mean:", df["Age"].mean(), ", Mode:", df["Age"].mode()[0])
print("Gender Count:", df["Gender"].value_counts())
print("Avg Daily Usage - Mean:", df["Avg_Daily_Usage_Hours"].mean(), ", Mode:", df["Avg_Daily_Usage_Hours"].mode()[0])
print("Mental Health Score - Mean:", df["Mental_Health_Score"].mean())


Age - Mean: 20.659574468085108 , Mode: 20
Gender Count: Gender
Female    353
Male      352
Name: count, dtype: int64
Avg Daily Usage - Mean: 4.918723404255319 , Mode: 4.7
Mental Health Score - Mean: 6.226950354609929


## 6.  Platform Stats

In [9]:

pf = df['Most_Used_Platform'].value_counts()
print(pf)
print("Most Used Platform:", pf.idxmax(), "| Count:", pf.max())

print("Gender vs Platform Crosstab:")
print(pd.crosstab(df['Most_Used_Platform'], df['Gender']))


Most_Used_Platform
Instagram    249
TikTok       154
Facebook     123
WhatsApp      54
Twitter       30
LinkedIn      21
WeChat        15
Snapchat      13
VKontakte     12
LINE          12
KakaoTalk     12
YouTube       10
Name: count, dtype: int64
Most Used Platform: Instagram | Count: 249
Gender vs Platform Crosstab:
Gender              Female  Male
Most_Used_Platform              
Facebook                24    99
Instagram              172    77
KakaoTalk               12     0
LINE                    12     0
LinkedIn                 8    13
Snapchat                 8     5
TikTok                  86    68
Twitter                 16    14
VKontakte                0    12
WeChat                   4    11
WhatsApp                11    43
YouTube                  0    10


## 7. 📈 Usage Patterns

In [10]:

print(df.groupby('Academic_Level')["Avg_Daily_Usage_Hours"].mean())
print(df.groupby('Gender')["Avg_Daily_Usage_Hours"].mean())

df["Age_Group"] = pd.cut(df["Age"], bins=[17, 19, 21, 24, 100], labels=["18-19", "20-21", "22-24", "25+"])
print(df.groupby("Age_Group")["Avg_Daily_Usage_Hours"].mean())


Academic_Level
Graduate         4.776923
High School      5.544444
Undergraduate    5.001416
Name: Avg_Daily_Usage_Hours, dtype: float64
Gender
Female    5.011048
Male      4.826136
Name: Avg_Daily_Usage_Hours, dtype: float64
Age_Group
18-19    5.141243
20-21    4.940187
22-24    4.695169
25+           NaN
Name: Avg_Daily_Usage_Hours, dtype: float64


  print(df.groupby("Age_Group")["Avg_Daily_Usage_Hours"].mean())


## 8. 🎓 Academic Impact

In [11]:

print(df['Affects_Academic_Performance'].value_counts())
print(df.groupby('Affects_Academic_Performance')['Avg_Daily_Usage_Hours'].mean())


Affects_Academic_Performance
Yes    453
No     252
Name: count, dtype: int64
Affects_Academic_Performance
No     3.804365
Yes    5.538631
Name: Avg_Daily_Usage_Hours, dtype: float64


## 9. 💤 Sleep & Mental Health

In [12]:

print("Correlation between Usage & Sleep:", df["Avg_Daily_Usage_Hours"].corr(df["Sleep_Hours_Per_Night"]))
print("Correlation between Usage & Mental Health:", df["Avg_Daily_Usage_Hours"].corr(df["Mental_Health_Score"]))

df["Sleep_Group"] = pd.cut(df["Sleep_Hours_Per_Night"], bins=[0, 6, 8, 24], labels=["<6", "6-8", "8+"])
print(df.groupby("Sleep_Group")["Mental_Health_Score"].mean())


Correlation between Usage & Sleep: -0.790582455479992
Correlation between Usage & Mental Health: -0.801057623162343
Sleep_Group
<6     5.259459
6-8    6.358722
8+     7.336283
Name: Mental_Health_Score, dtype: float64


  print(df.groupby("Sleep_Group")["Mental_Health_Score"].mean())


## 10.  Country-wise Usage Analysis

In [13]:

ct = df["Country"].value_counts()
print(ct.head())

top = ct.head().index
print(df[df["Country"].isin(top)].groupby("Country")["Avg_Daily_Usage_Hours"].mean())


Country
India     53
USA       40
Canada    34
France    27
Mexico    27
Name: count, dtype: int64
Country
Canada    4.714706
France    4.055556
India     6.116981
Mexico    6.422222
USA       6.890000
Name: Avg_Daily_Usage_Hours, dtype: float64


## 11.  Relationship Status

In [14]:

rs = df["Relationship_Status"].value_counts()
print(rs)
print(df.groupby("Relationship_Status")["Avg_Daily_Usage_Hours"].mean())
print(df.groupby("Relationship_Status")["Addicted_Score"].mean())


Relationship_Status
Single             384
In Relationship    289
Complicated         32
Name: count, dtype: int64
Relationship_Status
Complicated        4.721875
In Relationship    4.930796
Single             4.926042
Name: Avg_Daily_Usage_Hours, dtype: float64
Relationship_Status
Complicated        7.031250
In Relationship    6.342561
Single             6.458333
Name: Addicted_Score, dtype: float64


## 12.  Creating New Categories

In [15]:

df["Usage_Category"] = pd.cut(df["Avg_Daily_Usage_Hours"], bins=[0,3,6,float("inf")], labels=["Light","Moderate","Heavy"])
print(df["Usage_Category"].value_counts())
print(df["Age_Group"].value_counts())


Usage_Category
Moderate    513
Heavy       143
Light        49
Name: count, dtype: int64
Age_Group
20-21    321
22-24    207
18-19    177
25+        0
Name: count, dtype: int64
