# Project Title: Mental Health

## About Dataset
This unique dataset was meticulously researched and prepared by AI Inventor Emirhan BULUT. It captures valuable information on social media usage and the dominant emotional state of users based on their activities. The dataset is ideal for exploring the relationship between social media usage patterns and emotional well-being. Features:

Files:

train.csv: Data for training models.<br>
test.csv: Data for testing models.<br>
val.csv: Data for validation purposes.<br>

**Dataset Overview:**

🌐 Source: __Dataset is taken from__ <font color=blue>**Kaggle**</font>

**Columns:**<br>
* User_ID: Unique identifier for the user.<br>
* Age: Age of the user.<br>
* Gender: Gender of the user (Female, Male, Non-binary).<br>
* Platform: Social media platform used (e.g., Instagram, Twitter, Facebook, LinkedIn, Snapchat, Whatsapp, Telegram).<br>
* Daily_Usage_Time (minutes): Daily time spent on the platform in minutes.<br>
* Posts_Per_Day: Number of posts made per day.<br>
* Likes_Received_Per_Day: Number of likes received per day.<br>
* Comments_Received_Per_Day: Number of comments received per day.<br>
* Messages_Sent_Per_Day: Number of messages sent per day.<br>
* Dominant_Emotion: User's dominant emotional state during the day (e.g., Happiness, Sadness, Anger, Anxiety, Boredom, Neutral).<br>

**Attribution:**<br>

You can use this dataset ethically and responsibly in accordance with the MIT license for educational and research purposes. If you use this dataset in your work, a citation or reference to this dataset would be appreciated.

## Introduction:
Social media is a central point of our life. It has changed the way we live, work, and interact with others. As a result, it has become an essential part of our daily lives. The dataset here aims to provide valuable insights into social media usage and the dominant emotional state of users.<br>

The goal of this project is to analyze the relation between social media trends and emotional well-being, and train the nodels.

**About Author:**<br>
**Author: Saadat Khalid Awan**

|Contact Information||
|---|---|
|Email:|	me.saadi96@gmail.com|
|Website:|	https://thesaadat.blogspot.com|
🌐 Let's Connect:

|Platform|	Badge|	URL|
|---|---|---|
|Facebook|	Facebook|	Saadat.Khalid.Awan|
|Instagram|	Instagram|	saadii_awan66|
|LinkedIn|	LinkedIn|	saadatawan|
|Medium|	Medium|	@me.saadat|
|Pinterest|	Pinterest|	its_saadatkhalid|
|Quora|	Quora|	Saadat-Khalid-Awan|
|TikTok|	TikTok|	saadat.awan|
|Twitter|	Twitter|	saadat_96|
|YouTube|	YouTube|	saadatkhalidawan|
|Github|	Github|	Saadat-Khalid|

# Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsb
import plotly.express as px

ModuleNotFoundError: No module named 'plotly'

In [3]:
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

## Data loading

* We've 3 files:<br>
    - test.csv<br>
    - train.csv<br>
    - val.csv<br>

In [None]:
train_df = pd.read_csv("/kaggle/input/social-media-usage-and-emotional-well-being/train.csv", on_bad_lines='skip')
test_df = pd.read_csv("/kaggle/input/social-media-usage-and-emotional-well-being/test.csv", on_bad_lines='skip')
val_df = pd.read_csv("/kaggle/input/social-media-usage-and-emotional-well-being/val.csv", on_bad_lines='skip')

* Before diving into the data, let's take a look at the data. Afterwards we'll explore the traning dataset.

In [None]:
# viwing the traing data, test data and validation data
print("Training Data:")
display(train_df.head())
print("----------------------------------------------------------------------")
print("Testing Data:")
display(test_df.head())
print("----------------------------------------------------------------------")
print("Validation Data:")
display(val_df.head())

Training Data:

----------------------------------------------------------------------
Testing Data:

----------------------------------------------------------------------
Validation Data:

In [None]:
# training data info
print("Traing Data Info:")
train_df.info()

In [None]:
# testing data info
print("Testing Data Info:")
test_df.info()

In [None]:
# validation data info
print("Validation Data Info:")
val_df.info()

* Shape of the training, testing, and validation data.

In [None]:
print(f"There are {train_df.shape[0]} rows and {train_df.shape[1]} columns in the training data.")
print(f"There are {test_df.shape[0]} rows and {test_df.shape[1]} columns in the testing data.")
print(f"There are {val_df.shape[0]} rows and {val_df.shape[1]} columns in the validation data.")

In [None]:
# checking for null values
print("Training Data:")
display(train_df.isnull().sum())
print("----------------------------------------------------------------------")
print("Testing Data:")
display(test_df.isnull().sum())
print("----------------------------------------------------------------------")
print("Validation Data:")
display(val_df.isnull().sum())

* I can do 2 things to deal with null values:<br>
    * Simple drop them as it is only 1 null value and might not affect our analysis.<br>
    * Fill them with the mean, median, or mode depending on the value of the column.

## Exploratory Data Analysis (EDA)

* Let's start EDA with the training data.

In [None]:
train_df.head()

* Coloum in our data:

In [None]:
# list of coloums in the training data
train_df.columns 

### Describe the training data

In [None]:
train_df.describe()

### Age Distribution

In [None]:
train_df['Age'].isnull().sum()

In [None]:
train_df['Age'].unique()

* In the age coloum we have 4 irrigular values.
    * Male, Female, Non-binary, and other.
* **_Let's drop/fill these values first_**

In [None]:
# removing the Male, Female, Non-binary, and işte mevcut veri kümesini 1000 satıra tamamlıyorum:

# Replace non-numeric values with NaN
train_df['Age'] = pd.to_numeric(train_df['Age'], errors='coerce')

# Handle NaN values
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)

In [None]:
train_df['Age'].unique()

In [None]:
plt = px.histogram(train_df, x='Age', title='Age Distribution')
plt.show()

### Gender Distribution

In [None]:
train_df['Gender'].unique()

* We've the same issue here too. Mix of Gender with some number like in Age Coloum. Let's handle it.

In [None]:
# Function to replace numeric values with NaN
def clean_gender_column(gender_value):
    try:
        # Try converting the value to float, if it succeeds it's a numeric value
        float(gender_value)
        return np.nan
    except ValueError:
        # If conversion fails, it's a valid gender entry
        return gender_value

# Apply the function to the Gender column
train_df['Gender'] = train_df['Gender'].apply(clean_gender_column)

# Fill NaN values with a placeholder or drop them
# Here we fill NaN values with 'Unknown'
train_df['Gender'].fillna('Unknown', inplace=True)

# Verify the unique values after cleaning
print(train_df['Gender'].unique())

['Female' 'Male' 'Non-binary' 'Unknown']

In [None]:
train_df['Gender'].unique()

In [None]:
train_df['Gender'].value_counts()

In [None]:
plt = px.histogram(train_df, x='Gender', title='Gender Distribution')
plt.show()

### Platform Distribution

In [None]:
train_df['Platform'].unique()

In [None]:
train_df['Platform'].value_counts()

* Let's Impute the nan with the mode

In [None]:
# filling with mode
train_df['Platform'].fillna(train_df['Platform'].mode()[0], inplace=True)

In [None]:
train_df['Platform'].unique()

In [None]:
train_df['Platform'].value_counts()

In [None]:
plt = px.histogram(train_df, x='Platform', title='Platform Distribution')
plt.show()

### Daily Usage Time (minutes) Distribution

In [None]:
plt = px.histogram(train_df, x='Daily_Usage_Time (minutes)', title='Daily Usage Time Distribution')
plt.show()

Post Per Day Distribution

In [None]:
train_df['Posts_Per_Day'].unique()

In [None]:
# fill with mode
train_df['Posts_Per_Day'].fillna(train_df['Posts_Per_Day'].mode()[0], inplace=True)

In [None]:
plt = px.histogram(train_df, x='Posts_Per_Day', title='Posts Per Day Distribution')
plt.show()

### Likes Per Day Distribution

In [None]:
train_df['Likes_Received_Per_Day'].unique()

In [None]:
# filling wih mode
train_df['Likes_Received_Per_Day'].fillna(train_df['Likes_Received_Per_Day'].mode()[0], inplace=True)

In [None]:
plt = px.histogram(train_df, x='Likes_Received_Per_Day', title='Posts Per Day Distribution')
plt.show()

### Comments Per Day Distribution

In [None]:
train_df['Comments_Received_Per_Day'].unique()

In [None]:
# filling with mode
train_df['Comments_Received_Per_Day'].fillna(train_df['Comments_Received_Per_Day'].mode()[0], inplace=True)

In [None]:
train_df['Comments_Received_Per_Day'].unique()

In [None]:
plt = px.histogram(train_df, x='Comments_Received_Per_Day', title='Posts Per Day Distribution')
plt.show()

### Messages Per Day Distribution

In [None]:
plt = px.histogram(train_df, x='Messages_Sent_Per_Day', title='Posts Per Day Distribution')
plt.show()

### Emotion Distribution

In [None]:
train_df['Dominant_Emotion'].unique()

In [None]:
# fill with mode
train_df['Dominant_Emotion'].fillna(train_df['Dominant_Emotion'].mode()[0], inplace=True)

In [None]:
plt = px.pie(train_df, names='Dominant_Emotion', title='Dominant Emotion Distribution')
# adding the values to the pie section
plt.update_traces(textposition='inside', textinfo='percent+label')
plt.show()

## Relationship Between Variables

### Gender and Platform

In [None]:
# Group the data by gender and platform
grouped = train_df.groupby(['Gender', 'Platform'])

# Count the number of rows in each group
counts = grouped.size()

# Print the counts
print(counts)

In [None]:
plt = px.histogram(train_df, x='Gender', color='Platform', title='Platform by Gender Usage')
plt.show()

### Age and Gender

In [None]:
# grouping age with gender
grouped = train_df.groupby(['Age', 'Gender'])

# count the number of rows in each group
counts = grouped.size()

# print the counts  
print(counts)

In [None]:
plt = px.histogram(train_df, x='Age', color='Gender', title='Age by Gender')
plt.show()

### Gender and Platform VS Daily Usage Time (minutes)

In [None]:
plt = px.histogram(train_df, x='Posts_Per_Day', y='Platform' ,color='Gender', title='Posts Per Day by Gender')
plt.show()

### Gender VS Emotions

In [None]:
# checking the Gender againest the Dominant_Emotion
grouped = train_df.groupby(['Gender', 'Dominant_Emotion'])

# count the number of rows in each group
counts = grouped.size()

# print the counts  
print(counts)

In [None]:
# ploting
plt = px.histogram(train_df, x='Gender', color='Dominant_Emotion', title='Dominant Emotion by Gender')
plt.show()

### Platform VS Emotions

In [None]:
# checking the Platformagainest the Dominant_Emotion
grouped = train_df.groupby(['Platform', 'Dominant_Emotion'])

# count the number of rows in each group
counts = grouped.size()

# print the counts
print(counts)

In [None]:
Plt = px.histogram(train_df, x='Platform', color='Dominant_Emotion', title='Dominant Emotion by Platform')
Plt.show()

In [None]:
# Create a contingency table
contingency_table = pd.crosstab(train_df['Platform'], train_df['Dominant_Emotion'])

# Plot the heatmap
fig = px.imshow(contingency_table, title='Platform vs Dominant Emotion Heatmap')
fig.show()

### Time Spent VS Emotions

In [None]:
# Daily Usage time by Dominant_Emotion
plt = px.histogram(train_df, x='Daily_Usage_Time (minutes)', color='Dominant_Emotion', title='Time Usage by Dominant Emotion')
plt.show()

### Likes Received VS Emotions

* This is the last realation I'll be looking in this data.

In [None]:
plt = px.histogram(train_df, x='Likes_Received_Per_Day', color='Dominant_Emotion', title='Likes Received vs Dominant Emotion')
plt.show()

## Model Training

In [None]:
# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [None]:
# Checking for missing values
train_df.isnull().sum()

In [None]:
# Let's drop the missing values
train_df.dropna(inplace=True)

In [None]:
# Define features and target
X = train_df.drop(columns=['Dominant_Emotion', 'User_ID']) # Features
y = train_df['Dominant_Emotion'] # Target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Define preprocessing steps. Seeperating the numerical and categorical features
numeric_features = ['Daily_Usage_Time (minutes)', 'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day', 'Messages_Sent_Per_Day']
categorical_features = ['Age', 'Gender', 'Platform']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)])

### Random Forest Classifier

Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classes (classification). It improves predictive performance and reduces overfitting compared to a single decision tree by averaging the results of many trees.

In [None]:
# Random Forest Classifier
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', RandomForestClassifier(random_state=42))])

rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_test)

print("Random Forest Classifier Report:")
print(classification_report(y_test, y_pred_rf))

### XGBoost Classifier

XGBoost (Extreme Gradient Boosting) is a scalable and efficient implementation of gradient boosting algorithms. It builds an ensemble of trees in a sequential manner, where each tree tries to correct the errors of the previous trees. XGBoost is known for its high performance and speed, making it popular for structured or tabular data tasks in machine learning competitions and real-world applications.

In [None]:
from sklearn.preprocessing import LabelEncoder
# Encode target labels as integers
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

In [None]:
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# XGBoost Classifier
xgb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', XGBClassifier(random_state=42))])

xgb_pipeline.fit(X_train, y_train)
y_pred_xgb = xgb_pipeline.predict(X_test)

In [None]:
# Decode the predicted labels back to original string labels for the report
y_test_decoded = label_encoder.inverse_transform(y_test)
y_pred_xgb_decoded = label_encoder.inverse_transform(y_pred_xgb)

print("XGBoost Classifier Report:")
print(classification_report(y_test_decoded, y_pred_xgb_decoded))

### Neural Network

In [None]:
from sklearn.compose import ColumnTransformer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.utils import to_categorical

In [None]:
# Define features and target
X = train_df.drop(columns=['Dominant_Emotion', 'User_ID'])
y = train_df['Dominant_Emotion']

# Encode target labels as integers
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# One-hot encode the target labels for neural network
y_onehot = to_categorical(y_encoded)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)

# Define preprocessing steps for features
numeric_features = ['Daily_Usage_Time (minutes)', 'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day', 'Messages_Sent_Per_Day']
categorical_features = ['Age', 'Gender', 'Platform']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)])

# Preprocess the features
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

### Build the neural network

In [None]:
# Build the neural network
model = Sequential()
model.add(Dense(128, input_dim=X_train.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(y_onehot.shape[1], activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.1)

In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy:.2f}')

# Make predictions
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)

# Decode the predicted and true labels back to original string labels for the report
y_pred_labels = label_encoder.inverse_transform(y_pred_classes)
y_test_labels = label_encoder.inverse_transform(y_test_classes)

In [None]:
# Print classification report
print("Neural Network Classifier Report:")
print(classification_report(y_test_labels, y_pred_labels))

## Insights:

1. Age Distribution: We can see that the majority of the users are in all ages between 21 and 35. The higest count is in the age group of 27.<br>
2. Gender Distribution: We've 4 Values in the Gender Column. _Male, Female, Non-Binary, Unknown._<br>
A. Male: 332,<br>
B. Female: 344,<br>
C. Non-Binary: 248,<br>
D .Unknown: 77<br>
3. Platform Distribution: We can see that the majority of the users use Instagram followed by Twitter and Facebook.<br>
A.Instagram 251<br>
B. Twitter 200<br>
C. Facebook 190<br>
D. LinkedIn 120<br>
E. Whatsapp 80<br>
F. Telegram 80<br>
G. Snapchat 80<br>
4. Daily_Usage_Time (minutes): The Daily Usage Time is from 40 to 200 minutes. Most count is in range of 60 to 90 minutes. While the Average time is 95 min.M<br>
5. Post per Day: The Minimum number of post is 1 and Maximum is 8, while the average is 3.32 posts. (Most Count is 2)<br>
6. Likes Received per Day: The Minimum number of like received is 5 and Highest is 110, while the average is 39.89.<br>
7. Comments Received per Day: The Minimum number of comment received is 2 and maximum is 40 while the average if 15.61.<br>
8. Messages Sent per Day: The Minimum number of messages sent is 8 and maximum is 50 while the average if 22.56.<br>
9. Emotions: We've total 6 emotions:<br>
A. Happiness: 20.1%<br>
B. Anger: 13%<br>
C. Neutral: 20%<br>
D. Anxiety: 17<br>
E. Boredom: 14%<br>
F. Sadness: 16%<br>
10. Gender VS Platform:<br>
A. _Instagram_ is most usage app amoung female while facebook is the least one.<br>
B. _Twitter_ is the most usage app amoung male followed by instagram. While Whatsapp is the least one.<br>
C. _Facebook_ is the most usage app amoung non-binary.<br>
11. Post Per Day By Gender<br>
A. Instagram:<br>
    a. Total Posts Per Day(Female): 885<br>
    b. Total Posts Per Day(Female): 443<br>
B. Twitter/X:<br>
    a. Total Posts Per Day(Female): 226<br>
    b. Total Posts Per Day(Female): 351<br>
C. LinkedIn:<br>
    a. Total Posts Per Day(Female): 60<br>
    b. Total Posts Per Day(Female): 58<br>
12. Dominent Emotion by Gender:<br>
A. **Female**: The dominent emotion is _Happniess_ and the count is 102. While _Anger, Nuetral, Anxiety_ and _Sadness_ are the other emotions with the almost same count range 56 - 48. The count of _Boredom_ is 30 which is the least count amoung other.<br>
B. **Male**: The dominent emotion is _Happniess_ and the count is 66. While _Anger, Nuetral, Anxiety, Boredom_ and _Sadness_ are the other emotions with the almost same count range 58 - 46.<br>
C. **Non-Binary**: The dominent emotion is _Neutral_ and the count is 82. While _Anxiety, Sadness_ and _Boredom_ are the other emotions with the almost same count 46. Anger is the leaset count (10) among the other.<br>
13. Dominant_Emotion is strongly correlated with Platform.<br>
A. _Happiness_ is the dominant emotion in the _Instagram_ platform. While Anger is the least count.<br>
B. _Anger_ is the dominant emotion in the _Twitter_ platform and _Whatsapp_. While _Happniess_ is the least count in both.<br>
C. _Sadness_ is the also dominant emotion in the _Twitter_ platform and _Snapchat_.<br>
D. _Neutral_ is the dominant emotion in the _Facebook_ platform and _Telegram_ platform.<br>
E. _Boredom_ is the dominant emotion in the _LinkedIn_ platform and _Facebook_<br>
F. _Anixity- is the also dominant emotion in the _Facebook_ and other platforms.<br>
14. Daily_Usage_Time (minutes) and Dominant_Emotion are strongly correlated.<br>
A. 200 minutes is related with Anxiety emotion.<br>
B. From 140 to 190 minutes is related with Happiness emotion.<br>
C. Anger is commonly seens in the range of 60 to 120 minutes.<br>
D. Other Emotions are also in the range of 40 to 130 minutes<br>
15. Likes Received Per Day is strongly correlated with Dominant_Emotion.<br>
A. Upon looking at the plot, we can see that the Happiness emotion is triggered by the Likes_Received_Per_Day Ranging from 60 to 109 Likes.<br>
B. While the Anxiety emotion is also triggered by the Likes_Received Ranging from 110 to 114, which could means more the likes received is causes the Anxiety.<br>

**About Model:**<br>
Random Forest Classifier: **accuracy: 99%** The Random Forest Classifier is performing exceptionally well on this dataset, with nearly perfect precision, recall, and f1-scores for most classes and an overall accuracy of 99%. This indicates that the model is highly effective at predicting the dominant emotion based on the given features.

XGBoost Classifier:**accuracy: 98%** The XGBoost Classifier is performing very well on this dataset, with high precision, recall, and f1-scores for most classes and an overall accuracy of 98%. This indicates that the model is highly effective at predicting the dominant emotion based on the given features, though it shows slightly less precision in predicting 'Anxiety' compared to other emotions.

Neural Network: **accuracy: 94%** The neural network classifier is highly effective in identifying various emotions, achieving a high overall accuracy and strong class-wise performance. Focusing on improving recall for Boredom and Sadness and balancing precision and recall for Neutral could further enhance the classifier's robustness.

In [8]:
!git add.

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
Continuing in 0.1 seconds, assuming that you meant 'add'.
Nothing specified, nothing added.
[33mhint: Maybe you wanted to say 'git add .'?[m
[33mhint: Turn this message off by running[m
[33mhint: "git config advice.addEmptyPathspec false"[m


In [9]:
!git commit -m "done"

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   mental_health.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.ipynb_checkpoints/mental_health-checkpoint.ipynb[m
	[31mUntitled.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [10]:
!git push origin master

Everything up-to-date
