# **About Dataset**

## **Social Media Usage and Emotional Well-Being**
This unique dataset was meticulously researched and prepared by AI Inventor Emirhan BULUT. It captures valuable information on social media usage and the dominant emotional state of users based on their activities. The dataset is ideal for exploring the relationship between social media usage patterns and emotional well-being.

## **Features:**
- `User_ID`: Unique identifier for the user.
- `Age`: Age of the user.
- `Gender`: Gender of the user (Female, Male, Non-binary).
- `Platform`: Social media platform used (e.g., Instagram, Twitter, Facebook, LinkedIn, Snapchat, Whatsapp, Telegram).
- `Daily_Usage_Time (minutes)`: Daily time spent on the platform in minutes.
- `Posts_Per_Day`: Number of posts made per day.
- `Likes_Received_Per_Day`: Number of likes received per day.
- `Comments_Received_Per_Day`: Number of comments received per day.
- `Messages_Sent_Per_Day`: Number of messages sent per day.
- `Dominant_Emotion`: User's dominant emotional state during the day (e.g., Happiness, Sadness, Anger, Anxiety, Boredom, Neutral).

## **Files**:
- `train.csv`: Data for training models.
- `test.csv`: Data for testing models.
- `val.csv`: Data for validation purposes.

## **Introduction**
Social media is a fundamental part of modern life, influencing how we live, work, and connect. Its impact extends to our mental and emotional well-being. This analysis explores the relationship between social media usage and emotional well-being, aiming to uncover how online interactions affect our mental health. Understanding these dynamics can help promote healthier social media habits and improve overall emotional health.

# **Code**

## **Import Required Libraries**

In [103]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## **Loading the Data**

In [19]:
train_df = pd.read_csv('train.csv', on_bad_lines = 'skip')
val_df = pd.read_csv('val.csv', on_bad_lines = 'skip')
test_df = pd.read_csv('test.csv', on_bad_lines = 'skip')

In [20]:
# Take a look at the data
print("Training data:")
display(train_df.head())
print("---------------------------------------------------------------------------------------------------")
print("Validation data:")
display(val_df.head())
print("---------------------------------------------------------------------------------------------------")
print("Test data:")
display(test_df.head())

Training data:


Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,1,25,Female,Instagram,120.0,3.0,45.0,10.0,12.0,Happiness
1,2,30,Male,Twitter,90.0,5.0,20.0,25.0,30.0,Anger
2,3,22,Non-binary,Facebook,60.0,2.0,15.0,5.0,20.0,Neutral
3,4,28,Female,Instagram,200.0,8.0,100.0,30.0,50.0,Anxiety
4,5,33,Male,LinkedIn,45.0,1.0,5.0,2.0,10.0,Boredom


---------------------------------------------------------------------------------------------------
Validation data:


Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,10,31,Male,Instagram,170,5,80,20,35,Happiness
1,877,32,Female,Instagram,155,6,75,25,38,Happiness
2,230,26,Non-binary,Facebook,45,1,8,4,12,Sadness
3,876,28,Non-binary,Snapchat,115,3,38,18,27,Anxiety
4,376,28,Non-binary,Snapchat,115,3,38,18,27,Anxiety


---------------------------------------------------------------------------------------------------
Test data:


Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,500,27,Female,Snapchat,120,4,40,18,22,Neutral
1,488,21,Non-binary,Snapchat,60,1,18,7,12,Neutral
2,776,28,Non-binary,Snapchat,115,3,38,18,27,Anxiety
3,869,27,Male,Telegram,105,3,48,20,28,Anxiety
4,573,21,Non-binary,Facebook,55,3,17,7,12,Neutral


In [21]:
# Training Data info
print("Training data Info:")
train_df.info()

Training data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001 entries, 0 to 1000
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     1001 non-null   object 
 1   Age                         1001 non-null   object 
 2   Gender                      1000 non-null   object 
 3   Platform                    1000 non-null   object 
 4   Daily_Usage_Time (minutes)  1000 non-null   float64
 5   Posts_Per_Day               1000 non-null   float64
 6   Likes_Received_Per_Day      1000 non-null   float64
 7   Comments_Received_Per_Day   1000 non-null   float64
 8   Messages_Sent_Per_Day       1000 non-null   float64
 9   Dominant_Emotion            1000 non-null   object 
dtypes: float64(5), object(5)
memory usage: 78.3+ KB


In [22]:
# Shape of training data, validation data, and test data
print(f'There are {train_df.shape[0]} rows and {train_df.shape[1]} columns in the training data.')
print(f'There are {val_df.shape[0]} rows and {val_df.shape[1]} columns in the validation data.')
print(f'There are {test_df.shape[0]} rows and {test_df.shape[1]} columns in the test data.')

There are 1001 rows and 10 columns in the training data.
There are 145 rows and 10 columns in the validation data.
There are 103 rows and 10 columns in the test data.


In [23]:
# Check for null values:
print("Traning data:")
display(train_df.isnull().sum())
print(f'There are {train_df.isnull().sum().sum()} null values in the training data.')
print("------------------------------------------------------------------------------")
print("Validation data:")
display(val_df.isnull().sum())
print(f'There are {val_df.isnull().sum().sum()} null values in the validation data.')
print("------------------------------------------------------------------------------")
print("Test data:")
display(test_df.isnull().sum())
print(f'There are {test_df.isnull().sum().sum()} null values in the test data.')

Traning data:


User_ID                       0
Age                           0
Gender                        1
Platform                      1
Daily_Usage_Time (minutes)    1
Posts_Per_Day                 1
Likes_Received_Per_Day        1
Comments_Received_Per_Day     1
Messages_Sent_Per_Day         1
Dominant_Emotion              1
dtype: int64

There are 8 null values in the training data.
------------------------------------------------------------------------------
Validation data:


User_ID                       0
Age                           0
Gender                        0
Platform                      0
Daily_Usage_Time (minutes)    0
Posts_Per_Day                 0
Likes_Received_Per_Day        0
Comments_Received_Per_Day     0
Messages_Sent_Per_Day         0
Dominant_Emotion              1
dtype: int64

There are 1 null values in the validation data.
------------------------------------------------------------------------------
Test data:


User_ID                       0
Age                           0
Gender                        0
Platform                      0
Daily_Usage_Time (minutes)    0
Posts_Per_Day                 0
Likes_Received_Per_Day        0
Comments_Received_Per_Day     0
Messages_Sent_Per_Day         0
Dominant_Emotion              0
dtype: int64

There are 0 null values in the test data.


## **Exploratory Data Analysis (EDA)**

In [24]:
train_df.head()

Unnamed: 0,User_ID,Age,Gender,Platform,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day,Dominant_Emotion
0,1,25,Female,Instagram,120.0,3.0,45.0,10.0,12.0,Happiness
1,2,30,Male,Twitter,90.0,5.0,20.0,25.0,30.0,Anger
2,3,22,Non-binary,Facebook,60.0,2.0,15.0,5.0,20.0,Neutral
3,4,28,Female,Instagram,200.0,8.0,100.0,30.0,50.0,Anxiety
4,5,33,Male,LinkedIn,45.0,1.0,5.0,2.0,10.0,Boredom


In [25]:
# List of columns in the traning data
train_df.columns

Index(['User_ID', 'Age', 'Gender', 'Platform', 'Daily_Usage_Time (minutes)',
       'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day',
       'Messages_Sent_Per_Day', 'Dominant_Emotion'],
      dtype='object')

In [26]:
train_df.describe()

Unnamed: 0,Daily_Usage_Time (minutes),Posts_Per_Day,Likes_Received_Per_Day,Comments_Received_Per_Day,Messages_Sent_Per_Day
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,95.95,3.321,39.898,15.611,22.56
std,38.850442,1.914582,26.393867,8.819493,8.516274
min,40.0,1.0,5.0,2.0,8.0
25%,65.0,2.0,20.0,8.0,17.75
50%,85.0,3.0,33.0,14.0,22.0
75%,120.0,4.0,55.0,22.0,28.0
max,200.0,8.0,110.0,40.0,50.0


### **Age Distribution**

In [27]:
train_df['Age'].isnull().sum()

0

In [28]:
train_df['Age'].value_counts(normalize = True)

Age
28                                                      0.091908
27                                                      0.091908
29                                                      0.089910
22                                                      0.073926
26                                                      0.065934
25                                                      0.063936
24                                                      0.063936
31                                                      0.061938
21                                                      0.055944
33                                                      0.055944
30                                                      0.047952
23                                                      0.047952
35                                                      0.037962
32                                                      0.037962
34                                                      0.035964
Male                 

We have 4 irrigular values: Male, Female, Non-binary and Other

In [29]:
# removing the Male, Female, Non-binary, and işte mevcut veri kümesini 1000 satıra tamamlıyorum:

# Replace non-numeric values with Nan
train_df['Age'] = pd.to_numeric(train_df['Age'], errors = 'coerce')

# Handle Nan values
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df['Age'].fillna(train_df['Age'].median(), inplace=True)


In [30]:
train_df['Age'].unique()

array([25., 30., 22., 28., 33., 21., 27., 24., 29., 31., 23., 26., 34.,
       35., 32.])

In [37]:
plt = px.histogram(train_df, x = 'Age', title = 'Age Distribution')
plt.show()

### **Gender Distribution**

In [39]:
train_df['Gender'].unique()

array(['Female', 'Male', 'Non-binary', '27', '24', '29', '33', '31', '22',
       '25', '28', '30', '23', '34', '26', '35', '21', '32', nan],
      dtype=object)

In [41]:
# Replace numeric values with NaN
def replace_numeric_with_nan(value):
    try:
        float(value)
        return np.nan
    except ValueError:
        return value
    
train_df['Gender'] = train_df['Gender'].apply(replace_numeric_with_nan)

# Handle NaN values
train_df['Gender'].fillna('Unknown', inplace=True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [42]:
train_df['Gender'].unique()

array(['Female', 'Male', 'Non-binary', 'Unknown'], dtype=object)

In [43]:
train_df['Gender'].value_counts(normalize = True)

Gender
Female        0.343656
Male          0.331668
Non-binary    0.247752
Unknown       0.076923
Name: proportion, dtype: float64

### **Platform Distribution**

In [44]:
train_df['Platform'].unique()

array(['Instagram', 'Twitter', 'Facebook', 'LinkedIn', 'Whatsapp',
       'Telegram', 'Snapchat', nan], dtype=object)

In [45]:
# Filling nan values with mode
train_df['Platform'].fillna(train_df['Platform'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [46]:
train_df['Platform'].unique()

array(['Instagram', 'Twitter', 'Facebook', 'LinkedIn', 'Whatsapp',
       'Telegram', 'Snapchat'], dtype=object)

In [48]:
train_df['Platform'].value_counts(normalize = True)

Platform
Instagram    0.250749
Twitter      0.199800
Facebook     0.189810
LinkedIn     0.119880
Whatsapp     0.079920
Telegram     0.079920
Snapchat     0.079920
Name: proportion, dtype: float64

### **Daily Usage Time (minutes) Distribution**

In [50]:
train_df['Daily_Usage_Time (minutes)'].unique()

array([120.,  90.,  60., 200.,  45., 150.,  85., 110.,  55., 170.,  75.,
        95.,  65., 180., 100.,  40., 125.,  50., 140., 105., 190.,  70.,
        80., 160., 145., 130., 115., 175., 165., 155.,  nan])

In [51]:
# Filling nan values with mode
train_df['Daily_Usage_Time (minutes)'].fillna(train_df['Daily_Usage_Time (minutes)'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [54]:
plt = px.histogram(train_df, x='Daily_Usage_Time (minutes)', title='Daily Usage Time Distribution')
plt.show()

### **Post Per Day Distribution**

In [55]:
train_df['Posts_Per_Day'].unique()

array([ 3.,  5.,  2.,  8.,  1.,  4.,  6.,  7., nan])

In [56]:
train_df['Posts_Per_Day'].fillna(train_df['Posts_Per_Day'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [57]:
plt = px.histogram(train_df, x='Posts_Per_Day', title='Posts Per Day Distribution')
plt.show()

### **Likes Per Day Distribution**

In [58]:
train_df['Likes_Received_Per_Day'].unique()

array([ 45.,  20.,  15., 100.,   5.,  60.,  30.,  25.,  10.,  80.,  35.,
        12.,  90.,  40.,  55.,  33.,   8.,  70.,  28.,  11.,  95.,  18.,
         9.,  85.,  38.,   6.,  13.,  75.,  27.,  88.,  22.,  78.,  29.,
        50.,  36.,  72.,  65., 110.,  14.,  17., 105.,  43.,  37.,  42.,
        48.,  21.,  24.,  23.,  83.,  nan])

In [59]:
train_df['Likes_Received_Per_Day'].fillna(train_df['Likes_Received_Per_Day'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [60]:
plt = px.histogram(train_df, x='Likes_Received_Per_Day', title='Likes Received Per Day Distribution')
plt.show()

### **Comments Per Day Distribution**

In [61]:
train_df['Comments_Received_Per_Day'].unique()

array([10., 25.,  5., 30.,  2., 15., 12.,  3., 20.,  7.,  4., 23., 18.,
       22., 14., 26.,  8., 19., 17., 11.,  6.,  9., 13., 40., 16., 35.,
       38., 28., 36., 33., nan])

In [62]:
train_df['Comments_Received_Per_Day'].fillna(train_df['Comments_Received_Per_Day'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [63]:
plt = px.histogram(train_df, x='Comments_Received_Per_Day', title='Comments Received Per Day Distribution')
plt.show()

## **Messages Per Day Distribution**

In [64]:
train_df['Messages_Sent_Per_Day'].unique()

array([12., 30., 20., 50., 10., 25., 18., 22.,  8., 35., 15., 40., 28.,
       33., 17., 45., 21., 11., 32., 24., 14.,  9., 38., 31., 27., 19.,
       26., 29., 23., nan])

In [65]:
train_df['Messages_Sent_Per_Day'].fillna(train_df['Messages_Sent_Per_Day'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [66]:
plt = px.histogram(train_df, x='Messages_Sent_Per_Day', title='Messages Sent Per Day Distribution')
plt.show()

### **Emotion Distribution**

In [67]:
train_df['Dominant_Emotion'].unique()

array(['Happiness', 'Anger', 'Neutral', 'Anxiety', 'Boredom', 'Sadness',
       nan], dtype=object)

In [68]:
train_df['Dominant_Emotion'].fillna(train_df['Dominant_Emotion'].mode()[0], inplace = True)


A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.





In [70]:
plt = px.pie(train_df, names='Dominant_Emotion', title='Dominant Emotion Distribution')
# adding the values to the pie section
plt.update_traces(textposition='inside', textinfo='percent+label')
plt.show()

## **Relationship Between Variables**

### **Gender and Platform**

In [86]:
# Group the data by gender and platform
gender_platform_grouped = train_df.groupby(['Gender', 'Platform']).size().reset_index(name='Count')

gender_platform_grouped

Unnamed: 0,Gender,Platform,Count
0,Female,Facebook,10
1,Female,Instagram,150
2,Female,LinkedIn,40
3,Female,Snapchat,28
4,Female,Twitter,66
5,Female,Whatsapp,50
6,Male,Facebook,38
7,Male,Instagram,76
8,Male,LinkedIn,48
9,Male,Telegram,50


In [88]:
plt = px.bar(gender_platform_grouped, x='Gender', y='Count', color='Platform', title='Platform by Gender Usage', barmode='group')
plt.show()

### **Age and Gender**

In [93]:
# Group the data by gender and age
gender_age_grouped = train_df.groupby(['Gender', 'Age']).size()

gender_age_grouped


Gender      Age 
Female      21.0    20
            22.0    48
            23.0    10
            24.0    26
            25.0    40
            26.0     8
            27.0    28
            28.0    36
            29.0    28
            30.0     8
            31.0     8
            32.0    28
            33.0    18
            34.0    18
            35.0    20
Male        21.0    10
            22.0     8
            23.0     8
            24.0    10
            25.0    24
            26.0    28
            27.0    44
            28.0    28
            29.0    30
            30.0    40
            31.0    54
            32.0    10
            33.0    10
            34.0    10
            35.0    18
Non-binary  21.0    26
            22.0    18
            23.0    30
            24.0    28
            26.0    30
            27.0    20
            28.0    28
            29.0    32
            33.0    28
            34.0     8
Unknown     27.0    77
dtype: int64

In [91]:
plt = px.histogram(train_df, x='Age', color='Gender', title='Age by Gender')
plt.show()

### **Platform and Emotions**

In [98]:
# Group the data by gender and age
gender_emotion_grouped = train_df.groupby(['Platform', 'Dominant_Emotion']).size().reset_index(name='Count')

gender_emotion_grouped

Unnamed: 0,Platform,Dominant_Emotion,Count
0,Facebook,Anxiety,50
1,Facebook,Boredom,40
2,Facebook,Neutral,70
3,Facebook,Sadness,30
4,Instagram,Anger,10
5,Instagram,Anxiety,30
6,Instagram,Happiness,171
7,Instagram,Neutral,20
8,Instagram,Sadness,20
9,LinkedIn,Anxiety,20


In [99]:
plt = px.bar(gender_emotion_grouped, x='Platform', y='Count', color='Dominant_Emotion', title='Dominant Emotion by Gender', barmode='group')

plt.show()

In [97]:
# Create a contingency table
contingency_table = pd.crosstab(train_df['Platform'], train_df['Dominant_Emotion'])

# Plot the heatmap
fig = px.imshow(contingency_table, title='Platform vs Dominant Emotion Heatmap')
fig.show()

### **Time Spent VS Emotions**

In [100]:
# Daily Usage time by Dominant_Emotion
plt = px.histogram(train_df, x='Daily_Usage_Time (minutes)', color='Dominant_Emotion', title='Time Usage by Dominant Emotion')
plt.show()

### **Likes Received VS Emotions**

In [102]:
plt = px.histogram(train_df, x='Likes_Received_Per_Day', color='Dominant_Emotion', title='Likes Received vs Dominant Emotion')
plt.show()

## **Model Training**

In [124]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from xgboost import XGBClassifier

In [107]:
train_df.isnull().sum()

User_ID                       0
Age                           0
Gender                        0
Platform                      0
Daily_Usage_Time (minutes)    0
Posts_Per_Day                 0
Likes_Received_Per_Day        0
Comments_Received_Per_Day     0
Messages_Sent_Per_Day         0
Dominant_Emotion              0
dtype: int64

In [108]:
X = train_df.drop(columns = ['User_ID', 'Dominant_Emotion'])
y = train_df['Dominant_Emotion']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [123]:
numeric_features = ['Daily_Usage_Time (minutes)',
       'Posts_Per_Day', 'Likes_Received_Per_Day', 'Comments_Received_Per_Day',
       'Messages_Sent_Per_Day']
categorical_features = ['Age', 'Gender', 'Platform']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(), categorical_features)
])

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('rf', RandomForestClassifier(random_state = 42))
])

In [126]:
param_grid = {
    'rf__n_estimators': [100, 200, 300],
    'rf__max_depth': [None, 10, 20, 30],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(estimator=rf_pipeline, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)

grid_search.fit(X_train, y_train)
y_pred = grid_search.predict(X_test)

# In ra các kết quả đánh giá
print(f"Best Parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters: {'rf__max_depth': None, 'rf__min_samples_leaf': 1, 'rf__min_samples_split': 2, 'rf__n_estimators': 300}


In [127]:
rf_pipeline_best = grid_search.best_estimator_
y_pred = rf_pipeline_best.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred, average='weighted')}")
print(f"Precision: {precision_score(y_test, y_pred, average='weighted')}")
print(f"Recall: {recall_score(y_test, y_pred, average='weighted')}")

Accuracy: 0.9950248756218906
F1 Score: 0.9950201354734427
Precision: 0.995130729332063
Recall: 0.9950248756218906
