Eléonor KIOULOU

DIA2 - A5

# Projet Dataviz

In [466]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook_connected"

Kaggle: https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health?resource=download

### Context and Motivation

Social media is very important in our lives, especially for university students. This project looks at how using social media affects the mental health of young adults. We use survey data from Kaggle to understand the link between social media and problems like depression and anxiety. We will make interactive charts with tools like Bokeh or Plotly to explain our findings. This study is useful for students, teachers, and health experts. It's important now because mental health is getting more attention in schools and the tech world. We want to help people understand how social media can affect our mental health.

In [467]:
df = pd.read_csv("smmh.csv")
df.columns

Index(['Timestamp', '1. What is your age?', '2. Gender',
       '3. Relationship Status', '4. Occupation Status',
       '5. What type of organizations are you affiliated with?',
       '6. Do you use social media?',
       '7. What social media platforms do you commonly use?',
       '8. What is the average time you spend on social media every day?',
       '9. How often do you find yourself using Social media without a specific purpose?',
       '10. How often do you get distracted by Social media when you are busy doing something?',
       '11. Do you feel restless if you haven't used Social media in a while?',
       '12. On a scale of 1 to 5, how easily distracted are you?',
       '13. On a scale of 1 to 5, how much are you bothered by worries?',
       '14. Do you find it difficult to concentrate on things?',
       '15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?',
       '16. Following the previous question, 

## Preprocessing

In [468]:
# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)

Timestamp                                                                                                                0
1. What is your age?                                                                                                     0
2. Gender                                                                                                                0
3. Relationship Status                                                                                                   0
4. Occupation Status                                                                                                     0
5. What type of organizations are you affiliated with?                                                                  30
6. Do you use social media?                                                                                              0
7. What social media platforms do you commonly use?                                                                      0
8. What is the a

In [469]:
# Drop rows where any of the important columns have missing values
df = df.dropna(subset=[
    '1. What is your age?',
    '2. Gender',
    '3. Relationship Status',
    '4. Occupation Status',
    '6. Do you use social media?',
    '8. What is the average time you spend on social media every day?',
    '9. How often do you find yourself using Social media without a specific purpose?',
    '18. How often do you feel depressed or down?',
    '19. On a scale of 1 to 5, how frequently does your interest in daily activities fluctuate?',
    '20. On a scale of 1 to 5, how often do you face issues regarding sleep?'
])

In [470]:
# Check data types
df.dtypes

Timestamp                                                                                                                object
1. What is your age?                                                                                                    float64
2. Gender                                                                                                                object
3. Relationship Status                                                                                                   object
4. Occupation Status                                                                                                     object
5. What type of organizations are you affiliated with?                                                                   object
6. Do you use social media?                                                                                              object
7. What social media platforms do you commonly use?                                                     

In [471]:
# Converting categorical data to appropriate data types
df['1. What is your age?'] = df['1. What is your age?'].astype(int)
df['2. Gender'] = df['2. Gender'].astype('category')
df['3. Relationship Status'] = df['3. Relationship Status'].astype('category')
df['4. Occupation Status'] = df['4. Occupation Status'].astype('category')
df['6. Do you use social media?'] = df['6. Do you use social media?'].astype('category')

# 'Less than 1 hour' -> 0.5, 'Between 1 and 2 hours' -> 1.5, 'Between 2 and 3 hours' -> 2.5, and so on.
time_mapping = {
    'Less than an Hour': 0.5,
    'Between 1 and 2 hours': 1.5,
    'Between 2 and 3 hours': 2.5,
    'Between 3 and 4 hours': 3.5,
    'Between 4 and 5 hours': 4.5,
    'More than 5 hours': 6
}
df['8. What is the average time you spend on social media every day?'] = df['8. What is the average time you spend on social media every day?'].map(time_mapping)

In [472]:
# Define the new column names in a dictionary
simplified_column_names = {
    'Timestamp': 'timestamp',
    '1. What is your age?': 'age',
    '2. Gender': 'gender',
    '3. Relationship Status': 'relationship_status',
    '4. Occupation Status': 'occupation',
    '5. What type of organizations are you affiliated with?': 'organization_type',
    '6. Do you use social media?': 'uses_social_media',
    '7. What social media platforms do you commonly use?': 'common_platforms',
    '8. What is the average time you spend on social media every day?': 'average_daily_use',
    '9. How often do you find yourself using Social media without a specific purpose?': 'without_specific_purposeless_use(1-5)',
    '10. How often do you get distracted by Social media when you are busy doing something?': 'distraction_when_busy(1-5)',
    '11. Do you feel restless if you haven\'t used Social media in a while?': 'restlessness(1-5)',
    '12. On a scale of 1 to 5, how easily distracted are you?': 'easy_distraction_scale(1-5)',
    '13. On a scale of 1 to 5, how much are you bothered by worries?': 'bothered_by_worries_scale(1-5)',
    '14. Do you find it difficult to concentrate on things?': 'concentration_difficulty(1-5)',
    '15. On a scale of 1-5, how often do you compare yourself to other successful people through the use of social media?': 'comparison_with_others(1-5)',
    '16. Following the previous question, how do you feel about these comparisons, generally speaking?': 'feelings_about_comparison(1-5)',
    '17. How often do you look to seek validation from features of social media?': 'validation_seeking(1-5)',
    '18. How often do you feel depressed or down?': 'feel_depressed(1-5)',
    '19. On a scale of 1 to 5, how frequently does your interest in daily activities fluctuate?': 'interest_daily_activities_fluctuation(1-5)',
    '20. On a scale of 1 to 5, how often do you face issues regarding sleep?': 'sleep_issues(1-5)'
}

# Apply the new simplified column names to the dataset
df = df.rename(columns=simplified_column_names)

# Confirm the new column names by printing them out
df.columns.tolist()

['timestamp',
 'age',
 'gender',
 'relationship_status',
 'occupation',
 'organization_type',
 'uses_social_media',
 'common_platforms',
 'average_daily_use',
 'without_specific_purposeless_use(1-5)',
 'distraction_when_busy(1-5)',
 'restlessness(1-5)',
 'easy_distraction_scale(1-5)',
 'bothered_by_worries_scale(1-5)',
 'concentration_difficulty(1-5)',
 'comparison_with_others(1-5)',
 'feelings_about_comparison(1-5)',
 'validation_seeking(1-5)',
 'feel_depressed(1-5)',
 'interest_daily_activities_fluctuation(1-5)',
 'sleep_issues(1-5)']

In [473]:
# Define the mapping from numeric values to the original time categories
reverse_time_mapping = {
    0.5: 'Less than an Hour',
    1.5: 'Between 1 and 2 hours',
    2.5: 'Between 2 and 3 hours',
    3.5: 'Between 3 and 4 hours',
    4.5: 'Between 4 and 5 hours',
    6.0: 'More than 5 hours'
}

# Map the 'average_daily_use_numeric' to their respective categories
df['average_daily_use_category'] = df['average_daily_use'].map(reverse_time_mapping)

In [474]:
df.dtypes

timestamp                                       object
age                                              int64
gender                                        category
relationship_status                           category
occupation                                    category
organization_type                               object
uses_social_media                             category
common_platforms                                object
average_daily_use                              float64
without_specific_purposeless_use(1-5)            int64
distraction_when_busy(1-5)                       int64
restlessness(1-5)                                int64
easy_distraction_scale(1-5)                      int64
bothered_by_worries_scale(1-5)                   int64
concentration_difficulty(1-5)                    int64
comparison_with_others(1-5)                      int64
feelings_about_comparison(1-5)                   int64
validation_seeking(1-5)                          int64
feel_depre

In [475]:
df.to_csv('smmh_new.csv', index=False)
df.head()

Unnamed: 0,timestamp,age,gender,relationship_status,occupation,organization_type,uses_social_media,common_platforms,average_daily_use,without_specific_purposeless_use(1-5),...,easy_distraction_scale(1-5),bothered_by_worries_scale(1-5),concentration_difficulty(1-5),comparison_with_others(1-5),feelings_about_comparison(1-5),validation_seeking(1-5),feel_depressed(1-5),interest_daily_activities_fluctuation(1-5),sleep_issues(1-5),average_daily_use_category
0,4/18/2022 19:18:47,21,Male,In a relationship,University Student,University,Yes,"Facebook, Twitter, Instagram, YouTube, Discord...",2.5,5,...,5,2,5,2,3,2,5,4,5,Between 2 and 3 hours
1,4/18/2022 19:19:28,21,Female,Single,University Student,University,Yes,"Facebook, Twitter, Instagram, YouTube, Discord...",6.0,4,...,4,5,4,5,1,1,5,4,5,More than 5 hours
2,4/18/2022 19:25:59,21,Female,Single,University Student,University,Yes,"Facebook, Instagram, YouTube, Pinterest",3.5,3,...,2,5,4,3,3,1,4,2,5,Between 3 and 4 hours
3,4/18/2022 19:29:43,21,Female,Single,University Student,University,Yes,"Facebook, Instagram",6.0,4,...,3,5,3,5,1,2,4,3,2,More than 5 hours
4,4/18/2022 19:33:31,21,Female,Single,University Student,University,Yes,"Facebook, Instagram, YouTube",2.5,3,...,4,5,5,3,3,3,4,4,1,Between 2 and 3 hours


## Data Visualization Distribution

In [476]:
# Define a pastel color sequence
pastel_colors = ['#ffd1dc', '#a8e6cf', '#dcedc1', '#ffd3b6', '#ffaaa5',"#a9d5fc", "#fcefa7", "#b6fcb6", "#FFB6C1" ]

In [477]:
# Graph 1: Age Distribution
fig = px.histogram(df, x='age', nbins=40, color_discrete_sequence=["#96ebbc"])

fig.update_layout(
    title_text='Age Distribution',
    title_x=0.5,
    xaxis=dict(
        title='Age',
        showgrid=True,  
        gridcolor='LightGrey', 
    ),
    yaxis=dict(
        title='Count',
        showgrid=True,
        gridcolor='LightGrey',
    ),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
)
fig.update_layout(width=1200, height=600)
fig.show()

The histogram depicts the age distribution of a dataset, with a notable peak in the 20-30 age range, suggesting a younger demographic predominance. The distribution tails off as age increases, indicating fewer respondents in older age brackets.

In [478]:
# Graph 2: Gender Distribution
# Filter the DataFrame to only include 'Female' and 'Male' categories.
filtered_df = df[df['gender'].isin(['Female', 'Male'])]

# Create the histogram with custom colors.
fig = px.histogram(filtered_df, x='gender',
                   category_orders={'gender': ['Female', 'Male']},
                   color_discrete_sequence=pastel_colors)

# Customize the layout to include a grid and match the desired style.
fig.update_layout(
    title_text='Gender Distribution',
    title_x=0.5,
    xaxis=dict(
        title='Gender',
        showgrid=True,
        gridcolor='LightGrey'
    ),
    yaxis=dict(
        title='Count',
        showgrid=True,
        gridcolor='LightGrey'
    ),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    showlegend=False
)
fig.update_layout(width=1200, height=600)
fig.show()
# Save the chart as an HTML file
#fig.write_html('html_graph/gender_distribution.html')

The bar graph illustrates the gender distribution of a dataset, showing a slight difference in favor of women over men, suggesting an almost balanced representation of the sexes in the sample.

In [479]:
# Graph 3: Relationship Status Distribution
# Create a pie chart with Plotly Express using the custom pastel palette.
fig = px.pie(df, names='relationship_status', 
             title='Relationship Status Distribution',
             color='relationship_status',
             color_discrete_sequence=pastel_colors,
             hole=0.3)

# Customize the chart to set the legend to the right and vertically, increase legend font size, and set the background to white
fig.update_traces(textinfo='percent+label',textfont_size=15)
fig.update_layout(
    legend=dict(
        orientation="v",
        yanchor="middle",
        y=1,
        xanchor="left",
        x=1.05,
        font=dict(size=15)  # Increase legend font size
    ),
    plot_bgcolor='rgba(255,255,255,1)', 
    paper_bgcolor='rgba(255,255,255,1)',
    margin=dict(t=150, l=0, r=0, b=0),  # Increase top margin to give more space for the title
    title=dict(
        y=0.90,  # Raise the title position
        x=0.4,
        xanchor='center',
        yanchor='top',
        font=dict(size=20)  # Increase title font size if desired
    )
)
# Increase the size of the figure
fig.update_layout(width=800, height=600)

fig.show()

The donut chart shows relationship status distribution. Most people (59.3%) are single, 21% married, 18.3% in a relationship, and 1.46% divorced, giving a view of the group's marital status.

In [480]:
# Graph 4: Social Media Platform Usage Distribution
# This line explodes the 'common_platforms' column into multiple rows.
platform_counts = df['common_platforms'].str.split(', ').explode().value_counts().reset_index()
platform_counts.columns = ['Platforms', 'Count']  # Renaming columns for clarity

# Ensure there are enough colors for each bar by repeating the color sequence if necessary
num_platforms = platform_counts.shape[0]
repeated_colors = pastel_colors * (num_platforms // len(pastel_colors) + 1)
platform_counts['Color'] = repeated_colors[:num_platforms]

# Create the bar plot with Plotly Express
fig = px.bar(platform_counts, x='Platforms', y='Count',
             title='Social Media Platform Usage Distribution',
             color='Color',  # Use the Color column to map colors to bars
             color_discrete_sequence=pastel_colors)  # Use the pastel color sequence

# Update the layout to have a white background and gridlines
fig.update_layout(
    xaxis=dict(title='Platforms', showgrid=True, gridcolor='LightGrey'),
    yaxis=dict(title='Count', showgrid=True, gridcolor='LightGrey'),
    plot_bgcolor='rgba(255,255,255,1)',
    paper_bgcolor='rgba(255,255,255,1)',
    showlegend=False
)

# Customize x-axis tick angle
fig.update_xaxes(tickangle=45)
fig.update_layout(width=1200, height=600)

fig.show()

The bar chart displays the usage distribution across various social media platforms. YouTube appears to be the most used platform, followed closely by Facebook and Instagram. Platforms like Discord, Snapchat, Pinterest, Twitter, and Reddit show moderate counts, while TikTok has the least, suggesting a diverse range of platform preferences among the users surveyed.

In [481]:
# Graph 5: Average Time on Social Media Distribution
# This line calculates the counts of occurrences for each category in 'average_daily_use'.
average_daily_use_counts = df['average_daily_use_category'].value_counts().reset_index()
average_daily_use_counts.columns = ['Time on Social Media', 'Count']  # Renaming columns for clarity

# Define a color sequence
colors = px.colors.qualitative.Pastel  # Using Plotly's built-in qualitative pastel colors

# Ensure there are enough colors for each bar by repeating the color sequence if necessary
num_categories = average_daily_use_counts.shape[0]
repeated_colors = pastel_colors * (num_categories // len(pastel_colors) + 1)
average_daily_use_counts['Color'] = repeated_colors[:num_categories]

# Create the bar plot with Plotly Express
fig = px.bar(average_daily_use_counts, x='Time on Social Media', y='Count',
             title='Average Time on Social Media Distribution',
             color='Color',  # Use the Color column to map colors to bars
             color_discrete_sequence=pastel_colors,  # Use the pastel color sequence
             category_orders={'Time on Social Media': list(reverse_time_mapping.values())})  # Set the x-axis order

# Update the layout to have a white background and gridlines
fig.update_layout(
    xaxis=dict(title='Time on Social Media', showgrid=True, gridcolor='LightGrey'),
    yaxis=dict(title='Count', showgrid=True, gridcolor='LightGrey'),
    plot_bgcolor='rgba(255,255,255,1)',
    paper_bgcolor='rgba(255,255,255,1)',
    showlegend=False
)

# Customize x-axis tick angle
fig.update_xaxes(tickangle=45)
fig.update_layout(width=1200, height=600)
fig.show()

The bar chart illustrates how much time users spend on social media on average, and it reveals that the largest group of users spends more than 5 hours per day on social media. The distribution is relatively even among those who spend less than an hour, those who spend between 1 and 2 hours, and those who spend between 3 and 4 hours per day on social media. Users who spend less than 1 hour represent a slightly smaller proportion, suggesting that the majority of users engage with social media for an extended period, which can be considered excessive.

## Data Visualization Habits And Mental Health

In [482]:
# Graph 1: Correlation Heatmap of Psychological Features
# Compute the correlation matrix
correlation_matrix = df.iloc[:, 9:].corr()

# Create a heatmap with Plotly Express
fig = px.imshow(correlation_matrix, text_auto=True, color_continuous_scale='RdBu',labels=dict(color='Correlation'), x=correlation_matrix.columns, y=correlation_matrix.columns)

# Update the layout
fig.update_layout(
    title='Correlation Heatmap of Psychological Features',
    xaxis_nticks=36,
    yaxis_nticks=36,
    plot_bgcolor='rgba(255,255,255,1)',
    paper_bgcolor='rgba(255,255,255,1)',
    margin=dict(l=170)
)
# Adjusting the color bar (legend) position
fig.update_layout(coloraxis_colorbar=dict(
    x=1.50,  # Positioning the colorbar to the right side
    y=0.5,
    lenmode='fraction',
    len=0.65
))

# Increase the size of the figure
fig.update_layout(width=1100, height=1100)
# Lower the title
fig.update_layout(title=dict(y=0.95))
# Update the layout to reduce the top margin
fig.update_layout(margin=dict(t=0))
fig.show()

The correlation heatmap for mental health and social media usage has been created. It shows how various factors such as purposeless social media use, feelings of restlessness, distraction, worries, concentration difficulty, comparison with others, feelings about comparison, validation seeking, depression, interest in daily activities fluctuation, and sleep issues correlate with each other.

In [483]:
# Graph 2 : Comparative Analysis of Social Media Usage and Its Impact on Habits Characteristics
# Define the value variables
value_vars = [
    'interest_daily_activities_fluctuation(1-5)', 
    'concentration_difficulty(1-5)', 
    'distraction_when_busy(1-5)', 
    'without_specific_purposeless_use(1-5)', 
    'sleep_issues(1-5)', 
    'restlessness(1-5)'
]

# Calculate mean scores
mean_scores = df.groupby('uses_social_media')[value_vars].mean().reset_index()

# Melt the mean_scores dataframe
melted_data = mean_scores.melt(id_vars='uses_social_media', var_name='Characteristic', value_name='Average Score')

# Define a color map for the 'Yes' and 'No' categories
color_map = {'Yes': "#ffaaa5", 'No': "#a9d5fc"}

# Create the bar plot with a white background
fig = px.bar(
    melted_data, 
    x='Characteristic', 
    y='Average Score', 
    color='uses_social_media', 
    barmode='group',
    color_discrete_map=color_map,
    title='Comparative Analysis of Social Media Usage and Its Impact on Habits Characteristics'
)

# Update the layout with a white background and gridlines
fig.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',
    paper_bgcolor='rgba(255,255,255,1)',
    showlegend=True,
    xaxis=dict(
        showline=True,
        showgrid=True,
        gridcolor='lightgrey',
        linecolor='black',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='black'
        ),
    ),
    yaxis=dict(
        showline=True,
        showgrid=True,
        gridcolor='lightgrey',
        linecolor='black',
        linewidth=2,
        ticks='outside',
        tickfont=dict(
            family='Arial',
            size=12,
            color='black'
        ),
    ),
    xaxis_title='Characteristic',
    yaxis_title='Average Score',
    xaxis_tickangle=-45
)
fig.update_layout(width=1200, height=800)
fig.show()

The graph presents a side-by-side comparison of average scores on various mental health-related characteristics, contrasting social media users with non-users. It indicates that social media users generally report higher levels of difficulty with concentration, distraction while busy, purposeless use, sleep issues, and restlessness compared to non-users, highlighting a potential adverse impact of social media usage on these aspects of mental health.

## Data Visualization Self Reflection And Mental Health

In [484]:
# Graph 1 : Comparative Analysis of Social Media's Impact on Self-Reflection Based on Usage Patterns
# Creating the data summary
grouped_data = df.groupby('uses_social_media')[
    'comparison_with_others(1-5)', 
    'feelings_about_comparison(1-5)', 
    'validation_seeking(1-5)', 
    'feel_depressed(1-5)'
].mean().reset_index()

# Define a color map for the 'Yes' and 'No' categories
colors = {'Yes': "#ffaaa5", 'No': "#a9d5fc"}

# Creating subgraphs
fig = make_subplots(rows=2, cols=2, subplot_titles=[
    'Comparison with Others by Social Media Use',
    'Feelings About Comparison by Social Media Use',
    'Validation Seeking by Social Media Use',
    'Feeling Depressed by Social Media Use'
])

# Added bar charts with specific colors
for i, col in enumerate(['comparison_with_others(1-5)', 'feelings_about_comparison(1-5)', 'validation_seeking(1-5)', 'feel_depressed(1-5)']):
    for use in grouped_data['uses_social_media'].unique():
        fig.add_trace(go.Bar(
            x=[use], 
            y=grouped_data[grouped_data['uses_social_media'] == use][col], 
            name=use,
            marker_color=colors[use]
        ), row=int((i/2)+1), col=(i%2)+1)

# Layout update with white background and gridlines
fig.update_layout(
    title_text="Comparative Analysis of Social Media's Impact on Self-Reflection Based on Usage Patterns",
    height=700, width=900,
    showlegend=False,
    plot_bgcolor='rgba(255, 255, 255, 1)',
    paper_bgcolor='rgba(255, 255, 255, 1)',
    xaxis=dict(showline=True, showgrid=True, gridcolor='lightgrey'),
    yaxis=dict(showline=True, showgrid=True, gridcolor='lightgrey'),
    title=dict(
        font=dict(size=22)
    )
)

# Updating each subplot for gridding
for i in range(1, 5):
    fig.update_xaxes(showline=True, showgrid=True, gridcolor='lightgrey', row=int((i+1)/2), col=1 if i%2 else 2)
    fig.update_yaxes(showline=True, showgrid=True, gridcolor='lightgrey', row=int((i+1)/2), col=1 if i%2 else 2)

# Increase the size of the figure
fig.update_layout(width=1100, height=1100)
# Show the figure
fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



The set of bar plots above compares the average scores for comparison with others, feelings about comparison, validation seeking, and feelings of depression between individuals who use social media and those who do not.

From the plots, we can observe trends and differences in these characteristics based on social media usage, which could provide insights into the psychological impact of social media. These visual comparisons can be useful for identifying areas where social media may be associated with negative feelings or behaviors.

### Focusing On The "Comparison" Feature

I have chosen to focus on the 'comparison_with_others(1-5)' feature because it is one of the most common habits that we often engage in without even realizing it.

In [485]:
# Graph 2 : Comparison with Others by Social Media Use and Average Daily Use
# Include non-social users in the mapping
df.loc[df['uses_social_media'] == 'No', 'average_daily_use_category'] = 'No Social Media'

# Calculate the average comparison score for each category
plot_data = df.groupby('average_daily_use_category').agg({'comparison_with_others(1-5)': 'mean'}).reset_index()
plot_data = plot_data.sort_values(by='comparison_with_others(1-5)')

# Create the bar chart with Plotly Express
fig = px.bar(
    plot_data, 
    x='average_daily_use_category', 
    y='comparison_with_others(1-5)', 
    title='Comparison with Others by Social Media Use and Average Daily Use',
    labels={'average_daily_use_category': 'User Category / Average Daily Use', 'comparison_with_others(1-5)': 'Average Comparison with Others (1-5)'},
    color='average_daily_use_category',
    color_discrete_sequence=pastel_colors
)

# Update the layout for better readability
fig.update_layout(
    xaxis_tickangle=-45,
    plot_bgcolor='white',  # Fond blanc
    paper_bgcolor='white', # Fond blanc
    xaxis=dict(showline=True, showgrid=True, gridcolor='lightgrey'),  # Quadrillage pour l'axe X
    yaxis=dict(showline=True, showgrid=True, gridcolor='lightgrey')   # Quadrillage pour l'axe Y
)
fig.update_layout(width=1200, height=600)
fig.show()

The graph depicts a correlation between social media usage and the frequency of social comparisons. Non-users of social media report the least amount of comparison, while in the majority, the more time one spends on social networks, the more likely they are to engage in comparisons with others.

In [486]:
# Graph 3 : Distribution of Depression Feelings at Different Levels of Comparison with Others
# Recreating the box plot with the desired aesthetics
fig = px.box(df, x='comparison_with_others(1-5)', y='feel_depressed(1-5)',
             labels={'comparison_with_others(1-5)': 'Comparison with Others (1-5)',
                     'feel_depressed(1-5)': 'Feeling Depressed (1-5)'},
             title='Distribution of Depression Feelings at Different Levels of Comparison with Others')

# Updating the axes to include gridlines in light grey
fig.update_layout( plot_bgcolor='white',paper_bgcolor='white')
fig.update_xaxes(showline=True, linewidth=1, linecolor='lightgrey', gridcolor='lightgrey')
fig.update_yaxes(showline=True, linewidth=1, linecolor='lightgrey', gridcolor='lightgrey')

fig.update_layout(width=1200, height=600)
fig.show()

The box plot reveals a trend where higher self-comparison correlates with increased feelings of depression. Each step up in comparison level shows a rise in the median depression score. The spread of data suggests variability in how comparison affects individuals. Outliers indicate that some experience notably different levels of depression.

### Focusing On The Correlation between feeling Depressed and Other caracteristics 

I then shifted my focus to the "Depression" feature to better study mental health and identify the characteristics that impact mental well-being

In [487]:
# Graph 4 : Subplots of Box Plots Comparing 'Feel Depressed' With Various Factors
# List of numerical variables to compare with 'feel_depressed(1-5)'
compare_vars = [
    "feelings_about_comparison(1-5)",
    "sleep_issues(1-5)",
    "bothered_by_worries_scale(1-5)",
    "validation_seeking(1-5)"
]

# Define the number of rows needed for the subplots, considering 2 plots per row
num_rows = -(-len(compare_vars) // 2)

# Create the subplots with 2 columns
fig = make_subplots(rows=num_rows, cols=2, subplot_titles=compare_vars)

# Add a box plot to the figure for each comparison variable
for i, var in enumerate(compare_vars):
    fig.add_trace(
        go.Box(y=df[var], x=df['feel_depressed(1-5)'], name=var),
        row=(i // 2) + 1,
        col=(i % 2) + 1
    )

# Update layout for a clear view
fig.update_layout(plot_bgcolor='white', paper_bgcolor='white')
fig.update_layout(height=800, width=1200, title_text="Subplots of Box Plots Comparing 'Feel Depressed' With Various Factors")
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey')

# Show plot
fig.show()

1. Feelings About Comparison: As people's feelings about comparison increase, there doesn't seem to be a clear trend in how it relates to feeling depressed. The spread of responses is fairly consistent across all levels.

2. Sleep Issues: There's a noticeable trend that as sleep issues worsen (moving from 1 to 5), the median level of feeling depressed increases, suggesting a possible correlation where poor sleep might be associated with higher levels of depression.

3. Bothered by Worries: Similarly to sleep issues, as the level of being bothered by worries increases, so does the median level of feeling depressed. This could indicate that being more worried is associated with higher depression.

4. Validation Seeking: The spread and median levels appear to increase with the level of validation seeking, hinting that those who seek more validation might also experience higher levels of depression.

In [488]:
# Graph 5 : Box Plots Comparing 'Feel Depressed' With Categorical Factors
# Define the categorical variables to compare with 'feel_depressed(1-5)'
compare_vars_categorical = [ "relationship_status", "occupation", "uses_social_media","average_daily_use_category"]

# Define the number of rows needed for the subplots, considering 2 plots per row
num_rows = -(-len(compare_vars_categorical) // 2)  # Ceiling division to get the number of rows needed

# Create the subplots with 2 columns
fig = make_subplots(rows=num_rows, cols=2, subplot_titles=compare_vars_categorical)

# Add a box plot to the figure for each comparison variable
for i, var in enumerate(compare_vars_categorical):
    fig.add_trace(
        go.Box(x=df[var], y=df['feel_depressed(1-5)'], name=var),
        row=(i // 2) + 1,
        col=(i % 2) + 1
    )

# Update layout for a clear view
fig.update_layout(plot_bgcolor='white', paper_bgcolor='white')
fig.update_layout(height=800, width=1200, title_text="Box Plots Comparing 'Feel Depressed' With Categorical Factors")
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey')

# Show plot
fig.show()

1. Relationship Status: Individuals who are 'Divorced' show a slightly higher median level of feeling depressed compared to other statuses. However, the spread of responses is wide across all categories, indicating variation within each group.

2. Occupation: 'University Student' and 'Salaried Worker' categories have similar median levels of depression, but 'University Student' shows a wider range, suggesting more variability in feelings of depression among students. The 'Retired' group shows a narrower spread, indicating more consistency in their responses.

3. Uses Social Media: Those who use social media report a higher median level of feeling depressed than those who do not, with the users' responses showing a significant spread.

4. Average Daily Use Category: There's an interesting pattern; those who do not use social media report the lowest levels of feeling depressed, while users with "More than 5 hours" show the highest median depression level. The groups "Between 3 and 4 hours" and "Between 4 and 5 hours" have a lower median compared to "More than 5 hours", but higher than those with less usage.

In [489]:
# Graph 6 : Comparison of 'Feel Depressed' Across Different Social Media Platforms
# Split the 'common_platforms' into individual platforms and stack them
all_platforms = df['common_platforms'].str.split(', ', expand=True).stack()
all_platforms.index = all_platforms.index.droplevel(1)  # to align with df's index
all_platforms.name = 'platform'

# Join the new 'platform' column with the 'feel_depressed(1-5)' column from the original dataframe
platform_df = df[['feel_depressed(1-5)']].join(all_platforms)
platform_df.reset_index(drop=True, inplace=True)

# Create a box plot comparing 'feel_depressed(1-5)' with individual platforms
fig = px.box(platform_df, x='platform', y='feel_depressed(1-5)',
             labels={'platform': 'Social Media Platform',
                     'feel_depressed(1-5)': 'Feeling Depressed (1-5)'},
             title="Comparison of 'Feel Depressed' Across Different Social Media Platforms")

# Update layout for a clear view
fig.update_layout(plot_bgcolor='white', paper_bgcolor='white')
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgrey')

# Show plot
fig.show()

The box plot shows that median depression levels reported by users differ slightly between social media platforms. Platforms like TikTok and Snapchat exhibit slightly higher median depression levels, whereas Facebook and Twitter are on the lower end. The variability in depression levels is also platform-dependent, with some showing a wider range of responses. Despite these nuances, the overall trend indicates a general consistency in reported depression levels across different platforms.

### Conclusion

The box plots reveal a nuanced connection between social media interaction and depression. Heavy use of social media and tendencies to compare with others are associated with higher self-reported depression. Those not engaged in social media tend to report lower depression levels. The data implies that while social media use is a significant factor in mental health, the impact may vary depending on how individuals engage with these platforms.
Just because there's a link between social media use and depression doesn't mean one causes the other. We need more research to understand if or how they truly affect each other.