# Time-Stealed, or how social medias are stealing our lives

_Year 2025, a great portion of our lives is experienced within a virtual environment where we can access an infinite stream of information that excites our brain and that pleasures it with rivers of dopamine. The speed, the bright colors, the catchy audios and the enraging news, everything has been carefully engineered to keep us trapped in front of a glowy screen_

Told like this it might sound like the incipit of a cyberpunk novel, but it's actually not that far from our reality, is it? The virtual environments I'm talking about are obviously social medias, that despite all they've done for the free circulation of information, have a scary dark side.

The aim of this project is to shed a light on that side and investigate where does our time go and why it seems so difficult to put a stop to a now far too common habbit: doomscrolling.

But why should social medias be implemented to keep us on their platforms? To understand this, it's important to clarify what is their business model. Social medias' profit originates primarely from the "attention economy", in other words, they make money by showing tertiary companies' adds to their user base (and by selling their data, but that is a whole other story). So, when the content you're consuming is suddently interrupted by an ad, the attention economy has completed its circle: a company made its attempt to convince you to buy their product and the hosting platform made its share. The follow-up question is then, how does the social media of the situation boosts its revenue? Simple, it tries to maximise the exposure time and the engagement of the user. And how does it do it? With a plethora of sophisticated [psychological tricks](https://www.youtube.com/watch?v=uaaC57tcci0) that hook the user's attention for [hours](https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/) on end.

## 0. Imports and utility functions

In [None]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.inference import VariableElimination
from pgmpy.estimators import PC, HillClimbSearch, TreeSearch
from sklearn.preprocessing import LabelEncoder

In [None]:
# Visualizes the given model with the given title
def visualize_model(model, title):
    graph = nx.DiGraph(model.edges)
    plt.figure(figsize=(12, 6))
    nx.draw_circular(G=graph, with_labels=True, node_size=4000, node_color="#8621ff", arrowsize=20, font_size=8)
    plt.title(title)
    plt.show()

In [None]:
# Analyzes the structure of the Bayesian Network
def analyze_model(model, model_name):
    print(model_name.upper())
    print(f"\nD-separations: {model.get_independencies().get_assertions()}")
    print(f"\nImmoralities: {model.get_immoralities()}") # TODO should I include ALL v-structures?
    for n in model.nodes:
        print(f"\n{n}")
        print(f"Local semantics: {model.local_independencies(n)}")
        print(f"Markov blanket: {n} is {model.get_markov_blanket(n)}")
    print()

## 1. Dataset
This project leverages bayesian networks to extract interesting patterns from the dataset [Dark Side Of Social Media](https://www.kaggle.com/datasets/muhammadroshaanriaz/time-wasters-on-social-media?resource=download). This dataset has been generated via synthetic data generation tecniques to simulate real-world social media usage patterns (read the [EDA](https://www.kaggle.com/code/waqi786/in-depth-analysis-of-time-wasters-on-social-media) for further details).

In [None]:
# Loads and previews the dataset and its shape
data = pd.read_csv('Time-Wasters on Social Media.csv', delimiter=',')
print(f'Shape: {data.shape}')
data.head()

## 2. Bayesian network

### 2.1 Variables
The dataset contains many variables spanning from the characteristics of the users to the features of their device. However, for the purpose of this project - which focuses on the time spent on social medias due to their addictive traits - only 10 out of 31 columns have been selected.

In [None]:
# Selects only relevant columns
selected_data = data[['Watch Reason', 'Platform', 'Importance Score', 'Self Control', 'Addiction Level', 'Engagement', 'Scroll Rate', 'Total Time Spent', 'Satisfaction', 'ProductivityLoss']]

# Renames the selected columns for clarity
selected_data.loc[:, 'ProductivityLoss'].apply(lambda x: 10 - x)
selected_data.columns = ['Reason', 'Platform', 'Importance', 'Intentionality', 'Addiction', 'Engagement', 'Context-Switch', 'Time', 'Satisfaction', 'Productivity']
selected_data.index = data.UserID.values
selected_data.head()

Here is a brief descriprion of them:

- Reason: the reason why the users entered the platform. It's interesting because it represents the initial emotional state of the users and what they wish to get out of their social media consumption

- Platform: the social media platform chosen by the users. It's interesting because each platform has its unique features and implementation

- Importance: the importance assigned by the users to the content. It's interesting because it returns relevant information about the cost/benefit tradeoff accepted by the users

- Intentionality: the intentionality level that lead the users to enter the platform. It's interesting because...

- Addiction: the addiction level of the users to social medias. It's interesting because...

- Engagement: the engagement level of the users with the content. It's interesting because it quantifies the degree of activity/passivity of the content consumption experience

- Context-Switch: how quickly the users switch from one piece of content to the next. It's interesting because it tells about the width of the users' attention span

- Time: the total amount of time that the users have spent on the platform. It's the interesting because...

- Satisfaction: the satisfaction level of the users at the end of the content consumption. It's interesting because it returns relevant information about the cost/benefit tradeoff accepted by the users

- Productivity: the productivity level of the users during the content consumption. It's interesting because...

it's one of the behavioural traits that the study wishes to capture
It's the most interesting aspect of the study given that it's the "price" of social media consumption

### 2.2 Correlation matrix
Before building the Bayesian Network it can be usefull to look at the correlation matrix to gather some information about the relationships between the selected variables. Given that the correlation matrix only works with numerical variables, the categorical ones have been label-encoded, meaning that each string-value that a variable could take has been univocally mapped into an integer. This encoding has been preferred to the one-hot-encoding for readibility reasons.

In [None]:
# Encodes the categorical columns with a label-encoder
label_encoder = LabelEncoder()
categorical_columns = selected_data.select_dtypes(include='object').columns
encoded_data = selected_data.copy()
for c in categorical_columns:
    encoded_data[c] = label_encoder.fit_transform(selected_data[c])

# Computes the correlation matrix
correlation_matrix = encoded_data.corr(method='pearson')

# Visualizes the correlation matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
plt.figure(figsize=(12, 6))
sns.heatmap(correlation_matrix, annot=True, cmap=sns.diverging_palette(316, 267, as_cmap=True), fmt=".2f", linewidths=.5, mask=mask)
plt.title('Correlation matrix')
plt.show()

From the above graph we can already make three considerations:
1. most variables are poorly correlated of eachother, so they are basically independent
2. some variables (Addiction-Intentionality, Productivity-Satisfaction) are inversly correlated of eachother, so they represent basically the same concept
3. some variables (Satisfaction, Productivity, Intentionality, Addiction) are interestingly correlated of eachother but also unexpectedly isolated from other ones

**Personal note**: Remember that the dataset used was synthetically produced and **I suspect that for that reason it might not precisely reflect reality**

### 2.3 Structure learning and analysis
Structure learning is a data-driven process to estimate the structure of a Bayesian Network from the available data. For this project both the constraint-based and score-based approach have been put to the test

#### 2.3.1 PC algorithm (constraint-based)
This Bayesian Network was obtained by applying the PC algorithm. This tecnique inserts one variable at a time and adds a link between the new variable and the old ones only if they are conditionally dependent of eachother

In [None]:
# Estimates the structure of the Bayesian Network from the selected data
pc_estimator = PC(selected_data)
pc_model = pc_estimator.estimate(show_progress=False)
visualize_model(model=pc_model, title="PC model")

In [None]:
analyze_model(model=pc_model, model_name="PC model") # TODO fix this

TODO interpret the results

#### 2.3.2 Tree search (score-based)
This Bayesian Network was obtained by applying the Tree Search algorithm. This tecnique uses the Chow-Liu algorithm and the TAN algorithm to estimate a tree-like structure from the selected data.

In [None]:
ts_estimator = TreeSearch(selected_data, root_node='Intentionality')
ts_model = ts_estimator.estimate(show_progress=False)
visualize_model(model=ts_model, title="Tree search model")

In [None]:
analyze_model(model=ts_model, model_name="Tree search model")

TODO interpret the result

### 2.4 Structure definition and analysis
This Bayesian Network has been build by adding variables one by one following this order:

TODO update this section

Reason -> Platform -> Importance -> Intentionality -> Engagement -> Context-Switch -> Time -> Satisfaction

This is because users choose the platform according to the reason that motivates them to enter the social media space. Depending on the importance of their session, they are more or less intentional with their consumption which effects their engagement and their context-switch. All the previous variables impact the time spent on the platform and the satisfaction associated to the experience.

At each step, a link between the new variable and the old ones is added according to these considerations:

- Addition of Reason
- Addition of Platform
    - Reason -> Platform: the platform chosen for the session depends on the reason that initiated it as different platforms might be chosen for different reasons
- Addition of Importance
    - Reason -> Importance: the importance of the session depends on the reason that initiated it, for ex: the educational purpose might be considered more important that the procrastination one
- Addition of Intentionality
    - Reason -> Intentionality: the intentionality of the session depends on the reason that initiated it, for ex: if the session started as a habit, then the intentionality might be low, whereas if it started to communicate with someone, then it might be high
    - Platform -> Intentionality: the intentionality of the session depends on the platform as certain platforms are more addictive than others
- Addition of Engagement
    - Reason -> Engagement: the engagement of the session depends on the reason that initiated it, for ex: if the session started for boredom, then the engagement might be low, whereas if it started for entertainment, then it might be high
- Addition of Context-Switch
    - Platform -> Context-Switch: the context-switch level of the session depends on the chosen platform as certain platforms host short pieces of content while others longer ones
    - Engagement -> Context-Switch: the context-switch level of the session depends on the engagement because the more the users are engaged with a certain piece of content the less they context-switch, and vice-versa
- Addition of Time
    - Intentionality -> Time: the time spent on the platform depends on the intentionality as an intentional session might be more efficient than an unintentional one
- Addition of Satisfaction
    - Importance -> Satisfaction, Time -> Satisfaction: the satisfaction associated to the session depends on the ratio between its importance and its duration

In [None]:
# Defines the BN
custom_model = DiscreteBayesianNetwork([
    ('Reason', 'Platform'),
    ('Reason', 'Importance'),
    ('Reason', 'Intentionality'),
    ('Reason', 'Engagement'),
    ('Platform', 'Intentionality'),
    ('Platform', 'Context-Switch'),
    ('Importance', 'Satisfaction'),
    ('Intentionality', 'Time'),
    ('Engagement', 'Context-Switch'),
    ('Time', 'Satisfaction')
])

# Fits the data into the BN to learn the CPTs (a CPT is a table that specifies the probability of each value of a variable for every combination of values of its parents)
custom_model.cpds = []
custom_model.fit(selected_data)

# The printing of the CPTs has been commented out as it took too much space
# # Prints the CPTs
# for cpd in custom_model.get_cpds():
#     print('CPT of {}'.format(cpd.variable))
#     print(cpd, '\n')

In [None]:
visualize_model(custom_model, "Custom model")

In [None]:
analyze_model(custom_model, "Custom model")

## 3. Inferences

The inferences will investigate low/high levels of *place_holder*, so it is necessary to define some thresholds:

- Important session: > 5 (assuming the scale is 1-10)
- High intentionality: > 5 (assuming the scale is 1-10)
- High engagement: > 500 (assuming the scale is 1-1000)
- High context-switch: > 50 (assuming the scale is 1-100)
- Long session: > 30 (assuming min as the unit of measurement)
- High satisfaction: > 5 (assuming the scale is 1-10)

Assumptions on scales and units of measurement are necessary as the dataset description lacks the needed details. However, it was possible to extract the following information about the domains of the variables:

In [None]:
# Prints the domain of each variable:
for c in selected_data.columns:
    domain = selected_data[c].unique()
    if domain.dtype == "int64":
        print(f"Domain of {c}: [min: {domain.min()}, max: {domain.max()}]")
    else:
        print(f"Domain of {c}: {domain}")

In [None]:
variable_eliminator = VariableElimination(custom_model)

long_time_states = [s for s in custom_model.states['Time'] if s > 30]
low_intentionality_states = [s for s in custom_model.states['Intentionality'] if s <= 5]
low_satisfaction_states = [s for s in custom_model.states['Satisfaction'] if s <= 5]

### 3.1 What is the probability that a shallow session is long?

In [None]:
# TODO do some benchmarks
# TODO answer the same question with an approximate inference

# P(T>30,In<=5)
p_d_of_t_and_in = variable_eliminator.query(variables=['Time', 'Intentionality'], joint=True)
p_of_long_t_and_low_in = 0
for t in long_time_states:
    for i in low_intentionality_states:
        p_of_long_t_and_low_in += p_d_of_t_and_in.get_value(**{'Time':t, 'Intentionality':i})
print(f"P(T>30,In<=5): {p_of_long_t_and_low_in}")

# P(In<=5)
p_d_of_in = variable_eliminator.query(variables=['Intentionality'])
p_of_low_in = 0
for i in low_intentionality_states:
    p_of_low_in += p_d_of_in.get_value(**{'Intentionality':i})
print(f"P(In<=5): {p_of_low_in}")

# P(T>30|T<=5) = P(T>30,In<=5) / P(In<=5)
p_of_long_t_given_low_in = p_of_long_t_and_low_in / p_of_low_in
print(f"P(T>30|In<=5): {p_of_long_t_given_low_in}")

### 3.2 What is the probability that a long session is not satisfying?

In [None]:
# P(S<=5,T>30)
p_d_of_s_and_t = variable_eliminator.query(variables=['Satisfaction', 'Time'], joint=True)
p_of_low_s_and_long_t = 0
for d in low_satisfaction_states:
    for t in long_time_states:
        p_of_low_s_and_long_t += p_d_of_s_and_t.get_value(**{'Satisfaction':d, 'Time':t})
print(f"P(S<=5,T>30): {p_of_low_s_and_long_t}")

# P(T>30)
p_d_of_t = variable_eliminator.query(variables=['Time'])
p_of_long_t = 0
for t in long_time_states:
    p_of_long_t += p_d_of_t.get_value(**{'Time':t})
print(f"P(T>30): {p_of_long_t}")

# P(S<=5|T>30) = P(S<=5,T>30) / P(T>30)
p_of_low_s_given_long_t = p_of_low_s_and_long_t / p_of_long_t
print(f"P(S<=5|T>30): {p_of_low_s_given_long_t}")

### 3.3 What is the probability that a long and shallow session is not satisfying?

In [None]:
# P(S<=5,T>30,In<=5)
p_d_of_s_and_t_and_in = variable_eliminator.query(variables=['Satisfaction', 'Time', 'Intentionality'], joint=True)
p_of_low_s_and_long_t_and_low_in = 0
for d in low_satisfaction_states:
    for t in long_time_states:
        for i in low_intentionality_states:
            p_of_low_s_and_long_t_and_low_in += p_d_of_s_and_t_and_in.get_value(**{'Satisfaction':d, 'Time':t, 'Intentionality':i})
print(f"P(S<=5,T>30,In<=5): {p_of_low_s_and_long_t_and_low_in}")

# P(S<=5|T>30,In<=5) = P(S<=5,T>30,In<=5) / P(T>30,In<=5)
p_of_low_s_given_long_t_and_low_in = p_of_low_s_and_long_t_and_low_in / p_of_long_t_and_low_in
print(f"P(S<=5|T>30,In<=5): {p_of_low_s_given_long_t_and_low_in}")

### 3.4 What is the probability that a shallow session is not satisfying?

In [None]:
# P(S<=5,In<=5)
p_d_of_s_and_in = variable_eliminator.query(variables=['Satisfaction', 'Intentionality'], joint=True)
p_of_low_s_and_low_in = 0
for d in low_satisfaction_states:
    for i in low_intentionality_states:
        p_of_low_s_and_low_in += p_d_of_s_and_in.get_value(**{'Satisfaction':d, 'Intentionality':i})
print(f"P(S<=5,In<=5): {p_of_low_s_and_low_in}")

# P(S<=5|In<=5) = P(S<=5,In<=5) / P(In<=5)
p_of_low_s_given_low_in = p_of_low_s_and_low_in / p_of_low_in
print(f"P(S<=5|In<=5): {p_of_low_s_given_low_in}")

## 4. The conclusions