# Time-Stealed, or how social medias are stealing our lives

"Year 2025, a great portion of our lives is experienced within a virtual environment where we can access an infinite stream of information that excites our brain and that pleasures it with rivers of dopamine. The speed, the bright colors, the catchy audios and the enraging news, everything has been carefully engineered to keep us trapped in front of a glowy screen"

Told like this it might sound like the incipit of a cyberpunk novel, but it's actually not that far from our reality, is it? The virtual environments I'm talking about are obviously social medias, that despite all they've done for the free circulation of information, they have a scary dark side.

The aim of this project is to shed a light on that side and investigate where does our time go and why it seems so difficult to put a stop to a now far too common habbit: doomscrolling.

But why should social medias be implemented to keep us on their platforms? To understand this, it's important to clarify what is their business model. Social medias' profit originates primarely from the "attention economy", in other words, they make money by showing tertiary companies' adds to their user base (and by selling their data, but that is a whole other story). So, when the content you're consuming is suddently interrupted by an add, the attention economy has completed its circle: a company made its attempt to convince you to buy their product and the hosting platform made its share. The follow-up question is then, how does the social media of the situation boosts its revenue? Simple, it tries to maximise the exposure time and the engagement of the user. And how does it do it? With a plethora of sophisticated [psychological tricks](https://www.youtube.com/watch?v=uaaC57tcci0) that hook the user's attention for [hours](https://www.statista.com/statistics/433871/daily-social-media-usage-worldwide/) on end.

## 1. The dataset
This project leverages bayesian networks to extract interesting patterns from the dataset [Dark Side Of Social Media](https://www.kaggle.com/datasets/muhammadroshaanriaz/time-wasters-on-social-media?resource=download). This dataset has been generated via synthetic data generation tecniques to simulate real-world social media usage patterns (read the [EDA](https://www.kaggle.com/code/waqi786/in-depth-analysis-of-time-wasters-on-social-media) for further details).

In [None]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.inference import VariableElimination
from pgmpy.factors.discrete import TabularCPD

In [None]:
# Loads and previews the dataset and its shape
data = pd.read_csv('Time-Wasters on Social Media.csv', delimiter=',')
print(f'Shape: {data.shape}')
data.head()

## 2. The bayesian network

### 1.1 The variables
The dataset contains many variables spanning from the characteristics of the users to the features of their device. However, for the purpose of this project - which focuses on the time spent on social medias due to their addictive traits - only 8 out of 31 columns have been selected. Here is a brief descriprion of them:

- Reason: the reason why the users entered the platform. It's interesting because it represents the initial emotional state of the users and what they wish to get out of their social media consumption

- Platform: the social media platform used by the users. It's interesting because each platform has its unique features and implementation

- Importance: the importance assigned by the users to the content. It's interesting because it returns relevant information about the cost/benefit tradeoff accepted by the users

- Intentionality: the level of intentionality of the users when they decided to enter the platform. Originally called "self_control", it's interesting because it's one of the behavioural traits that the study wishes to capture

- Engagement: the engagement level of the users with the content. It's interesting because it quantifies the degree of activity/passivity of the content consumption experience

- Context-Switch: how quickly the users switch from one piece of content to the next. Originally called "Scroll-Rate", it's interesting because it tells about the width of the users' attention span

- Time: the total amount of time that the users have spent on the platform. It's the most interesting aspect of the study given that it's the "price" of social media consumption

- Satisfaction: the satisfaction level of the users at the end of the content consumption. It's interesting because it returns relevant information about the cost/benefit tradeoff accepted by the users

In [None]:
# Selects only relevant columns and renames them for clarity
selected_data = data[['Watch Reason', 'Platform', 'Importance Score', 'Self Control', 'Engagement', 'Scroll Rate', 'Total Time Spent', 'Satisfaction']]
selected_data.columns = ['Reason', 'Platform', 'Importance', 'Intentionality', 'Engagement', 'Context-Switch', 'Time', 'Satisfaction']
selected_data.index = data.UserID.values
selected_data.head()

### 1.2 The ordering
The ordering of the variables has been obtained by organizing them in the following causality structure:

|       | A             | B              | C             | D              |
| ----- | ------------- | -------------- | ------------- | -------------- |
| **1** | Reason        | Platform       |               |                |
| **2** | Importance    | Intentionality | Engagement    | Context-Switch |
| **3** | Time          | Satisfaction   |               |                |

Variables in 1 initiate the session, those 2 define it and those in 3 represent the user experience at the end of it. Specifically, we can say that users choose the platform according to the reason that motivates them to enter the social media space. Depending on the importance of their session, they are more or less intentional with their consumption which effects their engagement and their context-switch. All the previous variables impact the time spent on the platform and the satisfaction associated to the experience.

Ordering: Reason -> Platform -> Importance -> Intentionality -> Engagement -> Context-Switch -> Time -> Satisfaction

### 1.3 The structure
To build a Bayesian Network, variables are inserted one by one following an ordering. At each step, a link between the new variable and the old ones can be added according to a specific criterion. For this Bayesian Network the following logic-driven approach has been applied: 

- Addition of Reason
- Addition of Platform
    - Reason -> Platform: the platform chosen for the session depends on the reason that initiated it as different platforms might be chosen for different reasons
- Addition of Importance
    - Reason -> Importance: the importance of the session depends on the reason that initiated it, for ex: the educational purpose might be considered more important that the procrastination one
- Addition of Intentionality
    - Reason -> Intentionality: the intentionality of the session depends on the reason that initiated it, for ex: if the session started as a habit, then the intentionality might be low, whereas if it started to communicate with someone, then it might be high
    - Platform -> Intentionality: the intentionality of the session depends on the platform as certain platforms are more addictive than others
- Addition of Engagement
    - Reason -> Engagement: the engagement of the session depends on the reason that initiated it, for ex: if the session started for boredom, then the engagement might be low, whereas if it started for entertainment, then it might be high
- Addition of Context-Switch
    - Platform -> Context-Switch: the context-switch level of the session depends on the chosen platform as certain platforms host short pieces of content while others longer ones
    - Engagement -> Context-Switch: the context-switch level of the session depends on the engagement because the more the users are engaged with a certain piece of content the less they context-switch, and vice-versa
- Addition of Time
    - Intentionality -> Time: the time spent on the platform depends on the intentionality as an intentional session might be more efficient than an unintentional one
- Addition of Satisfaction
    - Importance -> Satisfaction, Time -> Satisfaction: the satisfaction associated to the session depends on the ratio between its importance and its duration

In [None]:
# TODO The prior probability can be computed directly on the data, positive cases / possible cases

# Defines the BN
logic_diven_model = DiscreteBayesianNetwork([
    ('Reason', 'Platform'),
    ('Reason', 'Importance'),
    ('Reason', 'Intentionality'),
    ('Reason', 'Engagement'),
    ('Platform', 'Intentionality'),
    ('Platform', 'Context-Switch'),
    ('Importance', 'Satisfaction'),
    ('Intentionality', 'Time'),
    ('Engagement', 'Context-Switch'),
    ('Time', 'Satisfaction')
])

# Fits the data into the BN to learn the CPTs (a CPT is a table that specifies the probability of each value of a variable for every combination of values of its parents)
logic_diven_model.cpds = []
logic_diven_model.fit(selected_data)

# Prints the CPTs
for cpd in logic_diven_model.get_cpds():
    print('CPT of {}'.format(cpd.variable))
    print(cpd, '\n')

In [None]:
# Visualises the BN
plt.figure(figsize=(12, 7))
nx.draw_circular(logic_diven_model, node_color='b', node_size=6000, with_labels=True, font_color='w', font_size=9)
plt.show()

$$P(R,P,Im,In,E,C,T,S) = P(R)P(P|R)P(Im|R)P(In|R,P)P(E|R)P(C|P,E)P(T|In)P(S|Im,T)

## 3. The inferences

The inferences will investigate low/high levels of *place_holder*, so it is necessary to define some thresholds:

- Important session: > 5 (assuming the scale is 1-10)
- High intentionality: > 5 (assuming the scale is 1-10)
- High engagement: > 500 (assuming the scale is 1-1000)
- High context-switch: > 50 (assuming the scale is 1-100)
- Long session: > 30 (assuming min as the unit of measurement)
- High satisfaction: > 5 (assuming the scale is 1-10)

Assumptions on scales and units of measurement are necessary as the dataset description lacks the needed details. However, it was possible to extract the following information about the domains of the variables:

In [None]:
# Prints the domain of each variable:
for c in selected_data.columns:
    domain = selected_data[c].unique()
    if domain.dtype == "int64":
        print(f"Domain of {c}: [min: {domain.min()}, max: {domain.max()}]")
    else:
        print(f"Domain of {c}: {domain}")

In [None]:
variable_eliminator = VariableElimination(logic_diven_model)

### 2.1 Investigations on Time

- What is the probability that a low level of intentionality leads to a long session?

In [None]:
# TODO understand how variable elimination works and implement the needed functions

- What is the probability that a session motivated by habit is a long session?

In [None]:
time_states = [t for t in logic_diven_model.states['Time'] if t > 30]
# P(T|R='Reason')
probability_distribution_of_time_given_reason = variable_eliminator.query(variables=['Time'], evidence={'Reason':'Habit'}) 
result = 0
for t in time_states:
    # P(T=t|R='Habit')
    result += probability_distribution_of_time_given_reason.get_value(**{'Time':t})
print(f"P(T>30|R='Habit'): {result}")

- What is the platform that retains the user for the longest? / What is the platform that is most likely to retain the user for a long time?

### 2.2 Investigations on Satisfaction

- What is the probability that a long and shallow session leads to a low level of satisfaction?

- What is the reason that leads to the lowest level of satisfaction?

- What is the platform that leads to the lowest level of satisfaction?

### 2.3 Investigations on Intentionality

- What is the probability that an engaging session was highly intentional?

- What is the reason that leads to the lowest level of intentionality?

- What is the most addictive platform?

### 2.4 Investigations on Context-switch

- What is the probability that a low level of engagement leads to a high level of context-switch?

- What is the reason that leads to the highest level of context-switch?

- What is the platform related to the highest level of context-switch?

## 4. The conclusions

- find other topics to test

Two structures have been defined for the Bayesian Network: one has been build with a logic-driven approach while the other has been derived with a data-driven one. Both approaches insert variables one by one following the previously defined ordering, what changes is the criterion with which a link is added (or not) between the new variable and the old ones.
The reasoning behind this approach is the following:

To build a Bayesian Network, variables are inserted one by one following an ordering. At each step, a link between the new variable and the old ones can be added according to a specific criterion. For this Bayesian Network, a data-driven approach has been discarded as the selected variables are totally intertwined with eachother. Instead,the following logic-driven approach has been applied: 