<center><h1>Lab: Exploring the DSA Transparency database</h1></center>

The goal of this lab is to study, with very simple methods, a sample from the [Digital Services Act Transparency Database](https://transparency.dsa.ec.europa.eu/dashboard). The database collects moderation decision by online platforms (VLOPs, Very Large Online Platforms), as identified by the [DSA](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX%3A32022R2065#enc_1) and the European Commission.

The main goal of the lab is to gain insights on the differentiated moderation practices of social media platforms subject to the DSA's obligations. To do this, we us a stratified sample of the transparency database, filtered to contain 70000 decisions by social media platforms.


## Imports

In [2]:
import numpy
import pandas as pd

## Load data

In [5]:
data = pd.read_csv(open("./sample-strat-may2024-socmed-70k.csv", encoding="utf-8"), sep=",")

## Basic data exploration

The first goal is to get a hang of the data that you have, its variables, and answer a few basic questions. You can familirise yourself with the variables by reading the document "DSA transparency database - Description of variables.docx" provided.

- Which are the platforms in the sample?
- How many decisions are present for each platform?
- Which variables would you modify, and how (this might include removing variables, changing their type, creating new variables from a combination of existing ones, etc.)?

Write your thoughts and summarised results here.

In [12]:
data["platform_name"].value_counts()

platform_name
Facebook     10000
TikTok       10000
Instagram    10000
Snapchat     10000
LinkedIn     10000
X            10000
YouTube      10000
Name: count, dtype: int64

-platforms are : Facebook, Tik Tok, Instagram, Snapchat, LinkedIn, X, Youtube
-There are 10000 decisions for each platform
-most of these variables are usefull but some others are not ('platform_uid'). Or some like 'incompatible_content_illegal' could be change (it is open but should be multiple choice). Besides all the date are not really usefull (application_date , content_date, created_date) and could be regrouped with only the difference between these date, it's the same for end date restriction, most of the time all the restriction ended at the same time and the information of what restriction is given and is already sufficient ( 'end_date_visibility_restriction', 'end_date_monetary_restriction', 'end_date_service_restriction','end_date_account_restriction'). 

In general, you can see a variable `x` by typing `data["x"]`, see the different values the variable takes with `set(data["x"].values)`, and see the counts of values for a variable using `data["x"].value_counts()`.

In [16]:
data["decision_visibility"]

0                                            NaN
1                                            NaN
2                                            NaN
3                                            NaN
4        ["DECISION_VISIBILITY_CONTENT_REMOVED"]
                          ...                   
69995    ["DECISION_VISIBILITY_CONTENT_REMOVED"]
69996    ["DECISION_VISIBILITY_CONTENT_REMOVED"]
69997    ["DECISION_VISIBILITY_CONTENT_REMOVED"]
69998    ["DECISION_VISIBILITY_CONTENT_REMOVED"]
69999    ["DECISION_VISIBILITY_CONTENT_REMOVED"]
Name: decision_visibility, Length: 70000, dtype: object

Already you should be able to remove of modify a few variables. Do so before going on, to facilitate your exploration. You can remove a column with `data = data.drop("platform_name", axis=1)`.

In [19]:
data.drop("decision_visibility", axis=1)
data.drop("source_identity", axis=1)
data.drop("source_type", axis=1)
data.drop("application_date", axis=1)
data.drop("source_identity", axis=1)

Unnamed: 0,decision_visibility,decision_ground,illegal_content_legal_ground,incompatible_content_ground,category,content_type,content_date,territorial_scope,application_date,source_type,automated_detection,automated_decision,platform_name,created_at
0,,DECISION_GROUND_INCOMPATIBLE_CONTENT,,"This was a violation of ""sections 3.2"" of our ...",STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2024-05-05 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-05 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_PARTIALLY,Facebook,2024-05-06 14:40:11
1,,DECISION_GROUND_INCOMPATIBLE_CONTENT,,"This was a violation of ""sections 3.2"" of our ...",STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2024-05-11 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-11 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_PARTIALLY,Facebook,2024-05-12 13:42:48
2,,DECISION_GROUND_INCOMPATIBLE_CONTENT,,"This was a violation of ""sections 3.2"" of our ...",STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2024-05-05 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-05 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_PARTIALLY,Facebook,2024-05-06 11:51:21
3,,DECISION_GROUND_INCOMPATIBLE_CONTENT,,"This was a violation of ""sections 3.2"" of our ...",STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2024-05-08 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-08 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_PARTIALLY,Facebook,2024-05-09 14:40:45
4,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,"This was a violation of ""sections 3.2"" of our ...",STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_SYNTHETIC_MEDIA""]",2024-05-11 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-11 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_PARTIALLY,Facebook,2024-05-12 16:34:10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Destination requirements,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2024-03-05 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-07 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_FULLY,YouTube,2024-05-08 22:51:13
69996,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Destination requirements,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2024-04-22 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-05 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_FULLY,YouTube,2024-05-06 23:23:28
69997,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Technical requirements,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_OTHER""]",2023-10-17 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-06 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_FULLY,YouTube,2024-05-07 22:30:52
69998,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Abusing the ad network,STATEMENT_CATEGORY_SCAMS_AND_FRAUD,"[""CONTENT_TYPE_OTHER""]",2023-08-18 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-07 00:00:00,SOURCE_VOLUNTARY,Yes,AUTOMATED_DECISION_NOT_AUTOMATED,YouTube,2024-05-08 22:31:53


## Filtering by platform

Now that we have a general idea of the database, we would like to get a first idea of differentiated behaviours by platform. You can filter a dataframe like so: `data[data["x"]==y]`. Below, for example, is the code to get all decisions for TikTok.

In [22]:
data[data["platform_name"] == "TikTok"]

Unnamed: 0,decision_visibility,decision_ground,illegal_content_legal_ground,incompatible_content_ground,category,content_type,content_date,territorial_scope,application_date,source_type,source_identity,automated_detection,automated_decision,platform_name,created_at
10000,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Hate Speech and Hateful Behaviors,STATEMENT_CATEGORY_ILLEGAL_OR_HARMFUL_SPEECH,"[""CONTENT_TYPE_TEXT""]",2024-05-07 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-07 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-07 23:10:14
10001,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-09 02:58:51
10002,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,The ad features restricted or prohibited claim...,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-08 00:00:00,"[""PL""]",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_NOT_AUTOMATED,TikTok,2024-05-10 05:49:37
10003,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-09 19:31:33
10004,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-10 11:18:20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-10 10:15:31
19996,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_IMAGE""]",2024-04-18 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-06 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-06 04:47:09
19997,"[""DECISION_VISIBILITY_OTHER""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Community Guidelines,STATEMENT_CATEGORY_SCOPE_OF_PLATFORM_SERVICE,"[""CONTENT_TYPE_VIDEO""]",2024-05-09 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-09 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-11 14:11:54
19998,"[""DECISION_VISIBILITY_CONTENT_REMOVED""]",DECISION_GROUND_INCOMPATIBLE_CONTENT,,Animal Abuse,STATEMENT_CATEGORY_ANIMAL_WELFARE,"[""CONTENT_TYPE_TEXT""]",2024-05-08 00:00:00,"[""AT"",""BE"",""BG"",""CY"",""CZ"",""DE"",""DK"",""EE"",""ES"",...",2024-05-08 00:00:00,SOURCE_VOLUNTARY,,Yes,AUTOMATED_DECISION_FULLY,TikTok,2024-05-08 21:09:20


In [32]:
automatic_detection_counts = data.groupby(["platform_name", "automated_detection"]).size().unstack()
print(automatic_detection_counts)
automated_decision_counts = data.groupby(["platform_name", "automated_decision"]).size()
print(automated_decision_counts)

automated_detection       No     Yes
platform_name                       
Facebook               129.0  9871.0
Instagram              521.0  9479.0
LinkedIn               307.0  9693.0
Snapchat              7672.0  2328.0
TikTok                 316.0  9684.0
X                    10000.0     NaN
YouTube               1159.0  8841.0
platform_name  automated_decision              
Facebook       AUTOMATED_DECISION_NOT_AUTOMATED      129
               AUTOMATED_DECISION_PARTIALLY         9871
Instagram      AUTOMATED_DECISION_NOT_AUTOMATED      521
               AUTOMATED_DECISION_PARTIALLY         9479
LinkedIn       AUTOMATED_DECISION_FULLY              962
               AUTOMATED_DECISION_NOT_AUTOMATED     9038
Snapchat       AUTOMATED_DECISION_FULLY             2811
               AUTOMATED_DECISION_NOT_AUTOMATED     7189
TikTok         AUTOMATED_DECISION_FULLY             9378
               AUTOMATED_DECISION_NOT_AUTOMATED      622
X              AUTOMATED_DECISION_NOT_AUTOMATED  

Perform a variable exploration per platform. What can you infer in terms of differentiated moderation practices?

We observe that there are a different moderation pratices for each platforms : for instance X don't have any automated detection, Facebook has a very large part of atomated detection in which decision are partially automated made (it is kinda the same for Instagramm), Linkedin in spite of having a large part of automated detection still have most of the decision made manually... We can  see with these datas that moderation practices are very different between all these different platforms.

## Cross-frequency counts

In the next steps, we are interested in going a bit deeper, seeing how variables relate to each other. This can be done with a cross-frequency analysis, _i.e._ looking how variables co-occur together. Below is an example:

In [50]:
data.groupby("platform_name")["decision_ground"].value_counts()

platform_name  decision_ground                     
Facebook       DECISION_GROUND_INCOMPATIBLE_CONTENT    10000
Instagram      DECISION_GROUND_INCOMPATIBLE_CONTENT     9999
               DECISION_GROUND_ILLEGAL_CONTENT             1
LinkedIn       DECISION_GROUND_INCOMPATIBLE_CONTENT    10000
Snapchat       DECISION_GROUND_INCOMPATIBLE_CONTENT     9993
               DECISION_GROUND_ILLEGAL_CONTENT             7
TikTok         DECISION_GROUND_INCOMPATIBLE_CONTENT    10000
X              DECISION_GROUND_ILLEGAL_CONTENT         10000
YouTube        DECISION_GROUND_INCOMPATIBLE_CONTENT     9957
               DECISION_GROUND_ILLEGAL_CONTENT            43
Name: count, dtype: int64

What can you tell, by platform, reading this kind of analysis?

For Facebook, Instagram, LinkedIn, Snapchat, TikTok and Youtube: these platforms heavily emphasize enforcing internal platform policies (DECISION_GROUND_INCOMPATIBLE_CONTENT), with little or no focus on illegal content. X deals only with illegal content (DECISION_GROUND_ILLEGAL_CONTENT) suggesting an emphasis on addressing legal obligations over internal policy enforcement.

## Moderation time

Create a new variable for moderation time, _i.e._ the time between the decision and its application. Is this new variable relevant? What can you conclude from it?

In [49]:

# Create the moderation_time variable

data['moderation_time'] = (data['application_date'] - data['content_date']).dt.total_seconds() / 3600

average_moderation_time = data.groupby("platform_name")["moderation_time"].mean()

print(average_moderation_time)


platform_name
Facebook     1798.1856
Instagram    2476.9512
LinkedIn      304.4256
Snapchat     6370.4712
TikTok        270.9648
X               0.0000
YouTube      9609.8304
Name: moderation_time, dtype: float64


Moderation times vary significantly across platforms. TikTok is extremely fast (271 sec), likely due to full automation, while YouTube is much slower (9609 sec), possibly because of high content volume and manual reviews. X is instant (0 sec) zhich indicate full automated decision (or these are false data). LinkedIn is surprisingly quick (304 sec) despite mostly manual decisions, suggesting an efficient process. Overall, automation often speeds up moderation but it varies depending on the platform (maybe for Linkedin it's different since the context of the platform is different, there are less moderation needed,they are more distributed)