# Data cleaning

In this file, we cleaned the downloaded data:
The main steps include:
1. Adding party affiliation to tweet rows
2. Deleting unnecessary downloaded Retweets.
3. Deleting links and mentions from the tweets text and saving them to separate columns
4. Expanding the column of public metrics
5. Encoding emojis in a unified format
6. Translating tweets using Google Translate in Google Sheets
7. Saving all downloaded tweets to one file

### 1.  Used libraries

In [2]:
import os
import pandas as pd
import re
import emoji

### 2. Reading JSON files and transforming them into party-specific pickle files

In [5]:
base_input_paths = ['data/PoWyborach', 'data/tweets_data_2022']
subfolders = ['Konfederacja', 'NL', 'PIS', 'PO', 'PL2050', 'PSL']
output_folder = 'data/tweets_data_combined'

for subfolder in subfolders:
    dataframes = []
    for base_input_path in base_input_paths:
        folder_path = os.path.join(base_input_path, subfolder)
        for filename in os.listdir(folder_path):
            if filename.endswith('.json'):
                file_path = os.path.join(folder_path, filename)
                politician = filename.split("_tweets.json")[0]
                try:
                    df = pd.read_json(file_path)  
                    df["username"] = politician  
                    df["party"] = subfolder
                    print(f"Read {len(df)} rows from {file_path}")  
                    dataframes.append(df)
                except ValueError as e:
                    print(f"Error reading {file_path}: {e}")
    if dataframes:
        combined_df = pd.concat(dataframes, ignore_index=True)
        
        output_file_path = os.path.join(output_folder, f'{subfolder}_combined.pkl')
        combined_df.to_pickle(output_file_path) 
        
        print(f"Saved {subfolder} combined data to {output_file_path}")

print("Processing complete!")

Read 320 rows from data/PoWyborach/Konfederacja/placzekgrzegorz_2024-04-16_2024-10-15.json
Read 597 rows from data/PoWyborach/Konfederacja/MichalWawer_2023-10-16_2024-10-15.json
Read 1318 rows from data/PoWyborach/Konfederacja/KonradBerkowicz_2024-04-16_2024-10-15_vol1 (1).json
Read 950 rows from data/PoWyborach/Konfederacja/Wlodek_Skalik_2023-10-16_2024-10-15.json
Read 721 rows from data/PoWyborach/Konfederacja/SlawomirMentzen_2023-10-16_2024-10-15.json
Read 889 rows from data/PoWyborach/Konfederacja/GrzegorzBraun__2023-10-16_2024-10-15.json
Read 175 rows from data/PoWyborach/Konfederacja/TudujKrzysztof_2023-10-16_2024-10-15.json
Read 964 rows from data/PoWyborach/Konfederacja/bartlomiejpejo_2023-10-16_2024-10-15.json
Read 421 rows from data/PoWyborach/Konfederacja/placzekgrzegorz_2023-10-16_2024-04-15.json
Read 772 rows from data/PoWyborach/Konfederacja/MarSypniewski_2023-10-16_2024-10-15.json
Read 289 rows from data/PoWyborach/Konfederacja/KonradBerkowicz_2023-10-15_2024-04-16_vol2 

### 3. Data cleaning

In [6]:
df_konfederacja = pd.read_pickle(os.path.join(output_folder, 'Konfederacja_combined.pkl'))
df_NL = pd.read_pickle(os.path.join(output_folder, 'NL_combined.pkl'))
df_PIS = pd.read_pickle(os.path.join(output_folder, 'PIS_combined.pkl'))
df_PO = pd.read_pickle(os.path.join(output_folder, 'PO_combined.pkl'))
df_PL2050 = pd.read_pickle(os.path.join(output_folder, 'PL2050_combined.pkl'))
df_PSL = pd.read_pickle(os.path.join(output_folder, 'PSL_combined.pkl'))

In [7]:
df_konfederacja.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12651 entries, 0 to 12650
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          12651 non-null  object             
 1   reply_settings          12651 non-null  object             
 2   entities                11407 non-null  object             
 3   created_at              12651 non-null  datetime64[ns, UTC]
 4   attachments             6203 non-null   object             
 5   edit_controls           12651 non-null  object             
 6   author_id               12651 non-null  float64            
 7   edit_history_tweet_ids  12651 non-null  object             
 8   lang                    12651 non-null  object             
 9   possibly_sensitive      12651 non-null  object             
 10  id                      12651 non-null  float64            
 11  conversation_id         12651 non-null  f

In [8]:
df_konfederacja.head()

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
0,"{'retweet_count': 230, 'reply_count': 61, 'lik...",everyone,"{'urls': [{'start': 277, 'end': 300, 'url': 'h...",2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1.284852e+18,"[1846086770229694583, 1846086999964283214]",pl,False,1.846087e+18,1.846087e+18,❌ Rząd polski zamierza budować w Polsce 49 Cen...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
1,"{'retweet_count': 1301, 'reply_count': 169, 'l...",everyone,"{'urls': [{'start': 276, 'end': 299, 'url': 'h...",2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1.284852e+18,"[1845747336862961872, 1845748090461966651]",pl,False,1.845748e+18,1.845748e+18,❌ Szambo wybija i robi się coraz ciekawiej. Na...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
2,"{'retweet_count': 682, 'reply_count': 145, 'li...",everyone,"{'urls': [{'start': 279, 'end': 302, 'url': 'h...",2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1.284852e+18,[1845366606823657982],pl,False,1.845367e+18,1.845367e+18,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SPOS...",Original,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
3,"{'retweet_count': 271, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1.284852e+18,[1845006197847359885],pl,False,1.845006e+18,1.845006e+18,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ml...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
4,"{'retweet_count': 214, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 281, 'end': 304, 'url': 'h...",2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1.284852e+18,[1844633149784891665],pl,False,1.844633e+18,1.844633e+18,❌ O CO TUTAJ CHODZI? W październiku 2024 r. sz...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,


In [9]:
# Merge all dataframes into one
df = pd.concat([df_konfederacja, df_NL, df_PIS, df_PO, df_PL2050, df_PSL], ignore_index=True)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52787 entries, 0 to 52786
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          52787 non-null  object             
 1   reply_settings          52787 non-null  object             
 2   entities                46839 non-null  object             
 3   created_at              52787 non-null  datetime64[ns, UTC]
 4   attachments             22966 non-null  object             
 5   edit_controls           52771 non-null  object             
 6   author_id               52787 non-null  float64            
 7   edit_history_tweet_ids  52771 non-null  object             
 8   lang                    52787 non-null  object             
 9   possibly_sensitive      52771 non-null  object             
 10  id                      52771 non-null  float64            
 11  conversation_id         52771 non-null  f

In [11]:
len(df)

52787

In [12]:
pd.options.display.float_format = '{:.0f}'.format
df['id'] = df['id'].fillna(0).astype('int64')
df['id']

0        1846086999964283136
1        1845748090461966592
2        1845366606823657984
3        1845006197847360000
4        1844633149784891648
                ...         
52782    1583494747536396288
52783    1583398153021104128
52784    1583360725958828032
52785    1581668593192034304
52786    1581668593192034304
Name: id, Length: 52787, dtype: int64

In [13]:
# Get the value counts of 'id'
id_counts = df['id'].value_counts()

# Filter the counts to show only those greater than 1
id_counts_above_1 = id_counts[id_counts > 1]

# Display the counts
print(f"IDs with counts greater than 1:\n{id_counts_above_1}")

IDs with counts greater than 1:
id
0                      16
1780108572161945856     3
1581668593192034304     2
1590377683367522304     2
1690321544608718848     2
                       ..
1792117616804352256     2
1582569169882853376     2
1582320173482414080     2
1792169946169975296     2
1734231599116374272     2
Name: count, Length: 229, dtype: int64


In [14]:
id_counts_above_1.sum()

473

In [15]:
# Count unique IDs
non_duplicate_counts = df['id'].nunique()
print(f"Number of unique IDs: {non_duplicate_counts}")

# Count duplicate IDs
duplicate_counts = df['id'].duplicated().sum()
print(f"Number of duplicate IDs: {duplicate_counts}")

# Get the value counts of 'id'
id_counts = df['id'].value_counts()

# Filter the counts to show only those greater than 1
id_counts_above_1 = id_counts[id_counts > 1]

# Sum of counts of IDs that appear more than once
total_duplicate_rows = id_counts_above_1.sum()
print(f"Total number of duplicate rows based on 'id': {total_duplicate_rows}")

# Convert all columns to strings to avoid unhashable types
df_str = df.astype(str)

# Now check for exact duplicate rows across all columns
duplicates_all = df_str[df_str.duplicated(keep=False)]
print(f"Total duplicate rows (exact match across all columns): {duplicates_all.shape[0]}")
duplicates_all

Number of unique IDs: 52543
Number of duplicate IDs: 244
Total number of duplicate rows based on 'id': 473
Total duplicate rows (exact match across all columns): 304


Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
318,"{'retweet_count': 299, 'reply_count': 80, 'lik...",everyone,"{'urls': [{'start': 274, 'end': 297, 'url': 'h...",2024-04-16 05:24:08+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1.2848522205934141e+18,['1780104870302732443'],pl,False,1780104870302732544,1.7801048703027325e+18,❌ Od 1 września niemal 60 tys. ukraińskich 🇺🇦 ...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
319,"{'retweet_count': 299, 'reply_count': 80, 'lik...",everyone,"{'urls': [{'start': 274, 'end': 297, 'url': 'h...",2024-04-16 05:24:08+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1.2848522205934141e+18,['1780104870302732443'],pl,False,1780104870302732544,1.7801048703027325e+18,❌ Od 1 września niemal 60 tys. ukraińskich 🇺🇦 ...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
915,"{'retweet_count': 35, 'reply_count': 111, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2023-10-17 10:47:20+00:00,{'media_keys': ['3_1714231608536891392']},"{'edits_remaining': 5, 'is_edit_eligible': Tru...",9.417106438534472e+17,['1714231616065708419'],pl,False,1714231616065708544,1.7142316160657085e+18,Ponad 43 tysiące wyborców oddało swoje głosy n...,Original,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",,,MichalWawer_2023-10-16_2024-10-15.json,Konfederacja,
916,"{'retweet_count': 35, 'reply_count': 111, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2023-10-17 10:47:20+00:00,{'media_keys': ['3_1714231608536891392']},"{'edits_remaining': 5, 'is_edit_eligible': Tru...",9.417106438534472e+17,['1714231616065708419'],pl,False,1714231616065708544,1.7142316160657085e+18,Ponad 43 tysiące wyborców oddało swoje głosy n...,Original,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",,,MichalWawer_2023-10-16_2024-10-15.json,Konfederacja,
2233,"{'retweet_count': 5, 'reply_count': 1, 'like_c...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-04-16 05:38:51+00:00,{'media_keys': ['3_1780108567447560193']},"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1420353350.0,['1780108572161945969'],pl,False,1780108572161945856,1.7801085721619459e+18,❗SPOTKANIE CZŁONKÓW I SYMPATYKÓW KLUBU KONFEDE...,Original,,,,KonradBerkowicz_2024-04-16_2024-10-15_vol1 (1)...,Konfederacja,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52615,"{'retweet_count': 9, 'reply_count': 1, 'like_c...",everyone,"{'urls': [{'start': 231, 'end': 254, 'url': 'h...",2023-07-03 21:13:30+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276.0,['1675976082812461056'],pl,0.0,1675976082812461056,1.675976082812461e+18,10 lat ma Europejska Fundacja na rzecz demokra...,Original,,,,GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52766,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2022-11-15 22:10:59+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",1119834276.0,['1592641337475870720'],qam,0.0,1592641337475870720,1.5926413374758707e+18,RT @KleszczDaniel: @KosiniakKamysz @nowePSL,Retweet,,,"[{'type': 'retweeted', 'id': '1592605909196632...",GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52767,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2022-11-15 22:10:59+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",1119834276.0,['1592641337475870720'],qam,0.0,1592641337475870720,1.5926413374758707e+18,RT @KleszczDaniel: @KosiniakKamysz @nowePSL,Retweet,,,"[{'type': 'retweeted', 'id': '1592605909196632...",GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52785,"{'retweet_count': 2, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 49, 'end': 72, 'url': 'htt...",2022-10-16 15:29:13+00:00,{'media_keys': ['3_1581668584501149697']},"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276.0,['1581668593192034304'],pl,0.0,1581668593192034304,1.5816685931920343e+18,16 październik 1978 rok pamiętamy #DzieńPapies...,Original,,,,GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,


brief insight into how do these duplicates look like

In [16]:
df[df['id'].duplicated(keep=False)].sort_values(by='id')


Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
35149,"{'retweet_count': 18, 'reply_count': 31, 'like...",everyone,,2023-10-16 22:27:00+00:00,,,61552404,,pl,,0,,@BMikolajewska odpowie💪,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35154,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,,2023-10-16 23:18:00+00:00,,,61552404,,pl,,0,,@BernadettaK2 @otoWojciech @BMikolajewska 😂😂😂,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35153,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,,2023-10-16 23:18:00+00:00,,,61552404,,pl,,0,,@3Fritz1 @BMikolajewska 🫡,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
48543,"{'retweet_count': 2, 'reply_count': 1, 'like_c...",everyone,,2024-10-17 18:49:00+00:00,,,1119834276,,pl,,0,,Dziękuję 🤝😀🍀,Quote,,,"[{'type': 'quoted', 'id': '1714320722481615223'}]",GrzybAndrzej_2023-10-16_2024-10-15.json,PSL,
48544,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,,2023-10-18 12:46:00+00:00,,,61552404,,pl,,0,,@Maciej_ENZ0 Dziękuję i pozdrawiam,Reply,,,,GrzybAndrzej_2023-10-16_2024-10-15.json,PSL,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49818,"{'retweet_count': 15, 'reply_count': 2, 'like_...",everyone,"{'mentions': [{'start': 30, 'end': 42, 'userna...",2024-09-13 08:23:40+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",964017524,[1834508231009325542],pl,0,1834508231009325568,1834508231009325568,W ramach roboczego kontaktu z @WodyPolskie ora...,Original,"[{'domain': {'id': '11', 'name': 'Sport', 'des...",,,DariuszKlimczak_2023-10-16_2024-10-15.json,PSL,
34812,"{'retweet_count': 567, 'reply_count': 221, 'li...",everyone,"{'mentions': [{'start': 8, 'end': 19, 'usernam...",2024-09-19 06:31:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",52367150,[1836654233296244914],pl,False,1836654233296244992,1836654233296244992,Premier @donaldtusk : namierzono człowieka prz...,Original,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,CTomczyk_2023-10-16_2024-10-15_GUWNOMAMYZNIMPR...,PO,
34811,"{'retweet_count': 568, 'reply_count': 221, 'li...",everyone,"{'mentions': [{'start': 8, 'end': 19, 'usernam...",2024-09-19 06:31:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",52367150,[1836654233296244914],pl,False,1836654233296244992,1836654233296244992,Premier @donaldtusk : namierzono człowieka prz...,Original,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,CTomczyk_2023-10-16_2024-10-15_GUWNOMAMYZNIMPR...,PO,
26115,"{'retweet_count': 141, 'reply_count': 125, 'li...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-09-24 14:22:59+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",138048156,[1838584923688444342],pl,False,1838584923688444416,1838584923688444416,Potwierdza się to o czym mówiliśmy już od dawn...,Quote,,,"[{'type': 'quoted', 'id': '1838307356918071625'}]",mblaszczak_2023-10-16_2024-10-15 (1).json,PIS,


In [17]:
for col in df.columns:
    if df[col].apply(lambda x: isinstance(x, dict)).any():
        print(f"Column '{col}' contains dictionaries.")
    elif df[col].apply(lambda x: isinstance(x, list)).any():
        print(f"Column '{col}' contains lists.")

Column 'public_metrics' contains dictionaries.
Column 'entities' contains dictionaries.
Column 'attachments' contains dictionaries.
Column 'edit_controls' contains dictionaries.
Column 'edit_history_tweet_ids' contains lists.
Column 'context_annotations' contains lists.
Column 'referenced_tweets' contains lists.
Column 'geo' contains dictionaries.


In [18]:
# Get all duplicate IDs
duplicate_ids = df[df['id'].duplicated(keep=False)]

# Exclude columns with unhashable (dict-like) values
columns_to_exclude = ['edit_controls', 'public_metrics', 'attachments', 'entities', 'geo', 'edit_history_tweet_ids', 'context_annotations','referenced_tweets']
valid_columns = [col for col in df.columns if col not in columns_to_exclude]

# Find differences across valid columns
diff_summary = duplicate_ids[valid_columns].groupby('id').nunique()

# Show columns where duplicates have different values
diff_summary = diff_summary[(diff_summary > 1).any(axis=1)]

In [19]:
diff_summary

Unnamed: 0_level_0,reply_settings,created_at,author_id,lang,possibly_sensitive,conversation_id,text,category,in_reply_to_user_id,username,party
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,1,9,2,1,0,0,16,2,0,2,2
1780108572161945856,1,1,1,1,1,1,1,1,0,2,1
1780146130551996672,1,1,1,1,1,1,1,1,1,2,1
1780171152914034944,1,1,1,1,1,1,1,1,1,2,1
1780244011212485120,1,1,1,1,1,1,1,1,0,2,1
1780309557610258688,1,1,1,1,1,1,1,1,0,2,1
1780345695163047936,1,1,1,1,1,1,1,1,1,2,1
1780345829615636480,1,1,1,1,1,1,1,1,1,2,1
1780346025854603264,1,1,1,1,1,1,1,1,1,2,1
1780346371888833024,1,1,1,1,1,1,1,1,1,2,1


In [20]:
duplicates = df[df.duplicated(subset=['id'], keep=False)]
duplicates

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
318,"{'retweet_count': 299, 'reply_count': 80, 'lik...",everyone,"{'urls': [{'start': 274, 'end': 297, 'url': 'h...",2024-04-16 05:24:08+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1780104870302732443],pl,False,1780104870302732544,1780104870302732544,❌ Od 1 września niemal 60 tys. ukraińskich 🇺🇦 ...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
319,"{'retweet_count': 299, 'reply_count': 80, 'lik...",everyone,"{'urls': [{'start': 274, 'end': 297, 'url': 'h...",2024-04-16 05:24:08+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1780104870302732443],pl,False,1780104870302732544,1780104870302732544,❌ Od 1 września niemal 60 tys. ukraińskich 🇺🇦 ...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
608,"{'retweet_count': 55, 'reply_count': 46, 'like...",everyone,"{'mentions': [{'start': 133, 'end': 142, 'user...",2024-04-11 12:31:09+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",941710643853447168,[1778400392470057027],pl,False,1778400392470056960,1778400392470056960,PiS właśnie złożył w Sejmie wniosek o odrzucen...,Original,,,,MichalWawer_2023-10-16_2024-10-15.json,Konfederacja,
609,"{'retweet_count': 55, 'reply_count': 46, 'like...",everyone,"{'urls': [{'start': 281, 'end': 304, 'url': 'h...",2024-04-11 12:31:09+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",941710643853447168,[1778400392470057027],pl,False,1778400392470056960,1778400392470056960,PiS właśnie złożył w Sejmie wniosek o odrzucen...,Original,,,,MichalWawer_2023-10-16_2024-10-15.json,Konfederacja,
915,"{'retweet_count': 35, 'reply_count': 111, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2023-10-17 10:47:20+00:00,{'media_keys': ['3_1714231608536891392']},"{'edits_remaining': 5, 'is_edit_eligible': Tru...",941710643853447168,[1714231616065708419],pl,False,1714231616065708544,1714231616065708544,Ponad 43 tysiące wyborców oddało swoje głosy n...,Original,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",,,MichalWawer_2023-10-16_2024-10-15.json,Konfederacja,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52718,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 90, 'end': 113, 'url': 'ht...",2023-02-25 00:45:56+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",1119834276,[1629281505804468226],en,0,1629281505804468224,1629281505804468224,RT @bpaskal: The view from the highest tower o...,Retweet,,,"[{'type': 'retweeted', 'id': '1629218162175623...",GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52766,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2022-11-15 22:10:59+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",1119834276,[1592641337475870720],qam,0,1592641337475870720,1592641337475870720,RT @KleszczDaniel: @KosiniakKamysz @nowePSL,Retweet,,,"[{'type': 'retweeted', 'id': '1592605909196632...",GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52767,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2022-11-15 22:10:59+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",1119834276,[1592641337475870720],qam,0,1592641337475870720,1592641337475870720,RT @KleszczDaniel: @KosiniakKamysz @nowePSL,Retweet,,,"[{'type': 'retweeted', 'id': '1592605909196632...",GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52785,"{'retweet_count': 2, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 49, 'end': 72, 'url': 'htt...",2022-10-16 15:29:13+00:00,{'media_keys': ['3_1581668584501149697']},"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1581668593192034304],pl,0,1581668593192034304,1581668593192034304,16 październik 1978 rok pamiętamy #DzieńPapies...,Original,,,,GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,


In [21]:
duplicate_text_count = df['text'].duplicated().sum()
print(f"Number of duplicate Text Entries: {duplicate_text_count}")

Number of duplicate Text Entries: 320


In [22]:
duplicate_id_text_rows = df[df.duplicated(subset=['id', 'text'], keep=False)]
print(f"Rows where BOTH `id` and `text` are duplicated: {len(duplicate_id_text_rows)}")

Rows where BOTH `id` and `text` are duplicated: 457


In [23]:
# Count occurrences of each ID
id_counts = df['id'].value_counts()
print("Distribution of duplicate IDs:")
print(id_counts.value_counts().sort_index())

# Count occurrences of each text
text_counts = df['text'].value_counts()
print("\nDistribution of duplicate Text Entries:")
print(text_counts.value_counts().sort_index())

Distribution of duplicate IDs:
count
1     52314
2       227
3         1
16        1
Name: count, dtype: int64

Distribution of duplicate Text Entries:
count
1    52164
2      291
3        9
4        2
6        1
Name: count, dtype: int64


In [24]:
# Get all duplicate ID rows
duplicate_id_rows = df[df.duplicated(subset=['id'], keep=False)]

# Get all duplicate Text rows
duplicate_text_rows = df[df.duplicated(subset=['text'], keep=False)]

# Get rows where both ID and Text are duplicated
duplicate_id_text_rows = df[df.duplicated(subset=['id', 'text'], keep=False)]

# Compare overlaps
print(f"Rows where ID is duplicated: {len(duplicate_id_rows)}")
print(f"Rows where Text is duplicated: {len(duplicate_text_rows)}")
print(f"Rows where BOTH ID and Text are duplicated: {len(duplicate_id_text_rows)}")

# Find duplicate IDs that are NOT in the text duplicate set
id_not_in_text = duplicate_id_rows[~duplicate_id_rows['id'].isin(duplicate_text_rows['id'])]
print(f"\nDuplicate IDs NOT duplicated in Text: {len(id_not_in_text)}")

# Find duplicate Texts that are NOT in the ID duplicate set
text_not_in_id = duplicate_text_rows[~duplicate_text_rows['text'].isin(duplicate_id_rows['text'])]
print(f"Duplicate Texts NOT duplicated in ID: {len(text_not_in_id)}")


Rows where ID is duplicated: 473
Rows where Text is duplicated: 623
Rows where BOTH ID and Text are duplicated: 457

Duplicate IDs NOT duplicated in Text: 16
Duplicate Texts NOT duplicated in ID: 165


In [25]:
empty_id_rows = df[df['id'].isna()]
print(f"Rows where `id` is empty (NaN): {len(empty_id_rows)}")
#empty_id_rows

zero_id_rows = df[df['id'] == 0]
print(f"Rows where `id` is 0: {len(zero_id_rows)}")
zero_id_rows

Rows where `id` is empty (NaN): 0
Rows where `id` is 0: 16


Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
35141,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,@tomekbit ✌️,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35142,"{'retweet_count': 0, 'reply_count': 2, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,"@MaciejGdynia Maćku, czekam na oficjalne wynik...",Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35143,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,"@MCichonAlicja Alu, czekamy jeszcze na wynik?",Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35144,"{'retweet_count': 1, 'reply_count': 1, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,@REL_76 🥰🥰🥰,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35145,"{'retweet_count': 1, 'reply_count': 0, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,@Gidziela 🥰✌️,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35146,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,@WHaptar Gratulacje👏🥂,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35147,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,,2023-10-16 00:00:00+00:00,,,61552404,,pl,,0,,@KapenGenezyp Dziękuję❤️❤️❤️,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35148,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,,2023-10-16 21:57:00+00:00,,,61552404,,pl,,0,,@jasinska_e ❤️,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35149,"{'retweet_count': 18, 'reply_count': 31, 'like...",everyone,,2023-10-16 22:27:00+00:00,,,61552404,,pl,,0,,@BMikolajewska odpowie💪,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,
35150,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",everyone,,2023-10-16 22:41:00+00:00,,,61552404,,pl,,0,,@DorotaNiedziela ja Tobie też❣️,Reply,,,,Leszczyna_2023-10-16_2023-12-31.json,PO,


In [26]:
tweets_by_author = df[df['author_id'] == 61552404.0].sort_values(by='created_at')
display(tweets_by_author)

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
42688,"{'retweet_count': 482, 'reply_count': 0, 'like...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2023-09-09 15:22:42+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1700530175337738371],pl,False,1700530175337738496,1700530175337738496,RT @Platforma_org: 💬 Przewodniczący @donaldtus...,Retweet,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,"[{'type': 'retweeted', 'id': '1700489605026340...",Leszczyna_2022-10-16_2023-10-15.json,PO,
42689,"{'retweet_count': 482, 'reply_count': 0, 'like...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2023-09-09 15:22:42+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1700530175337738371],pl,False,1700530175337738496,1700530175337738496,RT @Platforma_org: 💬 Przewodniczący @donaldtus...,Retweet,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,"[{'type': 'retweeted', 'id': '1700489605026340...",Leszczyna_2022-10-16_2023-10-15.json,PO,
42687,"{'retweet_count': 376, 'reply_count': 0, 'like...",everyone,"{'hashtags': [{'start': 114, 'end': 127, 'tag'...",2023-09-09 15:23:08+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1700530281784971545],pl,False,1700530281784971520,1700530281784971520,"RT @MariaLe85219860: B U M‼️\n\nDuda, Ziobro, ...",Retweet,,,"[{'type': 'retweeted', 'id': '1700475689600696...",Leszczyna_2022-10-16_2023-10-15.json,PO,
42686,"{'retweet_count': 144, 'reply_count': 0, 'like...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2023-09-09 15:23:20+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1700530331500085744],pl,False,1700530331500085760,1700530331500085760,RT @Platforma_org: 💬 Przewodniczący @donaldtus...,Retweet,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,"[{'type': 'retweeted', 'id': '1700482994564186...",Leszczyna_2022-10-16_2023-10-15.json,PO,
42685,"{'retweet_count': 167, 'reply_count': 0, 'like...",everyone,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",2023-09-09 15:23:36+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1700530402610364779],pl,False,1700530402610364672,1700530402610364672,RT @Platforma_org: 💬 Przewodniczący @donaldtus...,Retweet,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,"[{'type': 'retweeted', 'id': '1700463357575221...",Leszczyna_2022-10-16_2023-10-15.json,PO,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33674,"{'retweet_count': 0, 'reply_count': 2, 'like_c...",everyone,"{'urls': [{'start': 85, 'end': 108, 'url': 'ht...",2024-10-05 10:35:26+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1842513925612372262],pl,False,1842513925612372224,1842146757360169472,"@arekpisarski @MZ_GOV_PL @NFZ_GOV_PL tak, @Rze...",Reply,,713836895495696384,"[{'type': 'replied_to', 'id': '184214897941436...",Leszczyna_2024-04-01_2024-10-15.json,PO,
33673,"{'retweet_count': 0, 'reply_count': 2, 'like_c...",everyone,"{'urls': [{'start': 303, 'end': 326, 'url': 'h...",2024-10-05 10:42:14+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1842515633826627766],pl,False,1842515633826627840,1842146757360169472,"@ewa_esse @MZ_GOV_PL od 15. października, ale ...",Reply,,1439610920175558656,"[{'type': 'replied_to', 'id': '184216083786941...",Leszczyna_2024-04-01_2024-10-15.json,PO,
33672,"{'retweet_count': 47, 'reply_count': 116, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-10-10 05:48:35+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",61552404,"[1844253222103380157, 1844253674584916327]",pl,False,1844253674584916224,1844253674584916224,Ustawa o wychowaniu w trzeźwości trafi dopiero...,Quote,,,"[{'type': 'quoted', 'id': '1844091771900526777'}]",Leszczyna_2024-04-01_2024-10-15.json,PO,
33671,"{'retweet_count': 0, 'reply_count': 5, 'like_c...",everyone,"{'urls': [{'start': 296, 'end': 319, 'url': 'h...",2024-10-10 16:17:02+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",61552404,[1844411831844036750],pl,False,1844411831844036864,1844253674584916224,@Mariusz32382943 nad zmianami systemowymi prac...,Reply,,1217929501063155712,"[{'type': 'replied_to', 'id': '184427203001280...",Leszczyna_2024-04-01_2024-10-15.json,PO,


id 0 of a tweet may mean, that the tweet was, changed, deleted, or that it is not a tweet but something else and was wrongly categorized as one

We need to remove duplicate tweets + delete the tweets that were 0; because our custom downloading loop occasionally downloads the same tweet two times to ensure completeness.

In [27]:
import pandas as pd
import numpy as np

# 1) Copy the original DataFrame before cleaning
df_before = df.copy()

# 2) Get the initial size
initial_size = len(df_before)
print(f"Initial number of tweets: {initial_size}")

# 3) Check and report issues with the 'id' column
print("\n--- ID QUALITY CHECK ---")

# Convert id to string for consistent checking
df_before['id'] = df_before['id'].astype(str)

# Check for various problems
missing_ids = df_before['id'].isna()
empty_ids = df_before['id'] == ''
zero_ids = df_before['id'] == '0'
very_short_ids = df_before['id'].str.len() < 5  # Twitter IDs are typically longer

# Report on ID issues
print(f"Missing IDs (NaN): {missing_ids.sum()} ({missing_ids.mean():.2%})")
print(f"Empty IDs: {empty_ids.sum()} ({empty_ids.mean():.2%})")
print(f"Zero IDs ('0'): {zero_ids.sum()} ({zero_ids.mean():.2%})")
print(f"Very short IDs (< 5 chars): {very_short_ids.sum()} ({very_short_ids.mean():.2%})")

# Create a mask for all problematic IDs
problematic_ids_mask = missing_ids | empty_ids | zero_ids | very_short_ids

# Report total problematic IDs
print(f"Total problematic IDs: {problematic_ids_mask.sum()} ({problematic_ids_mask.mean():.2%})")

# 4) First filter out problematic IDs from the original dataset
df_no_problems = df_before[~problematic_ids_mask].copy()
problematic_removed = initial_size - len(df_no_problems)

# 5) Then remove duplicates from the dataset without problematic IDs
df_after = df_no_problems.drop_duplicates(subset=['id'])
duplicates_removed = len(df_no_problems) - len(df_after)

# 6) Calculate removed counts and percentages
remaining_final = len(df_after)
total_removed = initial_size - remaining_final

duplicate_percentage = (duplicates_removed / initial_size) * 100
problematic_percentage = (problematic_removed / initial_size) * 100
total_removed_percentage = (total_removed / initial_size) * 100
remaining_percentage = (remaining_final / initial_size) * 100

# 7) Print comprehensive results
print("\n--- CLEANING SUMMARY ---")
print(f"Initial tweets: {initial_size}")
print(f"Problematic ID tweets removed: {problematic_removed} ({problematic_percentage:.2f}%)")
print(f"Duplicate tweets removed: {duplicates_removed} ({duplicate_percentage:.2f}%)")
print(f"Total tweets removed: {total_removed} ({total_removed_percentage:.2f}%)")
print(f"Tweets remaining: {remaining_final} ({remaining_percentage:.2f}%)")

# 8) Show sample of problematic IDs
if problematic_ids_mask.sum() > 0:
    print("\nSample of problematic IDs:")
    sample_problematic = df_before[problematic_ids_mask].head(5)
    for i, (idx, row) in enumerate(sample_problematic.iterrows()):
        print(f"  {i+1}. ID: '{row['id']}', Text: '{row['text'][:50]}...'")

# 9) Identify the actual duplicate IDs from the data without problematic IDs
duplicate_ids = df_no_problems[df_no_problems.duplicated(subset=['id'], keep='first')]['id'].unique().tolist()
print(f"\nNumber of unique duplicate IDs: {len(duplicate_ids)}")
if duplicate_ids:
    print("Sample of duplicate IDs (first 5):")
    for i, dup_id in enumerate(duplicate_ids[:5]):
        print(f"  {i+1}. {dup_id}")
else:
    print("No duplicates found")

# 10) Keep df_after as the new df
df = df_after
print(f"\nFinal clean dataframe shape: {df.shape}")

# 11) Verify no problematic IDs remain
if (df['id'] == '0').sum() > 0 or df['id'].isna().sum() > 0 or (df['id'] == '').sum() > 0 or (df['id'].str.len() < 5).sum() > 0:
    print("WARNING: Some problematic IDs still remain in the cleaned dataframe")
else:
    print("SUCCESS: All problematic IDs have been removed")

Initial number of tweets: 52787

--- ID QUALITY CHECK ---
Missing IDs (NaN): 0 (0.00%)
Empty IDs: 0 (0.00%)
Zero IDs ('0'): 16 (0.03%)
Very short IDs (< 5 chars): 16 (0.03%)
Total problematic IDs: 16 (0.03%)

--- CLEANING SUMMARY ---
Initial tweets: 52787
Problematic ID tweets removed: 16 (0.03%)
Duplicate tweets removed: 229 (0.43%)
Total tweets removed: 245 (0.46%)
Tweets remaining: 52542 (99.54%)

Sample of problematic IDs:
  1. ID: '0', Text: '@tomekbit ✌️...'
  2. ID: '0', Text: '@MaciejGdynia Maćku, czekam na oficjalne wyniki, ż...'
  3. ID: '0', Text: '@MCichonAlicja Alu, czekamy jeszcze na wynik?...'
  4. ID: '0', Text: '@REL_76 🥰🥰🥰...'
  5. ID: '0', Text: '@Gidziela 🥰✌️...'

Number of unique duplicate IDs: 228
Sample of duplicate IDs (first 5):
  1. 1780104870302732544
  2. 1778400392470056960
  3. 1714231616065708544
  4. 1794299341743829248
  5. 1780108572161945856

Final clean dataframe shape: (52542, 20)
SUCCESS: All problematic IDs have been removed


In [28]:
# 1) How many total rows have a duplicate 'id' (including the first occurrence)?
total_dup_rows = df.duplicated(subset=['id'], keep=False).sum()
print(f"Total rows that share a duplicate ID (including the first occurrence): {total_dup_rows}")

# 2) How many rows are "extra" duplicates beyond the first?
extra_dup_rows = df.duplicated(subset=['id'], keep='first').sum()
print(f"Number of extra duplicates beyond the first occurrence: {extra_dup_rows}")

# 3) How many unique IDs appear more than once?
duplicate_ids = df[df.duplicated(subset=['id'], keep=False)]['id'].unique()
num_duplicate_ids = len(duplicate_ids)
print(f"Number of unique IDs that are duplicated: {num_duplicate_ids}")

Total rows that share a duplicate ID (including the first occurrence): 0
Number of extra duplicates beyond the first occurrence: 0
Number of unique IDs that are duplicated: 0


In [29]:
df.head()

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
0,"{'retweet_count': 230, 'reply_count': 61, 'lik...",everyone,"{'urls': [{'start': 277, 'end': 300, 'url': 'h...",2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,❌ Rząd polski zamierza budować w Polsce 49 Cen...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
1,"{'retweet_count': 1301, 'reply_count': 169, 'l...",everyone,"{'urls': [{'start': 276, 'end': 299, 'url': 'h...",2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,❌ Szambo wybija i robi się coraz ciekawiej. Na...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
2,"{'retweet_count': 682, 'reply_count': 145, 'li...",everyone,"{'urls': [{'start': 279, 'end': 302, 'url': 'h...",2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SPOS...",Original,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
3,"{'retweet_count': 271, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ml...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
4,"{'retweet_count': 214, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 281, 'end': 304, 'url': 'h...",2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,❌ O CO TUTAJ CHODZI? W październiku 2024 r. sz...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,


In [30]:
# Get the value counts of the 'category' column
category_counts = df['category'].value_counts()

# Display the counts
print(category_counts)

# Get the number of unique categories
unique_category_count = category_counts.count()
print(f"Number of unique categories: {unique_category_count}")

category
Original    32794
Reply       10790
Quote        5478
Retweet      3480
Name: count, dtype: int64
Number of unique categories: 4


We need to delete retweets because they are wrongly provided by the X API. We want to analyze only original tweets, replies, and quotes.

In [31]:
df = df[df['category'] != 'Retweet']

In [32]:
df

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
0,"{'retweet_count': 230, 'reply_count': 61, 'lik...",everyone,"{'urls': [{'start': 277, 'end': 300, 'url': 'h...",2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,❌ Rząd polski zamierza budować w Polsce 49 Cen...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
1,"{'retweet_count': 1301, 'reply_count': 169, 'l...",everyone,"{'urls': [{'start': 276, 'end': 299, 'url': 'h...",2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,❌ Szambo wybija i robi się coraz ciekawiej. Na...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
2,"{'retweet_count': 682, 'reply_count': 145, 'li...",everyone,"{'urls': [{'start': 279, 'end': 302, 'url': 'h...",2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SPOS...",Original,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
3,"{'retweet_count': 271, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ml...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
4,"{'retweet_count': 214, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 281, 'end': 304, 'url': 'h...",2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,❌ O CO TUTAJ CHODZI? W październiku 2024 r. sz...,Original,,,,placzekgrzegorz_2024-04-16_2024-10-15.json,Konfederacja,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52770,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 120, 'end': 143, 'url': 'h...",2022-11-04 17:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588577342594482176],pl,0,1588577342594482176,1588577342594482176,Rozporządzenie o zakazie rejestracji nowych sa...,Quote,"[{'domain': {'id': '46', 'name': 'Business Tax...",,"[{'type': 'quoted', 'id': '1586802685587738629'}]",GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52774,"{'retweet_count': 1, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 163, 'end': 186, 'url': 'h...",2022-11-04 07:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588426347646533634],pl,0,1588426347646533632,1588426347646533632,Wojna gazowa Putina. Europa przełamała szantaż...,Original,,,,GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52779,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 243, 'end': 266, 'url': 'h...",2022-10-29 12:23:28+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1586332889444237312],pl,0,1586332889444237312,1586332889444237312,"Naukowcy wiedzą jak sprawić, by OZE nie zależa...",Original,,,,GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,
52784,"{'retweet_count': 4, 'reply_count': 0, 'like_c...",everyone,"{'urls': [{'start': 117, 'end': 140, 'url': 'h...",2022-10-21 07:33:09+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1583360725958828034],pl,0,1583360725958828032,1583360725958828032,Chaos na rynku energii. Zrywane umowy i zamroż...,Original,,,,GrzybAndrzej_2022-10-16_2023-10-15.json,PSL,


In [33]:
# Update the 'username' column to keep only the string until '_2' -> split to date range

#df['username'] = df['username'].str.split('_2').str[0].copy()
df.loc[:, 'username'] = df['username'].str.split('_2').str[0]

In [34]:
category_summary = df['category'].value_counts()
print(category_summary)
total_tweets = category_summary.sum()
print(f"Total tweets: {total_tweets}")

category
Original    32794
Reply       10790
Quote        5478
Name: count, dtype: int64
Total tweets: 49062


In [35]:
# Ensure the created_at column is in datetime format

#df['created_at'] = pd.to_datetime(df['created_at'])
df.loc[:, 'created_at'] = pd.to_datetime(df['created_at'])

In [36]:
df.head()

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,text,category,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo
0,"{'retweet_count': 230, 'reply_count': 61, 'lik...",everyone,"{'urls': [{'start': 277, 'end': 300, 'url': 'h...",2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,❌ Rząd polski zamierza budować w Polsce 49 Cen...,Original,,,,placzekgrzegorz,Konfederacja,
1,"{'retweet_count': 1301, 'reply_count': 169, 'l...",everyone,"{'urls': [{'start': 276, 'end': 299, 'url': 'h...",2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,❌ Szambo wybija i robi się coraz ciekawiej. Na...,Original,,,,placzekgrzegorz,Konfederacja,
2,"{'retweet_count': 682, 'reply_count': 145, 'li...",everyone,"{'urls': [{'start': 279, 'end': 302, 'url': 'h...",2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SPOS...",Original,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,placzekgrzegorz,Konfederacja,
3,"{'retweet_count': 271, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ml...,Original,,,,placzekgrzegorz,Konfederacja,
4,"{'retweet_count': 214, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 281, 'end': 304, 'url': 'h...",2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,❌ O CO TUTAJ CHODZI? W październiku 2024 r. sz...,Original,,,,placzekgrzegorz,Konfederacja,


In [37]:
df.loc[1, 'text']

'❌ Szambo wybija i robi się coraz ciekawiej. Na światło dzienne wychodzą bowiem coraz to nowe fakty. Otóż wytyczne funkcjonowania 49 Centrów Integracji Cudzoziemców (CIC) przewidują dla cudzoziemców w całej Polsce między innymi… zatrudnianie OSOBISTYCH ASYSTENTÓW w urzędach,… https://t.co/OZyTSwcMAb'

Emojis handler

In [38]:
def add_space_around_emojis(text):
    return ''.join(f' {char} ' if char in emoji.EMOJI_DATA or re.match(r'[\U0001F1E6-\U0001F1FF]', char) else char for char in text)

df['text'] = df['text'].apply(add_space_around_emojis)

def clean_text(text):
    mentions = re.findall(r'@\w+', text)
    text = re.sub(r'@\w+', '', text)
    links = re.findall(r'http\S+', text)
    text = re.sub(r'http\S+', '', text)
    hashtags = re.findall(r'#\w+', text)
    text = re.sub(r'(?<!\s)([\U0001F600-\U0001F64F])', r' \1', text)
    text = re.sub(r'([\U0001F600-\U0001F64F])(?!\s)', r'\1 ', text)
    return [text, mentions, links, hashtags]

df[['text_clean', 'mentions', 'links', 'hashtags']] = pd.DataFrame(df['text'].apply(clean_text).tolist(), index=df.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(add_space_around_emojis)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['text_clean', 'mentions', 'links', 'hashtags']] = pd.DataFrame(df['text'].apply(clean_text).tolist(), index=df.index)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['text_clean', 'mentions'

In [39]:
import pandas as pd
pd.options.mode.chained_assignment = None  # Turn off the warning from lack of loc

In [40]:
df.head()

Unnamed: 0,public_metrics,reply_settings,entities,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,...,context_annotations,in_reply_to_user_id,referenced_tweets,username,party,geo,text_clean,mentions,links,hashtags
0,"{'retweet_count': 230, 'reply_count': 61, 'lik...",everyone,"{'urls': [{'start': 277, 'end': 300, 'url': 'h...",2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,...,,,,placzekgrzegorz,Konfederacja,,❌ Rząd polski zamierza budować w Polsce 49 C...,[],"[https://t.co/gL3O8F0ITB, https://t.co/cay37TX...",[]
1,"{'retweet_count': 1301, 'reply_count': 169, 'l...",everyone,"{'urls': [{'start': 276, 'end': 299, 'url': 'h...",2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,...,,,,placzekgrzegorz,Konfederacja,,❌ Szambo wybija i robi się coraz ciekawiej. ...,[],[https://t.co/OZyTSwcMAb],[]
2,"{'retweet_count': 682, 'reply_count': 145, 'li...",everyone,"{'urls': [{'start': 279, 'end': 302, 'url': 'h...",2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,...,"[{'domain': {'id': '10', 'name': 'Person', 'de...",,,placzekgrzegorz,Konfederacja,,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",[@donaldtusk],[https://t.co/rIGkIpR8sw],[#RadaKrajowaKO]
3,"{'retweet_count': 271, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,...,,,,placzekgrzegorz,Konfederacja,,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,[],[https://t.co/cwusG1221F],[]
4,"{'retweet_count': 214, 'reply_count': 56, 'lik...",everyone,"{'urls': [{'start': 281, 'end': 304, 'url': 'h...",2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,...,,,,placzekgrzegorz,Konfederacja,,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,"[@MZ_GOV_PL, @Leszczyna, @NFZ_GOV_PL]",[https://t.co/RkylDcUHbo],[]


In [41]:
df.drop(columns=['entities'], inplace=True)

In [42]:
# Some additioanl numerical data from tweets is extracted and added to the dataframe as new variables, then the original column is dropped
df['retweet_count'] = df['public_metrics'].apply(lambda x: x['retweet_count'])
df['reply_count'] = df['public_metrics'].apply(lambda x: x['reply_count'])
df['like_count'] = df['public_metrics'].apply(lambda x: x['like_count'])
df['quote_count'] = df['public_metrics'].apply(lambda x: x['quote_count'])
df['impression_count'] = df['public_metrics'].apply(lambda x: x['impression_count'])

df.drop(columns=['public_metrics'], inplace=True)

In [43]:
df

Unnamed: 0,reply_settings,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,...,geo,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count
0,everyone,2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,...,,❌ Rząd polski zamierza budować w Polsce 49 C...,[],"[https://t.co/gL3O8F0ITB, https://t.co/cay37TX...",[],230,61,644,7,11648
1,everyone,2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,...,,❌ Szambo wybija i robi się coraz ciekawiej. ...,[],[https://t.co/OZyTSwcMAb],[],1301,169,3845,57,146584
2,everyone,2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,...,,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",[@donaldtusk],[https://t.co/rIGkIpR8sw],[#RadaKrajowaKO],682,145,2061,28,100757
3,everyone,2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,...,,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,[],[https://t.co/cwusG1221F],[],271,56,989,8,30769
4,everyone,2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,...,,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,"[@MZ_GOV_PL, @Leszczyna, @NFZ_GOV_PL]",[https://t.co/RkylDcUHbo],[],214,56,678,9,17432
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52770,everyone,2022-11-04 17:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588577342594482176],pl,0,1588577342594482176,1588577342594482176,...,,Rozporządzenie o zakazie rejestracji nowych sa...,[@EPPGroup],[https://t.co/zjdg6y5yH8],[],3,0,3,1,0
52774,everyone,2022-11-04 07:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588426347646533634],pl,0,1588426347646533632,1588426347646533632,...,,Wojna gazowa Putina. Europa przełamała szantaż...,[@Money_pl],[https://t.co/UP6coX7Ukj],[],1,0,2,1,0
52779,everyone,2022-10-29 12:23:28+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1586332889444237312],pl,0,1586332889444237312,1586332889444237312,...,,"Naukowcy wiedzą jak sprawić, by OZE nie zależa...",[],[https://t.co/rS6OtqX4wc],[],3,0,2,1,0
52784,everyone,2022-10-21 07:33:09+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1583360725958828034],pl,0,1583360725958828032,1583360725958828032,...,,Chaos na rynku energii. Zrywane umowy i zamroż...,[@BOksinska],[https://t.co/XMl5UjJLwb],[],4,0,2,1,0


In [44]:
df.dtypes

reply_settings                         object
created_at                datetime64[ns, UTC]
attachments                            object
edit_controls                          object
author_id                             float64
edit_history_tweet_ids                 object
lang                                   object
possibly_sensitive                     object
id                                     object
conversation_id                       float64
text                                   object
category                               object
context_annotations                    object
in_reply_to_user_id                   float64
referenced_tweets                      object
username                               object
party                                  object
geo                                    object
text_clean                             object
mentions                               object
links                                  object
hashtags                          

In [45]:
df.head()

Unnamed: 0,reply_settings,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,...,geo,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count
0,everyone,2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,...,,❌ Rząd polski zamierza budować w Polsce 49 C...,[],"[https://t.co/gL3O8F0ITB, https://t.co/cay37TX...",[],230,61,644,7,11648
1,everyone,2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,...,,❌ Szambo wybija i robi się coraz ciekawiej. ...,[],[https://t.co/OZyTSwcMAb],[],1301,169,3845,57,146584
2,everyone,2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,...,,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",[@donaldtusk],[https://t.co/rIGkIpR8sw],[#RadaKrajowaKO],682,145,2061,28,100757
3,everyone,2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,...,,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,[],[https://t.co/cwusG1221F],[],271,56,989,8,30769
4,everyone,2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,...,,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,"[@MZ_GOV_PL, @Leszczyna, @NFZ_GOV_PL]",[https://t.co/RkylDcUHbo],[],214,56,678,9,17432


In [46]:
df.dtypes

reply_settings                         object
created_at                datetime64[ns, UTC]
attachments                            object
edit_controls                          object
author_id                             float64
edit_history_tweet_ids                 object
lang                                   object
possibly_sensitive                     object
id                                     object
conversation_id                       float64
text                                   object
category                               object
context_annotations                    object
in_reply_to_user_id                   float64
referenced_tweets                      object
username                               object
party                                  object
geo                                    object
text_clean                             object
mentions                               object
links                                  object
hashtags                          

In [47]:
import pandas as pd

# Step 1: Check for duplicate columns and remove them
if df.columns.duplicated().any():
    print("Duplicate columns found! Removing them...")
    df_no_duplicates = df.loc[:, ~df.columns.duplicated()]  # Keep the first occurrence of each column
else:
    df_no_duplicates = df.copy()

# Step 2: Convert 'id' column to string (if needed)
df_no_duplicates['id'] = df_no_duplicates['id'].astype(str)

# Step 3: Check for missing or empty values in 'text' and 'text_clean'
empty_text = df_no_duplicates[df_no_duplicates['text'].isna() | (df_no_duplicates['text'].astype(str).str.strip() == '')]
empty_text_clean = df_no_duplicates[df_no_duplicates['text_clean'].isna() | (df_no_duplicates['text_clean'].astype(str).str.strip() == '')]

print(f"Rows where 'text' is empty or null: {empty_text.shape[0]}")
print(empty_text[['id', 'text', 'text_clean']].head())

print(f"\nRows where 'text_clean' is empty or null: {empty_text_clean.shape[0]}")
empty_text_clean[['id', 'text', 'text_clean']].head()


Rows where 'text' is empty or null: 0
Empty DataFrame
Columns: [id, text, text_clean]
Index: []

Rows where 'text_clean' is empty or null: 731


Unnamed: 0,id,text,text_clean
41,1830335798408884480,@DOganaw @KONFEDERACJA_ https://t.co/XyFM1k7j7d,
46,1830203038419591168,@PremierRP @donaldtusk https://t.co/2cp9NKe0U4,
49,1829244931577098240,@Leszczyna https://t.co/RrGcwaedOx,
57,1827578129142943744,@PremierRP https://t.co/Vf4FzG1iX0,
58,1827374206306099200,@PlaOb_StalWola @donaldtusk https://t.co/tyTl9...,


In [48]:
false_count = (df['text_clean'].str.strip().astype(bool))
print(f"Number of False values: {false_count}")

Number of False values: 0        True
1        True
2        True
3        True
4        True
         ... 
52770    True
52774    True
52779    True
52784    True
52785    True
Name: text_clean, Length: 49062, dtype: bool


In [49]:
#Delete empty tweets
df = df[false_count]

In [50]:
len(df)

48331

saving data used for translation 

In [52]:
df_clean_text = df[['id', 'text', 'text_clean']]

df_clean_text.to_csv('data/02.processed/data_for_translation.csv', index=False)
df.to_csv('data/02.processed/whole_dataset_for_translation.csv', index=False)

In [53]:
len(df)

48331

In [54]:
df_clean_text.dtypes

id            object
text          object
text_clean    object
dtype: object

reading data used for translation

In [55]:
# Read CSV with ID column as string (text)
df_clean_text = pd.read_csv('Data/02.processed/data_for_translation.csv', dtype={'id': str})

# Verify the column type
print("ID column type:", df_clean_text['id'].dtype)
print("Sample ID:", df_clean_text['id'].iloc[0], "of type", type(df_clean_text['id'].iloc[0]))

ID column type: object
Sample ID: 1846086999964283136 of type <class 'str'>


In [56]:
df_clean_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48331 entries, 0 to 48330
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          48331 non-null  object
 1   text        48331 non-null  object
 2   text_clean  48331 non-null  object
dtypes: object(3)
memory usage: 1.1+ MB


In [57]:
df_clean_text

Unnamed: 0,id,text,text_clean
0,1846086999964283136,❌ Rząd polski zamierza budować w Polsce 49 C...,❌ Rząd polski zamierza budować w Polsce 49 C...
1,1845748090461966592,❌ Szambo wybija i robi się coraz ciekawiej. ...,❌ Szambo wybija i robi się coraz ciekawiej. ...
2,1845366606823657984,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...","❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP..."
3,1845006197847360000,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...
4,1844633149784891648,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...
...,...,...,...
48326,1588577342594482176,Rozporządzenie o zakazie rejestracji nowych sa...,Rozporządzenie o zakazie rejestracji nowych sa...
48327,1588426347646533632,Wojna gazowa Putina. Europa przełamała szantaż...,Wojna gazowa Putina. Europa przełamała szantaż...
48328,1586332889444237312,"Naukowcy wiedzą jak sprawić, by OZE nie zależa...","Naukowcy wiedzą jak sprawić, by OZE nie zależa..."
48329,1583360725958828032,Chaos na rynku energii. Zrywane umowy i zamroż...,Chaos na rynku energii. Zrywane umowy i zamroż...


In [58]:
# Filter rows where 'text_clean' is null OR empty (after stripping whitespace)
null_or_empty_text_clean = df_clean_text[
    df_clean_text['text_clean'].isna() | 
    (df_clean_text['text_clean'].astype(str).str.strip() == '')
]

# Display the number of problematic rows
print(f"Rows where 'text_clean' is null or empty: {null_or_empty_text_clean.shape[0]}")

# Show the affected rows
null_or_empty_text_clean[['id', 'text', 'text_clean']]

Rows where 'text_clean' is null or empty: 0


Unnamed: 0,id,text,text_clean


In [59]:
# 1. Print the total number of rows in df_clean_text
print("Total rows in df_clean_text:", len(df_clean_text))

# 2. Filter out rows where 'text_clean' is null or an empty string (after stripping whitespace)
valid_rows = df_clean_text[
    ~(
        df_clean_text['text_clean'].isna() 
        | (df_clean_text['text_clean'].astype(str).str.strip() == '')
    )
]

# 3. Print the number of those valid (non-empty) rows
print("Rows with non-empty 'text_clean':", len(valid_rows))

Total rows in df_clean_text: 48331
Rows with non-empty 'text_clean': 48331


reading translation dataset 

In [61]:
df_en_text = pd.read_parquet('data/przetłumaczone przed i po/df_combined.parquet')
df_en_text

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,reply_count,like_count,quote_count,impression_count,text_clean_en,text_clean_en_demojized,text_clean_demojized,emoji_count_en,emoji_count,name
0,375146901,everyone,1182211615,[{'domain': {'description': 'Named people in t...,1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...",{'editable_until': '2024-10-15 20:49:34+00:00'...,"[{'id': '1846091776269963695', 'type': 'replie...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,1,33,0,1555,"Failure to implement most of the ""100 specifi...","Failure to implement most of the ""100 specifi...","Niezrealizowanie większości ze ""100 konkretów...",0,0,Bartłomiej Pejo
1,,everyone,1182211615,[{'domain': {'description': 'Named people in t...,1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,{'editable_until': '2024-10-15 17:12:19+00:00'...,,2024-10-15 16:12:19+00:00,[1846222583898784025],...,2,72,0,3031,"A year after the elections, one thing must be ...","A year after the elections, one thing must be ...",Rok po wyborach trzeba powiedzieć jedno - nie ...,0,0,Bartłomiej Pejo
2,,everyone,1182211615,,1846161400328028160,"❌ Mamy rok po wyborach, a Polska pogrąża się ...",{'editable_until': '2024-10-15 13:09:12+00:00'...,,2024-10-15 12:09:12+00:00,[1846161400328028272],...,3,33,2,8636,"❌ We are a year after the elections, and Pola...",:cross_mark: We are a year after the election...,":cross_mark: Mamy rok po wyborach, a Polska p...",1,1,Bartłomiej Pejo
3,,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,{'editable_until': '2024-10-15 08:32:44+00:00'...,,2024-10-15 07:32:44+00:00,[1846091824101769490],...,2,38,0,2441,A year has passed since the parliamentary elec...,A year has passed since the parliamentary elec...,Mija rok od wyborów parlamentarnych. W kampani...,0,0,Bartłomiej Pejo
4,,everyone,1182211615,,1846075343188144128,#Idę11 🇵 🇱 https://t.co/KiCe5ATOpX,{'editable_until': '2024-10-15 07:27:14+00:00'...,,2024-10-15 06:27:14+00:00,[1846075343188144153],...,18,616,2,8634,#I'm going11 🇵 🇱,#I'm going11 🇵 🇱,#Idę11 🇵 🇱,2,2,Bartłomiej Pejo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48326,,everyone,961181894,,1707719554355380480,"Studiujesz na kierunku lekarskim, pielęgniarst...",{'editable_until': '2023-09-29 12:30:44+00:00'...,,2023-09-29 11:30:44+00:00,[1707719554355380484],...,0,6,0,2154,"Are you studying medicine, nursing or emergenc...","Are you studying medicine, nursing or emergenc...","Studiujesz na kierunku lekarskim, pielęgniarst...",0,0,Adam Struzik
48327,,everyone,961181894,,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,{'editable_until': '2023-09-19 14:08:40+00:00'...,,2023-09-19 13:08:40+00:00,[1704120323023454339],...,0,15,0,649,The meeting is over. And further support for t...,The meeting is over. And further support for t...,Za nami posiedzenie . I kolejne wsparcie dla m...,0,0,Adam Struzik
48328,,everyone,961181894,,1702668459576786944,Płockie Centrum Onkologii gotowe! Już na począ...,{'editable_until': '2023-09-15 13:59:29+00:00'...,,2023-09-15 12:59:29+00:00,[1702668459576787064],...,0,16,0,581,The Płock Oncology Center is ready! It will ac...,The Płock Oncology Center is ready! It will ac...,Płockie Centrum Onkologii gotowe! Już na począ...,0,0,Adam Struzik
48329,,everyone,961181894,,1701960909369868544,To jedna z największych inwestycji drogowych @...,{'editable_until': '2023-09-13 15:07:56+00:00'...,,2023-09-13 14:07:56+00:00,[1701960909369868437],...,0,13,0,621,This is one of the largest road investments ...,This is one of the largest road investments \...,To jedna z największych inwestycji drogowych ...,0,0,Adam Struzik


In [62]:
missing_translation_mask = df_en_text['text_clean_en'].isna() | (df_en_text['text_clean_en'].str.strip() == '')
df_en_text[missing_translation_mask]

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,reply_count,like_count,quote_count,impression_count,text_clean_en,text_clean_en_demojized,text_clean_demojized,emoji_count_en,emoji_count,name


merging second version of translated dataset with original one

In [63]:
# Step 1: Make sure IDs are strings
df['id'] = df['id'].astype(str)
df_en_text['id'] = df_en_text['id'].astype(str)

df_en_combined = pd.concat([df_en_text], ignore_index=True)

# Step 3: Drop duplicates by 'id' to keep only the latest 
df_en_combined = df_en_combined.drop_duplicates(subset='id', keep='last')

# Step 4: Merge back into the full dataset to get a unified view
df_merged = df.merge(df_en_text[['id', 'text_clean_en']], on='id')

print(f"Total rows after merge: {len(df_merged)} ")


Total rows after merge: 48331 


In [64]:
# Find the IDs present in df but not in df_merged
missing_ids = df[~df['id'].isin(df_merged['id'])]['id']

# Display the missing rows
missing_rows = df[df['id'].isin(missing_ids)]
print(missing_rows)

Empty DataFrame
Columns: [reply_settings, created_at, attachments, edit_controls, author_id, edit_history_tweet_ids, lang, possibly_sensitive, id, conversation_id, text, category, context_annotations, in_reply_to_user_id, referenced_tweets, username, party, geo, text_clean, mentions, links, hashtags, retweet_count, reply_count, like_count, quote_count, impression_count]
Index: []

[0 rows x 27 columns]


In [65]:
df_merged

Unnamed: 0,reply_settings,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
0,everyone,2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,...,❌ Rząd polski zamierza budować w Polsce 49 C...,[],"[https://t.co/gL3O8F0ITB, https://t.co/cay37TX...",[],230,61,644,7,11648,❌ The Polish government intends to build 49 F...
1,everyone,2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,...,❌ Szambo wybija i robi się coraz ciekawiej. ...,[],[https://t.co/OZyTSwcMAb],[],1301,169,3845,57,146584,❌ The cesspool is breaking out and it's getti...
2,everyone,2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,...,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",[@donaldtusk],[https://t.co/rIGkIpR8sw],[#RadaKrajowaKO],682,145,2061,28,100757,❌ I DON'T UNDERSTAND HOW YOU CAN HURT YOUR OW...
3,everyone,2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,...,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,[],[https://t.co/cwusG1221F],[],271,56,989,8,30769,🆘 The pharmaceutical company GSK will pay ove...
4,everyone,2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,...,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,"[@MZ_GOV_PL, @Leszczyna, @NFZ_GOV_PL]",[https://t.co/RkylDcUHbo],[],214,56,678,9,17432,"❌ WHAT IS GOING ON HERE? In October 2024, her..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48326,everyone,2022-11-04 17:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588577342594482176],pl,0,1588577342594482176,1588577342594482176,...,Rozporządzenie o zakazie rejestracji nowych sa...,[@EPPGroup],[https://t.co/zjdg6y5yH8],[],3,0,3,1,0,Regulation banning the registration of new car...
48327,everyone,2022-11-04 07:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588426347646533634],pl,0,1588426347646533632,1588426347646533632,...,Wojna gazowa Putina. Europa przełamała szantaż...,[@Money_pl],[https://t.co/UP6coX7Ukj],[],1,0,2,1,0,Putin's gas war. Europe has broken the blackma...
48328,everyone,2022-10-29 12:23:28+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1586332889444237312],pl,0,1586332889444237312,1586332889444237312,...,"Naukowcy wiedzą jak sprawić, by OZE nie zależa...",[],[https://t.co/rS6OtqX4wc],[],3,0,2,1,0,Scientists know how to ensure that renewable e...
48329,everyone,2022-10-21 07:33:09+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1583360725958828034],pl,0,1583360725958828032,1583360725958828032,...,Chaos na rynku energii. Zrywane umowy i zamroż...,[@BOksinska],[https://t.co/XMl5UjJLwb],[],4,0,2,1,0,Chaos on the energy market. Terminated contrac...


check wether the data went correctly

In [66]:
df_merged[df_merged["id"]=="1807795860480160000"]

Unnamed: 0,reply_settings,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
5163,everyone,2024-07-01 15:18:15+00:00,"{'media_keys': ['3_1807795856101212161', '3_18...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",1182211615,[1807795860480159980],pl,False,1807795860480160000,1807795860480160000,...,🇳 🇱 Holenderski klub NAC Breda przygotował ...,[],"[https://t.co/WI736ocWQ7, https://t.co/ANX3x4e...",[],5,1,24,0,603,🇳 🇱 The Dutch club NAC Breda has prepared spe...


In [67]:
# Check how many rows still have missing or empty translations
missing_translation_mask = df_merged['text_clean_en'].isna() | (df_merged['text_clean_en'].str.strip() == '')

# Show some of them
df_missing_translation = df_merged[missing_translation_mask]
print(f"Rows without translation: {df_missing_translation.shape[0]}")
display(df_missing_translation[['id', 'text', 'text_clean', 'text_clean_en']].head())


Rows without translation: 0


Unnamed: 0,id,text,text_clean,text_clean_en


removing rows withtout translation due to possessing text that is not being analyzed by our research

In [68]:
# Remove them
df_clean_translated = df_merged[~missing_translation_mask].copy()

print(f"Remaining rows with proper translation: {df_clean_translated.shape[0]}")

Remaining rows with proper translation: 48331


In [69]:
df_clean_translated.head()

Unnamed: 0,reply_settings,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
0,everyone,2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,...,❌ Rząd polski zamierza budować w Polsce 49 C...,[],"[https://t.co/gL3O8F0ITB, https://t.co/cay37TX...",[],230,61,644,7,11648,❌ The Polish government intends to build 49 F...
1,everyone,2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,...,❌ Szambo wybija i robi się coraz ciekawiej. ...,[],[https://t.co/OZyTSwcMAb],[],1301,169,3845,57,146584,❌ The cesspool is breaking out and it's getti...
2,everyone,2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,...,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",[@donaldtusk],[https://t.co/rIGkIpR8sw],[#RadaKrajowaKO],682,145,2061,28,100757,❌ I DON'T UNDERSTAND HOW YOU CAN HURT YOUR OW...
3,everyone,2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,...,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,[],[https://t.co/cwusG1221F],[],271,56,989,8,30769,🆘 The pharmaceutical company GSK will pay ove...
4,everyone,2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,...,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,"[@MZ_GOV_PL, @Leszczyna, @NFZ_GOV_PL]",[https://t.co/RkylDcUHbo],[],214,56,678,9,17432,"❌ WHAT IS GOING ON HERE? In October 2024, her..."


In [70]:
df_clean_translated.to_csv('data/02.processed/df_clean_translated_further_analalysis.csv', index=False)

In [71]:
def count_emojis(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # Emoticons
        "\U0001F300-\U0001F5FF"  # Symbols & pictographs
        "\U0001F680-\U0001F6FF"  # Transport & map symbols
        "\U0001F700-\U0001F77F"  # Alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric shapes
        "\U0001F800-\U0001F8FF"  # Supplemental arrows
        "\U0001F900-\U0001F9FF"  # Supplemental symbols and pictographs
        "\U0001FA00-\U0001FA6F"  # Chess symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and pictographs extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251"  # Enclosed characters
        "]+",
        flags=re.UNICODE,
    )
    return len(emoji_pattern.findall(text))


In [72]:
# Demojize text columns
df_clean_translated['text_clean_en_demojized'] = df_clean_translated['text_clean_en'].apply(
    lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x
)
df_clean_translated['text_clean_demojized'] = df_clean_translated['text_clean'].apply(
    lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x
)

# Count emojis in original text columns
df_clean_translated['emoji_count_en'] = df_clean_translated['text_clean_en'].apply(
    lambda x: count_emojis(str(x)) if pd.notnull(x) else 0
)
df_clean_translated['emoji_count'] = df_clean_translated['text_clean'].apply(
    lambda x: count_emojis(str(x)) if pd.notnull(x) else 0
)


In [73]:
# Total number of rows
total_rows = len(df_clean_translated)

# Rows with emojis in 'text_clean_en'
rows_with_emojis_en = df_clean_translated[df_clean_translated['emoji_count_en'] > 0].shape[0]

# Rows with emojis in 'text_clean'
rows_with_emojis = df_clean_translated[df_clean_translated['emoji_count'] > 0].shape[0]

# Display statistics
print(f"Total number of rows: {total_rows}")
print(f"Rows with emojis in 'text_clean_en': {rows_with_emojis_en} ({(rows_with_emojis_en/total_rows)*100:.2f}%)")
print(f"Rows with emojis in 'text_clean': {rows_with_emojis} ({(rows_with_emojis/total_rows)*100:.2f}%)")


Total number of rows: 48331
Rows with emojis in 'text_clean_en': 17994 (37.23%)
Rows with emojis in 'text_clean': 18289 (37.84%)


In [74]:
df_clean_translated[['text_clean_en', 'text_clean_en_demojized', 'emoji_count_en', 'text_clean', 'text_clean_demojized', 'emoji_count']].head()


Unnamed: 0,text_clean_en,text_clean_en_demojized,emoji_count_en,text_clean,text_clean_demojized,emoji_count
0,❌ The Polish government intends to build 49 F...,:cross_mark: The Polish government intends to...,2,❌ Rząd polski zamierza budować w Polsce 49 C...,:cross_mark: Rząd polski zamierza budować w ...,3
1,❌ The cesspool is breaking out and it's getti...,:cross_mark: The cesspool is breaking out and...,1,❌ Szambo wybija i robi się coraz ciekawiej. ...,:cross_mark: Szambo wybija i robi się coraz ...,1
2,❌ I DON'T UNDERSTAND HOW YOU CAN HURT YOUR OW...,:cross_mark: I DON'T UNDERSTAND HOW YOU CAN H...,1,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",":cross_mark: NIE ROZUMIEM, JAK MOŻNA KRZYWDZ...",1
3,🆘 The pharmaceutical company GSK will pay ove...,:SOS_button: The pharmaceutical company GSK w...,2,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,:SOS_button: Firma farmaceutyczna GSK zapła...,3
4,"❌ WHAT IS GOING ON HERE? In October 2024, her...",:cross_mark: WHAT IS GOING ON HERE? In Octobe...,2,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,:cross_mark: O CO TUTAJ CHODZI? W październi...,3


In [75]:
# Filter rows
rows_with_emojis_in_text_clean_only = df_clean_translated[
    (df_clean_translated['emoji_count'] > 0) & (df_clean_translated['emoji_count_en'] == 0)
]

# Display the number of such rows
print(f"Number of rows with emojis in 'text_clean' but not in 'text_clean_en': {len(rows_with_emojis_in_text_clean_only)}")

# Display the affected rows
rows_with_emojis_in_text_clean_only[['text_clean', 'text_clean_en']]


Number of rows with emojis in 'text_clean' but not in 'text_clean_en': 295


Unnamed: 0,text_clean,text_clean_en
472,Recepta na problemy na polskiej granicy jest b...,The solution to problems at the Polish border ...
536,Unia Europejska mówi Polakom: wasze domy i mie...,The European Union tells Poles: your houses an...
633,Unia Europejska planuje wprowadzać kolejne ogr...,The European Union plans to introduce further ...
651,Poseł Adam Gomoła z Polski 2050 Szymona Hołown...,MP Adam Gomoła from Szymon Hołownia's Poland 2...
686,"Premier Tusk oświadczył, że zamierza wprowadzi...",Prime Minister Tusk stated that he intends to ...
...,...,...
45637,W niedzielę będę gościem . Serdecznie zachęcam...,I will be a guest on Sunday. I encourage you t...
45654,"I 10 razy tyle, co na mieszkalnictwo 🙂",And 10 times as much as for housing :)
46684,Intensywny tydzień misji Komisji Budżetu PE w ...,An intense week of mission of the EP Budget Co...
46926,Może być 🙂 Dziękuję i ściskam dłoń,Maybe :) Thank you and I shake your hand


In [76]:
df_clean_translated['text_clean_en_demojized'] = df_clean_translated['text_clean_en'].apply(lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x)
df_clean_translated['text_clean_demojized'] = df_clean_translated['text_clean'].apply(lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x)

df_clean_translated[['text_clean_en', 'text_clean_en_demojized', 'text_clean', 'text_clean_demojized']].head()

Unnamed: 0,text_clean_en,text_clean_en_demojized,text_clean,text_clean_demojized
0,❌ The Polish government intends to build 49 F...,:cross_mark: The Polish government intends to...,❌ Rząd polski zamierza budować w Polsce 49 C...,:cross_mark: Rząd polski zamierza budować w ...
1,❌ The cesspool is breaking out and it's getti...,:cross_mark: The cesspool is breaking out and...,❌ Szambo wybija i robi się coraz ciekawiej. ...,:cross_mark: Szambo wybija i robi się coraz ...
2,❌ I DON'T UNDERSTAND HOW YOU CAN HURT YOUR OW...,:cross_mark: I DON'T UNDERSTAND HOW YOU CAN H...,"❌ NIE ROZUMIEM, JAK MOŻNA KRZYWDZIĆ W TEN SP...",":cross_mark: NIE ROZUMIEM, JAK MOŻNA KRZYWDZ..."
3,🆘 The pharmaceutical company GSK will pay ove...,:SOS_button: The pharmaceutical company GSK w...,🆘 Firma farmaceutyczna GSK zapłaci ponad 2 ...,:SOS_button: Firma farmaceutyczna GSK zapła...
4,"❌ WHAT IS GOING ON HERE? In October 2024, her...",:cross_mark: WHAT IS GOING ON HERE? In Octobe...,❌ O CO TUTAJ CHODZI? W październiku 2024 r. ...,:cross_mark: O CO TUTAJ CHODZI? W październi...


In [77]:
df_clean_translated['possibly_sensitive'] = df_clean_translated['possibly_sensitive'].astype(bool)

In [78]:
username_to_realname = {
    'bartlomiejpejo': 'Bartłomiej Pejo',
    'GrzegorzBraun_': 'Grzegorz Braun',
    'Iwaszkiewicz_RJ': 'Robert Iwaszkiewicz',
    'KonradBerkowicz': 'Konrad Berkowicz',
    'MarSypniewski': 'Marcin Sypniewski',
    'MichalWawer': 'Michał Wawer',
    'placzekgrzegorz': 'Grzegorz Płaczek',
    'SlawomirMentzen': 'Sławomir Mentzen',
    'TudujKrzysztof': 'Krzysztof Tuduj',
    'Wlodek_Skalik': 'Włodzimierz Skalik',
    'WTumanowicz': 'Witold Tumanowicz',
    'AndrzejSzejna': 'Andrzej Szejna',
    'AnitaKDZG': 'Anita Kucharska-Dziedzic',
    'JoankaSW': 'Joanna Scheuring-Wielgus',
    'KGawkowski': 'Krzysztof Gawkowski',
    'K_Smiszek': 'Krzysztof Śmiszek',
    'MarcinKulasek': 'Marcin Kulasek',
    'MoskwaWodnicka': 'Małgorzata Moskwa-Wodnicka',
    'PaulinaPW2024': 'Paulina Piechna-Więckiewicz',
    'poselTTrela': 'Tomasz Trela',
    'RobertBiedron': 'Robert Biedroń',
    'WandaNowicka': 'Wanda Nowicka',
    'wieczorekdarek': 'Dariusz Wieczorek',
    'wlodekczarzasty': 'Włodzimierz Czarzasty',
    'Arek_Iwaniak': 'Arkadiusz Iwaniak',
    'B_Maciejewska': 'Beata Maciejewska',
    'BeataSzydlo': 'Beata Szydło',
    'elzbietawitek': 'Elżbieta Witek',
    'Kaminski_M_': 'Mariusz Kamiński',
    'Kowalczyk_H': 'Henryk Kowalczyk',
    'Macierewicz_A': 'Antoni Macierewicz',
    'mblaszczak': 'Mariusz Błaszczak',
    'MorawieckiM': 'Mateusz Morawiecki',
    'mwojcik_': 'Michał Wójcik',
    'PatrykJaki': 'Patryk Jaki',
    'bbudka': 'Borys Budka',
    'CTomczyk': 'Cezary Tomczyk',
    'donaldtusk': 'Donald Tusk',
    'DorotaNiedziela': 'Dorota Niedziela',
    'EwaKopacz': 'Ewa Kopacz',
    'JanGrabiec': 'Jan Grabiec',
    'Konwinski_PO': 'Zbigniew Konwiński',
    'Leszczyna': 'Izabela Leszczyna',
    'MKierwinski': 'Marcin Kierwiński',
    'M_K_Blonska': 'Małgorzata Kidawa-Błońska',
    'OklaDrewnowicz': 'Agnieszka Okła-Drewnowicz',
    'trzaskowski_': 'Rafał Trzaskowski',
    'TomaszSiemoniak': 'Tomasz Siemoniak',
    'AgaBaranowskaPL': 'Agnieszka Baranowska',
    'aga_buczynska': 'Agnieszka Buczyńska',
    'hennigkloska': 'Paulina Hennig-Kloska',
    'joannamucha': 'Joanna Mucha',
    'Kpelczynska': 'Katarzyna Pełczyńska-Nałęcz',
    'LukaszOsmalak': 'Łukasz Osmalak',
    'SlizPawel': 'Paweł Śliz',
    'szymon_holownia': 'Szymon Hołownia',
    'ZalewskiPawel': 'Paweł Zalewski',
    'ZywnoMaciej': 'Maciej Żywno',
    'JKozlowskiEu': 'Jacek Kozłowski',
    'michalkobosko': 'Michał Kobosko',
    'DariuszKlimczak': 'Dariusz Klimczak',
    'GrzybAndrzej': 'Andrzej Grzyb',
    'Hetman_K': 'Krzysztof Hetman',
    'JarubasAdam': 'Adam Jarubas',
    'KosiniakKamysz': 'Władysław Kosiniak-Kamysz',
    'Paslawska': 'Urszula Pasławska',
    'PZgorzelskiP': 'Piotr Zgorzelski',
    'StefanKrajewski': 'Stefan Krajewski',
    'StruzikAdam': 'Adam Struzik'
}

# Add the 'name' column to the dataframe
df_clean_translated['name'] = df_clean_translated['username'].map(username_to_realname)

In [79]:
# Delete next line sign from the 'text_clean_en' column
df_clean_translated['text_clean_en'] = df_clean_translated['text_clean_en'].str.replace('\n', ' ')

In [81]:
# Save the DataFrame to a Parquet file
df_clean_translated.to_parquet('data/03.cleaned/df_combined.parquet', index=False)

In [82]:
df_clean_translated

Unnamed: 0,reply_settings,created_at,attachments,edit_controls,author_id,edit_history_tweet_ids,lang,possibly_sensitive,id,conversation_id,...,reply_count,like_count,quote_count,impression_count,text_clean_en,text_clean_en_demojized,text_clean_demojized,emoji_count_en,emoji_count,name
0,everyone,2024-10-15 07:13:34+00:00,"{'media_keys': ['3_1846083966849159168', '3_18...","{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1846086770229694583, 1846086999964283214]",pl,False,1846086999964283136,1846086999964283136,...,61,644,7,11648,❌ The Polish government intends to build 49 F...,:cross_mark: The Polish government intends to...,:cross_mark: Rząd polski zamierza budować w ...,2,3,Grzegorz Płaczek
1,everyone,2024-10-14 08:46:51+00:00,,"{'edits_remaining': 4, 'is_edit_eligible': Tru...",1284852220593414144,"[1845747336862961872, 1845748090461966651]",pl,False,1845748090461966592,1845748090461966592,...,169,3845,57,146584,❌ The cesspool is breaking out and it's getti...,:cross_mark: The cesspool is breaking out and...,:cross_mark: Szambo wybija i robi się coraz ...,1,1,Grzegorz Płaczek
2,everyone,2024-10-13 07:30:58+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845366606823657982],pl,False,1845366606823657984,1845366606823657984,...,145,2061,28,100757,❌ I DON'T UNDERSTAND HOW YOU CAN HURT YOUR OW...,:cross_mark: I DON'T UNDERSTAND HOW YOU CAN H...,":cross_mark: NIE ROZUMIEM, JAK MOŻNA KRZYWDZ...",1,1,Grzegorz Płaczek
3,everyone,2024-10-12 07:38:50+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1845006197847359885],pl,False,1845006197847360000,1845006197847360000,...,56,989,8,30769,🆘 The pharmaceutical company GSK will pay ove...,:SOS_button: The pharmaceutical company GSK w...,:SOS_button: Firma farmaceutyczna GSK zapła...,2,3,Grzegorz Płaczek
4,everyone,2024-10-11 06:56:29+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1284852220593414144,[1844633149784891665],pl,False,1844633149784891648,1844633149784891648,...,56,678,9,17432,"❌ WHAT IS GOING ON HERE? In October 2024, her...",:cross_mark: WHAT IS GOING ON HERE? In Octobe...,:cross_mark: O CO TUTAJ CHODZI? W październi...,2,3,Grzegorz Płaczek
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48326,everyone,2022-11-04 17:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588577342594482176],pl,False,1588577342594482176,1588577342594482176,...,0,3,1,0,Regulation banning the registration of new car...,Regulation banning the registration of new car...,Rozporządzenie o zakazie rejestracji nowych sa...,0,0,Andrzej Grzyb
48327,everyone,2022-11-04 07:02:07+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1588426347646533634],pl,False,1588426347646533632,1588426347646533632,...,0,2,1,0,Putin's gas war. Europe has broken the blackma...,Putin's gas war. Europe has broken the blackma...,Wojna gazowa Putina. Europa przełamała szantaż...,0,0,Andrzej Grzyb
48328,everyone,2022-10-29 12:23:28+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1586332889444237312],pl,False,1586332889444237312,1586332889444237312,...,0,2,1,0,Scientists know how to ensure that renewable e...,Scientists know how to ensure that renewable e...,"Naukowcy wiedzą jak sprawić, by OZE nie zależa...",0,0,Andrzej Grzyb
48329,everyone,2022-10-21 07:33:09+00:00,,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",1119834276,[1583360725958828034],pl,False,1583360725958828032,1583360725958828032,...,0,2,1,0,Chaos on the energy market. Terminated contrac...,Chaos on the energy market. Terminated contrac...,Chaos na rynku energii. Zrywane umowy i zamroż...,0,0,Andrzej Grzyb
