# Data cleaning

In this file, we cleaned the downloaded data:
The main steps include:
1. Adding party affiliation to tweet rows
2. Deleting unnecessary downloaded Retweets.
3. Deleting links and mentions from the tweets text and saving them to separate columns
4. Expanding the column of public metrics
5. Encoding emojis in a unified format
6. Translating tweets using Google Translate in Google Sheets
7. Saving all downloaded tweets to one file

### 1.  Used libraries

In [1]:
import os
import pandas as pd
import re
import emoji

### 2. Reading JSON files and transforming them into party-specific pickle files

In [2]:
base_input_paths = ['data/PoWyborach', 'data/tweets_data_2022']
subfolders = ['Konfederacja', 'NL', 'PIS', 'PO', 'PL2050', 'PSL']
output_folder = 'data/tweets_data_combined'

for subfolder in subfolders:
    dataframes = []
    for base_input_path in base_input_paths:
        folder_path = os.path.join(base_input_path, subfolder)
        for filename in os.listdir(folder_path):
            if filename.endswith('.json'):
                file_path = os.path.join(folder_path, filename)
                politician = filename.split("_tweets.json")[0]
                try:
                    df = pd.read_json(file_path)  
                    df["username"] = politician  
                    df["party"] = subfolder
                    print(f"Read {len(df)} rows from {file_path}")  
                    dataframes.append(df)
                except ValueError as e:
                    print(f"Error reading {file_path}: {e}")
    if dataframes:
        combined_df = pd.concat(dataframes, ignore_index=True)
        
        output_file_path = os.path.join(output_folder, f'{subfolder}_combined.pkl')
        combined_df.to_pickle(output_file_path) 
        
        print(f"Saved {subfolder} combined data to {output_file_path}")

print("Processing complete!")

Read 964 rows from data/PoWyborach\Konfederacja\bartlomiejpejo_2023-10-16_2024-10-15.json
Read 889 rows from data/PoWyborach\Konfederacja\GrzegorzBraun__2023-10-16_2024-10-15.json
Read 11 rows from data/PoWyborach\Konfederacja\Iwaszkiewicz_RJ_2023-10-16_2024-10-15.json
Read 289 rows from data/PoWyborach\Konfederacja\KonradBerkowicz_2023-10-15_2024-04-16_vol2 (1).json
Read 1318 rows from data/PoWyborach\Konfederacja\KonradBerkowicz_2024-04-16_2024-10-15_vol1 (1).json
Read 772 rows from data/PoWyborach\Konfederacja\MarSypniewski_2023-10-16_2024-10-15.json
Read 597 rows from data/PoWyborach\Konfederacja\MichalWawer_2023-10-16_2024-10-15.json
Read 421 rows from data/PoWyborach\Konfederacja\placzekgrzegorz_2023-10-16_2024-04-15.json
Read 320 rows from data/PoWyborach\Konfederacja\placzekgrzegorz_2024-04-16_2024-10-15.json
Read 721 rows from data/PoWyborach\Konfederacja\SlawomirMentzen_2023-10-16_2024-10-15.json
Read 175 rows from data/PoWyborach\Konfederacja\TudujKrzysztof_2023-10-16_2024-1

### 3. Data cleaning

In [3]:
df_konfederacja = pd.read_pickle(os.path.join(output_folder, 'Konfederacja_combined.pkl'))
df_NL = pd.read_pickle(os.path.join(output_folder, 'NL_combined.pkl'))
df_PIS = pd.read_pickle(os.path.join(output_folder, 'PIS_combined.pkl'))
df_PO = pd.read_pickle(os.path.join(output_folder, 'PO_combined.pkl'))
df_PL2050 = pd.read_pickle(os.path.join(output_folder, 'PL2050_combined.pkl'))
df_PSL = pd.read_pickle(os.path.join(output_folder, 'PSL_combined.pkl'))

In [4]:
df_konfederacja.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12651 entries, 0 to 12650
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          12651 non-null  object             
 1   in_reply_to_user_id     3078 non-null   float64            
 2   reply_settings          12651 non-null  object             
 3   author_id               12651 non-null  float64            
 4   context_annotations     1334 non-null   object             
 5   id                      12651 non-null  float64            
 6   text                    12651 non-null  object             
 7   edit_controls           12651 non-null  object             
 8   referenced_tweets       4543 non-null   object             
 9   created_at              12651 non-null  datetime64[ns, UTC]
 10  edit_history_tweet_ids  12651 non-null  object             
 11  lang                    12651 non-null  o

In [5]:
df_konfederacja.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182212000.0,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1.846277e+18,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],pl,1.846092e+18,"{'mentions': [{'start': 0, 'end': 11, 'usernam...",False,Reply,,,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182212000.0,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1.846223e+18,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],pl,1.846223e+18,"{'urls': [{'start': 100, 'end': 123, 'url': 'h...",False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182212000.0,,1.846161e+18,"‚ùåMamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô w ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],pl,1.846161e+18,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182212000.0,,1.846092e+18,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],pl,1.846092e+18,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182212000.0,,1.846075e+18,#Idƒô11 üáµüá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],qme,1.846075e+18,"{'hashtags': [{'start': 0, 'end': 6, 'tag': 'I...",False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja


In [6]:
# Merge all dataframes into one
df = pd.concat([df_konfederacja, df_NL, df_PIS, df_PO, df_PL2050, df_PSL], ignore_index=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52787 entries, 0 to 52786
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          52787 non-null  object             
 1   in_reply_to_user_id     10972 non-null  float64            
 2   reply_settings          52787 non-null  object             
 3   author_id               52787 non-null  float64            
 4   context_annotations     5948 non-null   object             
 5   id                      52771 non-null  float64            
 6   text                    52787 non-null  object             
 7   edit_controls           52771 non-null  object             
 8   referenced_tweets       19691 non-null  object             
 9   created_at              52787 non-null  datetime64[ns, UTC]
 10  edit_history_tweet_ids  52771 non-null  object             
 11  lang                    52787 non-null  o

In [8]:
len(df)

52787

In [9]:
pd.options.display.float_format = '{:.0f}'.format
df['id'] = df['id'].fillna(0).astype('int64')
df['id']

0        1846277256509116672
1        1846222583898784000
2        1846161400328028160
3        1846091824101769472
4        1846075343188144128
                ...         
52782    1701274354145780224
52783    1701273238263742720
52784    1701273238263742720
52785    1697128952131661824
52786    1697128952131661824
Name: id, Length: 52787, dtype: int64

In [10]:
# Get the value counts of 'id'
id_counts = df['id'].value_counts()

# Filter the counts to show only those greater than 1
id_counts_above_1 = id_counts[id_counts > 1]

# Display the counts
print(f"IDs with counts greater than 1:\n{id_counts_above_1}")

IDs with counts greater than 1:
id
0                      16
1780108572161945856     3
1714215043720442368     2
1581668593192034304     2
1581702691851763712     2
                       ..
1713961177028415488     2
1780346371888833024     2
1791793599819993088     2
1780346025854603264     2
1780345829615636480     2
Name: count, Length: 229, dtype: int64


In [11]:
id_counts_above_1.sum()

473

In [12]:
# Count unique IDs
non_duplicate_counts = df['id'].nunique()
print(f"Number of unique IDs: {non_duplicate_counts}")

# Count duplicate IDs
duplicate_counts = df['id'].duplicated().sum()
print(f"Number of duplicate IDs: {duplicate_counts}")

# Get the value counts of 'id'
id_counts = df['id'].value_counts()

# Filter the counts to show only those greater than 1
id_counts_above_1 = id_counts[id_counts > 1]

# Sum of counts of IDs that appear more than once
total_duplicate_rows = id_counts_above_1.sum()
print(f"Total number of duplicate rows based on 'id': {total_duplicate_rows}")

# Convert all columns to strings to avoid unhashable types
df_str = df.astype(str)

# Now check for exact duplicate rows across all columns
duplicates_all = df_str[df_str.duplicated(keep=False)]
print(f"Total duplicate rows (exact match across all columns): {duplicates_all.shape[0]}")
duplicates_all

Number of unique IDs: 52543
Number of duplicate IDs: 244
Total number of duplicate rows based on 'id': 473
Total duplicate rows (exact match across all columns): 304


Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
1660,"{'retweet_count': 38, 'reply_count': 23, 'like...",,everyone,1.1863674835461652e+18,,1768754497738678272,Skandalüëá https://t.co/XBJRvv3OVR,"{'edits_remaining': 5, 'is_edit_eligible': Tru...","[{'type': 'quoted', 'id': '1768727139937828957'}]",2024-03-15 21:41:48+00:00,['1768754497738678324'],in,1.7687544977386783e+18,"{'urls': [{'start': 9, 'end': 32, 'url': 'http...",False,Quote,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
1661,"{'retweet_count': 38, 'reply_count': 23, 'like...",,everyone,1.1863674835461652e+18,,1768754497738678272,Skandalüëá https://t.co/XBJRvv3OVR,"{'edits_remaining': 5, 'is_edit_eligible': Tru...","[{'type': 'quoted', 'id': '1768727139937828957'}]",2024-03-15 21:41:48+00:00,['1768754497738678324'],in,1.7687544977386783e+18,"{'urls': [{'start': 9, 'end': 32, 'url': 'http...",False,Quote,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
1743,"{'retweet_count': 16, 'reply_count': 0, 'like_...",,everyone,1.1863674835461652e+18,,1753779932742799616,RT @Roman_Korona: ‚è∞Zapraszam na doroczny Zlot ...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1753770538424504...",2024-02-03 13:58:14+00:00,['1753779932742799591'],pl,1.7537799327427996e+18,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",False,Retweet,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
1744,"{'retweet_count': 16, 'reply_count': 0, 'like_...",,everyone,1.1863674835461652e+18,,1753779932742799616,RT @Roman_Korona: ‚è∞Zapraszam na doroczny Zlot ...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1753770538424504...",2024-02-03 13:58:14+00:00,['1753779932742799591'],pl,1.7537799327427996e+18,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",False,Retweet,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
1862,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,1.5544839915117036e+18,,1762109123800293376,"Nie ma takiej obietnicy, kt√≥rej polityk nie ob...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-02-26 13:35:28+00:00,['1762109123800293457'],pl,1.7621091238002934e+18,"{'urls': [{'start': 80, 'end': 103, 'url': 'ht...",False,Original,{'media_keys': ['3_1762109117865304064']},,Iwaszkiewicz_RJ_2023-10-16_2024-10-15.json,Konfederacja
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52744,"{'retweet_count': 6, 'reply_count': 0, 'like_c...",,everyone,961181894.0,,1707867087899705600,RT @JKaminska02: Niezale≈ºnie od poglƒÖd√≥w warto...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1707772421258813...",2023-09-29 21:16:59+00:00,['1707867087899705655'],pl,1.7078670878997056e+18,"{'mentions': [{'start': 3, 'end': 15, 'usernam...",0.0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52783,"{'retweet_count': 41, 'reply_count': 0, 'like_...",,everyone,961181894.0,,1701273238263742720,RT @KosiniakKamysz: Polska wymiera! Miesiƒôczni...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1701264896711709...",2023-09-11 16:35:22+00:00,['1701273238263742680'],pl,1.7012732382637427e+18,"{'mentions': [{'start': 3, 'end': 18, 'usernam...",0.0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52784,"{'retweet_count': 41, 'reply_count': 0, 'like_...",,everyone,961181894.0,,1701273238263742720,RT @KosiniakKamysz: Polska wymiera! Miesiƒôczni...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1701264896711709...",2023-09-11 16:35:22+00:00,['1701273238263742680'],pl,1.7012732382637427e+18,"{'mentions': [{'start': 3, 'end': 18, 'usernam...",0.0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52785,"{'retweet_count': 12, 'reply_count': 0, 'like_...",,everyone,961181894.0,,1697128952131661824,RT @nowePSL: üí¨ Referendum to gra polityczna @p...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1697126436279128...",2023-08-31 06:07:27+00:00,['1697128952131661845'],pl,1.6971289521316618e+18,"{'mentions': [{'start': 3, 'end': 11, 'usernam...",0.0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL


brief insight into how do these duplicates look like

In [13]:
df[df['id'].duplicated(keep=False)].sort_values(by='id')


Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
34539,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,@magosia_10_19 ü§õü•∞,,,2023-10-16 23:17:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
48817,"{'retweet_count': 2, 'reply_count': 1, 'like_c...",,everyone,1119834276,,0,Dziƒôkujƒô ü§ùüòÄüçÄ,,"[{'type': 'quoted', 'id': '1714320722481615223'}]",2024-10-17 18:49:00+00:00,,pl,,,,Quote,,,GrzybAndrzej_2023-10-16_2024-10-15.json,PSL
48818,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,@Maciej_ENZ0 Dziƒôkujƒô i pozdrawiam,,,2023-10-18 12:46:00+00:00,,pl,,,,Reply,,,GrzybAndrzej_2023-10-16_2024-10-15.json,PSL
34538,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,"@RGrupinski @StGawlowski i ode mnie Rafale, dl...",,,2023-10-16 23:15:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34528,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",,everyone,61552404,,0,@tomekbit ‚úåÔ∏è,,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48525,"{'retweet_count': 15, 'reply_count': 2, 'like_...",,everyone,964017524,"[{'domain': {'id': '11', 'name': 'Sport', 'des...",1834508231009325568,W ramach roboczego kontaktu z @WodyPolskie ora...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...",,2024-09-13 08:23:40+00:00,[1834508231009325542],pl,1834508231009325568,"{'mentions': [{'start': 30, 'end': 42, 'userna...",0,Original,,,DariuszKlimczak_2023-10-16_2024-10-15.json,PSL
32828,"{'retweet_count': 568, 'reply_count': 221, 'li...",,everyone,52367150,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1836654233296244992,Premier @donaldtusk : namierzono cz≈Çowieka prz...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-09-19 06:31:07+00:00,[1836654233296244914],pl,1836654233296244992,"{'mentions': [{'start': 8, 'end': 19, 'usernam...",False,Original,,,CTomczyk_2023-10-16_2024-10-15_GUWNOMAMYZNIMPR...,PO
32829,"{'retweet_count': 567, 'reply_count': 221, 'li...",,everyone,52367150,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1836654233296244992,Premier @donaldtusk : namierzono cz≈Çowieka prz...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-09-19 06:31:07+00:00,[1836654233296244914],pl,1836654233296244992,"{'mentions': [{'start': 8, 'end': 19, 'usernam...",False,Original,,,CTomczyk_2023-10-16_2024-10-15_GUWNOMAMYZNIMPR...,PO
27208,"{'retweet_count': 141, 'reply_count': 125, 'li...",,everyone,138048156,,1838584923688444416,Potwierdza siƒô to o czym m√≥wili≈õmy ju≈º od dawn...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...","[{'type': 'quoted', 'id': '1838307356918071625'}]",2024-09-24 14:22:59+00:00,[1838584923688444342],pl,1838584923688444416,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Quote,,,mblaszczak_2023-10-16_2024-10-15 (1).json,PIS


In [14]:
for col in df.columns:
    if df[col].apply(lambda x: isinstance(x, dict)).any():
        print(f"Column '{col}' contains dictionaries.")
    elif df[col].apply(lambda x: isinstance(x, list)).any():
        print(f"Column '{col}' contains lists.")

Column 'public_metrics' contains dictionaries.
Column 'context_annotations' contains lists.
Column 'edit_controls' contains dictionaries.
Column 'referenced_tweets' contains lists.
Column 'edit_history_tweet_ids' contains lists.
Column 'entities' contains dictionaries.
Column 'attachments' contains dictionaries.
Column 'geo' contains dictionaries.


In [15]:
# Get all duplicate IDs
duplicate_ids = df[df['id'].duplicated(keep=False)]

# Exclude columns with unhashable (dict-like) values
columns_to_exclude = ['edit_controls', 'public_metrics', 'attachments', 'entities', 'geo', 'edit_history_tweet_ids', 'context_annotations','referenced_tweets']
valid_columns = [col for col in df.columns if col not in columns_to_exclude]

# Find differences across valid columns
diff_summary = duplicate_ids[valid_columns].groupby('id').nunique()

# Show columns where duplicates have different values
diff_summary = diff_summary[(diff_summary > 1).any(axis=1)]

In [16]:
diff_summary

Unnamed: 0_level_0,in_reply_to_user_id,reply_settings,author_id,text,created_at,lang,conversation_id,possibly_sensitive,category,username,party
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0,1,2,16,9,1,0,0,2,2,2
1780108572161945856,0,1,1,1,1,1,1,1,1,2,1
1780146130551996672,1,1,1,1,1,1,1,1,1,2,1
1780171152914034944,1,1,1,1,1,1,1,1,1,2,1
1780244011212485120,0,1,1,1,1,1,1,1,1,2,1
1780309557610258688,0,1,1,1,1,1,1,1,1,2,1
1780345695163047936,1,1,1,1,1,1,1,1,1,2,1
1780345829615636480,1,1,1,1,1,1,1,1,1,2,1
1780346025854603264,1,1,1,1,1,1,1,1,1,2,1
1780346371888833024,1,1,1,1,1,1,1,1,1,2,1


In [17]:
duplicates = df[df.duplicated(subset=['id'], keep=False)]
duplicates

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
962,"{'retweet_count': 15, 'reply_count': 28, 'like...",,everyone,1182211615,,1714195119706890496,Serdeczne dziƒôki za ka≈ºdy g≈Ços. ü§ù\nDla mnie to...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-17 08:22:19+00:00,[1714195119706890463],pl,1714195119706890496,"{'hashtags': [{'start': 251, 'end': 264, 'tag'...",False,Original,{'media_keys': ['3_1714195114472431617']},{'place_id': '535f0c2de0121451'},bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
963,"{'retweet_count': 15, 'reply_count': 28, 'like...",,everyone,1182211615,,1714195119706890496,Serdeczne dziƒôki za ka≈ºdy g≈Ços. ü§ù\nDla mnie to...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-17 08:22:19+00:00,[1714195119706890463],pl,1714195119706890496,"{'urls': [{'start': 275, 'end': 298, 'url': 'h...",False,Original,{'media_keys': ['3_1714195114472431617']},{'place_id': '535f0c2de0121451'},bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
1660,"{'retweet_count': 38, 'reply_count': 23, 'like...",,everyone,1186367483546165248,,1768754497738678272,Skandalüëá https://t.co/XBJRvv3OVR,"{'edits_remaining': 5, 'is_edit_eligible': Tru...","[{'type': 'quoted', 'id': '1768727139937828957'}]",2024-03-15 21:41:48+00:00,[1768754497738678324],in,1768754497738678272,"{'urls': [{'start': 9, 'end': 32, 'url': 'http...",False,Quote,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
1661,"{'retweet_count': 38, 'reply_count': 23, 'like...",,everyone,1186367483546165248,,1768754497738678272,Skandalüëá https://t.co/XBJRvv3OVR,"{'edits_remaining': 5, 'is_edit_eligible': Tru...","[{'type': 'quoted', 'id': '1768727139937828957'}]",2024-03-15 21:41:48+00:00,[1768754497738678324],in,1768754497738678272,"{'urls': [{'start': 9, 'end': 32, 'url': 'http...",False,Quote,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
1743,"{'retweet_count': 16, 'reply_count': 0, 'like_...",,everyone,1186367483546165248,,1753779932742799616,RT @Roman_Korona: ‚è∞Zapraszam na doroczny Zlot ...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1753770538424504...",2024-02-03 13:58:14+00:00,[1753779932742799591],pl,1753779932742799616,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",False,Retweet,,,GrzegorzBraun__2023-10-16_2024-10-15.json,Konfederacja
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52744,"{'retweet_count': 6, 'reply_count': 0, 'like_c...",,everyone,961181894,,1707867087899705600,RT @JKaminska02: Niezale≈ºnie od poglƒÖd√≥w warto...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1707772421258813...",2023-09-29 21:16:59+00:00,[1707867087899705655],pl,1707867087899705600,"{'mentions': [{'start': 3, 'end': 15, 'usernam...",0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52783,"{'retweet_count': 41, 'reply_count': 0, 'like_...",,everyone,961181894,,1701273238263742720,RT @KosiniakKamysz: Polska wymiera! Miesiƒôczni...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1701264896711709...",2023-09-11 16:35:22+00:00,[1701273238263742680],pl,1701273238263742720,"{'mentions': [{'start': 3, 'end': 18, 'usernam...",0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52784,"{'retweet_count': 41, 'reply_count': 0, 'like_...",,everyone,961181894,,1701273238263742720,RT @KosiniakKamysz: Polska wymiera! Miesiƒôczni...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1701264896711709...",2023-09-11 16:35:22+00:00,[1701273238263742680],pl,1701273238263742720,"{'mentions': [{'start': 3, 'end': 18, 'usernam...",0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52785,"{'retweet_count': 12, 'reply_count': 0, 'like_...",,everyone,961181894,,1697128952131661824,RT @nowePSL: üí¨ Referendum to gra polityczna @p...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1697126436279128...",2023-08-31 06:07:27+00:00,[1697128952131661845],pl,1697128952131661824,"{'mentions': [{'start': 3, 'end': 11, 'usernam...",0,Retweet,,,StruzikAdam_2022-10-16_2023-10-15.json,PSL


In [18]:
duplicate_text_count = df['text'].duplicated().sum()
print(f"Number of duplicate Text Entries: {duplicate_text_count}")

Number of duplicate Text Entries: 320


In [19]:
duplicate_id_text_rows = df[df.duplicated(subset=['id', 'text'], keep=False)]
print(f"Rows where BOTH `id` and `text` are duplicated: {len(duplicate_id_text_rows)}")

Rows where BOTH `id` and `text` are duplicated: 457


In [20]:
# Count occurrences of each ID
id_counts = df['id'].value_counts()
print("Distribution of duplicate IDs:")
print(id_counts.value_counts().sort_index())

# Count occurrences of each text
text_counts = df['text'].value_counts()
print("\nDistribution of duplicate Text Entries:")
print(text_counts.value_counts().sort_index())

Distribution of duplicate IDs:
count
1     52314
2       227
3         1
16        1
Name: count, dtype: int64

Distribution of duplicate Text Entries:
count
1    52164
2      291
3        9
4        2
6        1
Name: count, dtype: int64


In [21]:
# Get all duplicate ID rows
duplicate_id_rows = df[df.duplicated(subset=['id'], keep=False)]

# Get all duplicate Text rows
duplicate_text_rows = df[df.duplicated(subset=['text'], keep=False)]

# Get rows where both ID and Text are duplicated
duplicate_id_text_rows = df[df.duplicated(subset=['id', 'text'], keep=False)]

# Compare overlaps
print(f"Rows where ID is duplicated: {len(duplicate_id_rows)}")
print(f"Rows where Text is duplicated: {len(duplicate_text_rows)}")
print(f"Rows where BOTH ID and Text are duplicated: {len(duplicate_id_text_rows)}")

# Find duplicate IDs that are NOT in the text duplicate set
id_not_in_text = duplicate_id_rows[~duplicate_id_rows['id'].isin(duplicate_text_rows['id'])]
print(f"\nDuplicate IDs NOT duplicated in Text: {len(id_not_in_text)}")

# Find duplicate Texts that are NOT in the ID duplicate set
text_not_in_id = duplicate_text_rows[~duplicate_text_rows['text'].isin(duplicate_id_rows['text'])]
print(f"Duplicate Texts NOT duplicated in ID: {len(text_not_in_id)}")


Rows where ID is duplicated: 473
Rows where Text is duplicated: 623
Rows where BOTH ID and Text are duplicated: 457

Duplicate IDs NOT duplicated in Text: 16
Duplicate Texts NOT duplicated in ID: 165


In [22]:
empty_id_rows = df[df['id'].isna()]
print(f"Rows where `id` is empty (NaN): {len(empty_id_rows)}")
#empty_id_rows

zero_id_rows = df[df['id'] == 0]
print(f"Rows where `id` is 0: {len(zero_id_rows)}")
zero_id_rows

Rows where `id` is empty (NaN): 0
Rows where `id` is 0: 16


Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
34528,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",,everyone,61552404,,0,@tomekbit ‚úåÔ∏è,,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34529,"{'retweet_count': 0, 'reply_count': 2, 'like_c...",,everyone,61552404,,0,"@MaciejGdynia Maƒáku, czekam na oficjalne wynik...",,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34530,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",,everyone,61552404,,0,"@MCichonAlicja Alu, czekamy jeszcze na wynik?",,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34531,"{'retweet_count': 1, 'reply_count': 1, 'like_c...",,everyone,61552404,,0,@REL_76 ü•∞ü•∞ü•∞,,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34532,"{'retweet_count': 1, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,@Gidziela ü•∞‚úåÔ∏è,,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34533,"{'retweet_count': 0, 'reply_count': 1, 'like_c...",,everyone,61552404,,0,@WHaptar Gratulacjeüëèü•Ç,,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34534,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,@KapenGenezyp Dziƒôkujƒô‚ù§Ô∏è‚ù§Ô∏è‚ù§Ô∏è,,,2023-10-16 00:00:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34535,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,@jasinska_e ‚ù§Ô∏è,,,2023-10-16 21:57:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34536,"{'retweet_count': 18, 'reply_count': 31, 'like...",,everyone,61552404,,0,@BMikolajewska odpowieüí™,,,2023-10-16 22:27:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO
34537,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,everyone,61552404,,0,@DorotaNiedziela ja Tobie te≈º‚ù£Ô∏è,,,2023-10-16 22:41:00+00:00,,pl,,,,Reply,,,Leszczyna_2023-10-16_2023-12-31.json,PO


In [23]:
tweets_by_author = df[df['author_id'] == 61552404.0].sort_values(by='created_at')
display(tweets_by_author)

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
39211,"{'retweet_count': 482, 'reply_count': 0, 'like...",,everyone,61552404,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1700530175337738496,RT @Platforma_org: üí¨ PrzewodniczƒÖcy @donaldtus...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1700489605026340...",2023-09-09 15:22:42+00:00,[1700530175337738371],pl,1700530175337738496,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",False,Retweet,,,Leszczyna_2022-10-16_2023-10-15.json,PO
39212,"{'retweet_count': 482, 'reply_count': 0, 'like...",,everyone,61552404,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1700530175337738496,RT @Platforma_org: üí¨ PrzewodniczƒÖcy @donaldtus...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1700489605026340...",2023-09-09 15:22:42+00:00,[1700530175337738371],pl,1700530175337738496,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",False,Retweet,,,Leszczyna_2022-10-16_2023-10-15.json,PO
39210,"{'retweet_count': 376, 'reply_count': 0, 'like...",,everyone,61552404,,1700530281784971520,"RT @MariaLe85219860: B U M‚ÄºÔ∏è\n\nDuda, Ziobro, ...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1700475689600696...",2023-09-09 15:23:08+00:00,[1700530281784971545],pl,1700530281784971520,"{'hashtags': [{'start': 114, 'end': 127, 'tag'...",False,Retweet,,,Leszczyna_2022-10-16_2023-10-15.json,PO
39209,"{'retweet_count': 144, 'reply_count': 0, 'like...",,everyone,61552404,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1700530331500085760,RT @Platforma_org: üí¨ PrzewodniczƒÖcy @donaldtus...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1700482994564186...",2023-09-09 15:23:20+00:00,[1700530331500085744],pl,1700530331500085760,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",False,Retweet,,,Leszczyna_2022-10-16_2023-10-15.json,PO
39208,"{'retweet_count': 167, 'reply_count': 0, 'like...",,everyone,61552404,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1700530402610364672,RT @Platforma_org: üí¨ PrzewodniczƒÖcy @donaldtus...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'retweeted', 'id': '1700463357575221...",2023-09-09 15:23:36+00:00,[1700530402610364779],pl,1700530402610364672,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",False,Retweet,,,Leszczyna_2022-10-16_2023-10-15.json,PO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34658,"{'retweet_count': 0, 'reply_count': 2, 'like_c...",713836895495696384,everyone,61552404,,1842513925612372224,"@arekpisarski @MZ_GOV_PL @NFZ_GOV_PL tak, @Rze...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184214897941436...",2024-10-05 10:35:26+00:00,[1842513925612372262],pl,1842146757360169472,"{'urls': [{'start': 85, 'end': 108, 'url': 'ht...",False,Reply,,,Leszczyna_2024-04-01_2024-10-15.json,PO
34657,"{'retweet_count': 0, 'reply_count': 2, 'like_c...",1439610920175558656,everyone,61552404,,1842515633826627840,"@ewa_esse @MZ_GOV_PL od 15. pa≈∫dziernika, ale ...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184216083786941...",2024-10-05 10:42:14+00:00,[1842515633826627766],pl,1842146757360169472,"{'urls': [{'start': 303, 'end': 326, 'url': 'h...",False,Reply,,,Leszczyna_2024-04-01_2024-10-15.json,PO
34656,"{'retweet_count': 47, 'reply_count': 116, 'lik...",,everyone,61552404,,1844253674584916224,Ustawa o wychowaniu w trze≈∫wo≈õci trafi dopiero...,"{'edits_remaining': 4, 'is_edit_eligible': Tru...","[{'type': 'quoted', 'id': '1844091771900526777'}]",2024-10-10 05:48:35+00:00,"[1844253222103380157, 1844253674584916327]",pl,1844253674584916224,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Quote,,,Leszczyna_2024-04-01_2024-10-15.json,PO
34655,"{'retweet_count': 0, 'reply_count': 5, 'like_c...",1217929501063155712,everyone,61552404,,1844411831844036864,@Mariusz32382943 nad zmianami systemowymi prac...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184427203001280...",2024-10-10 16:17:02+00:00,[1844411831844036750],pl,1844253674584916224,"{'urls': [{'start': 296, 'end': 319, 'url': 'h...",False,Reply,,,Leszczyna_2024-04-01_2024-10-15.json,PO


id 0 of a tweet may mean, that the tweet was, changed, deleted, or that it is not a tweet but something else and was wrongly categorized as one

We need to remove duplicate tweets + delete the tweets that were 0; because our custom downloading loop occasionally downloads the same tweet two times to ensure completeness.

In [24]:
import pandas as pd
import numpy as np

# 1) Copy the original DataFrame before cleaning
df_before = df.copy()

# 2) Get the initial size
initial_size = len(df_before)
print(f"Initial number of tweets: {initial_size}")

# 3) Check and report issues with the 'id' column
print("\n--- ID QUALITY CHECK ---")

# Convert id to string for consistent checking
df_before['id'] = df_before['id'].astype(str)

# Check for various problems
missing_ids = df_before['id'].isna()
empty_ids = df_before['id'] == ''
zero_ids = df_before['id'] == '0'
very_short_ids = df_before['id'].str.len() < 5  # Twitter IDs are typically longer

# Report on ID issues
print(f"Missing IDs (NaN): {missing_ids.sum()} ({missing_ids.mean():.2%})")
print(f"Empty IDs: {empty_ids.sum()} ({empty_ids.mean():.2%})")
print(f"Zero IDs ('0'): {zero_ids.sum()} ({zero_ids.mean():.2%})")
print(f"Very short IDs (< 5 chars): {very_short_ids.sum()} ({very_short_ids.mean():.2%})")

# Create a mask for all problematic IDs
problematic_ids_mask = missing_ids | empty_ids | zero_ids | very_short_ids

# Report total problematic IDs
print(f"Total problematic IDs: {problematic_ids_mask.sum()} ({problematic_ids_mask.mean():.2%})")

# 4) First filter out problematic IDs from the original dataset
df_no_problems = df_before[~problematic_ids_mask].copy()
problematic_removed = initial_size - len(df_no_problems)

# 5) Then remove duplicates from the dataset without problematic IDs
df_after = df_no_problems.drop_duplicates(subset=['id'])
duplicates_removed = len(df_no_problems) - len(df_after)

# 6) Calculate removed counts and percentages
remaining_final = len(df_after)
total_removed = initial_size - remaining_final

duplicate_percentage = (duplicates_removed / initial_size) * 100
problematic_percentage = (problematic_removed / initial_size) * 100
total_removed_percentage = (total_removed / initial_size) * 100
remaining_percentage = (remaining_final / initial_size) * 100

# 7) Print comprehensive results
print("\n--- CLEANING SUMMARY ---")
print(f"Initial tweets: {initial_size}")
print(f"Problematic ID tweets removed: {problematic_removed} ({problematic_percentage:.2f}%)")
print(f"Duplicate tweets removed: {duplicates_removed} ({duplicate_percentage:.2f}%)")
print(f"Total tweets removed: {total_removed} ({total_removed_percentage:.2f}%)")
print(f"Tweets remaining: {remaining_final} ({remaining_percentage:.2f}%)")

# 8) Show sample of problematic IDs
if problematic_ids_mask.sum() > 0:
    print("\nSample of problematic IDs:")
    sample_problematic = df_before[problematic_ids_mask].head(5)
    for i, (idx, row) in enumerate(sample_problematic.iterrows()):
        print(f"  {i+1}. ID: '{row['id']}', Text: '{row['text'][:50]}...'")

# 9) Identify the actual duplicate IDs from the data without problematic IDs
duplicate_ids = df_no_problems[df_no_problems.duplicated(subset=['id'], keep='first')]['id'].unique().tolist()
print(f"\nNumber of unique duplicate IDs: {len(duplicate_ids)}")
if duplicate_ids:
    print("Sample of duplicate IDs (first 5):")
    for i, dup_id in enumerate(duplicate_ids[:5]):
        print(f"  {i+1}. {dup_id}")
else:
    print("No duplicates found")

# 10) Keep df_after as the new df
df = df_after
print(f"\nFinal clean dataframe shape: {df.shape}")

# 11) Verify no problematic IDs remain
if (df['id'] == '0').sum() > 0 or df['id'].isna().sum() > 0 or (df['id'] == '').sum() > 0 or (df['id'].str.len() < 5).sum() > 0:
    print("WARNING: Some problematic IDs still remain in the cleaned dataframe")
else:
    print("SUCCESS: All problematic IDs have been removed")

Initial number of tweets: 52787

--- ID QUALITY CHECK ---
Missing IDs (NaN): 0 (0.00%)
Empty IDs: 0 (0.00%)
Zero IDs ('0'): 16 (0.03%)
Very short IDs (< 5 chars): 16 (0.03%)
Total problematic IDs: 16 (0.03%)

--- CLEANING SUMMARY ---
Initial tweets: 52787
Problematic ID tweets removed: 16 (0.03%)
Duplicate tweets removed: 229 (0.43%)
Total tweets removed: 245 (0.46%)
Tweets remaining: 52542 (99.54%)

Sample of problematic IDs:
  1. ID: '0', Text: '@tomekbit ‚úåÔ∏è...'
  2. ID: '0', Text: '@MaciejGdynia Maƒáku, czekam na oficjalne wyniki, ≈º...'
  3. ID: '0', Text: '@MCichonAlicja Alu, czekamy jeszcze na wynik?...'
  4. ID: '0', Text: '@REL_76 ü•∞ü•∞ü•∞...'
  5. ID: '0', Text: '@Gidziela ü•∞‚úåÔ∏è...'

Number of unique duplicate IDs: 228
Sample of duplicate IDs (first 5):
  1. 1714195119706890496
  2. 1768754497738678272
  3. 1753779932742799616
  4. 1736748164722385408
  5. 1762109123800293376

Final clean dataframe shape: (52542, 20)
SUCCESS: All problematic IDs have been removed


In [25]:
# 1) How many total rows have a duplicate 'id' (including the first occurrence)?
total_dup_rows = df.duplicated(subset=['id'], keep=False).sum()
print(f"Total rows that share a duplicate ID (including the first occurrence): {total_dup_rows}")

# 2) How many rows are "extra" duplicates beyond the first?
extra_dup_rows = df.duplicated(subset=['id'], keep='first').sum()
print(f"Number of extra duplicates beyond the first occurrence: {extra_dup_rows}")

# 3) How many unique IDs appear more than once?
duplicate_ids = df[df.duplicated(subset=['id'], keep=False)]['id'].unique()
num_duplicate_ids = len(duplicate_ids)
print(f"Number of unique IDs that are duplicated: {num_duplicate_ids}")

Total rows that share a duplicate ID (including the first occurrence): 0
Number of extra duplicates beyond the first occurrence: 0
Number of unique IDs that are duplicated: 0


In [26]:
df.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],pl,1846091776269963776,"{'mentions': [{'start': 0, 'end': 11, 'usernam...",False,Reply,,,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],pl,1846222583898784000,"{'urls': [{'start': 100, 'end': 123, 'url': 'h...",False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"‚ùåMamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô w ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],pl,1846161400328028160,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],pl,1846091824101769472,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµüá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],qme,1846075343188144128,"{'hashtags': [{'start': 0, 'end': 6, 'tag': 'I...",False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja


In [27]:
# Get the value counts of the 'category' column
category_counts = df['category'].value_counts()

# Display the counts
print(category_counts)

# Get the number of unique categories
unique_category_count = category_counts.count()
print(f"Number of unique categories: {unique_category_count}")

category
Original    32794
Reply       10790
Quote        5478
Retweet      3480
Name: count, dtype: int64
Number of unique categories: 4


We need to delete retweets because they are wrongly provided by the X API. We want to analyze only original tweets, replies, and quotes.

In [28]:
df = df[df['category'] != 'Retweet']

In [29]:
df

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],pl,1846091776269963776,"{'mentions': [{'start': 0, 'end': 11, 'usernam...",False,Reply,,,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],pl,1846222583898784000,"{'urls': [{'start': 100, 'end': 123, 'url': 'h...",False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"‚ùåMamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô w ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],pl,1846161400328028160,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],pl,1846091824101769472,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµüá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],qme,1846075343188144128,"{'hashtags': [{'start': 0, 'end': 6, 'tag': 'I...",False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52753,"{'retweet_count': 9, 'reply_count': 0, 'like_c...",,everyone,961181894,,1707719554355380480,"Studiujesz na kierunku lekarskim, pielƒôgniarst...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-29 11:30:44+00:00,[1707719554355380484],pl,1707719554355380480,"{'mentions': [{'start': 132, 'end': 142, 'user...",0,Original,{'media_keys': ['3_1707719240550166528']},,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52774,"{'retweet_count': 9, 'reply_count': 0, 'like_c...",,everyone,961181894,,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-19 13:08:40+00:00,[1704120323023454339],pl,1704120323023454464,"{'mentions': [{'start': 20, 'end': 30, 'userna...",0,Original,{'media_keys': ['3_1704118556785254400']},,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52777,"{'retweet_count': 8, 'reply_count': 0, 'like_c...",,everyone,961181894,,1702668459576786944,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-15 12:59:29+00:00,[1702668459576787064],pl,1702668459576786944,"{'mentions': [{'start': 210, 'end': 220, 'user...",0,Original,{'media_keys': ['3_1702668263241498624']},,StruzikAdam_2022-10-16_2023-10-15.json,PSL
52778,"{'retweet_count': 8, 'reply_count': 0, 'like_c...",,everyone,961181894,,1701960909369868544,To jedna z najwiƒôkszych inwestycji drogowych @...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-13 14:07:56+00:00,[1701960909369868437],pl,1701960909369868544,"{'mentions': [{'start': 45, 'end': 55, 'userna...",0,Original,"{'media_keys': ['3_1701960742109429762', '3_17...",,StruzikAdam_2022-10-16_2023-10-15.json,PSL


In [30]:
# Update the 'username' column to keep only the string until '_2' -> split to date range

#df['username'] = df['username'].str.split('_2').str[0].copy()
df.loc[:, 'username'] = df['username'].str.split('_2').str[0]

In [31]:
category_summary = df['category'].value_counts()
print(category_summary)
total_tweets = category_summary.sum()
print(f"Total tweets: {total_tweets}")

category
Original    32794
Reply       10790
Quote        5478
Name: count, dtype: int64
Total tweets: 49062


In [32]:
# Ensure the created_at column is in datetime format

#df['created_at'] = pd.to_datetime(df['created_at'])
df.loc[:, 'created_at'] = pd.to_datetime(df['created_at'])

In [33]:
df.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],pl,1846091776269963776,"{'mentions': [{'start': 0, 'end': 11, 'usernam...",False,Reply,,,bartlomiejpejo,Konfederacja
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],pl,1846222583898784000,"{'urls': [{'start': 100, 'end': 123, 'url': 'h...",False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo,Konfederacja
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"‚ùåMamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô w ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],pl,1846161400328028160,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo,Konfederacja
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],pl,1846091824101769472,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo,Konfederacja
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµüá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],qme,1846075343188144128,"{'hashtags': [{'start': 0, 'end': 6, 'tag': 'I...",False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo,Konfederacja


In [34]:
df.loc[1, 'text']

'Rok po wyborach trzeba powiedzieƒá jedno - nie na takie pa≈Ñstwo Donald Tusk umawia≈Ç siƒô z wyborcami! https://t.co/4Jh5Ni6sgr'

Emojis handler

In [35]:
def add_space_around_emojis(text):
    return ''.join(f' {char} ' if char in emoji.EMOJI_DATA or re.match(r'[\U0001F1E6-\U0001F1FF]', char) else char for char in text)

df['text'] = df['text'].apply(add_space_around_emojis)

def clean_text(text):
    mentions = re.findall(r'@\w+', text)
    text = re.sub(r'@\w+', '', text)
    links = re.findall(r'http\S+', text)
    text = re.sub(r'http\S+', '', text)
    hashtags = re.findall(r'#\w+', text)
    text = re.sub(r'(?<!\s)([\U0001F600-\U0001F64F])', r' \1', text)
    text = re.sub(r'([\U0001F600-\U0001F64F])(?!\s)', r'\1 ', text)
    return [text, mentions, links, hashtags]

df[['text_clean', 'mentions', 'links', 'hashtags']] = pd.DataFrame(df['text'].apply(clean_text).tolist(), index=df.index)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['text'] = df['text'].apply(add_space_around_emojis)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['text_clean', 'mentions', 'links', 'hashtags']] = pd.DataFrame(df['text'].apply(clean_text).tolist(), index=df.index)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['text_clean', 'mentions'

In [36]:
import pandas as pd
pd.options.mode.chained_assignment = None  # Turn off the warning from lack of loc

In [37]:
df.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,...,possibly_sensitive,category,attachments,geo,username,party,text_clean,mentions,links,hashtags
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,...,False,Reply,,,bartlomiejpejo,Konfederacja,"Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",[@donaldtusk],[],[]
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,...,False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo,Konfederacja,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[]
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,...,False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo,Konfederacja,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[]
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,...,False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo,Konfederacja,Mija rok od wybor√≥w parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[]
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,...,False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo,Konfederacja,#Idƒô11 üáµ üá±,[],[https://t.co/KiCe5ATOpX],[#Idƒô11]


In [38]:
df.drop(columns=['entities'], inplace=True)

In [39]:
# Some additioanl numerical data from tweets is extracted and added to the dataframe as new variables, then the original column is dropped
df['retweet_count'] = df['public_metrics'].apply(lambda x: x['retweet_count'])
df['reply_count'] = df['public_metrics'].apply(lambda x: x['reply_count'])
df['like_count'] = df['public_metrics'].apply(lambda x: x['like_count'])
df['quote_count'] = df['public_metrics'].apply(lambda x: x['quote_count'])
df['impression_count'] = df['public_metrics'].apply(lambda x: x['impression_count'])

df.drop(columns=['public_metrics'], inplace=True)

In [40]:
df

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,party,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count
0,375146901,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,Konfederacja,"Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",[@donaldtusk],[],[],3,1,33,0,1555
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,Konfederacja,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031
2,,everyone,1182211615,,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,Konfederacja,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636
3,,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,Konfederacja,Mija rok od wybor√≥w parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441
4,,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,Konfederacja,#Idƒô11 üáµ üá±,[],[https://t.co/KiCe5ATOpX],[#Idƒô11],45,18,616,2,8634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52753,,everyone,961181894,,1707719554355380480,"Studiujesz na kierunku lekarskim, pielƒôgniarst...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-29 11:30:44+00:00,[1707719554355380484],...,PSL,"Studiujesz na kierunku lekarskim, pielƒôgniarst...",[@SejmikMaz],[https://t.co/6zats7JXbY],[],9,0,6,0,2154
52774,,everyone,961181894,,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-19 13:08:40+00:00,[1704120323023454339],...,PSL,Za nami posiedzenie . I kolejne wsparcie dla m...,[@SejmikMaz],[https://t.co/A7EG9Jzuv1],[#OSP],9,0,15,0,649
52777,,everyone,961181894,,1702668459576786944,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-15 12:59:29+00:00,[1702668459576787064],...,PSL,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,[@SejmikMaz],[https://t.co/OALgj7gqxE],[],8,0,16,0,581
52778,,everyone,961181894,,1701960909369868544,To jedna z najwiƒôkszych inwestycji drogowych @...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-13 14:07:56+00:00,[1701960909369868437],...,PSL,To jedna z najwiƒôkszych inwestycji drogowych ...,[@SejmikMaz],[https://t.co/9jWRcVZHXk],[#634],8,0,13,0,621


In [41]:
df.dtypes

in_reply_to_user_id                   float64
reply_settings                         object
author_id                             float64
context_annotations                    object
id                                     object
text                                   object
edit_controls                          object
referenced_tweets                      object
created_at                datetime64[ns, UTC]
edit_history_tweet_ids                 object
lang                                   object
conversation_id                       float64
possibly_sensitive                     object
category                               object
attachments                            object
geo                                    object
username                               object
party                                  object
text_clean                             object
mentions                               object
links                                  object
hashtags                          

In [42]:
df.head()

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,party,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count
0,375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,Konfederacja,"Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",[@donaldtusk],[],[],3,1,33,0,1555
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,Konfederacja,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031
2,,everyone,1182211615,,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,Konfederacja,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636
3,,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,Konfederacja,Mija rok od wybor√≥w parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441
4,,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,Konfederacja,#Idƒô11 üáµ üá±,[],[https://t.co/KiCe5ATOpX],[#Idƒô11],45,18,616,2,8634


In [43]:
df.dtypes

in_reply_to_user_id                   float64
reply_settings                         object
author_id                             float64
context_annotations                    object
id                                     object
text                                   object
edit_controls                          object
referenced_tweets                      object
created_at                datetime64[ns, UTC]
edit_history_tweet_ids                 object
lang                                   object
conversation_id                       float64
possibly_sensitive                     object
category                               object
attachments                            object
geo                                    object
username                               object
party                                  object
text_clean                             object
mentions                               object
links                                  object
hashtags                          

In [44]:
import pandas as pd

# Step 1: Check for duplicate columns and remove them
if df.columns.duplicated().any():
    print("Duplicate columns found! Removing them...")
    df_no_duplicates = df.loc[:, ~df.columns.duplicated()]  # Keep the first occurrence of each column
else:
    df_no_duplicates = df.copy()

# Step 2: Convert 'id' column to string (if needed)
df_no_duplicates['id'] = df_no_duplicates['id'].astype(str)

# Step 3: Check for missing or empty values in 'text' and 'text_clean'
empty_text = df_no_duplicates[df_no_duplicates['text'].isna() | (df_no_duplicates['text'].astype(str).str.strip() == '')]
empty_text_clean = df_no_duplicates[df_no_duplicates['text_clean'].isna() | (df_no_duplicates['text_clean'].astype(str).str.strip() == '')]

print(f"Rows where 'text' is empty or null: {empty_text.shape[0]}")
print(empty_text[['id', 'text', 'text_clean']].head())

print(f"\nRows where 'text_clean' is empty or null: {empty_text_clean.shape[0]}")
empty_text_clean[['id', 'text', 'text_clean']].head()


Rows where 'text' is empty or null: 0
Empty DataFrame
Columns: [id, text, text_clean]
Index: []

Rows where 'text_clean' is empty or null: 731


Unnamed: 0,id,text,text_clean
16,1844711276577964544,@Nowa_Nadzieja_ @KONFEDERACJA_,
92,1838621761090285568,@KONFEDERACJA_ @Nowa_Nadzieja_,
254,1821520992629305600,https://t.co/H9BQbYjylo,
595,1768213892272959744,@MPerspektywa @AdamAbramowicz1 https://t.co/bz...,
786,1733398115917402624,https://t.co/uXKjbwD1DQ,


In [45]:
false_count = (df['text_clean'].str.strip().astype(bool))
print(f"Number of False values: {false_count}")

Number of False values: 0        True
1        True
2        True
3        True
4        True
         ... 
52753    True
52774    True
52777    True
52778    True
52779    True
Name: text_clean, Length: 49062, dtype: bool


In [46]:
#Delete empty tweets
df = df[false_count]

In [47]:
len(df)

48331

saving data used for translation 

In [48]:
df_clean_text = df[['id', 'text', 'text_clean']]

df_clean_text.to_csv('data/02.processed/data_for_translation.csv', index=False)
df.to_csv('data/02.processed/whole_dataset_for_translation.csv', index=False)

In [49]:
len(df)

48331

In [50]:
df_clean_text.dtypes

id            object
text          object
text_clean    object
dtype: object

reading data used for translation

In [51]:
# Read CSV with ID column as string (text)
df_clean_text = pd.read_csv('Data/02.processed/data_for_translation.csv', dtype={'id': str})

# Verify the column type
print("ID column type:", df_clean_text['id'].dtype)
print("Sample ID:", df_clean_text['id'].iloc[0], "of type", type(df_clean_text['id'].iloc[0]))

ID column type: object
Sample ID: 1846277256509116672 of type <class 'str'>


In [52]:
df_clean_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48331 entries, 0 to 48330
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          48331 non-null  object
 1   text        48331 non-null  object
 2   text_clean  48331 non-null  object
dtypes: object(3)
memory usage: 1.1+ MB


In [53]:
df_clean_text

Unnamed: 0,id,text,text_clean
0,1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w..."
1,1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,Rok po wyborach trzeba powiedzieƒá jedno - nie ...
2,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ..."
3,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,Mija rok od wybor√≥w parlamentarnych. W kampani...
4,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,#Idƒô11 üáµ üá±
...,...,...,...
48326,1707719554355380480,"Studiujesz na kierunku lekarskim, pielƒôgniarst...","Studiujesz na kierunku lekarskim, pielƒôgniarst..."
48327,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,Za nami posiedzenie . I kolejne wsparcie dla m...
48328,1702668459576786944,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...
48329,1701960909369868544,To jedna z najwiƒôkszych inwestycji drogowych @...,To jedna z najwiƒôkszych inwestycji drogowych ...


In [54]:
# Filter rows where 'text_clean' is null OR empty (after stripping whitespace)
null_or_empty_text_clean = df_clean_text[
    df_clean_text['text_clean'].isna() | 
    (df_clean_text['text_clean'].astype(str).str.strip() == '')
]

# Display the number of problematic rows
print(f"Rows where 'text_clean' is null or empty: {null_or_empty_text_clean.shape[0]}")

# Show the affected rows
null_or_empty_text_clean[['id', 'text', 'text_clean']]

Rows where 'text_clean' is null or empty: 0


Unnamed: 0,id,text,text_clean


In [55]:
# 1. Print the total number of rows in df_clean_text
print("Total rows in df_clean_text:", len(df_clean_text))

# 2. Filter out rows where 'text_clean' is null or an empty string (after stripping whitespace)
valid_rows = df_clean_text[
    ~(
        df_clean_text['text_clean'].isna() 
        | (df_clean_text['text_clean'].astype(str).str.strip() == '')
    )
]

# 3. Print the number of those valid (non-empty) rows
print("Rows with non-empty 'text_clean':", len(valid_rows))

Total rows in df_clean_text: 48331
Rows with non-empty 'text_clean': 48331


reading translation dataset 

In [79]:
df_en_text = pd.read_excel('data/02.processed/tweets_translation/translated_tweets.xlsx', dtype={'id': str})
df_en_text

Unnamed: 0,id,text,text_clean,text_clean_en
0,1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...","Failure to implement most of the ""100 specifi..."
1,1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"A year after the elections, one thing must be ..."
2,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","‚ùå We are a year after the elections, and Pola..."
3,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,Mija rok od wybor√≥w parlamentarnych. W kampani...,A year has passed since the parliamentary elec...
4,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,#Idƒô11 üáµ üá±,#I'm going11 üáµ üá±
...,...,...,...,...
48326,1707719554355380480,"Studiujesz na kierunku lekarskim, pielƒôgniarst...","Studiujesz na kierunku lekarskim, pielƒôgniarst...","Are you studying medicine, nursing or emergenc..."
48327,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,Za nami posiedzenie . I kolejne wsparcie dla m...,The meeting is over. And further support for t...
48328,1702668459576786944,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,The P≈Çock Oncology Center is ready! It will ac...
48329,1701960909369868544,To jedna z najwiƒôkszych inwestycji drogowych @...,To jedna z najwiƒôkszych inwestycji drogowych ...,This is one of the largest road investments \...


In [80]:
missing_translation_mask = df_en_text['text_clean_en'].isna() | (df_en_text['text_clean_en'].str.strip() == '')
df_en_text[missing_translation_mask]

Unnamed: 0,id,text,text_clean,text_clean_en


merging second version of translated dataset with original one

In [None]:
# Step 1: Make sure IDs are strings
df['id'] = df['id'].astype(str)
df_en_text['id'] = df_en_text['id'].astype(str)

df_en_combined = pd.concat([df_en_text], ignore_index=True)

# Step 3: Drop duplicates by 'id' to keep only the latest 
df_en_combined = df_en_combined.drop_duplicates(subset='id', keep='last')

# Step 4: Merge back into the full dataset to get a unified view
df_merged = df.merge(df_en_text[['id', 'text_clean_en']], on='id')

print(f"Total rows after merge: {len(df_merged)} ")


Total rows after merge: 48331 


In [85]:
# Find the IDs present in df but not in df_merged
missing_ids = df[~df['id'].isin(df_merged['id'])]['id']

# Display the missing rows
missing_rows = df[df['id'].isin(missing_ids)]
print(missing_rows)

Empty DataFrame
Columns: [in_reply_to_user_id, reply_settings, author_id, context_annotations, id, text, edit_controls, referenced_tweets, created_at, edit_history_tweet_ids, lang, conversation_id, possibly_sensitive, category, attachments, geo, username, party, text_clean, mentions, links, hashtags, retweet_count, reply_count, like_count, quote_count, impression_count]
Index: []

[0 rows x 27 columns]


In [87]:
df_merged

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
0,375146901,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,"Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",[@donaldtusk],[],[],3,1,33,0,1555,"Failure to implement most of the ""100 specifi..."
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031,"A year after the elections, one thing must be ..."
2,,everyone,1182211615,,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636,"‚ùå We are a year after the elections, and Pola..."
3,,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,Mija rok od wybor√≥w parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441,A year has passed since the parliamentary elec...
4,,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,#Idƒô11 üáµ üá±,[],[https://t.co/KiCe5ATOpX],[#Idƒô11],45,18,616,2,8634,#I'm going11 üáµ üá±
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48326,,everyone,961181894,,1707719554355380480,"Studiujesz na kierunku lekarskim, pielƒôgniarst...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-29 11:30:44+00:00,[1707719554355380484],...,"Studiujesz na kierunku lekarskim, pielƒôgniarst...",[@SejmikMaz],[https://t.co/6zats7JXbY],[],9,0,6,0,2154,"Are you studying medicine, nursing or emergenc..."
48327,,everyone,961181894,,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-19 13:08:40+00:00,[1704120323023454339],...,Za nami posiedzenie . I kolejne wsparcie dla m...,[@SejmikMaz],[https://t.co/A7EG9Jzuv1],[#OSP],9,0,15,0,649,The meeting is over. And further support for t...
48328,,everyone,961181894,,1702668459576786944,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-15 12:59:29+00:00,[1702668459576787064],...,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,[@SejmikMaz],[https://t.co/OALgj7gqxE],[],8,0,16,0,581,The P≈Çock Oncology Center is ready! It will ac...
48329,,everyone,961181894,,1701960909369868544,To jedna z najwiƒôkszych inwestycji drogowych @...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-13 14:07:56+00:00,[1701960909369868437],...,To jedna z najwiƒôkszych inwestycji drogowych ...,[@SejmikMaz],[https://t.co/9jWRcVZHXk],[#634],8,0,13,0,621,This is one of the largest road investments \...


check wether the data went correctly

In [88]:
df_merged[df_merged["id"]=="1807795860480160000"]

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
407,,everyone,1182211615,,1807795860480160000,üá≥ üá± Holenderski klub NAC Breda przygotowa≈Ç ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-07-01 15:18:15+00:00,[1807795860480159980],...,üá≥ üá± Holenderski klub NAC Breda przygotowa≈Ç ...,[],"[https://t.co/WI736ocWQ7, https://t.co/ANX3x4e...",[],5,1,24,0,603,üá≥ üá± The Dutch club NAC Breda has prepared spe...


In [89]:
# Check how many rows still have missing or empty translations
missing_translation_mask = df_merged['text_clean_en'].isna() | (df_merged['text_clean_en'].str.strip() == '')

# Show some of them
df_missing_translation = df_merged[missing_translation_mask]
print(f"Rows without translation: {df_missing_translation.shape[0]}")
display(df_missing_translation[['id', 'text', 'text_clean', 'text_clean_en']].head())


Rows without translation: 0


Unnamed: 0,id,text,text_clean,text_clean_en


removing rows withtout translation due to possessing text that is not being analyzed by our research

In [90]:
# Remove them
df_clean_translated = df_merged[~missing_translation_mask].copy()

print(f"Remaining rows with proper translation: {df_clean_translated.shape[0]}")

Remaining rows with proper translation: 48331


In [91]:
df_clean_translated.head()

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
0,375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,"Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",[@donaldtusk],[],[],3,1,33,0,1555,"Failure to implement most of the ""100 specifi..."
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031,"A year after the elections, one thing must be ..."
2,,everyone,1182211615,,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636,"‚ùå We are a year after the elections, and Pola..."
3,,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,Mija rok od wybor√≥w parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441,A year has passed since the parliamentary elec...
4,,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,#Idƒô11 üáµ üá±,[],[https://t.co/KiCe5ATOpX],[#Idƒô11],45,18,616,2,8634,#I'm going11 üáµ üá±


In [92]:
df_clean_translated.to_csv('data/02.processed/df_clean_translated_further_analalysis.csv', index=False)

In [93]:
def count_emojis(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # Emoticons
        "\U0001F300-\U0001F5FF"  # Symbols & pictographs
        "\U0001F680-\U0001F6FF"  # Transport & map symbols
        "\U0001F700-\U0001F77F"  # Alchemical symbols
        "\U0001F780-\U0001F7FF"  # Geometric shapes
        "\U0001F800-\U0001F8FF"  # Supplemental arrows
        "\U0001F900-\U0001F9FF"  # Supplemental symbols and pictographs
        "\U0001FA00-\U0001FA6F"  # Chess symbols
        "\U0001FA70-\U0001FAFF"  # Symbols and pictographs extended-A
        "\U00002702-\U000027B0"  # Dingbats
        "\U000024C2-\U0001F251"  # Enclosed characters
        "]+",
        flags=re.UNICODE,
    )
    return len(emoji_pattern.findall(text))


In [94]:
# Demojize text columns
df_clean_translated['text_clean_en_demojized'] = df_clean_translated['text_clean_en'].apply(
    lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x
)
df_clean_translated['text_clean_demojized'] = df_clean_translated['text_clean'].apply(
    lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x
)

# Count emojis in original text columns
df_clean_translated['emoji_count_en'] = df_clean_translated['text_clean_en'].apply(
    lambda x: count_emojis(str(x)) if pd.notnull(x) else 0
)
df_clean_translated['emoji_count'] = df_clean_translated['text_clean'].apply(
    lambda x: count_emojis(str(x)) if pd.notnull(x) else 0
)


In [95]:
# Total number of rows
total_rows = len(df_clean_translated)

# Rows with emojis in 'text_clean_en'
rows_with_emojis_en = df_clean_translated[df_clean_translated['emoji_count_en'] > 0].shape[0]

# Rows with emojis in 'text_clean'
rows_with_emojis = df_clean_translated[df_clean_translated['emoji_count'] > 0].shape[0]

# Display statistics
print(f"Total number of rows: {total_rows}")
print(f"Rows with emojis in 'text_clean_en': {rows_with_emojis_en} ({(rows_with_emojis_en/total_rows)*100:.2f}%)")
print(f"Rows with emojis in 'text_clean': {rows_with_emojis} ({(rows_with_emojis/total_rows)*100:.2f}%)")


Total number of rows: 48331
Rows with emojis in 'text_clean_en': 17994 (37.23%)
Rows with emojis in 'text_clean': 18289 (37.84%)


In [96]:
df_clean_translated[['text_clean_en', 'text_clean_en_demojized', 'emoji_count_en', 'text_clean', 'text_clean_demojized', 'emoji_count']].head()


Unnamed: 0,text_clean_en,text_clean_en_demojized,emoji_count_en,text_clean,text_clean_demojized,emoji_count
0,"Failure to implement most of the ""100 specifi...","Failure to implement most of the ""100 specifi...",0,"Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...","Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",0
1,"A year after the elections, one thing must be ...","A year after the elections, one thing must be ...",0,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,0
2,"‚ùå We are a year after the elections, and Pola...",:cross_mark: We are a year after the election...,1,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",":cross_mark: Mamy rok po wyborach, a Polska p...",1
3,A year has passed since the parliamentary elec...,A year has passed since the parliamentary elec...,0,Mija rok od wybor√≥w parlamentarnych. W kampani...,Mija rok od wybor√≥w parlamentarnych. W kampani...,0
4,#I'm going11 üáµ üá±,#I'm going11 üáµ üá±,2,#Idƒô11 üáµ üá±,#Idƒô11 üáµ üá±,2


In [97]:
# Filter rows
rows_with_emojis_in_text_clean_only = df_clean_translated[
    (df_clean_translated['emoji_count'] > 0) & (df_clean_translated['emoji_count_en'] == 0)
]

# Display the number of such rows
print(f"Number of rows with emojis in 'text_clean' but not in 'text_clean_en': {len(rows_with_emojis_in_text_clean_only)}")

# Display the affected rows
rows_with_emojis_in_text_clean_only[['text_clean', 'text_clean_en']]


Number of rows with emojis in 'text_clean' but not in 'text_clean_en': 295


Unnamed: 0,text_clean,text_clean_en
453,"Mamy Pa≈Ñstwo z dykty, a Kosiniak-Kamysz natych...","We are out of business, and Kosiniak-Kamysz sh..."
589,Dyrektywa budynkowa przyjƒôta! Przymusowe remon...,The Building Directive has been adopted! Compu...
686,"Gdzie tu sens, gdzie logika ‚Åâ Ô∏è","Where is the sense, where is the logic?"
1251,Ostrzegali≈õmy üëá mƒÖdra PRZED a nie dopiero P...,We warned you wisely BEFORE and not only AFTER...
1421,W niedzielƒô zapraszam do Opatowa. Nawet 2 razy...,I invite you to Opat√≥w on Sunday. Even twice :...
...,...,...
45723,"I 10 razy tyle, co na mieszkalnictwo üôÇ",And 10 times as much as for housing :)
45917,"Dzi≈õ, w Pa≈Çacu Prezydenckim odby≈Ça siƒô uroczys...","Today, the ceremony of awarding nominations to..."
46705,Intensywny tydzie≈Ñ misji Komisji Bud≈ºetu PE w ...,An intense week of mission of the EP Budget Co...
46895,Dbanie o bezpiecze≈Ñstwo ≈ºywno≈õciowe i ochrona ...,Taking care of food security and protecting th...


In [98]:
df_clean_translated['text_clean_en_demojized'] = df_clean_translated['text_clean_en'].apply(lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x)
df_clean_translated['text_clean_demojized'] = df_clean_translated['text_clean'].apply(lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x)

df_clean_translated[['text_clean_en', 'text_clean_en_demojized', 'text_clean', 'text_clean_demojized']].head()

Unnamed: 0,text_clean_en,text_clean_en_demojized,text_clean,text_clean_demojized
0,"Failure to implement most of the ""100 specifi...","Failure to implement most of the ""100 specifi...","Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...","Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w..."
1,"A year after the elections, one thing must be ...","A year after the elections, one thing must be ...",Rok po wyborach trzeba powiedzieƒá jedno - nie ...,Rok po wyborach trzeba powiedzieƒá jedno - nie ...
2,"‚ùå We are a year after the elections, and Pola...",:cross_mark: We are a year after the election...,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...",":cross_mark: Mamy rok po wyborach, a Polska p..."
3,A year has passed since the parliamentary elec...,A year has passed since the parliamentary elec...,Mija rok od wybor√≥w parlamentarnych. W kampani...,Mija rok od wybor√≥w parlamentarnych. W kampani...
4,#I'm going11 üáµ üá±,#I'm going11 üáµ üá±,#Idƒô11 üáµ üá±,#Idƒô11 üáµ üá±


In [99]:
df_clean_translated['possibly_sensitive'] = df_clean_translated['possibly_sensitive'].astype(bool)

In [100]:
username_to_realname = {
    'bartlomiejpejo': 'Bart≈Çomiej Pejo',
    'GrzegorzBraun_': 'Grzegorz Braun',
    'Iwaszkiewicz_RJ': 'Robert Iwaszkiewicz',
    'KonradBerkowicz': 'Konrad Berkowicz',
    'MarSypniewski': 'Marek Sypniewski',
    'MichalWawer': 'Micha≈Ç Wawer',
    'placzekgrzegorz': 'Grzegorz P≈Çaczek',
    'SlawomirMentzen': 'S≈Çawomir Mentzen',
    'TudujKrzysztof': 'Krzysztof Tuduj',
    'Wlodek_Skalik': 'W≈Çodzimierz Skalik',
    'WTumanowicz': 'Witold Tumanowicz',
    'AndrzejSzejna': 'Andrzej Szejna',
    'AnitaKDZG': 'Anita Kucharska-Dziedzic',
    'JoankaSW': 'Joanna Scheuring-Wielgus',
    'KGawkowski': 'Krzysztof Gawkowski',
    'K_Smiszek': 'Krzysztof ≈ömiszek',
    'MarcinKulasek': 'Marcin Kulasek',
    'MoskwaWodnicka': 'Ma≈Çgorzata Moskwa-Wodnicka',
    'PaulinaPW2024': 'Paulina Piechna-Wiƒôckiewicz',
    'poselTTrela': 'Tomasz Trela',
    'RobertBiedron': 'Robert Biedro≈Ñ',
    'WandaNowicka': 'Wanda Nowicka',
    'wieczorekdarek': 'Dariusz Wieczorek',
    'wlodekczarzasty': 'W≈Çodzimierz Czarzasty',
    'Arek_Iwaniak': 'Arkadiusz Iwaniak',
    'B_Maciejewska': 'Beata Maciejewska',
    'BeataSzydlo': 'Beata Szyd≈Ço',
    'elzbietawitek': 'El≈ºbieta Witek',
    'Kaminski_M_': 'Mariusz Kami≈Ñski',
    'Kowalczyk_H': 'Henryk Kowalczyk',
    'Macierewicz_A': 'Antoni Macierewicz',
    'mblaszczak': 'Mariusz B≈Çaszczak',
    'MorawieckiM': 'Mateusz Morawiecki',
    'mwojcik_': 'Micha≈Ç W√≥jcik',
    'PatrykJaki': 'Patryk Jaki',
    'bbudka': 'Borys Budka',
    'CTomczyk': 'Cezary Tomczyk',
    'donaldtusk': 'Donald Tusk',
    'DorotaNiedziela': 'Dorota Niedziela',
    'EwaKopacz': 'Ewa Kopacz',
    'JanGrabiec': 'Jan Grabiec',
    'Konwinski_PO': 'Zbigniew Konwi≈Ñski',
    'Leszczyna': 'Izabela Leszczyna',
    'MKierwinski': 'Marcin Kierwi≈Ñski',
    'M_K_Blonska': 'Ma≈Çgorzata Kidawa-B≈Ço≈Ñska',
    'OklaDrewnowicz': 'Agnieszka Ok≈Ça-Drewnowicz',
    'trzaskowski_': 'Rafa≈Ç Trzaskowski',
    'TomaszSiemoniak': 'Tomasz Siemoniak',
    'AgaBaranowskaPL': 'Agnieszka Baranowska',
    'aga_buczynska': 'Agnieszka Buczy≈Ñska',
    'hennigkloska': 'Paulina Hennig-Kloska',
    'joannamucha': 'Joanna Mucha',
    'Kpelczynska': 'Katarzyna Pelczy≈Ñska',
    'LukaszOsmalak': '≈Åukasz Osma≈Çek',
    'SlizPawel': 'Pawe≈Ç ≈öliz',
    'szymon_holownia': 'Szymon Ho≈Çownia',
    'ZalewskiPawel': 'Pawe≈Ç Zalewski',
    'ZywnoMaciej': 'Maciej ≈ªywno',
    'JKozlowskiEu': 'Janusz Koz≈Çowski',
    'michalkobosko': 'Micha≈Ç Kobosko',
    'DariuszKlimczak': 'Dariusz Klimczak',
    'GrzybAndrzej': 'Andrzej Grzyb',
    'Hetman_K': 'Krzysztof Hetman',
    'JarubasAdam': 'Adam Jarubas',
    'KosiniakKamysz': 'W≈Çadys≈Çaw Kosiniak-Kamysz',
    'Paslawska': 'Urszula Pas≈Çawska',
    'PZgorzelskiP': 'Piotr Zgorzelski',
    'StefanKrajewski': 'Stefan Krajewski',
    'StruzikAdam': 'Adam Struzik'
}

# Add the 'name' column to the dataframe
df_clean_translated['name'] = df_clean_translated['username'].map(username_to_realname)

In [101]:
# Delete next line sign from the 'text_clean_en' column
df_clean_translated['text_clean_en'] = df_clean_translated['text_clean_en'].str.replace('\n', ' ')

In [103]:
# Save the DataFrame to a Parquet file
df_clean_translated.to_parquet('data/03.cleaned/df_combined.parquet', index=False)

In [104]:
df_clean_translated

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,reply_count,like_count,quote_count,impression_count,text_clean_en,text_clean_en_demojized,text_clean_demojized,emoji_count_en,emoji_count,name
0,375146901,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie wiƒôkszo≈õci ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,1,33,0,1555,"Failure to implement most of the ""100 specifi...","Failure to implement most of the ""100 specifi...","Niezrealizowanie wiƒôkszo≈õci ze ""100 konkret√≥w...",0,0,Bart≈Çomiej Pejo
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieƒá jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,2,72,0,3031,"A year after the elections, one thing must be ...","A year after the elections, one thing must be ...",Rok po wyborach trzeba powiedzieƒá jedno - nie ...,0,0,Bart≈Çomiej Pejo
2,,everyone,1182211615,,1846161400328028160,"‚ùå Mamy rok po wyborach, a Polska pogrƒÖ≈ºa siƒô ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,3,33,2,8636,"‚ùå We are a year after the elections, and Pola...",:cross_mark: We are a year after the election...,":cross_mark: Mamy rok po wyborach, a Polska p...",1,1,Bart≈Çomiej Pejo
3,,everyone,1182211615,,1846091824101769472,Mija rok od wybor√≥w parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,2,38,0,2441,A year has passed since the parliamentary elec...,A year has passed since the parliamentary elec...,Mija rok od wybor√≥w parlamentarnych. W kampani...,0,0,Bart≈Çomiej Pejo
4,,everyone,1182211615,,1846075343188144128,#Idƒô11 üáµ üá± https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,18,616,2,8634,#I'm going11 üáµ üá±,#I'm going11 üáµ üá±,#Idƒô11 üáµ üá±,2,2,Bart≈Çomiej Pejo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48326,,everyone,961181894,,1707719554355380480,"Studiujesz na kierunku lekarskim, pielƒôgniarst...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-29 11:30:44+00:00,[1707719554355380484],...,0,6,0,2154,"Are you studying medicine, nursing or emergenc...","Are you studying medicine, nursing or emergenc...","Studiujesz na kierunku lekarskim, pielƒôgniarst...",0,0,Adam Struzik
48327,,everyone,961181894,,1704120323023454464,Za nami posiedzenie @SejmikMaz. I kolejne wspa...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-19 13:08:40+00:00,[1704120323023454339],...,0,15,0,649,The meeting is over. And further support for t...,The meeting is over. And further support for t...,Za nami posiedzenie . I kolejne wsparcie dla m...,0,0,Adam Struzik
48328,,everyone,961181894,,1702668459576786944,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-15 12:59:29+00:00,[1702668459576787064],...,0,16,0,581,The P≈Çock Oncology Center is ready! It will ac...,The P≈Çock Oncology Center is ready! It will ac...,P≈Çockie Centrum Onkologii gotowe! Ju≈º na poczƒÖ...,0,0,Adam Struzik
48329,,everyone,961181894,,1701960909369868544,To jedna z najwiƒôkszych inwestycji drogowych @...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-09-13 14:07:56+00:00,[1701960909369868437],...,0,13,0,621,This is one of the largest road investments ...,This is one of the largest road investments \...,To jedna z najwiƒôkszych inwestycji drogowych ...,0,0,Adam Struzik
