# Data cleaning

In this file, we cleaned the downloaded data:
The main steps include:
1. Adding party affiliation to tweet rows
2. Deleting links and mentions from the tweets text and saving them to separate columns
3. Expanding the column of public metrics
4. Encoding emojis in a unified format
5. Translating tweets using Google Translate in Google Sheets
6. Saving all downloaded tweets to one file

### 1.  Used libraries

In [188]:
import os
import pandas as pd
import re
import emoji

### 2. Reading JSON files and transforming them into party-specific pickle files

In [189]:
base_input_path = 'data/tweets_data_final'
subfolders = ['Konfederacja', 'NL', 'PIS', 'PO', 'PL2050', 'PSL']
output_folder = 'data/tweets_data_combined'

os.makedirs(output_folder, exist_ok=True)

for subfolder in subfolders:
    folder_path = os.path.join(base_input_path, subfolder)
    dataframes = []
    
    for filename in os.listdir(folder_path):
        if filename.endswith('.json'):
            file_path = os.path.join(folder_path, filename)
            politician = filename.split("_tweets.json")[0]
            try:
                df = pd.read_json(file_path)  
                df["username"] = politician  
                df["party"] = subfolder
                print(f"Read {len(df)} rows from {file_path}")  
                dataframes.append(df)
            except ValueError as e:
                print(f"Error reading {file_path}: {e}")
    
    if dataframes:
        combined_df = pd.concat(dataframes, ignore_index=True)
        
        output_file_path = os.path.join(output_folder, f'{subfolder}_combined.pkl')
        combined_df.to_pickle(output_file_path) 
        
        print(f"Saved {subfolder} combined data to {output_file_path}")

print("Processing complete!")

Read 964 rows from data/tweets_data_final\Konfederacja\bartlomiejpejo_2023-10-16_2024-10-15.json
Read 721 rows from data/tweets_data_final\Konfederacja\SlawomirMentzen_2023-10-16_2024-10-15.json
Read 175 rows from data/tweets_data_final\Konfederacja\TudujKrzysztof_2023-10-16_2024-10-15.json
Read 950 rows from data/tweets_data_final\Konfederacja\Wlodek_Skalik_2023-10-16_2024-10-15.json
Read 750 rows from data/tweets_data_final\Konfederacja\WTumanowicz_2023-10-16_2024-10-15.json
Saved Konfederacja combined data to data/tweets_data_combined\Konfederacja_combined.pkl
Read 0 rows from data/tweets_data_final\NL\DyduchMarek_2023-10-16_2024-10-15.json
Read 178 rows from data/tweets_data_final\NL\KGawkowski_2023-10-16_2024-10-15.json
Read 457 rows from data/tweets_data_final\NL\RobertBiedron_2023-10-16_2024-10-15.json
Read 73 rows from data/tweets_data_final\NL\wlodekczarzasty_2023-10-16_2024-10-15.json
Saved NL combined data to data/tweets_data_combined\NL_combined.pkl
Read 556 rows from data/

In [190]:
df_konf = pd.read_pickle(os.path.join(output_folder, 'Konfederacja_combined.pkl'))
df_NL = pd.read_pickle(os.path.join(output_folder, 'NL_combined.pkl'))
df_PIS = pd.read_pickle(os.path.join(output_folder, 'PIS_combined.pkl'))
df_PO = pd.read_pickle(os.path.join(output_folder, 'PO_combined.pkl'))
df_PL2050 = pd.read_pickle(os.path.join(output_folder, 'PL2050_combined.pkl'))
df_PSL = pd.read_pickle(os.path.join(output_folder, 'PSL_combined.pkl'))

In [191]:
df_konf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          3560 non-null   object             
 1   in_reply_to_user_id     726 non-null    float64            
 2   reply_settings          3560 non-null   object             
 3   author_id               3560 non-null   int64              
 4   context_annotations     365 non-null    object             
 5   id                      3560 non-null   int64              
 6   text                    3560 non-null   object             
 7   edit_controls           3560 non-null   object             
 8   referenced_tweets       1055 non-null   object             
 9   created_at              3560 non-null   datetime64[ns, UTC]
 10  edit_history_tweet_ids  3560 non-null   object             
 11  lang                    3560 non-null   obj

In [192]:
# Merge all dataframes into one
df = pd.concat([df_konf, df_NL, df_PIS, df_PO, df_PL2050, df_PSL], ignore_index=True)

In [193]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11523 entries, 0 to 11522
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          11523 non-null  object             
 1   in_reply_to_user_id     1888 non-null   float64            
 2   reply_settings          11523 non-null  object             
 3   author_id               11523 non-null  float64            
 4   context_annotations     1636 non-null   object             
 5   id                      11507 non-null  float64            
 6   text                    11523 non-null  object             
 7   edit_controls           11507 non-null  object             
 8   referenced_tweets       3210 non-null   object             
 9   created_at              11523 non-null  datetime64[ns, UTC]
 10  edit_history_tweet_ids  11507 non-null  object             
 11  lang                    11523 non-null  o

In [None]:
pd.options.display.float_format = '{:.0f}'.format
df['id'] = df['id'].fillna(0).astype('int64')
df['id']

0        1846277256509116672
1        1846222583898784000
2        1846161400328028160
3        1846091824101769472
4        1846075343188144128
                ...         
11518    1714531739610075392
11519    1714523500176637952
11520    1714186262041510400
11521    1713833158838276096
11522    1713833158838276096
Name: id, Length: 11523, dtype: int64

In [195]:
df['id'].nunique()

11461

We need to remove duplicate tweets because our custom downloading loop occasionally downloads the same tweet two times to ensure completeness.

In [196]:
# Remove duplicates from the dataframe based on specific columns
df.drop_duplicates(subset=['id'], inplace=True)

In [197]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11461 entries, 0 to 11521
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype              
---  ------                  --------------  -----              
 0   public_metrics          11461 non-null  object             
 1   in_reply_to_user_id     1878 non-null   float64            
 2   reply_settings          11461 non-null  object             
 3   author_id               11461 non-null  float64            
 4   context_annotations     1630 non-null   object             
 5   id                      11461 non-null  int64              
 6   text                    11461 non-null  object             
 7   edit_controls           11460 non-null  object             
 8   referenced_tweets       3195 non-null   object             
 9   created_at              11461 non-null  datetime64[ns, UTC]
 10  edit_history_tweet_ids  11460 non-null  object             
 11  lang                    11461 non-null  object

In [198]:
df.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],pl,1846091776269963776,"{'mentions': [{'start': 0, 'end': 11, 'usernam...",False,Reply,,,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],pl,1846222583898784000,"{'urls': [{'start': 100, 'end': 123, 'url': 'h...",False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"❌Mamy rok po wyborach, a Polska pogrąża się w ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],pl,1846161400328028160,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],pl,1846091824101769472,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idę11 🇵🇱 https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],qme,1846075343188144128,"{'hashtags': [{'start': 0, 'end': 6, 'tag': 'I...",False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo_2023-10-16_2024-10-15.json,Konfederacja


In [199]:
# Update the 'username' column to keep only the string until '_2'
df['username'] = df['username'].str.split('_2').str[0]

In [200]:
category_summary = df['category'].value_counts()
print(category_summary)
total_tweets = category_summary.sum()
print(f"Total tweets: {total_tweets}")

category
Original    8235
Reply       1852
Quote       1370
Retweet        4
Name: count, dtype: int64
Total tweets: 11461


In [201]:
# Ensure the created_at column is in datetime format
df['created_at'] = pd.to_datetime(df['created_at'])

In [202]:
df.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,lang,conversation_id,entities,possibly_sensitive,category,attachments,geo,username,party
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],pl,1846091776269963776,"{'mentions': [{'start': 0, 'end': 11, 'usernam...",False,Reply,,,bartlomiejpejo,Konfederacja
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],pl,1846222583898784000,"{'urls': [{'start': 100, 'end': 123, 'url': 'h...",False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo,Konfederacja
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"❌Mamy rok po wyborach, a Polska pogrąża się w ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],pl,1846161400328028160,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo,Konfederacja
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],pl,1846091824101769472,"{'urls': [{'start': 278, 'end': 301, 'url': 'h...",False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo,Konfederacja
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idę11 🇵🇱 https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],qme,1846075343188144128,"{'hashtags': [{'start': 0, 'end': 6, 'tag': 'I...",False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo,Konfederacja


In [203]:
print(df.loc[1, 'text'])

Rok po wyborach trzeba powiedzieć jedno - nie na takie państwo Donald Tusk umawiał się z wyborcami! https://t.co/4Jh5Ni6sgr


In [204]:
def add_space_around_emojis(text):
    return ''.join(f' {char} ' if char in emoji.EMOJI_DATA or re.match(r'[\U0001F1E6-\U0001F1FF]', char) else char for char in text)

df['text'] = df['text'].apply(add_space_around_emojis)

def clean_text(text):
    # Remove mentions and links
    mentions = re.findall(r'@\w+', text)
    text = re.sub(r'@\w+', '', text)
    links = re.findall(r'http\S+', text)
    text = re.sub(r'http\S+', '', text)
    hashtags = re.findall(r'#\w+', text)
    # Ensure there is a space before and after each emoji
    text = re.sub(r'(?<!\s)([\U0001F600-\U0001F64F])', r' \1', text)
    text = re.sub(r'([\U0001F600-\U0001F64F])(?!\s)', r'\1 ', text)
    return [text, mentions, links, hashtags]

df[['text_clean', 'mentions', 'links', 'hashtags']] = pd.DataFrame(df['text'].apply(clean_text).tolist(), index=df.index)

In [205]:
df.head()

Unnamed: 0,public_metrics,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,...,possibly_sensitive,category,attachments,geo,username,party,text_clean,mentions,links,hashtags
0,"{'retweet_count': 3, 'reply_count': 1, 'like_c...",375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,...,False,Reply,,,bartlomiejpejo,Konfederacja,"Niezrealizowanie większości ze ""100 konkretów...",[@donaldtusk],[],[]
1,"{'retweet_count': 9, 'reply_count': 2, 'like_c...",,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,...,False,Original,{'media_keys': ['13_1846222491456282626']},,bartlomiejpejo,Konfederacja,Rok po wyborach trzeba powiedzieć jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[]
2,"{'retweet_count': 4, 'reply_count': 3, 'like_c...",,everyone,1182211615,,1846161400328028160,"❌ Mamy rok po wyborach, a Polska pogrąża się ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,...,False,Original,{'media_keys': ['3_1846148786910810112']},,bartlomiejpejo,Konfederacja,"❌ Mamy rok po wyborach, a Polska pogrąża się ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[]
3,"{'retweet_count': 6, 'reply_count': 2, 'like_c...",,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,...,False,Original,{'media_keys': ['3_1846091818959597568']},,bartlomiejpejo,Konfederacja,Mija rok od wyborów parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[]
4,"{'retweet_count': 45, 'reply_count': 18, 'like...",,everyone,1182211615,,1846075343188144128,#Idę11 🇵 🇱 https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,...,False,Original,{'media_keys': ['13_1846075276687478784']},,bartlomiejpejo,Konfederacja,#Idę11 🇵 🇱,[],[https://t.co/KiCe5ATOpX],[#Idę11]


In [206]:
df.drop(columns=['entities'], inplace=True)

In [207]:
# Extract public metrics into separate columns
df['retweet_count'] = df['public_metrics'].apply(lambda x: x['retweet_count'])
df['reply_count'] = df['public_metrics'].apply(lambda x: x['reply_count'])
df['like_count'] = df['public_metrics'].apply(lambda x: x['like_count'])
df['quote_count'] = df['public_metrics'].apply(lambda x: x['quote_count'])
df['impression_count'] = df['public_metrics'].apply(lambda x: x['impression_count'])

# Delete the 'public_metrics' column
df.drop(columns=['public_metrics'], inplace=True)

In [208]:
df

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,party,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count
0,375146901,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,Konfederacja,"Niezrealizowanie większości ze ""100 konkretów...",[@donaldtusk],[],[],3,1,33,0,1555
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,Konfederacja,Rok po wyborach trzeba powiedzieć jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031
2,,everyone,1182211615,,1846161400328028160,"❌ Mamy rok po wyborach, a Polska pogrąża się ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,Konfederacja,"❌ Mamy rok po wyborach, a Polska pogrąża się ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636
3,,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,Konfederacja,Mija rok od wyborów parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441
4,,everyone,1182211615,,1846075343188144128,#Idę11 🇵 🇱 https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,Konfederacja,#Idę11 🇵 🇱,[],[https://t.co/KiCe5ATOpX],[#Idę11],45,18,616,2,8634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11517,1557599378285666304,everyone,163197547,,1714536050096296192,@Marek83406343 @tvn24rozmowa @KonradPiasecki @...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '171452430045820...",2023-10-18 06:57:03+00:00,[1714536050096296256],...,PSL,W są ludzie o różnych o światopoglądach i...,"[@Marek83406343, @tvn24rozmowa, @KonradPiaseck...",[],[],0,0,1,0,22
11518,,everyone,163197547,,1714531739610075392,Będę namawiała kolegów z opozycji do skorzysta...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-18 06:39:55+00:00,[1714531739610075435],...,PSL,Będę namawiała kolegów z opozycji do skorzysta...,"[@Paslawska, @tvn24rozmowa, @nowePSL]",[],[#PiS],6,1,33,0,806
11519,,everyone,163197547,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",1714523500176637952,Budżet państwa musi być uchwalony do końca sty...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-18 06:07:11+00:00,[1714523500176638070],...,PSL,Budżet państwa musi być uchwalony do końca sty...,"[@paslawska, @tvn24rozmowa, @now]",[https://t.co/BzsRxbPYv4],[],6,3,16,0,660
11520,,everyone,163197547,,1714186262041510400,Dziękuję z całego serca &lt;3 #DobraGospodyni ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-17 07:47:07+00:00,[1714186262041510351],...,PSL,Dziękuję z całego serca &lt;3 #DobraGospodyni,[@nowePSL],[https://t.co/cpxQaIsGty],[#DobraGospodyni],15,13,122,0,4535


In [209]:
# Change the type of 'id' column to float
df['id'] = df['id'].astype('int64')

In [210]:
# Create a new dataframe with 'id' and 'text_clean' columns
df_clean_text = df[['id', 'text_clean']]

# Export the dataframe to a CSV file
df_clean_text.to_csv('data_for_translation.csv', index=False)

In [211]:
# Read the clean_text_data.csv file into a DataFrame
df_en_text = pd.read_csv('tweets_translation/translated_tweets.csv')

# Display the first few rows of the DataFrame
df_en_text.head()


Unnamed: 0,id,text_clean,text_clean_en
0,1846277256509116672,"Niezrealizowanie większości ze ""100 konkretów...","Failure to implement most of the ""100 specifi..."
1,1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"A year after the elections, one thing must be ..."
2,1846161400328028160,"❌ Mamy rok po wyborach, a Polska pogrąża się ...","❌ We are a year after the elections, and Pola..."
3,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,A year has passed since the parliamentary elec...
4,1846075343188144128,#Idę11 🇵 🇱,#I'm going11 🇵 🇱


In [212]:
df_en_text["id"] = df_en_text["id"].apply(lambda x: int(float(x.replace(',', ''))))

In [213]:
# Ensure the 'text_clean_en' column exists in df_clean_text
if 'text_clean_en' in df_en_text.columns:
    df = df.merge(df_en_text[['id', 'text_clean_en']], on='id', how='left')

    # Display the first few rows of the updated dataframe to verify the merge
    display(df.head())
else:
    print("Column 'text_clean_en' does not exist in df_clean_text")


Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,text_clean,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en
0,375146901.0,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,"Niezrealizowanie większości ze ""100 konkretów...",[@donaldtusk],[],[],3,1,33,0,1555,"Failure to implement most of the ""100 specifi..."
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,Rok po wyborach trzeba powiedzieć jedno - nie ...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031,"A year after the elections, one thing must be ..."
2,,everyone,1182211615,,1846161400328028160,"❌ Mamy rok po wyborach, a Polska pogrąża się ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,"❌ Mamy rok po wyborach, a Polska pogrąża się ...",[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636,"❌ We are a year after the elections, and Pola..."
3,,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,Mija rok od wyborów parlamentarnych. W kampani...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441,A year has passed since the parliamentary elec...
4,,everyone,1182211615,,1846075343188144128,#Idę11 🇵 🇱 https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,#Idę11 🇵 🇱,[],[https://t.co/KiCe5ATOpX],[#Idę11],45,18,616,2,8634,#I'm going11 🇵 🇱


In [214]:
# Replace '#VALUE!' with NaN in 'text_clean_en' column
df['text_clean_en'] = df['text_clean_en'].replace('#VALUE!', pd.NA)

In [215]:
df['text_clean_en_demojized'] = df['text_clean_en'].apply(lambda x: emoji.demojize(str(x)) if pd.notnull(x) else x)

df[['text_clean_en', 'text_clean_en_demojized']].head()

Unnamed: 0,text_clean_en,text_clean_en_demojized
0,"Failure to implement most of the ""100 specifi...","Failure to implement most of the ""100 specifi..."
1,"A year after the elections, one thing must be ...","A year after the elections, one thing must be ..."
2,"❌ We are a year after the elections, and Pola...",:cross_mark: We are a year after the election...
3,A year has passed since the parliamentary elec...,A year has passed since the parliamentary elec...
4,#I'm going11 🇵 🇱,#I'm going11 🇵 🇱


In [216]:
df

Unnamed: 0,in_reply_to_user_id,reply_settings,author_id,context_annotations,id,text,edit_controls,referenced_tweets,created_at,edit_history_tweet_ids,...,mentions,links,hashtags,retweet_count,reply_count,like_count,quote_count,impression_count,text_clean_en,text_clean_en_demojized
0,375146901,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846277256509116672,"@donaldtusk Niezrealizowanie większości ze ""10...","{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '184609177626996...",2024-10-15 19:49:34+00:00,[1846277256509116623],...,[@donaldtusk],[],[],3,1,33,0,1555,"Failure to implement most of the ""100 specifi...","Failure to implement most of the ""100 specifi..."
1,,everyone,1182211615,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1846222583898784000,Rok po wyborach trzeba powiedzieć jedno - nie ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 16:12:19+00:00,[1846222583898784025],...,[],[https://t.co/4Jh5Ni6sgr],[],9,2,72,0,3031,"A year after the elections, one thing must be ...","A year after the elections, one thing must be ..."
2,,everyone,1182211615,,1846161400328028160,"❌ Mamy rok po wyborach, a Polska pogrąża się ...","{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 12:09:12+00:00,[1846161400328028272],...,[],"[https://t.co/zFk5QLd1em, https://t.co/bRV4y07...",[],4,3,33,2,8636,"❌ We are a year after the elections, and Pola...",:cross_mark: We are a year after the election...
3,,everyone,1182211615,,1846091824101769472,Mija rok od wyborów parlamentarnych. W kampani...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 07:32:44+00:00,[1846091824101769490],...,[],"[https://t.co/rtVu3Bh43G, https://t.co/8Q3LME6...",[],6,2,38,0,2441,A year has passed since the parliamentary elec...,A year has passed since the parliamentary elec...
4,,everyone,1182211615,,1846075343188144128,#Idę11 🇵 🇱 https://t.co/KiCe5ATOpX,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2024-10-15 06:27:14+00:00,[1846075343188144153],...,[],[https://t.co/KiCe5ATOpX],[#Idę11],45,18,616,2,8634,#I'm going11 🇵 🇱,#I'm going11 🇵 🇱
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11456,1557599378285666304,everyone,163197547,,1714536050096296192,@Marek83406343 @tvn24rozmowa @KonradPiasecki @...,"{'edits_remaining': 5, 'is_edit_eligible': Fal...","[{'type': 'replied_to', 'id': '171452430045820...",2023-10-18 06:57:03+00:00,[1714536050096296256],...,"[@Marek83406343, @tvn24rozmowa, @KonradPiaseck...",[],[],0,0,1,0,22,There are people with different worldviews and...,There are people with different worldviews and...
11457,,everyone,163197547,,1714531739610075392,Będę namawiała kolegów z opozycji do skorzysta...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-18 06:39:55+00:00,[1714531739610075435],...,"[@Paslawska, @tvn24rozmowa, @nowePSL]",[],[#PiS],6,1,33,0,806,I will encourage my colleagues from the opposi...,I will encourage my colleagues from the opposi...
11458,,everyone,163197547,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",1714523500176637952,Budżet państwa musi być uchwalony do końca sty...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-18 06:07:11+00:00,[1714523500176638070],...,"[@paslawska, @tvn24rozmowa, @now]",[https://t.co/BzsRxbPYv4],[],6,3,16,0,660,The state budget must be adopted by the end of...,The state budget must be adopted by the end of...
11459,,everyone,163197547,,1714186262041510400,Dziękuję z całego serca &lt;3 #DobraGospodyni ...,"{'edits_remaining': 5, 'is_edit_eligible': Tru...",,2023-10-17 07:47:07+00:00,[1714186262041510351],...,[@nowePSL],[https://t.co/cpxQaIsGty],[#DobraGospodyni],15,13,122,0,4535,Thank you with all my heart &lt;3 #GoodHousewife,Thank you with all my heart &lt;3 #GoodHousewife


In [221]:
# Convert the 'possibly_sensitive' column to boolean, handling non-boolean values
df['possibly_sensitive'] = df['possibly_sensitive'].astype(bool)

In [None]:
# Save the DataFrame to a Parquet file
df.to_parquet('cleaned_data/df_combined.parquet', index=False)