### Data cleaning

In [23]:
import matplotlib.pyplot as plt
import pandas as pd

# Import dataset
df = pd.read_csv('comentarios_charlie_kirk_balanceado.csv')

display(df)
df[df['id']==2773].mensaje.iloc[0]

Unnamed: 0,id,mensaje,año,mes,dia,score,subreddit
0,3586,POST: Remember: Your vote is 100% confidential...,2024,11,1,3,democrats
1,3587,POST: Charlie Kirk falls for fake Garbage Driv...,2024,11,1,1291,PoliticalHumor
2,3588,POST: Charlie Kirk falls for fake Garbage Driv...,2024,11,1,341,PoliticalHumor
3,3589,POST: Charlie Kirk falls for fake Garbage Driv...,2024,11,1,244,PoliticalHumor
4,3590,POST: Charlie Kirk falls for fake Garbage Driv...,2024,11,1,148,PoliticalHumor
...,...,...,...,...,...,...,...
8051,2826,POST: JB Pritzker Compares Trump’s ICE Crackdo...,2025,10,15,1,politics
8052,2827,POST: JB Pritzker Compares Trump’s ICE Crackdo...,2025,10,15,1,politics
8053,2828,POST: JB Pritzker Compares Trump’s ICE Crackdo...,2025,10,15,1,politics
8054,2834,POST: Charlie Kirk posthumously awarded Medal ...,2025,10,15,1,Conservative


"POST: US revokes visas of individuals who ‘celebrated’ Charlie Kirk’s death || COMENTARIO: **As a reminder, this subreddit [is for civil discussion](https://www.reddit.com/r/politics/wiki/index#wiki_the_rules_of_.2Fr.2Fpolitics.3A).** In general, please be courteous to others. Argue the merits of ideas, don't attack other posters or commenters. Hate speech, any suggestion or support of physical harm, or other rule violations can result in a temporary or a permanent ban. If you see comments in violation of our rules, please report them. **Sub-thread Information** If the post flair on this post indicates the wrong paywall status, please report this Automoderator comment with a custom report of “incorrect flair”. **Announcement** r/Politics is actively looking for new moderators. If you have an interest in helping to make this subreddit a place for quality discussion, please fill out [this form](https://sh.reddit.com/r/politics/application). *** *I am a bot, and this action was performed

In [24]:
# Rename to english
df = df.rename(columns={"año": "year", "mes": "month", "dia": "day", "mensaje": "message"})
df['message'] = df['message'].str.replace('COMENTARIO','COMMENT')

# Create DATE column
df["date"] = pd.to_datetime(df[["year", "month", "day"]])

# Delete unnecessary columns
df.drop(columns=["year", "month", "day"], inplace=True)

In [25]:
# Delete URLs and embeded URLs
df['message'] = df['message'].str.replace(
    r'\[([^\]]+)\]\(([^)]+)\)',  # [text](url)
    r'\1',                       # keep only “text”
    regex=True
)

df['message'] = df['message'].str.replace(
    r'(https?://\S+|www\.\S+|(?<!\S)/\S+|\S+\.\S{2,})',
    '',
    regex=True
)

# Split into post and comment
df['post'] = df['message'].str.extract(r'POST:?(.*?)\s*\|\|\s*COMMENT:', expand=False).str.strip()
df['comment'] = df['message'].str.extract(r'COMMENT:?(.*)', expand=False).str.strip()
df.drop(columns=['message'], inplace=True)


# Checking for void comments 
void_comments = df['comment'].isna() | (df['comment'].str.strip() == '')
print("Number of void comments:", void_comments.sum())

display(df)
print(df[df['id']==2773].post.iloc[0], df[df['id']==2773].comment.iloc[0])


Number of void comments: 0


Unnamed: 0,id,score,subreddit,date,post,comment
0,3586,3,democrats,2024-11-01,Remember: Your vote is 100% confidential. No o...,Should be illegal to straight up lie like that...
1,3587,1291,PoliticalHumor,2024-11-01,Charlie Kirk falls for fake Garbage Driver Bre...,Brent Terhune trolls the hell out of the Garba...
2,3588,341,PoliticalHumor,2024-11-01,Charlie Kirk falls for fake Garbage Driver Bre...,This video on his TikTok has nearly a million ...
3,3589,244,PoliticalHumor,2024-11-01,Charlie Kirk falls for fake Garbage Driver Bre...,Nobody ever accused Kirk of being the sharpest...
4,3590,148,PoliticalHumor,2024-11-01,Charlie Kirk falls for fake Garbage Driver Bre...,They are upset that they got called garbage by...
...,...,...,...,...,...,...
8051,2826,1,politics,2025-10-15,JB Pritzker Compares Trump’s ICE Crackdown to ...,"Hahahahahha, yeah, they also did a great job i..."
8052,2827,1,politics,2025-10-15,JB Pritzker Compares Trump’s ICE Crackdown to ...,But you now agree he was given a hearing? That...
8053,2828,1,politics,2025-10-15,JB Pritzker Compares Trump’s ICE Crackdown to ...,"No, my point was that denying comparisons to f..."
8054,2834,1,Conservative,2025-10-15,Charlie Kirk posthumously awarded Medal of Fre...,He's our generation's MLK Jr. He should get a ...


US revokes visas of individuals who ‘celebrated’ Charlie Kirk’s death **As a reminder, this subreddit is for civil  In general, please be courteous to others. Argue the merits of ideas, don't attack other posters or commenters. Hate speech, any suggestion or support of physical harm, or other rule violations can result in a temporary or a permanent ban. If you see comments in violation of our rules, please report them. **Sub-thread Information** If the post flair on this post indicates the wrong paywall status, please report this Automoderator comment with a custom report of “incorrect flair”. **Announcement** r/Politics is actively looking for new moderators. If you have an interest in helping to make this subreddit a place for quality discussion, please fill out this form. *** *I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.*


In [26]:
# Group message by post, where in each post comments are sorted by dated. Then groups of messages lead by a post are sorted by the oldest comment

# Compute first date per post
first_dates = df.groupby("post")["date"].min().reset_index().rename(columns={"date": "first_date"})

# Sort posts by first date
first_dates = first_dates.sort_values("first_date").reset_index(drop=True)

# Assign a numeric post ID
first_dates["id_post"] = range(1, len(first_dates) + 1)

# Merge back to original df
df = df.merge(first_dates[["post", "id_post"]], on="post")

# Sort original df by post first date, then comment date
df = df.merge(first_dates[["post", "first_date"]], on="post")
df = df.sort_values(by=["first_date", "date"]).drop(columns=["first_date"])

# Reorder columns
cols = df.columns.tolist()
cols.insert(1, cols.pop(cols.index("id_post")))   
cols.insert(2, cols.pop(cols.index("date")))    
df = df[cols]

# Sort by post ID and date
df = df.sort_values(by=["id_post", "date"], ascending=[True, True])


# # Create global consecutive index
df["id"] = range(1, len(df) + 1)
df.set_index("id", inplace=True)

df

Unnamed: 0_level_0,id_post,date,score,subreddit,post,comment
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,2024-11-01,3,democrats,Remember: Your vote is 100% confidential. No o...,Should be illegal to straight up lie like that...
2,2,2024-11-01,1291,PoliticalHumor,Charlie Kirk falls for fake Garbage Driver Bre...,Brent Terhune trolls the hell out of the Garba...
3,2,2024-11-01,341,PoliticalHumor,Charlie Kirk falls for fake Garbage Driver Bre...,This video on his TikTok has nearly a million ...
4,2,2024-11-01,244,PoliticalHumor,Charlie Kirk falls for fake Garbage Driver Bre...,Nobody ever accused Kirk of being the sharpest...
5,2,2024-11-01,148,PoliticalHumor,Charlie Kirk falls for fake Garbage Driver Bre...,They are upset that they got called garbage by...
...,...,...,...,...,...,...
8052,205,2025-10-15,1,politics,JB Pritzker Compares Trump’s ICE Crackdown to ...,So you believe they will eventually stop bothe...
8053,205,2025-10-15,1,politics,JB Pritzker Compares Trump’s ICE Crackdown to ...,Your go-to example actually did get a hearing ...
8054,205,2025-10-15,1,politics,JB Pritzker Compares Trump’s ICE Crackdown to ...,"Hahahahahha, yeah, they also did a great job i..."
8055,205,2025-10-15,1,politics,JB Pritzker Compares Trump’s ICE Crackdown to ...,But you now agree he was given a hearing? That...


In [27]:
# Save df
df.to_csv('charlie_kirk_comments_cleaned.csv')