# Prepare the data for analysis

In this notebook, we'll be preparing the data by cleaning and performing possible common pre-processing steps for this dataset, so we don't need to do it in each analysis notebook.

This contains the following steps:
1. Converting the `deltalake` **table** to a `pandas` **Dataframe** 
2. Basic inspection on the data
3. Remove the duplicated articles
    - Remove fully duplicated articles
    - Remove duplicated articles based on their url
    - Remove duplicated articles based on their title
4. Resolve issues in data
    - Replace the `source` field with the source name instead of the source class data from the extractor engine
    - Migrate the `author` field to a list of authors
    - Migrate the `images` field to a list of image urls
    - Migrate the `tags` field to a list of tags
5. Handle missing data
    - Handle the 'NULL' tags in the cells
    - Drop the rows with only null values
6. Label few aggregation cases for testing aggregation approaches 
7. Cast the data to proper data types
    - Cast the `publication_date` field to datetime type

## Converting the `deltalake` **table** to a `pandas` **Dataframe**

In [1]:
# Import the necessary libraries for this task
import numpy as np
import pandas as pd
from deltalake import DeltaTable
from tqdm import tqdm
import pickle

In [2]:
# Import the data (articles) from the data lake (deltalake) 
tbl = DeltaTable('./data/articles')

In [3]:
# Convert the Deltalake table to a pandas Dataframe
df = tbl.to_pandas()

## Basic inspection on the data

In [11]:
# Checking the columns of the dataset
df.columns

Index(['id', 'title', 'author', 'publication_date', 'source', 'url', 'summary',
       'content', 'tags', 'categories', 'images'],
      dtype='object')

In [12]:
# Inspecting the dataset (with the first 5 rows)
df.head()

Unnamed: 0,id,title,author,publication_date,source,url,summary,content,tags,categories,images
0,672960439902a3465058b1f0,Datadog challenger Dash0 aims to dash observab...,"Anna Heim ,Devin Coldewey ,Marina Temkin ,Maxw...",2024-11-04T00:00:00.000000,ArticleSource(id=ObjectId('6696759a347e0ad7140...,https://techcrunch.com/2024/11/04/datadog-chal...,,The end of zero-interest rates has driven comp...,"Fundraising ,Startups ,observability ,cloud co...",,https://techcrunch.com/wp-content/uploads/2024...
1,67295ff69902a3465058b1ef,Karen tries to claim ownership of the place sh...,,,ArticleSource(id=ObjectId('6696759a347e0ad7140...,https://cheezburger.com/37613061/karen-tries-t...,,Scroll down for the next article,"entitled parents ,housing ,Random Memes ,Geek ...",,https://i.chzbgr.com/full/10424000000/hAC1368F...
2,67295f349902a3465058b1ee,Elon Musk's ex Grimes takes swipe at Tesla bil...,"Eve Buckland ,Eve Buckland For Dailymail.Com ,...",2024-11-04T21:53:57.000000+0000,ArticleSource(id=ObjectId('6696759a347e0ad7140...,https://www.dailymail.co.uk/tvshowbiz/article-...,,Elon Musk's ex Grimes took a savage swipe at t...,,,https://i.dailymail.co.uk/1s/2024/11/03/05/916...
3,67295f059902a3465058b1ed,Kansas City Chiefs find success in bringing ba...,"Adam Teicher ,Jenna Laine ,Todd Archer ,Kather...",,ArticleSource(id=ObjectId('6696759a347e0ad7140...,https://www.espn.com/nfl/story/_/id/42171653/k...,,Get ready for an electric Week 9 Monday Night ...,,,https://a2.espncdn.com/combiner/i?img=%2Fphoto...
4,67295dd59902a3465058b1ec,CMS’s Medical Debt Relief Will Worsen Medical ...,"Ge Bai ,Ge Baicontributoropinions Expressed Fo...",2024-11-04T00:00:00.000000,ArticleSource(id=ObjectId('6696759a347e0ad7140...,https://www.forbes.com/sites/gebai/2024/11/04/...,,Man collects money with magnet from human crow...,,,https://specials-images.forbesimg.com/imageser...


## Remove the duplicated articles

First, we'll remove the fully duplicated articles.

In [13]:
# Let's set the index of the dataset to the 'id' column
df.set_index('id', inplace=True)

In [14]:
# Checking the number of fully duplicated rows in the dataset
df.duplicated().sum()

np.int64(120)

In [15]:
# Checking the total number of rows in the dataset
df.count()

title               5182
author              4970
publication_date    2526
source              5182
url                 5182
summary             2200
content             5182
tags                2796
categories          2200
images              5182
dtype: int64

In [16]:
# Dropping the fully duplicated rows in the dataset
df.drop_duplicates(keep='last', inplace=True)

In [17]:
# Verifying the number of rows in the dataset after dropping the duplicates
df.count()

title               5062
author              4850
publication_date    2406
source              5062
url                 5062
summary             2080
content             5062
tags                2676
categories          2080
images              5062
dtype: int64

Now, let's remove the duplicated articles based on their url.

In [18]:
# Checking the number of url duplicated rows in the dataset
df.duplicated(subset=['url']).sum()

np.int64(2079)

In [19]:
# Dropping the url duplicated rows in the dataset
df.drop_duplicates(subset=['url'], keep='last', inplace=True)

In [20]:
# Verifying the number of url duplicated rows in the dataset
df.duplicated(subset=['url']).sum()

np.int64(0)

Now, let's remove the duplicated articles based on their title.

In [21]:
# Checking the number of title duplicated rows in the dataset
df.duplicated(subset=['title']).sum()

np.int64(6)

In [22]:
# Dropping the title duplicated rows in the dataset
df.drop_duplicates(subset=['title'], keep='last', inplace=True)

In [23]:
# Verifying the number of title duplicated rows in the dataset
df.duplicated(subset=['title']).sum()

np.int64(0)

In [24]:
# Verifying the number of rows in the dataset after dropping all the duplicates
df.count()

title               2977
author              2766
publication_date     327
source              2977
url                 2977
summary                2
content             2977
tags                 598
categories             2
images              2977
dtype: int64

## Resolve issues in data

Now, let's replace the `source` field with the source name instead of the source class data from the extractor engine.

In [25]:
# Checking a sample of the dataset
df['source'].iloc[0]

"ArticleSource(id=ObjectId('6696759a347e0ad714039d51'), name='Eater', domain='www.eater.com;nymag.com;austin.eater.com', rss_url='https://feeds.feedburner.com/EaterNational', categories=['Top Sources', 'Food'])"

In [26]:
# Building the logic to extract the source of the article
df['source'].iloc[0].split(',')[1][7:-1]

'Eater'

In [27]:
# Extracting the source of the article and reassiging it to the 'source' column
df['source'] = df['source'].str.split(',').str[1].str[7:-1]

In [28]:
# Verifying the changes in the 'source' column
df['source']

id
6724ecf59aa896701328e146                        Eater
6724d89e9aa896701328e0d6                       Forbes
1fbdb59e-eb89-4ee7-8c93-4d604068110a       TechCrunch
49f2bf37-4a71-467c-85a5-4096eb555245        FAIL Blog
235e7b38-5484-4363-9ebd-81c58b96186f       Daily Mail
                                            ...      
e1419026-eb7a-491e-859b-a346a257aebc    Atlas Obscura
18882704-7c38-4ac4-9067-2c3bd10aa9d8    Atlas Obscura
e852098e-a512-48a2-881b-7abb6632bd9b    Atlas Obscura
2991b1a0-f57b-4a03-b8aa-62ba7b675caa    Atlas Obscura
d827b560-4bb8-474c-9509-34443c1ccb5c    Atlas Obscura
Name: source, Length: 2977, dtype: object

Let's migrate the `author` column from string to list of strings.

In [29]:
# Check a sample of the 'author' column
df['author'].iloc[0]

'Mary Anne Porto'

In [30]:
# Building the logic
df['author'].iloc[0].split(' ,')

['Mary Anne Porto']

In [31]:
# Extracting the author of the article and reassiging it to the 'author' column
df['author'] = df['author'].str.split(' ,')

In [32]:
# Verifying the changes in the 'author' column
df['author']

id
6724ecf59aa896701328e146                                                [Mary Anne Porto]
6724d89e9aa896701328e0d6                [Gary Shilling, Gary Shillingnewslettergary Sh...
1fbdb59e-eb89-4ee7-8c93-4d604068110a    [Anna Heim, Devin Coldewey, Marina Temkin, Max...
49f2bf37-4a71-467c-85a5-4096eb555245                                                  NaN
235e7b38-5484-4363-9ebd-81c58b96186f    [Eve Buckland, Eve Buckland For Dailymail.Com,...
                                                              ...                        
e1419026-eb7a-491e-859b-a346a257aebc                                                  NaN
18882704-7c38-4ac4-9067-2c3bd10aa9d8                                                  NaN
e852098e-a512-48a2-881b-7abb6632bd9b                                      [Diana Hubbell]
2991b1a0-f57b-4a03-b8aa-62ba7b675caa                                  [Tristan McConnell]
d827b560-4bb8-474c-9509-34443c1ccb5c                                                  NaN
Name: a

Let's migrate the `images` column from string to list of strings.

In [33]:
# Check a sample of the 'images' column
df['images'].iloc[0]

'https://punchdrink.com/wp-content/uploads/2018/07/Social-Low-Proof-Negroni-Sherry-Cocktail-Recipe-Session-Cocktails-Book.jpg'

In [34]:
# Building the logic
df['images'].iloc[0].split(' ,')

['https://punchdrink.com/wp-content/uploads/2018/07/Social-Low-Proof-Negroni-Sherry-Cocktail-Recipe-Session-Cocktails-Book.jpg']

In [35]:
# Extracting the images of the article and reassiging it to the 'images' column
df['images'] = df['images'].str.split(' ,')

In [36]:
# Verifying the changes in the 'images' column
df['images']

id
6724ecf59aa896701328e146                [https://punchdrink.com/wp-content/uploads/201...
6724d89e9aa896701328e0d6                [https://imageio.forbes.com/specials-images/im...
1fbdb59e-eb89-4ee7-8c93-4d604068110a    [https://techcrunch.com/wp-content/uploads/202...
49f2bf37-4a71-467c-85a5-4096eb555245    [https://i.chzbgr.com/full/10424000000/hAC1368...
235e7b38-5484-4363-9ebd-81c58b96186f    [https://i.dailymail.co.uk/1s/2024/11/03/05/91...
                                                              ...                        
e1419026-eb7a-491e-859b-a346a257aebc    [https://assets.atlasobscura.com/assets/Gastro...
18882704-7c38-4ac4-9067-2c3bd10aa9d8    [https://assets.atlasobscura.com/assets/Gastro...
e852098e-a512-48a2-881b-7abb6632bd9b    [https://assets.atlasobscura.com/assets/Gastro...
2991b1a0-f57b-4a03-b8aa-62ba7b675caa    [https://img.atlasobscura.com/ei4-MsZwKPbsqqZv...
d827b560-4bb8-474c-9509-34443c1ccb5c    [https://assets.atlasobscura.com/assets/Gastro...
Name: i

Let's migrate the `tags` column from string to list of strings.

In [37]:
# Check a sample of the 'tags' column
df['tags'].iloc[0]

'Hack Your Drink ,The Ultimates ,Collections ,D List ,Master the Classics ,Recommendations'

In [38]:
# Building the logic
df['tags'].iloc[0].split(' ,')

['Hack Your Drink',
 'The Ultimates',
 'Collections',
 'D List',
 'Master the Classics',
 'Recommendations']

In [39]:
# Extracting the tags of the article and reassiging it to the 'tags' column
df['tags'] = df['tags'].str.split(' ,')

In [40]:
# Verifying the changes in the 'tags' column
df['tags']

id
6724ecf59aa896701328e146                [Hack Your Drink, The Ultimates, Collections, ...
6724d89e9aa896701328e0d6                                                           [NULL]
1fbdb59e-eb89-4ee7-8c93-4d604068110a    [Fundraising, Startups, observability, cloud c...
49f2bf37-4a71-467c-85a5-4096eb555245    [entitled parents, housing, Random Memes, Geek...
235e7b38-5484-4363-9ebd-81c58b96186f                                                  NaN
                                                              ...                        
e1419026-eb7a-491e-859b-a346a257aebc                                                  NaN
18882704-7c38-4ac4-9067-2c3bd10aa9d8                                                  NaN
e852098e-a512-48a2-881b-7abb6632bd9b                                                  NaN
2991b1a0-f57b-4a03-b8aa-62ba7b675caa                                                  NaN
d827b560-4bb8-474c-9509-34443c1ccb5c                                                  NaN
Name: t

## Handle missing data

Let's handle the 'NULL' tags in the cells.

In [41]:
# Replacing the 'NULL' and ['NULL'] values in the dataset with np.nan
df[df=='NULL'] = np.nan
df[df.isin([['NULL']])] = np.nan
df[df.isin(["NaT"])] = np.nan

In [42]:
# Verifying the changes in the dataset
df.head()

Unnamed: 0_level_0,title,author,publication_date,source,url,summary,content,tags,categories,images
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
6724ecf59aa896701328e146,The 15 Best Cocktails to Make This Holiday Season,[Mary Anne Porto],2024-11-01T09:00:48.000000,Eater,https://punchdrink.com/articles/best-holiday-f...,,,"[Hack Your Drink, The Ultimates, Collections, ...",,[https://punchdrink.com/wp-content/uploads/201...
6724d89e9aa896701328e0d6,Softer U.S. Economic Growth Ahead,"[Gary Shilling, Gary Shillingnewslettergary Sh...",2024-11-01T00:00:00.000000,Forbes,https://www.forbes.com/newsletters/gary-shilli...,,,,,[https://imageio.forbes.com/specials-images/im...
1fbdb59e-eb89-4ee7-8c93-4d604068110a,Datadog challenger Dash0 aims to dash observab...,"[Anna Heim, Devin Coldewey, Marina Temkin, Max...",2024-11-04 00:00:00,TechCrunch,https://techcrunch.com/2024/11/04/datadog-chal...,,The end of zero-interest rates has driven comp...,"[Fundraising, Startups, observability, cloud c...",,[https://techcrunch.com/wp-content/uploads/202...
49f2bf37-4a71-467c-85a5-4096eb555245,Karen tries to claim ownership of the place sh...,,,FAIL Blog,https://cheezburger.com/37613061/karen-tries-t...,,Scroll down for the next article,"[entitled parents, housing, Random Memes, Geek...",,[https://i.chzbgr.com/full/10424000000/hAC1368...
235e7b38-5484-4363-9ebd-81c58b96186f,Elon Musk's ex Grimes takes swipe at Tesla bil...,"[Eve Buckland, Eve Buckland For Dailymail.Com,...",,Daily Mail,https://www.dailymail.co.uk/tvshowbiz/article-...,,Elon Musk's ex Grimes took a savage swipe at t...,,,[https://i.dailymail.co.uk/1s/2024/11/03/05/91...


Let's drop the columns with all null values

In [43]:
df.isnull().sum()

title                  0
author               211
publication_date    2650
source                 0
url                    0
summary             2975
content                0
tags                2380
categories          2977
images                 0
dtype: int64

In [44]:
# The `categories` column needs to be dropped as it contains only null values
df.drop(columns=['categories'], inplace=True)

## Cast the data to proper data types

In [45]:
# Checking the data types of all the columns
df.dtypes

title               object
author              object
publication_date    object
source              object
url                 object
summary             object
content             object
tags                object
images              object
dtype: object

Setting datetime format for publication_date

In [46]:
# Changing the 'publication_date' column to datetime format
df['publication_date'] = pd.to_datetime(df['publication_date'], errors='coerce')

## Label few aggregation cases for testing aggregation approaches

In [47]:
df['event_id'] = np.nan

### Case 1

In [48]:
# Select random article from the dataset
df.loc['953d844e-858d-4f06-a664-5a74e77e2766']

title               Judge declines to block Musk’s $1 million vote...
author              [Lauren Feiner, A Senior Policy Reporter At Th...
publication_date                                                  NaT
source                                                      The Verge
url                 https://www.theverge.com/2024/11/4/24288183/mu...
summary                                                           NaN
content             Elon Musk’s America PAC can move forward with ...
tags                                                              NaN
images              [https://cdn.vox-cdn.com/thumbor/rodkS6uNARTTL...
event_id                                                          NaN
Name: 953d844e-858d-4f06-a664-5a74e77e2766, dtype: object

In [49]:
# Searching for similar articles from keywords
keywords = ['block', 'philladelphia', 'judge', 'denies', 'declines', 'musk', 'pennsylvania', 'million']

temp_df = df.copy()
temp_df['keyword_count'] = temp_df['title'].apply(lambda x: sum(keyword.lower() in x.lower() for keyword in keywords))
temp_df = temp_df[temp_df['keyword_count'] > 0].sort_values(by='keyword_count', ascending=False)
temp_df.loc[temp_df['keyword_count'] > 1]

Unnamed: 0_level_0,title,author,publication_date,source,url,summary,content,tags,images,event_id,keyword_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
b87065f4-5db8-4fde-8371-93fbec783f10,Judge denies Philadelphia DA's request to bloc...,[Abc News],NaT,ABC News,https://abcnews.go.com/Politics/elon-musk-pac-...,,A Philadelphia judge is allowing Elon Musk’s A...,,[https://i.abcnewsfe.com/a/a150a26a-eef0-4ebb-...,,5
953d844e-858d-4f06-a664-5a74e77e2766,Judge declines to block Musk’s $1 million vote...,"[Lauren Feiner, A Senior Policy Reporter At Th...",NaT,The Verge,https://www.theverge.com/2024/11/4/24288183/mu...,,Elon Musk’s America PAC can move forward with ...,,[https://cdn.vox-cdn.com/thumbor/rodkS6uNARTTL...,,5
75e5337b-5962-45eb-9acd-45788497bf4d,Pennsylvania judge allows Elon Musk's PAC to c...,"[Andrea Margolis, Fox News, Andrea Margolis Is...",NaT,FOX News,https://www.foxnews.com/politics/pennsylvania-...,,A Pennsylvania judge is allowing Elon Musk's A...,,[https://a57.foxnews.com/static.foxnews.com/fo...,,3
8d95247d-3f55-46e8-8fc5-42604d50df70,Elon Musk says he’s giving away $1 million a d...,[Ellen Ioanes],NaT,Vox,https://www.vox.com/politics/378912/musk-trump...,,covers breaking and general assignment news as...,,[https://platform.vox.com/wp-content/uploads/s...,,2
f419c1d8-3062-4c0f-a42d-729f6020966d,Elon Musk’s PAC admits $1 million voter giveaw...,"[Gadel Valle, A Policy Reporter. Her Past Work...",NaT,The Verge,https://www.theverge.com/2024/11/4/24287952/el...,,A representative of Elon Musk’s America PAC sa...,,[https://duet-cdn.vox-cdn.com/thumbor/0x0:2040...,,2


In [50]:
case_1_ids = [
  '953d844e-858d-4f06-a664-5a74e77e2766',
  '75e5337b-5962-45eb-9acd-45788497bf4d',
  'b87065f4-5db8-4fde-8371-93fbec783f10',
]

In [51]:
# Assigning the event_id to the selected articles
df.loc[case_1_ids, 'event_id'] = 'case_1'

  df.loc[case_1_ids, 'event_id'] = 'case_1'


### Case 2

In [52]:
# Select random article from the dataset
df.loc['726a7000-dde4-4299-b964-fffbbb41490f']

title               Apple users can soon upgrade to ChatGPT Plus w...
author              [Maxwell Zeff, Devin Coldewey, Manish Singh, K...
publication_date                                                  NaT
source                                                     TechCrunch
url                 https://techcrunch.com/2024/11/04/apple-users-...
summary                                                           NaN
content             Apple products are getting an integration with...
tags                       [Apple, ChatGPT Plus, OpenAI, ChatGPT, AI]
images              [https://techcrunch.com/wp-content/uploads/202...
event_id                                                          NaN
Name: 726a7000-dde4-4299-b964-fffbbb41490f, dtype: object

In [53]:
# Searching for similar articles from keywords
keywords = ['apple', 'chatgpt', 'plus']

temp_df = df.copy()
temp_df['keyword_count'] = temp_df['title'].apply(lambda x: sum(keyword.lower() in x.lower() for keyword in keywords))
temp_df = temp_df[temp_df['keyword_count'] > 0].sort_values(by='keyword_count', ascending=False)
temp_df.loc[temp_df['keyword_count'] > 1]

Unnamed: 0_level_0,title,author,publication_date,source,url,summary,content,tags,images,event_id,keyword_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
726a7000-dde4-4299-b964-fffbbb41490f,Apple users can soon upgrade to ChatGPT Plus w...,"[Maxwell Zeff, Devin Coldewey, Manish Singh, K...",NaT,TechCrunch,https://techcrunch.com/2024/11/04/apple-users-...,,Apple products are getting an integration with...,"[Apple, ChatGPT Plus, OpenAI, ChatGPT, AI]",[https://techcrunch.com/wp-content/uploads/202...,,3
141ce42e-29ea-41e1-a670-a9da3fa26e39,Apple will let you upgrade to ChatGPT Plus rig...,"[Jay Peters, A News Editor Who Writes About Te...",NaT,The Verge,https://www.theverge.com/2024/11/4/24288015/ap...,,Apple’s second iOS 18.2 developer beta include...,,[https://www.theverge.com/icons/native-ad-plac...,,3


In [54]:
case_2_ids = [
  '726a7000-dde4-4299-b964-fffbbb41490f',
  '141ce42e-29ea-41e1-a670-a9da3fa26e39',
]

In [55]:
# Assigning the event_id to the selected articles
df.loc[case_2_ids, 'event_id'] = 'case_2'

### Case 3

In [56]:
# Select random article from the dataset
df.loc['eafa01c6-3078-49bf-913f-71697594fe1b']

title               Heidi Klum unveils Halloween costume as ET wit...
author              [Stephanie Giang-Paunon Larry Fink, Stephanie ...
publication_date                                                  NaT
source                                                       FOX News
url                 https://www.foxnews.com/entertainment/heidi-kl...
summary                                                           NaN
content             The queen of Halloween, Heidi Klum, was "out o...
tags                                                [#HeidiHalloween]
images              [https://a57.foxnews.com/static.foxnews.com/fo...
event_id                                                          NaN
Name: eafa01c6-3078-49bf-913f-71697594fe1b, dtype: object

In [57]:
# Searching for similar articles from keywords
keywords = ['heidi', 'klum']

temp_df = df.copy()
temp_df['keyword_count'] = temp_df['title'].apply(lambda x: sum(keyword.lower() in x.lower() for keyword in keywords))
temp_df = temp_df[temp_df['keyword_count'] > 0].sort_values(by='keyword_count', ascending=False)
temp_df.loc[temp_df['keyword_count'] > 0]

Unnamed: 0_level_0,title,author,publication_date,source,url,summary,content,tags,images,event_id,keyword_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
d2e0859a-5cbf-4533-8207-6beb173b532d,Heidi Klum claims she was a man in her past li...,"[Heidi Parker, Heidi Parker For Dailymail.Com,...",NaT,Daily Mail,https://www.dailymail.co.uk/tvshowbiz/article-...,,Heidi Klum believes she has 'lived many lives ...,,[https://i.dailymail.co.uk/1s/2024/10/31/18/91...,,2
eafa01c6-3078-49bf-913f-71697594fe1b,Heidi Klum unveils Halloween costume as ET wit...,"[Stephanie Giang-Paunon Larry Fink, Stephanie ...",NaT,FOX News,https://www.foxnews.com/entertainment/heidi-kl...,,"The queen of Halloween, Heidi Klum, was ""out o...",[#HeidiHalloween],[https://a57.foxnews.com/static.foxnews.com/fo...,,2
6e353ebb-749d-4520-be04-650305979a44,Heidi Klum Dresses as E.T. for Her Annual Hall...,,NaT,TMZ,https://www.tmz.com/2024/11/01/heidi-klum-et-c...,,Heidi Klum's Halloween bash is always a wild e...,,[https://imagez.tmz.com/image/92/16by9/2024/10...,,2
b04e63ae-de09-4c62-811f-649d80545ada,HEIDI ALEXANDER: Potholes have plagued drivers...,"[Heidi Alexander,Transport Secretary]",NaT,Daily Mail,https://www.dailymail.co.uk/debate/article-142...,,Potholes have plagued drivers across the count...,,[https://i.dailymail.co.uk/1s/2024/12/19/16/93...,,1


In [58]:
case_3_ids = [
  'eafa01c6-3078-49bf-913f-71697594fe1b',
  '6e353ebb-749d-4520-be04-650305979a44'
]

In [59]:
# Assigning the event_id to the selected articles
df.loc[case_3_ids, 'event_id'] = 'case_3'

### Uncategorized articles

Let's just mark them as 'none'

In [60]:
df['event_id'].value_counts()

event_id
case_1    3
case_2    2
case_3    2
Name: count, dtype: int64

In [61]:
df.loc[df['event_id'].isna(), ['event_id']] = 'none'

In [62]:
df['event_id'].value_counts()

event_id
none      2970
case_1       3
case_2       2
case_3       2
Name: count, dtype: int64

## Save the cleaned data

Finally, we'll save the cleaned data as a pickle file for further analysis.

In [63]:
with open('./df.pkl', 'wb') as f:
    pickle.dump(df, f)