# Data Preprocessing

In this notebook I walk through some simple data preprocessing of the News Category Dataset so that it does not need to be done in each step. Note that there is no exact way to prep data such as this for search and the below represents a simple cleanup process to make the processing in later steps simpler.

Link to data: https://www.kaggle.com/datasets/rmisra/news-category-dataset

In [3]:
import re
import uuid

import pandas as pd

In [2]:
# Load in the data and add uuids 
# Note that while uuids are not necessary they can help when creating lookups in later stages. 

data = pd.read_json("data/News_Category_Dataset_v3.json", orient='records', lines=True)
uuid_list = [str(uuid.uuid4()) for i in range(len(data))]
data.insert(0, 'uuid', uuid_list)
print(data.shape)
data.head()


(209527, 7)


Unnamed: 0,uuid,link,headline,category,short_description,authors,date
0,4bccf640-00c8-42fc-9b7d-45cef8199e17,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,3413a047-5ee5-4aa8-a05b-7bab1eae2077,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,be6d3df8-7bda-452e-9574-8530477abb64,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,33d6b4bc-593c-4786-931b-eebd3ab374f6,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,762ccc4c-a710-4c3a-b7b9-d786aaef8e81,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [4]:
%%time 

def clean_text(txt:str)->str:
    """Simple utility function for performing some basic text cleaning"""
    txt = txt.lower()

    # Remove special characters
    pattern = re.compile('[^A-Za-z0-9\s\']')
    txt = re.sub(pattern, '', txt)
    
    return txt

data['clean_headline'] = data['headline'].apply(clean_text)
data['short_description'] = data['short_description'].apply(clean_text)

# Create a combined title and description col for later search purposes. 
data['combined_text'] = data['clean_headline'] + " " + data['short_description']


CPU times: user 3.57 s, sys: 22.8 ms, total: 3.59 s
Wall time: 3.59 s


In [7]:
data.to_json("data/preprocessed_data.jsonl", orient='records', lines=True)