# Headline processing

This notebook cleans the headlines and produces a CSV containing simplified tokens.

## Imports

In [13]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from pandas.core.common import flatten
import matplotlib.pyplot as plt
import seaborn as sns

## Setup

In [14]:
pd.set_option("max_colwidth", 0)

## Data sourcing

In [15]:
articles = pd.read_csv("./data/articles.csv")

articles.head()

Unnamed: 0,title,description,link,source
0,From Gemma Collins in the diary room to jokes about hyperventilating into a paper bag: A Level students flood social media with hilarious memes as they collect their exam results,"The long-awaited decision day at the end of the summer holidays has arrived - and with it many emotions, from jubilation to disappointment to grudging acceptance.",https://www.dailymail.co.uk/news/article-12416495/From-Gemma-Collins-diary-room-jokes-hyperventilating-paper-bag-Level-students-flood-social-media-hilarious-memes-collect-exam-results.html?ns_mchannel=rss&ito=1490&ns_campaign=1490,Daily Mail
1,"How Michael Parkinson sparred with Muhammad Ali, fell out with Meg Ryan and enraged Helen Mirren: Talk show giant's greatest interviews as he dies at 88","Affectionately known simply as Parky, Sir Michael Parkinson was one of the world's greatest interviewers.",https://www.dailymail.co.uk/tvshowbiz/article-12416303/How-Michael-Parkinson-sparred-Muhammad-Ali-enraged-Helen-Mirren-fell-Meg-Ryan-Talk-giants-greatest-interviews-dies-88.html?ns_mchannel=rss&ito=1490&ns_campaign=1490,Daily Mail
2,Pupils fume they've been 'completely screwed over' by return to pre-Covid marking after top grades drop by 9% in a year - while universities start to run out of Clearing places as thousands miss out on their preferred choices,This year's pupils who received their A-level results today did not sit GCSE exams two years ago and were awarded teacher-assessed grades amid the pandemic.,https://www.dailymail.co.uk/news/article-12416943/Pupils-fume-theyve-completely-screwed-level-results.html?ns_mchannel=rss&ito=1490&ns_campaign=1490,Daily Mail
3,Moment Greek tennis star Stefanos Tsitsipas confronts woman in the crowd imitating a bee to put him off his stroke every time he serves…but he still wins the match,A woman has apologised after she was confronted by Stefanos Tsitsipas for imitating a bee in a bizarre incident during his match at the Cincinnati Masters.,https://www.dailymail.co.uk/news/article-12417041/Stefanos-Tsitsipas-confronts-woman-imitating-bee.html?ns_mchannel=rss&ito=1490&ns_campaign=1490,Daily Mail
4,The Reckoning FIRST LOOK: Steve Coogan transforms into depraved presenter Jimmy Savile ahead of new series after BBC delayed broadcast over 'fierce response from victims',"The first official still for the new Jimmy Saville drama, The Reckoning, has been released ahead of its broadcast later this year.",https://www.dailymail.co.uk/tvshowbiz/article-12416995/The-Reckoning-LOOK-Steve-Coogan-transforms-depraved-presenter-Jimmy-Savile-ahead-new-series-BBC-delayed-broadcast-fierce-response-victims.html?ns_mchannel=rss&ito=1490&ns_campaign=1490,Daily Mail


## Most frequent words

### Processing

In [16]:
# Subset the data

title_df = articles[["title", "source"]].copy()

In [17]:
title_df["keywords"] = title_df["title"].str.lower()

In [18]:
# Split into tokens

title_df["keywords"] = title_df["keywords"].apply(word_tokenize)

In [19]:

# Create an object that can be used to lemmatise

lemma = WordNetLemmatizer()

# Create a dictionary to map tags to ones that the lemmatiser will understand.

tag_map = defaultdict(lambda : "n")  # by default, assume nouns
tag_map['J'] = "a"  # adjectives
tag_map['V'] = "v"  # verbs
tag_map['R'] = "r"  # adverbs

# Create a function to get the pos tags for a set of tokens, and return the tokens in a way the
# lemmatizer can interpret
def get_wordnet_tags(tokens):
    """Returns WordNet pos_tags for a set of tokens"""
    
    # Tag tokens with pos_tagger
    tagged_tokens = pos_tag(tokens)
    
    # Convert each tag to a version wordnet can understand
    tagged_tokens = [(token[0], tag_map[token[1][0]]) for token in tagged_tokens]
    
    return tagged_tokens

In [20]:
# pos_tag the tokens

title_df["keywords"] = title_df["keywords"].apply(get_wordnet_tags)

# Lemmatise the tokens

title_df["keywords"] = title_df["keywords"].apply(lambda tokens: [lemma.lemmatize(word=token[0], pos=token[1]) for token in tokens])

In [21]:
# Filter out punctuation, stop words, and very short words

stops = stopwords.words("english")

# Add specific stopwords

stops.extend(["n't"])

def filter_tokens(tokens):

    return [t for t in tokens
            if t not in stops
            and len(t) > 2]

title_df["keywords"] = title_df["keywords"].apply(filter_tokens)

In [22]:
# Remove specifically apostrophes

title_df["keywords"] = title_df["keywords"].apply(lambda tokens: [x.replace("'", "") for x in tokens])

In [23]:
# Join token lists back into strings

title_df["keywords"] = title_df["keywords"].apply(lambda tokens: " ".join(tokens))

In [24]:
title_df.sample(3)

Unnamed: 0,title,source,keywords
125,Saeed Roustaee: Martin Scorsese backs director jailed in Iran for Cannes screening,BBC,saeed roustaee martin scorsese back director jail iran cannes screen
75,"Ireland star Ruesha Littlejohn delights in England knocking out Australia from the Women's World Cup after Aussie player went on holiday with her ex-girlfriend, leading to a handshake spat",Daily Mail,ireland star ruesha littlejohn delight england knock australia woman world cup aussie player holiday ex-girlfriend lead handshake spat
98,Jamie Foxx says he is 'finally starting to feel like' himself again after terrifying health scare that left him hospitalized for MONTHS: 'I can see the light',Daily Mail,jamie foxx say finally start feel like terrify health scare leave hospitalize month see light


## Data export

In [25]:
title_df.to_csv("./data/processed_headlines.csv", index=False)