**News Classification via Headlines: Identifying and Classifying Sports-related and Crime-related news**

1. Sports-related News: News discusing sporting events and other sports-related activities.
2. Crime-related News: News discussing criminal activities or crime-related accidents.
3. Neutral: News about non-related issues not specifically related to any of the above categories.

Dataset: News headlines published over a period of nineteen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation).

Expected Outcomes: By the end of this notebook, we aim to have a trained and validated model capable of accurately classifying news on the basisis of headlines into the specified categories. This classificiation can help automated systems to filter news according to a user's preferences.

Note: The code and methodologies presented in this notebook can be adapted for other analyzing any sort of bias in the given data

In [5]:
import pandas as pd
import numpy as np

data = pd.read_csv("data/news_headlines.csv")
data.rename(columns={"headline_text" : "text"}, inplace=True)
data = data.sample(n = 100_000)
data.index = np.arange(len(data))
data['text'] = data['text'].replace(regex='(@\w+)|#|&|!',value='')
data['text'] = data['text'].replace(regex=r'http\S+', value='')
# # Filtering out one-word review comments
# data = data[data.index.isin(
#     [i for i, s in enumerate(data.text) if len(s.split()) >= 3]
# )]
data

Unnamed: 0,publish_date,text
0,20051004,community surprise at development commission
1,20121012,pair hurt in hume freeway truck crash
2,20071218,removing sex abuse victims from community leaves
3,20101108,australian afghan troops kill taliban leader
4,20080411,lyons mhr not worried by threat of legal action
...,...,...
99995,20070517,google revamps search engine
99996,20160807,sunday august 7 full program
99997,20100105,moderate quake jolts taiwan
99998,20120607,mining under threat says bca


In [6]:
from src.wordbiases import CalculateWordBias
from src.model import LikelihoodModelForNormalDist

target_set_1 = ["sports", "football", "athletics", "game"]
target_set_2   = ["crime", "murder", "theft", "violence"] 
F = ["ADJ", "NOUN", "PRON"]
wbcalc = CalculateWordBias(target_set_1, target_set_2, F, computing_device="cuda")
wbcalc.process_documents(data, "text")
c1, c2 = wbcalc.calculate_target_embeddings()

wbiases, _ , biased_words = wbcalc.calculate_biases()
total_pop = [b for _, b in biased_words]

mu = np.mean(total_pop)
sigma = np.std(total_pop)

# TODO : Find a way to compute or estimate t1 or t2
# Dividing the populations in 7-93-7
likelihood_clf = LikelihoodModelForNormalDist(0.07, 0.93, threshold_limit=0)
likelihood_clf.fit_total_pop(total_pop)

preds = likelihood_clf.predict(wbiases)[0]
data["prclass"] = preds


100%|██████████| 100000/100000 [13:27<00:00, 123.86it/s]
100%|██████████| 100000/100000 [00:50<00:00, 1995.08it/s]


Sports-Related News Headlines

In [12]:
for text in data[data["prclass"] == 0].text:
    print(text)

tas country hour 14 may 2014
australians honoured on national day
why is senator scott ludlam standing down
lauren wells double act five minute break between games finals
murdoch appeal date set
sailing again
dokic out of fed cup tie
russell crowe to be baptised
ben simmons leads philadelphia 76ers to nba victory over dallas
deportivo flying high ahead of historic porto clash
wa harvest
budgies swarm in central australia stunning display
aussies finish fifth in euro nations cup
sydney water supply testing ok sca
densham wins bronze in dramatic triathlon finish
fighting bullying
chinese doping officials say no cover up of sun ban
aussies are under cooked says sa a coach
cow corner
hartcher visit
truckie protest
mack horton shayna jack israel folau
ko sharing lead at australian open golf
mcrae faces salford doom
collapsed rower stunned by team mates reaction
mcewen signs for australian team
interview ross lyon
salim buried amid battles in iraq
olympics on the line coronavirus looms games

Crime-Related News Headlines

In [13]:
for text in data[data["prclass"] == 1].text:
    print(text)

removing sex abuse victims from community leaves
lyons mhr not worried by threat of legal action
aloisi looking for redemption
police investigate brisbane airport death
court jails fraudster
cane suicide levy
men plead guilty to money laundering
13 y o on car stealing charges
drivers take note of police safety message
murder accused had concerns about investigation
creditors angry over new licence for westpoint
fraud couple
it is estimated the nsw fires has caused the death
man pleads guilty to drink drive charge after car
how dr brad mckay became the victim of online scammers
man charged with drug trafficking after police
police legitimately followed hickey
fuel thefts continue in bendigo region
goorjan denies bullets are bullies
gunman kills at least 8 in village rampage
ian baz bosch murder trial continues ben daly claims
mayfield man wanted over wickham park murder
police call off search for pilot
policeman of the year arrested for extortion
taliban gunmen launch coordinated attack

Neutral or Non-related News Headlines

In [14]:
for text in data[data["prclass"] == 2].text:
    print(text)

community surprise at development commission
pair hurt in hume freeway truck crash
australian afghan troops kill taliban leader
residents back hostel expansion snub
army base terrorism case exaggerated
summary first test day five
meatscapes
kanye adele lead grammy nominations
pakistan gets clearance for cup replacements
hawaii spared worst of two hurricanes
pms dept tried to suppress haneef documents lawyer
residents accuse developer of breaching foreshore
live tropical cyclone nathan category four system landfall
new pardoo station vision
iran hangs 'mossad agent' for scientist killing
sweet and sour citrus season
a beating issue
girl found safe in yeppoon
nationals senator resigns to join liberals
lockhart the game of chicken that devastated the dairy industry
organic tasmanian hazelnuts win top gong
rowling voted britains most influential woman
renowned anu school of art faces cuts amid funding shortfall
nt govt changes variable taxi boundaries
family films fire encroaching on house