<h1> Classifying news articles as business or political, based on their title </h1>

<h3> The goal of this notebook is to explore how one could use Tf-Idf and/or cosine similarity to classify news articles as business or political articles.</h3>

<p> The Notebook is structured in the following manner: </p>
<ol>
<li> Data Cleaning </li> 
<li> Data Analysis </li> 
<li> Model Building </li> 
<li> Model Experiments </li> 
</ol>

<h1 style="text-align:center"> Data cleaning </h1>

<p>Before we begin, we have to import all of our needed libraries. </p>

In [2]:
import numpy as np
import pandas as pd
import scipy as scp
import sklearn as sk
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist
from collections import Counter
import re
import string
import contractions


<p>Now that we're ready, let's load the data.</p>

In [3]:
newsDataset = pd.read_csv('./ArticleTitlesCategoryDataset.csv')
newsDataset.head()

Unnamed: 0,Title,Category,Source
0,"As Democrats try to hold on in November, it’s ...",Politics,CNN
1,A Dizzying week for Trump’s legal issues,Politics,CNN
2,Nancy Pelosi did what Donald Trump failed to d...,Politics,CNN
3,Donald Trump told us all *exactly* what he was...,Politics,CNN
4,See why Republicans are trying to get you to f...,Politics,CNN


<p>Alright, the dataset is loaded successfully, now let's clean the data a bit. We'll do the following: </p>
<ol>
    <li> Turn all the contraction words into proper words (don't -> do not, I'd -> I would). </li>
    <li> Remove all the punctuation from the title. </li>
    <li> Tokenize the title. </li>
</ol>

In [17]:
EnglishStopwords = set(stopwords.words('english'))

Titles = newsDataset['Title']
CleanedTitles = []
Lemmatizer = WordNetLemmatizer()
for title in Titles:
    NoContractionsTitle = contractions.fix(title)
    NoPunctuationTitle = NoContractionsTitle.translate(str.maketrans('','', string.punctuation)) 
    TokenizedTitle = word_tokenize(NoPunctuationTitle) 
    
    cleanTitle = []
    for word in TokenizedTitle:
        if word not in EnglishStopwords:
            LemmatizedVerb = Lemmatizer.lemmatize(word.lower(), pos='v')
            LemmatizedNoun = Lemmatizer.lemmatize(LemmatizedVerb, pos='n')
            if re.match('\W+|\d+', LemmatizedNoun) == None: 
                cleanTitle.append(LemmatizedNoun)
                
    cleanTitle = " ".join(cleanTitle)
    
    CleanedTitles.append(cleanTitle)


['a democrat try hold november pete buttigieg demand campaign trail',
 'a dizzy week trump legal issue',
 'nancy pelosi donald trump fail january',
 'donald trump tell u exactly go november',
 'see republican try get focus wokeness',
 'why may know control senate november',
 'liz cheney dire warn future election',
 'pakistan summon u ambassador biden call country dangerous nuclear weapon',
 'kansa democrat deliver surprise win abortion right november show whether',
 'first cnn biden zero abortion right dnc event week election day',
 'five takeaway georgia senate debate',
 'giuliani lawyer submit witness list upcoming dc attorney discipline hear',
 'obama campaign michigan georgia final week midterm election',
 'january panel ask secret service information contact agent oath keeper member',
 'the latino voter shift come focus south texas',
 'why may know control senate november',
 'the central tension drive election',
 'independent candidate upend oregon race governor give gop open',
 '

In [5]:
CleanedDataset = pd.DataFrame(columns=['CleanTitle', 'Category', 'Source'])
CleanedDataset['CleanTitle'] = CleanedTitles
CleanedDataset['Category'] = newsDataset['Category']
CleanedDataset['Source'] = newsDataset['Source']

CleanedDataset.head(5)

Unnamed: 0,CleanTitle,Category,Source
0,a democrat try hold november pete buttigieg de...,Politics,CNN
1,a dizzy week trump ’ legal issue,Politics,CNN
2,nancy pelosi donald trump fail january 6,Politics,CNN
3,donald trump tell u exactly go november 2020,Politics,CNN
4,see republican try get focus wokeness,Politics,CNN


<p>Great ! We now have a clean titles dataset ! We can now proceed to the data analysis part!</p>

<h1 style="text-align:center"> Data Analysis </h1>

<p>Let's find the most common words in the titles of political articles. </p>
<p> First, we'll filter the data by the politics category. </p>

In [6]:
PoliticalArticles = CleanedDataset[CleanedDataset['Category'] == 'Politics']
PoliticalArticles.head(5)

Unnamed: 0,CleanTitle,Category,Source
0,a democrat try hold november pete buttigieg de...,Politics,CNN
1,a dizzy week trump ’ legal issue,Politics,CNN
2,nancy pelosi donald trump fail january 6,Politics,CNN
3,donald trump tell u exactly go november 2020,Politics,CNN
4,see republican try get focus wokeness,Politics,CNN


<p>Now that we have the political articles only, we can proceed with the extraction of the most common words in the titles. </p>

In [9]:
Titles = "".join(PoliticalArticles['CleanTitle'])
TitlesTokenized = word_tokenize(Titles)
fdist = FreqDist(TitlesTokenized)
fdist.most_common(20)

[('trump', 10),
 ('’', 8),
 ('november', 5),
 ('senate', 5),
 ('‘', 5),
 ('say', 5),
 ('democrat', 4),
 ('campaign', 4),
 ('u', 4),
 ('republican', 4),
 ('control', 4),
 ('warn', 4),
 ('biden', 4),
 ('election', 4),
 ('6', 4),
 ('gop', 4),
 ('tory', 4),
 ('try', 3),
 ('week', 3),
 ('pelosi', 3)]

<p>We see that "trump", "november", and "senate" are the most common words in the political articles' titles. In fact we could see that the most distinct political words longer than 3 letters are mentioned at least 4 times, so everything that fits those criteria we could interpret as a political article indicator.  </p>

<p> Let's do the same for the business articles! </p>

In [10]:
BusinessArticles = CleanedDataset[CleanedDataset['Category'] == 'Business']
BusinessArticles.head(5)

Unnamed: 0,CleanTitle,Category,Source
59,trevor milton founder nikola find guilty fraud,Business,CNN
60,more 1 4 french gas station least one fuel,Business,CNN
61,these retail chain may survive recession,Business,CNN
62,it scary time hollywood but horror studio behi...,Business,CNN
63,last chance lock nearly 10 return save,Business,CNN


In [11]:
BusinessTitles = "".join(BusinessArticles['CleanTitle'])
BusinessTitlesTokenized = word_tokenize(BusinessTitles)
fdist = FreqDist(BusinessTitlesTokenized)
fdist.most_common(20)

[('’', 13),
 ('‘', 4),
 ('end', 3),
 ('get', 3),
 ('billion', 3),
 ('1', 2),
 ('time', 2),
 ('hit', 2),
 ('make', 2),
 ('chance', 2),
 ('lock', 2),
 ('nearly', 2),
 ('10', 2),
 ('return', 2),
 ('lose', 2),
 ('largest', 2),
 ('worth', 2),
 ('battery', 2),
 ('could', 2),
 ('rat', 2)]