<h1> Classifying news articles as business or political, based on their title </h1>

<h3> The goal of this notebook is to explore how one could use Tf-Idf and/or cosine similarity to classify news articles as business or political articles.</h3>

<p> The Notebook is structured in the following manner: </p>
<ol>
<li> Data Cleaning </li> 
<li> Data Analysis </li> 
<li> Model Building </li> 
<li> Model Experiments </li> 
</ol>

<h1 style="text-align:center"> Data cleaning </h1>

<p>Before we begin, we have to import all of our needed libraries. </p>

In [2]:
import numpy as np
import pandas as pd
import scipy as scp
import sklearn as sk
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.probability import FreqDist
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import re
import string
import contractions


<p>Now that we're ready, let's load the data.</p>

In [3]:
newsDataset = pd.read_csv('./ArticleTitlesCategoryDataset.csv')
newsDataset.head()

Unnamed: 0,Title,Category,Source
0,"As Democrats try to hold on in November, it’s ...",Politics,CNN
1,A Dizzying week for Trump’s legal issues,Politics,CNN
2,Nancy Pelosi did what Donald Trump failed to d...,Politics,CNN
3,Donald Trump told us all *exactly* what he was...,Politics,CNN
4,See why Republicans are trying to get you to f...,Politics,CNN


<p>Alright, the dataset is loaded successfully, now let's clean the data a bit. We'll do the following: </p>
<ol>
    <li> Turn all the contraction words into proper words (don't -> do not, I'd -> I would). </li>
    <li> Remove all the punctuation from the title. </li>
    <li> Tokenize the title. </li>
</ol>

In [4]:
EnglishStopwords = set(stopwords.words('english'))

Titles = newsDataset['Title']
CleanedTitles = []
Lemmatizer = WordNetLemmatizer()
for title in Titles:
    NoContractionsTitle = contractions.fix(title)
    NoPunctuationTitle = NoContractionsTitle.translate(str.maketrans('','', string.punctuation)) 
    TokenizedTitle = word_tokenize(NoPunctuationTitle) 
    
    cleanTitle = []
    for word in TokenizedTitle:
        if word not in EnglishStopwords:
            LemmatizedVerb = Lemmatizer.lemmatize(word.lower(), pos='v')
            LemmatizedNoun = Lemmatizer.lemmatize(LemmatizedVerb, pos='n')
            if re.match('\W+|\d+', LemmatizedNoun) == None: 
                cleanTitle.append(LemmatizedNoun)
                
    cleanTitle = " ".join(cleanTitle)
    
    CleanedTitles.append(cleanTitle)


In [5]:
CleanedDataset = pd.DataFrame(columns=['CleanTitle', 'Category', 'Source'])
CleanedDataset['CleanTitle'] = CleanedTitles
CleanedDataset['Category'] = newsDataset['Category']
CleanedDataset['Source'] = newsDataset['Source']

CleanedDataset.head(5)

Unnamed: 0,CleanTitle,Category,Source
0,a democrat try hold november pete buttigieg de...,Politics,CNN
1,a dizzy week trump legal issue,Politics,CNN
2,nancy pelosi donald trump fail january,Politics,CNN
3,donald trump tell u exactly go november,Politics,CNN
4,see republican try get focus wokeness,Politics,CNN


<p>Great ! We now have a clean titles dataset ! We can now proceed to the data analysis part!</p>

<h1 style="text-align:center"> Data Analysis </h1>

<p>Let's find the most common words in the titles of political articles. </p>
<p> First, we'll filter the data by the politics category. </p>

In [6]:
PoliticalArticles = CleanedDataset[CleanedDataset['Category'] == 'Politics']
PoliticalArticles.head(5)

Unnamed: 0,CleanTitle,Category,Source
0,a democrat try hold november pete buttigieg de...,Politics,CNN
1,a dizzy week trump legal issue,Politics,CNN
2,nancy pelosi donald trump fail january,Politics,CNN
3,donald trump tell u exactly go november,Politics,CNN
4,see republican try get focus wokeness,Politics,CNN


<p>Now that we have the political articles only, we can proceed with the extraction of the most common words in the titles. </p>

In [7]:
Titles = "".join(PoliticalArticles['CleanTitle'])
TitlesTokenized = word_tokenize(Titles)
fdist = FreqDist(TitlesTokenized)
PoliticalMostCommonWords = fdist.most_common(10)
PoliticalMostCommonWords

[('trump', 10),
 ('senate', 5),
 ('say', 5),
 ('democrat', 4),
 ('campaign', 4),
 ('u', 4),
 ('republican', 4),
 ('control', 4),
 ('warn', 4),
 ('biden', 4)]

<p>We see that "trump", "senate", "say", "democrat", and "campaign" are some of the most common words in the political articles' titles. We could interpret those as a political article indicator.  </p>

<p> Let's do the same for the business articles! </p>

In [8]:
BusinessArticles = CleanedDataset[CleanedDataset['Category'] == 'Business']
BusinessArticles.head(5)

Unnamed: 0,CleanTitle,Category,Source
59,trevor milton founder nikola find guilty fraud,Business,CNN
60,more french gas station least one fuel,Business,CNN
61,these retail chain may survive recession,Business,CNN
62,it scary time hollywood but horror studio behi...,Business,CNN
63,last chance lock nearly return save,Business,CNN


In [9]:
BusinessTitles = "".join(BusinessArticles['CleanTitle'])
BusinessTitlesTokenized = word_tokenize(BusinessTitles)
fdist = FreqDist(BusinessTitlesTokenized)
BusinessMostCommonWords = fdist.most_common(10)
BusinessMostCommonWords

[('cut', 6),
 ('feed', 5),
 ('time', 4),
 ('warn', 4),
 ('go', 4),
 ('truss', 4),
 ('end', 3),
 ('lose', 3),
 ('largest', 3),
 ('first', 3)]

<p>Again, we see that "cut", "feed", "time", "warn", and "go" are some of the most common words in the business articles' titles. We could interpret those as a business article indicator.  </p>

<h1 style="text-align:center"> Model Building </h1>

<p>What we could begin with is to find the most meaningful words in an article's title, using TF-IDF(term freuqnecy - document inverse frequency). </p>

<p>The formula for the term frequency will tell us how frequent a certain term is in the article title that we're looking at. The formua is the following: </p> 

$ TF(t,d) = \frac{t}{d} $, where $\bold{t}$ is the occurence of the term we're looking for in a said article title, and $\bold{d}$ is the amount of words in the target article title.

<p> The next thing we've got to do is lay out the formula for the Inverse-document frequency (IDF). It tells us how common(doesn't carry much meaning) or rare(it carries a lot of meaning/it's specific) a certain word is in a given article title. </p>

$ IDF(n,d) = \log(\frac{n}{d})$, where $\bold{n}$ is the amount of articles and $\bold{d}$ is the amount of times the term is seen across all article titles.  

<p>The way we'll get a quantifier as to how relevant a certain word is to a certain article title is to use both TF and IDF together. We'll multiply them, that way we'll have a formula that looks like that: </p>

$ TF-IDF = TF(t,d) * IDF(n,d) = \frac{t}{d} * \log(\frac{n}{d})$

<p> The way we'll quantify if a word is relevant to the certain article title or not is the following:</p>
<ol>
<li>if TF-IDF returns a high number, then that will mean that the word is relevant and vice-versa.</li> 
<li>A low TF score would mean that the target term/word is not that frequent. On the other hand, a high TF score would mean that the word is very frequent in a certain text/article. </li>
<li> In the case of interpreting an IDF score, a low score would mean higher relevance of the word with respect to the given text/article titles, and a high score would indicate the opposite. </li>
</ol> 

<p>Next we'll create two TfIdfVectorizers, using nltk - one for the political articles and one for the business articles. </p>

In [32]:
politicalVectorizer = TfidfVectorizer()
political_TFIDF_matrix = politicalVectorizer.fit_transform(PoliticalArticles['CleanTitle']).toarray()
political_TFIDF_df = pd.DataFrame(political_TFIDF_matrix, columns=politicalVectorizer.get_feature_names())
political_TFIDF_df



Unnamed: 0,abortion,activist,admins,afraid,ag,agent,ally,ambassador,aoc,apply,...,why,win,wisconsin,wise,without,witness,wokeness,would,ye,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.459044,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.424146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.349662,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.308459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.308459,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.30061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.331114


<p> Now those are our TF-IDF values for the political articles. It's so big, let's remove all the zero values, so that we can get to the essence of what the dataset is tellign us. </p>

In [43]:
political_TFIDF_df_zeroes_removed = political_TFIDF_df.where(political_TFIDF_df!=0)
political_TFIDF_df_zeroes_removed.head(10)

Unnamed: 0,academic,adhesive,adsupported,african,ai,ambani,america,amid,announce,another,...,world,worry,worth,xi,year,yen,yet,young,zeroemissions,zte
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,0.345836,,,,,,,,,,...,0.345836,,,,,,,,,
8,,,,,,,0.460466,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


<p>That certainly is a lot more readable. Let's do the same for the business articles.</p>

In [40]:
businessVectorizer = TfidfVectorizer()
businessArticlesMatrix = businessVectorizer.fit_transform(BusinessArticles['CleanTitle']).toarray()
businessArticlesMatrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [46]:
business_TFIDF_df = pd.DataFrame(businessArticlesMatrix, columns = businessVectorizer.get_feature_names())
business_TFIDF_df.head(10)



Unnamed: 0,academic,adhesive,adsupported,african,ai,ambani,america,amid,announce,another,...,world,worry,worth,xi,year,yen,yet,young,zeroemissions,zte
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.345836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.345836,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.460466,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [49]:
business_TFIDF_df_zeroes_removed = business_TFIDF_df.where(business_TFIDF_df != 0)
business_TFIDF_df_zeroes_removed.head(10)

Unnamed: 0,academic,adhesive,adsupported,african,ai,ambani,america,amid,announce,another,...,world,worry,worth,xi,year,yen,yet,young,zeroemissions,zte
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,0.345836,,,,,,,,,,...,0.345836,,,,,,,,,
8,,,,,,,0.460466,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,
