<h1> Classifying news articles as business or political, based on their title </h1>

<h3> The goal of this notebook is to explore how one could use Tf-Idf and/or cosine similarity to classify news articles as business or political articles.</h3>

<p> The Notebook is structured in the following manner: </p>
<ol>
<li> Data Cleaning </li> 
<li> Data Analysis </li> 
<li> Model Building </li> 
<li> Model Experiments </li> 
</ol>

<h1 style="text-align:center"> Data cleaning </h1>

<p>Before we begin, we have to import all of our needed libraries. </p>

In [8]:
import numpy as np
import pandas as pd
import scipy as scp
import sklearn as sk
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
import string
import contractions


<p>Now that we're ready, let's load the data.</p>

In [3]:
newsDataset = pd.read_csv('./ArticleTitlesCategoryDataset.csv')
newsDataset.head()

Unnamed: 0,Title,Category,Source
0,"As Democrats try to hold on in November, it’s ...",Politics,CNN
1,A Dizzying week for Trump’s legal issues,Politics,CNN
2,Nancy Pelosi did what Donald Trump failed to d...,Politics,CNN
3,Donald Trump told us all *exactly* what he was...,Politics,CNN
4,See why Republicans are trying to get you to f...,Politics,CNN


<p>Alright, the dataset is loaded successfully, now let's clean the data a bit. We'll do the following: </p>
<ol>
    <li> Turn all the contraction words into proper words (don't -> do not, I'd -> I would). </li>
    <li> Remove all the punctuation from the title. </li>
    <li> Tokenize the title. </li>
</ol>

In [20]:
EnglishStopwords = set(stopwords.words('english'))

Titles = newsDataset['Title']
CleanedTitles = []
Lemmatizer = WordNetLemmatizer()
for title in Titles:
    NoContractionsTitle = contractions.fix(title)
    NoPunctuationTitle = NoContractionsTitle.translate(str.maketrans('','', string.punctuation)) 
    TokenizedTitle = word_tokenize(NoPunctuationTitle) 
    
    cleanTitle = []
    for word in TokenizedTitle:
        if word not in EnglishStopwords:
            LemmatizedVerb = Lemmatizer.lemmatize(word.lower(), pos='v')
            LemmatizedNoun = Lemmatizer.lemmatize(LemmatizedVerb, pos='n')
            cleanTitle.append(LemmatizedNoun)
    cleanTitle = " ".join(cleanTitle)
    
    CleanedTitles.append(cleanTitle)


In [21]:
CleanedDataset = pd.DataFrame(columns=['CleanTitle', 'Category', 'Source'])
CleanedDataset['CleanTitle'] = CleanedTitles
CleanedDataset['Category'] = newsDataset['Category']
CleanedDataset['Source'] = newsDataset['Source']

CleanedDataset.head(5)

Unnamed: 0,CleanTitle,Category,Source
0,as democrats try hold november pete buttigieg ...,Politics,CNN
1,a dizzy week trump ’ legal issue,Politics,CNN
2,nancy pelosi donald trump fail january 6,Politics,CNN
3,donald trump tell us exactly go november 2020,Politics,CNN
4,see republicans try get focus wokeness,Politics,CNN


<p>Great ! We now have a clean titles dataset ! We can now proceed to the data analysis part!</p>

<h1 style="text-align:center"> Data Analysis </h1>

<p>Let's find the most common words in the titles of political articles. </p>
<p> First, we'll filter the data by the politics category. </p>

In [None]:
PoliticalArticles = CleanedDataset[CleanedDataset['Category'] == 'Politics']
PoliticalArticles.head(5)

Unnamed: 0,CleanTitle,Category,Source
0,as democrats try hold november pete buttigieg ...,Politics,CNN
1,a dizzy week trump ’ legal issue,Politics,CNN
2,nancy pelosi donald trump fail january 6,Politics,CNN
3,donald trump tell us exactly go november 2020,Politics,CNN
4,see republicans try get focus wokeness,Politics,CNN


<p>Now that we have the political articles only, we can proceed with the extraction of the most common words in the titles. </p>

In [None]:
Lemmatizer.lemmatize('trying', pos='v')

'try'