### Overview
Newspapers and their online formats supply the public with the information we need to understand the events occurring in the world around us. From politics to sports, the news keeps us informed, in the loop, and ready to make decisions about how to act in a rapidly changing world.

Given the vast amount of news articles in circulation, identifying and organizing articles by topic is a useful activity. This can help you sift through the enormous amount of information out there so you can find the news relevant to your interests, or even allow you to build a news recommendation engine!

The News International is the largest English language newspaper in Pakistan, covering local and international news across a variety of sectors. A selection of articles from a [Kaggle Dataset of The News International articles](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles) is provided in the workspace.

In this project, I will use **term frequency-inverse document frequency (tf-idf)** to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.

### Project Goal
Analyze news documents using Tf-idf NLP supervised machine learning models.

### Text Preprocessing
Before a text can be processed by a NLP model, the text data needs to be pre-processed. Text data pre-processing is the process of cleaning and prepping the text data to be processed by NLP models.

Cleaning and prepping tasks:
- **Noise removal** is a text pre-processing step concerned with removing unnecessary formatting from our text.
- **Tokenization** is a text pre-processing step devoted to breaking up text into smaller units (usually words or discrete terms).
- **Normalization** is the name we give most other text pre-processing tasks, including stemming, lemmatization, upper and lowercasing, and stopwords removal.
   
   -- **Stemming** is the normalization pre-processing task focused on removing word affixes.
   
   -- **Lemmatization** is the normalization pre-processing task that more carefully brings words down to their root forms.

In [1]:
import re
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

In [2]:
stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

In [3]:
## Part-of-Speech Tagging Function:
def get_part_of_speech(word):
    # synonyms matching
    probable_part_of_speech = wordnet.synsets(word)
    # Initializing counter class objects
    pos_counts = Counter()
    # Tagging and counting tags
    pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos() == "n"]) # noun
    pos_counts["v"] = len([item for item in probable_part_of_speech if item.pos() == "v"]) # Verb
    pos_counts["a"] = len([item for item in probable_part_of_speech if item.pos() == "a"]) # Adjectiveif
    pos_counts["r"] = len([item for item in probable_part_of_speech if item.pos() == "r"]) # Adverb
    
    # The most common tag, the tag with the highest count, en: n for Noun
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    
    return most_likely_part_of_speech

In [4]:
def preprocess_text(text):
    cleaned = re.sub(r'\W+', ' ', text).lower()
    cleaned = re.sub(r'\d+', ' ', cleaned)
    tokenized = word_tokenize(cleaned)
    tokenized_no_stopwords = [word for word in tokenized if word not in stop_words]
    normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token))for token in tokenized_no_stopwords ])
    return normalized

In [5]:
## Articles Analysis
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from articles import articles

In [6]:
# articles sample
articles[5]

'SYDNEY: Cricket fever has gripped Australia with the World Cup just days away. Fans from around the world have thronged to the country and hotels are capitalising. Prices of rooms have almost doubled to 300 dollars and hotels are experiencing full bookings. Experts estimate that during the mega event Australia will generate 1.5 million US dollars just from hotel bookings. If the cost of internal air travel, taxis and tickets is taken into consideration, Australia stands to generate two million US dollars during the World Cup.'

In [7]:
## Preprocess the articles
processed_articles = [preprocess_text(article) for article in articles]
processed_articles[5]

'sydney cricket fever grip australia world cup day away fan around world throng country hotel capitalise price room almost double dollar hotel experience full book expert estimate mega event australia generate million u dollar hotel book cost internal air travel taxi ticket take consideration australia stand generate two million u dollar world cup'

### Calculate tf-idf scores

### Method 1 : TfidfVectorizer

In [8]:
# The norm = None keyword argument prevents scikit-learn from modifying the multiplication of term frequency
corpus_vectorizer = TfidfVectorizer(norm=None)
# fit and transform the training data and returns a score matrix
corpus_tfidf_scores = corpus_vectorizer.fit_transform(processed_articles)

In [9]:
print(corpus_tfidf_scores)

  (0, 167)	2.2992829841302607
  (0, 67)	2.7047480922384253
  (0, 61)	2.7047480922384253
  (0, 291)	2.7047480922384253
  (0, 48)	5.4094961844768505
  (0, 113)	2.7047480922384253
  (0, 185)	2.7047480922384253
  (0, 52)	2.7047480922384253
  (0, 242)	2.7047480922384253
  (0, 298)	2.01160091167848
  (0, 197)	2.7047480922384253
  (0, 3)	2.2992829841302607
  (0, 56)	2.01160091167848
  (0, 195)	2.2992829841302607
  (0, 51)	2.7047480922384253
  (0, 165)	2.7047480922384253
  (0, 43)	2.7047480922384253
  (0, 50)	2.7047480922384253
  (0, 35)	5.4094961844768505
  (0, 143)	2.7047480922384253
  (0, 207)	2.7047480922384253
  (0, 65)	2.7047480922384253
  (0, 1)	2.7047480922384253
  (0, 231)	2.7047480922384253
  (0, 157)	5.4094961844768505
  :	:
  (9, 162)	2.7047480922384253
  (9, 13)	2.7047480922384253
  (9, 235)	2.7047480922384253
  (9, 91)	2.7047480922384253
  (9, 105)	2.7047480922384253
  (9, 274)	2.7047480922384253
  (9, 112)	2.7047480922384253
  (9, 312)	5.4094961844768505
  (9, 119)	2.70474809223

### Method 2: Bag-of-Words to tf-idf

First create Bag-of-Words model and converted the bag-of-words model into tf-idf scores using scikit-learn's **TfidfTransformer**


In [10]:
vectorizer = CountVectorizer()
# fit and transform training data and returns a BoW matrix, word count
bow_matrix = vectorizer.fit_transform(processed_articles)

In [11]:
# stores BoW vectors results in a DataFrame
# Get the feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()
# create an articles index
articles_index = [f"Article {i+1}" for i in range(len(articles))]
# create pandas DataFrame with feature names
# The .T stands for transpose() and todense() returns a dense matrix representation
df_bag_of_words = pd.DataFrame(bow_matrix.T.todense(), index=feature_names, columns=articles_index)
df_bag_of_words

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0,0,0,1,0,0,0,0,0,0
abide,1,0,0,0,0,0,0,0,0,0
accord,0,0,1,0,0,0,0,0,0,0
add,1,0,0,0,0,0,0,0,1,0
agency,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
world,0,0,0,0,0,3,0,0,0,0
would,0,0,0,1,0,0,0,0,1,0
year,0,1,0,0,0,0,0,0,0,0
yi,0,0,0,0,0,0,0,0,0,2


In [12]:
# save the bag of words to csv file
df_bag_of_words.to_csv('data/articles_bow.csv')

In [13]:
# convert the bag-of-words model to tf-idf
transformer = TfidfTransformer(norm=None)
# transforms and return scores
tfidf_scores = transformer.fit_transform(bow_matrix)

In [14]:
# stores the scores results from TfidfTransformer in a DataFrame
df_tfidf_scores = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=articles_index)
df_tfidf_scores

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
agency,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


In [15]:
# stores the scores results from TfidfVectorizer in a DataFrame
df_corpus_tfidf_scores = pd.DataFrame(corpus_tfidf_scores.T.todense(), index=feature_names, columns=articles_index)
df_corpus_tfidf_scores

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
agency,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


In [16]:
# save the corpus_tfidf_scores to csv file
df_corpus_tfidf_scores.to_csv('data/articles_corpus_tfidf_scores.csv')
# save the corpus_tfidf_scores to csv file
df_tfidf_scores.to_csv('data/articles_tfidf_scores.csv')

### Scores results Comparison:
Confirms that the tf-idf scores given by **TfidfTransformer** and **TfidfVectorizer** are the same

In [17]:
if np.allclose(corpus_tfidf_scores.todense(),tfidf_scores.todense()):
    print("Are the tf-idf scores the same?: yes")
else:
    print("Are the tf-idf scores the same?: NO, something is wrong:(")

Are the tf-idf scores the same?: yes


### Results Analysis
A simple way of identifying the “topic” of a document is to label the document with its highest-scoring tf-idf term. While this is a more naive approach than others, it is a quick and easy way of getting insight into the topic of a document.

In [18]:
# articles highest-scoring tf-idf terms aka topics
# create a list of the indexes' highest value (pandas.Series.idsmax())
topics = [df_tfidf_scores[f'Article {i+1}'].idxmax() for i in range(len(articles_index))]
# create DataFrame
df_articles_topics = pd.DataFrame({'Topic':topics}, index=articles_index)
df_articles_topics

Unnamed: 0,Topic
Article 1,fare
Article 2,hong
Article 3,sugar
Article 4,petrol
Article 5,engine
Article 6,australia
Article 7,car
Article 8,railway
Article 9,cabinet
Article 10,china


By using NLP supervised machine learning models we can gain insight into news articles topics without having the need to read them.
For example, we can determine, with good certainty, that the article 6's topic is linked to Australia.

In [19]:
# Save the topic of the respective articles into csv file
df_articles_topics.to_csv('data/articles_topics.csv')