# Quick way to find topics of newspaper articles

In the following, I would like to examine newspaper articles and find out what topic each one deals with. I will use term frequency and reverse document frequency (tf-idf) to analyse the content of each article and identify the terms that best describe each article and give a quick insight into the topic of each article.



The dataset comes from [Kaggle](https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles) and has news articles from 2015 till september 2016 related to business and sports. It Contains the Heading of the particular Article, its content and its date. The content also contains the place from where the statement or Article was published.

Import necessary libraries:

In [1]:
import nltk, re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
import pandas as pd 
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

Read the data in a pandas dataframe:

In [2]:
df = pd.read_csv('Articles.csv', encoding = 'unicode_escape')

Look at first five rows of the df, inspect the data:

In [3]:
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,1/1/2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,1/2/2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,1/5/2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,1/6/2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,1/6/2015,us oil prices slip below 50 a barr,business


In [4]:
df.shape

(2692, 4)

In [5]:
df.describe()

Unnamed: 0,Article,Date,Heading,NewsType
count,2692,2692,2692,2692
unique,2584,666,2581,2
top,strong>ISLAMABAD: The International Monetary F...,8/1/2016,Tokyo stocks rise in early trade on weaker yen...,sports
freq,5,27,5,1408


### Preprocess article data

In order to tokenize and lemmatize the article data we write a function. We are also removing stopwords: 

In [6]:
stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

In [7]:
def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word)
    pos_counts = Counter()
    pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
    pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
    pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
    pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    return most_likely_part_of_speech

In [8]:
def preprocess_text(text):
    cleaned = re.sub(r'\W+', ' ', text).lower()
    tokenized = word_tokenize(cleaned)
    normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized if not re.match(r'\d+',token)])
    return normalized

Convert articles to list of articles:

In [9]:
articles = df.Article.tolist()

Apply the pre-processing function:

In [10]:
processed_articles = [preprocess_text(article) for article in articles]

Print the original and the processed first article to see the difference:

In [11]:
articles[0]

'KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n'

In [12]:
processed_articles[0]

'karachi the sindh government have decide to bring down public transport fare by per cent due to massive reduction in petroleum product price by the federal government geo news report source say reduction in fare will be applicable on public transport rickshaw taxi and other mean of travel meanwhile karachi transport ittehad kti have refuse to abide by the government decision kti president irshad bukhari say the commuter be charge the low fare in karachi a compare to other part of the country add that vehicle run on compress natural gas cng bukhari say karachi transporter will cut fare when decrease in cng price will be make'

### Calculate Tf-idf Scores


Initialize and fit CountVectorizer:

In [13]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(processed_articles)

Transform word counts for each article into tf-idf scores:

In [14]:
transformer = TfidfTransformer(norm=None)
tfidf_scores_transformed = transformer.fit_transform(counts)

Now we want to confirm if TfidfTransformer gives the same results as TfidfVectorizer:

In [15]:
vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_articles)

In [16]:
# checks is scores are equal:
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
    print(pd.DataFrame({'Are the tf-idf scores the same?':['YES']}))
else:
    print(pd.DataFrame({'Are the tf-idf scores the same?':['No, something is wrong :(']}))

  Are the tf-idf scores the same?
0                             YES


### Find the highest-scoring tf-idf item

An easy way to find the general topic about an article.


We want to transform the scipy matrices to pandas dataframes to get a general idea about the data:

In [17]:
# get vocabulary of terms
try:
    feature_names = vectorizer.get_feature_names()
except:
    pass

In [18]:
# get article index
try:
    article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
    pass

In [19]:
# create pandas DataFrame(s) with tf-idf scores
try:
    df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
    print(df_tf_idf.head())
except: 
    pass

try:
    df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
    print(df_tf_idf.head())
except:
    pass

              Article 1  Article 2  Article 3  Article 4  Article 5  \
__cf_email__        0.0        0.0        0.0        0.0        0.0   
a300                0.0        0.0        0.0        0.0        0.0   
a320                0.0        0.0        0.0        0.0        0.0   
a321                0.0        0.0        0.0        0.0        0.0   
a330                0.0        0.0        0.0        0.0        0.0   

              Article 6  Article 7  Article 8  Article 9  Article 10  ...  \
__cf_email__        0.0        0.0        0.0        0.0         0.0  ...   
a300                0.0        0.0        0.0        0.0         0.0  ...   
a320                0.0        0.0        0.0        0.0         0.0  ...   
a321                0.0        0.0        0.0        0.0         0.0  ...   
a330                0.0        0.0        0.0        0.0         0.0  ...   

              Article 2683  Article 2684  Article 2685  Article 2686  \
__cf_email__           0.0           0

In [20]:
# get highest scoring tf-idf term for each article
for num in range(1, 10):
    print(df_tf_idf[[f'Article {num}']].idxmax())


Article 1    fare
dtype: object
Article 2    percent
dtype: object
Article 3    hong
dtype: object
Article 4    the
dtype: object
Article 5    oil
dtype: object
Article 6    arabia
dtype: object
Article 7    kse
dtype: object
Article 8    ang
dtype: object
Article 9    sugar
dtype: object
