# News Analysis

The News International is the largest English language newspaper in Pakistan, covering local and international news across a variety of sectors. A selection of articles from a Kaggle Dataset of The News International articles (https://www.kaggle.com/asad1m9a9h6mood/news-articles) is provided in the workspace.

In this project we will use "term frequency-inverse document frequency" (tf-idf) to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.


## Data Investigation

In [67]:
import re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
from articles import articles
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer 


In [68]:
# View article
articles[0]

'KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling. Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.'

In [69]:
stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

def preprocess_text(text):
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized if not re.match(r'\d+',token)])
  return normalized

In [70]:
# Preprocess articles
processed_articles = [preprocess_text(article) for article in articles]
processed_articles[0]

'karachi the sindh government have decide to bring down public transport fare by per cent due to massive reduction in petroleum product price by the federal government geo news report source say reduction in fare will be applicable on public transport rickshaw taxi and other mean of travel meanwhile karachi transport ittehad kti have refuse to abide by the government decision kti president irshad bukhari say the commuter be charge the low fare in karachi a compare to other part of the country add that vehicle run on compress natural gas cng bukhari say karachi transporter will cut fare when decrease in cng price will be make'

## Tf-idf Scores Calculation

We want to begin our analysis by starting off with simple word counts for each article.

In [71]:
# Initialize and fit CountVectorizer
vectorizer = CountVectorizer() 

Now we are going to fit and transform our vectorizer on `processed_articles` to get the word counts for each article.

In [88]:
# Convert counts to tf-idf
counts = vectorizer.fit_transform(processed_articles)

<10x353 sparse matrix of type '<class 'numpy.float64'>'
	with 516 stored elements in Compressed Sparse Row format>

In [73]:
# Get vocabulary of terms
feature_names = vectorizer.get_feature_names_out()

In [74]:
# Get article index
article_index = [f"Article {i+1}" for i in range(len(articles))]

In [75]:
# Create pandas DataFrame with word counts
df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
df_word_counts

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0,0,0,1,0,0,0,0,0,0
abide,1,0,0,0,0,0,0,0,0,0
about,0,0,0,0,0,0,1,0,0,0
accord,0,0,1,0,0,0,0,0,0,0
add,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
world,0,0,0,0,0,3,0,0,0,0
would,0,0,0,1,0,0,0,0,1,0
year,0,1,0,0,0,0,0,0,0,0
yi,0,0,0,0,0,0,0,0,0,2


Now that we have the word counts for each article, let’s convert them into tf-idf scores. We are going to fit and transform our transformer on counts to convert the word counts into tf-idf scores for each article and save the resulting tf-idf scores to a variable named  `tfidf_scores_transformed`.

In [76]:
# Initialize and fit TfidfVectorizer
transformer = TfidfTransformer(norm=None) 
tfidf_scores_transformed = transformer.fit_transform(counts)

In [77]:
# Create pandas DataFrame(s) with tf-idf scores
df_tf_idf = pd.DataFrame(tfidf_scores_transformed.T.todense(), index=feature_names, columns=article_index)
df_tf_idf

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
about,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


Now we have our tf-idf scores for each article. We want to confirm, however, that the TfidfTransformer gives the same results as directly using the `TfidfVectorizer`. Let's initialize a `TfidfVectorizer` object, fit and transform our vectorizer on `processed_articles` to calculate the tf-idf scores for each article in one step. We are going to save the resulting tf-idf scores to a variable named `tfidf_scores`.

In [78]:
vectorizer = TfidfVectorizer(norm=None) 
tfidf_scores = vectorizer.fit_transform(processed_articles) 

In [79]:
# Create pandas DataFrame(s) with tf-idf scores
df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
df_tf_idf


Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
about,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


Let’s confirm that the tf-idf scores given by `TfidfTransformer` and `TfidfVectorizer` are the same.

In [80]:
# Check if tf-idf scores are equal
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
  print(pd.DataFrame({'Are the tf-idf scores the same?':['YES']}))
else:
  print(pd.DataFrame({'Are the tf-idf scores the same?':['No, something is wrong :(']}))

  Are the tf-idf scores the same?
0                             YES


## Results Analysis

A simple way of identifying the “topic” of a document is to label the document with its highest-scoring tf-idf term. While this is a more naive approach than others, it is a quick and easy way of getting insight into the topic of a document, so we are going to use it for our current purposes.

In [81]:
# Get topic
for i in range(1, 11):
  print(df_tf_idf[[f'Article {i}']].idxmax())

Article 1    fare
dtype: object
Article 2    hong
dtype: object
Article 3    sugar
dtype: object
Article 4    petrol
dtype: object
Article 5    engine
dtype: object
Article 6    australia
dtype: object
Article 7    car
dtype: object
Article 8    railway
dtype: object
Article 9    cabinet
dtype: object
Article 10    china
dtype: object


Let's compare these topics with initial texts. 

For Article 1 topic was "fare".

In [82]:
articles[0]

'KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling. Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.'

For Article 2 topic was "hong".

In [83]:
articles[1]

'HONG KONG:  Hong Kong shares opened 0.66 percent lower Monday following a tepid lead from Wall Street, as the first full week of the new year kicked off. The benchmark Hang Seng Index dipped 158.63 points to 23,699.19.'

For Article 3 topic was "sugar".

In [84]:
articles[2]

'KARACHI: Wholesale market rates for sugar dropped to less than Rs 50 per kg following the resumption of sugar cane crushing by sugar mills in Sindh. Within two days, the rate dropped by Rs 1.70 to Rs 49.80 per kg in Karachi Whole Sale Market. According to dealers, the resumption of sugar cane crushing by the mills stabilised the supply to the market with an immediate effect on price as well. Industry experts said that the quality of sugar cane is excellent in Sindh and approximately 100 kg of sugar cane can produce 11 kg of sugar.'

For Article 8 topic was "railway".

In [85]:
articles[7]

'LAHORE: Federal Minister for Railways, Khawaja Saad Rafique Tuesday announced good news of pay-raise for the employees of Pakistan Railways. In a media statement, the Minister disclosed that a summary for increase in salaries for the employees of Pakistan Railways has been forwarded to the Prime Minister. He also said that the government had also chalked out a plan to build houses for the Railways workers. Khawaja Saad Rafique said it was expected that the salaries of Railway Police may witness a jump of 20 percent. He also announced the government\x92s plan to launch a new train service between Karachi and Islamabad.'

## Conclusion

We used "term frequency-inverse document frequency" (`tf-idf`) to analyse each article’s content and to uncover the terms that best describe each article, labelling the document with its highest-scoring `tf-idf` term. While this is a crude approach to topic definition, it is helpful in gaining some quick insight into the articles meaning. As a result, at least some articles got terms that correspond with their real topics. 