# Read the News Analysis

A Codecademy practice project from the <a href='https://www.codecademy.com/learn/paths/data-science'>Data Science Path</a> Natural Languages Processing (NLP) Course, Term Frequency-Inverse Document Frequency Section.

## Overview

Newspapers and their online formats supply the public with the information we need to understand the events occurring in the world around us. From politics to sports, the news keeps us informed, in the loop, and ready to make decisions about how to act in a rapidly changing world.<br>
<br>
Given the vast amount of news articles in circulation, identifying and organizing articles by topic is a useful activity. This can help you sift through the enormous amount of information out there so you can find the news relevant to your interests, or even allow you to build a news recommendation engine!<br>
<br>
The <a href="https://www.thenews.com.pk/">News International</a> is the largest English language newspaper in Pakistan, covering local and international news across a variety of sectors. A selection of articles from a <a href="https://www.kaggle.com/asad1m9a9h6mood/news-articles">Kaggle Dataset of The News International articles</a> is provided in the workspace.<br>
<br>
In this project I used term frequency-inverse document frequency (tf-idf) to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.

### Project Goal:

Analyze news documents using TF-IDF MLP supervised machine learning models.

### Project Requirements

Be familiar with:
- Python3
- NLP (Natural Languages Processing)
<br><br>
- The Python Libraries:
    - Pandas
    - NLKT
    - Sklearn
    

### Link:
    

<a href='https://www.alex-ricciardi.com/post/read-the-news-analysis'>My Project Blog Presentation<a>

<h2 style='color : MediumBlue'>Text Pre-processing</h2>

Before a text can be processed by a NLP model, the text data needs to be pre-processed.
Text data pre-processing is the process of  cleaning and prepping  the text data to be processed by NLP models.

 Cleaning and prepping tasks:
- Noise removal is a text pre-processing step concerned with removing unnecessary formatting from our text.
- Tokenization is a text pre-processing step devoted to breaking up text into smaller units (usually words or discrete terms).
- Normalization is the name we give most other text pre-processing tasks, including stemming, lemmatization, upper and lowercasing, and stopwords removal.
    - Stemming is the normalization pre-processing task focused on removing word affixes.
    - Lemmatization is the normalization pre-processing task that more carefully brings words down to their root forms.

#### Libraries:

In [1]:
# Regex
import re
# Natural Language Toolkit - https://www.nltk.org/ -
import nltk
# Lexical database of English
from nltk.corpus import wordnet
# Stop words 
from nltk.corpus import stopwords
# Tokenization into words
from nltk.tokenize import word_tokenize
# lemmatization class
from nltk.stem import WordNetLemmatizer
# Counter Dictionary class - https://docs.python.org/3/library/collections.html#collections.Counter -
from collections import Counter

Note:<br>
NLTK comes with data packages (corpora, toy grammars, trained models, ect).<br>
To install the data, first <a href='http://www.nltk.org/install.html'>install NLTK</a>, then install <a href='http://www.nltk.org/data.html'> install NLTK_data.</a>

<b>Initialization:</b>
- of stop words from the English language
- of the text normalizer

In [2]:
stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

<h3 style='color : DarkMagenta'>Part-of-Speech Tagging</h3>

<a href="https://nlp.stanford.edu/software/tagger.shtml#:~:text=A%20Part%2DOf%2DSpeech%20Tagger,like%20'noun%2Dplural'.">Part-of-Speech Tagging</a> is the process of reading text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc.

#### Lemmatization

To improve the performance of <a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">lemmatization</a> (bring a word to his root), each word in the processed text is assigned parts of speech tag, such as noun, verb, adjective, etc.

#### Part-of-Speech tagging function:

The ```get_part_of_speech()``` function:
- Takes the arguments:
    - ```word```, string data type.<br>
<br>
- Matches ```word``` with synonyms
- Tags ```word``` and count tags.<br> 
<br>
- Returns The most common tag, the tag with the highest count, ex: n for Noun, string data type.

In [3]:
def get_part_of_speech(word):
    # Synonyms matching
    probable_part_of_speech = wordnet.synsets(word)
    # Initializing Counter class object
    pos_counts = Counter()
    # Taging and counting tags
    pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  ) # Noun
    pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  ) # Verb
    pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  ) # Adjectif
    pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  ) # Adverb
    # The most common tag, the tag with the highest count, ex: n for Noun 
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    
    return most_likely_part_of_speech

<h3 style='color : DarkMagenta'> Pre-processing text main function:</h3>

The ```preprocess_text()``` function:
- Takes the arguments:
    - ```text```, string data type.<br>
<br>
- Cleans ```text``` 
- Tokenizes ```text```
- Normalizes ```text```<br> 
<br>
- Returns the ```normalized``` text, pre-processed text, string data type.

In [4]:
def preprocess_text(text):
    cleaned = re.sub(r'\W+', ' ', text).lower()
    tokenized = word_tokenize(cleaned)
    # Removes stopwords
    tokenized_no_stopwords = [word for word in tokenized if word not in stop_words]
    # lemmatizes 
    normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) \
                                                           for token in tokenized_no_stopwords if not re.match(r'\d+',token)])
    return normalized

<h2 style='color : MediumBlue'>Articles Analysis</h2>

The analysis' goal is to uncover the terms that best describe each article and predict each article's topic.

#### Libraries

In [5]:
import pandas as pd
import numpy as np
# Convert a collection of text documents to a matrix of token counts, Bag-of-Words
from sklearn.feature_extraction.text import CountVectorizer
# Convert a collection of raw documents to a matrix of tf-idf scores, and BoW matrix to tf-idf scores
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
# Data 
from articles import articles

In [6]:
# Articles sample
articles[5]

'SYDNEY: Cricket fever has gripped Australia with the World Cup just days away. Fans from around the world have thronged to the country and hotels are capitalising. Prices of rooms have almost doubled to 300 dollars and hotels are experiencing full bookings. Experts estimate that during the mega event Australia will generate 1.5 million US dollars just from hotel bookings. If the cost of internal air travel, taxis and tickets is taken into consideration, Australia stands to generate two million US dollars during the World Cup.'

#### Pre-process Articles:

In [7]:
processed_articles = [preprocess_text(article) for article in articles]
# Preprocess Articles sample
print(processed_articles[5])

sydney cricket fever grip australia world cup day away fan around world throng country hotel capitalise price room almost double dollar hotel experience full book expert estimate mega event australia generate million u dollar hotel book cost internal air travel taxi ticket take consideration australia stand generate two million u dollar world cup


<h3 style='color : DarkMagenta'>TF-IDF:</h3>

- Term frequency-inverse document frequency, known as tf-idf, is a numerical statistic used to indicate how important a word is to each document in a collection of documents
- tf-idf consists of two components, term frequency and inverse document frequency
term frequency is how often a word appears in a document. This is the same as bag-of-words’ word count
- inverse document frequency is a measure of how often a word appears across all documents of a corpus
- tf-idf is calculated as the term frequency multiplied by the inverse document frequency, the calculated tf-idf is also referred as a Tf-idf score.

<h2 style='color : mediumblue'>Tf-idf Scores:</h2>

Tf-idf scores are calculated on a term-document basis. That means there is a tf-idf score for each word, for each document. The tf-idf score for some term t in a document d in some corpus is calculated as follows:<br>
<br>
<i><font size=3>tfidf(t,d) = tf(t,d) * idf(t,corpus)</font></i>

- tf(t,d) is the term frequency of term t in document d
- idf(t,corpus) is the inverse document frequency of a term t across corpus
<br><br>
Note: the term frequency element is also referred as the Bag-of-words model results.

In python, the tf-idf values for each term-document pair in a corpus can easily be calculated using scikit-learn’s ```TfidfVectorizer```

In [8]:
# Initializes variable to class, empty score 
# The norm=None keyword argument prevents scikit-learn from modifying the multiplication of term frequency and inverse
corpus_vectorizer = TfidfVectorizer(norm=None)
#  Fits/transforms training data and returns a score matrix 
corpus_tfidf_scores = corpus_vectorizer.fit_transform(processed_articles)

<h4 style='color : DarkGreen'>Bag-of-Words to tf-idf:</h4>

Like showed in the above cell, ```TfidfVectorizer``` can directly calculat the tf-idf scores for a set of terms across a corpus, but for the seek of this exercise, I created a Bag-of-Words model, and converted the bag-of-words model into tf-idf scores using scikit-learn’s ```TfidfTransformer```.

Bag-of-Words:
- Bag-of-Words (BoW), also referred to as the unigram model, is a statistical language model based on word count.
- BoW can be implemented as a Python dictionary with each key set to a word and each value set to the number of times that word appears in a text.
- For BoW, training data is the text that is used to build a BoW model.
- BoW test data is the new text that is converted to a BoW vector using a trained features dictionary.
- A feature vector is a numeric depiction of an item’s salient features.
- Feature extraction (or vectorization) is the process of turning text into a BoW vector.
- A features dictionary is a mapping of each unique word in the training data to a unique index. This is used to build out BoW vectors.
- BoW has less data sparsity than other statistical models. It also suffers less from overfitting.
- BoW has higher perplexity than other models, making it less ideal for language prediction.
- One solution to overfitting is language smoothing, in which a bit of probability is taken from known words and allotted to unknown words.

In [9]:
# Initializes variable to class, empty BoW vector
vectorizer = CountVectorizer()
# Fits/transforms training data and returns a BoW matrix, words count 
bow_matrix = vectorizer.fit_transform(processed_articles)

#  ----------------- Stores BoW vectors results in a DataFrame

# Gets vocabulary of terms
feature_names = vectorizer.get_feature_names()
# Creates an articles index
articles_index = [f"Article {i+1}" for i in range(len(articles))]
# Creates pandas DataFrame with the features names
# The .T stand for transpose() and todense() returns a dense matrix representation
df_bag_of_words = pd.DataFrame(bow_matrix.T.todense(), index=feature_names, columns=articles_index)
df_bag_of_words.to_csv('data/articles_bow.csv')

In [10]:
df_bag_of_words

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0,0,0,1,0,0,0,0,0,0
abide,1,0,0,0,0,0,0,0,0,0
accord,0,0,1,0,0,0,0,0,0,0
add,1,0,0,0,0,0,0,0,1,0
agency,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
world,0,0,0,0,0,3,0,0,0,0
would,0,0,0,1,0,0,0,0,1,0
year,0,1,0,0,0,0,0,0,0,0
yi,0,0,0,0,0,0,0,0,0,2


Converts the bag-of-words model to tf-idf. 

In [11]:
# Initializes variable to class
transformer = TfidfTransformer(norm=None)
# Transfoms and retunrs scores
tfidf_scores = transformer.fit_transform(bow_matrix)

Saves score results

In [12]:
#  ----------------- Stores scores results from TfidfVectorizer in a DataFrame
df_corpus_tfidf_scores = pd.DataFrame(corpus_tfidf_scores.T.todense(), index=feature_names, columns=articles_index)
df_corpus_tfidf_scores.to_csv('data/articles_corpus_tfidf_scores.csv')
#  ----------------- Stores scores results from TfidfTransformer in a DataFrame
df_tfidf_scores = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=articles_index)
df_tfidf_scores.to_csv('data/articles_tfidf_scores.csv')

In [13]:
df_tfidf_scores

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
agency,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


<h4 style='color : DarkGreen'>Scores results comparison:</h4>

Confirms that the tf-idf scores given by ```TfidfTransformer``` and ```TfidfVectorizer``` are the same.

In [14]:
if np.allclose(corpus_tfidf_scores.todense(), tfidf_scores.todense()):
    print('Are the tf-idf scores the same? : yes')
else:
    print('Are the tf-idf scores the same?: No, something is wrong :(')

Are the tf-idf scores the same? : yes


<h2 style='color : MediumBlue'>Results Analysis</h2>

To analyze the results, I use the process of labeling each ```article```'s highest-scoring tf-idf term to determined each ```article```'s ```topic```.<br>
While the process of labeling the highest-scoring tf-idf term is a more naive approach than others, it is a quick and easy way of getting insight into the topic.

In [15]:
# Articles highest-scoring tf-idf terms aka topics 
# Creates a list of the indexes' highest value (pandas.Series.idxmax()) 
topics = [df_tfidf_scores[f'Article {i+1}'].idxmax() for i in range(len(articles_index))] 
# Creates a Dataframe   
df_articles_topics = pd.DataFrame({'Topic' : topics}, index = articles_index)
df_articles_topics.to_csv('data/articles_topics.csv')
df_articles_topics

Unnamed: 0,Topic
Article 1,fare
Article 2,hong
Article 3,sugar
Article 4,petrol
Article 5,engine
Article 6,australia
Article 7,car
Article 8,railway
Article 9,cabinet
Article 10,china


By using NLP supervised machine learning models we can gain insight into news articles topics without having  the need to read them.<br>
For example, we can determine, with good certainty, that the article-6's topic is linked to Australia.