# Exploring News Articles with TF-IDF

In this project, I'll leverage Natural Language Processing (NLP) and the power of term frequency-inverse document frequency (TF-IDF) to analyze news articles. By identifying the most significant terms in each article, I can gain quick insights into their topics, making it easier for me to navigate the world of news. I'll work with a selection of articles from The News International, a prominent English-language newspaper in Pakistan. My goal is to use TF-IDF to examine each article's content and determine the terms that best describe its topic.

Here's how I'll approach this:

**1. Importing Libraries**: I'll start by importing essential libraries, including CountVectorizer, TfidfTransformer, and TfidfVectorizer from scikit-learn. These tools will help me perform TF-IDF analysis.

**2. Text Inspection**: I'll take a closer look at one of the provided articles to understand its content and structure.

**3. Data Preprocessing**: To prepare the articles for analysis, I'll preprocess them by tokenizing and lemmatizing the text. The preprocessed articles will be stored in a list called `processed_articles`.

**4. Calculate TF-IDF Scores**: I'll calculate TF-IDF scores using two different approaches:
   - First, I'll start with word counts (Bag-of-Words) by initializing a CountVectorizer and fitting it on the processed articles. I'll convert the word counts into a DataFrame.
   - Then, I'll convert these word counts into TF-IDF scores using a TfidfTransformer.
   - Finally, I'll calculate TF-IDF scores directly using a TfidfVectorizer.

**5. Verify Results**: To ensure consistency, I'll compare the TF-IDF scores obtained from the two different approaches. If they match, I'll confirm that my analysis is on the right track.

**6. Infer Word Importances**: To gain insight into each article's topic, I'll identify the term with the highest TF-IDF score for each article. While this approach is simple, it provides a quick glimpse into the main themes of the articles.

By the end of this project, I'll have a better understanding of the topics covered in these news articles, making it easier for me to stay informed in a fast-paced world. Let's dive into the analysis and uncover the insights hidden within the articles!

-----

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
# import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
from functions.articles import articles  # Make sure to have articles.py and preprocessing.py in your project folder
from functions.preprocessing import preprocess_text

## Text Inspection

In [2]:
# Print one of the articles
print(articles[4])

KARACHI: The final shipment of Chinese manufactured Rail Engines arrived in Pakistan on Friday. Federal Railways Minister, Khwaja Saad Rafique says, the inclusion of the new engines will help ease the shortfall faced by Pakistan Railways. The shipment includes 2000 and 3000-horse-power engines which will be used to pull freight bogeys. Rafique told journalists, the inclusion of 15 new engines has brought Pakistan Railways total strength to 268 engines however more engines are still required.


## Data Preprocessing

In [3]:
# Preprocess articles
processed_articles = [preprocess_text(article) for article in articles]

# Print one of the preprocessed articles
print(processed_articles[4])  

karachi the final shipment of chinese manufacture rail engine arrive in pakistan on friday federal railway minister khwaja saad rafique say the inclusion of the new engine will help ease the shortfall face by pakistan railway the shipment include and horse power engine which will be use to pull freight bogey rafique tell journalist the inclusion of new engine have bring pakistan railway total strength to engine however more engine be still require


## Approach 1: TF-IDF Scores from Bag-of_Words

### Word Counts (Bag-of-Words)

In [4]:
# Initialize and fit CountVectorizer
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(processed_articles)

# Create a DataFrame with word counts
df_word_counts = pd.DataFrame(counts.toarray(), columns=vectorizer.get_feature_names_out(), index=[f"Article {i+1}" for i in range(len(articles))]).T
df_word_counts

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0,0,0,1,0,0,0,0,0,0
abide,1,0,0,0,0,0,0,0,0,0
about,0,0,0,0,0,0,1,0,0,0
accord,0,0,1,0,0,0,0,0,0,0
add,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
world,0,0,0,0,0,3,0,0,0,0
would,0,0,0,1,0,0,0,0,1,0
year,0,1,0,0,0,0,0,0,0,0
yi,0,0,0,0,0,0,0,0,0,2


### Convert Word Counts to tf-idf Scores

In [5]:
# Initialize and fit TfidfTransformer
transformer = TfidfTransformer(norm=None)
tfidf_scores_transformed = transformer.fit_transform(counts)

# Create a DataFrame with tf-idf scores (from TfidfTransformer)
df_tf_idf_transformed = pd.DataFrame(tfidf_scores_transformed.toarray(), columns=vectorizer.get_feature_names_out(), index=[f"Article {i+1}" for i in range(len(articles))]).T
df_tf_idf_transformed

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
about,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


## Approach 2: Calculate TF-IDF Scores Directly

In [6]:
# Initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(processed_articles)

# Create a DataFrame with tf-idf scores (from TfidfVectorizer)
df_tf_idf = pd.DataFrame(tfidf_scores.toarray(), columns=vectorizer.get_feature_names_out(), index=[f"Article {i+1}" for i in range(len(articles))]).T
df_tf_idf


Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10
abbasi,0.000000,0.000000,0.000000,2.704748,0.0,0.000000,0.000000,0.0,0.000000,0.000000
abide,2.704748,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
about,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,2.704748,0.0,0.000000,0.000000
accord,0.000000,0.000000,2.704748,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
add,2.299283,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,2.299283,0.000000
...,...,...,...,...,...,...,...,...,...,...
world,0.000000,0.000000,0.000000,0.000000,0.0,8.114244,0.000000,0.0,0.000000,0.000000
would,0.000000,0.000000,0.000000,2.299283,0.0,0.000000,0.000000,0.0,2.299283,0.000000
year,0.000000,2.704748,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000
yi,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,5.409496


In [7]:
# Check if tf-idf scores from TfidfTransformer and TfidfVectorizer are equal
if np.allclose(tfidf_scores_transformed.todense(), tfidf_scores.todense()):
    print(pd.DataFrame({'Are the tf-idf scores the same?': ['YES']}))
else:
    print(pd.DataFrame({'Are the tf-idf scores the same?': ['No, something is wrong :(']}))

  Are the tf-idf scores the same?
0                             YES


## Infer Word Importances

In [8]:
# Analyze the Results
for i in range(1, 11):
    highest_score_index = df_tf_idf[f'Article {i}'].idxmax()
    print(f"Highest scoring term for Article {i}: {highest_score_index}")

Highest scoring term for Article 1: fare
Highest scoring term for Article 2: hong
Highest scoring term for Article 3: sugar
Highest scoring term for Article 4: petrol
Highest scoring term for Article 5: engine
Highest scoring term for Article 6: australia
Highest scoring term for Article 7: car
Highest scoring term for Article 8: railway
Highest scoring term for Article 9: cabinet
Highest scoring term for Article 10: china
