## Data Info

Link : https://www.kaggle.com/datasets/rmisra/news-category-dataset

The file contains 210,294 records between 2012 and 2022. Each json record contains the following attributes:

- category: Category article belongs to

- headline: Headline of the article

- authors: Person authored the article

- link: Link to the post

- short_description: Short description of the article

- date: Date the article was published

In [2]:
# importing necessary librries
import pandas as pd
import numpy as np
import json
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
file_path = "News_Category_Dataset_v3.json"
df = pd.read_json(file_path,lines=True)

In [6]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [7]:
data = []
with open(file_path, 'r') as file:
    for line in file:
        try:
            obj = json.loads(line)
            data.append(obj)
        except json.JSONDecodeError:
            pass

df = pd.DataFrame(data)

In [8]:
df.head().T

Unnamed: 0,0,1,2,3,4
link,https://www.huffpost.com/entry/covid-boosters-...,https://www.huffpost.com/entry/american-airlin...,https://www.huffpost.com/entry/funniest-tweets...,https://www.huffpost.com/entry/funniest-parent...,https://www.huffpost.com/entry/amy-cooper-lose...
headline,Over 4 Million Americans Roll Up Sleeves For O...,"American Airlines Flyer Charged, Banned For Li...",23 Of The Funniest Tweets About Cats And Dogs ...,The Funniest Tweets From Parents This Week (Se...,Woman Who Called Cops On Black Bird-Watcher Lo...
category,U.S. NEWS,U.S. NEWS,COMEDY,PARENTING,U.S. NEWS
short_description,Health experts said it is too early to predict...,He was subdued by passengers and crew when he ...,"""Until you have a dog you don't understand wha...","""Accidentally put grown-up toothpaste on my to...",Amy Cooper accused investment firm Franklin Te...
authors,"Carla K. Johnson, AP",Mary Papenfuss,Elyse Wanshel,Caroline Bologna,Nina Golgowski
date,2022-09-23,2022-09-23,2022-09-23,2022-09-23,2022-09-22


In [9]:
df.head().T

Unnamed: 0,0,1,2,3,4
link,https://www.huffpost.com/entry/covid-boosters-...,https://www.huffpost.com/entry/american-airlin...,https://www.huffpost.com/entry/funniest-tweets...,https://www.huffpost.com/entry/funniest-parent...,https://www.huffpost.com/entry/amy-cooper-lose...
headline,Over 4 Million Americans Roll Up Sleeves For O...,"American Airlines Flyer Charged, Banned For Li...",23 Of The Funniest Tweets About Cats And Dogs ...,The Funniest Tweets From Parents This Week (Se...,Woman Who Called Cops On Black Bird-Watcher Lo...
category,U.S. NEWS,U.S. NEWS,COMEDY,PARENTING,U.S. NEWS
short_description,Health experts said it is too early to predict...,He was subdued by passengers and crew when he ...,"""Until you have a dog you don't understand wha...","""Accidentally put grown-up toothpaste on my to...",Amy Cooper accused investment firm Franklin Te...
authors,"Carla K. Johnson, AP",Mary Papenfuss,Elyse Wanshel,Caroline Bologna,Nina Golgowski
date,2022-09-23,2022-09-23,2022-09-23,2022-09-23,2022-09-22


## Find Similarity
### cosine similarity

In [15]:
given_data = df.iloc[12]

given_data

link                 https://www.huffpost.com/entry/fiona-threatens...
headline             Fiona Threatens To Become Category 4 Storm Hea...
category                                                    WORLD NEWS
short_description    Hurricane Fiona lashed the Turks and Caicos Is...
authors                                                Dánica Coto, AP
date                                                        2022-09-21
Name: 12, dtype: object

In [17]:
# Calculate cosine similarity between the given data and all other data points
similarity_scores = cosine_similarity(df,given_data)



ValueError: could not convert string to float: 'https://www.huffpost.com/entry/covid-boosters-uptake-us_n_632d719ee4b087fae6feaac9'

In [18]:
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform(df['short_description'])

# Select any headline
target_headline = df['short_description'].iloc[123]

# Transform the target headline
target_tfidf = vectorizer.transform([target_headline])

# Calculate cosine similarity
similarity_scores = cosine_similarity(tfidf_matrix, target_tfidf).ravel()

# Find the most similar headline
most_similar_index = similarity_scores.argmax()
most_similar_headline = df.loc[most_similar_index, 'short_description']

print("Most similar headline:", most_similar_headline)

Most similar headline: If your freezer is overflowing, or you're tired of carrying around ice packs and bottles of pumped milk, this option could be for you.


In [19]:
most_similar_index

123

### Jaccard similarity

In [21]:
# helper function for preprocessing the text
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove special characters and symbols
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize the text into words
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Join the tokens back into a single string
    processed_text = " ".join(tokens)
    
    return processed_text


In [25]:
# Preprocess the headlines
df['processed_headline'] = df['headline'].apply(preprocess_text)

In [24]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [26]:
# Modified Dataset
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date,processed_headline
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23,4 million americans roll sleeves omicron targe...
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23,american airlines flyer charged banned life pu...
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23,23 funniest tweets cats dogs week sept 17 23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23,funniest tweets parents week sept 17 23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22,woman called cops black bird watcher loses law...


In [27]:
df['headline'].iloc[102]

'Michigan Secretary of State Worried About ‘Violence And Disruption’ Going Into Midterms'

In [28]:
# Select any headline
target_headline = preprocess_text(df['headline'].iloc[102])

# Calculate Jaccard similarity
similarity_scores = pairwise_distances(df['processed_headline'].values.reshape(-1, 1),
                                       [target_headline],
                                       metric='jaccard').ravel()

# Find the most similar headline
most_similar_index = similarity_scores.argmin()
most_similar_headline = df.loc[most_similar_index, 'headline']

print("Most similar headline:", most_similar_headline)



ValueError: invalid literal for int() with base 10: 'michigan secretary state worried violence disruption going midterms'