<a href="https://colab.research.google.com/github/Kalyanchittaluri/MachineLearningProjects/blob/main/NewsRecommendationSystem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Let's understand the Math a bit

We calculate the score by dividing the frequency of the words by the length of the corpus, then multiplying by 100 and applying a logarithmic transformation.


Step 2: Preprocessing the Text (Removing Stop Words)


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
doc = [str(news2)]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(doc)
corpus = vectorizer.get_feature_names_out()
# Display the feature words
corpus

array(['accuser', 'biden', 'campaign', 'candidate', 'circulated',
       'democrats', 'fox', 'inaccurately', 'new', 'news', 'paper',
       'points', 'prominent', 'reade', 'rebuked', 'reported', 'reporting',
       'talking', 'tara', 'telling', 'times', 'wednesday', 'york'],
      dtype=object)

Step 3: Define Your Query

Next, we define a query that the user may input. The goal is to find the most relevant news article for this query.


In [None]:
query = "some news Newyork or NY on wednesday or some talk Biden campaign News that the reported talking points that have"


Step 4: Frequency Count for the Words in the Query


In [None]:
from collections import Counter
# Convert query to list of words and get frequency count
query_words = query.lower().split()
query_freq = Counter(query_words)

query_freq

Counter({'some': 2,
         'news': 2,
         'newyork': 1,
         'or': 2,
         'ny': 1,
         'on': 1,
         'wednesday': 1,
         'talk': 1,
         'biden': 1,
         'campaign': 1,
         'that': 2,
         'the': 1,
         'reported': 1,
         'talking': 1,
         'points': 1,
         'have': 1})

Step 5: Comparing the Query to the News

For each word in the query, we get the corresponding frequency in the news article's corpus.


In [None]:
freq = []
for key in query_freq:
    try:
        freq.append(corpus.count(key))
    except Exception:
        pass

freq

[]

Step 6: Calculating the Score

We calculate the score by dividing the frequency of the words by the length of the corpus, then multiplying by 100 and applying a logarithmic transformation.


In [None]:
from math import log10
# Calculate the score
if sum(freq) == 0:  # If no matches found
    score = 0
else:
    score = sum(freq) / len(corpus)
    score = log10(score) * 100  # Apply log transformation and scaling
print("Score:", score)

Score: 0


Step 7: Evaluating Multiple News Articles

Let's now compare multiple news articles and calculate their respective scores.


In [None]:
from math import log10
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

class TextCleaner:
    def __init__(self, doc):
        self.doc = [str(doc)]

    def process(self):
        """
        Remove the stop words and extract features.
        :return: List of features (words)
        """
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform(self.doc)
        corpus = vectorizer.get_feature_names_out()
        return corpus

class WordFrequency:
    def __init__(self, data):
        self.data = data

    def process(self):
        """
        Get the frequency count of each word.
        :return: Frequency as a dictionary
        """
        return Counter(self.data)

news1 = """
Louisiana reported 27 new deaths statewide on Monday, but none in Orleans Parish, the first time the Big Easy reported no new deaths from the virus since March 22.
"""
news2 = """
The New York Times rebuked the Biden campaign on Wednesday, telling Fox News that the reported talking points that have been circulated to prominent Democrats "inaccurately" describe the paper's reporting on the candidate's accuser Tara Reade.
"""

query = "Biden campaign reported on Wednesday"

# Create a list of news articles
news_list = [news1, news2]

# Calculate score for each news article
scores = []
for news in news_list:
    corpus = TextCleaner(news).process()
    corpus_freq = WordFrequency(corpus).process()
    query_freq = Counter(query.split())
    freq = [corpus_freq.get(key, 0) for key in query_freq]
    if sum(freq) == 0:
        score = float('inf')
    else:
        score = sum(freq) / len(corpus)
        score = log10(score + 1e-9) * 100
    scores.append(score)

# Display scores
print("Scores for each news article:", scores)


Scores for each news article: [-107.9181240836091, -99.99999956570551]


In [None]:
# Combine scores with news articles
news_with_scores = list(zip(news_list, scores))

# Sort by score (ascending order)
news_with_scores.sort(key=lambda x: x[1])

# Display sorted articles and scores
for idx, (news, score) in enumerate(news_with_scores):
    print(f"Rank {idx + 1}:")
    print(f"Score: {score}")
    print(f"News: {news[:100]}...")
    print()


Rank 1:
Score: -107.9181240836091
News: 
Louisiana reported 27 new deaths statewide on Monday, but none in Orleans Parish, the first time th...

Rank 2:
Score: -99.99999956570551
News: 
The New York Times rebuked the Biden campaign on Wednesday, telling Fox News that the reported talk...



In [None]:
top_article, top_score = news_with_scores[0]
print("Top Recommended Article:")
print(f"Score: {top_score}")
print(f"News: {top_article}")


Top Recommended Article:
Score: -107.9181240836091
News: 
Louisiana reported 27 new deaths statewide on Monday, but none in Orleans Parish, the first time the Big Easy reported no new deaths from the virus since March 22.



In [None]:
print("Top Recommendations:")
for idx, (news, score) in enumerate(news_with_scores[:2]):  # Top 2
    print(f"Rank {idx + 1}:")
    print(f"Score: {score}")
    print(f"News: {news[:100]}...")
    print()


Top Recommendations:
Rank 1:
Score: -107.9181240836091
News: 
Louisiana reported 27 new deaths statewide on Monday, but none in Orleans Parish, the first time th...

Rank 2:
Score: -99.99999956570551
News: 
The New York Times rebuked the Biden campaign on Wednesday, telling Fox News that the reported talk...

