# News Recommendation System

This Jupyter notebook demonstrates the creation of a news recommendation system. The notebook is divided into several parts:
1. Fetching and preprocessing news data.
2. Storing data in a MySQL database.
3. Simulating user interactions.
4. Building and evaluating a collaborative filtering model using SVD.
5. Creating a content-based recommendation system using TF-IDF.

## Part 1: Fetching and Preprocessing News Data

In this part, fetch the latest news headlines using the NewsAPI and preprocess the data for further analysis.

In [40]:
import requests

api_key = 'd6ba6f6958ea40deafd02344947bbe2b'
url = f'https://newsapi.org/v2/top-headlines?country=us&apiKey={api_key}'

response = requests.get(url)
data = response.json()

articles = data['articles']
for article in articles:
    print(article['title'], article['description'])



Stock market news today: Nasdaq sinks, Nvidia drops 5% ahead of next round of Big Tech earnings - Yahoo Finance None
Olympic gymnastics live updates: Simone Biles, USA women win gold medal in team final - USA TODAY None
Israel targets Hezbollah commander in Beirut strike after deadly Golan Heights attack - CNN None
Kamala Harris to campaign with VP pick in battleground states next week - CBS News None
2024 Booker longlist includes ‘James’ and ‘Wandering Stars’ - The Washington Post The 13 finalists include Percival Everett’s retelling of “Adventures of Huckleberry Finn” and Tommy Orange’s follow-up to “There There.”
'Stop interrupting me': Cruz gets in heated exchange with Secret Service official - CNN None
Gunmen on jet skis kill 12-year-old boy on Cancun beach while firing at rival drug dealer: Mexican officials - Fox News None
DUIs and integrity concerns: What we know about the deputy who killed Sonya Massey - USA TODAY None
Probe of soldiers over alleged sexual abuse fuels tension 

## Part 2: Cleaning and Preparing Data for TF-IDF Vectorization

Clean the titles of the articles by removing any non-alphanumeric characters to prepare them for TF-IDF vectorization.

In [41]:
import pandas as pd

data = pd.DataFrame(articles)
data['cleaned_title'] = data['title'].str.replace('[^a-zA-Z0-9\s]', '', regex=True)


In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(data['cleaned_title'])

## Part 3: Setting Up MySQL Database and Tables

Set up a MySQL database and create tables to store the news articles and user interactions.

## Install the necessary packages

In [43]:
pip install mysql-connector-python pandas sqlalchemy

245540.03s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [44]:
import mysql.connector

# Connect to MySQL
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password='november281996',
    database='news_recommendation'
)
cursor = conn.cursor()

# Create news_articles table
cursor.execute("""
CREATE TABLE IF NOT EXISTS news_articles (
    id INT AUTO_INCREMENT PRIMARY KEY,
    author VARCHAR(255),
    title VARCHAR(255) NOT NULL,
    description TEXT,
    url VARCHAR(255) NOT NULL,
    urlToImage VARCHAR(255),
    publishedAt DATETIME,
    content TEXT,
    source_id VARCHAR(50),
    source_name VARCHAR(100)
);
""")

# Create user_interactions table
cursor.execute("""
CREATE TABLE IF NOT EXISTS user_interactions (
    user_id INT,
    article_id INT,
    rating INT,
    PRIMARY KEY (user_id, article_id)
)
""")

conn.commit()
cursor.close()
conn.close()

## Part 4: Inserting Data into MySQL

Insert the fetched news articles into the MySQL database.

In [45]:
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime

# Create DataFrame
data = pd.DataFrame(articles)

# Flatten the 'source' column
data['source_id'] = data['source'].apply(lambda x: x.get('id') if x else None)
data['source_name'] = data['source'].apply(lambda x: x.get('name') if x else None)
data = data.drop(columns=['source'])

# Convert 'publishedAt' to MySQL-compatible datetime format
data['publishedAt'] = pd.to_datetime(data['publishedAt']).dt.strftime('%Y-%m-%d %H:%M:%S')

# Truncate the URL if it exceeds the length limit (e.g., 255 characters)
max_length = 255
data['url'] = data['url'].apply(lambda x: x[:max_length] if x and len(x) > max_length else x)

# Create SQLAlchemy engine
engine = create_engine('mysql+mysqlconnector://root:november281996@localhost/news_recommendation')

# Insert data into news_articles table
try:
    data.to_sql('news_articles', con=engine, if_exists='append', index=False)
    print("Data inserted successfully.")
except Exception as e:
    print(f"Error: {e}")

Data inserted successfully.


## Part 5: Simulating User Interactions and Inserting into MySQL

Simulate user interactions with the news articles and store the interactions in the MySQL database.

In [46]:
import mysql.connector
import pandas as pd
from sqlalchemy import create_engine, exc, text
import numpy as np

# Connect to MySQL and fetch news articles
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password='november281996',
    database='news_recommendation'
)

# Fetch articles into a DataFrame
articles = pd.read_sql("SELECT * FROM news_articles", conn)
conn.close()

# Assuming you have a DataFrame 'articles' from your news articles collection
article_ids = articles['id'].tolist()
user_ids = range(1, 101)  # Simulate 100 users

# Simulate interactions
np.random.seed(42)
interactions = []
for user_id in user_ids:
    # Each user interacts with 10-20 articles
    interacted_articles = np.random.choice(article_ids, size=np.random.randint(10, 20), replace=False)
    for article_id in interacted_articles:
        rating = np.random.randint(1, 6)  # Simulate a rating between 1 and 5
        interactions.append((user_id, article_id, rating))  # Append as a tuple

# Create a DataFrame
user_interactions = pd.DataFrame(interactions, columns=['user_id', 'article_id', 'rating'])

# Create the SQLAlchemy engine
engine = create_engine('mysql+mysqlconnector://root:november281996@localhost/news_recommendation')

# Store interactions in MySQL with duplicate handling
conn = engine.connect()

# Prepare the data for insertion
data_to_insert = user_interactions.to_dict(orient='records')

# Create the query with placeholders
query = text("""
    INSERT INTO user_interactions (user_id, article_id, rating)
    VALUES (:user_id, :article_id, :rating)
    ON DUPLICATE KEY UPDATE rating = VALUES(rating)
""")

# Use execute many for bulk insert
try:
    conn.execute(query, data_to_insert)
except exc.SQLAlchemyError as e:
    print(f"Error occurred: {e}")

conn.close()
print("User interactions successfully simulated and stored in MySQL.")

User interactions successfully simulated and stored in MySQL.


  articles = pd.read_sql("SELECT * FROM news_articles", conn)


## Part 6: Building and Evaluating a Recommendation Model with SVD

Build a collaborative filtering recommendation model using Singular Value Decomposition (SVD) and evaluate its performance.

## Install the necessary packages

In [47]:
pip install scikit-surprise


245546.05s - pydevd: Sending message related to process being replaced timed-out after 5 seconds



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [48]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
import pandas as pd
import numpy as np
import mysql.connector

# Connect to MySQL and fetch user interaction data
conn = mysql.connector.connect(
    host='localhost',
    user='root',
    password='november281996',
    database='news_recommendation'
)
query = "SELECT * FROM user_interactions"
user_interactions = pd.read_sql(query, conn)
conn.close()

# Load data into Surprise dataset
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(user_interactions[['user_id', 'article_id', 'rating']], reader)

# Train and test split
trainset, testset = train_test_split(data, test_size=0.25)

# Train SVD algorithm
algo = SVD()
algo.fit(trainset)

# Make predictions
predictions = algo.test(testset)
print("RMSE:", accuracy.rmse(predictions))

RMSE: 1.5326
RMSE: 1.5325882836327962


  user_interactions = pd.read_sql(query, conn)


## Part 7: Content-Based Recommendation System Using TF-IDF

Create a content-based recommendation system using TF-IDF vectorization and cosine similarity.

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import pandas as pd
from sqlalchemy import create_engine

# Create SQLAlchemy engine
engine = create_engine('mysql+mysqlconnector://root:november281996@localhost/news_recommendation')

# Connect to MySQL and fetch news articles
articles = pd.read_sql("SELECT * FROM news_articles", engine)

# Ensure titles are unique by removing duplicates
articles = articles.drop_duplicates(subset='title')

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(articles['title'].fillna(''))

# Compute cosine similarity
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
indices = pd.Series(articles.index, index=articles['title']).drop_duplicates()

def get_recommendations(title, cosine_sim=cosine_sim):
    if title not in indices:
        return f"Title '{title}' not found in the articles."
    
    idx = indices[title]
    print(f"Index of the title: {idx}")

    sim_scores = list(enumerate(cosine_sim[idx]))
    print(f"Similarity scores before sorting: {sim_scores[:10]}")  # Print first 10 scores for debugging

    # Ensure that similarity scores are tuples of index and score
    sim_scores = [(i, score) for i, score in sim_scores if isinstance(score, (int, float))]
    print(f"Filtered similarity scores: {sim_scores[:10]}")  # Print first 10 scores for debugging

    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    print(f"Similarity scores after sorting: {sim_scores[:10]}")  # Print first 10 scores for debugging

    sim_scores = sim_scores[1:11]
    print(f"Top 10 similarity scores: {sim_scores}")  # Print top 10 scores for debugging

    article_indices = [i[0] for i in sim_scores]
    print(f"Article indices: {article_indices}")  # Print article indices for debugging

    return articles['title'].iloc[article_indices]

# Test the function with a specific title
test_title = 'Russian and Chinese bombers intercepted off of Alaska - ABC News'
recommendations = get_recommendations(test_title)
print(recommendations)

Index of the title: 21
Similarity scores before sorting: [(0, 0.0), (1, 0.0), (2, 0.0), (3, 0.05976902006767167), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.11412742304479827), (8, 0.0), (9, 0.0)]
Filtered similarity scores: [(0, 0.0), (1, 0.0), (2, 0.0), (3, 0.05976902006767167), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.11412742304479827), (8, 0.0), (9, 0.0)]
Similarity scores after sorting: [(21, 0.9999999999999999), (70, 0.20842463708780345), (63, 0.179309214511483), (45, 0.15847010266977618), (7, 0.11412742304479827), (50, 0.1105487529575391), (48, 0.1014362279006179), (81, 0.06279512441457209), (11, 0.06110793175291705), (54, 0.06109826516994615)]
Top 10 similarity scores: [(70, 0.20842463708780345), (63, 0.179309214511483), (45, 0.15847010266977618), (7, 0.11412742304479827), (50, 0.1105487529575391), (48, 0.1014362279006179), (81, 0.06279512441457209), (11, 0.06110793175291705), (54, 0.06109826516994615), (29, 0.06068136708669128)]
Article indices: [70, 63, 45, 7, 50, 48, 81, 11, 54, 29]
17

## Part 8: Improved Content-Based Recommendation System with TF-IDF and Cosine Similarity

 Improve the content-based recommendation system by adding bigrams to the TF-IDF vectorization and ensuring robustness in the similarity calculations.

In [50]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


# Database connection function
def get_db_connection():
    conn = mysql.connector.connect(
        host='localhost',
        user='root',
        password='november281996',
        database='news_recommendation'
    )
    return conn

# Fetch articles from database
def fetch_articles():
    conn = get_db_connection()
    query = "SELECT * FROM news_articles"
    articles = pd.read_sql(query, conn)
    conn.close()
    # Drop duplicate titles
    articles = articles.drop_duplicates(subset='title')
    return articles


def get_recommendations(title):
    articles = fetch_articles()
    
    if title not in articles['title'].values:
        print(f"Title '{title}' not found in the dataset.")
        return []  # Title not found
    
    # Create the TF-IDF matrix
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
    tfidf_matrix = vectorizer.fit_transform(articles['title'].fillna(''))
    
    # Calculate cosine similarity matrix
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    
    # Re-create the indices to ensure consistency
    indices = pd.Series(articles.index, index=articles['title']).drop_duplicates()
    idx = indices.get(title, None)
    
    if idx is None:
        print(f"Title '{title}' not found in the indices.")
        return []  # Title not found

    # Ensure idx is within the valid range of cosine_sim
    if idx >= cosine_sim.shape[0]:
        print(f"Index {idx} is out of bounds for the cosine similarity matrix.")
        return []
    
    # Get similarity scores for the given title
    sim_scores = list(enumerate(cosine_sim[idx]))
    # Exclude the original article itself
    sim_scores = [(i, score) for i, score in sim_scores if i != idx]
    # Sort the articles based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get top 10 recommendations
    sim_scores = sim_scores[:10]
    article_indices = [i[0] for i in sim_scores]
    
    return articles['title'].iloc[article_indices].tolist()



# Example usage
title = 'Russian and Chinese bombers intercepted off of Alaska - ABC News'
recommendations = get_recommendations(title)
for rec in recommendations:
    print(rec)

White Dudes for Kamala Harris: Giant Zoom calls power fundraising - BBC.com
Microsoft apologises after thousands report new outage - BBC.com
Southport knife attack: Two children dead and nine injured at dance workshop - BBC.com
D.C. cleans up vandalism and graffiti in wake of anti-Netanyahu protests - The Washington Post
Israeli officials say they want to avoid all-out war in Lebanon retaliation - Reuters
McDonald’s sales fall worldwide for first time in four years as cost of living bites - The Guardian US
Kamala Harris to campaign with VP pick in battleground states next week - CBS News
NASA's Perseverance Mars rover finds possible signs of ancient Red Planet life - Space.com
SpaceX finds cause of Falcon 9 failure, eyes return to flight as soon as July 27 - Space.com
Roy Cooper Is Said to Withdraw From Harris’s Vice-Presidential Field - The New York Times


  articles = pd.read_sql(query, conn)
