<div class="markdown" style="background-color: white; color:black;" >
    <div class="markdown" style="background-color: white; color:black; text-align: center;">
        <h1 style="font-size: 48px; font-weight: bold;">Recommender System For Recommending Articles To User</h1>
    </div>
</div>


<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 24px; /* Increase the font size here */
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h2 style="font-size: 28px;">Abstract</h2>
</div>


Our recommender system uses the current articles <span style="color:blue"><b>text</b></span> content to recommend more articles to the user. Achieving this was very intuitive and fairly simple, as one would notice if they went through this notebook's contents thoroughly.

We used the <a href="https://www.kaggle.com/datasets/everydaycodings/global-news-dataset" style="color:red">global-news-dataset</a>, which was offered to use in the Problem Statement document.


<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Import Data, Libraries</h1>
</div>


Common Libraries are imported, such as Pandas, SkLearn, NLTK, Regex.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
import string
from nltk.tokenize import TweetTokenizer


#loading articles into dataframe
articles_df = pd.read_csv("data/global-news-dataset/data.csv")



<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Data Preprocessing</h1>
</div>


The following function performs these steps:
1. <span style= "color:skyblue">Regular Expression-Based Cleaning</span>:
    
    - Removes <span style= "color:red">stock market tickers ($GE)</span>.
    - Removes <span style= "color:red">old-style retweet text ("RT")</span>.
    - Removes <span style= "color:red">hyperlinks</span>.
    - Removes <span style= "color:red">hashtags</span> by replacing the <span style= "color:red">"#"</span> symbol.
2. <span style= "color:skyblue">Tokenization</span>:
    - Converts the text into a list of individual words using <span style= "color:red">TweetTokenizer</span>.
3. <span style= "color:skyblue">Stopword Removal and Punctuation Removal</span>:
    - Filters out <span style= "color:red">common stop words (e.g., "the", "a")</span> in English using `stopwords_english`.
    - Removes any remaining <span style= "color:red">punctuation characters (e.g., ".", ",", ";")</span>.
4. <span style= "color:skyblue">Stemming (Optional)</span>:
    - The commented-out line (`processed_text.append(word)`) would simply <span style= "color:red">add words without stemming</span>.
    - The active line performs stemming (`stemmer.stem(word)`) to <span style= "color:red">reduce words to their base form (e.g., "jumping" becomes "jump")</span>.
5. <span style= "color:skyblue">Return Processed Text</span>:
    - Returns the resulting list of processed words.

In [2]:

def process_text(text):
    """Process text function.
    Input:
        text: a string containing the text
    Output:
        processed_text: a list of words containing the processed text

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')

    # remove stock market tickers like $GE
    text = re.sub(r'\$\w*', '', text)
    # remove old style retweet text "RT"
    text = re.sub(r'^RT[\s]+', '', text)
    # remove hyperlinks    
    text = re.sub(r'https?://[^\s\n\r]+', '', text)
    # remove hashtags
    # only removing the hash # sign from the word
    text = re.sub(r'#', '', text)
    # tokenize text
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    text_tokens = tokenizer.tokenize(text)

    processed_text = []
    for word in text_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # processed_text.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            processed_text.append(stem_word)

    return processed_text

Here's the result of running the function on a text input.

In [19]:
process_text("I have worked very hard on this project.")

['work', 'hard', 'project']

We preprocess and then use the <span style= "color:red">TF-IDF Vectorizer</span> offered by SkLearn to <span style= "color:red">vectorize data</span> from the dataset.

In [11]:
# Preprocess text content (cleaning, tokenizing, stop words removal)
articles_df["preprocessed_text"] = articles_df["content"].apply(process_text)
articles_df["preprocessed_text"] = articles_df["preprocessed_text"].apply(lambda x: ' '.join(x)) #joins the list of processed words into a single string with space.

# Create TF-IDF features
vectorizer = TfidfVectorizer()  
article_features = vectorizer.fit_transform(articles_df["preprocessed_text"])


Then we save <span style= "color:red">the trained vectorizer</span> so that we can use it later to <span style= "color:red">vectorize our test data</span>.

In [16]:
import pickle 
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)


<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Training the Model</h1>
</div>

Our data, which was of type <span style= "color:red">string</span>, is now in the form of a <span style= "color:red">numerical CSR-Matrix</span>. It's simple from here, we run the <span style= "color:red">K Nearest Neighbours Clustering</span> technique, using <span style= "color:red">cosine similarity as the evaluation metric</span>. 
This essentially means <span style= "color:red">data which is similar would be clustered around a neighbour vector</span>.

In [5]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(n_neighbors=10,metric="cosine")
model.fit(article_features)

<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Test</h1>
</div>


Now we load our <span style= "color:red">pretrained vectorizer</span>.

In [None]:
import joblib
model = joblib.load('article_model.joblib', 'r')

In [17]:
vectorizer = joblib.load('vectorizer.pkl')

The following function does exactly what we did with our training data. We <span style= "color:red">preprocess it</span>, then use our <span style= "color:red">KNN model</span> to find the <span style= "color:red">most similar articles</span> to our input.  We then <span style= "color:red">fetch the articles</span> from the `articles_df` dataframe based on the <span style= "color:red">indices of the nearest neighbors</span>, returning a <span style= "color:red">subset of articles related to the input keywords</span>.

In [None]:
from keybert import KeyBERT

# Create a KeyBERT instance
keybert = KeyBERT()

# Extract keyphrases from the first article
first_article = articles_df.iloc[0]['content']

keyphrases = keybert.extract_keywords(first_article, keyphrase_ngram_range=(3,3),
                                       use_mmr=True, diversity=0.2, stop_words='english')
keyphrases = keyphrases[:3]

In [2]:
def find_related_articles(keywords):
    # Preprocess the user input keywords
    processed_keywords = process_text(keywords)
    processed_keywords = ' '.join(processed_keywords)
    
    # Transform the user input keywords into TF-IDF features
    keyword_features = vectorizer1.transform([processed_keywords])
    
    # Find the nearest neighbors to the user input keywords
    distances, indices = model.kneighbors(keyword_features)
    
    # Get the related articles based on the nearest neighbors
    related_articles = articles_df.iloc[indices[0]]
    
    return related_articles


find_related_articles("f{sample}")

NameError: name 'process_text' is not defined

In [None]:
result = {}
for i in keyphrases:
    keyword = i[0]
    related_articles = find_related_articles(keyword)
    result[keyword] = []
    for index, article in related_articles.iterrows():
        title = article['title']
        link = article['url']
        result[keyword].append({'Title': title, 'Link': link})
result