<div class="markdown" style="background-color: white; color:black;" >
    <div class="markdown" style="background-color: white; color:black; text-align: center;">
        <h1 style="font-size: 48px; font-weight: bold;">Recommender System For Recommending Articles To User</h1>
    </div>
</div>


<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 24px; /* Increase the font size here */
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h2 style="font-size: 28px;">Abstract</h2>
</div>


Our recommender system uses the current articles <span style="color:blue"><b>text</b></span> content to recommend more articles to the user. Achieving this was very intuitive and fairly simple, as one would notice if they went through this notebook's contents thoroughly.

We used the <a href="https://www.kaggle.com/datasets/everydaycodings/global-news-dataset" style="color:red">global-news-dataset</a>, which was offered to use in the Problem Statement document.


<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Import Data, Libraries</h1>
</div>


Common Libraries are imported, such as Pandas, SkLearn, NLTK, Regex.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
import string
from nltk.tokenize import TweetTokenizer


#loading articles into dataframe
articles_df = pd.read_csv("data/global-news-dataset/data.csv")



<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Data Preprocessing</h1>
</div>


The following function performs these steps:
1. <span style= "color:skyblue">Regular Expression-Based Cleaning</span>:
    
    - Removes <span style= "color:red">stock market tickers ($GE)</span>.
    - Removes <span style= "color:red">old-style retweet text ("RT")</span>.
    - Removes <span style= "color:red">hyperlinks</span>.
    - Removes <span style= "color:red">hashtags</span> by replacing the <span style= "color:red">"#"</span> symbol.
2. <span style= "color:skyblue">Tokenization</span>:
    - Converts the text into a list of individual words using <span style= "color:red">TweetTokenizer</span>.
3. <span style= "color:skyblue">Stopword Removal and Punctuation Removal</span>:
    - Filters out <span style= "color:red">common stop words (e.g., "the", "a")</span> in English using `stopwords_english`.
    - Removes any remaining <span style= "color:red">punctuation characters (e.g., ".", ",", ";")</span>.
4. <span style= "color:skyblue">Stemming (Optional)</span>:
    - The commented-out line (`processed_text.append(word)`) would simply <span style= "color:red">add words without stemming</span>.
    - The active line performs stemming (`stemmer.stem(word)`) to <span style= "color:red">reduce words to their base form (e.g., "jumping" becomes "jump")</span>.
5. <span style= "color:skyblue">Return Processed Text</span>:
    - Returns the resulting list of processed words.

In [2]:

def process_text(text):
    """Process text function.
    Input:
        text: a string containing the text
    Output:
        processed_text: a list of words containing the processed text

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')

    # remove stock market tickers like $GE
    text = re.sub(r'\$\w*', '', text)
    # remove old style retweet text "RT"
    text = re.sub(r'^RT[\s]+', '', text)
    # remove hyperlinks    
    text = re.sub(r'https?://[^\s\n\r]+', '', text)
    # remove hashtags
    # only removing the hash # sign from the word
    text = re.sub(r'#', '', text)
    # tokenize text
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    text_tokens = tokenizer.tokenize(text)

    processed_text = []
    for word in text_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # processed_text.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            processed_text.append(stem_word)

    return processed_text

Here's the result of running the function on a text input.

In [19]:
process_text("I have worked very hard on this project.")

['work', 'hard', 'project']

We preprocess and then use the <span style= "color:red">TF-IDF Vectorizer</span> offered by SkLearn to <span style= "color:red">vectorize data</span> from the dataset.

In [11]:
# Preprocess text content (cleaning, tokenizing, stop words removal)
articles_df["preprocessed_text"] = articles_df["content"].apply(process_text)
articles_df["preprocessed_text"] = articles_df["preprocessed_text"].apply(lambda x: ' '.join(x)) #joins the list of processed words into a single string with space.

# Create TF-IDF features
vectorizer = TfidfVectorizer()  
article_features = vectorizer.fit_transform(articles_df["preprocessed_text"])


Then we save <span style= "color:red">the trained vectorizer</span> so that we can use it later to <span style= "color:red">vectorize our test data</span>.

In [16]:
import pickle 
with open('vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)


<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Training the Model</h1>
</div>

Our data, which was of type <span style= "color:red">string</span>, is now in the form of a <span style= "color:red">numerical CSR-Matrix</span>. It's simple from here, we run the <span style= "color:red">K Nearest Neighbours Clustering</span> technique, using <span style= "color:red">cosine similarity as the evaluation metric</span>. 
This essentially means <span style= "color:red">data which is similar would be clustered around a neighbour vector</span>.

In [5]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(n_neighbors=10,metric="cosine")
model.fit(article_features)

<style>
    .markdown-cell {
        background-color: black;
        color: white;
        padding: 20px;
        font-family: Arial, sans-serif;
        font-size: 16px;
        line-height: 1.5;
    }
    
    .markdown-cell h1 {
        font-size: 24px;
        font-weight: bold;
        margin-bottom: 20px;
    }
    
    .markdown-cell h2 {
        font-size: 20px;
        font-weight: bold;
        margin-bottom: 15px;
    }
    
    .markdown-cell p {
        margin-bottom: 10px;
    }
</style>

<div class="markdown-cell">
    <h1 style="font-size: 24px;">Test</h1>
</div>


Now we load our <span style= "color:red">pretrained vectorizer</span>.

In [17]:
with open('vectorizer.pkl', 'rb') as f:
    vectorizer1 = pickle.load(f)

The following function does exactly what we did with our training data. We <span style= "color:red">preprocess it</span>, then use our <span style= "color:red">KNN model</span> to find the <span style= "color:red">most similar articles</span> to our input.  We then <span style= "color:red">fetch the articles</span> from the `articles_df` dataframe based on the <span style= "color:red">indices of the nearest neighbors</span>, returning a <span style= "color:red">subset of articles related to the input keywords</span>.

In [20]:
def find_related_articles(keywords):
    # Preprocess the user input keywords
    processed_keywords = process_text(keywords)
    processed_keywords = ' '.join(processed_keywords)
    
    # Transform the user input keywords into TF-IDF features
    keyword_features = vectorizer1.transform([processed_keywords])
    
    # Find the nearest neighbors to the user input keywords
    distances, indices = model.kneighbors(keyword_features)
    
    # Get the related articles based on the nearest neighbors
    related_articles = articles_df.iloc[indices[0]]
    
    return related_articles


find_related_articles("I worked very hard on this project.")

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content,preprocessed_text
33882,39662,,Small Business Trends,Gabrielle Pickard-Whitehead,60 Famous Hard Work Quotes to Motivate Your Team,"In a slump and in need of a boost? These famous hard work quotes, spoken by true leaders and ach...",https://smallbiztrends.com/2023/10/famous-hard-work-quotes.html,https://media.smallbiztrends.com/2019/07/yoda-star-wars-hard-work-quote.png,2023-10-17 14:30:35.000000,Hard work is the most important key to success. Without being willing to work hard and put every...,Entrepreneurship,,hard work import key success without will work hard put everyth ventur busi success nearli impos...
48530,77012,,Wealthofgeeks.com,Richard Pretorius,College Graduate Shocked by Reality of 9-To-5 Work Life,"The old work hard, play hard ethic got an overwhelming knockdown when a TikTok user complained a...",https://wealthofgeeks.com/college-grad-shocked-work-life-reality/,https://wealthofgeeks.com/wp-content/uploads/2023/10/Stressed-Woman-1.jpg,2023-10-26 15:39:54.000000,"The old work hard, play hard ethic got an overwhelming knockdown when a TikTok user complained a...","Korea, Republic of",,old work hard play hard ethic got overwhelm knockdown tiktok user complain 9 5 job new york brie...
35962,43445,,Above the Law,Chris Williams,There Might Be Hope For Workplace Discrimination Suits In Texas After All,Change deep in the heart of Texas.\nThe post There Might Be Hope For Workplace Discrimination Su...,https://abovethelaw.com/2023/10/there-might-be-hope-for-workplace-discrimination-suits-in-texas-...,https://abovethelaw.com/uploads/2019/05/teacherappreciation2019-300x205.jpg,2023-10-12 21:45:12.000000,Getting up and going to work is hard enough without your colleagues giving you a hard time. Whil...,Philosophy,,get go work hard enough without colleagu give hard time justifi reason like think “ griffith not...
30360,33673,,Lifehacker.com,Becca Lewis,10 Podcasts That Will Make You a Better DIYer,"If you have a big project in front of you that you’re working on mostly on your own, you can tun...",https://lifehacker.com/the-best-home-improvement-podcasts-1850952184,"https://i.kinja-img.com/image/upload/c_fill,h_675,pg_1,q_80,w_1200/be658d95f93a9e8340a403a121980...",2023-10-24 17:00:00.000000,"If you have a big project in front of you that youre working on mostly on your own, you can tune...",Gardening,,big project front your work mostli tune podcast keep compani work hear other attempt r … 4399 char
35745,43156,,Crookedtimber.org,Ingrid Robeyns,Some thoughts on ‘team philosophy’,"In my academic job, I’ve just started a new 5-year project called ‘Visions for the future‘. In t...",https://crookedtimber.org/2023/10/21/some-thoughts-on-team-philosophy/,https://s0.wp.com/i/blank.jpg,2023-10-21 15:38:05.000000,"In my academic job, I’ve just started a new 5-year project called ‘Visions for the future‘. In t...",Philosophy,,academ job ’ start new 5 year project call ‘ vision futur ‘ first year project ’ tackl methodolo...
23631,21784,,Harvard Business Review,"Deborah Lovich, Rosie Sargeant",Does Your Hybrid Strategy Need to Change?,Companies continue to struggle to design and implement a post-Covid return-to-office strategy th...,https://hbr.org/2023/10/does-your-hybrid-strategy-need-to-change,https://hbr.org/resources/images/article_assets/2023/10/Oct23_02_BeatriceCaciotti.jpg,2023-10-02 13:00:00.000000,"If it doesn’t work for everyone, it’s not working.\n""&gt;\nThis fall, companies are once again p...",COVID,,’ work everyon ’ work fall compani push worker return offic continu debat mani day employe work ...
51576,118276,,Just Jared,Just Jared,Gerard Butler Spotted With Blond Hair While Filming New Movie with Director Julian Schnabel (Pho...,Gerard Butler is hard at work on his next movie and we have some new photos from set! The 53-yea...,https://www.justjared.com/2023/11/03/gerard-butler-spotted-with-blond-hair-while-filming-new-mov...,https://cdn.justjared.com/wp-content/uploads/headlines/2023/11/butler-blond.jpg,2023-11-03 06:53:47.000000,Gerard Butler is hard at work on his next movie and we have some new photos from set!\nThe 53-ye...,Amazon,,gerard butler hard work next movi new photo set 53 year-old actor spot blond hair get work film ...
22605,20206,,Lifehacker.com,Khamosh Pathak,Why You Should Turn Websites Into Apps on Your Mac,"Now that so much of our work happens on websites, it’s hard to find native apps designed for Mac...",https://lifehacker.com/why-you-should-turn-websites-into-apps-on-your-mac-1850956567,"https://i.kinja-img.com/image/upload/c_fill,h_675,pg_1,q_80,w_1200/6ced7f9785d1b0b85e4563b8b2be2...",2023-10-25 16:00:00.000000,"Now that so much of our work happens on websites, its hard to find native apps designed for Macs...",Google,,much work happen websit hard find nativ app design mac spend lot time obscur websit custom-mad w...
45819,70679,,Forbes,"Expert Panel®, Forbes Councils Member, \n Expert Panel®, Forbes Councils Member\n https://www.fo...",How Nonprofit Leaders Can Build And Maintain An Engaged Team,"While nonprofit professionals recognize the importance of what they do, staying fully engaged at...",https://www.forbes.com/sites/forbesnonprofitcouncil/2023/10/18/how-nonprofit-leaders-can-build-a...,https://imageio.forbes.com/specials-images/imageserve/6446949b5ec303b2261742cc/0x0.jpg?format=jp...,2023-10-18 17:15:58.000000,"getty\nBetween long, hard hours and an endless task list, working in the nonprofit industry can ...",Haiti,,getti long hard hour endless task list work nonprofit industri sometim feel like thankless work ...
23735,21987,business-insider,Business Insider,Kwan Wei Kevin Tan,Blackstone's Stephen Schwarzman says remote workers 'don't work as hard' and profit from working...,"""You know, they can make their lunch at home. They don't have to buy expensive clothes. And so t...",https://www.businessinsider.com/stephen-schwarzman-accused-remote-workers-of-not-working-hard-20...,https://i.insider.com/6538764196f7540cd064c1fd?width=1200&format=jpeg,2023-10-25 03:25:18.000000,"Blackstone CEO Stephen Schwarzman said it was ""more profitable"" for people to work remotely beca...",COVID,BillionaireStephen Schwarzmansaid on Tuesday that people benefited from remote work because they...,blackston ceo stephen schwarzman said profit peopl work remot work hard could save money commuti...
