## Simple bag of word type reccomenders

So far we have a reccomender using a term-frequency matrix. Each "document" (titles or abstracts) of a news article gets a vector
with the count of the unique words found in the document collection.
Then, a similarity matrix is calculated (cosine similarity for now). We can select an article and look in the similarity matrix to find the top n most similar articles


#### Things to do
- look at other matrix representations
    - Simple binary representation (only 1 if term in document, 0 otherwise)
    - TF-IDF matrix (terms that are less frequent in the collection are weighted more)

- look at other similarity measures
    - cosine similarity is often most popular, but others can be explored

- combine title and abstract columns and do the techniques on those

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

# NOTE Unsusre about needing the  imports below
# For text processing
# import nltk
# nltk.download("punkt")
# from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sindr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [32]:
# read news data
df = pd.read_csv("MINDsmall_train/news.tsv",
    sep="\t",
    names=["newsId", "category", "subcategory", "title","abstract", "url", "title_entities","abstract_entities"]
)

df = df[["newsId", "category", "subcategory", "title", "abstract"]]


# check for missing values
print(df.isna().sum())
print("we see that we are missing 2666 abstracts. We need to rop these when we use the abstracts")

df.head()

newsId            0
category          0
subcategory       0
title             0
abstract       2666
dtype: int64
we see that we are missing 2666 abstracts. We need to rop these when we use the abstracts


Unnamed: 0,newsId,category,subcategory,title,abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."


### Creating matrix representations for the titles and abstracts

In [29]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
vectorizer2 = CountVectorizer(stop_words='english')

# Fit and transform  text
title_X = vectorizer.fit_transform(df['title'])

#drop records without abstract
abstract_df = df.dropna(subset=['abstract'])
abstract_X = vectorizer2.fit_transform(abstract_df['abstract'])


print("Title matrix shape: ", title_X.shape)
print("Abstract matrix shape: ", abstract_X.shape)

Title matrix shape:  (51282, 30710)
Abstract matrix shape:  (48616, 50591)


### Computing similarities

In [30]:
# Compute the similarity matrix
title_similarity_matrix = cosine_similarity(title_X)
# NOTE: did not have enpugh memory to run this as well
#abstract_similarity_matrix = cosine_similarity(abstract_X)



MemoryError: Unable to allocate 17.6 GiB for an array with shape (48616, 48616) and data type float64

In [35]:
def recommend_articles(article_id, similarity_matrix, top_n=5):
    # Get the index of the article
    idx = df[df['newsId'] == article_id].index[0]
    
    # Get the similarity scores
    similarity_scores = list(enumerate(similarity_matrix[idx]))
    
    # Sort the articles based on the similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top_n most similar articles
    top_article_indices = [i[0] for i in similarity_scores[1:top_n+1]]
    
    # Return the top_n most similar articles
    return df[['newsId', "title"]].iloc[top_article_indices]

# Example: Recommend articles similar to the first article in the dataset
recommended_articles = recommend_articles("N55528", title_similarity_matrix, top_n=5)
print(recommended_articles)



       newsId                                              title
28360   N9056  This Is What Queen Elizabeth Is Doing About th...
29974  N60671  Prince Charles Teared Up When Prince William T...
38035  N43522             Prince Charles is Getting Into Fashion
44779  N57591  Prince Charles Is Getting Into the Fashion Bus...
22805  N43917  Prince William Is "Worried" About Prince Harry...
