## Simple bag of word type reccomenders

So far we have a reccomender using a term-frequency matrix. Each "document" (titles or abstracts) of a news article gets a vector
with the count of the unique words found in the document collection.
Then, a similarity matrix is calculated (cosine similarity for now). We can select an article and look in the similarity matrix to find the top n most similar articles


#### Things to do
- look at other matrix representations
    - Simple binary representation (only 1 if term in document, 0 otherwise)
    - TF-IDF matrix (terms that are less frequent in the collection are weighted more)

- look at other similarity measures
    - cosine similarity is often most popular, but others can be explored

- combine title and abstract columns and do the techniques on those

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


## Data reading and preprocessing

In [54]:
news_data = pd.read_csv("../MINDsmall_train/news.tsv",
    sep='\t',
    names=["newsId", "category", "subcategory", "title","abstract", "url", "title_entities","abstract_entities"]
)

behaviors_data = pd.read_csv(
    "../MINDsmall_train/behaviors.tsv",
    sep='\t',
    names=["impressionId", "userId", "timestamp", "click_history", "impressions"],
    parse_dates=['timestamp'] 
)

#Pre-processing

# check for missing values
print(news_data.isna().sum())
print("we see that we are missing 2666 abstracts. We need to drop these when we use the abstracts \n")


# check for missing values
print(behaviors_data.isna().sum())
print("we see that we are missing 3238 click histories. We need to drop these rows")

# remove null values
news_data = news_data.dropna().reset_index(drop=True)
behaviors_data = behaviors_data.dropna().reset_index(drop=True)
print(behaviors_data.isna().sum())


behaviors_data['timestamp'] = pd.to_datetime(behaviors_data['timestamp'], format='%Y-%m-%d %H:%M:%S')

behaviors_data['impressions_list'] = behaviors_data['impressions'].str.split()


# behaviors_data['hour_of_day'] = behaviors_data['timestamp'].dt.hour
# behaviors_data['clicks'] = behaviors_data['click_history'].str.split().str.len()
# behaviors_data['impressions_count'] = behaviors_data['impressions_list'].str.len()

# sort behaviors_data by timestamp
behaviors_data = behaviors_data.sort_values(by='timestamp')



newsId                  0
category                0
subcategory             0
title                   0
abstract             2666
url                     0
title_entities          3
abstract_entities       4
dtype: int64
we see that we are missing 2666 abstracts. We need to drop these when we use the abstracts 

impressionId        0
userId              0
timestamp           0
click_history    3238
impressions         0
dtype: int64
we see that we are missing 3238 click histories. We need to drop these rows
impressionId     0
userId           0
timestamp        0
click_history    0
impressions      0
dtype: int64


In [15]:
behaviors_data.head()

Unnamed: 0,impressionId,userId,timestamp,click_history,impressions,impressions_list
19705,20112,U65916,2019-11-09 00:00:19,N51706 N40767 N12096 N9798 N38802 N54827 N5780...,N54300-0 N46057-1 N57005-0 N52154-0 N57099-0 N...,"[N54300-0, N46057-1, N57005-0, N52154-0, N5709..."
13531,13807,U49985,2019-11-09 00:01:13,N5056 N29975 N53234 N39603 N50032 N8422 N53580...,N20602-0 N50059-0 N57768-1 N50135-1 N15134-0 N...,"[N20602-0, N50059-0, N57768-1, N50135-1, N1513..."
27115,27660,U25550,2019-11-09 00:02:44,N17260 N38298 N33976 N47719 N14888 N18870 N4607,N50135-0 N15134-0 N52433-1 N20602-0 N64536-0,"[N50135-0, N15134-0, N52433-1, N20602-0, N6453..."
149080,152217,U19710,2019-11-09 00:02:50,N3530 N48284 N43019 N62546 N138 N13138 N10676 ...,N57099-0 N30295-0 N21086-0 N5379-0 N57005-0 N4...,"[N57099-0, N30295-0, N21086-0, N5379-0, N57005..."
41348,42166,U38106,2019-11-09 00:03:09,N16874 N264 N48697 N51366,N3491-0 N20602-0 N25785-0 N23575-0 N38783-0 N1...,"[N3491-0, N20602-0, N25785-0, N23575-0, N38783..."


### Get news data from df of behaviours (training data)


In [55]:

behaviors_data.head()
behaviors_data['last_clicked_article'] = behaviors_data['click_history'].str.split().str[-1]

# join the training data with the news data to access all info about news article in "last_clicked_article"
behaviors_data = behaviors_data.join(news_data.set_index('newsId')[["title", "abstract"]], on='last_clicked_article')

#combination column
behaviors_data['combined'] = behaviors_data['title'] + " " + behaviors_data['abstract']



behaviors_data.isna().sum() # there were somehow 40 missing values in the title and abstract columns. We need to drop these rows
behaviors_data = behaviors_data.dropna().reset_index(drop=True)
behaviors_data.head()


behaviors_data.head()

Unnamed: 0,impressionId,userId,timestamp,click_history,impressions,impressions_list,last_clicked_article,title,abstract,combined
0,20112,U65916,2019-11-09 00:00:19,N51706 N40767 N12096 N9798 N38802 N54827 N5780...,N54300-0 N46057-1 N57005-0 N52154-0 N57099-0 N...,"[N54300-0, N46057-1, N57005-0, N52154-0, N5709...",N2678,'The bridge has definitely been burned': Willi...,Seven-time Pro Bowl left tackle Trent Williams...,'The bridge has definitely been burned': Willi...
1,13807,U49985,2019-11-09 00:01:13,N5056 N29975 N53234 N39603 N50032 N8422 N53580...,N20602-0 N50059-0 N57768-1 N50135-1 N15134-0 N...,"[N20602-0, N50059-0, N57768-1, N50135-1, N1513...",N61880,Dean Martin's Daughter Speaks Out About John L...,"Dean Martin's Daughter on New 'Baby, It's Cold...",Dean Martin's Daughter Speaks Out About John L...
2,27660,U25550,2019-11-09 00:02:44,N17260 N38298 N33976 N47719 N14888 N18870 N4607,N50135-0 N15134-0 N52433-1 N20602-0 N64536-0,"[N50135-0, N15134-0, N52433-1, N20602-0, N6453...",N4607,Cause determined in Jessi Combs' fatal speed r...,Occurred at speeds near 550 mph,Cause determined in Jessi Combs' fatal speed r...
3,152217,U19710,2019-11-09 00:02:50,N3530 N48284 N43019 N62546 N138 N13138 N10676 ...,N57099-0 N30295-0 N21086-0 N5379-0 N57005-0 N4...,"[N57099-0, N30295-0, N21086-0, N5379-0, N57005...",N41244,Keanu Reeves holds hands with Alexandra Grant ...,The Internet is even more in love with Keanu R...,Keanu Reeves holds hands with Alexandra Grant ...
4,42166,U38106,2019-11-09 00:03:09,N16874 N264 N48697 N51366,N3491-0 N20602-0 N25785-0 N23575-0 N38783-0 N1...,"[N3491-0, N20602-0, N25785-0, N23575-0, N38783...",N51366,See massive bear stuck in Lake Tahoe dumpster ...,Placer County Sheriff's deputies encounter so ...,See massive bear stuck in Lake Tahoe dumpster ...


### Creating matrix representations for the titles and abstracts

In [38]:

# feature must be "title", "abstract" or "combined"
def create_matrix(data, feature = "title"):
    vectorizer = CountVectorizer(stop_words='english')
    tf_matrix = vectorizer.fit_transform(data[feature])
    return tf_matrix



### Computing similarities

In [35]:
# Compute the similarity matrix
def create_similarity_matrix(tf_matrix):
    similarity_matrix = cosine_similarity(tf_matrix)
    return similarity_matrix



### Recommendation function
Takes in an article_id, and finds the top_n most similar items by using the similarity matrix

In [57]:
# idx = news_data['newsId' == "N2678"].index[0]
idx = news_data[news_data['newsId'] == "N2678"].index[0]
idx

47675

In [52]:
def recommend_articles(training_df, evaluation_row, similarity_matrix, top_n=5):

    article_id = evaluation_row['click_history'].split()[-1]

    # Get the index of the article
    idx = training_df['newsId' == article_id].index[0]
    idx = df[df['newsId'] == article_id].index[0]

    
    # Get the similarity scores
    similarity_scores = list(enumerate(similarity_matrix[idx]))
    
    # Sort the articles based on the similarity scores
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the top_n most similar articles
    top_article_indices = [i[0] for i in similarity_scores[1:top_n+1]]
    
    # Return the top_n most similar articles
    return df[['newsId']].iloc[top_article_indices]

# Example: Recommend articles similar to the first article in the dataset
recommended_articles = recommend_articles("N55528", title_similarity_matrix, top_n=5)
print(recommended_articles)




In [50]:
INITIAL_WINDOW_SIZE = 6
MAX_WINDOW_SIZE = 72
SLIDE_SIZE = 3
FEATURE = "title"
behaviors_data = behaviors_data.head(10000)
hits_per_window = {} #key = (start_time, end_time), value = (hits, total_rows)
#smaller behaviors_data 


start_time = behaviors_data['timestamp'].min()
end_time = start_time + pd.Timedelta(hours=INITIAL_WINDOW_SIZE)


def get_data_split(start_time, end_time, test_window_size=SLIDE_SIZE):
    # returns a training and evaluation split
    training_data = behaviors_data[(behaviors_data['timestamp'] >= start_time) & (behaviors_data['timestamp'] < end_time)]
    evaluation_data = behaviors_data[(behaviors_data['timestamp'] >= end_time) & (behaviors_data['timestamp'] < end_time + pd.Timedelta(hours=test_window_size))]
    return training_data, evaluation_data



print("TEST")
print("start time " , start_time)
print("end time " , end_time)
print(pd.Timedelta(hours=SLIDE_SIZE))
print("max time " , behaviors_data['timestamp'].max())
      

#The sliding window
while end_time + pd.Timedelta(hours=SLIDE_SIZE) <= behaviors_data['timestamp'].max():
    print(f"Start time: {start_time}, End time: {end_time}")
    training_data, evaluation_data = get_data_split(start_time, end_time)

    tf_matrix = create_matrix(training_data, FEATURE)
    similarity_matrix = create_similarity_matrix(tf_matrix)
    
    #recommend for the rows in the evaluation data based on the similiarity matrix we created with the training data
    #evaluate the hit rate based on the recommendations and the impressions in the evaluation row
    hits = 0
    total = 0
    print(f'Shape of evaluation data {evaluation_data.shape}')
    for index, row in evaluation_data.iterrows():
        total += 1

        recommendations = recommend_articles(training_data, row, similarity_matrix, top_n=5)
        clicked_article = next((n.split("-")[0] for n in row['impressions_list'] if n.split("-")[1] == "1"), None)

        if clicked_article in recommendations:
            hits += 1
            print(f'Hit for user {index} with article {clicked_article} and recommendations {recommendations}')

        print(f'Recommendations for user {index}: {recommendations}')
        print(f'Actual click for user {index}: {clicked_article}')
    hits_per_window[(start_time, end_time)] = (hits, total)

    print(f"Start time: {start_time}, End time: {end_time}")
    end_time += pd.Timedelta(hours=SLIDE_SIZE)
    if (end_time - start_time) > pd.Timedelta(hours = MAX_WINDOW_SIZE):
        print("\n")
        print("Increasing start time")
        print("\n")
        #remove data from the beginning of the training data
        start_time += pd.Timedelta(hours=SLIDE_SIZE)



TEST
start time  2019-11-09 00:00:19
end time  2019-11-09 06:00:19
0 days 03:00:00
max time  2019-11-09 16:32:50
Start time: 2019-11-09 00:00:19, End time: 2019-11-09 06:00:19
Shape of evaluation data (2902, 10)


TypeError: recommend_articles() missing 1 required positional argument: 'similarity_matrix'