In [None]:
##############################     7     ##############################
# (Intended) Upon login, 10 random articles is shown to the user to
# learn their preference.
# Choose 10 random articles, show them to the user, obtain their 
# satisfaction score on these articles on a scale of 1 to 10, with
# 10 the most satisfied and 1 being the least satisfied, then compute
# the user's preference vector by averaging over these vectors using
# their rating as the weights.

from sklearn.metrics.pairwise import cosine_similarity

# Set numpy random seed to ensure replicability
np.random.seed(125)
ten_random_articles = np.random.choice(df.index, size=10, replace=False)

# Holder variable to record vector and score
user_scores = []

# Looping over the 10 articles
for index in ten_random_articles:
    # Show the article to the user and inquire a score from the user
    score = int(input("Please give a satisfaction score in the range of 1 to 10 on the recommended article: \n\n" + df['content'][index]))

    # Turn it into a dictionary and add to holder list
    temp = {'vector': df['Vector'][index],
            'score' : score}
    user_scores.append(temp)

# Obtain an average of the user's vector weighted by score, to generate the initial
# learned preference of the user
sum_vector = np.zeros((model.vector_size, ))
for i in range(len(user_scores)):
    weighted_score = user_scores[i]['score'] / 10
    
    sum_vector += weighted_score * user_scores[i]['vector']

# Average over all vectors to get the initial preference
user_preference = sum_vector / len(user_scores)
user_preference = user_preference.reshape(1, -1)

# Find the article that is most similar to the learned preference of the user,
# Excluding the articles that was used for training
df_user = df.drop(index=ten_random_articles)
similarities = []
for index, row in df_user.iterrows():
    vector_article = row['Vector'].reshape(1, -1)
    sim = cosine_similarity(user_preference, vector_article)
    similarities.append(sim)

index = similarities.index(max(similarities))

df['content'][index]

In [None]:
import random as rd
# Prototype of how user preference learning will be like (next step)
# Here's how I imagined the user preference learning process to be:
# 1. User logins onto their account and their preference will be learned
#       by supplying 10 articles for them to rate on a scale of 1 to 10,
#       with respect to how satisfied they are with the article with 10
#       being the most satisfied and 1 being not satisfied at all.
# 2. After the rating of each article, we learn a little bit more about
#       the user. We aim to give user the most semantically different
#       article to rate (by maximizing the difference in similarity
#       score compared to their previous ratings), to get more info of
#       the user.
# 3. After the user rates all 10 articles, we will get an average of these
#       vectors, weighted by the ratings they gave to each of them. This
#       will be the initial learned preference of the user.
# 4. Whenever the user clicks on an article recommended to them (which pops
#       up in a separate window), the website will show a window which asks
#       them to rate the article that they just read on a scale of 1 to 10.
#       We use this to learn a little bit further about the user.


# Taking myself as the test case, I am going to simulate being a user here
# Create a holder list that will hold all of the vectors that the user
# has rated, the index of the article, and their rating of the articles
zichen = []
# Set a seed to ensure replicability
rd.seed(12454)
# Determine a random first article to show
first_article_index = rd.choice(df.index)

# Hidden here to avoid excessive amount of output, but basically it was
# about shooting crimes in Virginia which I honestly don't care for,
# so I am going to give it a score of 1
#df['content'][first_article_index]
rating1 = {'vector': df['Vector'][first_article_index],
           'index' : first_article_index,
           'score' : 1}
zichen.append(rating1)

# Next we are going to find the most different article from the first one
# to give to the user for rating, so lets find the least similar article
# from our similarity matrix compared to the first article shown
first_sim = similarity_matrix[first_article_index]
second_article_indexes = np.where(first_sim == first_sim.min())

# We found two, 63 and 147, so let's just randomly choose one
if rd.random() < 0.5:
    second_article_index = 63
else:
    second_article_index = 147

# The random run gave 63, now show the second article
# Hidden to avoid excessive output
df['content'][second_article_index]

# Results analysis (#cs156 - Overfitting)

The above results are reasonable and expected, since I gave two ratings of 9 to two Ukrainian war news articles, and two 8s and one 7 to three other political articles (if you want to check out the articles out themselves, you can run all of the above code cells, it should output the same data), and the most similar piece of text that the model recommends is also about the Ukrainian war. It makes perfect sense for the recommender to be recommending news articles about the Ukrainian war, after it sees that I am interested in the Ukranian war and politics. This is defintely not the ideal version of the recommender that we want, since we want it to be able to extrapolate our preference and recommend thing that we do not know we are interested in.

I have thought about a few interesting applications that might strength this current Word2Vec model, and I will be testing one of them out in the next iteration of the product. They are:

1. Instead of recommending articles that has the closest similarity to the learned user's preference, we always attempt to recommend articles that are a given distance away from the user's preference. In other words, if we were to think of the endpoint of the vector of user's preference as a point in a high-dimensional word-embedding space, we always attempt to recommend articles that is on the surface of a hyper-sphere that has its center as the preference of the user. With every news article that we recommend to the user, instead of asking for the user to score their satisfaction, we ask the user whether this is too convergent to their ideas (by which we move the center of the hyper-sphere away from this point), or too divergent to their ideas (by which we move the center of the hyper-sphere closer to this point); and whether they would like more convergent new (by which we decrease the radius of the hyper-sphere), or they would like more divergent news (by which we then increase the radius of the hyper-sphere).

2. The limitation of the above method is that it allows users to create their own "media bubble". Stubborn people only gets more stubborn. A potential improvement to the model is if we do not allow the user to control that. In addition, we include some randomness into the model prediction, by adding a random vector of a set magnitude on top of the learned user's preference every time we are generating a news recommendation, that is to say, instead of recommending news articles that are the most similar to the preference of the user, we recommend news articles that are always a set semantic distance away from the preference of the user (in a random direction). Obviously, the limitation of this method is that the setting of the magnitude of the random vector becomes very influential in the recommendation process.

# \#cs110 - Complexity Analysis: 

For this function "make_similarity_matrix(iterable)", we can analyze its scaling behavior and see that it scales with $\Theta(n^2)$, where $n$ is the length of this "iterable" in its input. This is because the function utilizes two for loops that first goes over all of the items once in the iterable, then for each item it goes over all of the items in the iterable a second time, to compute a cosine similarity score between every pair of items in the iterable. Computing the cosine similarity score might also be costly if the vectors in the iterable are of a large size, but that would only be multiplying a constant on top of the $n^2$ complexity (given that $n$ is large and it is since we are analyzing the function's asymptotic behavior, i.e., when $n$ is large), which justifies our analysis of its scaling behavior being $\Theta(n^2)$. The Big-$\Theta$ notation is the better one to be used here, because there is no conditional statement in the for loops that gives us a better case or a worst case, which makes describing its scaling behavior using the average case $\Theta$ the best choice. This complexity of this function could be improved if conditional statements were included in the two for loops to avoid repeatedly calculating the same score again, since the cosine similarity score between two vectors does not change when their order is reversed, additionally the cosine similarity score between two identical vectors is just 1. This is not done in this iteration of the product since this is just the first draft, but in future iterations in order to minimize the time taken for this operation, this improvement will likely be implemented, which should improve the complexity of this function to $\Theta(n^2/2-n)$, with the division by 2 due to removing all the duplicate calculations, and minus $n$ due to simply equating all the calculation between identical vectors to be 1.

# \#cs110-PythonProgramming:

Explicitly shows that the PythonProgram passes verification tests.

The placeholder list failure_info is completely empty shows that the above function of get_news_content() worked perfectly without any errors. There are certain runs in which the news_content variable returned from the function get_news_content() is empty, but that was due to the attribute for which data is stored is incorrect (i.e., BeautifulSoup.find_all() returned nothing because it was not able to find any data stored under such attribute), and not because the function is not running properly.