# <font color='darkblue'> Recommendations with IBM Watson. Part II</font>

## <font color='darkblue'> Rank and Content Based Recommendations</font>


In this notebook, rank based recommendations to address the cold start problem and content based recommendations methods are investigated. The dataset is real data from the IBM Watson Studio platform. 

## <font color='darkblue'>Environment SetUp</font>

In [83]:
# General libraries and packages

import pandas as pd
import numpy as np
import re
import string
from ast import literal_eval

# Recommendation systems library
import surprise

# Packages and libraries for content based recs
import re

# NLP packages
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

# Data processing packages
from sklearn.feature_extraction.text import TfidfVectorizer

# Import linear kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Import visualization packages and libraries

import matplotlib.pyplot as plt
%matplotlib inline

# Choose style and color palette
import seaborn as sns
sns.set_style("darkgrid")

colors = sns.color_palette('PuBuGn')

In [3]:
# Use 2 decimal places in output display
pd.set_option("display.precision", 2)

# Don't wrap dataframe across additional lines
pd.set_option("display.expand_frame_repr", False)

# Set the maximum widths of columns
pd.set_option("display.max_colwidth", 90)

# Set max rows displayed in output to 20
pd.set_option("display.max_rows", 20)

## <font color='darkblue'>Upload Preprocessed Data</font>

<div class="alert alert-block alert-info">

<b>NOTES</b>:
    <ul>
        <li>There are 4 files to upload.</li>
        <li>Two files are centered on the user, the remaining two are centered on the article.</li>
    </ul>

</div>

In [84]:
# Read in the user-item interaction files
df = pd.read_csv('data/df.csv', index_col=[0])
articles_per_user = pd.read_csv('data/articles_per_user.csv', 
                                index_col=[0], 
                                converters={'viewed_articles': literal_eval})

# Read the articles information files
df_content = pd.read_csv('data/articles_community.csv', index_col=[0])
users_per_article = pd.read_csv('data/users_per_article.csv', 
                                index_col=[0],
                               converters={'users_accessed': literal_eval})

In [85]:
# Check the user_item dataframes
articles_per_user.head()

Unnamed: 0,user_id,viewed_articles,articles_count,unique_articles_count
0,1,"[1430, 1430, 732, 1429, 43, 109, 1232, 310, 1293, 1406, 1406, 329, 585, 310, 1305, 105...",47,36
1,2,"[1314, 1305, 1024, 1176, 1422, 1427]",6,6
2,3,"[1429, 1429, 1330, 213, 1172, 1431, 1429, 1059, 1057, 29, 788, 1172, 868, 12, 1429, 10...",82,40
3,4,"[1338, 1314, 1330, 1330, 1427, 1160, 1162, 1391, 1162, 887, 1420, 1394, 1305, 1314, 13...",45,26
4,5,"[1276, 1351, 1166, 1351, 1351]",5,3


In [86]:
# Check the articles information dataframe
users_per_article.head()

Unnamed: 0,doc_body,doc_description,article_id,views,users_accessed,doc_name
0,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VID...",Detect bad readings in real time using Python and Streaming Analytics.,0,14,"[495, 495, 495, 503, 233, 552, 1347, 1051, 785, 2992, 3216, 3570, 4571, 4836]",detect malfunctioning iot sensors with streaming analytics
1,No Free Hunch Navigation * kaggle.com\r\n\r\n * kaggle.com\r\n\r\nCommunicating data s...,"See the forest, see the trees. Here lies the challenge in both performing and presenti...",1,0,[],Communicating data science: A guide to presenting your work
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Paths\r\n * Courses * Our Courses\r\n * ...,Here’s this week’s news in Data Science and Big Data.,2,58,"[676, 668, 668, 1145, 23, 23, 60, 60, 665, 98, 668, 794, 217, 60, 1401, 46, 1577, 789,...","this week in data science (april 18, 2017)"
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCALE - BOOST THE PERFORMANCE OF YOUR\r\nDI...","Learn how distributed DBs solve the problem of scaling persistent storage, but introdu...",3,0,[],DataLayer Conference: Boost the performance of your distributed database
4,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VID...",This video demonstrates the power of IBM DataScience Experience using a simple New Yor...,4,13,"[2345, 176, 457, 3011, 3207, 3302, 3827, 3986, 4179, 4231, 4239, 4308, 5040]",analyze ny restaurant data using spark in dsx


## <a class="anchor" id="Rank">Rank-Based Recommendations</a>

<div class="alert alert-block alert-info">

<b>NOTES</b>:
    <ul>
<li>The dataset does not contain ratings for whether a user liked an article or not.  We only know that a user has interacted with an article.  In these cases, the popularity of an article can be based on how often an article was interacted with.</li>
    </ul>

</div>

### <font color='darkblue'>Get the top n most popular articles</font>

In [11]:
# Function to retrieve the ids and title of the most viewed n articles
def get_top_articles(n):
    
    '''
    Finds the n most popular articles.
    
    INPUT:
        n (int) - specifies how many items should be returned
    OUTPUT:
        article_ids (list) - the ids of the n most popular articles
        titles (list) - the titles of the n most popular articles
    '''
    
    df_top_n = users_per_article.nlargest(n, 'views')
    titles = list(df_top_n.doc_name)
    article_ids = list(df_top_n.article_id)
    return article_ids, titles

In [12]:
# The ids of the 5 most popular articles
get_top_articles(5)[0]

[1429, 1330, 1431, 1427, 1364]

In [13]:
# The titles of the 5 most popular articles
get_top_articles(5)[1]

['use deep learning for image classification',
 'insights from new york car accident reports',
 'visualize car data with brunel',
 'use xgboost, scikit-learn & ibm watson machine learning apis',
 'predicting churn with the spss random tree algorithm']

## <a class="anchor" id="Content-Recs">Content Based Recommendations</a>

## <font color='darkblue'>Baseline Model</font>

<div class="alert alert-block alert-info">

<b>NOTES</b>:
    <ul>
<li>In this section we work with users_per_article dataframe that contains content information.</li>
    </ul>

</div>

In [14]:
# Make a copy of the dataframe
content = users_per_article.copy()

# Take a look at the data
content.head(2)

Unnamed: 0,doc_body,doc_description,article_id,views,users_accessed,doc_name
0,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VID...",Detect bad readings in real time using Python and Streaming Analytics.,0,14,"[495, 495, 495, 503, 233, 552, 1347, 1051, 785, 2992, 3216, 3570, 4571, 4836]",detect malfunctioning iot sensors with streaming analytics
1,No Free Hunch Navigation * kaggle.com\r\n\r\n * kaggle.com\r\n\r\nCommunicating data s...,"See the forest, see the trees. Here lies the challenge in both performing and presenti...",1,0,[],Communicating data science: A guide to presenting your work


In [15]:
# Check for missing values
content.isnull().sum()

doc_body           291
doc_description    280
article_id           0
views                0
users_accessed       0
doc_name             0
dtype: int64

In [16]:
# Dataframe that retains article_id and doc_name 
df_titles = content[['article_id', 'doc_name']]

# check the output
df_titles.head(2)

Unnamed: 0,article_id,doc_name
0,0,detect malfunctioning iot sensors with streaming analytics
1,1,Communicating data science: A guide to presenting your work


### <font color='darkblue'>Preprocess text and create TF-IDF matrix</font>

In [21]:
def tokenize(text):
    
    """
    Contains the pre-processing steps for a document:
        - tokenize
        - lemmatize
        - lowercasing
        - removes stopwords in English language
        
    INPUT (string) - raw message
    OUTPUT (list)  - clean tokens
    """
    
    # remove punctuation and unusual characters 
    text = re.sub(r"[^a-zA-Z0-9]", " ", text).strip()
    
    # split into words
    words = word_tokenize(text)
    
    # lemmatize - reduce words to their root form
    words = [WordNetLemmatizer().lemmatize(w) for w in words]
    
    # case normalize and remove leading & trailing empty spaces
    words = [w.lower().strip() for w in words]
    
    # remove stopwords, keep not and can
    clean_words = [w for w in words if w not in stopwords.words('english') 
                   or w in ['not', 'can']]
    
    return clean_words

In [22]:
# Create an instance of the TF-IDF vectorizer
tfidf = TfidfVectorizer(tokenizer=tokenize)

# Construct the TF-IDF matrix 
tfidf_matrix = tfidf.fit_transform(df_titles['doc_name'])

In [23]:
# Output the shape of the TF-IDF matrix
tfidf_matrix.shape

(1328, 1946)

### <font color='darkblue'>Compute the cosine similarity scores</font>

In [24]:
# Compute the cosine similarity matrix 
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# Check the output
cosine_sim.shape

(1328, 1328)

### <font color='darkblue'>Build the recommender function</font>

In [25]:
def get_article_info(article_id, df_titles):
    '''
    INPUT:
    article_id (int) - unique article identifier
    df_titles (pd.DataFrame) - contains article_id, title
    
    OUTPUT:
    article_title (str) - the article name associated with the provided article id
    '''
    
    article_title = df_titles[df_titles['article_id']==article_id]['doc_name'].unique()[0]
    return article_id, article_title

In [26]:
# Print a sample output
print(f'The title of the article with id = 542 is: {get_article_info(542, df_titles)[1]}.')

The title of the article with id = 542 is: getting started with python.


In [48]:
# Function that takes in the article id as input and gives n recommendations
def content_recommender(article_id, n, cosine_sim, df_titles):
    '''
    INPUT:
    article_id (int) - unique article identifier
    n (int) - how many recommendations should be returned
    cosine_sim (np.ndarray) - matrix of cosine similarities
    df (pd.DdataFrame) - contains title and article id
    
    OUTPUT:
    recommended_articles (list) - recommended articles (ids and titles), 
                                  sorted by cosine similarity
    '''
    # Get the information for the given article id
    given_article = get_article_info(article_id, df_titles)
    
    # Obtain the matrix index that matches the article id
    matrix_index = df_titles[df_titles['article_id']==article_id].index.values[0]
    
    # Sort the scores based on the cosine similarity scores with given article index, ignore the first entry
    sim_scores = pd.Series(cosine_sim[matrix_index]).sort_values(ascending=False).iloc[1:]
    
    # Get the indices corresponding to the scores of the n most similar articles
    sim_scores_n = list(sim_scores[:n+1].index.values)
    
    # Return the top n most similar article_ids as a pandas dataframe
    recommended_articles = df_titles.iloc[sim_scores_n]
    
    return given_article, recommended_articles

In [63]:
# Choose the id for the test article
article_id = 20

# Create the list of recommendations
recommendations_list_20 = content_recommender(article_id, 9, cosine_sim, df_titles)

# Print the information for the test article
print(f'The article with id = {article_id} for which we give recommendations is:\n{recommendations_list_20[0][1]}')

# Print the recommended articles information
print(f'\nThe recommended articles are:')
recommendations_list_20[1]

The article with id = 20 for which we give recommendations is:
working interactively with rstudio and notebooks in dsx

The recommended articles are:


Unnamed: 0,article_id,doc_name
373,373,working with notebooks in dsx
763,763,load data into rstudio for analysis in dsx
182,182,Overview of RStudio IDE in DSX
355,355,run shiny applications in rstudio in dsx
626,626,analyze db2 warehouse on cloud data in rstudio in dsx
665,665,get social with your notebooks in dsx
958,958,using dsx notebooks to analyze github data
474,474,publish notebooks to github in dsx
930,930,how to use version control (github) in rstudio within dsx?
821,821,using rstudio in ibm data science experience


In [70]:
# Choose the id for the test article
article_id = 500

# Create the list of recommendations
recommendations_list_500 = content_recommender(article_id, 9, cosine_sim, df_titles)

# Print the information for the test article
print(f'The article with id = {article_id} for which we give recommendations is:\n{recommendations_list_500[0][1]}')

# Print the recommended articles information
print(f'\nThe recommended articles are:')
recommendations_list_500[1]

The article with id = 500 for which we give recommendations is:
the difference between ai, machine learning, and deep learning?

The recommended articles are:


Unnamed: 0,article_id,doc_name
313,313,what is machine learning?
260,260,the machine learning database
762,762,From Machine Learning to Learning Machine (Dinesh Nirmal)
237,237,deep learning with data science experience
1035,1035,machine learning for the enterprise.
800,800,machine learning for the enterprise
336,336,challenges in deep learning
337,337,generalization in deep learning
1004,1004,how to get a job in deep learning
809,809,use the machine learning library


<div class="alert alert-block alert-info">

<b>NOTES</b>:
    <ul>
<li>The two lists of recommendations look good.</li>
    </ul>

</div>

### <font color='darkblue'>Make content recommendations</font>

In [121]:
def make_content_recs(user_id, cosine_sim, m=20, df=articles_per_user, df_content=users_per_article):
    '''
    INPUT:
        user_id (int) - unique numeric user identifier
        m (int) - the number of recommendations we want for the user
        df (pd.DataFrame) - contains users and articles interactions
        df_content (pd.DataFrame) - contains titles and article_ids
    
    OUTPUT:
        recs_ids (list) - list of recommendations for the user by article id
        rec_names (list) - list of recommendations for the user by article title
    
    Description:
        Loops through the articles based on closeness to the articles seen by the user.
        For each article seen by the user - finds n most similar articles based on 
        content recommendations. Does this until m recommendations are found.
    
    Notes:
        The articles that have the most total article interactions are chosen first.
   
    '''
    # List of recommended articles by id, and by title
    recommendations = []
    
    # Ids of articles seen by user  
    articles_ids_seen = articles_per_user.loc[user_id].viewed_articles
    
    for art_id in articles_ids_seen:
        
        # get the n most similar articles ids and titles
        n = 10
        
        similar_articles_ids = content_recommender(art_id, n, cosine_sim, df_titles)[1]['article_id'].tolist()
        similar_articles = content_recommender(art_id, n, cosine_sim, df_titles)[1]['doc_name'].tolist()

        # remove the ids of the articles in articles_ids_seen and available
        articles_ids_to_recommend = np.setdiff1d(similar_articles_ids, articles_ids_seen)
        
        # remove the articles already added to the recs list
        articles_ids_to_add = np.setdiff1d(articles_ids_to_recommend, recommendations)
        
        # add the sorted article ids
        recommendations.extend(articles_ids_to_add)
        
        # break when we have enough articles to recommend
        if len(recommendations) >= m:
            break
    
    # retain the first m recommendations
    recs = recommendations[:m]
    
    # get the articles titles
    complete_recs = [get_article_info(article_id, df_titles) for article_id in recs]
    
    return complete_recs


In [120]:
# Take a look at the unique articles seen by user_id=40
articles_ids_seen_40 = articles_per_user.loc[40].viewed_articles
articles_seen_40 = [get_article_info(article_id, df_titles) for article_id in articles_ids_seen_40]
print(f'The articles accessed by user_id=40 are:')
set(articles_seen_40)

The articles accessed by user_id=40 are:


{(151, 'jupyter notebook tutorial'),
 (162, 'an introduction to stock market data analysis with r (part 1)'),
 (486, 'use spark r to load and analyze data'),
 (542, 'getting started with python'),
 (645, 'how to perform a logistic regression in r'),
 (692, '15 page tutorial for r'),
 (1198, 'country statistics: commercial bank prime lending rate'),
 (1304, 'gosales transactions for logistic regression model'),
 (1368, 'putting a human face on machine learning'),
 (1430,
  'using pixiedust for fast, flexible, and easier data analysis and experimentation'),
 (1436, 'welcome to pixiedust')}

In [122]:
# Take a look at the list of recommended articles
print(f'The articles recommended to user_id=40 are:')
make_content_recs(40, cosine_sim, 20, articles_per_user, users_per_article)

The articles recommended to user_id=40 are:


[(16, 'higher-order logistic regression for large datasets'),
 (82, 'build a logistic regression model with wml & dsx'),
 (609, 'simple linear regression? do it the bayesian way'),
 (751, 'build a predictive analytic model'),
 (1047, 'a comparison of logistic regression and naive bayes '),
 (1051, 'a tensorflow regression model to predict house values'),
 (1274, 'data model with streaming analytics and python'),
 (1276, 'deploy your python model as a restful api'),
 (1305, 'gosales transactions for naive bayes model'),
 (1350, 'model a golomb ruler'),
 (21,
  'Mapping for Data Science with PixieDust and Mapbox – IBM Watson Data Lab – Medium'),
 (108,
  '520    using notebooks with pixiedust for fast, flexi...\nName: title, dtype: object'),
 (110, 'pixiedust: magic for your python notebook'),
 (522, 'share the (pixiedust) magic – ibm watson data lab – medium'),
 (617, 'pixiedust gets its first community-driven feature in 1.0.4'),
 (681,
  'real-time sentiment analysis of twitter hashtag

In [123]:
# Take a look at the unique articles seen by user_id=178
articles_ids_seen_178 = articles_per_user.loc[178].viewed_articles
articles_seen_178 = [get_article_info(article_id, df_titles) for article_id in articles_ids_seen_178]
print(f'The articles accessed by user_id=178 are:')
set(articles_seen_178)

The articles accessed by user_id=178 are:


{(173, '10 must attend data science, ml and ai conferences in 2018')}

In [124]:
# Take a look at the list of recommended articles
print(f'The articles recommended to user_id=178 are:')
make_content_recs(178, cosine_sim, 20, articles_per_user, users_per_article)

The articles recommended to user_id=178 are:


[(508, 'data science in the cloud'),
 (528, '10 tips on using jupyter notebook'),
 (661, '21 Must-Know Data Science Interview Questions and Answers'),
 (679, 'this week in data science'),
 (715,
  "for ai to get creative, it must learn the rules--then how to break 'em"),
 (784, '10 data science, machine learning and ai podcasts you must listen to'),
 (878, '10 data science podcasts you need to be listening to right now'),
 (967, 'ml algorithm != learning machine'),
 (986, 'r for data science'),
 (990, 'this week in data science (january 10, 2017)'),
 (1177, 'cifar-10 - python version')]

`2.` _Now that you have put together your content-based recommendation system, use the cell below to write a summary explaining how your content based recommender works.  Do you see any possible improvements that could be made to your function?  Is there anything novel about your content based recommender?_

**The content based recommendations are based on the article title or on the article description. The corpus consists of one group of these documents. The text is processed by removing punctuation and stop words, it is lemmatized and split into tokens. Once processed the corpus is fed into a TdIdf Vectorizer that creates a matrix of scores. The cosine similarities between any two documents (rows in the similarity matrix) are computed and the results are saved in a 1051x1051 matrix.**

**Given a user id, and assuming that the user has seen at least one article, the engine will recommend n content similar articles for each article seen by the user. Once this collection is created, it is sorted using the article popularity and m most popular articles are recommended.**

**One way to improve these results is to use more efficient NLP techniques, such as word embedding. Another option would be to create meta data for the articles based on their descriptions and full text, both available in the `df_content` dataframe.**

`3.` _Use your content-recommendation system to make recommendations for the below scenarios based on the comments.  Again no tests are provided here, because there isn't one right answer that could be used to find these content based recommendations._

In [None]:
# make recommendations for a brand new user - recommend the most popular articles
new_user_recs = get_top_article_ids(10, df) 
new_user_recommendations = [get_article_info(art_id, df_full) for art_id in new_user_recs]
new_user_recommendations

In [None]:
# make recommendations for a user who only has interacted with article id '1427.0'
content_recommender(1427, 10, cosine_sim, df_full)

### <a class="anchor" id="Matrix-Fact">Part V: Matrix Factorization</a>

In this part of the notebook, we use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

`1.` _Upload the user_item matrix from part 1._

In [None]:
# Load the matrix here
user_item_matrix = pd.read_pickle('user_item_matrix.p')

In [None]:
# quick look at the matrix
user_item_matrix.head()

`2.` _In this situation, you can use Singular Value Decomposition from [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.svd.html) on the user-item matrix.  Use the cell to perform SVD, and explain why this is different than in the lesson._

In [None]:
# find the shape of the user_item matrix
user_item_matrix.shape

In [None]:
# Perform SVD on the User-Item Matrix Here

u, s, vt = np.linalg.svd(user_item_matrix)# use the built in to get the three matrices

In [None]:
# get info on the output matrices
u.shape, s.shape, vt.shape

**The user_movie_subset matrix contains the movie ratings as entries, so that the entry (ij) corresponds to the rating j given by the user i. In this approach we get numerous missing entries for those users who did not watch or rated a certain movie. Since the SVD decomposition does not work with missing values, it has be replaced by an alternate approach, such as FunkSVD.**

**In the case of IBM recommendations, the user_item_matrix is a sparse array of binary entries, which records the interaction/no interaction between an user and an article. Thus, in this case we can apply SVD decomposition matrix method as the matrix does not have missing entries.**

`3.` _Now for the tricky part, how do we choose the number of latent features to use?  Running the below cell, you can see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix.  Run the cell below to get an idea of how the accuracy improves as we increase the number of latent features._

In [None]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

`4.` From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations.  Instead, we might split our dataset into a training and test set of data, as shown in the cell below.  

Use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below: 

* How many users can we make predictions for in the test set?  
* How many users are we not able to make predictions for because of the cold start problem?
* How many articles can we make predictions for in the test set?  
* How many articles are we not able to make predictions for because of the cold start problem?

In [None]:
# recall the df size
df.shape

In [None]:
df_train = df.head(40000)
df_test = df.tail(5993)


def create_test_and_train_user_item(df_train, df_test):
    '''
    INPUT:
    df_train - training dataframe
    df_test - test dataframe
    
    OUTPUT:
    user_item_train - a user-item matrix of the training dataframe 
                      (unique users for each row and unique articles for each column)
    user_item_test - a user-item matrix of the testing dataframe 
                    (unique users for each row and unique articles for each column)
    test_idx - all of the test user ids
    test_arts - all of the test article ids
    
    '''
    # create the two user_item matrices
    user_item_train = create_user_item_matrix(df_train)
    user_item_test = create_user_item_matrix(df_test)
    
    # extract the test user ids
    test_idx = list(user_item_test.index)
    # extract the test article_ids
    test_arts = user_item_test.columns
    
    return user_item_train, user_item_test, test_idx, test_arts

user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)

In [None]:
# determine the shapes of the two matrices
user_item_train.shape, user_item_test.shape

In [None]:
# the user ids in test set, the article ids in test set
len(test_idx), len(test_arts)

In [None]:
# common user ids in the train and the test sets, 
common_users = len(set(user_item_test.index) & set(user_item_train.index))

print('There are {} common user ids in the train and test sets.'.format(common_users))

In [None]:
# the number of user ids in the test set
test_idx = len(list(user_item_test.index))
# the number of user ids in the test set that are not in the train set
test_idx - common_users

print('There are {} users in the test set that are not in the train set.'.format(test_idx - common_users))

In [None]:
# the number of all article ids in the train set
len(user_item_train.columns)

In [None]:
# the number of all article ids in the test set
len(user_item_test.columns)

In [None]:
# common article ids in the train and the test sets
common_articles = len(set(user_item_test.columns) & set(user_item_train.columns))

print('There are {} common article ids in the train and test sets.'.format(common_articles))

In [None]:
# Replace the values in the dictionary below
a = 662 
b = 574 
c = 20 
d = 0 


sol_4_dict = {
    'How many users can we make predictions for in the test set?': c, 
    'How many users in the test set are we not able to make predictions for because of the cold start problem?': a, 
    'How many movies can we make predictions for in the test set?': b,
    'How many movies in the test set are we not able to make predictions for because of the cold start problem?': d
}

t.sol_4_test(sol_4_dict)

`5.` Now use the **user_item_train** dataset from above to find U, S, and V transpose using SVD. Then find the subset of rows in the **user_item_test** dataset that you can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions `2` - `4`.

Use the cells below to explore how well SVD works towards making predictions for recommendations on the test data.  

In [None]:
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = np.linalg.svd(user_item_train) 

In [None]:
# the shapes of the three matrices 
u_train.shape, s_train.shape, vt_train.shape

In [None]:
# the users we can make predictions for
common_user_ids = list(set(user_item_test.index) & set(user_item_train.index))
# the articles we can use to make predictions
test_arts = user_item_test.columns

print(common_user_ids)

In [None]:
# reduce the user_item_test to include only the common user_ids
user_item_test_red = user_item_test.loc[common_user_ids, test_arts]
# check the shape of the reduced matrix
user_item_test_red.shape

In [None]:
# reset and relabel indices to identify the common user ids
user_item_train_adj = user_item_train.reset_index()
# relabel indices to avoid out of range error
users_common_idx = user_item_train_adj[user_item_train_adj['user_id'].isin(common_user_ids)].index.to_list()
print(users_common_idx)

In [None]:
# reduce vt to the articles in test set
vt_test = vt_train[:, user_item_train.columns.isin(test_arts)]

# reduce u to the common user ids - note: need to work with the updated indices
u_test = u_train[users_common_idx, :]

# check the shapes of the reduced matrices
u_test.shape, vt_test.shape

In [None]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs_train = []
sum_errs_test = []

for k in num_latent_feats:
    # restructure with k latent features for the train set matrices
    s_new_train, u_new_train, vt_new_train = np.diag(s_train[:k]), u_train[:, :k], vt_train[:k, :]
    # restructure with k latent features for the test set matrices        
    s_new_test, u_new_test, vt_new_test = np.diag(s_train[:k]), u_test[:, :k], vt_test[:k, :]
    
    # take dot products for each group
    user_item_est_train = np.around(np.dot(np.dot(u_new_train, s_new_train), vt_new_train))
    user_item_est_test = np.around(np.dot(np.dot(u_new_test, s_new_train), vt_new_test))
    
    # compute error for each prediction to actual value
    diffs_train = np.subtract(user_item_train, user_item_est_train)
    diffs_test = np.subtract(user_item_test_red, user_item_est_test)
    
    # total train errors and keep track of them
    err_train = np.sum(np.sum(np.abs(diffs_train)))
    sum_errs_train.append(err_train)
    
    # total test errors and keep track of them
    err_test = np.sum(np.sum(np.abs(diffs_test)))
    sum_errs_test.append(err_test)

In [None]:
# check the matrix shapes for k=10
s_new_test, u_new_test, vt_new_test = np.diag(s_train[:10]), u_test[:, :10], vt_test[:10, :]
s_new_train.shape, u_new_test.shape, vt_new_test.shape

In [None]:
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

ax1.plot(num_latent_feats, 1 - np.array(sum_errs_test)/df.shape[0], color='b', label="Test accuracy");
ax2.plot(num_latent_feats, 1 - np.array(sum_errs_train)/df.shape[0], color = 'g', label="Train accuracy");

# get handlers and labels for the legend
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()

# create a legend for the test accuracy curve
ax1.legend(h1+h2, l1+l2, loc='center right')

ax1.set_xlabel('Number of Latent Features');
ax2.set_ylabel('Train Accuracy');
ax1.set_ylabel('Test Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

`6.` Use the cell below to comment on the results you found in the previous question. Given the circumstances of your results, discuss what you might do to determine if the recommendations you make with any of the above recommendation systems are an improvement to how users currently find articles? 

**The test accuracy decreases while the train accuracy increases with the number of latent features. The two accuracy curves intersect for 100 latent features. The shapes of the two curves indicate overfitting. The more latent features we use the more overfitted the model is. This is not surprising as the test set is quite small and thus easy to overfit.**


**The following steps can be taken to improve the recommendation engine:**

* **Design an A/B test to compare the warm users (those who receive content and collaborative filtering recommendations) with the cold users (the new users who are recommended the most popular articles only).** 

* **Given how imbalanced the data is, measures for the models's performance such as  F1 score and ROC/AUC, would provide a slight improvement in our estimations. Also we could take into consideration the number of unique articles a user interacts with.** 

* **Use different algorithms (such as the multi-arm bandit) for the cold user problem.**

**However the efficiency of any method is still affected by the data imbalance and the best way to improve the results is to increase the size of the test set by collecting data from more users, after which we could implement the suggestions mentioned above.**

<a id='conclusions'></a>
### Extras
Using your workbook, you could now save your recommendations for each user, develop a class to make new predictions and update your results, and make a flask app to deploy your results.  These tasks are beyond what is required for this project.  However, from what you learned in the lessons, you certainly capable of taking these tasks on to improve upon your work here!


## Conclusion

> Congratulations!  You have reached the end of the Recommendations with IBM project! 

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the [rubric](https://review.udacity.com/#!/rubrics/2322/view). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.


## Directions to Submit

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations! 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])

In [96]:
# Create a copy of article information data
metadata = users_per_article.copy()

# Replace NaN with the string 'none'
metadata.fillna('none', inplace=True)

# Create a new column that combines doc_name and doc_description 
metadata['doc_text'] = metadata['doc_description'] +  ' , ' + metadata['doc_name']

# Install packages to generate the BERT embeddings
from sentence_transformers import SentenceTransformer

# Transform the documents into 768-dim real vectors
model = SentenceTransformer('all-mpnet-base-v2')
metadata['vectors'] = metadata['doc_name'].apply(lambda x: model.encode(x))


# Calculate Within Clusters SS for a range of k values
# Adapted from: https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means

from sklearn.cluster import KMeans

# Function returns WSS score for k values from 1 to kmax

def calculate_WSS(points, kmax):
    
    sse = []
    
    for k in range(4, kmax+1):
        kmeans = KMeans(n_clusters = k).fit(points)
        centroids = kmeans.cluster_centers_
        pred_clusters = kmeans.predict(points)
        curr_sse = 0
        
        # calculate square of Euclidean distance of each point from its cluster center and add to current WSS
        for i in range(len(points)):
            curr_center = centroids[pred_clusters[i]]
            curr_sse += (points[i, 0] - curr_center[0]) ** 2 + (points[i, 1] - curr_center[1]) ** 2
        sse.append(curr_sse)
    return sse

# Create a list of vectors for Kmeans model
X = np.array(metadata['vectors'].tolist())

TOKENIZERS_PARALLELISM=False
wsse = calculate_WSS(X, 10)

wss_results = pd.DataFrame({'k':list(range(4,11)), 'wss' : wsse})

ax1 = wss_results.plot.scatter(x='k',
                      y='wss',
                      c='DarkBlue')

plt.plot(wss_results.k, wss_results.wss)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Method')
plt.show(


In [34]:
## Import libraries
from nltk.cluster import KMeansClusterer
import nltk

def clustering_question(metadata,NUM_CLUSTERS):

    sentences = metadata['doc_name']

    X = np.array(metadata['vectors'].tolist())

    kclusterer = KMeansClusterer(
        NUM_CLUSTERS, distance=nltk.cluster.util.cosine_distance,
        repeats=25,avoid_empty_clusters=True)

    assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

    metadata['cluster'] = pd.Series(assigned_clusters, index=metadata.index)
    metadata['centroid'] = metadata['cluster'].apply(lambda x: kclusterer.means()[x])

    return metadata, assigned_clusters

from sklearn.metrics import silhouette_score

sil_avg = []
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]

for k in range_n_clusters:
 kmeans = KMeans(n_clusters = k).fit(X)
 labels = kmeans.labels_
 sil_avg.append(silhouette_score(X, labels, metric = 'euclidean'))


plt.plot(range_n_clusters,sil_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
plt.show()

#Using the above Silhouette analysis, we can choose K’s optimal value as 3, 6, 8 
#because the average silhouette score is higher and indicates that the 
#data points are optimally positioned.

# Reduce the dimensionality of the embedding to 10 while keeping the size of the local neighborhood to 15

umap_embeddings = umap.UMAP(n_neighbors=15,
                           n_components=5,
                           metric='cosine').fit_transform(embeddings)

standard_embedding = umap.UMAP(random_state=42).fit_transform(list(metadata.vectors))
plt.scatter(standard_embedding[:, 0], standard_embedding[:, 1], c=mnist.target.astype(int), s=0.1, cmap='Spectral');

