# <font color='darkblue'> Recommendations with IBM Watson</font>

Several recommendation methods are investigated on real data from the IBM Watson Studio platform. 

## <font color='darkblue'>Table of Contents</font>

1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
2. [Rank Based Recommendations](#Rank)<br>
3. [User-User Based Collaborative Filtering](#User-User)<br>
4. [Content Based Recommendations](#Content-Recs)<br>
5. [Matrix Factorization](#Matrix-Fact)<br>
6. [Extras & Concluding](#conclusions)


## <font color='darkblue'>Environment SetUp</font>

In [1]:
# General libraries and packages

import pandas as pd
import numpy as np
import pickle

# Recommendation systems library
import surprise

# Packages and libraries for content based recs
import re

# NLP packages
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

# Data processing packages
from sklearn.feature_extraction.text import TfidfVectorizer

# Import linear kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# Import visualization packages and libraries

import matplotlib.pyplot as plt
%matplotlib inline

# Choose style and color palette
import seaborn as sns
sns.set_style("darkgrid")

colors = sns.color_palette('PuBuGn')

In [3]:
# Use 2 decimal places in output display
pd.set_option("display.precision", 2)

# Don't wrap dataframe across additional lines
pd.set_option("display.expand_frame_repr", False)

# Set the maximum widths of columns
pd.set_option("display.max_colwidth", 90)

# Set max rows displayed in output to 20
pd.set_option("display.max_rows", 20)

## <font color='darkblue'>Upload Preprocessed Data</font>

<div class="alert alert-block alert-info">

<b>NOTES</b>:
    <ul>
        <li>There are 4 files to upload.</li>
        <li>Two files are centered on the user, the remaining two are centered on the article.</li>
    </ul>

</div>

In [4]:
# Read in the user-item interaction files
df = pd.read_csv('data/df.csv', index_col=[0])
articles_per_user = pd.read_csv('data/articles_per_user.csv', index_col=[0])

# Read the articles information files
df_content = pd.read_csv('data/articles_community.csv', index_col=[0])
users_per_article = pd.read_csv('data/users_per_article.csv', index_col=[0])

In [5]:
# Check the user_item dataframes
articles_per_user.head()

Unnamed: 0,user_id,article_id,articles_count,unique_articles_count
0,1,"[1430, 1430, 732, 1429, 43, 109, 1232, 310, 1293, 1406, 1406, 329, 585, 310, 1305, 105...",47,36
1,2,"[1314, 1305, 1024, 1176, 1422, 1427]",6,6
2,3,"[1429, 1429, 1330, 213, 1172, 1431, 1429, 1059, 1057, 29, 788, 1172, 868, 12, 1429, 10...",82,40
3,4,"[1338, 1314, 1330, 1330, 1427, 1160, 1162, 1391, 1162, 887, 1420, 1394, 1305, 1314, 13...",45,26
4,5,"[1276, 1351, 1166, 1351, 1351]",5,3


In [6]:
# Check the articles information dataframe
users_per_article.head()

Unnamed: 0,doc_body,doc_description,article_id,views,users_accessed,doc_name
0,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VID...",Detect bad readings in real time using Python and Streaming Analytics.,0,14,"[495, 495, 495, 503, 233, 552, 1347, 1051, 785, 2992, 3216, 3570, 4571, 4836]",detect malfunctioning iot sensors with streaming analytics
1,No Free Hunch Navigation * kaggle.com\r\n\r\n * kaggle.com\r\n\r\nCommunicating data s...,"See the forest, see the trees. Here lies the challenge in both performing and presenti...",1,0,[],Communicating data science: A guide to presenting your work
2,☰ * Login\r\n * Sign Up\r\n\r\n * Learning Paths\r\n * Courses * Our Courses\r\n * ...,Here’s this week’s news in Data Science and Big Data.,2,58,"[676, 668, 668, 1145, 23, 23, 60, 60, 665, 98, 668, 794, 217, 60, 1401, 46, 1577, 789,...","this week in data science (april 18, 2017)"
3,"DATALAYER: HIGH THROUGHPUT, LOW LATENCY AT SCALE - BOOST THE PERFORMANCE OF YOUR\r\nDI...","Learn how distributed DBs solve the problem of scaling persistent storage, but introdu...",3,0,[],DataLayer Conference: Boost the performance of your distributed database
4,"Skip navigation Sign in SearchLoading...\r\n\r\nClose Yeah, keep it Undo CloseTHIS VID...",This video demonstrates the power of IBM DataScience Experience using a simple New Yor...,4,13,"[2345, 176, 457, 3011, 3207, 3302, 3827, 3986, 4179, 4231, 4239, 4308, 5040]",analyze ny restaurant data using spark in dsx


## <a class="anchor" id="Rank">Utilitary Functions</a>

## <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

<div class="alert alert-block alert-info">

<b>NOTES</b>:
    <ul>
<li>The dataset does not contain ratings for whether a user liked an article or not.  We only know that a user has interacted with an article.  In these cases, the popularity of an article can be based on how often an article was interacted with.</li>
    </ul>

</div>

### <font color='darkblue'>Get the top n most popular articles</font>

In [19]:
# Function to retrieve the ids and title of the most viewed n articles
def get_top_articles (n):
    
    '''
    Finds the n most popular articles.
    
    INPUT:
        n (int) - specifies how many items should be returned
    OUTPUT:
        article_ids (list) - the ids of the n most popular articles
        titles (list) - the titles of the n most popular articles
    '''
    
    df_top_n = users_per_article.nlargest(n, 'views')
    titles = list(df_top_n.doc_name)
    article_ids = list(df_top_n.article_id)
    return article_ids, titles

In [20]:
# The ids of the 5 most popular articles
get_top_articles(5)[0]

[1429, 1330, 1431, 1427, 1364]

In [21]:
# The titles of the 5 most popular articles
get_top_articles(5)[1]

['use deep learning for image classification',
 'insights from new york car accident reports',
 'visualize car data with brunel',
 'use xgboost, scikit-learn & ibm watson machine learning apis',
 'predicting churn with the spss random tree algorithm']

### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>


`1.` Reformat the `df` dataframe to be shaped with users as the rows and articles as the columns.  

* Each `user` appears in each row once.

* Each `article` shows up in only one `column`.  

* **If a user has interacted with an article, a 1 is placed where the user-row meets for that article-column**.  It does not matter how many times a user has interacted with the article, all entries where a user has interacted with an article are **1**.  

* **If a user has not interacted with an item, a zero is placed where the user-row meets for that article-column**. 

The tests are used to make sure the basic structure of the matrix matches what is expected by the solution.

In [None]:
# Create the user-article matrix with 1's and 0's

def create_user_item_matrix(df):
    '''
    INPUT:
    df (pandas dataframe) - article_id, title, user_id are the columns
    
    OUTPUT:
    user_item (nd array) - user item matrix 
    
    Description:
    Return a matrix with user ids as rows and article ids on the columns,
    with 1 values where a user interacted with an article and a 0 otherwise.
    '''
    
    # Create new column that keeps track of user_article interaction
    df['interact'] = 1
    # Transform df so that every user is on a row and every article corresponds to a column
    user_item =  df.groupby(['user_id', 'article_id'])['interact'].first().unstack()
    # Fill in NaN with 0 in the user_item matrix
    user_item.fillna(0, inplace = True)
    
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)

In [None]:
# Tests: You should just need to run this cell.  Don't change the code.
assert user_item.shape[0] == 5149, "Oops!  The number of users in the user-article matrix doesn't look right."
assert user_item.shape[1] == 714, "Oops!  The number of articles in the user-article matrix doesn't look right."
assert user_item.sum(axis=1)[1] == 36, "Oops!  The number of articles seen by user 1 doesn't look right."
print("You have passed our quick tests!  Please proceed!")

`2.` _The function below takes a `user_id` and provide an ordered list of the most similar users to that user (from most similar to least similar). The returned result should not contain the provided `user_id`, as we know that each user is similar to him/herself. Because the results for each user here are binary, it (perhaps) makes sense to compute similarity as the dot product of two users. 

_We use the tests to test the function._

In [None]:
def find_similar_users(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first
    
    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered
    
    '''
    # compute similarity of each user to the provided user
    user_similarities = user_item.dot(user_item.loc[user_id])

    # sort by similarity
    sorted_similarities = user_similarities.sort_values(ascending=False)

    # create list of just the ids
    similars = list(sorted_similarities.index)
   
    # remove the own user's id
    most_similar_users = similars[1:]
       
    return most_similar_users # return a list of the users in order from most to least similar
        

In [None]:
# Do a spot check of your function
print("The 10 most similar users to user 1 are: {}".format(find_similar_users(1)[:10]))
print("The 5 most similar users to user 3933 are: {}".format(find_similar_users(3933)[:5]))
print("The 3 most similar users to user 46 are: {}".format(find_similar_users(46)[:3]))

`3.` _Now that we have a function that provides the most similar users to each user, we will want to use these users to find articles we can recommend.  The functions below return the articles we would recommend to each user._

In [None]:
def get_article_names(article_ids, df=df):
    '''
    INPUT:
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column)
    '''
    article_names = [df[df['article_id'] == float(x)]['title'].unique()[0] for x in article_ids]
    # Return the article names associated with list of article ids
    return article_names 


def get_user_articles(user_id, user_item=user_item):
    '''
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the doc_full_name column in df_content)
    
    Description:
    Provides a list of the article_ids and article titles that have been seen by a user
    '''

    article_ids = user_item.loc[user_id][user_item.loc[user_id] == 1].index.astype('str').to_list()
    article_names = get_article_names(article_ids, df)
    return article_ids, article_names # return the ids and names


def user_user_recs(user_id, m):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    Users who are the same closeness are chosen arbitrarily as the 'next' user
    
    For the user where the number of recommended articles starts below m 
    and ends exceeding m, the last items are chosen arbitrarily
    
    '''
    # articles_seen by user (we don't want to recommend these)
    articles_seen = get_user_articles(user_id, user_item)[0]
    # find the similar users
    similar_users = find_similar_users(user_id)
    
    # list of recommended articles
    recs = []
    
    for user in similar_users:
        user_list = get_user_articles(user, user_item)[0]
        recs_update = np.setdiff1d(user_list, articles_seen)
        recs.extend(np.setdiff1d(recs_update, recs))
     
        if len(recs) >= m:
            break
    
    return recs[:m] # return recommendations for this user_id    

In [None]:
# Check Results
get_article_names(user_user_recs(1, 10)) # Return 10 recommendations for user 1

In [None]:
# Test the functions here - No need to change this code - just run this cell
assert set(get_article_names(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_article_names(['1320.0', '232.0', '844.0'])) == set(['housing (2015): united states demographic measures','self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook']), "Oops! Your the get_article_names function doesn't work quite how we expect."
assert set(get_user_articles(20)[0]) == set(['1320.0', '232.0', '844.0'])
assert set(get_user_articles(20)[1]) == set(['housing (2015): united states demographic measures', 'self-service data preparation with ibm data refinery','use the cloudant-spark connector in python notebook'])
assert set(get_user_articles(2)[0]) == set(['1024.0', '1176.0', '1305.0', '1314.0', '1422.0', '1427.0'])
assert set(get_user_articles(2)[1]) == set(['using deep learning to reconstruct high-resolution audio', 'build a python app on the streaming analytics service', 'gosales transactions for naive bayes model', 'healthcare python streaming application demo', 'use r dataframes & ibm watson natural language understanding', 'use xgboost, scikit-learn & ibm watson machine learning apis'])
print("If this is all you see, you passed all of our tests!  Nice job!")

`4.` _Now we are going to improve the consistency of the **user_user_recs** function from above._

* _Instead of arbitrarily choosing when we obtain users who are all the same closeness to a given user - choose the users that have the most total article interactions before choosing those with fewer article interactions._

* _Instead of arbitrarily choosing articles from the user where the number of recommended articles starts below m and ends exceeding m, choose articles with the articles with the most total interactions before choosing those with fewer total interactions. This ranking should be what would be obtained from the **top_articles** function written earlier._

In [None]:
# get a sample user id
user_id = 17

In [None]:
# order the other users based on similarity
neighbor_id = find_similar_users(user_id)
len(neighbor_id)

In [None]:
# record the similarity measure, i.e. the dot product with user_id 
similarity = [user_item.loc[neighbor].dot(user_item.loc[user_id]) for neighbor in neighbor_id]

In [None]:
# find the number of views for each user
num_interactions = [df[df['user_id'] == x].shape[0] for x in neighbor_id]

In [None]:
# create a dictionary 
neighbors_df = {'neighbor_id': neighbor_id,
               'similarity': similarity,
               'num_interactions': num_interactions}

In [None]:
# create a dataframe from dictionary
neighbors_df = pd.DataFrame(data=neighbors_df)

In [None]:
# check the output
neighbors_df.head()

In [None]:
# sort the values
neighbors_df.sort_values(by=['similarity', 'num_interactions'], ascending=False).head(2)

In [None]:
# print the information of the user_id we started with
neighbors_df.loc[user_id]

In [None]:
# create a list of the most simlar users
neighbor_list = neighbor_id[:10]
neighbor_list

In [None]:
# get one member from the neighbors list
user = neighbor_list[1]
user

In [None]:
 # get the articles seen by the similar user
similar_seen = get_user_articles(user, user_item)[0]
len(similar_seen)

In [None]:
# the articles seen by user_id member
articles_seen =  get_user_articles(user_id, user_item)[0]
articles_seen[:4]

In [None]:
# remove the articles seen by user_id from neighbor's list
# these are the articles to recommend
articles_to_rec = np.setdiff1d(similar_seen, articles_seen)
len(articles_to_rec)

In [None]:
# the final list of articles to recommend
recs_ids = []
# the articles seen by similar user to add to the recommended list
# remove those articles already in the list
articles_to_add = np.setdiff1d(articles_to_rec, recs_ids)
len(articles_to_add)

In [None]:
# for the next step we need article ids as float or integer
articles_ids = [float(x) for x in articles_to_add]
articles_ids[:2]

In [None]:
# sort the article ids
df_reduced=df[df['article_id'].isin(articles_ids)]
df_reduced.groupby('article_id').count()['title'].sort_values(ascending=False).index.to_list()[:2]

In [None]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    '''
    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
            1's when a user has interacted with an article, 0 otherwise
    
            
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - is a neighbor user_id
                    similarity - measure of the similarity of each user to the provided user_id
                    num_interactions - the number of articles viewed by the user - if a u
                    
    Other Details - sort the neighbors_df by the similarity and then by number of interactions where 
                    highest of each is higher in the dataframe
     
    '''
    # order the other users based on similarity with member user_id
    neighbor_id = find_similar_users(user_id)
    # record the similarity measure, i.e. the dot product with user_id 
    similarity = [user_item.loc[neighbor].dot(user_item.loc[user_id]) for neighbor in neighbor_id]
    # find the number of views/interactions for each user
    num_interactions = [df[df['user_id'] == x].shape[0] for x in neighbor_id]
    
    # create a dataframe 
    neighbors_df = pd.DataFrame(data={'neighbor_id': neighbor_id,
                                      'similarity': similarity,
                                      'num_interactions': num_interactions})
    # drop the row corresponding to the member user_id
    neighbors_df.drop(user_id, axis = 0, inplace = True)
    
    # sort by similarity and num_interactions                 
    neighbors_df.sort_values(by = ['similarity', 'num_interactions'], 
                             inplace=True, 
                             ascending=(False, False))   
    
    # Return the dataframe specified in the doc_string
    return neighbors_df 


def user_user_recs_part2(user_id, m=10):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the users based on closeness to the input user_id
    For each user - finds articles the user hasn't seen before and provides them as recs
    Does this until m recommendations are found
    
    Notes:
    * Choose the users that have the most total article interactions 
    before choosing those with fewer article interactions.

    * Choose articles with the articles with the most total interactions 
    before choosing those with fewer total interactions. 
   
    '''
    # list of recommended articles by id, and by title
    recs_ids = []
    
    # articles_seen by user (we don't want to recommend these)
    articles_seen = get_user_articles(user_id, user_item)[0]
    
    # similar users with most article views
    similar_users = get_top_sorted_users(user_id, df, user_item)
    
    for user in similar_users['neighbor_id'].values:
        
        # get the articles seen by the similar user
        similar_seen = get_user_articles(user, user_item)[0]
        
        # remove the articles in articles_seen
        articles_to_rec = np.setdiff1d(similar_seen, articles_seen)
        
        # remove the articles already added to the recs list
        articles_to_add = np.setdiff1d(articles_to_rec, recs_ids)
        
        # rewrite the recommended article ids as float 
        articles_ids = [float(x) for x in articles_to_add]
        
        # sort the articles by popularity, i.e. number of views
        df_red = df[df['article_id'].isin(articles_ids)]
        sorted_articles=df_red.groupby('article_id').count()['title'].sort_values(ascending=False).index.to_list()
       
        # add the sorted article ids
        recs_ids.extend(sorted_articles)
        
        # break when we have enough articles to recommend
        if len(recs_ids) >= m:
            break
    
    # retain the first m recommendations
    recs = recs_ids[:m]
    
    # get the articles names
    rec_names = get_article_names(recs, df)
    
    return recs, rec_names


In [None]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs_part2(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

`5.` _Based on the functions from above to we fill in the solutions to the dictionary below. Then test the dictionary against the solution. The code needed to answer each of the following comments below is provided._

In [None]:
# Tests with a dictionary of results

# Find the user that is most similar to user 1 
user1_most_sim = get_top_sorted_users(1, df, user_item).loc[0]['neighbor_id']
# Find the 10th most similar user to user 131
user131_10th_sim = get_top_sorted_users(131, df, user_item).loc[10]['neighbor_id']

In [None]:
print(user1_most_sim, user131_10th_sim)

In [None]:
# Dictionary Test Here
sol_5_dict = {
    'The user that is most similar to user 1.': user1_most_sim, 
    'The user that is the 10th most similar to user 131': user131_10th_sim,
}

t.sol_5_test(sol_5_dict)

`6.` _If we were given a new user, which of the above functions would you be able to use to make recommendations?  Explain.  Can you think of a better way we might make recommendations?  Use the cell below to explain a better method for new users._

**For a new use we recommend the most popular articles on the website, through `get_top_articles` and `get_top_article_ids` functions, as we don't have any information about user's preferences.**

`7.` _Using the existing functions, we provide the top 10 recommended articles we would provide for a new user below. Test the function against the standard solution._

In [None]:
new_user = '0.0'

# What would your recommendations be for this new user '0.0'?  As a new user, they have no observed articles.
# Provide a list of the top 10 article ids you would give to 
new_user_recs = get_top_article_ids(10, df) # Your recommendations here

In [None]:
# rewrite recommendations as a set of strings
new_user_recs = [str(x) for x in new_user_recs]

In [None]:
assert set(new_user_recs) == set(['1314.0','1429.0','1293.0','1427.0',
                                  '1162.0','1364.0','1304.0','1170.0','1431.0','1330.0']), "Oops! It makes sense that in this case we would want to recommend the most popular articles, because we don't know anything about these users."

print("That's right!  Nice job!")

### <a class="anchor" id="Content-Recs">Part IV: Content Based Recommendations</a>

Another method we might use to make recommendations is to perform a ranking of the highest ranked articles associated with some term. We could consider content to be the `doc_body`, `doc_description`, or `doc_full_name`.  There isn't one way to create a content based recommendation, especially considering that each of these columns hold content related information.  

`1.` _We will create a content based recommender based on `doc_description` and `full_doc_title` columns._  

`2.` _We choose the most popular recommendations that meet the content criteria._

#### Investigate and Prepare the Data

In [None]:
# make copies of the data
df_copy = df.copy()
df_content_copy = df_content.copy()

In [None]:
# take a look at the data
df_copy.columns

In [None]:
# get the data types in df
df_copy.dtypes

In [None]:
# change datatype of article id in df
df_copy['article_id'] = df_copy['article_id'].astype('int')

# check the output
df_copy.dtypes

In [None]:
# look at the titles in df
df_copy.title[1:4]

In [None]:
# lowercase all titles in df dataframe
df_copy['title'] = df_copy['title'].apply(lambda x: x.lower())

In [None]:
# take a look at the content data
df_content_copy.columns

In [None]:
# get the data types in df_content
df_content_copy.dtypes

In [None]:
# take a closer look at the data
df_content_copy.head(2)

In [None]:
# rename the column doc_full_name to title
df_content_copy.rename(columns={'doc_full_name': 'title'}, inplace = True)

In [None]:
# lower case all the article titles
df_content_copy['title'] = df_content_copy['title'].apply(lambda x: x.lower())

# check the outcome
df_content_copy.head()

In [None]:
# form a dataframe that has  columns: title, doc_description and article_id 
df_cont = df_content_copy[['article_id', 'title']]

# check the output
df_cont.head()

In [None]:
# create a dataframe for article titles in df
df_titles = df_copy[['article_id', 'title']]

df_titles.head()

In [None]:
# check for missing values
df_titles.isnull().sum()

In [None]:
# remove the duplicates from dataframe df_titles
df_titles = df_titles[~df_titles.duplicated()]

In [None]:
# combine the two dataframes so all available articles are included
df_full = pd.merge(df_cont, df_titles,
                  left_on = ['article_id', 'title'],
                  right_on = ['article_id', 'title'],
                  how = 'outer')

# check the outcome
df_full.shape

In [None]:
# preview the output
df_full.head()

In [None]:
# check for missing values
df_full.isnull().sum()

#### Pre-process the Text and Create TF-IDF Matrix

In [None]:
def tokenize(text):
    
    """
    Contains the pre-processing steps for a document:
        - tokenize
        - lemmatize
        - lowercasing
        - removes stopwords in English language
        
    INPUT (string) - raw message
    OUTPUT (list)  - clean tokens
    """
    
    # remove punctuation and unusual characters 
    text = re.sub(r"[^a-zA-Z0-9]", " ", text).strip()
    
    # split into words
    words = word_tokenize(text)
    
    # lemmatize - reduce words to their root form
    words = [WordNetLemmatizer().lemmatize(w) for w in words]
    
    # case normalize and remove leading & trailing empty spaces
    words = [w.lower().strip() for w in words]
    
    # remove stopwords, keep not and can
    clean_words = [w for w in words if w not in stopwords.words('english') 
                   or w in ['not', 'can']]
    
    return clean_words

In [None]:
# create an instance of the TF-IDF vectorizer
tfidf = TfidfVectorizer(tokenizer=tokenize)

# construct the TF-IDF matrix 
tfidf_matrix = tfidf.fit_transform(df_full['title'])

In [None]:
# output the shape of the tfidf matrix
tfidf_matrix.shape

#### Compute the cosine similarity scores

In [None]:
# compute the cosine similarity matrix 
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
# check the output
cosine_sim.shape

#### Build the recommender function

In [None]:
def get_article_info(article_id, df_full):
    '''
    INPUT:
    article_id - (integer)
    df_full - (pandas dataframe) contains article_id, title
    
    OUTPUT:
    article_title - (string) the article name associated with the provided article id
    '''
    
    article_title = df_full[df_full['article_id']==article_id]['title'].unique()[0]
    return article_id, article_title

In [None]:
# print a sample output

print('The title of the article with id = 542 is: {}.'.format(get_article_info(542, df_full)[1]))

In [None]:
# Function that takes in the article id as input and gives n recommendations
def content_recommender(article_id, n, cosine_sim, df_full):
    '''
    INPUT:
    article_id - (integer)
    n - (integer) how many recommendations should be returned
    cosine_sim  - (np.ndarray) matrix of cosine similarities
    df - (pandas dataframe) contains title and article id
    
    OUTPUT:
    recommended_articles  - (list) article ids and titles that are recommended, 
                             sorted by cosine similarity
    '''
    # get the information for the given article id
    given_art = get_article_info(article_id, df_full)
    
    # obtain the matrix index that matches the article id
    mat_index = df_full[df_full['article_id']==article_id].index.values[0]
    
    # sort the scores based on the cosine similarity scores with given article index, ignore the first entry
    sim_scores = pd.Series(cosine_sim[mat_index]).sort_values(ascending=False).iloc[1:]
    
    # get the indices corresponding to the scores of the n most similar articles
    sim_scores_n = list(sim_scores[:n+1].index.values)
    
    # return the top n most similar article_ids as a pandas dataframe
    recommended_articles = df_full.iloc[sim_scores_n]
    
    return given_art, recommended_articles

In [None]:
# give recommendations for the article with id 20
article_id = 20
rec_id20 = content_recommender(article_id, 10, cosine_sim, df_full)
rec_id20

In [None]:
# the article ids for the recommended articles for article_id=20
rec_id20[1]['article_id'].tolist()

In [None]:
# make recommendations for article id 224 based on title
content_recommender(224, 5, cosine_sim, df_full)

#### Make Content Recommendations

In [None]:
def make_content_recs(user_id, cosine_sim, m=10, df=df, df_full=df_full):
    '''
    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    df - (pandas dataframe) contains user and articles interactions
    df_full - (pandas dataframe) contains title and article id
    
    OUTPUT:
    recs_ids - (list) a list of recommendations for the user by article id
    rec_names - (list) a list of recommendations for the user by article title
    
    Description:
    Loops through the articles based on closeness to the articles seen by the user.
    For each article seen by the user - finds n most similar articles based on content recommendations.
    Does this until m recommendations are found.
    
    Notes:
    * Choose the articles that have the most total article interactions 
    before choosing those with fewer article interactions.
   
    '''
    # list of recommended articles by id, and by title
    recommendations = []
    
    # articles_seen by user 
    articles_seen = get_user_articles(user_id, user_item)[0]
    
    # rewrite the recommended article ids as int
    articles_ids_seen = [int(x[:-2]) for x in articles_seen]
    
    for art_id in articles_ids_seen:
        
        # get the n most similar articles 
        n = 10
        similar_articles = content_recommender(art_id, n, cosine_sim, df_full)[1]['article_id'].tolist()

        # remove the articles in articles_seen and available
        articles_to_rec = np.setdiff1d(similar_articles, articles_ids_seen)
        
        # remove the articles already added to the recs list
        articles_to_add = np.setdiff1d(articles_to_rec, recommendations)
        
        # add the sorted article ids
        recommendations.extend(articles_to_add)
        
        # break when we have enough articles to recommend
        if len(recs_ids) >= m:
            break
    
    # retain the first m recommendations
    recs = recommendations[:m]
    
    # get the articles titles
    complete_recs = [get_article_info(int(idx), df_full) for idx in recs]
    
    return complete_recs


In [None]:
# articles seen by user 40
articles_seen = get_user_articles(40, user_item)[0]
articles_ids_seen = [float(x) for x in articles_seen]
articles_info = [get_article_info(index, df_full) for index in articles_ids_seen]
articles_info

In [None]:
# recommendations for user 40 based on title content
make_content_recs(40, cosine_sim, 10, df, df_full)

In [None]:
# articles seen by user 178
articles_seen = get_user_articles(178, user_item)[0]
articles_ids_seen = [int(x[:-2]) for x in articles_seen]
articles_info = [get_article_info(index, df_full)[0] for index in articles_ids_seen]
articles_info

In [None]:
# recommendations for user 178 based on title content
make_content_recs(178, cosine_sim, 10, df, df_full)

`2.` _Now that you have put together your content-based recommendation system, use the cell below to write a summary explaining how your content based recommender works.  Do you see any possible improvements that could be made to your function?  Is there anything novel about your content based recommender?_

**The content based recommendations are based on the article title or on the article description. The corpus consists of one group of these documents. The text is processed by removing punctuation and stop words, it is lemmatized and split into tokens. Once processed the corpus is fed into a TdIdf Vectorizer that creates a matrix of scores. The cosine similarities between any two documents (rows in the similarity matrix) are computed and the results are saved in a 1051x1051 matrix.**

**Given a user id, and assuming that the user has seen at least one article, the engine will recommend n content similar articles for each article seen by the user. Once this collection is created, it is sorted using the article popularity and m most popular articles are recommended.**

**One way to improve these results is to use more efficient NLP techniques, such as word embedding. Another option would be to create meta data for the articles based on their descriptions and full text, both available in the `df_content` dataframe.**

`3.` _Use your content-recommendation system to make recommendations for the below scenarios based on the comments.  Again no tests are provided here, because there isn't one right answer that could be used to find these content based recommendations._

In [None]:
# make recommendations for a brand new user - recommend the most popular articles
new_user_recs = get_top_article_ids(10, df) 
new_user_recommendations = [get_article_info(art_id, df_full) for art_id in new_user_recs]
new_user_recommendations

In [None]:
# make recommendations for a user who only has interacted with article id '1427.0'
content_recommender(1427, 10, cosine_sim, df_full)

### <a class="anchor" id="Matrix-Fact">Part V: Matrix Factorization</a>

In this part of the notebook, we use matrix factorization to make article recommendations to the users on the IBM Watson Studio platform.

`1.` _Upload the user_item matrix from part 1._

In [None]:
# Load the matrix here
user_item_matrix = pd.read_pickle('user_item_matrix.p')

In [None]:
# quick look at the matrix
user_item_matrix.head()

`2.` _In this situation, you can use Singular Value Decomposition from [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.svd.html) on the user-item matrix.  Use the cell to perform SVD, and explain why this is different than in the lesson._

In [None]:
# find the shape of the user_item matrix
user_item_matrix.shape

In [None]:
# Perform SVD on the User-Item Matrix Here

u, s, vt = np.linalg.svd(user_item_matrix)# use the built in to get the three matrices

In [None]:
# get info on the output matrices
u.shape, s.shape, vt.shape

**The user_movie_subset matrix contains the movie ratings as entries, so that the entry (ij) corresponds to the rating j given by the user i. In this approach we get numerous missing entries for those users who did not watch or rated a certain movie. Since the SVD decomposition does not work with missing values, it has be replaced by an alternate approach, such as FunkSVD.**

**In the case of IBM recommendations, the user_item_matrix is a sparse array of binary entries, which records the interaction/no interaction between an user and an article. Thus, in this case we can apply SVD decomposition matrix method as the matrix does not have missing entries.**

`3.` _Now for the tricky part, how do we choose the number of latent features to use?  Running the below cell, you can see that as the number of latent features increases, we obtain a lower error rate on making predictions for the 1 and 0 values in the user-item matrix.  Run the cell below to get an idea of how the accuracy improves as we increase the number of latent features._

In [None]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs = []

for k in num_latent_feats:
    # restructure with k latent features
    s_new, u_new, vt_new = np.diag(s[:k]), u[:, :k], vt[:k, :]
    
    # take dot product
    user_item_est = np.around(np.dot(np.dot(u_new, s_new), vt_new))
    
    # compute error for each prediction to actual value
    diffs = np.subtract(user_item_matrix, user_item_est)
    
    # total errors and keep track of them
    err = np.sum(np.sum(np.abs(diffs)))
    sum_errs.append(err)
    
    
plt.plot(num_latent_feats, 1 - np.array(sum_errs)/df.shape[0]);
plt.xlabel('Number of Latent Features');
plt.ylabel('Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

`4.` From the above, we can't really be sure how many features to use, because simply having a better way to predict the 1's and 0's of the matrix doesn't exactly give us an indication of if we are able to make good recommendations.  Instead, we might split our dataset into a training and test set of data, as shown in the cell below.  

Use the code from question 3 to understand the impact on accuracy of the training and test sets of data with different numbers of latent features. Using the split below: 

* How many users can we make predictions for in the test set?  
* How many users are we not able to make predictions for because of the cold start problem?
* How many articles can we make predictions for in the test set?  
* How many articles are we not able to make predictions for because of the cold start problem?

In [None]:
# recall the df size
df.shape

In [None]:
df_train = df.head(40000)
df_test = df.tail(5993)


def create_test_and_train_user_item(df_train, df_test):
    '''
    INPUT:
    df_train - training dataframe
    df_test - test dataframe
    
    OUTPUT:
    user_item_train - a user-item matrix of the training dataframe 
                      (unique users for each row and unique articles for each column)
    user_item_test - a user-item matrix of the testing dataframe 
                    (unique users for each row and unique articles for each column)
    test_idx - all of the test user ids
    test_arts - all of the test article ids
    
    '''
    # create the two user_item matrices
    user_item_train = create_user_item_matrix(df_train)
    user_item_test = create_user_item_matrix(df_test)
    
    # extract the test user ids
    test_idx = list(user_item_test.index)
    # extract the test article_ids
    test_arts = user_item_test.columns
    
    return user_item_train, user_item_test, test_idx, test_arts

user_item_train, user_item_test, test_idx, test_arts = create_test_and_train_user_item(df_train, df_test)

In [None]:
# determine the shapes of the two matrices
user_item_train.shape, user_item_test.shape

In [None]:
# the user ids in test set, the article ids in test set
len(test_idx), len(test_arts)

In [None]:
# common user ids in the train and the test sets, 
common_users = len(set(user_item_test.index) & set(user_item_train.index))

print('There are {} common user ids in the train and test sets.'.format(common_users))

In [None]:
# the number of user ids in the test set
test_idx = len(list(user_item_test.index))
# the number of user ids in the test set that are not in the train set
test_idx - common_users

print('There are {} users in the test set that are not in the train set.'.format(test_idx - common_users))

In [None]:
# the number of all article ids in the train set
len(user_item_train.columns)

In [None]:
# the number of all article ids in the test set
len(user_item_test.columns)

In [None]:
# common article ids in the train and the test sets
common_articles = len(set(user_item_test.columns) & set(user_item_train.columns))

print('There are {} common article ids in the train and test sets.'.format(common_articles))

In [None]:
# Replace the values in the dictionary below
a = 662 
b = 574 
c = 20 
d = 0 


sol_4_dict = {
    'How many users can we make predictions for in the test set?': c, 
    'How many users in the test set are we not able to make predictions for because of the cold start problem?': a, 
    'How many movies can we make predictions for in the test set?': b,
    'How many movies in the test set are we not able to make predictions for because of the cold start problem?': d
}

t.sol_4_test(sol_4_dict)

`5.` Now use the **user_item_train** dataset from above to find U, S, and V transpose using SVD. Then find the subset of rows in the **user_item_test** dataset that you can predict using this matrix decomposition with different numbers of latent features to see how many features makes sense to keep based on the accuracy on the test data. This will require combining what was done in questions `2` - `4`.

Use the cells below to explore how well SVD works towards making predictions for recommendations on the test data.  

In [None]:
# fit SVD on the user_item_train matrix
u_train, s_train, vt_train = np.linalg.svd(user_item_train) 

In [None]:
# the shapes of the three matrices 
u_train.shape, s_train.shape, vt_train.shape

In [None]:
# the users we can make predictions for
common_user_ids = list(set(user_item_test.index) & set(user_item_train.index))
# the articles we can use to make predictions
test_arts = user_item_test.columns

print(common_user_ids)

In [None]:
# reduce the user_item_test to include only the common user_ids
user_item_test_red = user_item_test.loc[common_user_ids, test_arts]
# check the shape of the reduced matrix
user_item_test_red.shape

In [None]:
# reset and relabel indices to identify the common user ids
user_item_train_adj = user_item_train.reset_index()
# relabel indices to avoid out of range error
users_common_idx = user_item_train_adj[user_item_train_adj['user_id'].isin(common_user_ids)].index.to_list()
print(users_common_idx)

In [None]:
# reduce vt to the articles in test set
vt_test = vt_train[:, user_item_train.columns.isin(test_arts)]

# reduce u to the common user ids - note: need to work with the updated indices
u_test = u_train[users_common_idx, :]

# check the shapes of the reduced matrices
u_test.shape, vt_test.shape

In [None]:
num_latent_feats = np.arange(10,700+10,20)
sum_errs_train = []
sum_errs_test = []

for k in num_latent_feats:
    # restructure with k latent features for the train set matrices
    s_new_train, u_new_train, vt_new_train = np.diag(s_train[:k]), u_train[:, :k], vt_train[:k, :]
    # restructure with k latent features for the test set matrices        
    s_new_test, u_new_test, vt_new_test = np.diag(s_train[:k]), u_test[:, :k], vt_test[:k, :]
    
    # take dot products for each group
    user_item_est_train = np.around(np.dot(np.dot(u_new_train, s_new_train), vt_new_train))
    user_item_est_test = np.around(np.dot(np.dot(u_new_test, s_new_train), vt_new_test))
    
    # compute error for each prediction to actual value
    diffs_train = np.subtract(user_item_train, user_item_est_train)
    diffs_test = np.subtract(user_item_test_red, user_item_est_test)
    
    # total train errors and keep track of them
    err_train = np.sum(np.sum(np.abs(diffs_train)))
    sum_errs_train.append(err_train)
    
    # total test errors and keep track of them
    err_test = np.sum(np.sum(np.abs(diffs_test)))
    sum_errs_test.append(err_test)

In [None]:
# check the matrix shapes for k=10
s_new_test, u_new_test, vt_new_test = np.diag(s_train[:10]), u_test[:, :10], vt_test[:10, :]
s_new_train.shape, u_new_test.shape, vt_new_test.shape

In [None]:
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

ax1.plot(num_latent_feats, 1 - np.array(sum_errs_test)/df.shape[0], color='b', label="Test accuracy");
ax2.plot(num_latent_feats, 1 - np.array(sum_errs_train)/df.shape[0], color = 'g', label="Train accuracy");

# get handlers and labels for the legend
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()

# create a legend for the test accuracy curve
ax1.legend(h1+h2, l1+l2, loc='center right')

ax1.set_xlabel('Number of Latent Features');
ax2.set_ylabel('Train Accuracy');
ax1.set_ylabel('Test Accuracy');
plt.title('Accuracy vs. Number of Latent Features');

`6.` Use the cell below to comment on the results you found in the previous question. Given the circumstances of your results, discuss what you might do to determine if the recommendations you make with any of the above recommendation systems are an improvement to how users currently find articles? 

**The test accuracy decreases while the train accuracy increases with the number of latent features. The two accuracy curves intersect for 100 latent features. The shapes of the two curves indicate overfitting. The more latent features we use the more overfitted the model is. This is not surprising as the test set is quite small and thus easy to overfit.**


**The following steps can be taken to improve the recommendation engine:**

* **Design an A/B test to compare the warm users (those who receive content and collaborative filtering recommendations) with the cold users (the new users who are recommended the most popular articles only).** 

* **Given how imbalanced the data is, measures for the models's performance such as  F1 score and ROC/AUC, would provide a slight improvement in our estimations. Also we could take into consideration the number of unique articles a user interacts with.** 

* **Use different algorithms (such as the multi-arm bandit) for the cold user problem.**

**However the efficiency of any method is still affected by the data imbalance and the best way to improve the results is to increase the size of the test set by collecting data from more users, after which we could implement the suggestions mentioned above.**

<a id='conclusions'></a>
### Extras
Using your workbook, you could now save your recommendations for each user, develop a class to make new predictions and update your results, and make a flask app to deploy your results.  These tasks are beyond what is required for this project.  However, from what you learned in the lessons, you certainly capable of taking these tasks on to improve upon your work here!


## Conclusion

> Congratulations!  You have reached the end of the Recommendations with IBM project! 

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the [rubric](https://review.udacity.com/#!/rubrics/2322/view). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.


## Directions to Submit

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations! 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Recommendations_with_IBM.ipynb'])