# Recommendations with IBM in progress

In this project we will analyze the interactions that users have with different articles on the **IBM Watson Studio platform**. We will study data available in that platform to build a recommendation engine that is able to show the articles that are most pertinent to each user.


## Table of Contents

I. [Exploratory Data Analysis](#Exploratory-Data-Analysis)<br>
II. [Rank Based Recommendations](#Rank)<br>
III. [User-User Based Collaborative Filtering](#User-User)<br>
IV. [Content Based Recommendations](#Content-Recs)<br>
V. [Matrix Factorization](#Matrix-Fact)<br>
VI. [Extras & Concluding](#conclusions)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import project_tests as t
import pickle

%matplotlib inline

df = pd.read_csv('data/user-item-interactions.csv')
df_content = pd.read_csv('data/articles_community.csv')
del df['Unnamed: 0']
del df_content['Unnamed: 0']

# Show df to get an idea of the data
df.head()

Unnamed: 0,article_id,title,email
0,1430.0,"using pixiedust for fast, flexible, and easier...",ef5f11f77ba020cd36e1105a00ab868bbdbf7fe7
1,1314.0,healthcare python streaming application demo,083cbdfa93c8444beaa4c5f5e0f5f9198e4f9e0b
2,1429.0,use deep learning for image classification,b96a4f2e92d8572034b1e9b28f9ac673765cd074
3,1338.0,ml optimization using cognitive assistant,06485706b34a5c9bf2a0ecdac41daf7e7654ceb7
4,1276.0,deploy your python model as a restful api,f01220c46fc92c6e6b161b1849de11faacd7ccb2


In [None]:
df.shape

(45993, 3)

### <a class="anchor" id="Exploratory-Data-Analysis">Part I : Exploratory Data Analysis</a>


**Removing duplicate articles** from the df_content dataframe: 

In [None]:
df_content.article_id.value_counts();

In [None]:
df_content[df_content.article_id.isin([221, 232, 577, 398])]

Unnamed: 0,doc_body,doc_description,doc_full_name,doc_status,article_id
221,* United States\r\n\r\nIBM® * Site map\r\n\r\n...,When used to make sense of huge amounts of con...,How smart catalogs can turn the big data flood...,Live,221
232,Homepage Follow Sign in Get started Homepage *...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232
399,Homepage Follow Sign in Get started * Home\r\n...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
578,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577
692,Homepage Follow Sign in / Sign up Homepage * H...,One of the earliest documented catalogs was co...,How smart catalogs can turn the big data flood...,Live,221
761,Homepage Follow Sign in Get started Homepage *...,Today’s world of data science leverages data f...,Using Apache Spark as a parallel processing fr...,Live,398
970,This video shows you how to construct queries ...,This video shows you how to construct queries ...,Use the Primary Index,Live,577
971,Homepage Follow Sign in Get started * Home\r\n...,"If you are like most data scientists, you are ...",Self-service data preparation with IBM Data Re...,Live,232


In [None]:
df_content = df_content.drop_duplicates(subset= 'article_id')

**Understanding the data:**

- df: interaction between users and articles
- df_content: information about each article

In [None]:
# distribution of how many articles a user interacts with in the dataset:

user_articles_count = df.email.value_counts()
user_articles_count.hist(range= [0,150]);

In [None]:
# statistics
user_articles_count.describe()

In [None]:
print("The mean number of interactions per user is", user_articles_count.mean(),
      "50% of individuals have", user_articles_count.median(), "or fewer interactions",
      "The maximum number of user-article interactions by any 1 user is", user_articles_count.max())

In [None]:
# most viewed article: '1429.0', how often: 937 times 
df.article_id.value_counts();

In [None]:
unique_articles = df.article_id.nunique() # Number of unique articles that have at least one interaction
total_articles = df_content.article_id.nunique() # Number of unique articles on the IBM platform
unique_users = df.email.nunique() # Number of unique users
user_article_interactions = df.shape[0] # Number of user-article interactions


print("Unique articles with at least one interaction:", unique_articles,
      "Total number of unique articles:", total_articles,
      "Number of unique users:", unique_users,
      "Number of user-article interactions": user_article_interactions)

**Email mapper**:

The `email_mapper` function is used to map users' emails to ids.  There were a small number of null values, which likely belonged to a single user.

In [None]:
def email_mapper():
  """ Function to map users' emails to user_id column """
    coded_dict = dict()
    cter = 1
    email_encoded = []
    
    for val in df['email']:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        email_encoded.append(coded_dict[val])
    return email_encoded

email_encoded = email_mapper()

# removing email column
del df['email']
df['user_id'] = email_encoded

# show header
df.head()

Unnamed: 0,article_id,title,user_id
0,1430.0,"using pixiedust for fast, flexible, and easier...",1
1,1314.0,healthcare python streaming application demo,2
2,1429.0,use deep learning for image classification,3
3,1338.0,ml optimization using cognitive assistant,4
4,1276.0,deploy your python model as a restful api,5


### <a class="anchor" id="Rank">Part II: Rank-Based Recommendations</a>

We only know if a user has interacted with an article or not, but we don't have ratings to know if they liked them or not. In these cases, the popularity of an article can really only be based on how often an article was interacted with. 

Therefore, to make recommendations based on ranking, we will create a function that gives us the **top n article titles** from the df based on the number of interactions users had with them.

In [None]:
def get_top_articles(n, df=df):
    """
    Returns top n article titles from df, based on number of interactions.

    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles.
    
    """

    ordered_titles = df.groupby(['title'])['user_id'].count().reset_index(
  name='Count').sort_values(['Count'], ascending=False).title.values.tolist()
    
    top_articles = ordered_titles[:n]
    
    return top_articles # Return the top article titles from df (not df_content)


def get_top_article_ids(n, df=df):
    """
    Returns top n article ids from df, based on number of interactions.

    INPUT:
    n - (int) the number of top articles to return
    df - (pandas dataframe) df as defined at the top of the notebook 
    
    OUTPUT:
    top_articles - (list) A list of the top 'n' article titles 
    
    """
    ordered_ids = df.groupby(['article_id'])['user_id'].count().reset_index(
        name='Count').sort_values(['Count'], ascending=False).article_id.values.tolist()
    
    top_articles = ordered_ids[:n]
 
    return top_articles # Return the top article ids

In [None]:
print(get_top_articles(10))
print(get_top_article_ids(10))

['use deep learning for image classification', 'insights from new york car accident reports', 'visualize car data with brunel', 'use xgboost, scikit-learn & ibm watson machine learning apis', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'finding optimal locations of new store using decision optimization', 'apache spark lab, part 1: basic concepts', 'analyze energy consumption in buildings', 'gosales transactions for logistic regression model']
[1429.0, 1330.0, 1431.0, 1427.0, 1364.0, 1314.0, 1293.0, 1170.0, 1162.0, 1304.0]


In [None]:
def get_article_names(article_ids, df=df):
    """ Return the article names associated with list of article ids.
    
    INPUT: 
    article_ids - (list) a list of article ids
    df - (pandas dataframe) df as defined at the top of the notebook
    
    OUTPUT:
    article_names - (list) a list of article names associated with the list of article ids 
                    (this is identified by the title column) """

    
    article_names = list(set(df[df.article_id.isin(article_ids)].title.values))
    
    return article_names # Return the article names associated with list of article ids

### <a class="anchor" id="User-User">Part III: User-User Based Collaborative Filtering</a>

In order to build better recommendations for the users of the IBM's platform, we will make recommendations to a user **based on the articles read by other users that are similar to them**, in terms of the items they have interacted with. This is called User-User collaborative filtering. 


For this purpose, we will:
- build a **user by article matrix**, where there's a 1 if there was a interaction and a 0 otherwise. This will be useful to compute similarity between users, and also to retrieve which articles were seen by a given user. Therefore, using this matrix we will:
    - build a function that retrieves the **articles a user has interacted with**.
    - build a function that finds **most similar users** to any given user_id. Because the row for each user in our matrix is binary, we will compute similarity as the dot product between two users. When given a user_id, the function will return a pandas dataframe with all the neighbors sorted by similarity first, and number of interactions secondly.


- build a function to **recommend articles** for any user_id, based on its similarity with other users and the articles they have interacted with.

**User by article matrix**

In [None]:
def create_user_item_matrix(df):
  """ Returns a user by article matrix, where there's a 1 if there was
      an interaction and a 0 otherwise.
    
      INPUT: df - pandas dataframe with article_id, title, user_id columns
      OUTPUT: user_item - (pandas dataframe) user by article matrix. """
    
    df['ones'] = 1
    user_item = df.groupby(['user_id', 'article_id'])['ones'].max().unstack()
    user_item = user_item.replace(np.nan, 0)
    
    return user_item # return the user_item matrix 

user_item = create_user_item_matrix(df)

In [None]:
user_item.shape # 5149, 714

**Articles that a user_id has interacted with**.

In [None]:
def get_user_articles(user_id, user_item= user_item):
    """ 
    Returns a list of the article_ids and article titles 
    that have been seen by a user.
    
    INPUT:
    user_id - (int) a user id
    user_item - (pandas dataframe) matrix of users by articles.
    
    OUTPUT:
    article_ids - (list) a list of the article ids seen by the user
    article_names - (list) a list of article names seen by the user
    """
    
    article_ids = user_item.columns[user_item.loc[user_id, :] == 1.0].tolist()
    article_names = get_article_names(article_ids, df=df)
    
    # convert ids to strings
    article_ids = [str(a) for a in article_ids]
    
    return article_ids, article_names # return the ids and names

**Most similar users**. 

**A few considerations:**
* When choosing between users that have the same closeness to a given user, instead of doing it arbitrarily, we will choose those with the highest number of interactions. For that we will build the **get_top_sorted_users** function, which returns a dataframe with neighbors ordered by similarity first, and number of interactions secondly.

* When, for a similar user, we get a number of recommendations that exceeds the number we need, instead of choosing the remaining articles arbitrarily, we will choose those with the highest number of total interactions first. This ranking can be obtained with the **top_articles** function we wrote before.

In [None]:
def get_top_sorted_users(user_id, df=df, user_item=user_item):
    """ Returns a dataframe with all the neighbors of user_id, ordered first by
    similarity and second by num_interactions, both in descending order.

    INPUT:
    user_id - (int)
    df - (pandas dataframe) df as defined at the top of the notebook 
    user_item - (pandas dataframe) matrix of users by articles: 
                1's when a user has interacted with an article, 0 otherwise
    
    OUTPUT:
    neighbors_df - (pandas dataframe) a dataframe with:
                    neighbor_id - user_id of each neighbor.
                    similarity - measure of the similarity of each neighbor to 
                                 input user_id
                    num_interactions - number of articles seen by each neighbor
    """

    user_idx = user_id - 1
    
    # neighbors
    neighbor_ids = np.delete(np.array(user_item.index), user_idx) # without user itself
    
    # number of interactions per neighbor
    n_interactions = [df[df.user_id == u].shape[0] for u in neighbor_ids] 
    
    # similarities
    user_item_np = np.array(user_item)
    
    similarity_vector = np.dot(user_item_np[user_idx, :], np.transpose(user_item_np))
    similarity_vector = np.delete(similarity_vector, user_idx) # without user itself
    
    # dataframe
    neighbors_df = pd.DataFrame({'neighbor_id': neighbor_ids,
                                 'similarity' : similarity_vector,
                                 'num_interactions' : n_interactions
                                }).set_index('neighbor_id')
    
    # ordered by similarity and num_interactions
    neighbors_df = neighbors_df.sort_values(by=['similarity', 'num_interactions'], 
                                            ascending=False)
    
    return neighbors_df # Return the dataframe specified in the doc_string

**Articles to recommend** based on User-User collaborative filtering.

In [None]:
def user_user_recs(user_id, m=10):
    """
    Returns m recommendations for user_id based on collaborative filtering.

    INPUT:
    user_id - (int) a user id
    m - (int) the number of recommendations you want for the user
    
    OUTPUT:
    recs - (list) a list of m recommendations for the user by article id
    rec_names - (list) a list of m recommendations for the user by article title
    
    Description:
    Loops through the most similar users to the input user_id. 
    For each similar user, finds articles the input user hasn't seen before and
    provides them as recs. This is done this until m recommendations are found.
    
    Notes:
    - Similar users who are the same closeness to input user are chosen based on
    the number of total interactions they had. The one with most interactions is
    chosen.

    - When, for a similar user, we get a number of recommendations that exceeds 
    m, the articles to reach m are chosen based on the total number of 
    interactions the articles had. Those with more interactions are chosen first.
    If same number of interactions is found, it's chosen arbitrarily. 
    """
    
    recs = np.array([])
    
    seen_articles, _ = get_user_articles(user_id)
    
    neighbors_df = get_top_sorted_users(user_id)
    
    for nb in neighbors_df.index:
        potential_articles, _ = get_user_articles(nb)

        potential_recs = [a for a in potential_articles if a not in seen_articles]
        
        total_so_far = len(recs) + len(potential_recs)
        if total_so_far <= m:
            
            # agregar todos los articulos de ese nb
            recs = np.append(recs, potential_recs)
            
            if total_so_far == m: break
        
        else:
            
            # if i'm passing m, choose needed number of articles with more interactions and break
            n_final = m - total_so_far
            
            # first we need integers
            potential_recs = [int(float(pr)) for pr in potential_recs]
            
            # get all articles ordered by number of interactions
            total_n_articles = df.article_id.nunique()
            top_articles = get_top_article_ids(total_n_articles, df=df)
            
            # dict mapping ids in potential_recs with index in top_articles
            # lower number is higher in ranking
            ids_order = {}
            for pr in potential_recs:    
                ids_order[pr] = np.where(np.array(top_articles) == pr)[0][0]
            
            # ordered list
            recs_ordered = list(dict(sorted(ids_order.items(), 
                                            key=lambda item: item[1])).keys())
            
            # get the number needed
            new_recs = recs_ordered[:n_final]
            recs = np.append(recs, new_recs)
            
            break
        

        
    rec_names = get_article_names(recs)
    recs_str = [str(r) for r in recs]
    
    return recs_str, rec_names

In [None]:
# Quick spot check - don't change this code - just use it to test your functions
rec_ids, rec_names = user_user_recs(20, 10)
print("The top 10 recommendations for user 20 are the following article ids:")
print(rec_ids)
print()
print("The top 10 recommendations for user 20 are the following article names:")
print(rec_names)

The top 10 recommendations for user 20 are the following article ids:
['1330.0', '1427.0', '1364.0', '1170.0', '1162.0', '1304.0', '1351.0', '1160.0', '1354.0', '1368.0']

The top 10 recommendations for user 20 are the following article names:
['use xgboost, scikit-learn & ibm watson machine learning apis', 'putting a human face on machine learning', 'gosales transactions for logistic regression model', 'model bike sharing data with spss', 'analyze accident reports on amazon emr spark', 'movie recommender system with spark machine learning', 'apache spark lab, part 1: basic concepts', 'insights from new york car accident reports', 'predicting churn with the spss random tree algorithm', 'analyze energy consumption in buildings']
