## Project 2 Content-Based and Collaborative Filtering

## Part 1 Collaborative Filtering

We'll use the Python "surprise" library.  This library offers several packages the support recommender systems, including nearest neighbor-based methods and matrix factoriaztion.  In this section, we'll take advantage of the nearest neighbor algorithms.

In [290]:
import pandas as pd
from pandas.io.json import json_normalize
import gensim
from surprise import Reader
from surprise import Dataset
from surprise.prediction_algorithms.knns import KNNWithZScore, KNNBaseline, KNNBasic
from surprise.prediction_algorithms.baseline_only import BaselineOnly
from surprise.model_selection import GridSearchCV
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
import warnings
warnings.filterwarnings('ignore')
import random

random.seed(42)

### Load data

For our data, we'll use a "The Movies Dataset" from kaggle.  This dataset includes a table of 26 million ratings from hundreds of thousands of users.  We subset this data offline because finding a free method of storing this amount of data would be difficult.  Our subset includes all ratings from the top 10,000 users and top 2,000 movies.  The rating scale ranges between .5 and 5 in intervals of .5.

In [291]:
path = 'https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Project2/movies_medium.csv'
movies = (pd.read_csv(path)
          .drop(['Unnamed: 0', 'timestamp'], axis = 1)
          .dropna()
#          .groupby(['userId', 'movieId'])
#          .mean()
#          .reset_index()
)

reader = Reader(rating_scale=(.5, 5))

rec_data = Dataset.load_from_df(movies, reader)
trainset = rec_data.build_full_trainset()

movies.head(5)

Unnamed: 0,userId,movieId,rating
0,18276.0,4545.0,2.0
1,18276.0,4545.0,2.0
2,18276.0,4205.0,5.0
3,9279.0,8644.0,2.5
4,9279.0,8464.0,4.0


In [292]:
### Data Exploration


In [293]:
avg_user = movies.groupby('userId')['rating'].mean()
layout = {'xaxis' : {'title': 'avg rating'}, 'title' : 'Distribution of Average User Ratings',
         'yaxis': {'title' : 'Users'}
}
data = [go.Histogram(x = avg_user.values)]
py.iplot(go.Figure(data = data, layout = layout))


Most users are somewhat generous with their ratings, with a mode of 4.  Some users are tougher than others, indicating that we might want to standardize ratings by user.

In [294]:
avg_movie = movies.groupby('movieId')['rating'].mean()
layout = {'xaxis' : {'title': 'avg rating'}, 'title' : 'Distribution of Average Movie Ratings',
         'yaxis': {'title' : 'Movies'}
}
data = [go.Histogram(x = avg_movie.values)]
py.iplot(go.Figure(data = data, layout = layout))

The movie ratings are a little less right-skewed.  There is a fair amount of weight between 1 and 3, indicating there are some unpopular movies in our data.  

### Build Models


The Surprise library frames recomendations a bit more like a regression problem than the recommenderlab package.  This is the reason for its sklearn-like api.  The dataset is the input and predicted ratings are the featured output attribute.  **This framework can still be used for recommendations by choosing items with the highest predicted rating.**

We'll try 4 model types, 3 of which use a similarity matrix and take the K most similar users/items, however the expected rating is computed with slightly different formulas.  The last model type will be the baseline model.

- KNNBasic: Raw ratings are weighted by similarities.  Item-based uses the given user's rating of the similar items, while user-based computes the expected rating of an item based on the similar user's rating of that item
- KNNWithZScore: Similar to basic, but starts with the mean of a user/item and adds Z-scores weighted by similarity
- KNNBaseline: Starts with the base line rating and adds (baseline rating - actual rating) of similar items/users weighte by similarity

In [295]:
param_grid = {'k' : [30,40,50], 
              'sim_options' : {'user_based': [False, True], 
              'name': ['cosine', 'pearson', 'pearson_baseline']}
              
 }

In [296]:
def cv_mod(algo, data):

    cv = GridSearchCV(algo, param_grid, cv = 3, n_jobs = -1)# set cv folds to 3 in the interst of time
    cv.fit(data)

    full_frame = pd.DataFrame.from_dict( cv.cv_results)
    results_df = pd.concat([full_frame[['mean_test_rmse', 'mean_test_mae', 'param_k']], 
               json_normalize(cv.cv_results['param_sim_options'])], axis = 1)
    return cv, results_df


In [297]:
def graph_mod(results_df, algo):

    fig = tools.make_subplots(rows=1, cols=2, subplot_titles = ('Item Based', 'User Based'))
    i = 1
    for filter_type in  results_df.user_based.unique():
        for item in results_df.name.unique():
            df_temp = results_df.loc[(results_df.user_based == filter_type) & (results_df.name == item)]
            trace = go.Scatter(
                x = df_temp.param_k,
                y = df_temp.mean_test_rmse,
                mode = 'lines+markers',
                name = item,
            )
            fig.append_trace(trace,1,i)
            fig['layout']['yaxis1'].update(title = 'RMSE')
            fig['layout']['xaxis' + str(i)].update(title = 'K')
        i+=1
    fig['layout'].update(title=algo)
    return fig

In [298]:
from surprise.model_selection import cross_validate
algo = BaselineOnly()

# Run 5-fold cross-validation and print results
cv1 = cross_validate(algo, rec_data, measures=['RMSE', 'MAE'], cv=5)
pd.DataFrame.from_dict( cv1)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


Unnamed: 0,test_rmse,test_mae,fit_time,test_time
0,0.749864,0.570235,0.118682,0.070841
1,0.756086,0.575985,0.098736,0.065821
2,0.758694,0.574949,0.10472,0.067817
3,0.753087,0.571256,0.127659,0.497668
4,0.754748,0.570635,0.204451,0.116688


In [299]:
basic_cv, basic_results = cv_mod(KNNBasic, rec_data)
basic_fig = graph_mod(basic_results, 'KNN Basic')
py.iplot(basic_fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [300]:
zscore_cv, zscore_results = cv_mod(KNNWithZScore, rec_data)
z_fig = graph_mod(zscore_results, 'KNN With Z-Score')
py.iplot(z_fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [301]:
baseline_cv, baseline_results = cv_mod(KNNBaseline, rec_data)
base_fig = graph_mod(baseline_results, 'KNN KNNBaseline')
py.iplot(base_fig, layout = layout)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



## Content Based Filtering

In [302]:
path = 'https://raw.githubusercontent.com/TheFedExpress/DATA612/master/Project2/movies_metadata.csv'
def make_int(df, column):
    df = df.copy()
    df[column] =  df[column].map(lambda x: x.replace('-', ''))
    return df

movies_metadata = (pd.read_csv(path)
                       .loc[:, ['id', 'budget', 'overview', 'runtime']]
                       .pipe(make_int, 'id')
                       .drop_duplicates('id')
)

movies_metadata['id'] = movies_metadata['id'].astype(np.int32)

In [303]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import re
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models
import numpy as np

def lemmatize_stemming(text):
    stemmer = SnowballStemmer('english')
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(desc):
    words = []
    try:
        for item in simple_preprocess(desc, min_len = 3):
             if item not in STOPWORDS or item in ['he', 'she', 'her', 'his']:
                words.append(lemmatize_stemming(item))
        return words
    except(TypeError):
        return np.nan

movies_metadata['movie_words'] = movies_metadata['overview'].map(preprocess)
movie_words = movies_metadata[['id', 'movie_words']].dropna()

In [304]:
dictionary = corpora.Dictionary(movie_words['movie_words'])
dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=20000)
dictionary.compactify()
corpus = [dictionary.doc2bow(item) for item in movie_words['movie_words']]

In [305]:
lda = models.LdaModel(corpus = corpus, num_topics = 50, id2word = dictionary, passes = 10, alpha = .0005)

In [306]:
output = lda.print_topics(num_topics = 10)
topics = [lda.get_document_topics(element) for element in corpus]
output

[(29,
  '0.029*"greatest" + 0.022*"youth" + 0.017*"controversi" + 0.016*"roll" + 0.015*"decad" + 0.015*"biggest" + 0.013*"convers" + 0.012*"traffic" + 0.011*"roger" + 0.011*"cours"'),
 (1,
  '0.027*"comedian" + 0.025*"moscow" + 0.022*"wall" + 0.021*"seem" + 0.020*"air" + 0.019*"transport" + 0.019*"cold" + 0.018*"sinist" + 0.016*"supernatur" + 0.016*"christian"'),
 (23,
  '0.025*"bank" + 0.023*"insid" + 0.020*"hold" + 0.020*"trap" + 0.018*"bar" + 0.018*"fortun" + 0.015*"gold" + 0.014*"henri" + 0.013*"rob" + 0.012*"key"'),
 (22,
  '0.045*"night" + 0.034*"compani" + 0.027*"head" + 0.026*"blood" + 0.024*"club" + 0.017*"eat" + 0.017*"food" + 0.015*"busi" + 0.014*"job" + 0.014*"shop"'),
 (4,
  '0.023*"realiti" + 0.022*"achiev" + 0.021*"engin" + 0.020*"scienc" + 0.019*"generat" + 0.019*"technolog" + 0.018*"tortur" + 0.017*"influenc" + 0.017*"fiction" + 0.016*"resort"'),
 (5,
  '0.062*"island" + 0.037*"japanes" + 0.033*"amp" + 0.028*"light" + 0.025*"captain" + 0.022*"sam" + 0.017*"landscap" + 

In [307]:
all_clus = []
for i in range(len(topics)):
    for j in range(len(topics[i])):
        if topics[i][j][1] >= .1:
            all_clus.append({"id": movie_words.iloc[i, 0], "top_num":topics[i][j][0], "percentage":topics[i][j][1]})

In [308]:
new_df = (pd.DataFrame(all_clus)
              .pivot(index = 'id', columns = 'top_num', values = 'percentage')
)
full_df = movies_metadata.merge(new_df, on = 'id').rename(columns = {'id': 'movieId'})

In [309]:
#movies.groupby('userId').size().sort_values(ascending = False).head(5)
movies.groupby('movieId').size().sort_values(ascending = False).head(5)

movieId
4993.0    2302
7153.0    2167
2858.0    1901
3481.0    1504
3793.0    1404
dtype: int64

In [310]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

final_df = movies.merge(full_df, on = 'movieId')

def impute_median(df, column):
    df = df.copy()
    df[column] = df.loc[df[column] == 0, column].median()
    return df

X = (final_df.pipe(impute_median, 'budget')
         .pipe(impute_median, 'runtime')
         .fillna(0)
         .drop(columns = ['rating', 'userId', 'movieId', 'overview', 'movie_words'])
         .loc[final_df['userId'] == 741]
)
y = final_df.loc[final_df['userId'] == 741, 'rating']

xgb_param_grid = {
    'learning_rate': [.01, .05, .1],
    'max_depth': [3,5,7,9],
    'n_estimators': [750, 1000]
    
}
gb = xgb.XGBRegressor(objective = 'reg:linear', subsample = .6, colsample_bytree = .6, nthread = -1)
regressor = GridSearchCV(gb, xgb_param_grid, cv = 3, scoring = 'neg_mean_squared_error')
regressor.fit(X, y)

GridSearchCV(cv=3, error_score='raise',
       estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.6, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=-1, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.6),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.01, 0.05, 0.1], 'max_depth': [3, 5, 7, 9], 'n_estimators': [750, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [311]:
full_frame = pd.DataFrame.from_dict( regressor.cv_results_)
pd.concat([full_frame[['mean_test_score']], 
               json_normalize(regressor.cv_results_['params'])], axis = 1)

Unnamed: 0,mean_test_score,learning_rate,max_depth,n_estimators
0,-0.628168,0.01,3,750
1,-0.640225,0.01,3,1000
2,-0.61146,0.01,5,750
3,-0.614949,0.01,5,1000
4,-0.599182,0.01,7,750
5,-0.601277,0.01,7,1000
6,-0.596463,0.01,9,750
7,-0.598364,0.01,9,1000
8,-0.665432,0.05,3,750
9,-0.665432,0.05,3,1000
