<span style="color:blue"><font size = "4">RECOMMENDER SYSTEM BY MATRIX FACTORIZATION USING SINGULAR VAUE DECOMPOSITION (SVD)</font></span> 

In [1]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
from scipy.sparse.linalg import svds

In [2]:
# Load data
recomm = pd.read_csv('working_data.csv')
for col in recomm.columns: 
    if 'Unnamed: 0' in col: 
        del recomm[col]

In [3]:
# Make a checkpoint
df = recomm.copy()
# df.head()

# The dataset consist of 19 features.

- <span style="color:blue">INTEREST</span>: interest chosen at time of post
- <span style="color:blue">LIKE</span>: if the user liked a post. 1 if liked and 0 if not
- <span style="color:blue">POSTId</span>: post identification
- <span style="color:blue">SENTIMENT</span>: user's post comments
- <span style="color:blue">SHARE</span>: if the user shared a post. 1 if yes and 0 if not
- <span style="color:blue">TIMESPAN</span>: time of comments
- <span style="color:blue">USERId</span>: The user identification number
- <span style="color:blue">USER_CATEGORY</span>: the group in which the user belong at signup
- <span style="color:blue">COUNTRY_CODE	</span>: country code
- <span style="color:blue">COUNTRY_NAME</span>: country at the time of signup
- <span style="color:blue">CONTINENT</span>:continents
- <span style="color:blue">REGION</span>:regions of the continents
- <span style="color:blue">COUNTRY_GNP</span>:country's gross national product. The total value of goods produced and services provided by a country during one year.
- <span style="color:blue">CITY_NAME</span>: city at the time of signup
- <span style="color:blue">CITY_DISTRICT</span>: district/county/province/state
- <span style="color:blue">CITY_POPULATION</span>: city population
- <span style="color:blue">COUNTRY_LANGUAGE</span>: national language(s)
- <span style="color:blue">PERCENT_SPOKEN</span>: percentage of people speaking the language(s)
- <span style="color:blue">LOC_RANK</span>:

Future features:
- <span style="color:blue">VIEW</span>: if a user viewed the post. 1 if viewed and 0 if not

In [4]:
# create post dataframe
post_df = df[['POSTId', 'SENTIMENT', 'INTEREST']]
post_df.head()

Unnamed: 0,POSTId,SENTIMENT,INTEREST
0,296,canon 30d and 40d are way sexier might upgrade...,Distribution
1,306,i though about upgrating to a different camera...,Transformation
2,307,i didnt see the need to go up to the xti nikon...,Distribution
3,665,other canon cameras i own include the canon eo...,Distribution
4,899,to 40d right before 50d was announced thanks a...,Transformation


In [5]:
df['INTEREST'].value_counts()

Transformation       5276
Distribution         5251
Technology input     5128
Animal Production    5023
Consumption          5007
Crop Production      4964
Name: INTEREST, dtype: int64

In [6]:
group_df = df[['USERId','POSTId','INTEREST','TIMESTAMP','LIKE']]
replace_map = {'INTEREST':{'Transformation': 1, 'Distribution': 2, 'Technology input': 3, 'Animal Production': 4,
                                  'Consumption': 5, 'Crop Production': 6}}
group_df.replace(replace_map, inplace=True)
group_df.columns = ['CLASS' if x=='INTEREST' else x for x in group_df.columns]
group_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  regex=regex)


Unnamed: 0,USERId,POSTId,CLASS,TIMESTAMP,LIKE
0,1,296,2,1147880044,1.0
1,1,306,1,1147868817,1.0
2,1,307,2,1147868828,1.0
3,1,665,2,1147878820,1.0
4,1,899,1,1147868510,1.0


In [7]:
# Formate group matrix
# One row per user and one column per post
group_matrix = group_df.pivot(index = 'USERId', columns ='POSTId', values = 'CLASS').fillna(0)
group_matrix.head()

POSTId,1,2,3,4,5,6,7,9,10,11,...,200818,200838,201340,201646,202439,203375,203519,204542,204698,205106
USERId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,3.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Normalize by each user's mean
# Convert dataframe to numpy array

G = group_matrix.values
user_class_mean = np.mean(G, axis=1)
G_demeaned = G - user_class_mean.reshape(-1, 1)

Singular Value Descomposition (SVD) at a high level is an algorithm 
that decomposes a matrix R into the best lower rank approximation
of the original matrix R. Mathematically, it decomposes R into two 
unitary matrices and diagonal matrix:
    <span style="color:blue">R = UΣV^T</span> 

R is the user_class matrix
U is the user 'features' matrix
Σ is the diagonal matrix of singular values (especially weights)
V^T is the post (sentiment) matrix.

U represents how much users “like” each feature and V^T
represents how relevant each feature is to each post.

In [9]:
# Singular Value Decomposition
# k = latent factor
# k is used to approximate the original group matrix
# Note here that sigma returns just the values instead
# of a diagonal matrix.

U, sigma, Vt = svds(G_demeaned, k=50)
sigma = np.diag(sigma)

In [10]:
# Making predictiions from decomposed matrix
# Multiplying U, Σ, and V^T to get the rank k
# approximation of R
# Add user's means back to get the predicted class.

prediction = np.dot(np.dot(U, sigma), Vt) + user_class_mean.reshape(-1, 1)
pred_df = pd.DataFrame(prediction, columns = group_matrix.columns)

In the future with real data, split data into training and validation sets
and optimize k by minimizing the RMSE.

In [11]:
# Making Post Recommendations

def recommend_posts(predictions_df, userID, post_df, original_class_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge with the post information.
    user_data = original_class_df[original_class_df.USERId == (userID)]
    user_full = (user_data.merge(post_df, how = 'left', left_on = 'POSTId', right_on = 'POSTId').
                     sort_values(['CLASS'], ascending=False)
                 )

    print('User {0} has already posted {1} posts.'.format(userID, user_full.shape[0]))
    print('Recommending the highest {0} predicted posts.'.format(num_recommendations))
    
    # Recommend the highest predicted posts that the user hasn't seen yet.
    recommendations = (post_df[~post_df['POSTId'].isin(user_full['POSTId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'POSTId',
               right_on = 'POSTId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

already_posted, predictions = recommend_posts(pred_df, 34, post_df, group_df, 10)

User 34 has already posted 712 posts.
Recommending the highest 10 predicted posts.


In [12]:
print('Interest groups of user: ', already_posted['INTEREST'].value_counts())
already_posted.head()

Interest groups of user:  Distribution         126
Consumption          126
Crop Production      124
Animal Production    118
Transformation       115
Technology input     103
Name: INTEREST, dtype: int64


Unnamed: 0,USERId,POSTId,CLASS,TIMESTAMP,LIKE,SENTIMENT,INTEREST
326,34,1207,6,1317762456,1.0,My daugjhter loves hers!\r\n,Consumption
591,34,2959,6,1317762300,1.0,Absolutely loves her @AmazonKindle &lt;3 Best ...,Crop Production
578,34,2959,6,1317762300,1.0,@gruvtopia maybe it would be best if you have ...,Distribution
579,34,2959,6,1317762300,1.0,after buying the 4 gb sandisk cf card i spent ...,Animal Production
580,34,2959,6,1317762300,1.0,is too firm to easily rotate\r\n,Crop Production


In [13]:
# already_posted[already_posted['INTEREST']=='Crop Production']['LIKE'].value_counts()

In [14]:
predictions

Unnamed: 0,POSTId,SENTIMENT,INTEREST
15969,1196,@charltonbrooker mate Bill Gates doesn't look ...,Technology input
25840,1196,My @AmazonKindle's battery just died. I've had...,Consumption
29362,1196,#FF U ARE AWESOME!! Love You All! @classic_twe...,Crop Production
18130,1196,Good Friday from @ladygaga who I found out thi...,Technology input
11609,1196,RT @AllstateNews: Loved seeing the 80 year fla...,Technology input
5425,1196,AND Then There IS www.amazon.com/Circumstantia...,Distribution
25417,1196,it is only for review\r\n,Distribution
276,1196,#fridayreads #nowreading &quot;Kiss of the Hig...,Transformation
5149,1196,Some day I may have one then I would be more t...,Technology input
1550,1196,br the canon came out on top in all catergorie...,Consumption
