In [3]:
import pandas as pd
import pylab
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
#import seaborn as sns
import numpy as np
%matplotlib inline

# Content Based Recommender - Lectures

Content based recommender work best when there is a lot of new items which are difficult to be rated independently and we use similarities between objects instead. Basic idea is to find similarities between items and then based our recommendations on identified characteristics. This concept is based on build profile of attributes you find interesting.


Just to clarify the defintions we are going to use

### Content

* set of description describing products
* useful in recommendations

### TFIDF

* forward reference using vector references
* define relevant terms
* \# docs / # docs with term


## Content based filtering profile


* build vector of attributes (keywords)
* each attributes is a [dimension we are going to move along](https://en.wikipedia.org/wiki/Vector_space_model)
  * more keywords, more dimensions
* in this space we will define both the users and the items
* we can find their preferences by finding object close in this space
* we can ask users to create it directly
  * difficult for users as we don't know our preferences
* we wont use other ppl preferences to build your own (even if familiar)
* in this model we understand that items liked are important. We cant separate those two.
* system will be based on the search engine (index)
* entree style recommending - as there was always option, users tended to explore a lot


## Case based recommendations (K-space reasoning)

* have database around set of relevant attributes
* query based on example query and retrieve relevant cases 
* based on classical problem solving approach (DARPA)
  * detecting problems as cases
  * finding similarities between cases
  * leading to case based maintenances
  * this then lead to recommended systems 
  * important part is feedback (critique)
  * review as case generation
    * detect sentiment as well
    * described products as cases
    * solve problems using past problems experiences
    * adoptation is important here
* how to get latest information on the market
  * [conference](http://www.iccbr.org/iccbr14/)
  * [recsys](http://recsys.acm.org/)

# Assignment 2: Personalized Recommenders


## Read File

In [1]:
import pandas as pd
import pylab
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
import copy

#import pdb #debuggera
%matplotlib inline

from __future__ import division #so I can have float as std and int as //



In [196]:
dataFile = "Assignment2.csv"
print("reading %s" % dataFile)
dfData = pd.read_csv(dataFile, delimiter = "\t", error_bad_lines=True, encoding = 'utf-8-sig',na_values = ["NULL",""],header =0);

dataFile = "Assignment2Users.csv"
print("reading %s" % dataFile)
dfUsers = pd.read_csv(dataFile, delimiter = "\t", error_bad_lines=True, encoding = 'utf-8-sig',na_values = ["NULL",""],header =0);


reading Assignment2.csv
reading Assignment2Users.csv


# Part 1
## Building simple user profile

Build a very simple profile of user preferences for attributes. In this profile, you’ll count the total the number of positive and negative evaluations associated with each attribute, and create a profile with the total score for each attribute for each user. 

In [305]:
user1= dfUsers.iloc[:,0]
user2= dfUsers.iloc[:,1]

In [76]:
def userProfile(preferecesData,userVotes):
  cols = [pd.DataFrame(preferecesData[col].values * userVotes.values, columns=[col]) for col in preferecesData]
  votes = cols[0].join(cols[1:])
  votes = votes.sum()

  return votes

In [198]:
user1Profile = userProfile(dfData,user1)
user2Profile = userProfile(dfData,user2)

In [311]:
dfData
user2Profile

baseball    -1.024564
economics    1.000000
politics     1.052786
Europe       1.500000
Asia        -0.447214
soccer      -1.024564
war         -0.077350
security     1.500000
shopping     0.000000
family      -0.447214
dtype: float64

## Applying user profiles

* Which document does the simple profile predict user 1 will like best?
* What score does that prediction get?
* How many documents does the model predict U2 will dislike (prediction score that is negative)?

In [251]:
user1Preferences = userProfile(dfData.transpose(),user1Profile)
user2Preferences = userProfile(dfData.transpose(),user2Profile)

#user1Preferences.sort_values(inplace=True, ascending=False)

print "User 1 fav doc is {0} with score of {1:.2f}." .format(user1Preferences.idxmax(),user1Preferences.max())
print "User 2 will dislike {0} docs." .format(user2Preferences.iloc[user2Preferences.values<0].count())

user1Preferences.sort_values(inplace=True,ascending=False)
user1Preferences.head(3)

User 1 fav doc is doc16 with score of 3.33.
User 2 will dislike 4 docs.


doc16    3.333585
doc12    2.309021
doc1     2.256235
dtype: float64

# Part 2
## Building weighted user profile

We want to explore whether our simple model may be counting these attribute-heavy documents too much. To try this out, make a copy of the attributes matrix on another sheet. Then we’re going to have you normalize each row to be a unit length vector. 

In [289]:
docWeight = np.sqrt(dfData.sum(axis=1)) #weight based on no of items
dfDataWeigthed= (dfData.transpose()/docWeight).transpose() #weigthed data

## Applying user profiles

* Which document is now in second with this new model for user 1?
* What prediction score does it have?

In [290]:
user1Profile = userProfile(dfDataWeigthed,user1)
user2Profile = userProfile(dfDataWeigthed,user2)

user1Preferences = userProfile(dfDataWeigthed.transpose(),user1Profile)
user2Preferences = userProfile(dfDataWeigthed.transpose(),user2Profile)

In [291]:
user1Preferences.sort_values(inplace=True,ascending=False)
print user1Preferences.head(3)

print "User 1 fav doc is {0} with score of {1:.2f}." .format(user1Preferences.idxmax(), user1Preferences.max())

doc16    1.924646
doc6     1.370923
doc12    1.333114
dtype: float64
User 1 fav doc is doc16 with score of 1.92.


# Part 3
##  how common different terms are among our documents …

We’re going to include an IDF (inverse document frequency) term into our equation. Start with your spreadsheet from part 2. Add a row that shows 1/DF where DF is the number of documents in which each content attribute occurs. For example, baseball occurs in 4 documents, so baseball’s entry will be 0.25. Politics occurs in 10 documents, so it will get an IDF score of 0.1 (1 / 10).

Note that this is far more dramatic a computation than is usually used with large datasets (more common is 1 / log(DF)), but we need a dramatic value to see differences with a small dataset.

In [253]:
IDF = 1/dfData.sum(axis=0) #weight based on no of times term has been mentioned

baseball     0.250000
economics    0.166667
politics     0.100000
Europe       0.090909
Asia         0.166667
soccer       0.166667
war          0.142857
security     0.166667
shopping     0.142857
family       0.200000
dtype: float64

In [286]:
def userProfileWithIDF(preferecesData,userVotes,IDF):
  cols = [pd.DataFrame(preferecesData[col].values * userVotes.values, columns=[col]) for col in preferecesData]
  votes = cols[0].join(cols[1:])
  votes = votes.sum()
  votes = votes*IDF

  return votes

In [288]:
user1Profile = userProfileWithIDF(dfDataWeigthed,user1,IDF)
user2Profile = userProfile(dfDataWeigthed,user2)

user1PreferencesType3= userProfile(dfDataWeigthed.transpose(),user1Profile)
user2PreferencesType3 = userProfile(dfDataWeigthed.transpose(),user1Profile)

## Results

* Compare doc1 and doc9 for user1. What’s user1’s prediction for doc9 in the new IDF weighted model? See how there’s a dramatic difference from the prior model?
* Now let’s look at user 2. Look at doc6. It was moderately positive before and now is slightly negative. Why did that change?

In [323]:
d={'Approach2': user1Preferences, 'Approach3':user1PreferencesType3}
compareResults = pd.DataFrame(data=d)
compareResults.sort_index(inplace=True, ascending=False)
compareResults.head(4)

Unnamed: 0,Approach2,Approach3
doc9,1.132724,0.179067
doc8,-0.370053,-0.04753
doc7,-0.353553,-0.058926
doc6,1.370923,0.319432


Look at doc6. It was moderately positive before and now is slightly negative. It has two attributes, a common one the user really likes (Europe) and a rare one the user dislikes (Baseball). In prior models, the fact that the user liked Europe more than s/he disliked baseball was decisive, but this model recognizes that Baseball is rarer than Europe, and therefore should have more weight (after all, there are plenty of other articles about Europe).