#Recommendation Systems in Machine Learning

## Week 1 - Intro to Recommendation Systems

This week, we'll take a brief look at the libraries and frameworks we will be using in later weeks of this course. Specifically, we will be using NumPy, PyTorch, and Pandas very heavily in this course; please try and get as familiar as you can with these frameworks as possible, even outside of the weekly assignments. In this week's assignment, we will not be using PyTorch - we will use that more in the weeks on Recommender Systems using Deep Learning. 

### Intro to Numpy

NumPy is a very widely used library for large, multi-dimensional arrays. It's optimized to deal with large vector operations with some SUPER advanced C optimizations (you learn a little about it in COMPSCI 61C). 

NumPy is used in place of Python arrays both because Python arrays tend to be slower for operations like elementwise addition and multiplication, and because of the fact that Python arrays are can contain multiple types of objects. Normally, this is a good thing, but for fast array operations, type checking every single element of a list can be very time-intensive. Thus, Numpy is the go to library for machine learning data analysis and transformation.

Convention is to use "np" as an alias for numpy during an import. Numpy isn't provided by default with Python, so we have to import it.

In [2]:
import numpy as np

NumPy arrays are the workhorse of the library. A NumPy array is essentially a bunch of data coupled with some metadata:

type: the type of objects in the array. This will typically be floating-point numbers for our purposes, but other types can be stored. The type of an array can be accessed via the dtype attribute.

shape: the dimensions of the array. This is given as a tuple, where element $i$ of the tuple tells you how the "length" of the array in the $i$th dimension. For example, a 10-dimensional vector would have shape (10,), a 32-by-100 matrix would have shape (32,100), etc. The shape of an array can be accessed via the shape attribute.

There are number of ways to construct arrays. One is to pass in a Python sequence (such as list or tuple) to the np.array function:

In [3]:
np.array([1, 2.3, -6])

array([ 1. ,  2.3, -6. ])


We can also easily create ordered numerical lists:

In [4]:
# We zero index so you will actually get 0 to 6
print(np.arange(7))
# Remember the list won't include 9
print(np.arange(3, 9))

[0 1 2 3 4 5 6]
[3 4 5 6 7 8]


We can also customize these lists with a third parameter that specifies step size, similar to range() in Python loops.

In [5]:
np.arange(0.0, 100.0, 10.0)

array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.])

We can also very easily create multi-dimensional arrays

In [6]:
arr = np.array([[1, 2.3, -6], [7, 8, 9]])
print(arr)
print(arr.shape)

[[ 1.   2.3 -6. ]
 [ 7.   8.   9. ]]
(2, 3)


There are also many convenience functions for constructing special arrays. Here are some that might be useful:

In [7]:
# The identity matrix of given size
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [8]:
# A matrix with the given vector on the diagonal
np.diag([1.1,2.2,3.3])

array([[1.1, 0. , 0. ],
       [0. , 2.2, 0. ],
       [0. , 0. , 3.3]])

In [9]:
#An array of all zeros or ones with the given shape
np.zeros((8,4)), np.ones((3, 2))

(array([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]), array([[1., 1.],
        [1., 1.],
        [1., 1.]]))

In [10]:
# An array with a given shape full of a specified value
np.full((3,4), 2.1)

array([[2.1, 2.1, 2.1, 2.1],
       [2.1, 2.1, 2.1, 2.1],
       [2.1, 2.1, 2.1, 2.1]])

In [11]:
# A random (standard normal distribution) array with the given shape
np.random.randn(5,6)

array([[ 1.33120519, -0.2233578 ,  1.42409446, -0.73391984,  0.64248592,
         1.57430852],
       [ 0.07381566, -1.91178617,  0.04670992,  1.56996407, -0.8194011 ,
         0.45466285],
       [-0.6236496 , -1.09849796, -0.48034171,  0.44860469,  1.35594904,
        -0.35381804],
       [ 0.76910052, -0.05129554,  2.64714135,  0.55192718, -0.13546387,
         0.4411307 ],
       [-0.05815896, -0.18239503, -2.42912096, -1.01888854,  0.9598789 ,
        -0.19781149]])

Now let's suppose we have some data in an array so we can start doing stuff with it.

In [12]:
A = np.random.randn(10,5); x = np.random.randn(5)
A

array([[ 0.9812457 , -0.68185447,  0.43871488,  1.27540767,  0.21331268],
       [ 0.0302278 , -0.0114613 , -0.75766957,  0.84555225,  1.5683728 ],
       [-0.72788951,  1.70482396, -1.00490179,  0.76797585, -0.13818457],
       [ 0.00431601,  1.322423  , -0.24440994,  0.14387856, -0.99637638],
       [ 0.3589508 , -0.46991103,  0.89700273, -1.36489127,  0.10434249],
       [-1.58805692,  0.51322277,  0.44112037,  0.63600991, -0.97783322],
       [ 0.37455723, -0.47594566,  0.01694609, -1.88540108,  0.96650334],
       [ 0.95070611, -0.21924678, -0.33407335, -1.73841527,  0.1065307 ],
       [ 1.07425043,  1.4072141 ,  0.17385397,  0.56987896,  0.30988446],
       [ 0.90525753, -1.14662661, -0.79407265,  2.36743134, -1.52346664]])

NumPy lets us efficiently apply the same function to every element in an array. You'll often need to, for example, exponentiate a bunch of values, but if you use a list comprehension or map with the builtin Python math functions it will be really slow. Instead, you can just write:

In [13]:
# log, sin, cos, etc. work similarly - try them out!
np.exp(A)

array([[ 2.66777742,  0.50567836,  1.55071308,  3.58016063,  1.23777161],
       [ 1.0306893 ,  0.98860413,  0.46875756,  2.32926379,  4.7988332 ],
       [ 0.48292713,  5.50041731,  0.36608058,  2.15539898,  0.87093793],
       [ 1.00432534,  3.75250267,  0.78316652,  1.15474386,  0.36921492],
       [ 1.43182636,  0.62505787,  2.45224205,  0.25540845,  1.10998054],
       [ 0.20432224,  1.67066671,  1.55444781,  1.88892882,  0.3761252 ],
       [ 1.45434733,  0.62129724,  1.01709049,  0.15176818,  2.62873657],
       [ 2.58753609,  0.8031235 ,  0.71600126,  0.17579877,  1.11241207],
       [ 2.92779748,  4.08456035,  1.18988179,  1.76805303,  1.3632676 ],
       [ 2.47256861,  0.31770671,  0.4520002 , 10.66994956,  0.217955  ]])

We can take the sum/mean/standard deviation/etc. of all the elements in an array:

In [14]:
np.sum(x), np.mean(x), np.std(x)

(0.5330664561355374, 0.10661329122710747, 1.3301026841374637)

You can also specify an axis over which to compute the sum if you want a vector of row/column sums (again, sum here can be replaced with mean or other operations):

In [15]:
# Create an array with numbers in the range 0,...,3 and then reshape it to a 2x2 matrix
B = np.arange(4).reshape((2,2))

# Original matrix
print(B)
# Column sum
print(np.sum(B, axis=0))
# Row sum
print(np.sum(B, axis=1))

[[0 1]
 [2 3]]
[2 4]
[1 5]


We can also perform common linear algebra operations in NumPy.

In [16]:
# Matrix-vector product. The dimensions have to match, of course
A.dot(x)
# Note that in Python 3 there is also a slick notation A @ x which does the same thing

array([-3.4140895 ,  1.18779918,  5.8631209 ,  2.86321544, -2.26091446,
        2.92372466, -1.2459996 , -1.6818983 ,  1.68099548, -4.33446227])

In [17]:
# Transpose a matrix
A.T

array([[ 0.9812457 ,  0.0302278 , -0.72788951,  0.00431601,  0.3589508 ,
        -1.58805692,  0.37455723,  0.95070611,  1.07425043,  0.90525753],
       [-0.68185447, -0.0114613 ,  1.70482396,  1.322423  , -0.46991103,
         0.51322277, -0.47594566, -0.21924678,  1.4072141 , -1.14662661],
       [ 0.43871488, -0.75766957, -1.00490179, -0.24440994,  0.89700273,
         0.44112037,  0.01694609, -0.33407335,  0.17385397, -0.79407265],
       [ 1.27540767,  0.84555225,  0.76797585,  0.14387856, -1.36489127,
         0.63600991, -1.88540108, -1.73841527,  0.56987896,  2.36743134],
       [ 0.21331268,  1.5683728 , -0.13818457, -0.99637638,  0.10434249,
        -0.97783322,  0.96650334,  0.1065307 ,  0.30988446, -1.52346664]])

Now that you're familiar with NumPyfeel free to check out the documentation and see what else you can do - documentation can be found here: https://docs.scipy.org/doc/

#### Exercises

1) Create a vector of size 10 containing zeros

In [20]:
a = np.zeros(10)

2) Now change the fifth value to be 5

In [22]:
## NOTE - the fifth value is referred to by index 4
a[4] = 5
a

array([0., 0., 0., 0., 5., 0., 0., 0., 0., 0.])


3) Create a vector with values ranging from 10 to 49

In [27]:
#arange functions similarly to Python's builtin range, so we have to go from 10 to 50
a = np.arange(10, 50)
a

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45, 46, 47, 48, 49])

4) Reverse the previous vector (first element becomes last)

In [28]:
a = np.flip(a)
a

array([49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33,
       32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
       15, 14, 13, 12, 11, 10])

5) Create a 3x3 matrix with values ranging from 0 to 8. Create a 1D array first and then re-shape it

In [33]:
m = np.arange(0, 9)
m = m.reshape(3, 3)
m

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])


6) Create a 3x3x3 array with random values

In [36]:
a = np.random.randn(3, 3, 3)
a

array([[[ 0.44819499,  0.34563142,  0.55550849],
        [ 0.3430103 ,  1.24552369,  0.46070811],
        [ 0.4543051 , -0.41777337, -1.16548148]],

       [[-0.08226783, -0.22223126, -0.24895475],
        [-0.57752846,  0.75757441,  0.85848634],
        [-0.47014159, -0.15088304, -0.13159155]],

       [[-0.62269467,  0.10533377,  0.39280424],
        [ 1.22930398,  0.39146997,  1.30945547],
        [ 0.70623038,  0.92095499,  0.14409362]]])


7) Create a random array and find the sum, mean, and standard deviation

In [37]:
#length of the array could be anything here since we didn't specify
a = np.random.randn(10)
a.sum(), a.mean(), a.std()

(1.6038220066090334, 0.16038220066090333, 0.8038021977055199)

8) Make a diagonal matrix with values from 1-20 (try and create this and only type two numbers!)

In [39]:
a = np.diag(np.arange(1, 21))
a

array([[ 1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  5,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  7,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  8,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0

### Intro to pandas

pandas is a very widely used library for dealing with large datasets. It has some very convenient functions for importing and analyzing data from large databases that make it a very easy to learn option for data analysis.

Convention dictates that we import pandas as "pd".

In [40]:
import pandas as pd

In pandas, an array is referred to as a Series. You can create a Series by passing a list of values, letting pandas create a default integer index:

In [41]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


A 2-dimensional array in Python would be referred to as a DataFrame in pandas. You can create a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [42]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [43]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.514882,0.802582,0.560848,-0.57033
2013-01-02,1.161028,0.990791,0.268903,-0.801544
2013-01-03,0.121854,1.926957,0.436466,0.626573
2013-01-04,-1.52445,-0.783444,-0.501695,1.512323
2013-01-05,0.387131,-0.131827,-0.425667,0.253814
2013-01-06,-0.163197,-1.185949,-0.648501,0.917436


You can also create a DataFrame by passing a dict of objects that can be converted to a Series-like object.

In [44]:
df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Take note that the columns of the resulting dataframe have different data types

In [45]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

To view the top and bottom of the dataframe, you can use the following commands (they also take an optional input for the number of rows you wish to display)

In [46]:
df.head(3)

Unnamed: 0,A,B,C,D
2013-01-01,0.514882,0.802582,0.560848,-0.57033
2013-01-02,1.161028,0.990791,0.268903,-0.801544
2013-01-03,0.121854,1.926957,0.436466,0.626573


In [47]:
df.tail()

Unnamed: 0,A,B,C,D
2013-01-02,1.161028,0.990791,0.268903,-0.801544
2013-01-03,0.121854,1.926957,0.436466,0.626573
2013-01-04,-1.52445,-0.783444,-0.501695,1.512323
2013-01-05,0.387131,-0.131827,-0.425667,0.253814
2013-01-06,-0.163197,-1.185949,-0.648501,0.917436


You can also display the index and the columns:

In [48]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [49]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one datatype for the entire array, while pandas DataFrames have one datatype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy datatype that can hold all of the datatypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

In [50]:
#df.to_numpy() and df.values return the same result
df.values, df.values.dtype

(array([[ 0.51488188,  0.8025819 ,  0.56084773, -0.57033043],
        [ 1.16102767,  0.99079134,  0.26890312, -0.80154352],
        [ 0.12185378,  1.92695672,  0.43646605,  0.62657343],
        [-1.52445017, -0.78344363, -0.50169508,  1.51232332],
        [ 0.38713125, -0.13182651, -0.42566738,  0.25381379],
        [-0.16319743, -1.18594903, -0.64850056,  0.9174358 ]]),
 dtype('float64'))

In [51]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

.describe() will show a quick summary of your DataFrame:

In [52]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.082874,0.269852,-0.051608,0.323045
std,0.904122,1.178107,0.531945,0.886276
min,-1.52445,-1.185949,-0.648501,-0.801544
25%,-0.091935,-0.620539,-0.482688,-0.364294
50%,0.254493,0.335378,-0.078382,0.440194
75%,0.482944,0.943739,0.394575,0.84472
max,1.161028,1.926957,0.560848,1.512323


You can also transpose, sort your data by an axis, and sort by values:

In [53]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,0.514882,1.161028,0.121854,-1.52445,0.387131,-0.163197
B,0.802582,0.990791,1.926957,-0.783444,-0.131827,-1.185949
C,0.560848,0.268903,0.436466,-0.501695,-0.425667,-0.648501
D,-0.57033,-0.801544,0.626573,1.512323,0.253814,0.917436


In [54]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,-0.57033,0.560848,0.802582,0.514882
2013-01-02,-0.801544,0.268903,0.990791,1.161028
2013-01-03,0.626573,0.436466,1.926957,0.121854
2013-01-04,1.512323,-0.501695,-0.783444,-1.52445
2013-01-05,0.253814,-0.425667,-0.131827,0.387131
2013-01-06,0.917436,-0.648501,-1.185949,-0.163197


In [55]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,-0.163197,-1.185949,-0.648501,0.917436
2013-01-04,-1.52445,-0.783444,-0.501695,1.512323
2013-01-05,0.387131,-0.131827,-0.425667,0.253814
2013-01-01,0.514882,0.802582,0.560848,-0.57033
2013-01-02,1.161028,0.990791,0.268903,-0.801544
2013-01-03,0.121854,1.926957,0.436466,0.626573


You can index into DataFrames very similarly to how you index into Python arrays. However, you can also index by the name of a column.

In [56]:
df['A']

2013-01-01    0.514882
2013-01-02    1.161028
2013-01-03    0.121854
2013-01-04   -1.524450
2013-01-05    0.387131
2013-01-06   -0.163197
Freq: D, Name: A, dtype: float64

In [57]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.514882,0.802582,0.560848,-0.57033
2013-01-02,1.161028,0.990791,0.268903,-0.801544
2013-01-03,0.121854,1.926957,0.436466,0.626573


In [58]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,1.161028,0.990791,0.268903,-0.801544
2013-01-03,0.121854,1.926957,0.436466,0.626573
2013-01-04,-1.52445,-0.783444,-0.501695,1.512323


For multi-axis selection, we normally use the functions .loc() and .iloc() (they can also be used for single-axis selection)

In [59]:
#For getting a cross section using a label
df.loc[dates[0]]

A    0.514882
B    0.802582
C    0.560848
D   -0.570330
Name: 2013-01-01 00:00:00, dtype: float64

In [60]:
#Selecting on a multi-axis
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,0.514882,0.802582
2013-01-02,1.161028,0.990791
2013-01-03,0.121854,1.926957
2013-01-04,-1.52445,-0.783444
2013-01-05,0.387131,-0.131827
2013-01-06,-0.163197,-1.185949


In [61]:
#Label slicing, with both endpoints included
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,1.161028,0.990791
2013-01-03,0.121854,1.926957
2013-01-04,-1.52445,-0.783444


.loc() is similar to .iloc(), but .iloc() is solely used with integer indices, whereas .loc() is used with label names.

In [62]:
df.iloc[3]

A   -1.524450
B   -0.783444
C   -0.501695
D    1.512323
Name: 2013-01-04 00:00:00, dtype: float64

In [63]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-1.52445,-0.783444
2013-01-05,0.387131,-0.131827


There are various other useful pandas functions, but for reading data in this course, the above will cover most of what you will need. If you ever have any questions or want to learn more about some pandas functions, feel free to ask in the Piazza or check out the pandas documentation - https://pandas.pydata.org/docs/user_guide/index.html

#### Exercises

1) Create a dataframe from a random NumPy array of size 5 by 5

In [69]:
df = pd.DataFrame(np.random.randn(5, 5))
df

Unnamed: 0,0,1,2,3,4
0,2.621138,1.793274,-0.714763,-0.379673,-0.305849
1,0.707764,-1.328068,1.261731,0.412334,-1.853433
2,0.212769,-0.184537,0.091617,0.130344,1.171961
3,0.299444,-0.271784,-0.644286,1.056123,0.209448
4,-0.157517,-0.614733,0.10565,-0.760786,-2.388082


2) Sort the DataFrame you created by the first column

In [70]:
df.sort_values(by=0, ascending=False)

Unnamed: 0,0,1,2,3,4
0,2.621138,1.793274,-0.714763,-0.379673,-0.305849
1,0.707764,-1.328068,1.261731,0.412334,-1.853433
3,0.299444,-0.271784,-0.644286,1.056123,0.209448
2,0.212769,-0.184537,0.091617,0.130344,1.171961
4,-0.157517,-0.614733,0.10565,-0.760786,-2.388082


3) Print the mean of the first row of the DataFrame

In [71]:
np.mean(df.loc[0, :])

0.6028254025225829

Alright so that was a lot, but you can just enjoy this meme for now, and then move on to the rest of the assignment :)

![Funny Meme](https://pics.me.me/thumb_how-to-make-friends-69448574.png)

### Naive Recommender

For this part of the exercise, you don't have to actually code anything. We'd just like you to read along and follow what we're doing as we construct a naive recommender from scratch. At the end, we'll ask you to determine what type of recommender we created based off the code written

The first thing we'll need to do is mount this colab notebook onto our google drive. This will allow us to access the files there. The code below will ask you to approve colab to access your google drive. Please set the part of DRIVE_PREFIX after '/content/drive/MyDrive/' to where you put this notebook in your drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
DRIVE_PREFIX = '/content/drive/MyDrive/Rec Sys for ML Decal Assignments/Week 1/'

Now we can import the dataset. We'll be using the MovieLens dataset. Before you run this block of code, please make sure that 'ml-latest-small.zip' is in the folder with this notebook. This block of code will open the zipfile and read the data in that zip file into a folder called "MovieLens Data".

In [73]:
import zipfile as zf
files = zf.ZipFile("ml-latest-small.zip", 'r')
files.extractall("MovieLens-Data")
files.close()

We'll be using "movies.csv" and "ratings.csv" from the "MovieLens Data" folder

In [74]:
movies = pd.read_csv("MovieLens-Data/ml-latest-small/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [76]:
ratings = pd.read_csv("MovieLens-Data/ml-latest-small/ratings.csv")
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


For ease of use later down the line, let's merge these two dataframe on the 'movieId' column

In [77]:
data = ratings.merge(movies, on='movieId', how='left')
data.head(10)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
5,1,70,3.0,964982400,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller
6,1,101,5.0,964980868,Bottle Rocket (1996),Adventure|Comedy|Crime|Romance
7,1,110,4.0,964982176,Braveheart (1995),Action|Drama|War
8,1,151,5.0,964984041,Rob Roy (1995),Action|Drama|Romance|War
9,1,157,5.0,964984100,Canadian Bacon (1995),Comedy|War


In this particular example, we want to create a recommender that will take in a movie and suggest similar movies based on how users rated movies in the past. Therefore, it would be more convenient to format our dataset based on each user - we will have a column for each movie in the dataset, and for each user a row that we will populate with their ratings for all the movies in the dataset. Let's make that DataFrame:

In [78]:
by_user = data.pivot_table(index='userId', columns='title', values='rating')
by_user.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...All the Marbles (1981),...And Justice for All (1979),00 Schneider - Jagd auf Nihil Baxter (1994),1-900 (06) (1994),10 (1979),10 Cent Pistol (2015),10 Cloverfield Lane (2016),10 Items or Less (2006),10 Things I Hate About You (1999),10 Years (2011),"10,000 BC (2008)",100 Girls (2000),100 Streets (2016),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),101 Dalmatians II: Patch's London Adventure (2003),101 Reykjavik (101 Reykjavík) (2000),102 Dalmatians (2000),10th & Wolf (2006),"10th Kingdom, The (2000)","10th Victim, The (La decima vittima) (1965)","11'09""01 - September 11 (2002)",11:14 (2003),"11th Hour, The (2007)",12 Angry Men (1957),12 Angry Men (1997),12 Chairs (1971),12 Chairs (1976),12 Rounds (2009),12 Years a Slave (2013),...,Zathura (2005),Zatoichi and the Chest of Gold (Zatôichi senryô-kubi) (Zatôichi 6) (1964),Zazie dans le métro (1960),Zebraman (2004),"Zed & Two Noughts, A (1985)",Zeitgeist: Addendum (2008),Zeitgeist: Moving Forward (2011),Zeitgeist: The Movie (2007),Zelary (2003),Zelig (1983),Zero Dark Thirty (2012),Zero Effect (1998),"Zero Theorem, The (2013)",Zero de conduite (Zero for Conduct) (Zéro de conduite: Jeunes diables au collège) (1933),Zeus and Roxanne (1997),Zipper (2015),Zodiac (2007),Zombeavers (2014),Zombie (a.k.a. Zombie 2: The Dead Are Among Us) (Zombi 2) (1979),Zombie Strippers! (2008),Zombieland (2009),Zone 39 (1997),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zoom (2006),Zoom (2015),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


For our naive recommender system, let's just pick a movie to look for similar items with. My pick is for 'Star Wars: Episode V - The Empire Strikes Back (1980)':)

The corrwith method in pandas will find the pairwise correlation between the column passed to it (in this case the column for 'Star Wars: Episode V - The Empire Strikes Back (1980)') and the columns of the dataframe it is called on:

In [79]:
correlations = by_user.corrwith(by_user['Star Wars: Episode V - The Empire Strikes Back (1980)'])
correlations.head(10)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


title
'71 (2014)                                      NaN
'Hellboy': The Seeds of Creation (2004)         NaN
'Round Midnight (1986)                          NaN
'Salem's Lot (2004)                             NaN
'Til There Was You (1997)                       NaN
'Tis the Season for Love (2015)                 NaN
'burbs, The (1989)                         0.208556
'night Mother (1986)                            NaN
(500) Days of Summer (2009)                0.292616
*batteries not included (1987)             0.049029
dtype: float64

Let's clean up this new DataFrame by removing all the NaN values. We'll also add the total number of ratings for each movie just to get a little more insight. 

In [80]:
recommendation = pd.DataFrame(correlations, columns=['correlation'])
recommendation.dropna(inplace=True)
total = pd.DataFrame(data.groupby('title')['rating'].count())
recommendation = recommendation.join(total)
recommendation.head()

Unnamed: 0_level_0,correlation,rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",0.208556,17
(500) Days of Summer (2009),0.292616,42
*batteries not included (1987),0.049029,7
...And Justice for All (1979),-0.052414,3
10 Cent Pistol (2015),1.0,2


Finally, we can just sort this dataframe by the highest correlation. While we're at it, we can make sure that we're only considering movies that have more than 100 ratings.

In [81]:
final_recc = recommendation[recommendation['rating'] > 100].sort_values('correlation', ascending=False).reset_index()
final_recc.head(10)

Unnamed: 0,title,correlation,rating
0,Star Wars: Episode V - The Empire Strikes Back...,1.0,211
1,Star Wars: Episode IV - A New Hope (1977),0.77797,251
2,Star Wars: Episode VI - Return of the Jedi (1983),0.643464,196
3,Raiders of the Lost Ark (Indiana Jones and the...,0.487676,200
4,Indiana Jones and the Last Crusade (1989),0.47247,140
5,Spider-Man (2002),0.469999,122
6,Terminator 2: Judgment Day (1991),0.453513,224
7,"Godfather, The (1972)",0.428278,192
8,Back to the Future (1985),0.427618,171
9,"Lord of the Rings: The Two Towers, The (2002)",0.4083,188


Unsurprisingly, the most similar movie to  'Star Wars: Episode V - The Empire Strikes Back (1980)' is itself, with other Star Wars movies from the original trilogy following in close succession. Movies with Harrison Ford appear to be the runner-ups after that. Interesting!

#### Exercises

1) What type of recommender is this and why?

Collaborative recommender system

2) What problems do you think this recommender has? List at least two.

Anything is fine for this question, but here are some common ones you could have listed:

1. Small sample size
2. Low holistic information usage
3. Scalability (correlation computation is very inefficient for large sample sizes)