# Recommendation Systems in Machine Learning

## Week 1 - Intro to Recommendation Systems

This week, we'll take a brief look at the libraries and frameworks we will be using in later weeks of this course. Specifically, we will be using NumPy, PyTorch, and Pandas very heavily in this course; please try and get as familiar as you can with these frameworks as possible, even outside of the weekly assignments. In this week's assignment, we will not be using PyTorch - we will use that more in the weeks on Recommender Systems using Deep Learning. 

### Intro to Numpy

NumPy is a very widely used library for large, multi-dimensional arrays. It's optimized to deal with large vector operations with some SUPER advanced C optimizations (you learn a little about it in COMPSCI 61C). 

NumPy is used in place of Python arrays both because Python arrays tend to be slower for operations like elementwise addition and multiplication, and because of the fact that Python arrays are can contain multiple types of objects. Normally, this is a good thing, but for fast array operations, type checking every single element of a list can be very time-intensive. Thus, Numpy is the go to library for machine learning data analysis and transformation.

Convention is to use "np" as an alias for numpy during an import. Numpy isn't provided by default with Python, so we have to import it.

In [1]:
import numpy as np

NumPy arrays are the workhorse of the library. A NumPy array is essentially a bunch of data coupled with some metadata:

type: the type of objects in the array. This will typically be floating-point numbers for our purposes, but other types can be stored. The type of an array can be accessed via the dtype attribute.

shape: the dimensions of the array. This is given as a tuple, where element $i$ of the tuple tells you how the "length" of the array in the $i$th dimension. For example, a 10-dimensional vector would have shape (10,), a 32-by-100 matrix would have shape (32,100), etc. The shape of an array can be accessed via the shape attribute.

There are number of ways to construct arrays. One is to pass in a Python sequence (such as list or tuple) to the np.array function:

In [2]:
np.array([1, 2.3, -6])

array([ 1. ,  2.3, -6. ])


We can also easily create ordered numerical lists:

In [3]:
# We zero index so you will actually get 0 to 6
print(np.arange(7))
# Remember the list won't include 9
print(np.arange(3, 9))

[0 1 2 3 4 5 6]
[3 4 5 6 7 8]


We can also customize these lists with a third parameter that specifies step size, similar to range() in Python loops.

In [4]:
np.arange(0.0, 100.0, 10.0)

array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90.])

We can also very easily create multi-dimensional arrays

In [5]:
arr = np.array([[1, 2.3, -6], [7, 8, 9]])
print(arr)
print(arr.shape)

[[ 1.   2.3 -6. ]
 [ 7.   8.   9. ]]
(2, 3)


There are also many convenience functions for constructing special arrays. Here are some that might be useful:

In [6]:
# The identity matrix of given size
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [7]:
# A matrix with the given vector on the diagonal
np.diag([1.1,2.2,3.3])

array([[1.1, 0. , 0. ],
       [0. , 2.2, 0. ],
       [0. , 0. , 3.3]])

In [8]:
#An array of all zeros or ones with the given shape
np.zeros((8,4)), np.ones((3, 2))

(array([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]]),
 array([[1., 1.],
        [1., 1.],
        [1., 1.]]))

In [9]:
# An array with a given shape full of a specified value
np.full((3,4), 2.1)

array([[2.1, 2.1, 2.1, 2.1],
       [2.1, 2.1, 2.1, 2.1],
       [2.1, 2.1, 2.1, 2.1]])

In [10]:
# A random (standard normal distribution) array with the given shape
np.random.randn(5,6)

array([[-0.82235625,  0.16166182,  0.8658556 , -0.64076653,  0.03832676,
        -1.62156733],
       [-1.82517917, -0.25607682,  0.40247866, -0.29578559,  2.5104027 ,
        -0.57768271],
       [ 0.19907821,  1.60737774,  0.62089205, -1.53224473,  0.42965091,
        -1.74627635],
       [ 0.32261841,  0.14125393, -0.46264801,  1.71736688,  0.17441449,
         0.06367158],
       [-1.80059151,  0.77768929, -0.29082781,  1.22128437, -1.30198883,
         1.03429833]])

Now let's suppose we have some data in an array so we can start doing stuff with it.

In [11]:
A = np.random.randn(10,5); x = np.random.randn(5)
A

array([[-0.76799093, -2.09422225, -2.79773646, -0.25232323,  0.31077338],
       [ 0.63500059,  0.44613292, -1.90698022, -0.81028264,  0.88787568],
       [ 0.83045473,  2.08686753,  1.72131392, -1.09020197, -1.69861669],
       [-0.89217255,  0.08532031,  1.23668415,  0.7087841 ,  1.2482638 ],
       [-1.27782798,  0.76522281, -0.34530848,  1.61172929, -1.35634624],
       [ 0.53234416, -0.89703836, -0.23577723, -0.15308264,  0.29125872],
       [ 0.67251842,  0.56367014, -0.29240107,  0.79528736,  2.04083892],
       [ 0.07430404, -0.23978856, -0.63862248, -0.38027023,  0.86842871],
       [ 0.47326669, -0.53416025, -0.36541147, -0.93265261,  1.4375769 ],
       [-0.25355079, -0.16693776, -0.47563675,  0.39558053, -0.65547734]])

NumPy lets us efficiently apply the same function to every element in an array. You'll often need to, for example, exponentiate a bunch of values, but if you use a list comprehension or map with the builtin Python math functions it will be really slow. Instead, you can just write:

In [12]:
# log, sin, cos, etc. work similarly - try them out!
np.exp(A)

array([[0.46394423, 0.123166  , 0.06094786, 0.77699355, 1.36447996],
       [1.88702326, 1.56225911, 0.14852823, 0.44473235, 2.42996214],
       [2.29436181, 8.059629  , 5.59187093, 0.3361486 , 0.18293641],
       [0.40976455, 1.08906585, 3.44417415, 2.03151962, 3.48428829],
       [0.27864186, 2.14947324, 0.70800191, 5.01147004, 0.25760027],
       [1.70291955, 0.40777556, 0.78995664, 0.85805881, 1.33811074],
       [1.95916512, 1.75710952, 0.74646909, 2.21507743, 7.69706371],
       [1.07713424, 0.7867942 , 0.52801928, 0.68367664, 2.38316327],
       [1.60522942, 0.58616131, 0.69391107, 0.3935085 , 4.21048101],
       [0.77604033, 0.84625228, 0.6214892 , 1.48524617, 0.51919417]])

We can take the sum/mean/standard deviation/etc. of all the elements in an array:

In [13]:
np.sum(x), np.mean(x), np.std(x)

(-0.037244333573317114, -0.007448866714663422, 0.7229652193966093)

You can also specify an axis over which to compute the sum if you want a vector of row/column sums (again, sum here can be replaced with mean or other operations):

In [14]:
# Create an array with numbers in the range 0,...,3 and then reshape it to a 2x2 matrix
B = np.arange(4).reshape((2,2))

# Original matrix
print(B)
# Column sum
print(np.sum(B, axis=0))
# Row sum
print(np.sum(B, axis=1))

[[0 1]
 [2 3]]
[2 4]
[1 5]


We can also perform common linear algebra operations in NumPy.

In [15]:
# Matrix-vector product. The dimensions have to match, of course
A.dot(x)
# Note that in Python 3 there is also a slick notation A @ x which does the same thing

array([ 0.76288905,  0.62036571, -1.00461857,  0.45338073,  3.25508825,
       -1.18183607,  0.66835199, -0.09264291, -1.31752801,  0.73581775])

In [16]:
# Transpose a matrix
A.T

array([[-0.76799093,  0.63500059,  0.83045473, -0.89217255, -1.27782798,
         0.53234416,  0.67251842,  0.07430404,  0.47326669, -0.25355079],
       [-2.09422225,  0.44613292,  2.08686753,  0.08532031,  0.76522281,
        -0.89703836,  0.56367014, -0.23978856, -0.53416025, -0.16693776],
       [-2.79773646, -1.90698022,  1.72131392,  1.23668415, -0.34530848,
        -0.23577723, -0.29240107, -0.63862248, -0.36541147, -0.47563675],
       [-0.25232323, -0.81028264, -1.09020197,  0.7087841 ,  1.61172929,
        -0.15308264,  0.79528736, -0.38027023, -0.93265261,  0.39558053],
       [ 0.31077338,  0.88787568, -1.69861669,  1.2482638 , -1.35634624,
         0.29125872,  2.04083892,  0.86842871,  1.4375769 , -0.65547734]])

Now that you're familiar with NumPyfeel free to check out the documentation and see what else you can do - documentation can be found here: https://docs.scipy.org/doc/

#### Exercises

1) Create a vector of size 10 containing zeros

In [22]:
## FILL IN YOUR ANSWER HERE ##
zs = np.zeros((10,))

2) Now change the fifth value to be 5

In [23]:
## FILL IN YOUR ANSWER HERE ##
zs[4] = 1
zs

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])


3) Create a vector with values ranging from 10 to 49

In [26]:
## FILL IN YOUR ANSWER HERE ##
r = np.arange(10,50)

4) Reverse the previous vector (first element becomes last)

In [28]:
## FILL IN YOUR ANSWER HERE ##
r[::-1]

array([49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33,
       32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
       15, 14, 13, 12, 11, 10])

5) Create a 3x3 matrix with values ranging from 0 to 8. Create a 1D array first and then re-shape it

In [29]:
## FILL IN YOUR ANSWER HERE ##
np.arange(9).reshape((3,3))

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])


6) Create a 3x3x3 array with random values

In [32]:
## FILL IN YOUR ANSWER HERE ##
np.random.randn(3,3,3)

array([[[ 0.29234943, -1.00786922,  1.80898821],
        [-0.90216374, -0.80719954, -1.38957578],
        [-0.44703354, -3.08414505,  0.53207045]],

       [[ 0.25197737,  0.27885975, -0.33631307],
        [ 0.92558852, -0.46946098,  1.05415788],
        [-1.08768049, -0.01919276,  0.55575855]],

       [[ 0.31060558,  0.56550868, -0.30678269],
        [-0.70973509,  1.58173312,  0.61201619],
        [-1.86819578,  0.34566622,  0.67080609]]])


7) Create a random array and find the sum, mean, and standard deviation

In [34]:
## FILL IN YOUR ANSWER HERE ##
x = np.random.randn(100000)
x.mean(),x.var(),x.std()

(-0.002582876987925224, 1.001273317693992, 1.0006364563086796)

8) Make a diagonal matrix with values from 1-20 (try and create this and only type two numbers!)

In [38]:
## FILL IN YOUR ANSWER HERE ##
np.diag(np.arange(1,21))

array([[ 1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  3,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  4,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  5,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  7,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  8,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0

### Intro to pandas

pandas is a very widely used library for dealing with large datasets. It has some very convenient functions for importing and analyzing data from large databases that make it a very easy to learn option for data analysis.

Convention dictates that we import pandas as "pd".

In [39]:
import pandas as pd

In pandas, an array is referred to as a Series. You can create a Series by passing a list of values, letting pandas create a default integer index:

In [40]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


A 2-dimensional array in Python would be referred to as a DataFrame in pandas. You can create a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [41]:
dates = pd.date_range('20130101', periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [42]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.489278,0.909886,-0.989561,2.029147
2013-01-02,-2.661679,-0.700865,1.928971,1.608391
2013-01-03,1.258268,1.275579,1.794955,-1.086337
2013-01-04,0.277834,-0.975103,-1.254463,1.478424
2013-01-05,0.171896,-1.000152,1.749716,0.856443
2013-01-06,0.437102,-0.89963,0.07182,0.350453


You can also create a DataFrame by passing a dict of objects that can be converted to a Series-like object.

In [43]:
df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


Take note that the columns of the resulting dataframe have different data types

In [44]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

To view the top and bottom of the dataframe, you can use the following commands (they also take an optional input for the number of rows you wish to display)

In [45]:
df.head(3)

Unnamed: 0,A,B,C,D
2013-01-01,0.489278,0.909886,-0.989561,2.029147
2013-01-02,-2.661679,-0.700865,1.928971,1.608391
2013-01-03,1.258268,1.275579,1.794955,-1.086337


In [46]:
df.tail()

Unnamed: 0,A,B,C,D
2013-01-02,-2.661679,-0.700865,1.928971,1.608391
2013-01-03,1.258268,1.275579,1.794955,-1.086337
2013-01-04,0.277834,-0.975103,-1.254463,1.478424
2013-01-05,0.171896,-1.000152,1.749716,0.856443
2013-01-06,0.437102,-0.89963,0.07182,0.350453


You can also display the index and the columns:

In [47]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [48]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one datatype for the entire array, while pandas DataFrames have one datatype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy datatype that can hold all of the datatypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

In [49]:
#df.to_numpy() and df.values return the same result
df.values, df.values.dtype

(array([[ 0.48927798,  0.90988634, -0.98956099,  2.02914652],
        [-2.66167934, -0.70086534,  1.92897133,  1.60839082],
        [ 1.25826824,  1.27557866,  1.79495533, -1.08633664],
        [ 0.27783363, -0.97510301, -1.25446325,  1.47842375],
        [ 0.17189577, -1.00015166,  1.74971575,  0.85644254],
        [ 0.43710242, -0.89962974,  0.07182031,  0.35045326]]),
 dtype('float64'))

In [50]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

.describe() will show a quick summary of your DataFrame:

In [51]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.00455,-0.231714,0.55024,0.872753
std,1.356823,1.03774,1.466,1.127505
min,-2.661679,-1.000152,-1.254463,-1.086337
25%,0.19838,-0.956235,-0.724216,0.476951
50%,0.357468,-0.800248,0.910768,1.167433
75%,0.476234,0.507198,1.783645,1.575899
max,1.258268,1.275579,1.928971,2.029147


You can also transpose, sort your data by an axis, and sort by values:

In [52]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,0.489278,-2.661679,1.258268,0.277834,0.171896,0.437102
B,0.909886,-0.700865,1.275579,-0.975103,-1.000152,-0.89963
C,-0.989561,1.928971,1.794955,-1.254463,1.749716,0.07182
D,2.029147,1.608391,-1.086337,1.478424,0.856443,0.350453


In [53]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,2.029147,-0.989561,0.909886,0.489278
2013-01-02,1.608391,1.928971,-0.700865,-2.661679
2013-01-03,-1.086337,1.794955,1.275579,1.258268
2013-01-04,1.478424,-1.254463,-0.975103,0.277834
2013-01-05,0.856443,1.749716,-1.000152,0.171896
2013-01-06,0.350453,0.07182,-0.89963,0.437102


In [54]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-05,0.171896,-1.000152,1.749716,0.856443
2013-01-04,0.277834,-0.975103,-1.254463,1.478424
2013-01-06,0.437102,-0.89963,0.07182,0.350453
2013-01-02,-2.661679,-0.700865,1.928971,1.608391
2013-01-01,0.489278,0.909886,-0.989561,2.029147
2013-01-03,1.258268,1.275579,1.794955,-1.086337


You can index into DataFrames very similarly to how you index into Python arrays. However, you can also index by the name of a column.

In [55]:
df['A']

2013-01-01    0.489278
2013-01-02   -2.661679
2013-01-03    1.258268
2013-01-04    0.277834
2013-01-05    0.171896
2013-01-06    0.437102
Freq: D, Name: A, dtype: float64

In [56]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.489278,0.909886,-0.989561,2.029147
2013-01-02,-2.661679,-0.700865,1.928971,1.608391
2013-01-03,1.258268,1.275579,1.794955,-1.086337


In [57]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-2.661679,-0.700865,1.928971,1.608391
2013-01-03,1.258268,1.275579,1.794955,-1.086337
2013-01-04,0.277834,-0.975103,-1.254463,1.478424


For multi-axis selection, we normally use the functions .loc() and .iloc() (they can also be used for single-axis selection)

In [58]:
#For getting a cross section using a label
df.loc[dates[0]]

A    0.489278
B    0.909886
C   -0.989561
D    2.029147
Name: 2013-01-01 00:00:00, dtype: float64

In [59]:
#Selecting on a multi-axis
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,0.489278,0.909886
2013-01-02,-2.661679,-0.700865
2013-01-03,1.258268,1.275579
2013-01-04,0.277834,-0.975103
2013-01-05,0.171896,-1.000152
2013-01-06,0.437102,-0.89963


In [60]:
#Label slicing, with both endpoints included
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,-2.661679,-0.700865
2013-01-03,1.258268,1.275579
2013-01-04,0.277834,-0.975103


.loc() is similar to .iloc(), but .iloc() is solely used with integer indices, whereas .loc() is used with label names.

In [61]:
df.iloc[3]

A    0.277834
B   -0.975103
C   -1.254463
D    1.478424
Name: 2013-01-04 00:00:00, dtype: float64

In [62]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,0.277834,-0.975103
2013-01-05,0.171896,-1.000152


There are various other useful pandas functions, but for reading data in this course, the above will cover most of what you will need. If you ever have any questions or want to learn more about some pandas functions, feel free to ask in the Piazza or check out the pandas documentation - https://pandas.pydata.org/docs/user_guide/index.html

#### Exercises

1) Create a dataframe from a random NumPy array of size 5 by 5

In [67]:
## FILL IN YOUR ANSWER HERE ##
x = np.random.randn(25).reshape((5,5))
df3 = pd.DataFrame(x)
df3

Unnamed: 0,0,1,2,3,4
0,0.439551,0.668973,0.067848,0.959159,-1.425292
1,-1.334186,-1.021507,1.24818,2.121341,-0.267076
2,0.951336,-0.275212,-3.067838,0.099928,-2.657927
3,0.545632,-0.225004,0.875064,-2.513119,0.460493
4,-0.095847,0.16825,-0.351526,-0.198191,-1.175656


2) Sort the DataFrame you created by the first column

In [70]:
## FILL IN YOUR ANSWER HERE ##
df3.sort_values(0,ascending=False)

Unnamed: 0,0,1,2,3,4
2,0.951336,-0.275212,-3.067838,0.099928,-2.657927
3,0.545632,-0.225004,0.875064,-2.513119,0.460493
0,0.439551,0.668973,0.067848,0.959159,-1.425292
4,-0.095847,0.16825,-0.351526,-0.198191,-1.175656
1,-1.334186,-1.021507,1.24818,2.121341,-0.267076


3) Print the mean of the first row of the DataFrame

In [74]:
## FILL IN YOUR ANSWER HERE ##
np.mean(df3.loc[0,:])

0.14204772332526852

Alright so that was a lot, but you can just enjoy this meme for now, and then move on to the rest of the assignment :)

![Funny Meme](https://pics.me.me/thumb_how-to-make-friends-69448574.png)

### Naive Recommender

For this part of the exercise, you don't have to actually code anything. We'd just like you to read along and follow what we're doing as we construct a naive recommender from scratch. At the end, we'll ask you to determine what type of recommender we created based off the code written.


In [76]:
!pip3 install google.colab

Collecting google.colab
  Downloading google-colab-1.0.0.tar.gz (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting google-auth~=1.4.0 (from google.colab)
  Obtaining dependency information for google-auth~=1.4.0 from https://files.pythonhosted.org/packages/56/80/369a47c28ce7d9be6a6973338133d073864d8efbb62747e414c34a3a5f4f/google_auth-1.4.2-py2.py3-none-any.whl.metadata
  Downloading google_auth-1.4.2-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting ipykernel~=4.6.0 (from google.colab)
  Obtaining dependency information for ipykernel~=4.6.0 from https://files.pythonhosted.org/packages/18/c3/76775a650cae2e3d9c033b26153583e61282692d9a3af12a3022d8f0cefa/ipykernel-4.6.1-py3-none-any.whl.metadata
  Downloading ipykernel-4.6.1-py3-none-any.whl.metadata (981 bytes)
Collecting ipython~=5.5.0 (from google.colab)
  Obtaining dependency information for

The first thing we'll need to do is mount this colab notebook onto our google drive. This will allow us to access the files there. The code below will ask you to approve colab to access your google drive. Please set the part of DRIVE_PREFIX after '/content/drive/MyDrive/' to where you put this notebook in your drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive/')
DRIVE_PREFIX = '/content/drive/MyDrive/Rec Sys for ML Decal Assignments/Week 1/'

ModuleNotFoundError: No module named 'google.colab'

Now we can import the dataset. We'll be using the MovieLens dataset. Before you run this block of code, please make sure that 'ml-latest-small.zip' is in the folder with this notebook. This block of code will open the zipfile and read the data in that zip file into a folder called "MovieLens Data".

In [2]:
import zipfile as zf
files = zf.ZipFile(DRIVE_PREFIX + "ml-latest-small.zip", 'r')
files.extractall("MovieLens-Data")
files.close()

NameError: name 'DRIVE_PREFIX' is not defined

We'll be using "movies.csv" and "ratings.csv" from the "MovieLens Data" folder

In [None]:
movies = pd.read_csv("MovieLens-Data/ml-latest-small/movies.csv")
movies.head()

In [None]:
ratings = pd.read_csv("MovieLens-Data/ml-latest-small/ratings.csv")
ratings.head()

For ease of use later down the line, let's merge these two dataframe on the 'movieId' column

In [None]:
data = ratings.merge(movies, on='movieId', how='left')
data.head(10)

In this particular example, we want to create a recommender that will take in a movie and suggest similar movies based on how users rated movies in the past. Therefore, it would be more convenient to format our dataset based on each user - we will have a column for each movie in the dataset, and for each user a row that we will populate with their ratings for all the movies in the dataset. Let's make that DataFrame:

In [None]:
by_user = data.pivot_table(index='userId', columns='title', values='rating')
by_user.head()

For our naive recommender system, let's just pick a movie to look for similar items with. My pick is for 'Star Wars: Episode V - The Empire Strikes Back (1980)':)

The corrwith method in pandas will find the pairwise correlation between the column passed to it (in this case the column for 'Star Wars: Episode V - The Empire Strikes Back (1980)') and the columns of the dataframe it is called on:

In [None]:
correlations = by_user.corrwith(by_user['Star Wars: Episode V - The Empire Strikes Back (1980)'])
correlations.head(10)

Let's clean up this new DataFrame by removing all the NaN values. We'll also add the total number of ratings for each movie just to get a little more insight. 

In [None]:
recommendation = pd.DataFrame(correlations, columns=['correlation'])
recommendation.dropna(inplace=True)
total = pd.DataFrame(data.groupby('title')['rating'].count())
recommendation = recommendation.join(total)
recommendation.head()

Finally, we can just sort this dataframe by the highest correlation. While we're at it, we can make sure that we're only considering movies that have more than 100 ratings.

In [None]:
final_recc = recommendation[recommendation['rating'] > 100].sort_values('correlation', ascending=False).reset_index()
final_recc.head(10)

Unsurprisingly, the most similar movie to  'Star Wars: Episode V - The Empire Strikes Back (1980)' is itself, with other Star Wars movies from the original trilogy following in close succession. Movies with Harrison Ford appear to be the runner-ups after that. Interesting!

#### Exercises

1) What type of recommender is this and why?

FILL YOUR ANSWER HERE

2) What problems do you think this recommender has? List at least two.

FILL YOUR ANSWER HERE