### Required Codio Assignment 19.1: Collaborative Filtering

**Expected Time = 90 minutes**

**Total Points = 50**

In this activity, you will use collaborative filtering to predict user ratings.  This iterative process will begin with our simple reviews dataset to fill in the missing values for the users.  Your regression models will be built using Scikit-Learn's `LinearRegression` estimator.

### Index


- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

### The Data

Again, you begin with data indexed by artists.  You will add random values for `F1` and `F2`, and use these to create regression models for each user.  Then, tracking the coefficients -- you create new artist vectors, and repeat the process.  The goal remains to predict user ratings of unrated albums.

In [2]:
reviews = pd.read_csv('data/user_rated.csv', index_col = 0).iloc[:, :-2].T

In [3]:
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,3.0,,2.0,3.0,1.0
Clint Black,4.0,9.0,5.0,,1.0
Dropdead,,,8.0,9.0,
Anti-Cimex,4.0,3.0,9.0,4.0,9.0
Cardi B,4.0,8.0,,9.0,5.0


[Back to top](#-Index)

### Problem 1

### Creating F1 and F2

**5 Points**

To begin, create two randomly instantiated vectors `F1` and `F2` as columns in your DataFrame.  To do so, you will draw numbers from a random normal distribution using `np.random.normal(size = 5)`.  Set `np.random.seed = 42`.  

In [4]:
### GRADED
reviews['F1'] = ''
reviews['F2'] = ''

    
### BEGIN SOLUTION
np.random.seed(42)
reviews['F1'] = np.random.normal(size = 5)
reviews['F2'] = np.random.normal(size = 5)
### END SOLUTION

### ANSWER CHECK
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.496714,-0.234137
Clint Black,4.0,9.0,5.0,,1.0,-0.138264,1.579213
Dropdead,,,8.0,9.0,,0.647689,0.767435
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.52303,-0.469474
Cardi B,4.0,8.0,,9.0,5.0,-0.234153,0.54256


[Back to top](#-Index)

### Problem 2

#### Regression models for all users

**10 Points**

Complete the starter code given below to iterate over the first five columns of the `reviews` dataframe. To define `X`, drop the rows where the column `c` is NaN and selects the `F1` and `F2` columns. The target variable `y` is set to the column `c` after dropping NaNs.

Next, use `X` and `y` to fit a linear regression model without an intercept to predict values of column `c` based on `F1` and `F2`. Assign this model to the variable `lr`.

Store the coefficients of the linear regression model  in the list `uf` and convert this list to a NumPy array.


In [6]:
### GRADED
uf = '' 
for c in reviews.columns[:5]:
    X = ''
    y = ''
    lr = ''
    coefs = ''
    
    
### BEGIN SOLUTION
uf = []
for c in reviews.columns[:5]:
    X = reviews.dropna(subset = [c])[['F1', 'F2']]
    y = reviews[c].dropna()
    lr = LinearRegression(fit_intercept=False).fit(X, y)
    coefs = lr.coef_
    uf.append(list(coefs))
uf = np.array(uf)
### END SOLUTION

### ANSWER CHECK
uf.shape #should be (5, 2)

(5, 2)

[Back to top](#-Index)

### Problem 3

#### New Model for artists

**10 Points**

Below, a dataframe `ui_df` is created using the coefficients from the previous problem.  Now, you are to use this data with `F1` and `F2` to build a new model and track each *artists* coefficients.  Assign this as a numpy array to `ifs` below.

HINT: The steps for this problem are similar to the ones in Problem 2.


In [11]:
ui_df = reviews.iloc[:, :-2].T
ui_df['F1'] = uf[:, 0]
ui_df['F2'] = uf[:, 1]
ui_df

Unnamed: 0,Michael Jackson,Clint Black,Dropdead,Anti-Cimex,Cardi B,F1,F2
Alfred,3.0,4.0,,4.0,4.0,3.820956,3.395762
Mandy,,9.0,,3.0,8.0,3.710347,7.006197
Lenny,2.0,5.0,8.0,9.0,,7.113263,3.952502
Joan,3.0,,9.0,4.0,9.0,5.240167,10.035759
Tino,1.0,1.0,,9.0,5.0,5.86328,2.197482


In [12]:
### GRADED
ifs = '' 

    
### BEGIN SOLUTION
ifs = []
for c in ui_df.columns[:5]:
    X = ui_df.dropna(subset = [c])[['F1', 'F2']]
    y = ui_df[c].dropna()
    lr = LinearRegression(fit_intercept=False).fit(X, y)
    coefs = lr.coef_
    ifs.append(list(coefs))
ifs = np.array(ifs)
### END SOLUTION

### ANSWER CHECK
ifs.shape

(5, 2)

[Back to top](#-Index)

### Problem 4

#### New model for users

**10 Points**

Below, a dataframe is created using the coefficients from our linear model on artists -- `if_df`.  You are to use this data to create new arrays of coefficients for the users.  Assign this array of coefficients as `uf2`.


HINT: The steps for this problem are similar to the ones in Problem 2.

In [14]:
if_df = reviews.copy().iloc[:, :-2]
if_df.loc[:, 'F1'] = ifs[:, 0]
if_df.loc[:, 'F2'] = ifs[:, 1]
if_df

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.16406,0.248037
Clint Black,4.0,9.0,5.0,,1.0,-0.207666,1.421081
Dropdead,,,8.0,9.0,,0.882355,0.436072
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.569998,-0.429204
Cardi B,4.0,8.0,,9.0,5.0,0.570041,0.670451


In [15]:
### GRADED
uf2 = '' 

    
### BEGIN SOLUTION
uf2 = []
for c in if_df.columns[:5]:
    X = if_df.dropna(subset = [c])[['F1', 'F2']]
    y = if_df[c].dropna()
    lr = LinearRegression(fit_intercept=False).fit(X, y)
    coefs = lr.coef_
    uf2.append(list(coefs))
uf2 = np.array(uf2)
### END SOLUTION

### ANSWER CHECK
uf2

array([[3.53046728, 3.4336384 ],
       [4.11783667, 7.26746079],
       [6.91421806, 4.47389919],
       [5.17072815, 9.36768386],
       [6.24342403, 1.6826658 ]])

[Back to top](#-Index)

### Problem 5

#### One more iteration

**5 Points**

Below, a dataframe `ui_df2` is created using the results of `uf2`.  Use the features `F1` and `F2` to create regression models for each user and track the coefficients in `ifs2`. 

HINT: The steps for this problem are similar to the ones in Problem 2.

In [17]:
ui_df2 = reviews.copy().iloc[:, :-2].T
ui_df2['F1'] = uf2[:, 0]
ui_df2['F2'] = uf2[:, 1]
ui_df2

Unnamed: 0,Michael Jackson,Clint Black,Dropdead,Anti-Cimex,Cardi B,F1,F2
Alfred,3.0,4.0,,4.0,4.0,3.530467,3.433638
Mandy,,9.0,,3.0,8.0,4.117837,7.267461
Lenny,2.0,5.0,8.0,9.0,,6.914218,4.473899
Joan,3.0,,9.0,4.0,9.0,5.170728,9.367684
Tino,1.0,1.0,,9.0,5.0,6.243424,1.682666


In [19]:
### GRADED
ifs2 = ''

    
### BEGIN SOLUTION
ifs2 = []
for c in ui_df2.columns[:5]:
    X = ui_df2.dropna(subset = [c])[['F1', 'F2']]
    y = ui_df2[c].dropna()
    lr = LinearRegression(fit_intercept=False).fit(X, y)
    coefs = lr.coef_
    ifs2.append(list(coefs))
ifs2 = np.array(ifs2)
### END SOLUTION

### ANSWER CHECK
ifs2

[Back to top](#-Index)

### Problem 6

#### Comparing Models

**10 Points**

Based on the first iteration resulting in `if_df` and the last in `if_df2` use these different item factors as inputs to a `LinearRegression` model to determine the `mean_squared_error` for each model for Alfred.  Which user factors did a better job as inputs to the model -- `if_df` or `if_df2`.  Assign your answer as a string to `ans6` below.

In [20]:
if_df2 = reviews.copy().iloc[:, :-2]
if_df2.loc[:, 'F1'] = ifs2[:, 0]
if_df2.loc[:, 'F2'] = ifs2[:, 1]
if_df2

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.129583,0.292646
Clint Black,4.0,9.0,5.0,,1.0,-0.178546,1.350443
Dropdead,,,8.0,9.0,,0.832828,0.501049
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.579597,-0.456446
Cardi B,4.0,8.0,,9.0,5.0,0.601892,0.668922


In [21]:
if_df.to_csv('data/Q.csv')
ui_df.to_csv('data/P.csv')

In [22]:
(ui_df[['F1', 'F2']]@if_df[['F1', 'F2']].T).T

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,1.46914,2.346515,2.147366,3.348941,1.506984
Clint Black,4.032171,9.185859,4.139643,13.173421,1.905196
Dropdead,4.852237,6.329049,8.0,9.0,6.131757
Anti-Cimex,4.541419,2.81815,9.47138,3.919665,8.262171
Cardi B,4.454797,6.812366,6.704814,9.715601,4.815617


In [23]:
reviews

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino,F1,F2
Michael Jackson,3.0,,2.0,3.0,1.0,0.496714,-0.234137
Clint Black,4.0,9.0,5.0,,1.0,-0.138264,1.579213
Dropdead,,,8.0,9.0,,0.647689,0.767435
Anti-Cimex,4.0,3.0,9.0,4.0,9.0,1.52303,-0.469474
Cardi B,4.0,8.0,,9.0,5.0,-0.234153,0.54256


In [24]:
(ui_df2[['F1', 'F2']]@if_df2[['F1', 'F2']].T).T

Unnamed: 0,Alfred,Mandy,Lenny,Joan,Tino
Michael Jackson,1.462328,2.660394,2.205232,3.411453,1.301465
Clint Black,4.006581,9.079066,4.807237,11.727308,1.157604
Dropdead,4.660695,7.070807,8.0,9.0,6.042798
Anti-Cimex,4.009446,3.187322,8.879586,3.891828,9.094049
Cardi B,4.421796,7.339858,7.154301,9.378471,4.883437


In [25]:
if_df[['F1', 'F2']]

Unnamed: 0,F1,F2
Michael Jackson,0.16406,0.248037
Clint Black,-0.207666,1.421081
Dropdead,0.882355,0.436072
Anti-Cimex,1.569998,-0.429204
Cardi B,0.570041,0.670451


In [26]:
from sklearn.metrics import mean_squared_error

In [27]:
### GRADED
ans6 = ''

    
### BEGIN SOLUTION
X1 = if_df.dropna(subset = ['Alfred'])[['F1', 'F2']]
X2 = if_df2.dropna(subset = ['Alfred'])[['F1', 'F2']]
y = if_df['Alfred'].dropna()
lr1 = LinearRegression().fit(X1, y)
mse1 = mean_squared_error(y, lr1.predict(X1))
lr2 = LinearRegression().fit(X2, y)
mse2 = mean_squared_error(y, lr2.predict(X2))
ans6 = 'if_df2' if mse2 < mse1 else 'if_df'
### END SOLUTION

### ANSWER CHECK
ans6

'if_df'