# Workshop Outline

## Intro to club 
Welcome back, club directors intro, github account

## Intro to Spring Semester 
Goal - to work on some data science projects together  
Objectives - write code collaboratively while learning about different data science techniques and tools
Workshop list

## Data Science Project Components
Pieces to the project puzzle  
1. Data collection 
2. Data cleaning
3. Picking an analysis model
4. Running the model
5. Interpreting the results  

How do the pieces fit together? 

## User Based Collaborative Filtering with Movie Ratings
Goal: 2 parts - (1) build movie suggestions for user based on their ratings and items consumed by similar users and (2) predict ratings in a test set based on user ratings and items consumed by similar users

These are called "TopN" and "Prediction" tasks in Recommender Systems, respectively. 
We'll focus on the prediction task. 

### Steps:: 

1. Find a database of movie ratings (e.g., https://grouplens.org/datasets/movielens/100k/)
        - Need at least (user,item,rating) tuples in dataset 
2. Split train/test sets 
        - If using link above, already split into stable 5-fold CV datasets
        - Ultimate objections: for each observation in test set, we want rating prediction given user and item
3. Build user-by-item matrix 
        - Reshape original dataset
        - Rating values go in user-by-item matrix element
        - Missing values are NAs (can impute 0) 
4. Choose a similarity measure
        - Popular options: Pearson Correlation Coefficient (pairwise comparison or imputed zero), Cosine Vector Similarity
        - These can get exotic 
5. Produce user-user similarity matrix 
        - For Pearson/Cosine, think: user-by-item x item-by-user => user-user
6. For each test set observation (user_u,item_i,rating_ui) 
        - find nearest neighbors of user_u that have rated item_i
        - select k of those neighbors
        - average (or weighted average) the ratings of those neighbors
        - Error? How to measure error between r_ui prediction and rating_ui? RMSE, MAE, something else? 
7. Abstract the previous step to loop over k values
        - It's common to use granular values between 1-20
        - Often go out as far as 100-150, by increments of 10-25
8. Produce a plot corresponding to 7 with x values as k and y values as error
        - Optimal k values? 
        - Describe the relationship
9. Rinse and repeat for all train/test splits
        - If you have to build a production quality system, how would you choose k? Would you choose k? 
        - Repeat experiment with other similarity measures
        - Anything surprising? 
        - Bottlenecks?
        - Effects of imputing? 
        - Could perform item-item similarity (in step 5 instead), thoughts? 


### Goal for today:: Get movie ratings data and manipulate format to calculate similarity

1. Find a collection of movie ratings
2. Download the information
3. Inspect how the information is stored
4. Prepare data
5. Explore similarity scores

In [1]:
import pandas as pd
import numpy as np

# read in the data file - first input is filename, 
# need to use the filepath to the file you want to load
data = pd.read_table('u1.base',sep = '\t', header = None)
# don't forget to read in u1.test as well!

In [3]:
# translate data into a numpy array where
# each row represents the ratings of one user,
# each column represents the ratings of one movie across all users
# this is typically referred to as a "ratings matrix" or "user-by-item rating matrix"
data_array = np.zeros((max(data[0]),max(data[1])))
for i in range(0,data.shape[0]):
    data_array[data[0][i]-1][data[1][i]-1] = data[2][i]

In [4]:
# calculate the Pearson cross-correlation coefficient matrix
data_corr = np.corrcoef(data_array)

Next steps: 6-9 from given above (and pasted below)

6. For each test set observation (user_u,item_i,rating_ui) 
        - find nearest neighbors of user_u that have rated item_i
        - select k of those neighbors
        - average (or weighted average) the ratings of those neighbors
        - Error? How to measure error between r_ui prediction and rating_ui? RMSE, MAE, something else? 
7. Abstract the previous step to loop over k values
        - It's common to use granular values between 1-20
        - Often go out as far as 100-150, by increments of 10-25
8. Produce a plot corresponding to 7 with x values as k and y values as error
        - Optimal k values? 
        - Describe the relationship
9. Rinse and repeat for all train/test splits
        - If you have to build a production quality system, how would you choose k? Would you choose k? 
        - Repeat experiment with other similarity measures
        - Anything surprising? 
        - Bottlenecks?
        - Effects of imputing? 
        - Could perform item-item similarity (in step 5 instead), thoughts? 

