# Collaborative Filtering Project
## Intro to Machine Learning
### Thomas Cazort
---

## Setup:

In [17]:
from collections import defaultdict
from scipy.stats import pearsonr
import numpy as np
import math
import statistics

Store similarities in dictionary SIM (This is like a sparse-matrix where we only store non-zero values)

Store ratings in ITM:

In [11]:
SIM = defaultdict(dict)
ITM = defaultdict(dict)

ITM[*m*][*u*] stores rating score for movie *m* and user *u*

SIM[*m1*][*m2*] stores similarity score between movie *m* and *m1*

In [20]:
ifile = open("netflix-small/ratings-train.txt")
for l in ifile:
    parts = l.strip().split(",")
    ITM[int(parts[0])][int(parts[1])] = float(parts[2])
ifile.close()

### Similarity Computation:

Compute similarity between *i* and *j* and store this value in SIM[*i*][*j*]

I will be using the correlation-coefficient formula described in class.

In [25]:
for i in ITM.keys():
    for j in ITM.keys():
        if i==j:
            continue
        # riBar and rjBar:
        riBar = statistics.mean(list(ITM[i].values()))
        rjBar = statistics.mean(list(ITM[j].values()))
        # SUM u e U:
        numer = 0
        denomP1 = 0
        denomP2 = 0
        for ui in ITM[i].keys():
            for uj in ITM[j].keys():
                if ui != uj:
                    continue
                rui, ruj = ITM[i][ui], ITM[j][uj]
                # Compute the Numerator of the Equation:
                numer += (rui - riBar) * (ruj - rjBar)
                # First part of Denominator:
                denomP1 += (rui - riBar)**2
                # Second part:
                denomP2 += (ruj - rjBar)**2
        # Compute simmilarity:
        denom = math.sqrt(denomP1) * math.sqrt(denomP2)
        wij = 0
        if denom != 0:
            wij = numer / denom
        # Add to SIM:
        SIM[i][j] = wij
SIM   

defaultdict(dict,
            {25: {33: 0.9177056144534772,
              18: 0.8835061800196984,
              12: 0.9334372050866899,
              9: 1.0,
              11: 0.9991360689531967,
              1: 0.20450264387930367,
              8: 0.99992138430618,
              5: 0.9912132036892201,
              7: 0.999986294517121,
              32: 0.9702412474100943,
              23: 0.5294680844487841,
              31: 1.0000000000000002,
              26: 0.9482416757849285,
              34: 0.9720556948454189,
              21: 0.9306548308955052,
              13: 0.509425222262211,
              2: 0.745505108434702,
              14: 0.9857518312606128,
              17: 0.7470068800735559,
              3: 0.9958327051633817,
              24: 0.9563885923271341,
              4: 0.7771229997378502,
              6: 0.5303635038144766,
              27: 0.86675499729854,
              29: 1.0,
              20: 0,
              22: 0.9389989674685002,
              

## Testing:

In [14]:
ifile = open("netflix-small/ratings-test.txt")
for l in ifile:
    parts = l.strip().split(",")
    movie = parts[0]
    user = parts[1]
    truerating = float(parts[2])
ifile.close()

### K Neighbors:

Find K Neighbors for movie from the weights store in SIM:

### Prediction:

Predict the rating by user using user's ratings for the K neighbors:

### MSE:

Compute the Mean-Squared Error between the true and predicted ratings: