#### Lab 02 - Collaborative Filtering (User, Film, Rating)

Maintain User Engagement by Generating Recommendation Using Other User Preference

Tackle High Sparsity Rate by Filtering on User Film Threshold

#### What Will We Perform ?

We'll Pivot the Rating Table

Perform Limitation on Film to User, User to Film to Decrease Sparse Rate

Join Rating and Film Table

Create Model to Perform Recommender

In [1]:
import warnings, pandas

warnings.filterwarnings("ignore")

In [2]:
# Use Rating Table

rating = pandas.read_table("ratings.csv", sep=",")

rating.iloc[:5]

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [3]:
rating.tail()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [4]:
null = rating.isnull().sum()

null.sum()

0

In [5]:
total = rating.shape[0]

total

100836

In [6]:
total_film = rating["MovieID"].nunique()

total_film

9724

In [7]:
total_user = rating["UserID"].nunique()

total_user

610

On Our Rating Table, There are 610 User & 9.724 Film

In [8]:
rata_rata_rating = rating["Rating"].mean()

rata_rata_rating

3.501556983616962

Out of 100.836 User Interaction with Film, Average Rating is 3.5 out of 5.0

Create Pivot to Rating Table and Perform Sparse Rate

In [9]:
# Create Rating Table

rating_table = rating.pivot_table(index="MovieID", columns="UserID", values="Rating")

# Create Sparse Rate Function to Check Sparse Rate

SparseRate = lambda table : round((table.isnull().sum().sum() / table.size) * 100, 3)

SparseRate(rating_table)

98.3

Our 98.3 % Sparsity Rate Signaling Many Things.

* Our Table Mostly Consists of 0

* Our Films Haven't been Watched

* Our Films is Watched, but the User Didn't Rate the Film

* One or Two Films is Watch Less than 10 Times or User Only Watch Less Than 50 Movie

Our Solution is to Set a Minimum Threshold to Film - User Interaction

The First Threshold is Film. Our Minimum Threshold for a Film is

*This Film Must be Watched, Minimum of 10 Times*

In [10]:
minimum_film_interaction = 10

WatchFilm = rating.groupby("MovieID")["Rating"].agg(["count", "mean"])

WatchFilm.columns = ["TotalWatch", "Rating"]

WatchFilm = WatchFilm[WatchFilm["TotalWatch"] > minimum_film_interaction]

rating = rating[rating["MovieID"].isin(WatchFilm.index)]

rating["MovieID"].nunique()

2121

In [11]:
rating_table = rating.pivot_table(index="MovieID", columns="UserID", values="Rating")

SparseRate(rating_table)

93.845

The Last Threshold is User. Our Minimum Threshold for a User is

*This User Must be Watch & Rate, Minimum of 50 Films*

In [12]:
minimum_user_interaction = 50

UserWatch = rating.groupby("UserID")[["Rating"]].count()

UserWatch = UserWatch[UserWatch["Rating"] > minimum_user_interaction]

rating = rating[rating["UserID"].isin(UserWatch.index)]

In [13]:
rating_table = rating.pivot_table(index="MovieID", columns="UserID", values="Rating")

SparseRate(rating_table)

90.396

Our Final Sparsity Rate is 90.39 %

Create Table Film to Join with Rating Table & Perform Feature Engineering

In [14]:
# Use Movies Table on Table Film

table_film = pandas.read_table("movies.csv", sep=",")

table_film = table_film.drop_duplicates("Title")

# Perform Genre Feature Engineering

table_film["Genre"] = table_film["Genre"].apply(lambda i : " ".join(i.split("|")))

table_film = table_film[table_film["Genre"] != "(no genres listed)"]

# Replace Film Genre

replacer = {"Sci-Fi":"SciFi", "Film-Noir":"Filnoir"}

for i, t in replacer.items():

  table_film["Genre"] = table_film["Genre"].str.replace(i, t)

table_film.iloc[:5]

Unnamed: 0,MovieID,Title,Genre
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,2,Jumanji (1995),Adventure Children Fantasy
2,3,Grumpier Old Men (1995),Comedy Romance
3,4,Waiting to Exhale (1995),Comedy Drama Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
# Join Rating and Table Film to Further Use

rating = rating.join(table_film.set_index("MovieID"), on="MovieID")

rating = rating.drop("Timestamp", axis="columns")

rating.iloc[:5]

Unnamed: 0,UserID,MovieID,Rating,Title,Genre
0,1,1,4.0,Toy Story (1995),Adventure Animation Children Comedy Fantasy
1,1,3,4.0,Grumpier Old Men (1995),Comedy Romance
2,1,6,4.0,Heat (1995),Action Crime Thriller
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),Mystery Thriller
4,1,50,5.0,"Usual Suspects, The (1995)",Crime Mystery Thriller


In [16]:
rating_table.iloc[:5, :8]

UserID,1,4,6,7,10,11,15,16
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,4.0,,,4.5,,,2.5,
2,,,4.0,,,,,
3,4.0,,5.0,,,,,
5,,,5.0,,,,,
6,4.0,,4.0,,,5.0,,


In [17]:
# Fill Missing Value with 0

rating_table = rating_table.fillna(0)

rating_table.iloc[:5, :8]

UserID,1,4,6,7,10,11,15,16
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,4.0,0.0,0.0,4.5,0.0,0.0,2.5,0.0
2,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0


In [18]:
rating_table.index.name

'MovieID'

In [19]:
len(rating_table.index)

2121

In [20]:
rating_table.columns.name

'UserID'

In [21]:
len(rating_table.columns)

352

What is Sparse Matrix ? Sparse Matrix is Matrix that Many of the Value is 0

Usually Sparse Matrix is Mostly Use in Collaborative Method Recommendation System

In [22]:
from scipy.sparse import csr_matrix

rating_matrix = csr_matrix(rating_table.values)

rating_matrix.shape

(2121, 352)

In [23]:
rating_matrix.toarray()[:5, :8]

array([[4. , 0. , 0. , 4.5, 0. , 0. , 2.5, 0. ],
       [0. , 0. , 4. , 0. , 0. , 0. , 0. , 0. ],
       [4. , 0. , 5. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0. , 5. , 0. , 0. , 0. , 0. , 0. ],
       [4. , 0. , 4. , 0. , 0. , 5. , 0. , 0. ]])

In [24]:
rating_matrix.toarray()[0, :10]

array([4. , 0. , 0. , 4.5, 0. , 0. , 2.5, 0. , 4.5, 3.5])

In [25]:
# Reset Rating Table

rating_reset = rating_table.reset_index()

rating_reset.shape

(2121, 353)

In [26]:
rating_reset.iloc[:10]

UserID,MovieID,1,4,6,7,10,11,15,16,17,...,600,601,602,603,604,605,606,607,608,610
0,1,4.0,0.0,0.0,4.5,0.0,0.0,2.5,0.0,4.5,...,2.5,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,5.0
1,2,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0
2,3,4.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
3,5,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.5,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
4,6,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,5.0
5,7,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,3.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0
6,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,10,0.0,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
8,11,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,0.0,0.0,2.5,3.0,0.0,0.0
9,12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
rating_reset.columns.name

'UserID'

In [28]:
rating_reset.index.name

Reseting Rating Table is to Create Table that Consist with

* Column UserID

* Indices is Only List of Range 1 to 2121 or 0 to 2120

In [29]:
len(rating_reset.index)

2121

Create Model Nearest Neighbors Model to Finding Similar Film

The Value is List of Rating From Multiple Users in Rating Matrix Table

Using Cosine Similarity

In [None]:
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(metric="cosine", algorithm="brute")

model.fit(rating_matrix)

## `Main Model`

Creating Model Step by Step

### 🕵 Challenges

* Find Similar Movie to `Avengers, The (2012)`

* Find Similar Movie to `Guardians of the Galaxy (2014)`

* Find Similar Movie to `Forrest Gump (1994)`

* Find Similar Movie to `Toy Story (1995)`

In [31]:
input_title = "Avengers"

movies_indices = table_film[table_film["Title"].str.contains(input_title)]

movies_indices = movies_indices.iloc[2]['MovieID']

table_film[table_film["MovieID"] == movies_indices]

Unnamed: 0,MovieID,Title,Genre
7693,89745,"Avengers, The (2012)",Action Adventure SciFi IMAX


In [32]:
rating[rating["MovieID"] == movies_indices].iloc[:8]

Unnamed: 0,UserID,MovieID,Rating,Title,Genre
1535,15,89745,2.0,"Avengers, The (2012)",Action Adventure SciFi IMAX
2182,18,89745,4.0,"Avengers, The (2012)",Action Adventure SciFi IMAX
3520,21,89745,4.0,"Avengers, The (2012)",Action Adventure SciFi IMAX
7891,52,89745,5.0,"Avengers, The (2012)",Action Adventure SciFi IMAX
9030,62,89745,4.0,"Avengers, The (2012)",Action Adventure SciFi IMAX
9400,63,89745,3.5,"Avengers, The (2012)",Action Adventure SciFi IMAX
11555,68,89745,4.5,"Avengers, The (2012)",Action Adventure SciFi IMAX
11950,73,89745,4.0,"Avengers, The (2012)",Action Adventure SciFi IMAX


In [33]:
movies_indices = rating_reset[rating_reset["MovieID"] == movies_indices]

movies_indices = movies_indices.index[0]

movies_indices

1961

In [34]:
rating_reset.iloc[1960:1965, :10]

UserID,MovieID,1,4,6,7,10,11,15,16,17
1960,89492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1961,89745,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
1962,89774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1963,89864,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1964,89904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
total_limit = 10

total_limit += 1

distances, indices = model.kneighbors(rating_matrix[movies_indices], n_neighbors=total_limit)

sorter = list(zip(indices.flatten().tolist(), distances.flatten().tolist()))

indices = sorted(sorter, key=lambda x: x[1])

indices[:5]

[(1961, 0.0),
 (2060, 0.23810542775945065),
 (1821, 0.2771270484831855),
 (1907, 0.29182116047789386),
 (1970, 0.2938930732962426)]

In [36]:
molist = [i[0] for i in indices]

molist[:5]

[1961, 2060, 1821, 1907, 1970]

Find Film ID on Row in Molist

Current Case is Film ID 89745, 112852, etc

In [37]:
molist = [rating_reset.iloc[i, 0] for i in molist]

molist[:5]

[89745, 112852, 59315, 77561, 91529]

In [38]:
table_film[table_film["MovieID"].isin(molist)]

Unnamed: 0,MovieID,Title,Genre
6743,59315,Iron Man (2008),Action Adventure SciFi
7324,77561,Iron Man 2 (2010),Action Adventure SciFi Thriller IMAX
7372,79132,Inception (2010),Action Crime Drama Mystery SciFi Thriller IMAX
7620,87232,X-Men: First Class (2011),Action Adventure SciFi Thriller War
7693,89745,"Avengers, The (2012)",Action Adventure SciFi IMAX
7768,91529,"Dark Knight Rises, The (2012)",Action Adventure Crime IMAX
8151,102125,Iron Man 3 (2013),Action SciFi Thriller IMAX
8395,110102,Captain America: The Winter Soldier (2014),Action Adventure SciFi IMAX
8425,111362,X-Men: Days of Future Past (2014),Action Adventure SciFi
8475,112852,Guardians of the Galaxy (2014),Action Adventure SciFi


In [39]:
def OutputFilm(title, limit=5):
  """
  Create Output Function Film to Perform Recommendation
  """
  movies_index = table_film[table_film["Title"] == title]
  movies_index = movies_index["MovieID"].values[0]
  movies_index = rating_reset[rating_reset["MovieID"] == movies_index].index[0]

  limit += 1

  distances, indices = model.kneighbors(rating_matrix[movies_index], n_neighbors=limit)
  sorter = list(zip(indices.flatten().tolist(), distances.flatten().tolist()))
  indices = sorted(sorter, key=lambda x: x[1])

  molist = [i[0] for i in indices]
  molist = [i for i in rating_reset.iloc[molist, 0].values][1:]
  result = table_film[table_film["MovieID"].isin(molist)]

  listrate = rating[rating["Title"].isin(result["Title"])]
  listrate = listrate.groupby("Title")["Rating"].mean()
  result = result.join(listrate, on="Title")

  return result.sort_values("Rating", ascending=False)

title = "Guardians of the Galaxy (2014)"

OutputFilm(title, 15)

Unnamed: 0,MovieID,Title,Genre,Rating
8683,122886,Star Wars: Episode VII - The Force Awakens (2015),Action Adventure Fantasy SciFi IMAX,3.984848
8438,111759,Edge of Tomorrow (2014),Action SciFi IMAX,3.95
8425,111362,X-Men: Days of Future Past (2014),Action Adventure SciFi,3.910714
7620,87232,X-Men: First Class (2011),Action Adventure SciFi Thriller War,3.865854
7693,89745,"Avengers, The (2012)",Action Adventure SciFi IMAX,3.858333
8695,122918,Guardians of the Galaxy 2 (2017),Action Adventure SciFi,3.847826
9223,152081,Zootopia (2016),Action Adventure Animation Children Comedy,3.846154
6743,59315,Iron Man (2008),Action Adventure SciFi,3.7625
8691,122904,Deadpool (2016),Action Adventure Comedy SciFi,3.738636
8395,110102,Captain America: The Winter Soldier (2014),Action Adventure SciFi IMAX,3.689655


In [40]:
# Test Forrest Gump Film

title = "Forrest Gump (1994)"

OutputFilm(title, 5)

Unnamed: 0,MovieID,Title,Genre,Rating
277,318,"Shawshank Redemption, The (1994)",Crime Drama,4.438914
257,296,Pulp Fiction (1994),Comedy Crime Drama Thriller,4.218884
1939,2571,"Matrix, The (1999)",Action SciFi Thriller,4.139908
97,110,Braveheart (1995),Action Drama War,3.974719
418,480,Jurassic Park (1993),Action Adventure SciFi Thriller,3.782383


In [41]:
# Test Toy Story Film

title = "Toy Story (1995)"

OutputFilm(title, 8)

Unnamed: 0,MovieID,Title,Genre,Rating
224,260,Star Wars: Episode IV - A New Hope (1977),Action Adventure SciFi,4.229798
257,296,Pulp Fiction (1994),Comedy Crime Drama Thriller,4.218884
314,356,Forrest Gump (1994),Comedy Drama Romance War,4.126482
911,1210,Star Wars: Episode VI - Return of the Jedi (1983),Action Adventure SciFi,4.122754
2355,3114,Toy Story 2 (1999),Adventure Animation Children Comedy Fantasy,3.875
3194,4306,Shrek (2001),Adventure Animation Children Comedy Fantasy Ro...,3.873333
418,480,Jurassic Park (1993),Action Adventure SciFi Thriller,3.782383
123,150,Apollo 13 (1995),Adventure Drama IMAX,3.762987
