# Recommender Systems Walk Through

### Intro

2 distinct Types of RS:

- Content Based Filtering

The aim of content-based recommendation is to create a ‘profile’ for each user and each item. Then recommend an item that is similar to a previose item used by the User.


- Collaborative Filtering

The aim of CF is to find similar users and recommend products based on a similar user.

Finally I will implement a simple hybrid model

![alt text](1_yrkvweErbifbPFkBUyZlOw.png)

### Data Prep

For simplicity we are using a small subset of the data available

In [1]:
import pandas as pd

import numpy as np 

In [2]:
credits = pd.read_csv('credits.csv')
credits.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'credits.csv'

In [None]:
keywords = pd.read_csv('keywords.csv')
keywords.head(3)

In [None]:
links = pd.read_csv('links.csv')
links.head(3)

In [None]:
meta = pd.read_csv('movies_metadata.csv')
meta.head(3)

In [None]:
rating = pd.read_csv('ratings.csv')
rating.head(3)

### Cleaning meta data

In [None]:
# Dropping many due to simplicity (Notebook is not trying to get the best model, just trying things out)
meta=meta[['id','imdb_id','title','overview','genres','vote_average','budget','runtime','adult']]
meta.adult.replace({'False': 0, 'True': 1}, inplace=True)

In [None]:
meta.shape

In [None]:
meta.head(5)

In [None]:
meta.genres = meta.genres.str.extract('(\d+)') # again wrong as many genres but keeping it simple

In [None]:
meta.genres = pd.to_numeric(meta.genres, errors='coerce')

In [None]:
meta.isnull().sum() / meta.shape[0] * 100.00

In [None]:
meta = meta.drop([19730, 29503, 35587]) # Incorrect IDs

meta.dropna(inplace = True)

In [None]:
meta['id'] = meta['id'].astype('int')
meta.genres = meta.genres.astype('int')

#### links data

In [None]:
links.head(3)
links.dropna(inplace = True)

In [None]:
links['tmdbId'] = links['tmdbId'].astype('int')
links['imdbId'] = links['imdbId'].astype('int')

In [None]:
links.isnull().sum() / links.shape[0] * 100.00

In [None]:
links.dropna(inplace = True)

In [None]:
df = pd.merge(meta, links, left_on=['id'], right_on = ['tmdbId'], how='inner')
df.drop(['imdb_id','id'],axis = 1,inplace = True)

In [None]:
df

### Cleaning credits data

In [None]:
credits

In [None]:
df = pd.merge(df, credits, left_on=['tmdbId'], right_on = ['id'], how='inner')
df.drop(['id'],axis = 1,inplace = True)
df.head(3)

In [None]:
df = pd.merge(df, keywords, left_on=['tmdbId'], right_on = ['id'], how='inner')
df.drop(['id'],axis = 1,inplace = True)
df.head(3)

In [None]:
from ast import literal_eval
df['cast'] = df['cast'].apply(literal_eval)
df['crew'] = df['crew'].apply(literal_eval)
df['keywords'] =  df['keywords'].apply(literal_eval)
df['cast_size'] = df['cast'].apply(lambda x: len(x))
df['crew_size'] = df['crew'].apply(lambda x: len(x))

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

df['director'] = df['crew'].apply(get_director)

df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df['cast'] = df['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

df['keywords'] = df['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
df.drop(['crew'],axis = 1,inplace = True)

In [None]:
df.head(3)

## Content Based Filtering 

Goal: be able to group similar movies together and have a ranking system

Many different approaches:

- Recommend movies with similar descriptions, crew, cast I.E NLP
- Tabular data i.e ratings, cost ect

I want to try a combination


I will:

- Cluster descriptions, crew and cast seperately . Make features out of these.
- then cluster the dataframe

## Collaborative Filtering

![alt text](1_qFweWAKML-SdpGndGMvLDw.png)