# Movie Recommendation Engine: Data Analysis and Visualization

#### The core dataset is contained in 3 main files, tag.csv - containing user generated tags, rating.csv - containing user ratings, movie.csv - containing movie master data.

#### User picks a genre they want to watch, then they get asked if they like a few (popular) movies [unless we can get autocomplete on the list], then a list of 3 movies is presented. Under the hood, we use ratings and tags to get list of movies calculating distances

In [1]:
import pandas as pd
import numpy as np
data_dir = r"C:\Users\Ben\Documents\Data Sets\movielens_20m_dataset\\"

In [2]:
pd.set_option('display.max_rows', 20)

In [3]:
movies_master = pd.read_csv(data_dir+"movie.csv")

In [4]:
movies_master.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Manipulate movies data to show different genres

In [6]:
def explode(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    if (lens > 0).all():
        # ALL lists in cells aren't empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .loc[:, df.columns]
    else:
        # at least one list in cells is empty
        return pd.DataFrame({
            col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
            for col in idx_cols
        }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
          .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
          .loc[:, df.columns]

In [7]:
movies_master["genres"] = movies_master.genres.str.split('|')
movies = explode(movies_master, 'genres')

In [8]:
movies.head(5)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure,1995.0
1,1,Toy Story (1995),Animation,1995.0
2,1,Toy Story (1995),Children,1995.0
3,1,Toy Story (1995),Comedy,1995.0
4,1,Toy Story (1995),Fantasy,1995.0


In [9]:
movies["genres"].unique()

array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'IMAX', 'Documentary', 'War', 'Musical',
       'Western', 'Film-Noir', '(no genres listed)'], dtype=object)

#### Number of movies per genre

In [10]:
print(movies_master.title.nunique(), "movies in the dataset")

27262 movies in the dataset


In [11]:
movies.groupby("genres").title.nunique()

genres
(no genres listed)      246
Action                 3519
Adventure              2328
Animation              1025
Children               1138
Comedy                 8369
Crime                  2938
Documentary            2471
Drama                 13337
Fantasy                1412
Film-Noir               330
Horror                 2610
IMAX                    196
Musical                1036
Mystery                1514
Romance                4124
Sci-Fi                 1740
Thriller               4178
War                    1194
Western                 676
Name: title, dtype: int64

### Extract the year of movie

In [5]:
#Select the year and the extract it from the parenthases
movies_master["year"] = movies_master.title.str[-6:]\
                                     .str.extract('(\d+)', expand=False)\
                                     .astype(float)

### Which movies are most watched (most rated) per genre?

In [12]:
ratings = pd.read_csv(data_dir+"rating.csv")

In [13]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


In [14]:
popular_movies = ratings.groupby("movieId")\
                        .userId.agg({"num_user" : len})\
                        .sort_values("num_user", ascending=False)\
                        .reset_index()

is deprecated and will be removed in a future version
  """Entry point for launching an IPython kernel.


In [15]:
pd.merge(popular_movies, movies_master, on = 'movieId').head(5)

Unnamed: 0,movieId,num_user,title,genres,year
0,296,67310,Pulp Fiction (1994),"[Comedy, Crime, Drama, Thriller]",1994.0
1,356,66172,Forrest Gump (1994),"[Comedy, Drama, Romance, War]",1994.0
2,318,63366,"Shawshank Redemption, The (1994)","[Crime, Drama]",1994.0
3,593,63299,"Silence of the Lambs, The (1991)","[Crime, Horror, Thriller]",1991.0
4,480,59715,Jurassic Park (1993),"[Action, Adventure, Sci-Fi, Thriller]",1993.0


In [16]:
print(sum(popular_movies.num_user < 30), "movies have less than 30 ratings")
popular_movies = popular_movies[popular_movies.num_user > 30]

14730 movies have less than 30 ratings


#### Can raters rate a movie more than once?

In [37]:
print("A user can rate a movie a maximum of", ratings.groupby(["userId","movieId"]).rating.count().max(), "time(s)")

A user can rate a movie a maximum of  1 time(s)


### Pull rating data together

In [17]:
ratings = ratings[ratings["movieId"].isin(popular_movies["movieId"])]
movies = movies[movies["movieId"].isin(popular_movies["movieId"])]
movies_master = movies_master[movies_master["movieId"].isin(popular_movies["movieId"])]

#### Add 80 percentile calculation to each movie. Ratings better than or equal to this will include raters who "like" the movie

In [18]:
quartiles = ratings.groupby("movieId").rating.quantile(q=.8, interpolation = 'midpoint').reset_index()
quartiles.columns = ["movieId", "top_quintile"]

In [19]:
movies_master = pd.merge(movies_master, quartiles, on = "movieId")

In [39]:
movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure,1995.0
1,1,Toy Story (1995),Animation,1995.0
2,1,Toy Story (1995),Children,1995.0
3,1,Toy Story (1995),Comedy,1995.0
4,1,Toy Story (1995),Fantasy,1995.0


### Understand tags data