## 1. Analysing movie ratings
<p>Here we are going to demonstrate some analysis performed upon movie ratings data and the preperation involved to be turned into a former proper for later usage with machine learning algorithms. This dataset is provided by <a src="https://grouplens.org/datasets/movielens/">GroupLens Research</a> and is collected by users of MovieLens which essentialy is a movie recomendations service (more about MovieLens <a src="https://movielens.org/">here</a>).</p>
<img src="https://md.ekstrandom.net/talks/2014/recsys2014/grouplens.png" "width=150px">
<p>Just to give a brief description of them, the data provide movie ratings, movie metadata (genres and year), and demographic data about the users(age, zip code, gender identification and occupation) and this is the information that is often of interest in the development of recommendation systems. The dataset contains 1 million ratings collected from 6,000 users on 4,000 movies and it is spread across three tables: ratings, user inforamtion and movie information.</p>
<hr>
<p>So let's begin! First we will import our data and print the first few lines to have a look of the data structure.
    <br><strong>Note: </strong>As will we see in the output following this code block,some of the data contained are sensitive so they've been kept anonymous through an integer "encryption". We can always refer to the <em>README</em> file, about the integer representation of these data, that is included within the <em>datasets</em> file.</p>

In [73]:
import pandas as pd
from IPython.display import display

# Make display smaller
pd.options.display.max_rows = 10

# Create dataframe for users info
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('./datasets/movielens/users.dat', sep='::',
                      names=unames, engine='python')

# Create dataframe for ratings info
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('./datasets/movielens/ratings.dat', sep='::',
                       names=rnames, engine='python')

# Create dataframe for movies info
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('./datasets/movielens/movies.dat', sep='::',
                      names=mnames, engine='python')

# Print few first lines of each
# print("Users Table\n", users.head(), end='\n\n')
print("Users Table")
display(users.head())
print("Ratings Table")
display(ratings.head())
print("Movies Table")
display(movies.head())


Users Table


Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


Ratings Table


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Movies Table


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## 2. Merging the tables
<p>Working with 3 different tables is not very convenient so we are going to merge them into one. That way we will be able to have a broader perspective. Since all of the 3 tables contain the <code>movie_id</code> we don't have to be explicit about on which column to join.</p>
<p>One note of importance is the fact that merging big dataframes can be a really costly operation, in terms of resources like memory, so in the world of Big Data such kind of operation is usually performed with the use of cloud computing. Even in our project here a low specifications computer will have a hard time to perform the merge (especially the last one which produces a dataframe of <em>1M x 10</em> dimensions so a work around is to merge in chunks to allow better merory management.</p>

In [61]:
# Create a template with columns that appear in the final data file.
# First the  unique columns from "ratings" and "user".
data = pd.DataFrame(columns=(ratings.columns.append(users.columns)).unique())

# Same for after merging with "movies". Keep column order and save file
data = pd.DataFrame(columns=(data.columns.append(movies.columns)).unique())
data.to_csv('./datasets/data.csv', index_label=False)
cols = list(data.columns.values)

# First merge "ratings" with "users". 
df = pd.merge(users, ratings)

# Save it to csv and delete variables to free memory
df.to_csv('./datasets/ratings_users.csv', mode='w')
test = pd.read_csv('./datasets/ratings_users.csv')
del[data, df, ratings, users]

# Merge "ratings_users" with "movies" in chunks. Prevents memory failure
for chunk in  pd.read_csv('./datasets/ratings_users.csv', chunksize=1000):
    df = pd.merge(chunk, movies, on='movie_id')
    # Rearrange cols to match the "data.csv" column order
    df = df[cols]
    df.to_csv('./datasets/data.csv', mode='a', columns=df[cols], header=False, index=False)

# Free memory. Make dataframe from the finalized csv   
del[movies, df]
data = pd.read_csv('./datasets/data.csv')
data.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
3,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
4,6,914,5,978237767,F,50,9,55117,My Fair Lady (1964),Musical|Romance


## 3. Exploring the dataset
<p>Now that we have our merged dataset we will perform some <em>Exploratory Data Analysis</em> <strong>(EDA)</strong> on it. In the context of this exploration, we will find the mean of movie ratings by gender. We will also try to filter down to titles of movies that received at least an arbitary number of ratings, let's say for example 300. Doing this, in some sense, we are trying top identify movies that are more active/famous choices of users</p>
<p>More interestingly we can use the index of the <em>at least 300 ratings</em> filtering, to provide it as boolean filter in the <em>ratings by gender</em> subset and will print top 10 for females. Again this is a totally arbitary selection and no assumptions are been made, it is just a showcase of a gender based preferences on a more narrow "sample".</p>

In [64]:
# Reshape frame to get means of rating by gender
mean_ratings = data.pivot_table('rating', index='title', columns='gender',
                                aggfunc='mean')
mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
