## 1. Analysing movie ratings
<p>Here we are going to demonstrate some analysis performed upon movie ratings data and the preperation involved to be turned into a former proper for later usage with machine learning algorithms. This dataset is provided by <a src="https://grouplens.org/datasets/movielens/">GroupLens Research</a> and is collected by users of MovieLens which essentialy is a movie recomendations service (more about MovieLens <a src="https://movielens.org/">here</a>).</p>
<img src="https://md.ekstrandom.net/talks/2014/recsys2014/grouplens.png" "width=150px">
<p>Just to give a brief description of them, the data provide movie ratings, movie metadata (genres and year), and demographic data about the users(age, zip code, gender identification and occupation) and this is the information that is often of interest in the development of recommendation systems. The dataset contains 1 million ratings collected from 6,000 users on 4,000 movies and it is spread across three tables: ratings, user inforamtion and movie information.</p>
<hr>
<p>So let's begin! First we will import our data and print the first few lines to have a look of the data structure.
    <br><strong>Note: </strong>As will we see in the output following this code block,some of the data contained are sensitive so they've been kept anonymous through an integer "encryption". We can always refer to the <em>README</em> file, about the integer representation of these data, that is included within the <em>datasets</em> file.</p>

In [48]:
import pandas as pd
from IPython.display import display

# Make display smaller
pd.options.display.max_rows = 10

# Create dataframe for users info
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('./datasets/movielens/users.dat', sep='::',
                      names=unames, engine='python')

# Create dataframe for ratings info
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('./datasets/movielens/ratings.dat', sep='::',
                       names=rnames, engine='python')

# Create dataframe for movies info
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('./datasets/movielens/movies.dat', sep='::',
                      names=mnames, engine='python')

# Print few first lines of each
# print("Users Table\n", users.head(), end='\n\n')
print("Users Table")
display(users.head())
print("Ratings Table")
display(ratings.head())
print("Movies Table")
display(movies.head())


Users Table


Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


Ratings Table


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Movies Table


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


## 2. Merging the tables
<p>Working with 3 different tables is not very convenient so we are going to merge them into one. That way we will be able to have a broader perspective. Since all of the 3 tables contain the <code>movie_id</code> we don't have to be explicit about on which column to join.</p>
<p>One note of importance is the fact that merging big dataframes can be a really costly operation, in terms of resources like memory, so in the world of Big Data such kind of operation is usually performed with the use of cloud computing. Even in our project here a low specifications computer will have a hard time to perform the merge (especially the last one which produces a dataframe of <em>1M x 10</em> dimensions so a work around is to merge in chunks to allow better merory management.</p>

In [49]:
# Create a template with columns that appear in the final data file.
# First the  unique columns from "ratings" and "user".
data = pd.DataFrame(columns=(ratings.columns.append(users.columns)).unique())

# Same for after merging with "movies". Keep column order and save file
data = pd.DataFrame(columns=(data.columns.append(movies.columns)).unique())
data.to_csv('./datasets/data.csv', index_label=False)
cols = list(data.columns.values)

# First merge "ratings" with "users". 
df = pd.merge(users, ratings)

# Save it to csv and delete variables to free memory
df.to_csv('./datasets/ratings_users.csv', mode='w')
test = pd.read_csv('./datasets/ratings_users.csv')
del[data, df, ratings, users]

# Merge "ratings_users" with "movies" in chunks. Prevents memory failure
for chunk in  pd.read_csv('./datasets/ratings_users.csv', chunksize=1000):
    df = pd.merge(chunk, movies, on='movie_id')
    # Rearrange cols to match the "data.csv" column order
    df = df[cols]
    df.to_csv('./datasets/data.csv', mode='a', columns=df[cols], header=False, index=False)

# Free memory. Make dataframe from the finalized csv   
del[movies, df]
data = pd.read_csv('./datasets/data.csv')
data.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
3,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
4,6,914,5,978237767,F,50,9,55117,My Fair Lady (1964),Musical|Romance


## 3. Exploring the dataset
<p>Now that we have our merged dataset we will perform some <em>Exploratory Data Analysis</em> <strong>(EDA)</strong> on it. In the context of this exploration, we will find the mean of movie ratings by gender. We will also try to filter down to titles of movies that received at least an arbitary number of ratings, let's say for example 300. Doing this, in some sense, we are trying top identify movies that are more active/famous choices of users</p>
<p>More interestingly we can use the index of the <em>"at least 300 ratings"</em> filtering, to provide it as boolean filter in the <em>ratings by gender</em> subset and will print top 10 for females. Again this is a totally arbitary selection and no assumptions are been made, it is just a showcase of a gender based preferences on a more narrow "sample".</p>

In [68]:
# Reshape frame to get means of rating by gender
mean_ratings = data.pivot_table('rating', index='title', columns='gender',
                                aggfunc='mean')
print("Mean ratings by gender:")
display(mean_ratings.head())

# Get size of ratings by title
ratings_by_title = data.groupby('title').size()

# Create filter for at least 300 ratings and filter out by gender
active_titles = ratings_by_title.index[ratings_by_title >= 300]
active_mean_ratings = mean_ratings.loc[active_titles]
print("Titles with at least 300 ratings, by gender:")
display(active_mean_ratings)

# Get 10 most rated films by females
active_sorted = active_mean_ratings.sort_values(by=['F'], ascending=False)[:10]
female_top10 = active_sorted.head(10)
print("Top 10 films by females:")
female_top10

Mean ratings by gender:


gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


Titles with at least 300 ratings, by gender:


gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


Top 10 films by females:


gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.57265,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415
"Shawshank Redemption, The (1994)",4.539075,4.560625
"Grand Day Out, A (1992)",4.537879,4.293255
To Kill a Mockingbird (1962),4.536667,4.372611
"Usual Suspects, The (1995)",4.513317,4.518248
It Happened One Night (1934),4.5,4.163934


## 4. Measuring Mean Difference
<p>Ok now we can procceed in extracting useful statistics that can be used in the future training of machine learning algorithm. In essence this can provide us with information on which films females rated higher than males (or vice versa, it depends on which gender we are interested to study. Here we arbitary picked up females).</p>
<p>We are going to abstract the female ratings from the male ratings. Before doing the actual subtraction we have to impute the missing values, which can be done by using the mean value of the column. The interpretation of the resulting numbers in the <em>'diff'</em> column is as follows: if the number is a negative one, it means that the specific movie found favor mostly by women. We get the opposite notion for a number that is positive</p>

In [71]:
# Get mean difference. Sort them, fill nan values
# mean_ratings.fillna(0.0, axis=0, inplace=True)
mean_ratings.fillna(mean_ratings.mean(), inplace=True)
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
# mean_ratings.dropna(axis=0, inplace=True)

sorted_by_diff = mean_ratings.sort_values(by='diff')

# Print top10 for female preference
print("Top 10 female ratings:")
display(sorted_by_diff[:10])

# Print in reverse order to get the male preference
print("Top 10 male ratings:")
display(sorted_by_diff[::-1][:10])

Top 10 female ratings:


gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)",4.0,1.0,-3.0
"James Dean Story, The (1957)",4.0,1.0,-3.0
Country Life (1994),5.0,2.0,-3.0
Babyfever (1994),3.666667,1.0,-2.666667
"Woman of Paris, A (1923)",5.0,2.428571,-2.571429
Cobra (1925),4.0,1.5,-2.5
Even Dwarfs Started Small (Auch Zwerge haben klein angefangen) (1971),3.293716,1.0,-2.293716
Lotto Land (1995),3.293716,1.0,-2.293716
Venice/Venice (1992),3.293716,1.0,-2.293716
Mutters Courage (1995),3.293716,1.0,-2.293716


Top 10 male ratings:


gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Tigrero: A Film That Was Never Made (1994),1.0,4.333333,3.333333
"Neon Bible, The (1995)",1.0,4.0,3.0
"Enfer, L' (1994)",1.0,3.75,2.75
Stalingrad (1993),1.0,3.59375,2.59375
Killer: A Journal of Murder (1995),1.0,3.428571,2.428571
Dangerous Ground (1997),1.0,3.333333,2.333333
In God's Hands (1998),1.0,3.333333,2.333333
Rosie (1998),1.0,3.333333,2.333333
"Flying Saucer, The (1950)",1.0,3.3,2.3
"Silence of the Palace, The (Saimt el Qusur) (1994)",1.0,3.217015,2.217015


## 5. Calculating Standard Deviation
<p>So in a first glimpse we can see, the not so surprising fact, that males tend to give high ratings in action movies (or at least that is what the titles of such movies suggest) as opposed to the females' preference in non-action titles.</p>
<p>One here can come up with many test cases and correspoding calculations. For our project's conclusion we will try one of these cases, specifically we would try to extract the entries that contain the most diasagreement among viewers, independent of gender identification. This can be achieved by measuring the <a href="https://simple.wikipedia.org/wiki/Standard_deviation">standard deviation</a> of the ratings.</p>