## Description
We have a 1 million ratings collected from 6K users on 4K movies data set from movie lens in the late 1990s and early 2000 The data provides movie rating, movie metadata (genres and year), And demographic data about the user Such data is often of interest in development of recommendation system based on ml algorithm.

## Contents
1. The Datasets
2. Analysis Plan

### The Datasets
1. Ratings
2. User information
3. Movies information

#### Key Variables
- Ratings
    - user id
    - movie id
    - rating
    - timestamp
- User information
    - user_id
    - age
    - gender
    - occupation
    - zip code
- Movies information
    - movie id
    - title
    - genres

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
movies_labels = ['movie_id', 'title', 'genres']
ratings_labels = ['user_id', 'movie_id', 'rating', 'timestamp']
users_labels = ['user_id', 'gender', 'age', 'occupation', 'zip_code']

users = pd.read_csv('data/movielens/users.dat', sep= '::', engine= 'python',
                    header= None, names= users_labels)
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


Notice that age and occupation are categorical variables(proper discription about it available on readme file, we will look it after when it needed), and genres is a list of categorical variables.

In [3]:
ratings = pd.read_csv('data/movielens/ratings.dat', sep= '::', engine= 'python',
                    header= None, names= ratings_labels)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [4]:
movies = pd.read_csv('data/movielens/movies.dat', sep= '::', engine= 'python',
                    header= None, names= movies_labels)
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


Analyse these different data It would be bit difficult to analyse. It would be more convenient to merge them in a single table.

In [5]:
# merge movies and ratings(upon movie_id) then with users (upon user_id)
df = pd.merge(pd.merge(movies, ratings, on= 'movie_id'), users)
# if column name same it would no difficult in merging them but it's good pratice to on=
df

Unnamed: 0,movie_id,title,genres,user_id,rating,timestamp,gender,age,occupation,zip_code
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268,F,1,10,48067
1,48,Pocahontas (1995),Animation|Children's|Musical|Romance,1,5,978824351,F,1,10,48067
2,150,Apollo 13 (1995),Drama,1,5,978301777,F,1,10,48067
3,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,1,4,978300760,F,1,10,48067
4,527,Schindler's List (1993),Drama|War,1,5,978824195,F,1,10,48067
...,...,...,...,...,...,...,...,...,...,...
1000204,3513,Rules of Engagement (2000),Drama|Thriller,5727,4,958489970,M,25,4,92843
1000205,3535,American Psycho (2000),Comedy|Horror|Thriller,5727,2,958489970,M,25,4,92843
1000206,3536,Keeping the Faith (2000),Comedy|Romance,5727,5,958489902,M,25,4,92843
1000207,3555,U-571 (2000),Action|Thriller,5727,3,958490699,M,25,4,92843


In [6]:
# Rearrange the columns
col = ['user_id', 'movie_id', 'rating', 'gender', 'age', 'timestamp', 'occupation', 'zip_code', 'title', 'genres']
df = df.reindex(columns= col)

# Change the age and occupation into categorical datatype, doing it open some extra gates with efficiency
df['age'] = df['age'].astype('category')
df['occupation'] = df['occupation'].astype('category')

df['gender'] = df['gender'].astype('category')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
 #   Column      Non-Null Count    Dtype   
---  ------      --------------    -----   
 0   user_id     1000209 non-null  int64   
 1   movie_id    1000209 non-null  int64   
 2   rating      1000209 non-null  int64   
 3   gender      1000209 non-null  category
 4   age         1000209 non-null  category
 5   timestamp   1000209 non-null  int64   
 6   occupation  1000209 non-null  category
 7   zip_code    1000209 non-null  object  
 8   title       1000209 non-null  object  
 9   genres      1000209 non-null  object  
dtypes: category(3), int64(4), object(3)
memory usage: 63.9+ MB


In [8]:
df.iloc[0]

user_id                                 1
movie_id                                1
rating                                  5
gender                                  F
age                                     1
timestamp                       978824268
occupation                             10
zip_code                            48067
title                    Toy Story (1995)
genres        Animation|Children's|Comedy
Name: 0, dtype: object

In [10]:
# Getting mean movie rating, across gender
mean_ratings = df.pivot_table(values= 'rating', index= 'title', columns= 'gender', aggfunc= 'mean')
mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


This can be little bias as many movies may not have been rated by much users, so we use minimum threshold of 250 ratings users

In [12]:
ratings_by_title = df.groupby('title').size()
ratings_by_title.sort_values(ascending= False).tail() # max rating around 3-4K

title
Target (1995)                                                1
I Don't Want to Talk About It (De eso no se habla) (1993)    1
An Unforgettable Summer (1994)                               1
Never Met Picasso (1996)                                     1
Full Speed (1996)                                            1
dtype: int64

In [13]:
popular_movies = ratings_by_title.index[ratings_by_title >= 250]
popular_movies

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

In [14]:
mean_ratings = mean_ratings.loc[popular_movies]
mean_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


In [16]:
# top movies among males
top_males_ratings = mean_ratings.sort_values('M', ascending= False)
top_males_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Godfather, The (1972)",4.3147,4.583333
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954),4.481132,4.576628
"Shawshank Redemption, The (1994)",4.539075,4.560625
Raiders of the Lost Ark (1981),4.332168,4.520597
"Usual Suspects, The (1995)",4.513317,4.518248


#### Movies Disagreement

In [17]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
# sort the diff
ratings_disagreement = mean_ratings.sort_values('diff')

# Top Women Favourite
ratings_disagreement.head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777


In [18]:
# top man favourite
ratings_disagreement.tail() # or [::-1].head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Cable Guy, The (1996)",2.25,2.863787,0.613787
"Longest Day, The (1962)",3.411765,4.031447,0.619682
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351


So this is the one side of a coin where we can Explore the movie about gender specific now we gonna find the Have most discrimination or very different opinion which can be find out by variance for standard deviation of the ratings.

In [21]:
ratings_discrimination = df.groupby('title')['rating'].std()

# finding most active one
ratings_discrimination = ratings_discrimination.loc[popular_movies]

# finding most discriminated one
ratings_discrimination.sort_values(ascending= False)[:10]

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

#### Movies Genres

In [22]:
movies.genres.head()

0     Animation|Children's|Comedy
1    Adventure|Children's|Fantasy
2                  Comedy|Romance
3                    Comedy|Drama
4                          Comedy
Name: genres, dtype: object

In [23]:
movies.genres.str.split('|').head()

0     [Animation, Children's, Comedy]
1    [Adventure, Children's, Fantasy]
2                   [Comedy, Romance]
3                     [Comedy, Drama]
4                            [Comedy]
Name: genres, dtype: object

In [24]:
Out[23].explode()

0     Animation
0    Children's
0        Comedy
1     Adventure
1    Children's
1       Fantasy
2        Comedy
2       Romance
3        Comedy
3         Drama
4        Comedy
Name: genres, dtype: object