# Movie Data Analysis in Python 3 

## Today, we will be examining .csv files from MovieLens. This will include exploration, cleaning, and visualisation of datasets.

#### Import Libraries

In [1]:
import pandas as pd

#### Import MovieLens Library 

Examine the file names within the folder with the !ls command.

In [2]:
!ls ./movielens/ml-latest/

genome-scores.csv  imdbpy-master  movies.csv   README.txt
genome-tags.csv    links.csv	  ratings.csv  tags.csv


!cat command with -wc will explore the lengths of each .csv

In [4]:
!cat ./movielens/ml-latest/movies.csv | wc -l

45844


In [5]:
!cat ./movielens/ml-latest/tags.csv | wc -l

753171


#### Read Dataset with pandas pd command and assign variable 'movies' to file.

In [6]:
#Sep indicates , as seperator value. 
movies = pd.read_csv('./movielens/ml-latest/movies.csv', sep=',')
print(type(movies))
movies.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
tags = pd.read_csv('./movielens/ml-latest/tags.csv', sep=',')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,1,318,narrated,1425942391
1,20,4306,Dreamworks,1459855607
2,20,89302,England,1400778834
3,20,89302,espionage,1400778836
4,20,89302,jazz,1400778841


In [8]:
ratings = pd.read_csv('./movielens/ml-latest/ratings.csv', sep=',')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


The column 'timestamp' in the ratings and tags files aren't really important to us, so we will delete them.

In [9]:
del ratings['timestamp']
del tags['timestamp']

This is verified by calling on the Index of each column. 

In [10]:
tags.columns

Index(['userId', 'movieId', 'tag'], dtype='object')

In [11]:
ratings.columns

Index(['userId', 'movieId', 'rating'], dtype='object')

In [13]:
movies.columns

Index(['movieId', 'title', 'genres'], dtype='object')

We can use easy functions such as 'describe' to tell us more about out datasets. 

In [14]:
ratings['rating'].describe()

count    2.602429e+07
mean     3.528090e+00
std      1.065443e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

Here we can see some useful numbers, like the average movie rating being 3.53. We can also easily look for coorelations.

In [15]:
ratings.corr()

Unnamed: 0,userId,movieId,rating
userId,1.0,-0.00141,-0.000159
movieId,-0.00141,1.0,-0.002841
rating,-0.000159,-0.002841,1.0


### Now we will look to clean our data

This includes looking at the datasets for anything that sticks out. Getting rid of duplicates, or null values, for example.

In [16]:
movies.shape

(45843, 3)

We can look to see if there are any null entries by running 'isnull'.

In [17]:
movies.isnull().any()

movieId    False
title      False
genres     False
dtype: bool

Now we will move on to our tags file:

In [18]:
tags.shape

(753170, 3)

In [19]:
tags.isnull().any()

userId     False
movieId    False
tag         True
dtype: bool

We can easily take care of this by running the 'dropna' function.

In [21]:
tags = tags.dropna()

In [22]:
tags.shape

(753154, 3)

In [23]:
753170 - 753154 

16

We can run the function 'isnull' again to confirm that the 16 null tags were indeed dropped.

In [24]:
tags.isnull().any()

userId     False
movieId    False
tag        False
dtype: bool

## Next we will explore and merge our data.

In [25]:
tags.head()

Unnamed: 0,userId,movieId,tag
0,1,318,narrated
1,20,4306,Dreamworks
2,20,89302,England
3,20,89302,espionage
4,20,89302,jazz


In [26]:
tags.groupby('movieId')

<pandas.core.groupby.DataFrameGroupBy object at 0x7f0c117b4a58>

In [27]:
tags.head()

Unnamed: 0,userId,movieId,tag
0,1,318,narrated
1,20,4306,Dreamworks
2,20,89302,England
3,20,89302,espionage
4,20,89302,jazz


In [28]:
movies[['title','genres']].head()

Unnamed: 0,title,genres
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,Jumanji (1995),Adventure|Children|Fantasy
2,Grumpier Old Men (1995),Comedy|Romance
3,Waiting to Exhale (1995),Comedy|Drama|Romance
4,Father of the Bride Part II (1995),Comedy


In [29]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

Unnamed: 0_level_0,movieId
rating,Unnamed: 1_level_1
0.5,404897
1.0,843310
1.5,403607
2.0,1762440
2.5,1255358
3.0,5256722
3.5,3116213
4.0,6998802
4.5,2170441
5.0,3812499


In [30]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.tail()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
176267,4.0
176269,3.5
176271,5.0
176273,1.0
176275,3.0


In [31]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,66008
2,26060
3,15497
4,2981
5,15258


In [32]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
176267,1
176269,1
176271,1
176273,1
176275,1


In [33]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,110,1.0
1,1,147,4.5
2,1,858,5.0
3,1,1221,5.0
4,1,1246,5.0


In [34]:
del ratings['userId']

In [35]:
ratings.head()

Unnamed: 0,movieId,rating
0,110,1.0
1,147,4.5
2,858,5.0
3,1221,5.0
4,1246,5.0


In [36]:
movies_tags = pd.merge(movies, tags, how='inner')
movies_tags.head()

Unnamed: 0,movieId,title,genres,userId,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,Pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,time travel
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,computer animation
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,funny
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,Pixar


In [37]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
avg_ratings.head()

Unnamed: 0,movieId,rating
0,1,3.888157
1,2,3.236953
2,3,3.17555
3,4,2.875713
4,5,3.079565


In [38]:
movies_tags_ratings = movies_tags.merge(avg_ratings, on='movieId', how='inner')
movies_tags_ratings.head()

Unnamed: 0,movieId,title,genres,userId,tag,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,Pixar,3.888157
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,time travel,3.888157
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,computer animation,3.888157
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,funny,3.888157
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,Pixar,3.888157


So that is our final merged .csv with all the colums we need. We will export this later as a master file. Let's say though, that we'd like to compare two tags and see how they stack up when they are compared by their ratings? We would want to filter out the non-selected tagged movies. This optimization is important, especially when dealing with large datasets (such as this one, with nearing 1m entries). 

There are a few things to think about when selecting possible tags for our 'ratings battle'. First, we want to make sure there are enough movies within the tag to compare:

In [52]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['feel-good']

635

In [53]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['Tyler Perry']

6

As you can see, it wouldn't be valuable to compare 'feel-good' movies with 'Tyler Perry' because there aren't enough to provide reliable data. Conversely, there is also the problem of oversaturated tags, or inaccurate tags:

In [54]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['sci-fi']

8040

You can see it wouldn't be a great comparison to compare any of the above three catagories together. The last problem would be data integrity:

In [59]:
movies_tags_ratings.iloc[143727]

movieId                        1732
title      Big Lebowski, The (1998)
genres                 Comedy|Crime
userId                         7806
tag                    black comedy
rating                      3.95432
Name: 143727, dtype: object

I can tell you that "The Big Lebowski" for example, is not a 'black comedy'. This would make one question the integrity of the entire dataset, but the simpliest solution would be to possibly use a 'tighter', more accurate, and concise catagory. This is a reason why initial data exploration is very important, and certainly at least a slight familiarity with the topic comes in use. In this case of movies, I don't have the time to analyze all the movies I'm unfamiliar with, so I'll probably use tagged actors, assuming the dataset at least contains the actor tagged, as opposed to more nuanced catagories such as 'black comedy'. 

In [62]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['John Goodman']

127

In [63]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['Tom Cruise']

657

In [65]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['Brad Pitt']

909

In [66]:
tag_counts = movies_tags_ratings['tag'].value_counts()
tag_counts['Steven Seagal']

52

Why not? Let's take a look at a slice of a particular data frame:

In [68]:
is_seagal = movies_tags_ratings['tag'].str.contains('Steven Seagal')
movies_tags_ratings[is_seagal][5:15]

Unnamed: 0,movieId,title,genres,userId,tag,rating
126124,1382,Marked for Death (1990),Action|Drama,172330,Steven Seagal,2.747428
126128,1382,Marked for Death (1990),Action|Drama,270123,Steven Seagal,2.747428
126137,1385,Under Siege (1992),Action|Drama|Thriller,71947,Steven Seagal,3.115369
126147,1385,Under Siege (1992),Action|Drama|Thriller,79623,Steven Seagal,3.115369
126154,1385,Under Siege (1992),Action|Drama|Thriller,172330,Steven Seagal,3.115369
126160,1385,Under Siege (1992),Action|Drama|Thriller,210476,Steven Seagal,3.115369
136067,1626,Fire Down Below (1997),Action|Drama|Thriller,172330,Steven Seagal,2.451911
136069,1626,Fire Down Below (1997),Action|Drama|Thriller,230606,Steven Seagal,2.451911
253481,4224,Exit Wounds (2001),Action|Thriller,172330,Steven Seagal,2.64966
263044,4466,Above the Law (1988),Action|Crime|Drama,930,Steven Seagal,2.808451


In [72]:
is_pitt = movies_tags_ratings['tag'].str.contains('Brad Pitt')
movies_tags_ratings[is_pitt][5:15]

Unnamed: 0,movieId,title,genres,userId,tag,rating
3494,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,30558,Brad Pitt,3.888769
3515,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,35246,Brad Pitt,3.888769
3519,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,35787,Brad Pitt,3.888769
3536,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,39928,Brad Pitt,3.888769
3548,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,39981,Brad Pitt,3.888769
3565,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,40251,Brad Pitt,3.888769
3600,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,46686,Brad Pitt,3.888769
3668,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,55884,Brad Pitt,3.888769
3692,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,59525,Brad Pitt,3.888769
3724,32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller,63234,Brad Pitt,3.888769


In [73]:
is_cruise = movies_tags_ratings['tag'].str.contains('Tom Cruise')
movies_tags_ratings[is_cruise][5:15]

Unnamed: 0,movieId,title,genres,userId,tag,rating
18250,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,103577,Tom Cruise,3.50193
18259,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,104338,Tom Cruise,3.50193
18322,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,147125,Tom Cruise,3.50193
18331,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,150936,Tom Cruise,3.50193
18334,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,151016,Tom Cruise,3.50193
18338,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,151340,Tom Cruise,3.50193
18351,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,156347,Tom Cruise,3.50193
18359,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,162958,Tom Cruise,3.50193
18395,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,186921,Tom Cruise,3.50193
18434,253,Interview with the Vampire: The Vampire Chroni...,Drama|Horror,196358,Tom Cruise,3.50193


In [80]:
is_goodman = movies_tags_ratings['tag'].str.contains('John Goodman')
movies_tags_ratings[is_goodman][5:15]

Unnamed: 0,movieId,title,genres,tag,rating
143735,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143757,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143802,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143810,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143845,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143872,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143895,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143927,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
143993,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323
144026,1732,"Big Lebowski, The (1998)",Comedy|Crime,John Goodman,3.954323


### Export to .csv

In [84]:
movies_tags_ratings.to_csv('out.csv', sep=',')
movies_tags_ratings[is_goodman].to_csv('goodman.csv', sep=',')
movies_tags_ratings[is_cruise].to_csv('cruise.csv', sep=',')
movies_tags_ratings[is_pitt].to_csv('pitt.csv', sep=',')
movies_tags_ratings[is_seagal].to_csv('seagal.csv', sep=',')

### Now Plot 

https://plot.ly/~bryanb0102/31/actor-battle-royale/