# In Depth Examination : tag vs tag ratings battle

We will take the "Movielens" data and further examine which 'tag' (i.e comedy vs scifi) has a higher average rating. 

We will do this by employing simple charts such as bar graphs.

## Import Libraries

In [3]:
import pandas as pd

## Import Movielens Library

In [4]:
!ls ./movielens/ml-latest/

genome-scores.csv  imdbpy-master  movies.csv   README.txt
genome-tags.csv    links.csv	  ratings.csv  tags.csv


In [5]:
!cat ./movielens/ml-latest/movies.csv | wc -l

45844


In [6]:
!cat ./movielens/ml-latest/tags.csv | wc -l

753171


### Read Dataset

In [7]:
movies = pd.read_csv('./movielens/ml-latest/movies.csv', sep=',')
print(type(movies))
movies.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
tags = pd.read_csv('./movielens/ml-latest/tags.csv', sep=',')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,1,318,narrated,1425942391
1,20,4306,Dreamworks,1459855607
2,20,89302,England,1400778834
3,20,89302,espionage,1400778836
4,20,89302,jazz,1400778841


In [9]:
ratings = pd.read_csv('./movielens/ml-latest/ratings.csv', sep=',')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [10]:
del ratings['timestamp']
del tags['timestamp']

In [11]:
tags.columns

Index(['userId', 'movieId', 'tag'], dtype='object')

In [12]:
movies.columns

Index(['movieId', 'title', 'genres'], dtype='object')

In [13]:
ratings.columns

Index(['userId', 'movieId', 'rating'], dtype='object')

In [14]:
ratings['rating'].describe()

count    2.602429e+07
mean     3.528090e+00
std      1.065443e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

In [15]:
tags['tag'].describe()

count     753154
unique     53508
top       sci-fi
freq        8040
Name: tag, dtype: object

In [16]:
ratings.corr()

Unnamed: 0,userId,movieId,rating
userId,1.0,-0.00141,-0.000159
movieId,-0.00141,1.0,-0.002841
rating,-0.000159,-0.002841,1.0


# Clean Data

In [17]:
movies.shape

(45843, 3)

In [18]:
movies.isnull().any()

movieId    False
title      False
genres     False
dtype: bool

In [19]:
tags.shape

(753170, 3)

In [20]:
tags.isnull().any()

userId     False
movieId    False
tag         True
dtype: bool

In [21]:
tags = tags.dropna()

In [22]:
tags.shape

(753154, 3)

In [23]:
tags.isnull().any()

userId     False
movieId    False
tag        False
dtype: bool

# Explore Data

### Here we will group and merge data.

In [24]:
tags.head()

Unnamed: 0,userId,movieId,tag
0,1,318,narrated
1,20,4306,Dreamworks
2,20,89302,England
3,20,89302,espionage
4,20,89302,jazz


In [25]:
tags.groupby('movieId')

<pandas.core.groupby.DataFrameGroupBy object at 0x7f6bbd65cc88>

In [26]:
tags.head()

Unnamed: 0,userId,movieId,tag
0,1,318,narrated
1,20,4306,Dreamworks
2,20,89302,England
3,20,89302,espionage
4,20,89302,jazz


In [27]:
tags['tag'].head()

0      narrated
1    Dreamworks
2       England
3     espionage
4          jazz
Name: tag, dtype: object

In [28]:
movies[['title','genres']].head()

Unnamed: 0,title,genres
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,Jumanji (1995),Adventure|Children|Fantasy
2,Grumpier Old Men (1995),Comedy|Romance
3,Waiting to Exhale (1995),Comedy|Drama|Romance
4,Father of the Bride Part II (1995),Comedy


Here we will explore how many of each tag that we've picked out while initially examined the data to see whether or not the quantatative values would have good relational values.

In [29]:
tag_counts = tags['tag'].value_counts()
tag_counts["so bad it's good"]

231

In [30]:
tag_counts = tags['tag'].value_counts()
tag_counts["Adam Sandler"]

298

It seems that I have stumbled upon two tags which would be worth comparing due to the tongue-in-cheek nature of the competition and the similar number of tagged values across both catagories.

Now we will filter the value_counts which include our selected catagories.

In [31]:
is_so_bad = tags['tag'].str.contains("so bad it's good")

tags[is_so_bad][5:15]

Unnamed: 0,userId,movieId,tag
12138,4496,6995,so bad it's good
12444,4496,74754,so bad it's good
12974,4496,134170,so bad it's good
13928,5127,76,so bad it's good
15054,5655,26157,so bad it's good
27947,12009,74754,so bad it's good
35467,12788,2164,so bad it's good
35531,12788,100729,so bad it's good
52581,20820,76,so bad it's good
65102,25127,47810,so bad it's good


In [32]:
is_Adam_Sandler = tags['tag'].str.contains('Adam Sandler')

tags[is_Adam_Sandler][5:15]

Unnamed: 0,userId,movieId,tag
24326,10279,59900,Adam Sandler
25710,11108,5673,Adam Sandler
37828,13273,59900,Adam Sandler
43578,15800,66509,Adam Sandler
44893,16802,2694,Adam Sandler
47092,18471,45672,Adam Sandler
47135,18471,59900,Adam Sandler
47142,18471,65088,Adam Sandler
47223,18471,111617,Adam Sandler
47351,18480,3979,Adam Sandler


## GroupBy

Combine movie ratings and create averages.

In [33]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

Unnamed: 0_level_0,movieId
rating,Unnamed: 1_level_1
0.5,404897
1.0,843310
1.5,403607
2.0,1762440
2.5,1255358
3.0,5256722
3.5,3116213
4.0,6998802
4.5,2170441
5.0,3812499


In [34]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.tail()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
176267,4.0
176269,3.5
176271,5.0
176273,1.0
176275,3.0


In [35]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,66008
2,26060
3,15497
4,2981
5,15258


In [36]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
176267,1
176269,1
176271,1
176273,1
176275,1


# Merge Dataframes.

In [37]:
tags.head()

Unnamed: 0,userId,movieId,tag
0,1,318,narrated
1,20,4306,Dreamworks
2,20,89302,England
3,20,89302,espionage
4,20,89302,jazz


In [38]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [39]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,110,1.0
1,1,147,4.5
2,1,858,5.0
3,1,1221,5.0
4,1,1246,5.0


In [40]:
del ratings['userId']

In [41]:
ratings.head()

Unnamed: 0,movieId,rating
0,110,1.0
1,147,4.5
2,858,5.0
3,1221,5.0
4,1246,5.0


In [42]:
movies_tags = pd.merge(movies, tags, how='inner')
movies_tags.head()

Unnamed: 0,movieId,title,genres,userId,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,Pixar
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,time travel
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,computer animation
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,funny
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,Pixar


In [43]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
avg_ratings.head()

Unnamed: 0,movieId,rating
0,1,3.888157
1,2,3.236953
2,3,3.17555
3,4,2.875713
4,5,3.079565


In [44]:
movies_tags_ratings = movies_tags.merge(avg_ratings, on='movieId', how='inner')
movies_tags_ratings.head()

Unnamed: 0,movieId,title,genres,userId,tag,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,Pixar,3.888157
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,time travel,3.888157
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,computer animation,3.888157
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,funny,3.888157
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,Pixar,3.888157


The code above is our final .csv

We will now re-check our selected parameters:

In [45]:
tag_counts_final = movies_tags_ratings['tag'].value_counts()
tag_counts_final["so bad it's funny"]

46

## So bad it's good vs. Adam Sandler

In [46]:
tag_counts_final = movies_tags_ratings['tag'].value_counts()
tag_counts_final["so bad it's good"]

231

In [47]:
tag_counts_final = movies_tags_ratings['tag'].value_counts()
tag_counts_final['Adam Sandler']

298

Now that we've confirmed the success of our value-counts within our .csv, we will create and apply a new filters with the selected tags in order to complete out cleaned .csv(movies_tags_ratings) that we've merged.

In [49]:
is_bad_final = movies_tags_ratings['tag'].str.contains("so bad it's good")

movies_tags_ratings[is_bad_final][5:15]

Unnamed: 0,movieId,title,genres,userId,tag,rating
9134,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,65615,so bad it's good,3.307961
9149,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,82695,so bad it's good,3.307961
9152,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,85029,so bad it's good,3.307961
9156,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,89621,so bad it's good,3.307961
9236,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,178616,so bad it's good,3.307961
9327,70,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,235828,so bad it's good,3.307961
9433,76,Screamers (1995),Action|Sci-Fi|Thriller,5127,so bad it's good,3.016247
9439,76,Screamers (1995),Action|Sci-Fi|Thriller,20820,so bad it's good,3.016247
9447,76,Screamers (1995),Action|Sci-Fi|Thriller,54983,so bad it's good,3.016247
9481,76,Screamers (1995),Action|Sci-Fi|Thriller,233747,so bad it's good,3.016247


In [51]:
is_adam_sandler_final = movies_tags_ratings['tag'].str.contains('Adam Sandler')

movies_tags_ratings[is_adam_sandler_final][5:15]

Unnamed: 0,movieId,title,genres,userId,tag,rating
10049,104,Happy Gilmore (1996),Comedy,45986,Adam Sandler,3.396512
10053,104,Happy Gilmore (1996),Comedy,67897,Adam Sandler,3.396512
10056,104,Happy Gilmore (1996),Comedy,79623,Adam Sandler,3.396512
10057,104,Happy Gilmore (1996),Comedy,80121,Adam Sandler,3.396512
10062,104,Happy Gilmore (1996),Comedy,113552,Adam Sandler,3.396512
10064,104,Happy Gilmore (1996),Comedy,123687,Adam Sandler,3.396512
10070,104,Happy Gilmore (1996),Comedy,125199,Adam Sandler,3.396512
10074,104,Happy Gilmore (1996),Comedy,150767,Adam Sandler,3.396512
10079,104,Happy Gilmore (1996),Comedy,167239,Adam Sandler,3.396512
10082,104,Happy Gilmore (1996),Comedy,171007,Adam Sandler,3.396512


In [52]:
movies_tags_ratings.head()

Unnamed: 0,movieId,title,genres,userId,tag,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,Pixar,3.888157
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1250,time travel,3.888157
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,computer animation,3.888157
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,funny,3.888157
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1652,Pixar,3.888157


# Create CSV files to import with plot.py

In [54]:
movies_tags_ratings.to_csv('out.csv', sep=',')
movies_tags_ratings[is_bad_final].to_csv('notfunny.csv', sep=',')
movies_tags_ratings[is_adam_sandler_final].to_csv('adamsandler.csv', sep=',')
