# 600_YouTube_merge


## Purpose

In this notebook we will merge our youtube trailer dataset with our movie industry dataset and add the final preperations for analysis

## Datasets

 - Input: movies.pkl, YouTubeTrailers.csv
 - Output: movieTrailers.pkl

# Loading the Data

In [383]:
import os.path
import json#for decoding a JSON response
import pandas as pd


In [384]:
if  not os.path.exists("movies.csv"):
    print("Missing dataset file")


In [385]:
movies = pd.read_pickle('movies.pkl') # Loading in the dataset

In [386]:
trailerData  = pd.read_csv('YouTubeTrailers.csv',  encoding='latin-1')

In [387]:
trailerData.head()

Unnamed: 0,Title,Published,Views,Likes,Dislikes,Comments
0,Good Will Hunting,2011,1470725,3475,79,133
1,Warrior,2011,4780270,11044,385,999
2,Being Elmo,2011,441364,4152,68,337
3,Senna,2011,435871,1520,30,203
4,Tomboy,2011,884320,5444,168,457


In [388]:
trailerData.shape

(4297, 6)

# Merging the Datasets

As both of our datasets are now clean we can merge them straight away.

In [397]:
movieTrailers = movies.merge(trailerData, left_on='name', right_on='Title')

In [398]:
movieTrailers

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,...,year,scoreRank,grossRank,HarMean,Title,Published,Views,Likes,Dislikes,Comments
0,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,...,1986,4651.5,6613.0,0.801096,Top Gun,2013,75516,137,38,9
1,0.0,Ministère de la Culture et de la Communication,France,Éric Rohmer,Drama,37455.0,Summer,R,1986-08-29,98,...,1986,6436.5,280.0,0.078453,Summer,2014,133507,1208,45,97
2,0.0,Fons Rademakers Produktie,Netherlands,Fons Rademakers,Drama,203781.0,The Assault,PG,1986-02-06,155,...,1986,5595.5,702.0,0.182758,The Assault,2012,2027519,1918,942,1065
3,0.0,Miramax,USA,Bob Weinstein,Comedy,2669366.0,Playing for Keeps,PG-13,1986-10-03,102,...,1986,159.0,2071.0,0.043041,Playing for Keeps,2012,1827077,2170,196,425
4,35000000.0,Eclectic Pictures,USA,Gabriele Muccino,Comedy,13101142.0,Playing for Keeps,PG-13,2012-12-07,105,...,2012,1552.0,3505.0,0.315389,Playing for Keeps,2012,1827077,2170,196,425
5,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,455579,2887,119,297
6,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,3594723,16329,1575,4803
7,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,41334,397,52,65
8,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,41334,397,52,65
9,100000000.0,Metro-Goldwyn-Mayer (MGM),USA,José Padilha,Action,58607007.0,RoboCop,PG-13,2014-02-12,117,...,2014,2716.5,5658.0,0.538321,RoboCop,2013,455579,2887,119,297


We can see that the merge has worked successfully as we have data from both datasets.



# Cleaning the Data

We now want to check for column values with the value 0 as this will affect the analysis later.

In [399]:
movieTrailers = movieTrailers [(movieTrailers [['Dislikes']] != 0).all(axis=1)]
movieTrailers = movieTrailers [(movieTrailers [['Likes']] != 0).all(axis=1)]
movieTrailers = movieTrailers [(movieTrailers [['Views']] != 0).all(axis=1)]
movieTrailers = movieTrailers [(movieTrailers [['Comments']] != 0).all(axis=1)]

In [400]:
movieTrailers

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,...,year,scoreRank,grossRank,HarMean,Title,Published,Views,Likes,Dislikes,Comments
0,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,...,1986,4651.5,6613.0,0.801096,Top Gun,2013,75516,137,38,9
1,0.0,Ministère de la Culture et de la Communication,France,Éric Rohmer,Drama,37455.0,Summer,R,1986-08-29,98,...,1986,6436.5,280.0,0.078453,Summer,2014,133507,1208,45,97
2,0.0,Fons Rademakers Produktie,Netherlands,Fons Rademakers,Drama,203781.0,The Assault,PG,1986-02-06,155,...,1986,5595.5,702.0,0.182758,The Assault,2012,2027519,1918,942,1065
3,0.0,Miramax,USA,Bob Weinstein,Comedy,2669366.0,Playing for Keeps,PG-13,1986-10-03,102,...,1986,159.0,2071.0,0.043041,Playing for Keeps,2012,1827077,2170,196,425
4,35000000.0,Eclectic Pictures,USA,Gabriele Muccino,Comedy,13101142.0,Playing for Keeps,PG-13,2012-12-07,105,...,2012,1552.0,3505.0,0.315389,Playing for Keeps,2012,1827077,2170,196,425
5,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,455579,2887,119,297
6,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,3594723,16329,1575,4803
7,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,41334,397,52,65
8,13000000.0,Orion Pictures,USA,Paul Verhoeven,Action,53424681.0,RoboCop,R,1987-07-17,102,...,1987,5973.0,5527.0,0.842165,RoboCop,2013,41334,397,52,65
9,100000000.0,Metro-Goldwyn-Mayer (MGM),USA,José Padilha,Action,58607007.0,RoboCop,PG-13,2014-02-12,117,...,2014,2716.5,5658.0,0.538321,RoboCop,2013,455579,2887,119,297


When we compare the shape we can see that we had a single 0 values in one of the columns.

The next step to clean our data is to deal with the movie remakes we have in the dataset. We have movie of the same name released in different times in the dataset, which is an issue as we will then have a mismatch of trailer data to our movie data. As we have trailer info on the remade movies those are the films we'll keep.

We first change our Published column to ints so we can compare it to the release date of the film. Any film we find that was released before the trailer we can drop as the trailers must come out before the movies.

In [401]:
movieTrailers.Published = movieTrailers.Published.astype(np.int64)

In [403]:
movieTrailers = movieTrailers[(movieTrailers['year'] >= movieTrailers['Published'])]#comparing the year of the movie to the publish year of the trailers removing the rows that dont satisfy the condition.


We dropped the columns that we wouldnt need to answer our research question as it left us with a much clearer and more concise dataset

In [404]:
movieTrailers = movieTrailers.drop(['company','country','budget','name','director','star','writer','rating','released','runtime'], axis=1)

In [405]:
movieTrailers

Unnamed: 0,genre,gross,score,votes,year,scoreRank,grossRank,HarMean,Title,Published,Views,Likes,Dislikes,Comments
4,Comedy,13101142.0,5.7,25725,2012,1552.0,3505.0,0.315389,Playing for Keeps,2012,1827077,2170,196,425
9,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,455579,2887,119,297
10,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,3594723,16329,1575,4803
11,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,41334,397,52,65
12,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,41334,397,52,65
14,Comedy,54731865.0,6.7,166912,2012,4127.5,5559.0,0.694861,Project X,2012,2469755,6006,393,418
17,Action,34507079.0,5.1,40431,2015,688.5,4871.0,0.176738,Hot Pursuit,2015,9582115,34231,1405,1269
18,Action,34507079.0,5.1,40431,2015,688.5,4871.0,0.176738,Hot Pursuit,2015,1162690,4333,210,148
20,Drama,7597898.0,6.6,32248,2012,3839.0,2951.0,0.489352,Promised Land,2012,516237,863,80,229
28,Animation,270329045.0,7.1,81466,2016,5122.0,6740.0,0.853801,Sing,2016,8829947,41639,2392,5924


We reset the index as we have dropped some columns in the process of cleaning the data so it's no longer correct.

In [406]:
movieTrailers.reset_index(inplace=True,drop=True)

In [407]:
movieTrailers.head()

Unnamed: 0,genre,gross,score,votes,year,scoreRank,grossRank,HarMean,Title,Published,Views,Likes,Dislikes,Comments
0,Comedy,13101142.0,5.7,25725,2012,1552.0,3505.0,0.315389,Playing for Keeps,2012,1827077,2170,196,425
1,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,455579,2887,119,297
2,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,3594723,16329,1575,4803
3,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,41334,397,52,65
4,Action,58607007.0,6.2,193792,2014,2716.5,5658.0,0.538321,RoboCop,2013,41334,397,52,65


We changed the types of our Views,Likes,Dislikes and Comments columns to ints as this will help in analysis later.

In [408]:
movieTrailers.Views = movieTrailers.Views.astype(np.int64)
movieTrailers.Likes = movieTrailers.Likes.astype(np.int64)
movieTrailers.Dislikes = movieTrailers.Dislikes.astype(np.int64)
movieTrailers.Comments = movieTrailers.Comments.astype(np.int64)

Finally we save the state of our dataset as we are now ready to begin our analysis

In [412]:
movieTrailers.to_pickle('movieTrailers.pkl')#save the dataset to a pickle file