# ELT Project

## Finding Data:

DATA SOURCE: https://www.kaggle.com/datasnaek/youtube-new/data <br/>
Utilizing: <br/>
3 csv files with Video Information (Canada, US, and Britain) <br/>
3 json files with Category Assignment (Canada, US, and Britain) <br/>

## Data Cleanup & Analysis

Plan and document the following:
* The sources of data that you will extract from.
* The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc).
* The type of final production database to load the data into (relational or non-relational).
* The final tables or collections that will be used in the production database.

You will be required to submit a final technical report with the above information and steps required to reproduce your ETL process.

## Project Report:

Submit a Final Report that describes the following:
* Extract: your original data sources and how the data was formatted (CSV, JSON, pgAdmin 4, etc).
* Transform: what data cleaning or transformation was required.
* Load: the final database, tables/collections, and why this was chosen.

Please upload the report to Github and submit a link to Bootcampspot.

In [43]:
import os
import pandas as pd
import json
import requests
import numpy as np

# pd.options.display.max_rows = 3000

from pandas.io.json import json_normalize
from sqlalchemy import create_engine

# EXTRACT

In [2]:
#reading in Canada csv video info
ca_file = os.path.join("data", "CAvideos.csv")
CA_df = pd.read_csv(ca_file)

In [3]:
#reading in Canada json category keys
json_CA = os.path.join("data", "CA_category_id.json")
category_CA_df = pd.read_json(json_CA)

In [4]:
#reading in Great Britain csv video info
gb_file = os.path.join("data", "GBvideos.csv")
GB_df = pd.read_csv(gb_file)

In [5]:
#reading in Great Britain json category keys
json_GB = os.path.join("data", "GB_category_id.json")
category_GB_df = pd.read_json(json_GB)

In [6]:
#reading in United States csv video info
us_file = os.path.join("data", "USvideos.csv")
US_df = pd.read_csv(us_file)

In [7]:
#reading in United States json category keys
json_US = os.path.join("data", "US_category_id.json")
category_US_df = pd.read_json(json_US)

# TRANSFORM

In [8]:
#clean up Canada category df with json_normalize (pulls dictionary items into their own column)
#drop static YouTube info not needed for video database, rename column and cast category_id to number for merge
CA_category_df = json_normalize(category_CA_df['items'])
CA_category_df.drop(['etag', 'kind', 'snippet.assignable', 'snippet.channelId'], axis=1, inplace=True)
CA_category_df.rename(columns={'id': 'category_id', 'snippet.title': 'category_name'}, inplace=True)
CA_categories = CA_category_df.astype({'category_id': 'int64'})

In [9]:
#adding a column to Canada video df to define which country info came from after upcoming concat
CA_df.insert(1,"country", "CA") 

In [10]:
#merge category_names into CA_df
CAfull_df = CA_df.merge(CA_categories, how='left', on="category_id")

In [11]:
# CAfull_df.count()     # video_id: 40881 - category_name: 40807  [NaN: 74]

In [12]:
#clean up Great Britain category df with json_normalize (pulls dictionary items into their own column)
#drop static YouTube info not needed for video database, rename column and cast category_id to number for merge
GB_category_df = json_normalize(category_GB_df['items'])
GB_category_df.drop(['etag', 'kind', 'snippet.assignable', 'snippet.channelId'], axis=1, inplace=True)
GB_category_df.rename(columns={'id': 'category_id', 'snippet.title': 'category_name'}, inplace=True)
GB_categories = GB_category_df.astype({'category_id': 'int64'})

In [13]:
#adding a column to Great Britain video df to define which country info came from after upcoming concat
GB_df.insert(1,"country", "GB")

In [14]:
#merge category_names into GB_df
GBfull_df = GB_df.merge(GB_categories, how='left', on="category_id")

In [15]:
# GBfull_df.count()     # video_id: 38916 - category_name: 38826  [NaN: 90]

In [16]:
#clean up United States category df with json_normalize (pulls dictionary items into their own column)
#drop static YouTube info not needed for video database, rename column and cast category_id to number for merge
US_category_df = json_normalize(category_US_df['items'])
US_category_df.drop(['etag', 'kind', 'snippet.assignable', 'snippet.channelId'], axis=1, inplace=True)
US_category_df.rename(columns={'id': 'category_id', 'snippet.title': 'category_name'}, inplace=True)
US_categories = US_category_df.astype({'category_id': 'int64'})

In [17]:
#adding a column to United States video df to define which country info came from after upcoming concat
US_df.insert(1,"country", "US")

In [18]:
#merge category_names into US_df
USfull_df = US_df.merge(US_categories, how='left', on="category_id")

In [19]:
# USfull_df.count()     # video_id: 40949 - category_name: 40949  [NaN: 0]

In [20]:
#pull only the most recent stats per video (for each CA, GB, and US df)
#1> add MaxDate col: take most recent trending_date for each video_id and assign every copy of that video_id
#2> filter the df to only the videos where trending_date and MaxDate are the same

CAfull_df['MaxDate'] = CAfull_df.groupby('video_id').trending_date.transform('max') # CA_df COUNT: 40881(max)
final_CA_df = CAfull_df[CAfull_df['MaxDate'] == CAfull_df['trending_date']] # final_CA_df COUNT: 24427(max)

GBfull_df['MaxDate'] = GBfull_df.groupby('video_id').trending_date.transform('max') # GB_df COUNT: 38916(max)
final_GB_df = GBfull_df[GBfull_df['MaxDate'] == GBfull_df['trending_date']] # final_CA_df COUNT: 3300(max)

USfull_df['MaxDate'] = USfull_df.groupby('video_id').trending_date.transform('max') # US_df COUNT: 40949(max)
final_US_df = USfull_df[USfull_df['MaxDate'] == USfull_df['trending_date']] # final_US_df COUNT: 6354(max)

In [21]:
full_frame = pd.concat([final_CA_df, final_GB_df, final_US_df], ignore_index=True) # full_frame COUNT: 34081(max)

In [22]:
# FILL NaN VALUES: 'description' [filling 1116 NaN descriptions]
full_frame['description'].fillna("No description provided.", inplace=True)

In [33]:
# full_frame.count()     # video_id: 34081 - category_name: 34026  [NaN: 55 (others were dropped in the MaxDate filter)]

In [34]:
# FILL NaN VALUES: 'category_name'
# unsure why none of the json sets contained the category assigner for 'Nonprofits & Activism' (category_id 29 *Google)
# specifying this category_id for future use (if category_id != 29, will not fillna...)

In [35]:
# get index list of all category_id == 29 rows


# cat_id_29_idx = full_frame.index[full_frame['category_id'] == 29]
# cat_id_29_idx

In [36]:
#test frame based on category_id == 29
#run before and after setting np.array as series to replace NaN in category_name with 'Nonprofits...'


# validator = full_frame.loc[cat_id_29_idx]
# validator

In [37]:
cat_name_full_set = np.where(pd.isnull(full_frame.category_name), "Nonprofits & Activism", full_frame.category_name)

In [38]:
full_frame['category_name'] = pd.Series(cat_name_full_set)

In [40]:
#drop unnecessary columns not using for project
full_frame.drop(['thumbnail_link', 
                 'ratings_disabled', 
                 'video_error_or_removed', 
                 'MaxDate'] , axis=1, inplace=True)

Unnamed: 0,video_id,country,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,comments_disabled,description,category_name
0,0yIWz1XEeyc,CA,17.14.11,Jake Paul Says Alissa Violet CHEATED with LOGA...,DramaAlert,25,2017-11-13T07:37:51.000Z,"#DramaAlert|""Drama""|""Alert""|""DramaAlert""|""keem...",1309699,103755,4613,12143,False,‚ñ∫ Follow for News! - https://twitter.com/KEEMS...,News & Politics
1,FyZMnhUtLfE,CA,17.14.11,ÁåéÂú∫ | Game Of Hunting 12„ÄêTVÁâà„ÄëÔºàËÉ°Ê≠å„ÄÅÂºµÂòâË≠Ø„ÄÅÁ•ñÂ≥∞Á≠â‰∏ªÊºîÔºâ,Â§ßÂäáÁç®Êí≠,1,2017-11-12T16:00:01.000Z,"ÈõªË¶ñÂäá|""Â§ßÈô∏ÈõªË¶ñÂäá""|""ÁåéÂú∫""|""ËÅåÂú∫""|""ÂïÜÊàò""|""Áà±ÊÉÖ""|""ÈÉΩÂ∏Ç""|""ËÉ°Ê≠å""|""ÈôàÈæô""...",158815,218,30,186,False,Thanks for watching the drama! Help more peopl...,Film & Animation
2,7MxiQ4v0EnE,CA,17.14.11,Daang ( Full Video ) | Mankirt Aulakh | Sukh S...,Speed Records,10,2017-11-11T16:41:15.000Z,"punjabi songs|""punjabi bhangra""|""punjabi music...",5718766,127477,7134,8063,False,Song - Daang\nSinger - Mankirt Aulakh\nFaceboo...,Music
3,gifPYwArCVQ,CA,17.14.11,Fake Pet Smart Employee Prank!,NELK,23,2017-11-13T01:30:01.000Z,"prank|""pranks""|""nelk""|""nelkfilmz""|""nelkfilms""",557883,44558,621,9619,False,3 Days left to cop NELK merch: https://nelk.ca...,Comedy
4,8NHA23f7LvU,CA,17.14.11,Jason Momoa Wows Hugh Grant With Some Dothraki...,The Graham Norton Show,24,2017-11-10T19:06:23.000Z,"Graham Norton|""Graham Norton Show Official""|""E...",1496225,16116,236,605,False,I think Sarah Millican was very excited for th...,Entertainment


In [41]:
#reorganize columns
full_frame = full_frame[['video_id',
                         'title', 
                         'channel_title', 
                         'views', 
                         'likes', 
                         'dislikes', 
                         'comments_disabled', 
                         'comment_count', 
                         'description', 
                         'tags', 
                         'category_id', 
                         'category_name', 
                         'publish_time', 
                         'trending_date', 
                         'country']]

In [42]:
full_frame

Unnamed: 0,video_id,title,channel_title,views,likes,dislikes,comments_disabled,comment_count,description,tags,category_id,category_name,publish_time,trending_date,country
0,0yIWz1XEeyc,Jake Paul Says Alissa Violet CHEATED with LOGA...,DramaAlert,1309699,103755,4613,False,12143,‚ñ∫ Follow for News! - https://twitter.com/KEEMS...,"#DramaAlert|""Drama""|""Alert""|""DramaAlert""|""keem...",25,News & Politics,2017-11-13T07:37:51.000Z,17.14.11,CA
1,FyZMnhUtLfE,ÁåéÂú∫ | Game Of Hunting 12„ÄêTVÁâà„ÄëÔºàËÉ°Ê≠å„ÄÅÂºµÂòâË≠Ø„ÄÅÁ•ñÂ≥∞Á≠â‰∏ªÊºîÔºâ,Â§ßÂäáÁç®Êí≠,158815,218,30,False,186,Thanks for watching the drama! Help more peopl...,"ÈõªË¶ñÂäá|""Â§ßÈô∏ÈõªË¶ñÂäá""|""ÁåéÂú∫""|""ËÅåÂú∫""|""ÂïÜÊàò""|""Áà±ÊÉÖ""|""ÈÉΩÂ∏Ç""|""ËÉ°Ê≠å""|""ÈôàÈæô""...",1,Film & Animation,2017-11-12T16:00:01.000Z,17.14.11,CA
2,7MxiQ4v0EnE,Daang ( Full Video ) | Mankirt Aulakh | Sukh S...,Speed Records,5718766,127477,7134,False,8063,Song - Daang\nSinger - Mankirt Aulakh\nFaceboo...,"punjabi songs|""punjabi bhangra""|""punjabi music...",10,Music,2017-11-11T16:41:15.000Z,17.14.11,CA
3,gifPYwArCVQ,Fake Pet Smart Employee Prank!,NELK,557883,44558,621,False,9619,3 Days left to cop NELK merch: https://nelk.ca...,"prank|""pranks""|""nelk""|""nelkfilmz""|""nelkfilms""",23,Comedy,2017-11-13T01:30:01.000Z,17.14.11,CA
4,8NHA23f7LvU,Jason Momoa Wows Hugh Grant With Some Dothraki...,The Graham Norton Show,1496225,16116,236,False,605,I think Sarah Millican was very excited for th...,"Graham Norton|""Graham Norton Show Official""|""E...",24,Entertainment,2017-11-10T19:06:23.000Z,17.14.11,CA
5,fy-CuCzaPp8,Rooster Teeth Animated Adventures - Drunk Baby...,Rooster Teeth,308568,19541,70,False,495,Miles gets stuck at work one night watching ov...,"Rooster Teeth|""RT""|""animation""|""television""|""f...",1,Film & Animation,2017-11-13T14:00:03.000Z,17.14.11,CA
6,puqaWrEC7tY,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,343168,10172,666,False,2146,Today we find out if Link is a Nickelback amat...,"rhett and link|""gmm""|""good mythical morning""|""...",24,Entertainment,2017-11-13T11:00:04.000Z,17.14.11,CA
7,ZhhXLMbZ1rQ,DO COLLEGE KIDS KNOW 80s MUSIC? #8 (REACT: Do ...,REACT,549374,16832,248,False,3579,SUBSCRIBE THEN HIT THE üîî! New Videos 12pm PT o...,"80s music|""80s songs""|""madonna""|""DO COLLEGE KI...",24,Entertainment,2017-11-12T20:00:01.000Z,17.14.11,CA
8,aVTAU_4i9AY,Throwing Things Into A Fan!,REKT,370827,12150,325,False,2352,Destroying Unbreakable Glasses! ‚û° https://www....,"Industrial Fan|""Industrial Fan Destruction""|""T...",24,Entertainment,2017-11-12T23:00:00.000Z,17.14.11,CA
9,m-nZmgHWoEw,Mythical Dog Party,Good Mythical MORE,116676,4324,136,False,618,We're hanging with some of the dogs of the Myt...,"rhett and link|""good mythical more""|""rhett and...",24,Entertainment,2017-11-13T11:00:06.000Z,17.14.11,CA


# do we want to slice publish time down to just the date to match trending_date?

# LOAD

FULL DATAFRAME +
* top 5 per country
* average rank by category overall
* average rank by category by country
* number of videos per category
* number of videos per category by country
* average number of views of top 10 overall
* average number of views of top 10 by country
* average number of view overall by country