# ELT Project

## Finding Data:

DATA SOURCE: https://www.kaggle.com/datasnaek/youtube-new/data <br/>
Utilizing: <br/>
3 csv files with Video Information (Canada, US, and Britain) <br/>
3 json files with Category Assignment (Canada, US, and Britain) <br/>

## Data Cleanup & Analysis

Plan and document the following:
* The sources of data that you will extract from.
* The type of transformation needed for this data (cleaning, joining, filtering, aggregating, etc).
* The type of final production database to load the data into (relational or non-relational).
* The final tables or collections that will be used in the production database.

You will be required to submit a final technical report with the above information and steps required to reproduce your ETL process.

## Project Report:

Submit a Final Report that describes the following:
* Extract: your original data sources and how the data was formatted (CSV, JSON, pgAdmin 4, etc).
* Transform: what data cleaning or transformation was required.
* Load: the final database, tables/collections, and why this was chosen.

Please upload the report to Github and submit a link to Bootcampspot.

In [1]:
import os
import pandas as pd
import json
import requests

# pd.options.display.max_rows = 3000

from pandas.io.json import json_normalize
from sqlalchemy import create_engine

# EXTRACT

In [2]:
#reading in Canada category keys
json_CA = os.path.join("data", "CA_category_id.json")
category_CA_df = pd.read_json(json_CA)
CA_category_df = json_normalize(category_CA_df['items'])
CA_category_df.drop(['etag', 'kind', 'snippet.assignable', 'snippet.channelId'], axis=1, inplace=True)

In [3]:
#reading in Canada csv video info
ca_file = os.path.join("data", "CAvideos.csv")
CA_df = pd.read_csv(ca_file)

In [4]:
#adding a column to define which country info came from
CA_df.insert(1,"country", "CA") 

In [5]:
#reading in Great Britain category keys
json_GB = os.path.join("data", "GB_category_id.json")
category_GB_df = pd.read_json(json_GB)
GB_category_df = json_normalize(category_GB_df['items'])
GB_category_df.drop(['etag', 'kind', 'snippet.assignable', 'snippet.channelId'], axis=1, inplace=True)

In [6]:
#reading in Great Britain csv video info
gb_file = os.path.join("data", "GBvideos.csv")
GB_df = pd.read_csv(gb_file)

In [7]:
#adding a column to define which country info came from
GB_df.insert(1,"country", "GB")

In [8]:
#reading in United States category keys
json_US = os.path.join("data", "US_category_id.json")
category_US_df = pd.read_json(json_US)
US_category_df = json_normalize(category_US_df['items'])
US_category_df.drop(['etag', 'kind', 'snippet.assignable', 'snippet.channelId'], axis=1, inplace=True)

In [9]:
#reading in United States csv video info
us_file = os.path.join("data", "USvideos.csv")
US_df = pd.read_csv(us_file)

In [10]:
#adding a column to define which country info came from
US_df.insert(1,"country", "US")

# TRANSFORM

In [11]:
CA_df['MaxDate'] = CA_df.groupby('video_id').trending_date.transform('max')
CA_df.count()

video_id                  40881
country                   40881
trending_date             40881
title                     40881
channel_title             40881
category_id               40881
publish_time              40881
tags                      40881
views                     40881
likes                     40881
dislikes                  40881
comment_count             40881
thumbnail_link            40881
comments_disabled         40881
ratings_disabled          40881
video_error_or_removed    40881
description               39585
MaxDate                   40881
dtype: int64

In [12]:
final_CA_df = CA_df[CA_df['MaxDate'] == CA_df['trending_date']]
final_CA_df.count()

video_id                  24427
country                   24427
trending_date             24427
title                     24427
channel_title             24427
category_id               24427
publish_time              24427
tags                      24427
views                     24427
likes                     24427
dislikes                  24427
comment_count             24427
thumbnail_link            24427
comments_disabled         24427
ratings_disabled          24427
video_error_or_removed    24427
description               23465
MaxDate                   24427
dtype: int64

In [13]:
GB_df['MaxDate'] = GB_df.groupby('video_id').trending_date.transform('max')
GB_df.count()

video_id                  38916
country                   38916
trending_date             38916
title                     38916
channel_title             38916
category_id               38916
publish_time              38916
tags                      38916
views                     38916
likes                     38916
dislikes                  38916
comment_count             38916
thumbnail_link            38916
comments_disabled         38916
ratings_disabled          38916
video_error_or_removed    38916
description               38304
MaxDate                   38916
dtype: int64

In [14]:
final_GB_df = GB_df[GB_df['MaxDate'] == GB_df['trending_date']]
final_GB_df.count()

video_id                  3300
country                   3300
trending_date             3300
title                     3300
channel_title             3300
category_id               3300
publish_time              3300
tags                      3300
views                     3300
likes                     3300
dislikes                  3300
comment_count             3300
thumbnail_link            3300
comments_disabled         3300
ratings_disabled          3300
video_error_or_removed    3300
description               3244
MaxDate                   3300
dtype: int64

In [15]:
US_df['MaxDate'] = US_df.groupby('video_id').trending_date.transform('max')
US_df.count()

video_id                  40949
country                   40949
trending_date             40949
title                     40949
channel_title             40949
category_id               40949
publish_time              40949
tags                      40949
views                     40949
likes                     40949
dislikes                  40949
comment_count             40949
thumbnail_link            40949
comments_disabled         40949
ratings_disabled          40949
video_error_or_removed    40949
description               40379
MaxDate                   40949
dtype: int64

In [16]:
final_US_df = US_df[US_df['MaxDate'] == US_df['trending_date']]
final_US_df.count()

video_id                  6354
country                   6354
trending_date             6354
title                     6354
channel_title             6354
category_id               6354
publish_time              6354
tags                      6354
views                     6354
likes                     6354
dislikes                  6354
comment_count             6354
thumbnail_link            6354
comments_disabled         6354
ratings_disabled          6354
video_error_or_removed    6354
description               6256
MaxDate                   6354
dtype: int64

In [17]:
full_frame = pd.concat([final_CA_df, final_GB_df, final_US_df], ignore_index=True)
full_frame.count()

video_id                  34081
country                   34081
trending_date             34081
title                     34081
channel_title             34081
category_id               34081
publish_time              34081
tags                      34081
views                     34081
likes                     34081
dislikes                  34081
comment_count             34081
thumbnail_link            34081
comments_disabled         34081
ratings_disabled          34081
video_error_or_removed    34081
description               32965
MaxDate                   34081
dtype: int64

In [None]:
#drop unnecessary columns not using for project
countries_merged.drop(['tags_x', 'comment_count_x', 'thumbnail_link_x', 'comments_disabled_x', 'ratings_disabled_x','video_error_or_removed_x', 'tags_y', 'comment_count_y', 'thumbnail_link_y', 'comments_disabled_y', 'ratings_disabled_y', 'video_error_or_removed_y',  'tags', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled','video_error_or_removed','publish_time_x','publish_time_y','publish_time'] , axis=1, inplace=True)
countries_merged.head()

In [None]:
#seeing all coulmns in left merged df
countries_merged.columns

In [None]:
#renaming columns by countries
#these are the coloumns to be renamed
# 'video_id', 
# CA= 'country_x', 'trending_date_x', 'title_x','channel_title_x', 'category_id_x','views_x', 'likes_x', 'dislikes_x', 'description_x', 
# GB = 'country_y','trending_date_y', 'title_y', 'channel_title_y', 'category_id_y', 'description_y','views_y', 'likes_y', 'dislikes_y',
# US = 'country', 'trending_date', 'title', 'channel_title', 'category_id','views', 'likes', 'dislikes','description'

countries_df=countries_merged.rename(columns={
    'country_x':'Canada','trending_date_x':'trend_date_CA','title_x':'title_CA','channel_title_x':'channel_title_CA','category_id_x':'category_id_CA','views_x':'views_CA', 'likes_x':'likes_CA', 'dislikes_x': 'dislikes_CA', 'description_x':'discriptions_CA',
    'country_y':'Great Britain','trending_date_y':'trend_date_GB', 'title_y':'title_GB', 'channel_title_y':'channel_title_GB', 'category_id_y':'category_id_GB','views_y':'views_GB','likes_y':'likes_GB', 'dislikes_y':'dislikes_GB','description_y':'discriptions_GB',
    'country':'United States', 'trending_date': 'tend_date_US', 'title':'title_US', 'channel_title':'channel_title_US', 'category_id':'category_id_US','views':'views_US', 'likes':'likes_US', 'dislikes':'dislikes_US','description':'description_US'})
countries_df.head()

In [None]:
is_max = US_csv_df['MaxDate'] == US_csv_df['trending_date']
US_max_date = US_csv_df[is_max]
US_max_date.head(1)

In [None]:
US_max_date.sort_values('video_id')

# LOAD

FULL DATAFRAME +
* top 5 per country
* average rank by category overall
* average rank by category by country
* number of videos per category
* number of videos per category by country
* average number of views of top 10 overall
* average number of views of top 10 by country
* average number of view overall by country