# Final Project
CS696 Big Data

Professor Whitney

Team:

Kristi Werry - 823386935

William Ritchie - 815829203

## Description
This project analyzes YouTube trending videos between the years 2017-2018. We have data from Canada, Germany, France, England, India, Japan, South Korea, Mexico, Russia, and the US. We attempt to analyze these video to find which characteristics could make a video on YouTube trending. We look at relationships between views, likes, dislikes, the number of comments, keywords in the titles, descriptions, and tags, popular publish times, and popular trending dates of videos. Many of these characteristics and more contribute to a video's trending status. 

## Running the Program
To run this project you must have pandas, pyspark, and wordcloud installed. To install wordcloud: pip install wordcloud

The dataset can be downloaded from:

Dataset Kaggle link: https://www.kaggle.com/datasnaek/youtube-new

If you download the dataset from kaggle be sure to unzip and place it in folder called: youtube-new and ensure that this folder  is placed in the same directory as the .ipynb file. Alternatively the projects github also contains the dataset as well. 

GitHub Project link: https://github.com/KristiWerry/YouTube_Trending_Data_Analysis

## Defects
The last section in the notebook runs a calculation for pearsons correlation between a few of the columns.  For some reason after running that section we are un able to run some of the prior sections.  The solution to this is to restart the kernal and clear the output.


## Imports

In [1]:
import numpy as np
import pandas as pd
import sys
from pyspark.sql import Row
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType
from pyspark.sql.types import BooleanType
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.types import TimestampType
from pyspark.sql.functions import *
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
from wordcloud import WordCloud, STOPWORDS 
from collections import Counter
import matplotlib.pyplot as plt

## Functions

In [2]:
# Loads data from multiple files and a specified schema into a single pyspark data frame
def Load_Datasets(sqlContext, files, schema):
    data = sqlContext.createDataFrame([], schema=schema)
    for file in files:
        temp_data = sqlContext.read.csv(path=file[0], schema=schema, dateFormat="yy.dd.MM", timestampFormat="yyyy-MM-dd")
        temp_data = temp_data.withColumn("country", lit(file[1]))
        data = data.union(temp_data)
    return data
   
    
# takes in a dataframe, a grouping, and a column name and returns 
# the average of that column based on the group
def average_by_group(df, group, column):
    return df.groupBy(group).agg(avg(col(column)).alias("average"))



# All credit for this function goes to this site: https://www.geeksforgeeks.org/generating-word-cloud-python/
# Takes a dataframe column and produces a word cloud of the most frequency words
def wordCloud(df, column, s):
    stopwords = set(STOPWORDS)
    comment_words = ''
    for val in df[column]: 
      
        # typecaste each val to string 
        val = str(val) 
  
        # split the value 
        tokens = val.split(s) 
      
        # Converts each token into lowercase 
        for i in range(len(tokens)): 
            tokens[i] = tokens[i].lower() 
      
        comment_words += " ".join(tokens)+" "
  
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='black', 
                    stopwords = stopwords, 
                    min_font_size = 10).generate(comment_words)

    # plot the WordCloud image                        
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
  
    plt.show()
    

#Takes a dataframe column and find the most common words. Then plots them.
def keyWords(df, column, s, amount):
    temp = Counter(" ".join(df[column].str.lower()).split(s)).most_common(amount)
    pd.DataFrame(temp, columns=['word', 'count']).plot(kind='bar',figsize=[15,10], x='word', y='count') 
    
#Takes a dataframe column, filters the words be getting rid of special characters and words less than length 3
# Then finds the most common words and plots them.
def keyWords_filtered(df, column, amount):
    pattern = '|'.join(['\|', ';', '-', '\(', '\)', '\[', '\]', '&', ' ', '\.', '\,', '\:', '\/', '\'', '\!', '\$', '’', '\\\\'])
    x = df[column].str.lower().str.replace(pattern, ' ').str.replace(r'\b(\w{1,3})\b', '')
    temp = Counter(" ".join(x).split()).most_common(20)
    pd.DataFrame(temp, columns=['word', 'count']).plot(kind='bar',figsize=[15,10], x='word', y='count')
    

# Converts the number range 1-7 into the corresponding week day
def convert_weekday(num):
    if num == 1:
        return "Monday"
    elif num ==2:
        return "Tuesday"
    elif num ==3:
        return "Wednesday"
    elif num ==4:
        return "Thursday"
    elif num ==5:
        return "Friday"
    elif num ==6:
        return "Saturday"
    elif num ==7:
        return "Sunday"
    else:
        return ""

## Importing Datasets

In [None]:
sqlContext = SparkSession.builder.appName("FinalProjectYoutube").getOrCreate();
root_dir = "youtube-new/";

# Set up the schema for reading in the data sets into a dataframe
customSchema = StructType([
  StructField("video_id", StringType(), True),
  StructField("trending_date", DateType(), True),
  StructField("title", StringType(), True),
  StructField("channel_title", StringType(), True),
  StructField("category_id", StringType(), True),
  StructField("publish_time", TimestampType(), True),
  StructField("tags", StringType(), True),
  StructField("views", IntegerType(), True),
  StructField("likes", IntegerType(), True),
  StructField("dislikes", IntegerType(), True),
  StructField("comment_count", IntegerType(), True),
  StructField("thumbnail_link", StringType(), True),
  StructField("comments_disabled", BooleanType(), True),
  StructField("ratings_disabled", BooleanType(), True),
  StructField("video_error_or_removed", BooleanType(), True),
  StructField("description", StringType(), True),
  StructField("country", StringType(), True)
])

# Associate csv files with respective countries
data_files = [
   (root_dir + "CAvideos.csv", "Canada"),
   (root_dir + "DEvideos.csv", "Germany"),
   (root_dir + "FRvideos.csv", "France"),
   (root_dir + "GBvideos.csv", "England"),
   (root_dir + "INvideos.csv", "India"),
   (root_dir + "JPvideos.csv", "Japan"),
   (root_dir + "KRvideos.csv", "South Korea"),
   (root_dir + "MXvideos.csv", "Mexico"),
   (root_dir + "RUvideos.csv", "Russia"),
   (root_dir + "USvideos.csv", "US"),
]

# Read in datasets
youtube_data_df = Load_Datasets(sqlContext, data_files, customSchema)

## Dataset Cleaning 

### Duplicate Rows

In [None]:
# When dropping the duplicate rows based on video_id we found that half of the dataset gets dropped.  So we now look at the 
# duplicate rows to find out more information about what is going on.  Looking at the duplicate rows you can see that the 
# same video can be trending for mulitple days and in different countries causing the same video to exist in multiple rows.
# We decided the "duplicates" were not actually truely duplicate rows, the information provided by these multiple entries 
# is still useful.  We determined a truely duplicate row requires the same: "video_id", "views", "likes", "dislikes",
# "country", and "trending_date" column values.
pandasdf = pd.DataFrame(data=youtube_data_df.take(100000), columns=youtube_data_df.columns)

pandas_df = pandasdf.loc[pandasdf['video_id'].duplicated()]

# We realized that a number of the duplicates have a "\n" for the video id, we decided to filter those rows out since
# they contain no useful information.  We remove these rows later on when dropping NA values from the dataset.
pandas_df = pandas_df[pandas_df['video_id'] != "\\n"]

# The video id value was manually selected by viewing the resulting dataframe from the previous line. You can see
# that this video was trending for multiple days and in mulitple countries, hence why it has mulitple rows in the 
# the dataset.
pandasdf[pandasdf['video_id'] == "n1WpP7iowLc"]

In [None]:
# The previous section determined that video_id was not sufficient in determining a truly duplicate row.  The following
# are a combination of the columns we think determine a truly duplicate row.  Meaning, if two rows have the same value
# in all of the below columns, then those two rows are indeed duplicates. 

# We found that if we included views, likes,or dislikes as part of the criteria, then some duplicate rows still existed
# because their view count differed only by roughly 10 views, but everyting else remained the same, which proves
# problematic in future analysis
compare_duplicate_cols = ["video_id", "country", "trending_date"]
row_count_with_dup = youtube_data_df.count()

# Drop duplicate rows, in the event where there is a duplicate row, we want to keep the one with the greatest number of
# views, so we sort the rows by their views column in descending order.  This is useful because the dropDuplicates
# function keeps the first instance of any pair of duplicate rows.
youtube_data_df = youtube_data_df.orderBy(col("views").desc()).dropDuplicates(compare_duplicate_cols)

# View duplicate row count information
num_duplicates = row_count_with_dup - youtube_data_df.count()
print("Number of duplicates: " + str(num_duplicates))
print("Remaining number of rows: " + str(youtube_data_df.count()))

### NA / Null Values

In [None]:
# The following are a list of the columns that we determined should not contain a null or NA value.  We could have
# done this when specifying the schema when we were importing the data, but we felt it necessary to learn more information
# about columns that contain nulls and NAs.  After playing around with the data we found that many of the values in the
# description column were NA.  We decided that this was okay because some videos might not have a description, thus
# this is why description is not included in the below list
no_nan_cols = ["video_id", "trending_date", "title", 
        "channel_title", "category_id",  
        "tags", "views", "likes", "dislikes", "comment_count", 
        "thumbnail_link", "comments_disabled", "ratings_disabled", 
        "video_error_or_removed", "country"]

row_count_with_nans = youtube_data_df.count()

# Drop nans from these columns
youtube_data_df = youtube_data_df.na.drop(subset=no_nan_cols) 

# View duplicate row count information
num_nans = row_count_with_nans - youtube_data_df.count()
print("Number of null,nans, and na's: " + str(num_nans))
print("Remaining number of rows: " + str(youtube_data_df.count()))

In [None]:
# We decided that it was advantageous to replace all of the nulls in the description column with empty strings, this
# way we do not need to check for nulls later on when working with this column in the dataset
youtube_data_df = youtube_data_df.fillna("", subset="description")

# Check that the description column contains no nulls
youtube_data_df.where(youtube_data_df.description.isNull()).count()

### Filter out #NAME? and #VALUE! from video_id

In [None]:
# In later analysis we discovered that some rows contained either #NAME? or #VALUE! in their video_id column, which
# we decided was unaccaptable due to questioning the validity of the data in that row.
youtube_data_df = youtube_data_df.filter(youtube_data_df.video_id != "#NAME?")
youtube_data_df = youtube_data_df.filter(youtube_data_df.video_id != "#VALUE!")

### Dropping duplicates but keeping max views

In [None]:
# The following filters out the duplicate video entries, leaving only one entry, the one with the highest view count.
# Having a separate dataframe setup this way proves convenient for some future analysis.

# We decided to drop duplicate rows with the same video id and country, and keep the ones with the largest view count
# The orderBy function call and the desc function call reorder the dataset rows in descending order based on the the
# value in the views column.  We wanted this because dropDuplicates function keeps the first instance of any duplicate
# pair, meaning in this case only the duplicate row with the highest view count will remain.
youtube_nodup_df = youtube_data_df.orderBy(col("views").desc()).dropDuplicates(["video_id", "country"])

## Analysis

### Likes, Dislikes, View, and Comment Count Averages 

In [None]:
#Out of the trending videos, average number of views
#change to not duplicate data instead of all_data
average_by_group(youtube_nodup_df, "country", "views").toPandas().plot(kind='bar', title="Average Number of Views per Country", figsize=[15,10], y='average', x='country')

In [None]:
average_by_group(youtube_nodup_df, "country", "likes").toPandas().plot(kind='bar', title="Average Number of Likes per Country", figsize=[15,10], y='average', x='country')

In [None]:
average_by_group(youtube_nodup_df, "country", "dislikes").toPandas().plot(kind='bar', title="Average Number of Dislikes per Country", figsize=[15,10], y='average', x='country')

In [None]:
average_by_group(youtube_nodup_df, "country", "comment_count").toPandas().plot(kind='bar', title="Average Number of Comments per Country", figsize=[15,10], y='average', x='country')

Analysis: We look at the averages of the number of views, likes, dislikes, and comment counts of each video because these characteristics play a large role in whether or not a video trends. If a video has a high number of views, then this means it is very popular and thus trending. As we can see from the data above, the US and England all have very high averages compared to the other countries. Canada is close behind. This could be due to the fact that YouTube is from the US; an English speaking country. These large averages could also be due to the amount of videos posted in these countries.  This shows that England and the US have a higher presence on Youtube than other countries.  It is interesting that England has higher averages than the US considering that the US founded youtube and has a higher overall population.  

### Average Title and Description Length

In [None]:
titlelen = youtube_nodup_df.withColumn('titleLength', length('title'))
average_by_group(titlelen, 'country', 'titleLength').toPandas().plot(kind='bar', title="Average Title Length per Country", figsize=[15,10], y='average', x='country')

In [None]:
desclen = youtube_nodup_df.withColumn('descriptionLength', length('description'))
average_by_group(desclen, 'country', 'descriptionLength').toPandas().plot(kind='bar', title="Average Desceription Length per Country", figsize=[15,10], y='average', x='country')

Analysis: By looking at the average lengths of the titles and descriptions, we wanted to see if there was a correlation between a more descriptive title or longer decription meant that a video was trending. As seen above, the length of a title is pretty even across the board where the average title length is more than 40 characters long. However, the average length of a description for these trending videos are much more chaotic. The average length of a description in South Korea is significatally lower than other countries. Japan also has a lower description average. This could be due to the fact that Korean and Japanese uses characters so the amount of characters they need to use to say the same thing is less. 

### Word Cloud
Prerequisite: pip install wordcloud

Showing the most common words in titles and descriptions for US trending videos in a word cloud.

In [None]:
pd_us_data_df = youtube_data_df.filter(youtube_data_df.country == "US").toPandas()
wordCloud(pd_us_data_df, 'title', ' ')

In [None]:
# This is gunna take a very long time!!!!!!!!!!!!!!!!!!!!!! ~40 mins
#wordCloud(pd_us_data_df, 'description', ' ')

Analysis: Creating a word cloud is a creative way to show the frequency of common words and phases. Here, we can see that the most common phase in a trending title is 'offical video' or 'offical trailer'. The most common words in a trending video's decription is subcribe, twitter, instagram, and follow. This shows that a common hook for viewers are offical videos (ie. an offical movie trailer opposed a fanmade video). The most common words in a description seem the be links promoting other social medias.

### Keyword Frequency Count (US data)

In [None]:
keyWords(pd_us_data_df, "title", " ", 20)

In [None]:
keyWords(pd_us_data_df, "description", " ", 20)

Analysis: After finding the most common 20 words for trending titles and descriptions, we have the most used word is ' in titles and 'the' in descriptions. The rest of the words aren't that helpful in determining common keywords to attract viewers either. Therefore, we decided to filter out special characters and words less than length 3. 

In [None]:
keyWords_filtered(pd_us_data_df, "title", 20)

In [None]:
keyWords_filtered(pd_us_data_df, "description", 20)

Analysis: After filtering out, we can see that the most common word in trending videos is 'official' and 'http' in descriptions. As mentioned before when looking at the word clouds, it seems like viewers perfer to watch offical videos and the description is used for other sites. 

### Looking at tags

In [None]:
# Most common tags
keyWords(pd_us_data_df, "tags", "|", 20)

Analsis: Looking at all trending videos for each country, it seems like the most common tags are 'funny' and 'comedy'. This could mean that many viewers seek out funny content opposed to other genres.

In [None]:
# word cloud of tags
wordCloud(pd_us_data_df, 'tags', '|')

Analysis: As we can see from the tags word cloud, the most common phases are 'music video', 'the voice', and 'to make'. This contradicts the single word phase of funny earlier. As we can see, many of YouTube's trending tags are also music and tutorial based. 

In [None]:
#average number of tags
#first split the tags to get an array of tags per video (no duplicates)
#then add a new column with the count of the number of tags
#then find the average number of tags with the function
data_tag_count = youtube_nodup_df.withColumn("tags", split("tags", "\|")).withColumn("tag_count", size("tags"))
average_by_group(data_tag_count, "country", "tag_count").toPandas().plot(kind='bar',figsize=[15, 10], title="Average Number of Tags per Country", y='average', x='country')

Analysis: By having a tag, that video is associated with a type of content. More tags means more changes of appearing in certain categories. We can see that most trending videos have on average about 15 tags. This is fairly even across all countries; however, US has the highest average number of tags. 

### Average Video Trending Duration

In [None]:
# The following counts the frequency a video_id appears in a country, which we are considering to be the trending 
# duration.  This of course does not account for videos that might re-trend in the same country, we attempted to
# solve this problem but it proved too difficult for pyspark
trend_video_duration_df = youtube_data_df.groupBy('video_id', 'country').count()
trend_video_duration_df.show()
average_by_group(trend_video_duration_df, "country", "count").toPandas().plot(kind='bar',figsize=[15, 10],title="Average Duration of a Trending Video Per Country", y='average', x='country')

In [None]:
# The following is the average trending duration across all videos in all countries
trend_video_duration_df.agg(avg(col("count"))).show()

Analysis: You can see that the average duration time across all countries is roughly 2 days, but looking at the graph of the average duration times per country shows two huge outliers, with an average US video trending approximately 6 days and an average English video trending approximately 12 days.  This shows the popularity of Youtube in those respective countries.  Meaning if you wanted to make a Youtube video that trended for a long duration, you would want to cater to the likes of these two countries.

### Video Publish Day of Week Frequency

In [None]:
# This section calculates what days of the week most trending videos are published on

# This allows convert_weekday to be used on a pyspark column
udf_convert_weekday = udf(convert_weekday, StringType())

# Calculate the frequency of videos published for each day of the week
publish_dow_freq = youtube_nodup_df.withColumn('Number Rep Day of Week', dayofweek(youtube_nodup_df.publish_time)).withColumn('Day of Week',udf_convert_weekday(dayofweek(youtube_nodup_df.publish_time))).groupBy("Day of Week", 'Number Rep Day of Week').count().orderBy('Number Rep Day of Week').drop('Number Rep Day of Week')
publish_dow_freq.show()

# Plot the results
ax = publish_dow_freq.toPandas().plot(kind='bar', figsize=[15,10], y = 'count', x = 'Day of Week', title='Number of Videos Published For Each Day of the Week')
ax.set_xlabel("Day of Week")
ax.set_ylabel("Frequency")

Analysis: You can see that videos are published pretty evenly across the whole week, with Friday and Saturday slightly more popular than the other days.  This is likely because a large population of people in most countries do not work on weekends, meaning the youtubers who posted the trending videos likely were attempting to cater to the free time people have on weekends. 

### Trending Video Day of Week Frequency

In [None]:
# This section calculates what days of the week most trending videos are actually trending

# This allows convert_weekday to be used on a pyspark column
udf_convert_weekday = udf(convert_weekday, StringType())

# Calculate the frequency of videos published for each day of the week
publish_dow_freq = youtube_nodup_df.withColumn('Number Rep Day of Week', dayofweek(youtube_nodup_df.trending_date)).withColumn('Day of Week',udf_convert_weekday(dayofweek(youtube_nodup_df.trending_date))).groupBy("Day of Week", 'Number Rep Day of Week').count().orderBy('Number Rep Day of Week').drop('Number Rep Day of Week')
publish_dow_freq.show()

# Plot the results
ax = publish_dow_freq.toPandas().plot(kind='bar', figsize=[15,10], y = 'count', x = 'Day of Week', title='Number of Videos That Started Trending on a Particular Day of the Week')
ax.set_xlabel("Day of Week")
ax.set_ylabel("Frequency")

Analysis: The above graph shows a very even spread of videos trending on any given day of the week.  It is interesting to note that Wednesday holds the highest number of videos trending.  This might be due to a lag effect where in youtubers post their videos just as the weekend starts, and the majority of the population who do not work weekends begin watching the videos, but the videos gaining popularity doesn't truly hit trending status until a few days later. 

### Total Number of Views by Date

In [None]:
#show total number of views by date
views_by_date = youtube_data_df.groupBy("trending_date").sum().orderBy("trending_date").toPandas()
pd.DataFrame(views_by_date, columns=['trending_date', 'sum(views)']).plot(kind='line',figsize=[15,10], title="Total Number of Views by Date", x='trending_date', y='sum(views)')

Analysis: It is interesting to see the total number of views of trending videos spike a certain times of the year. We can see the total number increased in Feb of 2018 and decline in the middle of may. I hypothesize that this is because younger people mostly contribute to the status of trending videos and during these times, these people are usually busy with tests or vacations instead of watching YouTube videos. This gives a clear insight on what days youtube clients are active and not active.

### Most Commen Keyword Used in Trending Videos Each Week

In [None]:
#organize trending dates by week and concatenate the titles into one cell
us_data_df = youtube_data_df.filter(youtube_data_df.country == "US")
new_df = us_data_df.withColumn("week",date_sub(next_day(col("trending_date"),"sunday"),7)).groupBy("week").agg(concat_ws(", ", collect_list(us_data_df.title)).alias("title")).orderBy("week")
new_df = new_df.toPandas()
new_df

In [None]:
#for all the titles, make them lowercased, get rid of special characters and words less then 3 characters long
pattern = '|'.join(['\|', ';', '-', '\(', '\)', '\[', '\]', '&', ' ', '\.', '\,', '\:', '\/', '\'', '\!', '\$', '’', '\\\\', '\'', '\"'])
new_df["title"] = new_df["title"].str.lower().str.replace(pattern, ' ').str.replace(r'\b(\w{1,3})\b', '')

#get a new df with the frequency counts of each word
freq = new_df["title"].str.split().apply(pd.value_counts)

#create a new df with the highest frequency, the word, and the date
high_freq_data = pd.DataFrame(freq.idxmax(axis=1), columns=["word"])
high_freq_data["value"] = pd.DataFrame(freq.max(axis=1), columns=['value'])
high_freq_data["week"] = new_df["week"]

#plot
high_freq_data.sort_values('week').plot(kind='bar',figsize=[15,10], title="Most Common Key word by Week", x='word', y='value')

In [None]:
high_freq_data #view table

Analysis: The graph above shows the most common keyword used in a trending video and the number of uses per week. As we can see, the most common word for most of the weeks is 'offical'. We saw that this is the most common keyword used in titles overall, but now we can see that it is also the most common word used in trending videos for most weeks. Other common words are the years '2017' and '2018'. As we can see, as the year 2017 ends, videos with 2017 are trending and once 2018 begins, the word 2018 is mostly used in titles. These could be due to the relevance of the years changing. 

### Category with Most Trending Videos (US data)

In [None]:
# Input the categories json as a dictionary
categories={}
data=pd.read_json("youtube-new/US_category_id.json")
for category in data["items"]:
    categories[category["id"]]=category["snippet"]["title"]#it Stores the category id with category name

categories #display categories

In [None]:
# Get the US data as a pandas df
us_data = youtube_nodup_df.filter(youtube_nodup_df.country == "US").toPandas()

# Make the id a string because the dict entries are strings
us_data["category_id"]=us_data["category_id"].astype(str)
# Make a new column with the category string read from the JSON dictionary
us_data["Category"]=us_data["category_id"].map(categories) 

In [None]:
# Group by category and plot 
us_data.groupby(["Category"]).count().plot(kind='bar',figsize=[15,10],title="Categories with Highest Number of Trending Videos", y="video_id")

Analysis: The graph above shows the number of videos trending per category. This shows that most trending videos are categorized under entertainment. This make a lot of sense because most videos can be argued that they are for entermainment. 

### Channel Title With the Most Trending Videos (2017 - 2018)

In [None]:
# Want only one count of each video with the most number of views
singleVideo_data = youtube_data_df.orderBy(col("views").desc()).dropDuplicates(["video_id"])

In [None]:
# Count the number of video each channel published
top_channels = singleVideo_data.groupBy("channel_title").count()

# Organize by channels with the most pubilshed videos within this time
# Take the top 20 and plot
top_channels.orderBy(top_channels['count'].desc()).limit(20).toPandas().plot(kind='bar',figsize=[15,10], title="Top 20 Channels with Highest Number of Trending Videos Posted", x='channel_title')

### Channel Title with the Most Total Views (2017 - 2018)

In [None]:
# Sum each channels total number of views within this time period
top_views_channels = singleVideo_data.groupBy("channel_title").sum().select(['channel_title', 'sum(views)'])

# Plot highest number of views per channel
top_views_channels.orderBy(top_views_channels['sum(views)'].desc()).limit(20).toPandas().plot(kind='bar',figsize=[15,10],title="Top 20 Total Number of Views by Channel", x='channel_title')

Analysis: One graph shows the top 20 channels that posted the most number of videos in this timeframe and the other graph shows the top 20 channels with the most total views. It is interesting to see that the more number of videoes trending does not mean the most total number of views. The channel 'SET India' posted the most amount of videos that were trending, but they don't even show up in the total number of views top 20. This could be due to viewers perfer quality over quantity.

### Average Time Between Publish Date to Trending Date

In [None]:
# We want single video data but with the shortest time between publish and trending times
leastTime_data = youtube_data_df.orderBy(col("views")).dropDuplicates(["video_id"])
date_differences_df = leastTime_data.withColumn("difference", datediff(col("trending_date"), col("publish_time"))).select("difference", "publish_time", "trending_date", "country", "channel_title")

In [None]:
average_by_group(date_differences_df, "country", "difference").toPandas().plot(kind='bar',figsize=[15,10],title="Average Number of Days Between Publish and Trending Date", y='average', x='country')

Analysis: As we can see from this graph, the average number of days between publish date and trending date of US and England videos are much higher than any other country. This could be due to the fact we saw earlier that the US and England contribute the most to a video's trending status and thus it takes longer for a video to break through and earn trending status. 

### Fastest Channels to Trending After Posting a Video

In [None]:
# Group by channel title and take an average of the difference since trending date could be different in different countries
# Take the first 40 channels
avg_day_trending = date_differences_df.groupBy("channel_title").agg(avg('difference')).orderBy("avg(difference)").na.drop().limit(40)
avg_day_trending.toPandas()

#### What about US channels?

In [None]:
# We want single video data but with the shortest time between publish and trending times
leastTime_data = us_data_df.orderBy(col("views")).dropDuplicates(["video_id"])

# Find the difference in publish and trending dates
date_differences_df = leastTime_data.withColumn("difference", datediff(col("trending_date"), col("publish_time"))).select("difference", "publish_time", "trending_date", "country", "channel_title")

# Take the averages of differences and take top 40 channels
avg_day_trending = date_differences_df.groupBy("channel_title").agg(avg('difference')).orderBy("avg(difference)").na.drop().limit(40)
avg_day_trending.toPandas()

Analysis: As we can see from the tables above, many channels have produced a trending video that gained trending status within the same day of publishing. 

### Finding correlation

In [None]:
col = ["views", 'likes', 'dislikes', 'comment_count']
vector_col = "corr_features"

inputdf = sqlContext.createDataFrame(youtube_nodup_df.select(*col).collect(), col)

# convert to vector column first
assembler = VectorAssembler(inputCols=inputdf.columns, outputCol=vector_col)
df_vector = assembler.transform(inputdf).select(vector_col)

# get correlation matrix
matrix = Correlation.corr(df_vector, vector_col)
print(str(matrix.head()[0]))

Analysis: Based on the correlation matrix you can see higher relations between: Number of views and the number of likes, as well as the number of likes and the number of comments.  This means that if a trending video has a large number of views, then that video is likely to have a large number of likes, which in turn means it will likely have a larger number of comments.

## Summary
Based on the analysis, we found that the following conditions are desireable when trying to create a trending video:
1. Market the video to cater to England and the US
2. The title should likely contain the word "official"
3. The description should contain other social media links
4. Posting the video on Friday or Saturday would be most optimal
5. Having at least 15 tags, one of which is the tag: "Funny"
6. The video should aim for the Entertainment Category
7. Quality over Quantity
8. The average length of the title should be approximately: 50 characters
9. The average description length should be approximately: 900 Characters
10. Posting the video in February is Optimal