# Introduction
This notebook is a very quick analysis and visualization of the data in the USVideos section. Ultimately, if we see any promising trends in this data, I would like to find a way to predict the number of views a video will get within a particular timeframe given more information we can mine with the youtube API. 

In [164]:
import numpy as np
import pandas as pd
import re
import json
import datetime

import matplotlib
import matplotlib.pyplot as plt

import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.graph_objs import *
from plotly.offline import init_notebook_mode, iplot 
init_notebook_mode(connected=True)

import nltk 
import string

import sklearn
from sklearn import datasets, linear_model

matplotlib.style.use('ggplot')
matplotlib.rcParams.update({'font.size': 22})

In [2]:
df = pd.read_csv("../input/USvideos.csv")

# Variables and Basic Correlation of Numeric Features
<br/>
First, let's see what we're given: 

In [3]:
print(df.columns)

Immediately, we can isolate the numeric variables and see if there's any sort of a correlation between them. Let's leave out the  `category_id` column for now since there seems to be no significant meaning to the numbering: 

In [4]:
df_num = pd.concat([df.views, df.likes, df.dislikes, df.comment_count], axis=1)
num_corr = df_num.corr()
num_corr

`views` and `likes` have a remarkably strong correlation, as we would expect.  However, we see that the correlation betwen `views` and `dislikes` or `comment_count`, while a positive correlation still exists, is not nearly as strong. I would have expected similar values. 
Instead, note the strong correlation between `dislikes` and `comment_count`. This suggests that people who dislike a video are a reliable source of comments - even more so than people who like the videos.  

## Formatting Dates/Category
To use the dates provided, let's transform them into a `datetime64` format. Since we're only given the date which the video becomes trending, we will omit the time of publication. Let's also create a new column with the category of the video (as opposed to the numerical id), just to improve readibility. 

In [5]:
df.trending_date = pd.to_datetime(df.trending_date, format='%y.%d.%m')
df['publish_date'] = pd.to_datetime(df.publish_time.map(lambda s: s[2:10]), format='%y-%m-%d')
df.drop(['publish_time'], axis = 1)

categoryId = {}

categoryJSON = json.load(open('../input/US_category_id.json'))
for cat in categoryJSON["items"]:
    categoryId[cat['id']] = cat["snippet"]["title"]
    
df['Category'] = df.category_id.map(categoryId)
df.sample(5)

# More Numerical Features
There are plenty of interesting values that we can obtain from this data: 
    
### Time to Trending
Given the publish and trending dates, we can find how long it took the video to start trending: 

In [6]:
df['days_to_trend'] = df.trending_date - df.publish_date
df.days_to_trend = df.days_to_trend.dt.days

### Number of Tags
According to youtube, tags are a major factor in how videos are ranked. It would be reasonable to think that more tags will cause the video to appear on more queries, therefore getting more views: 

In [7]:
df['num_tags'] = df.tags.map((lambda x: 0 if x == '[none]' else (x.count('|')+1)))

### Percentage Statistics
We have only been given the number of absolute views, likes, dislikes ... etc. It may be helpful, instead, to look at the <i>percentage</i> of views that interact with a video instead. This could be a good indication of how impactful watching that video was: 

In [8]:
df['%_like'] = 100*df.likes/df.views
df['%_dislike'] = 100*df.dislikes/df.views
df['%_comment'] = 100*df.comment_count/df.views

# Quick Statistics (Absolute)

What are the top videos in our feature categories? 
### <b> Views </b>

In [9]:
df_QuickStatistics = pd.concat([df.title, df.channel_title, df.views, df.likes, df.dislikes, df.comment_count, df.trending_date, df.num_tags], axis=1)
df_QuickStatistics.sort_values(by=['views'], ascending=False).head(10)

### **Likes**

In [10]:
df_QuickStatistics.sort_values(by=['likes'], ascending=False).head(10)

In [146]:
matplotlib.style.use('ggplot')
df.plot.scatter(x='likes', y='views', figsize = (25,10));

### **Dislikes**

In [11]:
df_QuickStatistics.sort_values(by=['dislikes'], ascending=False).head(10)

In [147]:
df.plot.scatter(x='dislikes', y='views', figsize = (25,10));

### **Tags** 

In [12]:
df_QuickStatistics.sort_values(by=['num_tags'], ascending=False).head(10)

In [148]:
df.plot.scatter(x='comment_count', y='views', figsize = (25,10));

### **Trending Videos from Channel**

In [13]:
df_QuickStatistics.channel_title.value_counts().head(10)

Overall, this was quite uninsightful. Let's try the same thing, but with percentage of viewers: 

# Quick Statistics (%)
We omit videos that disallow ratings and comments here, as they will always have the least interaction rate. 
### **% Liked**

In [14]:
df_QuickStatisticsP = pd.concat([df.title, df.channel_title, df.views, df['%_like'], df['%_dislike'], df['%_comment']], axis = 1)
df_QuickStatisticsP.sort_values(by=['%_like'], ascending=False).head(10)

In [149]:
df.plot.scatter(x='%_like', y='views', figsize = (25,10));

### **% Disliked** 

In [15]:
df_QuickStatisticsP.sort_values(by=['%_dislike'], ascending=False).head(10)

In [150]:
df.plot.scatter(x='%_dislike', y='views', figsize = (25,10));

### **% Commented**

In [16]:
df_QuickStatisticsP.sort_values(by=['%_comment'], ascending=False).head(10)

In [151]:
df.plot.scatter(x='%_comment', y='views', figsize = (25,10));

Interestingly enough, the top 10 `%_commented` videos were made by two channels, and all makeup videos. From the (rather anecdotal) top 10 `%_liked` and `%_disliked` videos, I noticed that the most frequently liked videos were all music videos and the most frequently disliked videos were largely regarding the FCC and net neutrality. However, these conclusions are made on nothing but our very brief look at the top of each of these categories. 

This brings us to our next question: what effect does the actual content of the videos have? 

# Video Content
How can we characterize the content of a video without actually watching it? Frankly, our options are quite limited. The biggest indicators that we would have would be the video's tags, title and description.  It's not immedieately obvious what we should do. First, let's just obtain a list of the most common tags, and words in titles and descriptions. 

First, let tokenize (get a list of words) the titles and descriptions: 

In [33]:
df['title_tokenized'] = df.title.map((lambda x: nltk.word_tokenize(x.lower()) if isinstance(x, str) else []))
df['description_tokenized'] = df.description.map((lambda x: nltk.word_tokenize(x.lower()) if isinstance(x, str) else []))
df['tags_tokenized'] = df.tags.map((lambda x: nltk.word_tokenize(re.sub('[|""]', "", x.lower())) if isinstance(x, str) else []))

Next, let's find the most common elements. For the sake of consistency, we will examine everything in lowercase. We will filter out all stopwords and punctuation. : 

In [61]:
accum = []
stopWords = set(nltk.corpus.stopwords.words('english'))

for i in range (1, df.shape[0]):
    accum+=df.iloc[i-1].title_tokenized
    
accum = list(filter(lambda x : (not (x.lower() in stopWords)) and (not (x in string.punctuation)) and (not (x in "’`'s-n't»...--'"+'”'+'“')), accum))
title_distribution = nltk.FreqDist(accum)

accum=[]
for i in range (1, df.shape[0]):
    accum+=df.iloc[i-1].description_tokenized
    
accum = list(filter(lambda x : (not (x.lower() in stopWords)) and (not (x in string.punctuation)) and (not (x in "’`'s-n't»...--'"+'”'+'“')), accum))
descr_distribution = nltk.FreqDist(accum)

accum = []

for i in range(1, df.shape[0]):
    accum+=df.iloc[i-1].tags_tokenized
    
accum = list(filter(lambda x: (not (x.lower() in stopWords)) and (not (x in string.punctuation)) and (not (x in "’`'s-n't»...--'"+'”'+'“|')), accum))

In [62]:
df_td = pd.DataFrame(title_distribution.most_common(25), columns = ['Word', 'Frequency'])
df_td['Index']=df_td.Word
noPrint = df_td.set_index('Index')

df_dd = pd.DataFrame(descr_distribution.most_common(25), columns = ['Word', 'Frequency'])
df_dd['Index']=df_dd.Word
noPrint = df_dd.set_index('Index')

df_tad = pd.DataFrame(tags_distribution.most_common(25), columns = ['Word', 'Frequency'])
df_tad['Index'] = df_tad.Word
noPrint = df_tad.set_index('Index')

A plot of the 25 most common keywords used in titles and descriptions of trending videos gives us:

In [63]:
trace1 = go.Bar(
    x = df_td.Word,
    y = df_td.Frequency,
    marker = dict(color = 'rgba(255, 255, 128, 0.5)',
    line=dict(color='rgb(0,0,0)',width=2))
)

trace2 = go.Bar(
    x = df_dd.Word,
    y = df_dd.Frequency,
    marker = dict(color = 'rgba(68, 71, 231, 0.5)',
    line=dict(color='rgb(0,0,0)',width=2))
)

trace3 = go.Bar(
    x = df_tad.Word,
    y = df_tad.Frequency,
    marker = dict(color = 'rgba(68, 231, 71, 0.5)',
    line=dict(color='rgb(0,0,0)',width=2))
)

layout1 = {
  'xaxis': {'title': 'Word'},
  'barmode': 'relative',
  'title': 'Most Common Words in Video Titles'
};

layout2 = {
  'xaxis': {'title': 'Word'},
  'barmode': 'relative',
  'title': 'Most Common Words in Video Descrptions'
};

layout3 = {
  'xaxis': {'title': 'Word'},
  'barmode': 'relative',
  'title': 'Most Common Tags'
};
fig1 = go.Figure(data = [trace1], layout = layout1)
fig2 = go.Figure(data = [trace2], layout = layout2)
fig3 = go.Figure(data = [trace3], layout = layout3)
iplot(fig1)
iplot(fig2)
iplot(fig3)

I think that the terms more or less speak for themselves. Personally, these results were pretty much expected. A large part of youtube's trending content is centered around mainstream media such as music videos or movie trailers. Hence, we see terms such as "official", "video", "2018", "trailer" ... etc a good amount. Note that the most common term, by far, found in the descrption is `http`(s). In general, the description is centered around promoting the channel ("subscribe, "like", "new", "watch" ... ) or promoting another source such as social media (e.g. "facebook", "twitter", "instagram") or a link ("http(s)"). Remarkably, it seems that we see videos having no tags more often than we see any other tag. 

### **Title/Description/Tags Length**
Conveniently, we have a list of all the tags/words for our videos. How long are titles and descriptions usually? 

We get the length in words for title/descrption/tags: 

In [154]:
df['title_length'] = df.title_tokenized.map((lambda s: len(s)))
df['descrption_length'] = df.description_tokenized.map((lambda s: len(s)))

trace1 = [go.Histogram(x = df.title_length, marker = dict(color = 'rgba(68, 231, 71, 0.5)'))]
trace2 = [go.Histogram(x = df.descrption_length, marker = dict(color = 'rgba(68, 71, 231, 0.5)'))]
trace3 = [go.Histogram(x = df.num_tags, marker = dict(color = 'rgba(231, 71, 68, 0.5)'))]

layout1 = {
  'xaxis': {'title': '# Words in Title'},
  'barmode': 'relative',
  'title': 'Length of Video Title (Words)',
  'yaxis': {'title': 'Freq'}
};
layout2 = {
  'xaxis': {'title': '# Words in Description'},
  'barmode': 'relative',
  'title': 'Length of Video Descrptions (Words)',
  'yaxis': {'title': 'Freq'}
};
layout3 = {iplot(fig2)
iplot(fig3)
  'xaxis': {'title': 'Number of Tags'},
  'barmode': 'relative',
  'title': 'Number of Tags',
  'yaxis': {'title': 'Freq'}
};

fig1 = go.Figure(data = trace1, layout = layout1)
fig2 = go.Figure(data = trace2, layout = layout2)
fig3 = go.Figure(data = trace3, layout = layout3)

iplot(fig1)


In [155]:
iplot(fig2)

In [156]:
iplot(fig3)

# More Correlations
How do these numbers that we've obtained from videos stack up against views, likes, or comments? 

### **Number of Tags**

In [157]:
df.plot.scatter(x='num_tags', y='views', figsize = (25,10));

### **Title Length (Words)**

In [163]:
df.plot.scatter(x='title_length', y='views', figsize = (25,10));

Unfortunately (and perhaps expectedly), we don't see any significant correlation. 

# Conclusion
Well, that's about it for now. In the future, I could consider using NLTK to do a sentiment analysis on the titles/tags of the videos, and see how those correlate with views or likes and dislikes. 

One idea I would be interested to attempt would be using the Youtube API to get information on the publishing channel (e.g. number of subscribers, video views, ratings ... etc). Given this information, I think that it would be very plausible to train a model to predict the number of views a video receives. Although we did not find any particularily remarkable trends in this initial look at the data, this idea is very plasuable, especially given the vast amount of features and videos we have access to. Check back at some point if you wan