# COGS 108 - Final Project

# Important
ONE, and only one, member of your group should upload this notebook to TritonED.
Each member of the group will receive the same grade on this assignment.
Keep the file name the same: submit the file 'FinalProject.ipynb'.
Only upload the .ipynb file to TED, do not upload any associted data. Make sure that for cells in which you want graders to see output that these cells have been executed.

# Group Members: Fill in the Student IDs of each group member here
Replace the lines below to list each persons full student ID, ucsd email and full name.

* A11973566 -- mchemele@ucsd.edu -- Maggie Chemelekova
* A15673410
* A12910452
* A13534242
* A13410656
* A12963052

Start your project here.

# Introduction and Background

We were interested in determining whether or not different categories of videos influence their popularity, as well as if a difference in region influences their popularity as well. 

In a kaggle data visualization “Exploring Youtube Trending Statistics EDA” by user Donyoe, the author creates a series of graphs based on raw youtube data, such as views and likes/dislikes. One graph set that we looked at was Top Countries in Absolute numbers, which categorized 5 countries by views, likes, dislikes, and comments by country. We became interested if some countries make videos trend faster or not, as while views and like are similar for all countries, France has a much larger amount of comments than all other countries, which are all relatively the same. Since there is such a disparity between the comments in France when compared to the other regions, our project hopes to tackle this disparity by trying to find if the region where the video is uploaded affects their time to trending. This dataset, while giving us the inspiration to tackle whether the region affects a video’s popularity, it is not the whole of our question.

The other research article that helped us was the Youtube-8M dataset, which was a dataset created by members of the Google AI Perception group that categorizes 6.1 million public youtube videos that have over 1000 views. This dataset had a histogram which showed the total number of videos by category, and reveals that Arts & Entertainment are the most uploaded video type.Taking a look at the trending section on youtube, which shows that while there were a variety of categories among trending videos, the most predominant category on the trending section tended to be entertainment videos. Taking both these sources, combined with other categories shown in the kaggle data visualization led us to question whether or not different category types influence the popularity of videos.

References (include links):
* https://www.kaggle.com/donyoe/exploring-youtube-trending-statistics-eda
* https://research.google.com/youtube8m/index.html
* https://www.youtube.com/feed/trending

#### Our research question: 
Does category influence popularity (likes, dislikes, views, tags, etc.) and if so, does it influence differently between different groups (e.g. Europeans vs. North Americans)

#### Our main hypothesis and predictions?
YouTube videos involving animals are more popular in North America than they are in Europe. 

# Data Description

#### Dataset Name: Trending YouTube Video Statistics

#### Link to the dataset: https://www.kaggle.com/datasnaek/youtube-new 

#### Number of observations: 40950

#### 1-2 sentences describing the dataset:
This dataset consists of daily statistics for trending YouTube videos in 5 different countries - US, Canada, Great Britain, France, and Germany. It has 40950 observations and the features it contains are: video_id, trending_date, title, channel_title, category_id, publish_time, tags, views (total number of views), likes (total number of likes), dislikes (total number of dislikes), comments_count, thumbnail_link, comments_disabled (true or false), ratings_disabled (true or false), video_error_or_removed (true or false), and description.


# Data Cleaning/Pre-Processing

*** done by Maggie

In [1]:
# Display plots directly in the notebook instead of in a new window
%matplotlib inline

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

In [2]:
# Configure libraries
# The seaborn library makes plots look nicer
sns.set()
sns.set_context('talk')

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

In [3]:
# import data into DataFrames

us_videos = pd.read_csv('US_videos.csv')
canada_videos = pd.read_csv('Canada_videos.csv')
great_britain_videos = pd.read_csv('Great_Britain_videos.csv')
france_videos = pd.read_csv('France_videos.csv')
germany_videos = pd.read_csv('Germany_videos.csv')

In [4]:
# Check out the data

us_videos.head(5)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [5]:
# drop unecessary columns from DataFrame

us_videos.drop(labels=['video_id', 'thumbnail_link', 'comments_disabled', 
                              'ratings_disabled', 'video_error_or_removed'], axis = 1, inplace = True)
us_videos.head(5)

Unnamed: 0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,description
0,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,"One year after the presidential election, John..."
2,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,Today we find out if Link is a Nickelback amat...
4,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,I know it's been a while since we did this sho...


In [6]:
canada_videos.drop(labels=['video_id', 'thumbnail_link', 'comments_disabled', 
                              'ratings_disabled', 'video_error_or_removed'], axis = 1, inplace = True)

france_videos.drop(labels=['video_id', 'thumbnail_link', 'comments_disabled', 
                              'ratings_disabled', 'video_error_or_removed'], axis = 1, inplace = True)

germany_videos.drop(labels=['video_id', 'thumbnail_link', 'comments_disabled', 
                              'ratings_disabled', 'video_error_or_removed'], axis = 1, inplace = True)

great_britain_videos.drop(labels=['video_id', 'thumbnail_link', 'comments_disabled', 
                              'ratings_disabled', 'video_error_or_removed'], axis = 1, inplace = True)

In [7]:
# Renaming the columns of the dataframe

us_videos.columns = ["trendingDate", "title", "channel", "category", "published", "tags", "views",
              "likes", "dislikes", "comments", "description"]

canada_videos.columns = ["trendingDate", "title", "channel", "category", "published", "tags", "views",
              "likes", "dislikes", "comments", "description"]

france_videos.columns = ["trendingDate", "title", "channel", "category", "published", "tags", "views",
              "likes", "dislikes", "comments", "description"]

germany_videos.columns = ["trendingDate", "title", "channel", "category", "published", "tags", "views",
              "likes", "dislikes", "comments", "description"]

great_britain_videos.columns = ["trendingDate", "title", "channel", "category", "published", "tags", "views",
              "likes", "dislikes", "comments", "description"]

In [8]:
us_videos.head(2)

Unnamed: 0,trendingDate,title,channel,category,published,tags,views,likes,dislikes,comments,description
0,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,"One year after the presidential election, John..."


In [9]:
# Drop any rows with missing data, but only for the columns "category", "likes", "views", "dislikes", "comments"

us_videos.dropna(subset=['category','likes', 'views', 'dislikes', 'comments'], inplace = True)

canada_videos.dropna(subset=['category','likes', 'views', 'dislikes', 'comments'], inplace = True)

france_videos.dropna(subset=['category','likes', 'views', 'dislikes', 'comments'], inplace = True)

germany_videos.dropna(subset=['category','likes', 'views', 'dislikes', 'comments'], inplace = True)

great_britain_videos.dropna(subset=['category','likes', 'views', 'dislikes', 'comments'], inplace = True)

#us_videos

In [10]:
us_videos["category"].unique()

array([22, 24, 23, 28,  1, 25, 17, 10, 15, 27, 26,  2, 19, 20, 29, 43],
      dtype=int64)

In [11]:
canada_videos["category"].unique()

array([10, 23, 24, 25, 22, 26,  1, 28, 20, 17, 29, 15, 19,  2, 27, 43, 30],
      dtype=int64)

In [12]:
france_videos["category"].unique()

array([24, 23, 20, 17, 22, 27, 26, 28,  2, 25,  1, 10, 43, 19, 15, 29, 30,
       44], dtype=int64)

In [13]:
germany_videos["category"].unique()

array([24, 23, 27, 22,  1,  2, 17, 26, 25, 10, 20, 43, 28, 29, 15, 19, 44,
       30], dtype=int64)

In [14]:
great_britain_videos["category"].unique()

array([26, 24, 10, 17, 25, 22, 23, 28, 15, 27,  1, 20,  2, 19, 29, 43],
      dtype=int64)

In [15]:
us_videos["category"].value_counts()

24    9964
10    6472
26    4146
23    3457
22    3210
25    2487
28    2401
1     2345
17    2174
27    1656
15     920
20     817
19     402
2      384
29      57
43      57
Name: category, dtype: int64

In [16]:
canada_videos["category"].value_counts()

24    13451
25     4159
22     4105
23     3773
10     3731
17     2787
1      2060
26     2007
20     1344
28     1155
27      991
19      392
15      369
2       353
43      124
29       74
30        6
Name: category, dtype: int64

In [17]:
france_videos["category"].value_counts()

24    9819
22    5719
23    4343
17    4342
10    3946
25    3752
26    2361
1     2157
20    1459
28     802
27     769
2      673
15     237
19     119
29     114
43      99
30      11
44       2
Name: category, dtype: int64

In [18]:
germany_videos["category"].value_counts()

24    15292
22     5988
25     2935
17     2752
23     2534
1      2376
10     2372
26     1745
20     1565
2       873
27      844
28      806
29      256
15      251
19      141
43      107
30        2
44        1
Name: category, dtype: int64

In [19]:
great_britain_videos["category"].value_counts()

10    13754
24     9124
22     2926
1      2577
26     1928
17     1907
23     1828
20     1788
25     1225
15      534
28      518
27      457
2       144
19       96
29       90
43       20
Name: category, dtype: int64

# Data Visualization

*** TO DO!!! still needs to be done

In [None]:
# plot the data, using scatter_matrix, from Pandas.

fig = pd.plotting.scatter_matrix(us_videos)

In [None]:
# plot a bar chart showing the number of categories

df2 = us_videos['category'].value_counts()
df2.plot.bar()

f1 = plt.gcf()

In [None]:
# plot a histogram of the category data for all category = 15

df3 = us_videos[us_videos['category']==15]
df3['likes'].plot.hist()

f2 = plt.gcf()

# 920 total videos for category = 15, over half of those have 0 to a relatively small amount of likes

# Data Analysis and Results

*** TO DO!!! still needs to be done

We would like to explore the time window between videos being published and becoming trend, with its relations to comments, views, length of descriptions (and possibly some words representation as well), tags and other several features. Specifically, we would like to build models with various machine learning algorithms (e.g., regression, PCA and other representations learning algorithms) to find basic characteristics of trending videos for different countries, and see how these characteristics might vary between countries and languages used.  We would also like to explore this relationship between the time needed for trending with different categories of videos (e.g., music, animal, daily life etc.). 
In addition, if time permits, we will also explore this relationship for different channels. Since we do not have data with channels, we will need to collect data (e.g., number of scribers, number of videos, average number of views and comments of their videos) by ourselves for such study.

# Privacy/Ethics Considerations
This dataset is released under CC0: Public Domain, which means we can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. This dataset was collected using the YouTube API, suggesting this dataset reflects the real situation. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count, but not any personal information that can be used to identify a user. 

# Conclusions and Discussion

*** TO DO!!! still needs to be done

YouTube is a large online  platform where users upload, share and watch videos. There are multiple factors that influence the number of views, likes, dislikes, and comments. Predicting whether a video will be popular on Youtube is significant for content creators and advertising companies. In order to make a fare prediction we would most likely have to examine multiple potential factors that might contribute to the popularity of a video - thumbnail, title, description, date published, number of tags used, the number of users already subscribed to the channel, video category, previous user engagement, and possibly other factors. Some video categories might be more popular in some countries compared to others. We hypothesized that videos in US and Canada become trending faster compared to France and Germany. If our hypothesis is wrong we might need to collect more data and explore other potential factors that would better predict how popular YouTube video will get. 

Our dataset is missing some data that could be useful to make  better predictions. Total channel subscribers count and age of the channel  might also be confounds in our analysis and predictions. 
