# INFO 300 Final Project

#### Metric Ideas:
- Youtube “Interaction’
    - Function of likes, dislikes, comments (possibly take into account comments/ratings_disabled)
    - Measures how much interaction users had with the content
- Views
- Category (get mapping through XX_category_id.json files)

- Time
    - Trending date
        - Talk about this (our data is already “trending data”, so we’d be looking at what trends most and least amongst already trending videos, not all videos) —> do we want more data?

#### Northern Meteorological Seasons
- Spring: March 1 to May 31
- Summer: June 1 to August 31
- Fall: September 1 to November 30
- Winter: December 1 to February 28 (February 29 on a leap year)

In [165]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the US dataset into a DataFrame
df = pd.read_csv('./../youtube-new/USvideos.csv')

print("This dataset contains " + str(df.shape[0]) + " entries and " + str(df.shape[1]) + " features.")

This dataset contains 40949 entries and 16 features.


In [166]:
print("The first video was published " + str(df['publish_time'].min()) + " and the last video was published " + str(df['publish_time'].max()))

The first video was published 2006-07-23T08:24:11.000Z and the last video was published 2018-06-14T01:31:53.000Z


### Data Preprocessing

In [167]:
json = pd.read_json('./../youtube-new/US_category_id.json')
# Get video category id and title. id: video_categories[0], title: video_categories[1]
video_categories = [[item['id'], item['snippet']['title']] for item in json['items']]
print(video_categories)

[['1', 'Film & Animation'], ['2', 'Autos & Vehicles'], ['10', 'Music'], ['15', 'Pets & Animals'], ['17', 'Sports'], ['18', 'Short Movies'], ['19', 'Travel & Events'], ['20', 'Gaming'], ['21', 'Videoblogging'], ['22', 'People & Blogs'], ['23', 'Comedy'], ['24', 'Entertainment'], ['25', 'News & Politics'], ['26', 'Howto & Style'], ['27', 'Education'], ['28', 'Science & Technology'], ['29', 'Nonprofits & Activism'], ['30', 'Movies'], ['31', 'Anime/Animation'], ['32', 'Action/Adventure'], ['33', 'Classics'], ['34', 'Comedy'], ['35', 'Documentary'], ['36', 'Drama'], ['37', 'Family'], ['38', 'Foreign'], ['39', 'Horror'], ['40', 'Sci-Fi/Fantasy'], ['41', 'Thriller'], ['42', 'Shorts'], ['43', 'Shows'], ['44', 'Trailers']]


#### Select Features and Filter Data

In [168]:
basic_features = ['views', 'likes', 'dislikes', 'comment_count']
exclude_features = ['thumbnail_link', 'comments_disabled', 'ratings_disabled', 'video_error_or_removed', 'description']

num_records_total = len(df)

# Filter out videos that have comments/ratings disabled or got removed
# and drop all excluded feature columns
df = df[(df[exclude_features[1]] == False) & 
        (df[exclude_features[2]] == False) &
        (df[exclude_features[3]] == False)].drop(columns=exclude_features)

# 696 records filtered. We could have used this intersection of data for further analysis
# but a sample size of ~700 out of ~41k might be too small to yield interesting findings
num_records_filtered = num_records_total - len(df)

# Preview
df.iloc[0:1]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954


### Normalizing features

In [169]:
# Normalize all feature columns
def normalize(col):
    largest = col.max()
    if largest == 0:
        return col
    return col / largest

for feature in basic_features:
    df[feature] = normalize(df[feature])
    
# Normalized preview
df[['title'] + basic_features][0:5]

Unnamed: 0,title,views,likes,dislikes,comment_count
0,WE WANT TO TALK ABOUT OUR MARRIAGE,0.003323,0.010247,0.001771,0.011717
1,The Trump Presidency: Last Week Tonight with J...,0.01074,0.017312,0.003671,0.00933
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",0.014171,0.026013,0.003189,0.006008
3,Nickelback Lyrics: Real or Fake?,0.001524,0.001812,0.000398,0.001576
4,I Dare You: GOING BALD!?,0.009306,0.023555,0.001188,0.012866


### Creating Metrics

In [171]:
# Split by season
spring = df[df['trending_date'].str.contains('0[345]$')] # 03, 04, 05
summer = df[df['trending_date'].str.contains('0[678]$')] # 06, 07, 08
fall = df[df['trending_date'].str.contains('09$|1[01]$')] # 09, 10, 11
winter = df[df['trending_date'].str.contains('12$|0[12]$')] # 12, 01, 02

In [277]:
# Function that retrieves the title of the the video with the *maximum* value for a given feature in a season
def get_max_metric(feature, season):
    season_df = globals()[season]  
    max_metric = season_df[season_df[feature] == season_df[feature].max()]
    return max_metric['title'].to_string(index=False)

# Function that retrieves the title of the the video with the *minimum* value for a given feature in a season
def get_min_metric(feature, season):
    season_df = globals()[season]  
    min_metric = season_df[season_df[feature] == season_df[feature].min()]
    return min_metric['title'].to_string(index=False)

# Views Metric
print("The video(s) with the most views is %s" %(get_max_metric('views', 'spring')))
print("The video(s) with the fewest views is %s" %(get_min_metric('views', 'spring')))
print()

# Likes Metric
print("The video(s) with the most likes is %s" %(get_max_metric('likes', 'spring')))
print("The video(s) with the fewest likes is %s" %(get_min_metric('likes', 'spring')))
print()

# Dislikes Metric
print("The video(s) with the most dislikes is %s" %(get_max_metric('dislikes', 'spring')))
print("The video(s) with the fewest dislikes is %s" %(get_min_metric('dislikes', 'spring')))
print()

# Comment Count Metric
print("The video(s) with the most comments is %s" %(get_max_metric('comment_count', 'spring')))
print("The video(s) with the fewest comments is %s" %(get_min_metric('comment_count', 'spring')))
print()

# Combined Metric



The video(s) with the most views is Childish Gambino - This Is America (Official V...
The video(s) with the fewest views is President Trump set to announce 2020 re-electi...

The video(s) with the most likes is BTS (방탄소년단) 'FAKE LOVE' Official MV
The video(s) with the fewest likes is President Trump set to announce 2020 re-electi...

The video(s) with the most dislikes is Childish Gambino - This Is America (Official V...
The video(s) with the fewest dislikes is Bird Lands On News Anchor's Head (News Blooper)
  Rescued Chimp Helps Out on Flight over Africa

The video(s) with the most comments is BTS (방탄소년단) 'FAKE LOVE' Official MV
The video(s) with the fewest comments is Bird Lands On News Anchor's Head (News Blooper)



In [270]:
# Double checking that there's actually a tie for fewest dislikes in spring

print("The video with the fewest dislikes is %s" %(get_min_metric('dislikes', 'spring')))
print()

spring[spring['dislikes'] == spring['dislikes'].min()]['title']
# spring[0]'
print(spring[spring['title'] == "Bird Lands On News Anchor's Head (News Blooper)"][basic_features])
print(spring[spring['title'] == "Rescued Chimp Helps Out on Flight over Africa"][basic_features])

The video with the fewest dislikes is Bird Lands On News Anchor's Head (News Blooper)
  Rescued Chimp Helps Out on Flight over Africa

          views     likes  dislikes  comment_count
21125  0.000038  0.000013  0.000002       0.000006
          views     likes  dislikes  comment_count
21640  0.000082  0.000036  0.000002       0.000012
21841  0.000211  0.000090  0.000007       0.000053
22063  0.000286  0.000121  0.000008       0.000073


In [281]:
# Min and Max metrics for all basic features and all seasons
seasons = ['spring', 'summer', 'fall', 'winter']

def print_min_and_max_metrics(feature, season):
    print("The video(s) with the most %s is %s" %(feature, get_max_metric(feature, season)))
    print("The video(s) with the fewest %s is %s" %(feature, get_min_metric(feature, season)))
    print()

for season in seasons:
    for feature in basic_features:
        print_min_and_max_metrics(feature, season)

The video(s) with the most views is Childish Gambino - This Is America (Official V...
The video(s) with the fewest views is President Trump set to announce 2020 re-electi...

The video(s) with the most likes is BTS (방탄소년단) 'FAKE LOVE' Official MV
The video(s) with the fewest likes is President Trump set to announce 2020 re-electi...

The video(s) with the most dislikes is Childish Gambino - This Is America (Official V...
The video(s) with the fewest dislikes is Bird Lands On News Anchor's Head (News Blooper)
  Rescued Chimp Helps Out on Flight over Africa

The video(s) with the most comment_count is BTS (방탄소년단) 'FAKE LOVE' Official MV
The video(s) with the fewest comment_count is Bird Lands On News Anchor's Head (News Blooper)

The video(s) with the most views is Childish Gambino - This Is America (Official V...
The video(s) with the fewest views is Josh Groban - Granted (Official Lyric Video)

The video(s) with the most likes is BTS (방탄소년단) 'FAKE LOVE' Official MV
The video(s) with th