# Group 4

### About Dataset
UPDATE: Source code used for collecting this data released here

Context
YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

This dataset is a daily record of the top trending YouTube videos.

Note that this dataset is a structurally improved version of this dataset.

Content
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, and FR regions (USA, Great Britain, Germany, Canada, and France, respectively), with up to 200 listed trending videos per day.

EDIT: Now includes data from RU, MX, KR, JP and IN regions (Russia, Mexico, South Korea, Japan and India respectively) over the same time period.

Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the five regions in the dataset.

For more information on specific columns in the dataset refer to the column metadata.

Acknowledgements
This dataset was collected using the YouTube API.

Inspiration
Possible uses for this dataset could include:

Sentiment analysis in a variety of forms
Categorising YouTube videos based on their comments and statistics.
Training ML algorithms like RNNs to generate their own YouTube comments.
Analysing what factors affect how popular a YouTube video will be.
Statistical analysis over time.

## Members

* Shyam Akhil Nekkanti 
* Jun He (Helena) - 8903073
* Zheming Li (Brendan) - 8914152

## Field of Inquiry

Digital media and content analysis, focusing on YouTube trending videos.

## Potential Poorly Defined Question

**Do longer videos perform better?**

Oversimplifies the relationship between video length and performance. Doesn't consider genre, audience retention, or platform algorithms.


## Research Question

What factors most strongly correlate with a video's likelihood of trending on YouTube across different regions?

## Dataset

### [Trending YouTube Video Statistics](https://www.kaggle.com/datasets/datasnaek/youtube-new)

* CSV
* Contains data for multiple regions including US, GB, DE, CA, FR, RU, MX, KR, JP and IN.
* Includes Various metrics such as views, likes, dislikes, comment count, and more.

### Questions to Consider for Choosing Data

1. Will it be one-time or ongoing?

    This is one-time. We are studying data up to a specific time point and using it to predict trends.
  
2. Will it be broadly implemented or implemented for one customer or interaction at a time?

    It will be broadly implemented as the data includes videos from multiple regions and various types.
 
3. Will it be automated?

    Data collection is not currently automated. But parts of the analysis could be automated, such as data collection and preprocessing.

## Setup and Configuration

### 1. Install Anaconda
 
Anaconda is a tool used for managing different Python environments.

### 2. Configure the IDE

Set Anaconda as the Python interpreter.

### 3. Install Libraries

Use pip to install all the required libraries.

In [11]:
!pip install numpy pandas matplotlib seaborn scikit-learn tensorflow keras



### 4. Coding

Import the required libraries, load the dataset, and perform data cleansing.

In [14]:
import os
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

# get the file path of the dataset
cur_folder = os.getcwd()
filepath = f"{cur_folder}/youtube-dataset/CAvideos.csv"
print(filepath)

# load the dataset into a pandas dataframe
data = pd.read_csv(filepath)

# print the first 10 rows of the dataset
data.head(10)

/Users/brendan/Workspace/Personal/Conestoga/AI/analysis-math/data-vis/youtube-dataset/CAvideos.csv


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
4,2Vv-BfVoq4g,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
5,0yIWz1XEeyc,17.14.11,Jake Paul Says Alissa Violet CHEATED with LOGA...,DramaAlert,25,2017-11-13T07:37:51.000Z,"#DramaAlert|""Drama""|""Alert""|""DramaAlert""|""keem...",1309699,103755,4613,12143,https://i.ytimg.com/vi/0yIWz1XEeyc/default.jpg,False,False,False,► Follow for News! - https://twitter.com/KEEMS...
6,_uM5kFfkhB8,17.14.11,Vanoss Superhero School - New Students,VanossGaming,23,2017-11-12T23:52:13.000Z,"Funny Moments|""Montage video games""|""gaming""|""...",2987945,187464,9850,26629,https://i.ytimg.com/vi/_uM5kFfkhB8/default.jpg,False,False,False,Vanoss Merch Shop: https://vanoss.3blackdot.co...
7,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57534,2967,15959,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
8,JzCsM1vtn78,17.14.11,THE LOGANG MADE HISTORY. LOL. AGAIN.,Logan Paul Vlogs,24,2017-11-12T20:19:24.000Z,"logan paul vlog|""logan paul""|""logan""|""paul""|""o...",4477587,292837,4123,36391,https://i.ytimg.com/vi/JzCsM1vtn78/default.jpg,False,False,False,Join the movement. Be a Maverick ► https://Sho...
9,43sm-QwLcx4,17.14.11,Finally Sheldon is winning an argument about t...,Sheikh Musa,22,2017-11-10T14:10:46.000Z,"God|""Sheldon Cooper""|""Young Sheldon""",505161,4135,976,1484,https://i.ytimg.com/vi/43sm-QwLcx4/default.jpg,False,False,False,Sheldon is roasting pastor of the church\nyoun...


## Data Cleansing

**1. Import Data**

Load the CSV files for each region.


**2. Merge Data Sets**

Combine data from different regions by adding a `region` column.


**3. Standardize Data**

* Convert publish time to a standard datetime format.
  
* Ensure consistent formatting of view counts, likes, and dislikes across regions.
  
* Handle missing values and outliers.
  
* Extract additional features from the description and tags for analysis.
