
# <center> Analyzing YouTube Trending Data using Python and Pandas(EDA) 

<img src="Img/Youtube.jpg" width="300">

# 1. Introduction
## 1.1 What is Youtube? 
    
**YouTube** is a free video sharing website that makes it easy to watch online videos. You can even create and upload your own videos to share with others. Originally created in 2005, YouTube is now one of the most popular sites on the Web, with visitors watching around 2.6 billion users worldwide as of 2022. It's ranked as the second-most popular social network, and the only platform that has more active users than YouTube is Facebook.
    
## [1.2 Context](https://www.kaggle.com/datasets/datasnaek/youtube-new) 
    
YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, "To determine the year's top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they're not the most-viewed videos overall for the calander year". Top performers on the YouTube trending list are music videos (such as the famous virile "Gangam Style"), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for. 
    
This dataset is a daily record of the top top trending YouTube videos. 
    
Note that this dataset is a structurally imporved version of [this dataset.](https://www.kaggle.com/datasnaek/youtube)
    
## [1.3 Content](https://www.kaggle.com/datasets/datasnaek/youtube-new)
    
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, and FR regions (USA, Great Britain, Germany, Canada, and France, respectively), with up to 200 listed trending videos per day. 
    
EDIT : Now includes data from RU, MX, KR, JP, and IN regions (Russia, Mexico, South Korea, Japan and India respectively) over the same time period. 
    
Each region's data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count. 
    
The data also includes a 'category_id' filed, which varies between regions. To retreive the categories for a specific video, find it in the associated JSON. One such file is included fore each of the five regions in the dataset. 
    
For more information on specific columns in the dataset refer to the [column metadata.](https://www.kaggle.com/datasnaek/youtube-new/data)
    
## 1.4 Features 
    
- video_id : Unique code of video 
- title : Title of video 	
- publishedAt : Published time of video 
- channelId : Unique code videoof youtube channel 
- channelTitle : Title of youtube channel
- categoryId : Unique code of category
- trending_date : Trending date of video 	
- tags : Tag of video
- view_count : Total view of video 
- likes : Counts of people like the video 
- dislikes : Counts of people dislike the video 
- comment_count : Counts of people comment on the video 
- thumbnail_link : Linkes of tunmbnails of the video 
- comments_disabled : Whether people can leave comments 
- ratings_disabled : Whether people can leave ratings 
- description : Description of video 
    
## 1.5 Purpose of this Project 
    
In this project, we will analyze YouTube Trending data using Python and compare code using Pandas module. 

# 2. Analyze using Python 

## 2.1 Importing Dataset

In [1]:
# Load dataset using Context Manager 
import os 
import csv
import json 
import chardet 

def import_dataset(path, country, header = True, json_file = False, data_information = False) : 
    
    if json_file == True : 
        file_path = os.path.join(path, country + '_category_id.json')
        with open(file_path) as file : 
            category = json.load(file)   
        if data_information == True : 
            print("Dataset Information : ")
            print(f"1. File path : {file_path}")
            print(f"2. File size : {os.path.getsize(file_path)}")
        return category
        
    file_path = os.path.join(path, country + '_youtube_trending_data.csv')
    with open(file_path, mode = 'rb') as file : 
        raw_bytes = file.read() 
        file_encoding = chardet.detect(raw_bytes)['encoding']
    
    with open(file_path, mode = 'r', encoding = file_encoding) as file : 
        reader = csv.reader(file) 
        rows = list(reader) 
        if header == False : 
            rows = rows[0:]
            return rows 
        header = rows[0]
        rows = rows[1:]
    if data_information == True :
        print("Dataset Information : ")
        print(f"1. File path : {file_path}")
        print(f"2. File size : {os.path.getsize(file_path)}")
        print(f"3. Num of headers : {len(header)}")
        print(f"4. Num of rows : {len(rows)}")
    return header, rows     

In [2]:
KR_category = import_dataset('Dataset/', 'KR', json_file = True, data_information = True) 
print('\n')
KR_header, KR_youtube = import_dataset('Dataset/', 'KR', data_information = True)

Dataset Information : 
1. File path : Dataset/KR_category_id.json
2. File size : 10096


Dataset Information : 
1. File path : Dataset/KR_youtube_trending_data.csv
2. File size : 155363335
3. Num of headers : 16
4. Num of rows : 130554


The times of importing data 'KR_youtube_trending_data.csv' takes some times because of its size(about 155MB). There are 16 columns and 130554 rows in dataset. 

## 2.2 View Dataset

In KR_category, there are a lot of informations in JSON format. We will extract 'id' and 'title' from all of country_category_id.json, and save data into id_title_dict variable.

In [3]:
# Get all file names in .json 

def get_filename_json(path) : 
    json_files = []
    filenames = os.listdir('Dataset/')
    for filename in filenames : 
        extension = filename.split('.')[1]
        # Check extensions of file : append .json extension
        if extension == 'json' : 
            json_files.append(filename) 
    return json_files 

def extract_country(path) :
    filenames = get_filename_json(path)
    country_list = []
    for filename in filenames :
        country = filename.split('_')[0]
        country_list.append(country)
    return country_list 

countries = extract_country('Dataset/')

In [4]:
countries

['KR', 'US', 'JP', 'RU', 'IN', 'CA', 'DE', 'MX', 'GB', 'BR', 'FR']

We will extract all categoryId : category title from {country}\_category\_id.json file. 

In [5]:
# Extract Information of id : title 

def make_category_id_title(countries) : 
    id_title_dicts = {} 
    for country in countries : 
        file = import_dataset('Dataset/', country, json_file = True)
        for item in file['items'] : 
            id_title_dicts[item['id']] = item['snippet']['title']
    id_title_dicts = sorted(id_title_dicts.items(), key = lambda x: int(x[0]))
    id_title_dicts = {row[0] : row[1] for row in id_title_dicts} # Return sorted dictionary 
    return id_title_dicts

category_id_title = make_category_id_title(countries)

In [6]:
for k, v in category_id_title.items() : 
    print(f'categoryId : {k}, title : {v}')

categoryId : 1, title : Film & Animation
categoryId : 2, title : Autos & Vehicles
categoryId : 10, title : Music
categoryId : 15, title : Pets & Animals
categoryId : 17, title : Sports
categoryId : 18, title : Short Movies
categoryId : 19, title : Travel & Events
categoryId : 20, title : Gaming
categoryId : 21, title : Videoblogging
categoryId : 22, title : People & Blogs
categoryId : 23, title : Comedy
categoryId : 24, title : Entertainment
categoryId : 25, title : News & Politics
categoryId : 26, title : Howto & Style
categoryId : 27, title : Education
categoryId : 28, title : Science & Technology
categoryId : 29, title : Nonprofits & Activism
categoryId : 30, title : Movies
categoryId : 31, title : Anime/Animation
categoryId : 32, title : Action/Adventure
categoryId : 33, title : Classics
categoryId : 34, title : Comedy
categoryId : 35, title : Documentary
categoryId : 36, title : Drama
categoryId : 37, title : Family
categoryId : 38, title : Foreign
categoryId : 39, title : Horror


In [7]:
# View columns and rows in KR_header, KR_youtube 

print(KR_header, '\n')
for row in KR_youtube[:3] : 
    print(row)

['video_id', 'title', 'publishedAt', 'channelId', 'channelTitle', 'categoryId', 'trending_date', 'tags', 'view_count', 'likes', 'dislikes', 'comment_count', 'thumbnail_link', 'comments_disabled', 'ratings_disabled', 'description'] 

['uq5LClQN3cE', '안녕하세요 보겸입니다', '2020-08-09T09:32:48Z', 'UCu9BCtGIEr73LXZsKmoujKw', '보겸 BK', '24', '2020-08-12T00:00:00Z', '보겸|bokyem', '5947503', '53326', '105756', '139946', 'https://i.ytimg.com/vi/uq5LClQN3cE/default.jpg', 'False', 'False', '']
['I-ZbZCHsHD0', '부락토스의 계획 [총몇명 프리퀄]', '2020-08-12T09:00:08Z', 'UCRuSxVu4iqTK5kCh90ntAgA', '총몇명', '1', '2020-08-12T00:00:00Z', '총몇명|재밌는 만화|부락토스|루시퍼|총몇명 프리퀄|총몇명 스토리', '963384', '28244', '494', '3339', 'https://i.ytimg.com/vi/I-ZbZCHsHD0/default.jpg', 'False', 'False', '오늘도 정말 감사드립니다!!총몇명 스튜디오 - 총몇명, 십제곱, 5G민, MOVE혁, 찐스톤, Jin, 정영준♬ BGMPrivate Reflection by Kevin MacLeodLink: https://filmmusic.io/song/4241-private-reflectionLicense: http://creativecommons.org/licenses/by/4.0/Cinema Blockbuster Trailer 3 by Sascha EndeL

There are column names 'video_id', 'title', 'publishedAt', 'channelId', 'channelTitle', 'categoryId', 'trending_date', 'tags', 'view_count', 'likes', 'dislikes', 'comment_count', 'thumbnail_link', 'comments_disabled', 'ratings_disabled', 'description'. In KR_youtube, the value of corresponding columns are in untidy. 

In [8]:
# Start and End of trending dates 

unique_trending_dates = [] 
for video in KR_youtube : 
    trending_date_index = KR_header.index('trending_date')
    date = video[trending_date_index]
    if date not in unique_trending_dates : 
        unique_trending_dates.append(date)

In [9]:
print(f"Starting date : {unique_trending_dates[0].split('T')[0]}")
print(f"Ending date : {unique_trending_dates[-1].split('T')[0]}")

Starting date : 2020-08-12
Ending date : 2022-06-06


KR_youtube trending dataset start at 2020-08-12 and end at 2022-06-06.

## 2.3 Explore Dataset

Now we are ready to explore KR youtube trending dataset. In this project we will answer of questions following. 

1. What is the longest trending video? (Search its title and trending dates) 
2. What is the longest trending category? (Searching using mean time) 
2. What is the most loved trending video? (Search the top 30 loved trending videos)
3. What is the most hated trending video? (Search the top 30 hated trending videos) 
4. Which category is trending most? (Search its category and total counts) 
5. Which category have the most view counts? 
6. Which category has the least time to be trending data? (Search using mean time) 

**What is the longest trending video?**

In [12]:
def longest_trending_video() : 
    video_trending_dates = {} # Store title and its trending dates in dictionary
    video_channel = {} # Store channel name and its title in dictionary

    for video in KR_youtube : 
        title_index = KR_header.index('title')
        channelTitle_index = KR_header.index('channelTitle')
        title = video[title_index]
        channelTitle = video[channelTitle_index]

        if title not in video_channel : 
            video_channel[title] = channelTitle

        if title not in video_trending_dates : 
            video_trending_dates[title] = 1
        else : 
            video_trending_dates[title] += 1
    
    top20_trending_dates_videos = sorted(video_trending_dates.items(), key = lambda x: x[1], reverse = True)[:20]
    return top20_trending_dates_videos

In [17]:
top20_trending_videos = longest_trending_video()
for title, days in top20_trending_videos : 
    print(f"{title[:80]}\t{days:>5}days")

Go Back (고백)	   42days
죄송합니다	   29days
안녕하세요 보겸입니다	   25days
컵라면 먹을 때 절대 일어날리 없는 상황들 ㅋㅋㅋ	   24days
대한민국 VS 투르크메니스탄 : FIFA 카타르 월드컵 2차 예선 하이라이트 - 2021.06.05	   24days
2주더놀릴예정	   23days
대박... 멧돼지를 아들로 삼은 할아버지｜KBS 주주클럽 050410 방송	   23days
극과 극을 살고 있는 인도의 갑부와 빈곤층들!	   23days
...	   22days
밤낮으로 우리집 비번 눌러대는 길고양이 좀 어떻게 해주세요ㅣStray Cat Enters A Door Lock Code Every Day	   22days
#shorts ｜김계란님 들고 더블폴	   22days
[진돗개 진솔쓰] 차박 일기. #Shorts	   22days
[도장TV 8회] 수박먹는 하영이^^;;	   22days
[짐승친구들] 예비군	   22days
The Feels	   22days
[20/21 UCL 결승] 맨시티 vs 첼시 H/L	   21days
어몽어스 VR좀비2(SCP) 8화 [어몽어스 애니메이션] / [AMONG US ANIMATION]	   21days
[안방1열 직캠4K] 에스파 'Next Level' 풀캠 (aespa Full Cam)│@SBS Inkigayo_2021.05.30.	   21days
황의조가 프랑스 감독 편견을 완전히 깨버렸던 경기	   21days
당신이 절대 '하얀색 신종 컵라면'을 알 수 없는 이유(마지막 소름 주의)	   21days


The most logest trending dates are 42 days(lasted more than a month!), and its name of channel and video is 'JangBeom June - Topic' and 'Go Back (고백)' which is song. Second longest trending dates are 29 days(lasted almost a month!). and its name of channel and video is '\[Nareum_TV\] 나름TV' and '죄송합니다(sorry)'.

**What is the longest trending category?**

In [18]:
def longest_trending_category() : 
    video_category = {} # Store video : category 
    video_trending_dates = {} # Store title and its trending dates in dictionary

    for video in KR_youtube :
        categoryId_index = KR_header.index('categoryId')
        title_index = KR_header.index('title')
        categoryId = video[categoryId_index]
        title = video[title_index]

        if title not in video_category : 
            video_category[title] = category_id_title[categoryId]
        if title not in video_trending_dates : 
            video_trending_dates[title] = 1
        else : 
            video_trending_dates[title] += 1

    category_trending_dates = {} # Store total sum of trending dates by category 
    category_counts = {} # Store total counts of videos by category 

    for title, dates in video_trending_dates.items() : 
        category = video_category[title]
        if category not in category_trending_dates : 
            category_trending_dates[category] = dates
            category_counts[category] = 1
        else : 
            category_trending_dates[category] += dates 
            category_counts[category] += 1

    category_mean_trending_dates = {} # Store mean trending dates by category 
    for category in category_trending_dates : 
        category_mean_trending_dates[category] = round(category_trending_dates[category] / category_counts[category],3)
    category_mean_trending_dates = sorted(category_mean_trending_dates.items(), key = lambda x : x[1], reverse = True )
    return category_mean_trending_dates

In [19]:
category_trending_dates = longest_trending_category()
for category, dates in category_trending_dates : 
    print(f"{category:25}  {dates:>5} days")

Howto & Style              8.589 days
Travel & Events            8.353 days
Comedy                     8.309 days
Pets & Animals             8.294 days
People & Blogs             8.019 days
Music                      7.775 days
Film & Animation            7.73 days
Gaming                     7.696 days
Autos & Vehicles            7.66 days
Education                  7.562 days
Science & Technology       7.505 days
Entertainment               7.38 days
Sports                     7.058 days
Nonprofits & Activism      6.844 days
News & Politics             5.86 days


People reacts Howto & Style, Travel & Events, Comedy, Pets & Animals, People and Blogs more longer than other categories(about 8.5days). However, People react short at Nonprofts & Activism and News & Politics. It's because they only focus on speaking short facts about events. 

**What is the most loved trending video?**

In [20]:
def most_loved_trending_video() : 
    video_love_ratio = {} 
    for video in KR_youtube : 
        title_index = KR_header.index('title')
        likes_index = KR_header.index('likes')
        dislikes_index = KR_header.index('dislikes') 
        title = video[title_index]
        likes = video[likes_index]
        dislikes = video[dislikes_index]
        if int(dislikes) == 0 : 
            ratio = int(likes)
        else : 
            ratio = round(int(likes) / int(dislikes), 3)

        if title not in video_love_ratio : 
            video_love_ratio[title] = ratio
        if (title in video_love_ratio) & (ratio > video_love_ratio[title]) : 
            video_love_ratio[title] = ratio 
    video_love_ratio = {video[0] : video[1] for video in sorted(video_love_ratio.items(), key = lambda x: x[1], reverse = True)[:30]}
    return video_love_ratio

In [21]:
top30_video_love_ratio = most_loved_trending_video() 
for title, ratio in top30_video_love_ratio.items() : 
    print(f"{title:90} \t{ratio}")

PSY - 'That That (prod. & feat. SUGA of BTS)' MV                                           	7258876
[CHOREOGRAPHY] Jin of BTS ‘슈퍼 참치’ Special Performance Video                                	4099435
BIGBANG - '봄여름가을겨울 (Still Life)' M/V                                                       	2886743
[CHOREOGRAPHY] BTS (방탄소년단) ‘Butter (Holiday Remix)’ Dance Practice                         	2720241
Stray Kids MANIAC M/V                                                                      	2462876
TXT (투모로우바이투게더) 'Good Boy Gone Bad' Official MV                                            	2097910
BTS (방탄소년단) ‘Proof’ Logo Trailer                                                           	1988498
Kendrick Lamar - The Heart Part 5                                                          	1780803
Red Velvet 레드벨벳 'Feel My Rhythm' MV                                                        	1772024
TREASURE - '직진 (JIKJIN)' M/V                                                               	1705442


All of the top30 loved videos are music, where music of BTS are the most. 

**What is the most hated trending video?**

In [22]:
def most_hated_trending_video() : 
    video_hate_ratio = {} 
    for video in KR_youtube : 
        title_index = KR_header.index('title')
        likes_index = KR_header.index('likes')
        dislikes_index = KR_header.index('dislikes') 
        title = video[title_index]
        likes = video[likes_index]
        dislikes = video[dislikes_index]
        if int(likes) == 0 : 
            ratio = int(dislikes)
        else : 
            ratio = round(int(dislikes) / int(likes), 3)

        if title not in video_hate_ratio : 
            video_hate_ratio[title] = ratio
        if (title in video_hate_ratio) & (ratio > video_hate_ratio[title]) : 
            video_hate_ratio[title] = ratio 
    video_hate_ratio = {video[0] : video[1] for video in sorted(video_hate_ratio.items(), key = lambda x: x[1], reverse = True)[:30]}
    return video_hate_ratio

In [23]:
top30_video_hate_ratio = most_hated_trending_video() 
for title, ratio in top30_video_hate_ratio.items() : 
    print(f"{title:80} \t{ratio:>10}")

죄송합니다.                                                                           	    13.083
죄송합니다                                                                            	     11.14
'전기'한테 한마디 하는 육지담,머니게임 리뷰 [육지담]                                                  	    10.368
보겸 충격의 뒷광고 증거!! 믿었던 보겸마저...                                                      	     6.602
샌드박스네트워크입니다.                                                                     	     6.069
고개 숙여 사과드립니다. 그리고 해명하고 싶습니다.                                                     	      5.88
간장게장 사장님을 만나고왔습니다.                                                               	     5.851
미슐랭 받은 끝판왕 라면 맛집 TOP3를 소개합니다!!!!                                                 	     5.679
드릴말씀있습니다.                                                                        	      4.58
평생 반성하면서 살겠습니다.                                                                  	     4.474
오가나 입니다.                                                              

Most of top30 hated videos are clarification videos. And almost 90% of videos are from personal youtubers.

**Which category is trending most?**

In [24]:
def counts_of_trending_category() : 
    category_counts = {} # Total counts of category of whole days
    for video in KR_youtube : 
        categoryId_index = KR_header.index('categoryId')
        categoryId = video[categoryId_index]
        category = category_id_title[categoryId]

        if category not in category_counts : 
            category_counts[category] = 1
        else : 
            category_counts[category] += 1
    category_counts = {category[0] : category[1] for category in sorted(category_counts.items(), key = lambda x: x[1], reverse = True)}
    return category_counts

In [25]:
trending_category = counts_of_trending_category()
for category, counts in trending_category.items() :
    print(f"{category:25} {counts:>5}")

Entertainment             45459
People & Blogs            20247
Music                     13915
Sports                    10238
Comedy                     8189
News & Politics            6650
Howto & Style              5706
Gaming                     4796
Film & Animation           4280
Pets & Animals             2679
Education                  2618
Science & Technology       2214
Travel & Events            1713
Autos & Vehicles           1631
Nonprofits & Activism       219


Most of trending categoreis are Entertainment and People & Blogs. As i think, those categories are easier to make than other categories (such as Sciene & Techonology which needs professional acknowledges), so they have more frequencies than other categories. 

**Which category have the most view counts?**

In [26]:
def viewcounts_category() : 
    video_category = {} # Store video : category 
    video_max_view_counts = {} # Store title and its max view counts 

    for video in KR_youtube :
        categoryId_index = KR_header.index('categoryId')
        title_index = KR_header.index('title')
        view_counts_index = KR_header.index('view_count')
        categoryId = video[categoryId_index]
        title = video[title_index]
        view_counts = float(video[view_counts_index])

        if title not in video_category : 
            video_category[title] = category_id_title[categoryId]
        
        if title not in video_max_view_counts : 
            video_max_view_counts[title] = view_counts
        if (title in video_max_view_counts) & (view_counts > video_max_view_counts[title]) :  
            video_max_view_counts[title] = view_counts 
            
    category_view_counts = {} # Store total view of videos by category 
    category_counts = {} # Store total counts of videos by category 

    for title, counts in video_max_view_counts.items() : 
        category = video_category[title]
        if category not in category_view_counts : 
            category_view_counts[category] = counts
            category_counts[category] = 1
        else : 
            category_view_counts[category] += counts  
            category_counts[category] += 1

    category_mean_view_counts = {} # Store mean trending dates by category 
    for category in category_view_counts : 
        category_mean_view_counts[category] = round(category_view_counts[category] / category_counts[category], 3)
    category_mean_view_counts = sorted(category_mean_view_counts.items(), key = lambda x : x[1], reverse = True)
    return category_mean_view_counts

In [27]:
category_view_counts = viewcounts_category()
for category, counts in category_view_counts : 
    print(f"{category:25}  {counts:>5}")

Music                      6176700.25
Science & Technology       1689400.966
Gaming                     1442615.321
Entertainment              1282082.794
Film & Animation           1223461.641
Comedy                     1182256.492
Sports                     1094999.142
People & Blogs             997238.544
Travel & Events            986814.56
News & Politics            959725.411
Nonprofits & Activism      952001.906
Pets & Animals             892252.111
Autos & Vehicles           883427.325
Howto & Style              842406.798
Education                  835365.409


People see Music most, and see Science & Technology secondly. There are some differences between longest trending categories and view counts of categories. We will check this in conclusion.

**Which category has the least time to be trending category?**

In [28]:
from datetime import datetime 
from statistics import median, mean

def reaction_time_category() : 
    video_category = {} # Store video : category 
    video_min_times = {} # Store title and its min times between publishedAt and trending_date

    for video in KR_youtube :
        categoryId_index = KR_header.index('categoryId')
        title_index = KR_header.index('title')
        publishedAt_index = KR_header.index('publishedAt')
        trending_date_index = KR_header.index('trending_date')
        categoryId = video[categoryId_index]
        title = video[title_index]
        publishedAt = datetime.strptime(video[publishedAt_index], '%Y-%m-%dT%H:%M:%SZ')
        trending_date = datetime.strptime(video[trending_date_index], '%Y-%m-%dT%H:%M:%SZ')
        reaction_time = trending_date - publishedAt 
        
        if reaction_time.days < 0 : 
            reaction_time = 0
        else : 
            reaction_time = reaction_time.total_seconds()//3600
        
        if title not in video_category : 
            video_category[title] = category_id_title[categoryId]
        
        if title not in video_min_times : 
            video_min_times[title] = reaction_time
        elif (title in video_min_times) & (reaction_time < video_min_times[title]) :  
            video_min_times[title] = reaction_time 

    category_total_reaction_times = {} # Store total hours of videos by category 

    for title, hours in video_min_times.items() : 
        category = video_category[title]
        if category not in category_total_reaction_times : 
            category_total_reaction_times[category] = [hours]
        else : 
            category_total_reaction_times[category].append(hours)

    category_mean_reaction_times = {} # Store mean trending hours by category 
    for category in category_total_reaction_times : 
        category_mean_reaction_times[category] = [min(category_total_reaction_times[category]), 
                                                  int(median(category_total_reaction_times[category])),
                                                  int(mean(category_total_reaction_times[category])),
                                                  max(category_total_reaction_times[category])]
    return category_mean_reaction_times

In [29]:
category_reaction_hours = reaction_time_category()
for category in category_reaction_hours : 
    print(f"{category:25}  {category_reaction_hours[category]}")

Entertainment              [0, 14, 28, 426.0]
Film & Animation           [0, 14, 20, 187.0]
People & Blogs             [0, 14, 25, 321.0]
Music                      [0, 14, 25, 517.0]
Comedy                     [0, 14, 23, 253.0]
Education                  [0, 16, 24, 182.0]
News & Politics            [0, 13, 18, 207.0]
Sports                     [0.0, 14, 20, 208.0]
Nonprofits & Activism      [0, 15, 29, 236.0]
Gaming                     [0, 15, 22, 226.0]
Travel & Events            [0, 15, 23, 230.0]
Pets & Animals             [0, 14, 20, 327.0]
Science & Technology       [0, 14, 21, 206.0]
Howto & Style              [0, 18, 27, 215.0]
Autos & Vehicles           [0, 15, 25, 183.0]


There aren't big difference of reaction times among categories. The category has the most fast reaction is News % Politics.

## 2.4 EDA Conclusion

The results of what we do in exploring dataset about category columns is such as below.

|Mean duration trending category|Total counts of trending category|Mean view counts of trending category|Least time to be trending category|  
|:---:|:---:|:---:|:---:|
|Howto & Style|Entertainment|Music|News & Politics| 
|Travel & Events|People & Blogs|Science & Technology|Sports|
|Comedy|Music|Gaming|Pets & Animals|     
|Pets & Animals|Sports|Entertaiment|Science & Technology|
|People & Blogs|Comedy|Film & Animation|Autos & Vehicles|  
|Music|News & Politics|Comedy|Entertainment|          
|Film & Animation|Howto & Style|Sports|Film & Animation| 
|Gaming|Gaming|People & Blogs|Music| 
|Autos & Vehicles|Film & Animation|Travel & Events |Education|         
|Education|Pets & Animals|News & Politics|Nonprofits & Activism| 
|Science & Technology|Education|Nonprofits & Activism|Gaming|
|Entertainment|Science & Technology|Pets & Animals|People & Blogs|            
|Sports|Travel & Events|Autos & Vehicles|Comedy|  
|Nonprofits & Activism|Autos & Vehicles|Howto & Style|Travel & Events|  
|News & Politics|Nonprofits & Activism|Education|Howto & Style| 

We can see that threre is some big differences of categories by topics(Mean duration trending category, Total counts of trending category, Mean view counts of trending category, Least time to be trending category). People usually have fast reaction to News & Politics and Sports but also forget fastly(We can check it in mean duration times.). The amount of entertainment videos people consumes is highest, so the mean duration times of entertaiment is short as well. 

By using this results, we can make business strategy when we start youtube. 

# 3. Analyze using Pandas

**Panas** is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is mainly used for data analysis and associated manipulation of tabular data in DataFrames. Pandas allows importing data from various file format such as CSV, JSON, SQL, and Excel.

In this time, we will do same things what we did in '2. Analyze using Python' using Pandas module. And then, we will also make pretty results using Plotly module. The questions we will explore are such as below. 

1. What is the longest trending video? (Search its title and trending dates)
2. What is the most loved trending video? (Search the top 10 loved trending videos)
3. What is the most hated trending video? (Search the top 10 hated trending videos)
4. Which category is famous? (Search its category and total counts)
5. Which category have the most view counts?
6. Which video has the least time to be trending data?
7. Which category has the least time to be trending data? (Search using mean time)

## 3.1 Importing Dataset

In [30]:
import pandas as pd 

KR_youtube_df = pd.read_csv('Dataset/KR_youtube_trending_data.csv')
KR_youtube_df.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,uq5LClQN3cE,안녕하세요 보겸입니다,2020-08-09T09:32:48Z,UCu9BCtGIEr73LXZsKmoujKw,보겸 BK,24,2020-08-12T00:00:00Z,보겸|bokyem,5947503,53326,105756,139946,https://i.ytimg.com/vi/uq5LClQN3cE/default.jpg,False,False,


## 3.2 View Dataset

In [31]:
# Basic information about dataset 
print(f"Number of rows : {KR_youtube_df.shape[0]}")
print(f"Number of columns : {KR_youtube_df.shape[1]}")

Number of rows : 130554
Number of columns : 16


In [32]:
# Check dataset using info() method 
print(KR_youtube_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130554 entries, 0 to 130553
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   video_id           130554 non-null  object
 1   title              130554 non-null  object
 2   publishedAt        130554 non-null  object
 3   channelId          130554 non-null  object
 4   channelTitle       130554 non-null  object
 5   categoryId         130554 non-null  int64 
 6   trending_date      130554 non-null  object
 7   tags               130554 non-null  object
 8   view_count         130554 non-null  int64 
 9   likes              130554 non-null  int64 
 10  dislikes           130554 non-null  int64 
 11  comment_count      130554 non-null  int64 
 12  thumbnail_link     130554 non-null  object
 13  comments_disabled  130554 non-null  bool  
 14  ratings_disabled   130554 non-null  bool  
 15  description        127110 non-null  object
dtypes: bool(2), int64(5)

In [33]:
# Merge category_id_title on KR_youtube

category = pd.DataFrame({'categoryId' : category_id_title.keys(), 'category_title' : category_id_title.values()})
category = category.astype({'categoryId' : 'int64'})
KR_youtube_df = pd.merge(KR_youtube_df, category, on = 'categoryId', how = 'left')

In [34]:
KR_youtube_df.head(1)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description,category_title
0,uq5LClQN3cE,안녕하세요 보겸입니다,2020-08-09T09:32:48Z,UCu9BCtGIEr73LXZsKmoujKw,보겸 BK,24,2020-08-12T00:00:00Z,보겸|bokyem,5947503,53326,105756,139946,https://i.ytimg.com/vi/uq5LClQN3cE/default.jpg,False,False,,Entertainment


## 3.3 Explore Dataset 

**What is the longest trending video?**

In [35]:
top20_trending_video = KR_youtube_df.groupby('title', as_index = False).size().sort_values(by = 'size', ascending = False).reset_index(drop = True).head(20)
top20_trending_video.columns = ['Title', 'Days']
display(top20_trending_video)

Unnamed: 0,Title,Days
0,Go Back (고백),42
1,죄송합니다,29
2,안녕하세요 보겸입니다,25
3,컵라면 먹을 때 절대 일어날리 없는 상황들 ㅋㅋㅋ,24
4,대한민국 VS 투르크메니스탄 : FIFA 카타르 월드컵 2차 예선 하이라이트 - 2...,24
5,대박... 멧돼지를 아들로 삼은 할아버지｜KBS 주주클럽 050410 방송,23
6,극과 극을 살고 있는 인도의 갑부와 빈곤층들!,23
7,2주더놀릴예정,23
8,[도장TV 8회] 수박먹는 하영이^^;;,22
9,[진돗개 진솔쓰] 차박 일기. #Shorts,22


**What is the longest trending category?**

In [36]:
trending_category = KR_youtube_df.groupby(['category_title', 'title'], as_index = False).size().\
groupby('category_title', as_index = False).mean().reset_index(drop = True).sort_values(by = 'size', ascending = False)
display(trending_category)

Unnamed: 0,category_title,size
6,Howto & Style,8.580451
11,Pets & Animals,8.294118
1,Comedy,8.280081
14,Travel & Events,8.275362
10,People & Blogs,8.009098
7,Music,7.773743
4,Film & Animation,7.711712
0,Autos & Vehicles,7.657277
5,Gaming,7.649123
2,Education,7.544669


**What is the most loved trending video?**

In [37]:
def like_ratio(row) : 
    if row['dislikes'] == 0 : 
        return row['likes'] 
    else : 
        return round(row['likes'] / row['dislikes'], 3)
KR_youtube_df['like_ratio'] = KR_youtube_df.apply(like_ratio, axis = 1)
top30_loved_videos = KR_youtube_df.groupby('title')['like_ratio'].max().reset_index().\
sort_values(by = 'like_ratio', ascending = False).head(30).reset_index(drop = True)
display(top30_loved_videos)

Unnamed: 0,title,like_ratio
0,PSY - 'That That (prod. & feat. SUGA of BTS)' MV,7258876.0
1,[CHOREOGRAPHY] Jin of BTS ‘슈퍼 참치’ Special Perf...,4099435.0
2,BIGBANG - '봄여름가을겨울 (Still Life)' M/V,2886743.0
3,[CHOREOGRAPHY] BTS (방탄소년단) ‘Butter (Holiday Re...,2720241.0
4,Stray Kids MANIAC M/V,2462876.0
5,TXT (투모로우바이투게더) 'Good Boy Gone Bad' Official MV,2097910.0
6,BTS (방탄소년단) ‘Proof’ Logo Trailer,1988498.0
7,Kendrick Lamar - The Heart Part 5,1780803.0
8,Red Velvet 레드벨벳 'Feel My Rhythm' MV,1772024.0
9,TREASURE - '직진 (JIKJIN)' M/V,1705442.0


**What is the most hated trending video?**

In [38]:
def hate_ratio(row) : 
    if row['likes'] == 0 : 
        return row['dislikes'] 
    else : 
        return round(row['dislikes'] / row['likes'], 3)
KR_youtube_df['hate_ratio'] = KR_youtube_df.apply(hate_ratio, axis = 1)
top30_hated_videos = KR_youtube_df.groupby('title')['hate_ratio'].max().reset_index().\
sort_values(by = 'hate_ratio', ascending = False).head(30).reset_index(drop = True)
display(top30_hated_videos)

Unnamed: 0,title,hate_ratio
0,죄송합니다.,13.083
1,죄송합니다,11.14
2,"'전기'한테 한마디 하는 육지담,머니게임 리뷰 [육지담]",10.368
3,보겸 충격의 뒷광고 증거!! 믿었던 보겸마저...,6.602
4,샌드박스네트워크입니다.,6.069
5,고개 숙여 사과드립니다. 그리고 해명하고 싶습니다.,5.88
6,간장게장 사장님을 만나고왔습니다.,5.851
7,미슐랭 받은 끝판왕 라면 맛집 TOP3를 소개합니다!!!!,5.679
8,드릴말씀있습니다.,4.58
9,평생 반성하면서 살겠습니다.,4.474


**Which category is trending most?**

In [39]:
category_counts = KR_youtube_df.groupby('category_title', as_index = False).size().sort_values(by = 'size', ascending = False)
display(category_counts)

Unnamed: 0,category_title,size
3,Entertainment,45459
10,People & Blogs,20247
7,Music,13915
13,Sports,10238
1,Comedy,8189
8,News & Politics,6650
6,Howto & Style,5706
5,Gaming,4796
4,Film & Animation,4280
11,Pets & Animals,2679


**Which category have the most view counts?**

In [40]:
pd.options.display.float_format = '{:.5f}'.format
category_view_counts = KR_youtube_df.groupby(['category_title', 'title'], as_index = False)['view_count'].max()\
.groupby('category_title').mean().sort_values(by = 'view_count', ascending = False).reset_index()
display(category_view_counts)

Unnamed: 0,category_title,view_count
0,Music,6173614.90447
1,Science & Technology,1689400.9661
2,Gaming,1437204.33174
3,Entertainment,1282412.37684
4,Film & Animation,1222907.95856
5,Comedy,1181850.67442
6,Sports,1095009.89662
7,People & Blogs,997950.95332
8,Travel & Events,973632.63768
9,News & Politics,961909.93034


**Which category has the least time to be trending category?**

In [41]:
KR_youtube_df['publishedAt'] = pd.to_datetime(KR_youtube_df['publishedAt'])
KR_youtube_df['trending_date'] = pd.to_datetime(KR_youtube_df['trending_date'])
KR_youtube_df['reaction_hours'] = KR_youtube_df['trending_date'] - KR_youtube_df['publishedAt']
KR_youtube_df['reaction_hours'] = KR_youtube_df['reaction_hours'].map(lambda x: 0 if x.total_seconds()//3600 < 0 else x.total_seconds()//3600)
reaction_times = KR_youtube_df.groupby(['category_title', 'title'])['reaction_hours'].min().groupby('category_title').agg(['min', 'median', 'mean', 'max'])
display(reaction_times)

Unnamed: 0_level_0,min,median,mean,max
category_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Autos & Vehicles,0.0,15.0,25.23474,183.0
Comedy,0.0,14.0,23.5723,253.0
Education,0.0,16.0,25.09798,182.0
Entertainment,0.0,14.0,28.33355,426.0
Film & Animation,0.0,14.0,20.35495,187.0
Gaming,0.0,15.0,22.8437,226.0
Howto & Style,0.0,18.0,27.32331,215.0
Music,0.0,14.0,25.77318,517.0
News & Politics,0.0,13.0,18.53527,207.0
Nonprofits & Activism,0.0,15.5,29.34375,236.0


We can get same results by using Pandas.

# 4. Time Analysis between Python and Pandas 

In this chapter, we will compare execution time of Python and Pandas what we did in '2. Analyze using Python' and '3. Analyze using Pandas and Plotly'. We will check execution times of code do same works by using 'time' module. 

## 4.1 Time Comparision of Importing Dataset

In [42]:
import time 

# Time of importing dataset using python
start = time.time() 
KR_header, KR_youtube = import_dataset('Dataset/', 'KR', data_information = True)
end = time.time() 
python_time = end - start 

# Time of importing dataset using pandas 
start = time.time() 
KR_youtube_df = pd.read_csv('Dataset/KR_youtube_trending_data.csv')
end = time.time()
pandas_time = end - start 

print(f"Execution time of python : {python_time} vs Excution time of pandas : {pandas_time}")
print(f"Execution ratio of python vs pandas : {python_time/pandas_time}")

Dataset Information : 
1. File path : Dataset/KR_youtube_trending_data.csv
2. File size : 155363335
3. Num of headers : 16
4. Num of rows : 130554
Execution time of python : 552.8068583011627 vs Excution time of pandas : 1.292646884918213
Execution ratio of python vs pandas : 427.65496497996776


## 4.2 Time Comparision of Exploring Dataset

In [45]:
import time
python_times = []
pandas_times = [] 

# Merge category_id_title on KR_youtube

category = pd.DataFrame({'categoryId' : category_id_title.keys(), 'category_title' : category_id_title.values()})
category = category.astype({'categoryId' : 'int64'})
KR_youtube_df = pd.merge(KR_youtube_df, category, on = 'categoryId', how = 'left')

# Python 
funct = [longest_trending_video, longest_trending_category, most_loved_trending_video, most_hated_trending_video, 
         counts_of_trending_category, viewcounts_category, reaction_time_category]
for i in range(len(funct)) : 
    start = time.time() 
    funct[i]()
    end = time.time()
    python_times.append(end - start) 
    
# Pandas 
start = time.time() 
q1 = KR_youtube_df.groupby('title', as_index = False).size().sort_values(by = 'size', ascending = False).reset_index(drop = True).head(20)
end = time.time() 
pandas_times.append(end - start) 


start = time.time() 
q2 = KR_youtube_df.groupby(['category_title', 'title'], as_index = False).size().\
groupby('category_title', as_index = False).mean().reset_index(drop = True).sort_values(by = 'size', ascending = False)
end = time.time() 
pandas_times.append(end - start) 


start = time.time() 
def like_ratio(row) : 
    if row['dislikes'] == 0 : 
        return row['likes'] 
    else : 
        return round(row['likes'] / row['dislikes'], 3)
KR_youtube_df['like_ratio'] = KR_youtube_df.apply(like_ratio, axis = 1)
q3 = KR_youtube_df.groupby('title')['like_ratio'].max().reset_index().\
sort_values(by = 'like_ratio', ascending = False).head(30).reset_index(drop = True)
end = time.time() 
pandas_times.append(end - start) 


start = time.time() 
def hate_ratio(row) : 
    if row['likes'] == 0 : 
        return row['dislikes'] 
    else : 
        return round(row['dislikes'] / row['likes'], 3)
KR_youtube_df['hate_ratio'] = KR_youtube_df.apply(hate_ratio, axis = 1)
q4 = KR_youtube_df.groupby('title')['hate_ratio'].max().reset_index().\
sort_values(by = 'hate_ratio', ascending = False).head(30).reset_index(drop = True)
end = time.time() 
pandas_times.append(end - start) 


start = time.time() 
q5 = KR_youtube_df.groupby('category_title', as_index = False).size().sort_values(by = 'size', ascending = False)
end = time.time() 
pandas_times.append(end - start) 


start = time.time() 
q6 = KR_youtube_df.groupby(['category_title', 'title'], as_index = False)['view_count'].max()\
.groupby('category_title').mean().sort_values(by = 'view_count', ascending = False).reset_index()
end = time.time() 
pandas_times.append(end - start) 


start = time.time() 
KR_youtube_df['publishedAt'] = pd.to_datetime(KR_youtube_df['publishedAt'])
KR_youtube_df['trending_date'] = pd.to_datetime(KR_youtube_df['trending_date'])
KR_youtube_df['reaction_hours'] = KR_youtube_df['trending_date'] - KR_youtube_df['publishedAt']
KR_youtube_df['reaction_hours'] = KR_youtube_df['reaction_hours'].map(lambda x: 0 if x.total_seconds()//3600 < 0 else x.total_seconds()//3600)
q7 = KR_youtube_df.groupby(['category_title', 'title'])['reaction_hours'].min().groupby('category_title').agg(['min', 'median', 'mean', 'max'])
end = time.time() 
pandas_times.append(end - start) 

In [46]:
for i in range(len(python_times)) : 
    python_time = python_times[i]
    pandas_time = pandas_times[i]
    print(f"=========== Question{i+1} ==========")
    print(f"Execution time of python : {python_time} vs Excution time of pandas : {pandas_time}")
    print(f"Execution ratio of python vs pandas : {python_time/pandas_time}")
    print('\n')

Execution time of python : 0.06637978553771973 vs Excution time of pandas : 0.029551029205322266
Execution ratio of python vs pandas : 2.246276604327691


Execution time of python : 0.06817150115966797 vs Excution time of pandas : 0.040929555892944336
Execution ratio of python vs pandas : 1.665581257172149


Execution time of python : 0.16998028755187988 vs Excution time of pandas : 1.3795523643493652
Execution ratio of python vs pandas : 0.12321408881934486


Execution time of python : 0.1865525245666504 vs Excution time of pandas : 1.4438941478729248
Execution ratio of python vs pandas : 0.1292009700582765


Execution time of python : 0.04221820831298828 vs Excution time of pandas : 0.007451057434082031
Execution ratio of python vs pandas : 5.66606937156022


Execution time of python : 0.10105609893798828 vs Excution time of pandas : 0.04294633865356445
Execution ratio of python vs pandas : 2.3530783323155497


Execution time of python : 1.9336814880371094 vs Excution time of pandas 

## 4.3 Final Conclusion 

When we import dataset, importing data using python takes about 552.80s and importing data using pandas takes about 1.29s. Importing dataset using pandas more faster than about 427 times using python! 

When we explore dataset through seven questions, EDA using pandas are more faster than using python excepting question 3 and 4. Because pandas is module build on C, using pandas while we do data analysis is more efficently on time and cost. 