**Problem statement**

Videos are a fast growing medium where people communicate, share knowledge, showcase skills etc. YouTube is one of the biggest platforms which hosts videos. The YouTube platform hosts content from many different professions/arts/ cultures across the world.

People can express their opinion about the video in the form of likes, dislikes, comments which are features provided by the YouTube platform which provides the information on the sentiment about the video.

The assignment involves the steps on programmatic data extraction from YouTube on which analysis can be conducted to understand various attributes related to a video.

**Steps performed**

### 1. Connect to the Youtube API using a Python client

#### a) Create a YouTube API key

In the Google Cloud Console, created a new project. Enabled the YouTube Data API for the project and generated credentials (API key) to establish a connection.

#### b) Install the Google API python client

Google API python client installed in Windows by entering the below command in Command prompt:

'pip install --upgrade google-api-python-client'

In [8]:
from googleapiclient.discovery import build

key = '<Provie your API key>'
api_service_name = 'youtube'
api_version = 'v3'

youtube = build(api_service_name, api_version, developerKey=key)

request = youtube.search().list(
        part='snippet',
        maxResults=5,
        q='avatar movie'
    )
response = request.execute()
if response:
    print('Connection to YouTube API successful.')
else:
    print('Failed to connect to YouTube API.')


Connection to YouTube API successful.


### 2. Search and extract the data

#### a) Search videos related to the query string  “avatar movie” 

In [9]:
request = youtube.search().list(
        part='snippet',
        maxResults=1,
        q='avatar movie'
    )
response = request.execute()
video = response['items'][0]

#ID
videoId = video['id']['videoId']
print('ID: ',videoId)

#Channel ID
channelId = video['snippet']['channelId']
print("--Snippet--")
print('Channel ID: ',channelId)

#Video Description
description = video['snippet']['description']
print('Video Description: ',description)

# Channel Title
channelTitle = video['snippet']['channelTitle']
print('Channel Title: ',channelTitle)

#Video Title
title = video['snippet']['title']
print('Video Title: ',title)

ID:  RNj2cH5yjNA
--Snippet--
Channel ID:  UC0A86RKLCqTEUna3hPlEpzg
Video Description:  AVATAR Full Movie 2024: Pandora World | Superhero FXL Action Movies 2024 in English (Game Movie). Best Action Game ...
Channel Title:  Superhero FXL Games
Video Title:  AVATAR Full Movie 2024: Pandora World | Superhero FXL Action Movies 2024 in English (Game Movie)


#### b) Provide the following statistics for query string “avatar movie” of top 50 videos sorted by relevance in the US region

In [10]:
import pandas as pd

# request for “avatar movie” of top 50 videos sorted by relevance in the US region
request = youtube.search().list(
        q='avatar movie',
        part='snippet',
        maxResults=50,
        regionCode='US',
        order='relevance',
        type='video'
    )
response = request.execute()

videoID = [item['id']['videoId'] for item in response['items']]

title,no_of_views,no_of_likes,no_of_comments = [],[],[],[]
for videoId in videoID:
        request = youtube.videos().list(
                part='snippet,contentDetails,statistics',
                id=videoId
                )
        response = request.execute()
        title.append(response['items'][0]['snippet'].get('title', 'NA'))
        no_of_views.append(response['items'][0]['statistics'].get('viewCount', 'NA'))
        no_of_likes.append(response['items'][0]['statistics'].get('likeCount', 'NA'))
        no_of_comments.append(response['items'][0]['statistics'].get('commentCount', 'NA'))
 

# Export to csv file
df = pd.DataFrame({
    'Video ID': videoID,
    'Title':title,
    'no_of_views': no_of_views,
    'no_of_likes':no_of_likes,
    'no_of_comments':no_of_comments
})
df.to_csv('youtube_video_stats.csv', index=False)

### 3. Analyze the exported data obtained in 2.b and carry out the following tasks

#### a) Sort the data 2.b  by top 10 comments in descending order and consider the video IDs and Titles of top 10 videos which have highest comments.

In [11]:
df['no_of_comments'] = df['no_of_comments'].astype(int)
sorted_df = df.sort_values(by='no_of_comments',ascending=False)
top_10_video_ids_and_titles = sorted_df.head(10)[['Video ID', 'Title','no_of_comments']]
top_10_video_ids_and_titles

Unnamed: 0,Video ID,Title,no_of_comments
2,d9MyW72ELq0,Avatar: The Way of Water | Official Trailer,42699
20,a8Gx8wiNbs8,Avatar: The Way of Water | Official Teaser Tra...,28934
28,bDHD1ueL4a4,The Weeknd - Nothing Is Lost (You Give Me Stre...,9731
21,RGx8rYbRVR4,Why People Hate Avatar: A Lesson In Lazy Comme...,5202
38,doDrnJkgf-s,Zoe Saldana Emotional Avatar Scene Behind The ...,5099
43,f5Zx8iPek5I,Avatar 3 Will Introduce The Dark Side🔥 Of Na'v...,4051
37,mRrKdgpZ6kE,AGAINST JAKE..!??😲😲 | AVATAR 3 #shorts #avatar,2300
49,PLtgIILX7E8,AVATAR Full Movie 2023: Fallen Kingdom | Super...,2151
41,X8SVkfbt8cs,Avatar: The Way of Water,1833
32,oFErWcXJLdw,TRAINING TO BE IN THE NEXT AVATAR MOVIE,1763


#### b) Use a suitable method to retrieve comments of those top 10 videos from 3.a.

In [12]:
from pprint import pprint


def retrieve_comments(df):
    # for loop to retrieve top 10 comments for each video
    top_10_video_comments={}
    for id in df['Video ID']:
        request = youtube.commentThreads().list(
            part="snippet",
            videoId=id,
            maxResults=10
        )
        response = request.execute()

        comments = {
            'etag': response['etag'],
            'items': []
        }
        for item in response['items']:
            comment_info={
                'id':item['id'],
                'kind':item['kind'],
                'snippet':{
                    'textDisplay':item['snippet']['topLevelComment']['snippet']['textDisplay']
                }
            }
            comments['items'].append(comment_info)
        top_10_video_comments[f'{id}_comments']=comments
    return top_10_video_comments

top_video_comments = retrieve_comments(top_10_video_ids_and_titles)

pprint(top_video_comments)

{'PLtgIILX7E8_comments': {'etag': '0lOIoDKPlwnqEUc19lCcRXWvKyI',
                          'items': [{'id': 'UgzQPg-rwH_W7eyhxQx4AaABAg',
                                     'kind': 'youtube#commentThread',
                                     'snippet': {'textDisplay': 'The subtitles '
                                                                'are available '
                                                                'in English, '
                                                                'Hindi, '
                                                                'Portuguese, '
                                                                'French, '
                                                                'Spanish, '
                                                                'German, '
                                                                'Indonesian, '
                                                                'and Thai. '
                

In [13]:
import json

with open('top_10_video_comments.json', 'w') as json_file:
    json.dump(top_video_comments, json_file, indent=4)

#### d) Function to get  the likes vs views ratio of the top 10 videos obtained in 3.a with the highest comments

In [14]:
import numpy as np

merged_df = pd.merge(df,top_10_video_ids_and_titles, on='Video ID',how='right')
merged_df[['no_of_views', 'no_of_likes']] = merged_df[['no_of_views', 'no_of_likes']].replace('NA', np.nan).astype(float)

def ratios(df,col1,col2):
    df['likes vs views ratios'] = col1/col2.replace((0,np.inf), np.nan)
    ratios_df = pd.DataFrame({
        'Video ID':df['Video ID'],
        'likes vs views ratios':df['likes vs views ratios'],
        'no_of_comments':df['no_of_comments_x']
    })
    return ratios_df

ratios(merged_df,merged_df['no_of_likes'],merged_df['no_of_views'])

Unnamed: 0,Video ID,likes vs views ratios,no_of_comments
0,d9MyW72ELq0,0.017571,42699
1,a8Gx8wiNbs8,0.024084,28934
2,bDHD1ueL4a4,0.02123,9731
3,RGx8rYbRVR4,0.054284,5202
4,doDrnJkgf-s,0.077466,5099
5,f5Zx8iPek5I,0.045046,4051
6,mRrKdgpZ6kE,,2300
7,PLtgIILX7E8,0.004387,2151
8,X8SVkfbt8cs,,1833
9,oFErWcXJLdw,0.017051,1763
