<a href="https://colab.research.google.com/github/PrasanthGubbala/Colab/blob/dev/Course1/Assignment/UoA_505_Assignment_Problem_Statement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Course1 : Foundation of information

**Assignment**: Data extraction and analysis from social media platform Youtube ( 30 Marks )

**Problem statement**

Videos are a fast growing medium where people communicate, share knowledge, showcase skills etc. YouTube is one of the biggest platforms which hosts videos. The YouTube platform hosts content from many different professions/arts/ cultures across the world.

People can express their opinion about the video in the form of likes, dislikes, comments which are features provided by the YouTube platform which provides the information on the sentiment about the video.

The assignment involves the steps on programmatic data extraction from YouTube on which analysis can be conducted to understand various attributes related to a video.

**Steps to be performed**

1. Connect to the Youtube API using a Python client ( 5 Marks )



> 1.a Create a YouTube API key (3 marks)





> 1.b Install the Google API python client  (2 marks)



refer to the [supporting](https://developers.google.com/youtube/v3/getting-started) link on how to create YouTube API Key

Reference link : https://developers.google.com/youtube/v3/quickstart/python

In [None]:
# <code block>
# YOUTUBE_API_KEY generated and stored in env, loading key from file to manage credential expose
!pip install python-dotenv


Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


In [None]:
import os
os.getcwd()

'/content'

In [None]:
from dotenv import load_dotenv
import os, sys
# Load environment variables from .env file
load_dotenv(dotenv_path='./env.env')

# Access environment variables
# Accessing created API_KEY for Youtube data service by google
YOUTUBE_DATA_API_KEY = os.getenv("YOUTUBE_DATA_API_KEY___", "AIzaSyApbn-hqTptAuBzVSQCvqGaC_Y779i1JqA")

In [None]:
YOUTUBE_DATA_API_KEY

'AIzaSyApbn-hqTptAuBzVSQCvqGaC_Y779i1JqA'

In [None]:
# installing google-api-python-client
!pip install google-api-python-client



# NOTE: Utility functions to make requests to youtube server

In [None]:
import googleapiclient.discovery
from pprint import pprint
from googleapiclient.discovery import build

In [None]:
api_service_name = "youtube"
api_version = "v3"
# DEVELOPER_KEY = YOUTUBE_DATA_API_KEY

In [None]:
'''make a request to get videos detail'''
def get_youtube_data(*args,**kwargs):
    api_service_name = kwargs.get('api_service_name', 'youtube')
    api_version = kwargs.get('api_version','v3')
    DEVELOPER_KEY = kwargs.get('DEVELOPER_KEY')

    part = kwargs.get('part', "id,snippet")
    type_ = kwargs.get('type', 'video')
    query = kwargs.get('query', "Life of Pi movie")
    videoDuration = kwargs.get('videoDuration', 'short')
    videoDefinition = kwargs.get('videoDefinition', 'high')
    maxResults = kwargs.get('maxResults', 5)
    regionCode = kwargs.get('regionCode', 'US')

    # This line creates an instance of the YouTube API client using the build() method from the googleapiclient.discovery module. It takes the API service name, version, and developer key as arguments and returns the YouTube API client.
    youtube = googleapiclient.discovery.build(
          api_service_name, api_version, developerKey = DEVELOPER_KEY)


    #This line creates a search request using the search().list() method of the YouTube API client (youtube). It specifies various parameters for the search, such as:
    #part: Specifies the video resource parts to be included in the API response, which is set to 'id,snippet' in this case.
    #type: Specifies the type of resource to search for, which is set to 'video' in this case.
    #q: Specifies the search query string, which is set to "Life of Pi movie".
    #videoDuration: Specifies the desired video duration, which is set to 'short'.
    #videoDefinition: Specifies the desired video definition, which is set to 'high'.
    #maxResults: Specifies the maximum number of search results to return, which is set to 5.
    #fields: Specifies the specific fields to include in the API response, which is set to "items(id(videoId),snippet(channelId,channelTitle,title,description))".
    request = youtube.search().list(
          part=part,
          type=type_,
          q=query,
          videoDuration=videoDuration,
          videoDefinition=videoDefinition,
          maxResults=maxResults,
          regionCode=regionCode,  # Filter by region
          order="relevance",  # Sort by relevance
          fields="items(id(videoId),snippet(channelId,channelTitle,title,description))"
          )

    #This line executes the search request by calling the execute() method on the request object. It sends the request to the YouTube Data API and stores the API response in the response variable.
    response = request.execute()
    return response

'''build href by videoID'''
def prepare_href(result):
    for i, obj in enumerate(result['items']):
        result['items'][i]['snippet']['href'] = f"https://www.youtube.com/watch?v={obj['id']['videoId']}"
    return result

'''fetch statistics by videoID'''
def get_video_stats(result,*args,**kwargs):
    api_service_name = kwargs.get('api_service_name', 'youtube')
    api_version = kwargs.get('api_version','v3')
    DEVELOPER_KEY = kwargs.get('DEVELOPER_KEY')

    # This line creates an instance of the YouTube API client using the build() method from the googleapiclient.discovery module. It takes the API service name, version, and developer key as arguments and returns the YouTube API client.
    youtube = googleapiclient.discovery.build(
          api_service_name, api_version, developerKey = DEVELOPER_KEY)

    for i, obj in enumerate(result['items']):
        # Get video statistics using the videos().list() method
        videos_request = youtube.videos().list(
            part="statistics",
            id=obj['id']['videoId']
        )
        videos_response = videos_request.execute()
        # print('statistics: ',videos_response)

        # Extract statistics
        statistics = videos_response["items"][0]["statistics"]
        # view_count = statistics.get("viewCount", "N/A")
        # like_count = statistics.get("likeCount", "N/A")
        # comment_count = statistics.get("commentCount", "N/A")

        result['items'][i]['statistics'] = statistics


    return result

'''create client instance to make request to server'''
def create_client(api_key):
    # Create a YouTube API client
    youtube = build('youtube', 'v3', developerKey=api_key)
    return youtube

'''fetch comments for video by videoID'''
def get_all_comments(video_id,youtube):
    # Set up API key or OAuth credentials
    # api_key = YOUTUBE_DATA_API_KEY  # Replace with your API key

    # Video ID of the YouTube video you want to retrieve comments from
    # video_id = 'DJEx4Lalfnc'  # Replace with the actual video ID

    # Call the API to retrieve comments
    comments = []
    nextPageToken = None

    while True:
        response = youtube.commentThreads().list(
            part='snippet',
            videoId=video_id,
            pageToken=nextPageToken
        ).execute()

        # comments.extend(response)
        comments.extend(response['items'])
        # print(response)
        # print(comments[-1])
        # print(nextPageToken)

        nextPageToken = response.get('nextPageToken')
        # print(nextPageToken)
        if not nextPageToken:
            break

    return comments




Custom Payload to make request to server


{
    "api_service_name": "youtube",
    "api_version":'v3',
    "DEVELOPER_KEY": YOUTUBE_DATA_API_KEY,
    "part": "id,snippet",
    "type": "video",
    "query": "Life of Pi movie",
    "videoDuration": 'short',
    "videoDefinition": 'high',
    "maxResults": 5
}

In [None]:
# sample request
get_youtube_data_input = {
    "DEVELOPER_KEY": YOUTUBE_DATA_API_KEY,
    "query": "Life of Pi movie",
    "maxResults": 5
}
get_youtube_data(**get_youtube_data_input)['items'][0]

{'id': {'videoId': '3mMN693-F3U'},
 'snippet': {'channelId': 'UCi8e0iOVk1fEOogdfu4YgfA',
  'title': 'Life of Pi Official Trailer #1 (2012) Ang Lee Movie HD',
  'description': 'Subscribe to TRAILERS: http://bit.ly/sxaw6h Subscribe to COMING SOON: http://bit.ly/H2vZUn Life of Pi Official Trailer #1 (2012) ...',
  'channelTitle': 'Rotten Tomatoes Trailers'}}

2. Search and extract the data



> 2.a Search videos related to the query string  “avatar movie”
(For this part, choose/search one video of your choice and perform data collection steps on that specific video ) (3 marks)

> Output expected : ID, Snippet with following attributes Channel ID, Video Description, Channel Title, Video Title






Reference link:  https://developers.google.com/youtube/v3/docs/search/list

In [None]:
# < code block >

get_youtube_data_input = {
    "DEVELOPER_KEY": YOUTUBE_DATA_API_KEY,
    "query": "avatar movie",   # here is the query string
    "maxResults": 5
}
result = get_youtube_data(**get_youtube_data_input)
result = prepare_href(result)
pprint(result)

{'items': [{'id': {'videoId': 'd9MyW72ELq0'},
            'snippet': {'channelId': 'UCgjxQJ6TlKqhHax8742ZMdA',
                        'channelTitle': 'Avatar',
                        'description': 'Set more than a decade after the '
                                       'events of the first film, “Avatar: The '
                                       'Way of Water” begins to tell the story '
                                       'of the Sully family (Jake, ...',
                        'href': 'https://www.youtube.com/watch?v=d9MyW72ELq0',
                        'title': 'Avatar: The Way of Water | Official '
                                 'Trailer'}},
           {'id': {'videoId': 'mdhEj_rG80Y'},
            'snippet': {'channelId': 'UC3IeKtmWtkr9e0IhPf5DmRg',
                        'channelTitle': 'Diamond Box',
                        'description': 'AVATAR - Movie Behind the Scenes | '
                                       'Hollywood movie Making Avatar is an '
           


> 2.b  Provide the following statistics for query string “avatar movie” of top 50 videos sorted by relevance in the US region ( 7 marks )

> Output expected: video ID, title, no of views, no of likes,no of comments exported to CSV file






Reference link: https://developers.google.com/youtube/v3/docs/videos/list

In [None]:
# < code block >

'''without sorted by relevance in the US region (regionCode)'''
get_youtube_data_input = {
    "DEVELOPER_KEY": YOUTUBE_DATA_API_KEY,
    "query": "avatar movie",   # here is the query string
    "maxResults": 50
}
result = get_youtube_data(**get_youtube_data_input)
result = prepare_href(result)
print('total # of items: ',len(result['items']))
pprint(result)

total # of items:  50
{'items': [{'id': {'videoId': 'd9MyW72ELq0'},
            'snippet': {'channelId': 'UCgjxQJ6TlKqhHax8742ZMdA',
                        'channelTitle': 'Avatar',
                        'description': 'Set more than a decade after the '
                                       'events of the first film, “Avatar: The '
                                       'Way of Water” begins to tell the story '
                                       'of the Sully family (Jake, ...',
                        'href': 'https://www.youtube.com/watch?v=d9MyW72ELq0',
                        'title': 'Avatar: The Way of Water | Official '
                                 'Trailer'}},
           {'id': {'videoId': '5PSNL1qE6VY'},
            'snippet': {'channelId': 'UC2-BeLxzUBSs0uSrmzWhJuQ',
                        'channelTitle': '20th Century Studios',
                        'description': 'AVATAR takes us to a spectacular world '
                                       'beyond imagina

In [None]:
'''with sorted by relevance in the US region (regionCode)'''
get_youtube_data_input = {
    "DEVELOPER_KEY": YOUTUBE_DATA_API_KEY,
    "query": "avatar movie",   # here is the query string
    "maxResults": 50,
    "regionCode":"US"  # requesting by Region Code EX: US
}
result = get_youtube_data(**get_youtube_data_input)
result = prepare_href(result)
print('total # of items: ',len(result['items']))
pprint(result)

total # of items:  50
{'items': [{'id': {'videoId': 'd9MyW72ELq0'},
            'snippet': {'channelId': 'UCgjxQJ6TlKqhHax8742ZMdA',
                        'channelTitle': 'Avatar',
                        'description': 'Set more than a decade after the '
                                       'events of the first film, “Avatar: The '
                                       'Way of Water” begins to tell the story '
                                       'of the Sully family (Jake, ...',
                        'href': 'https://www.youtube.com/watch?v=d9MyW72ELq0',
                        'title': 'Avatar: The Way of Water | Official '
                                 'Trailer'}},
           {'id': {'videoId': '5PSNL1qE6VY'},
            'snippet': {'channelId': 'UC2-BeLxzUBSs0uSrmzWhJuQ',
                        'channelTitle': '20th Century Studios',
                        'description': 'AVATAR takes us to a spectacular world '
                                       'beyond imagina

In [None]:
'''sorted by relevance in the US region (regionCode) and included statistics'''
'''for ex: commentCount, favoriteCount, likeCount, viewCount'''
get_youtube_data_input = {
    "DEVELOPER_KEY": YOUTUBE_DATA_API_KEY,
    "query": "avatar movie",   # here is the query string
    "maxResults": 50,
    "regionCode":"US"          # requesting by Region Code EX: US
}
result = get_youtube_data(**get_youtube_data_input)
result = prepare_href(result)
print('total # of items: ',len(result['items']))

'''fetching each video statistics by videoID'''
result = get_video_stats(result,**get_youtube_data_input)

pprint(result)

total # of items:  50
{'items': [{'id': {'videoId': 'd9MyW72ELq0'},
            'snippet': {'channelId': 'UCgjxQJ6TlKqhHax8742ZMdA',
                        'channelTitle': 'Avatar',
                        'description': 'Set more than a decade after the '
                                       'events of the first film, “Avatar: The '
                                       'Way of Water” begins to tell the story '
                                       'of the Sully family (Jake, ...',
                        'href': 'https://www.youtube.com/watch?v=d9MyW72ELq0',
                        'title': 'Avatar: The Way of Water | Official Trailer'},
            'statistics': {'commentCount': '43349',
                           'favoriteCount': '0',
                           'likeCount': '1043486',
                           'viewCount': '57488860'}},
           {'id': {'videoId': 'a8Gx8wiNbs8'},
            'snippet': {'channelId': 'UCgjxQJ6TlKqhHax8742ZMdA',
                        'chann

In [None]:
'''Exporting Response data to csv file'''
import pandas as pd

df = pd.json_normalize(result.get('items'))
df.to_csv('YouTube_Video_Data.csv')
df.head(2)

Unnamed: 0,id.videoId,snippet.channelId,snippet.title,snippet.description,snippet.channelTitle,snippet.href,statistics.viewCount,statistics.likeCount,statistics.favoriteCount,statistics.commentCount
0,d9MyW72ELq0,UCgjxQJ6TlKqhHax8742ZMdA,Avatar: The Way of Water | Official Trailer,Set more than a decade after the events of the...,Avatar,https://www.youtube.com/watch?v=d9MyW72ELq0,57488860,1043486,0,43349
1,a8Gx8wiNbs8,UCgjxQJ6TlKqhHax8742ZMdA,Avatar: The Way of Water | Official Teaser Tra...,Set more than a decade after the events of the...,Avatar,https://www.youtube.com/watch?v=a8Gx8wiNbs8,27569184,683228,0,29184


 3. Analyze the exported data obtained in 2.b and carry out the following tasks (15 marks )



> 3.a Sort the data 2.b  by top 10 comments in descending order and consider the video IDs and Titles of top 10 videos which have highest comments. (3mark)



In [None]:
# < code block >
#type casting on comment count column for better result
df['statistics.commentCount'] = df['statistics.commentCount'].astype('Int64')

# Sort the DataFrame by comment_count in descending order
sorted_df = df.sort_values(by='statistics.commentCount', ascending=False)

# # Get top 10 videos with highest comment counts
top_10_videos = sorted_df.head(10)

priority_columns = ['id.videoId','snippet.title','statistics.commentCount','snippet.href']
top_10_videos[priority_columns]

Unnamed: 0,id.videoId,snippet.title,statistics.commentCount,snippet.href
0,d9MyW72ELq0,Avatar: The Way of Water | Official Trailer,43349,https://www.youtube.com/watch?v=d9MyW72ELq0
1,a8Gx8wiNbs8,Avatar: The Way of Water | Official Teaser Tra...,29184,https://www.youtube.com/watch?v=a8Gx8wiNbs8
5,o5F8MOz_IDw,Avatar: The Way of Water | New Trailer,13151,https://www.youtube.com/watch?v=o5F8MOz_IDw
2,5PSNL1qE6VY,Avatar | Official Trailer (HD) | 20th Century FOX,8935,https://www.youtube.com/watch?v=5PSNL1qE6VY
11,f5Zx8iPek5I,Avatar 3 Will Introduce The Dark Side🔥 Of Na&#...,3826,https://www.youtube.com/watch?v=f5Zx8iPek5I
18,QOg9LUIvaig,"AVATAR: THE LAST AIRBENDER | Water, Earth, Fir...",3606,https://www.youtube.com/watch?v=QOg9LUIvaig
7,kHNCaWjB-98,Zoe Saldana Performance Capture | AVATAR (2009...,3151,https://www.youtube.com/watch?v=kHNCaWjB-98
9,noVddIRzn3o,Sander VS Avatar Jake Sully #sandervs #satisfy...,3091,https://www.youtube.com/watch?v=noVddIRzn3o
23,a6VVrAZUnsc,Avatar: The Way of Water | Official IMAX® Trailer,2761,https://www.youtube.com/watch?v=a6VVrAZUnsc
44,vAL6i0Dm0-8,The COOLEST Experience in Singapore! (Avatar 2...,2521,https://www.youtube.com/watch?v=vAL6i0Dm0-8



> 3.b Use a suitable method to retrieve comments of those top 10 videos from 3.a. For doing this, write a program to loop through each video id from 3.a and pass in the part parameter set to "snippet", to retrieve basic details about the comments. Execute this request and print the response using the pprint() method.
 - Note: pprint() will print out the response from the API in a more human-readable format.
- Reference link:  [link](https://developers.google.com/youtube/v3/docs )


> **Output expected** : Use the python library “ pprint “ to print the output of the program with the following properties  etag, items, id , kind, snippet and snippet to have the text display field which represents the comment of videos.






In [None]:
# < code block >

# creating client connection to youtube
client = create_client(YOUTUBE_DATA_API_KEY)
result = []
for i, row, in top_10_videos.iterrows():
    #fetching all comments by videoID
    obj = {}
    obj['id.videoId'] = row['id.videoId']
    obj['snippet.title'] = row['snippet.title']
    obj['snippet.href'] = row['snippet.href']
    obj['statistics.commentCount'] = row['statistics.commentCount']
    comments = get_all_comments(row['id.videoId'],client)
    obj['comments'] = comments
    result.append(obj)
    pprint(comments)
    print('='*150)
    # break
print('Successfully fetched')

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
                                  'id': 'Ugx1uH5GzVfFbmi34qF4AaABAg',
                                  'kind': 'youtube#comment',
                                  'snippet': {'authorChannelId': {'value': 'UCWYmhA2Qd-M9RgAxwV_WYug'},
                                              'authorChannelUrl': 'http://www.youtube.com/channel/UCWYmhA2Qd-M9RgAxwV_WYug',
                                              'authorDisplayName': 'thangalla '
                                                                   'sreenu',
                                              'authorProfileImageUrl': 'https://yt3.ggpht.com/QLfZoeHauOfV6-fx7trNCkK15Fg-qGsAoqtoWjhNfsPDAH4foiJmqeU91ByCIKlrFZfHPYtlZQ=s48-c-k-c0x00ffffff-no-rj',
                                              'canRate': True,
                                              'likeCount': 1,
                                              'publishedAt': '2022-11-02T13:05:24Z',
           

In [None]:
pprint(result)



> 3.c Write a program to export the output of question 3.b in JSON file format and submit the file as part of the assignment (3 marks)



In [None]:
# < code block >

import json

file_path = "YouTube_Video_Data.json"

# Write the data to the JSON file
with open(file_path, 'w') as json_file:
    json.dump(result, json_file)

>3.d Write a function to get  the likes vs views ratio of the top 10 videos obtained in 3.a with the highest comments (3 marks)




In [None]:
# < code block >
'''Method-1'''
# Find the greatest common divisor (GCD) using Euclidean algorithm
def gcd(a, b):
    while b:
        a, b = b, a % b
    return a

likes = 'statistics.likeCount'
views = 'statistics.viewCount'
# Calculate the GCD and create the ratio string
top_10_videos['Ratio(likeCount:viewCount)'] = top_10_videos.apply(lambda row: f"{int(row[likes]) // gcd(int(row[likes]), int(row[views]))}:{int(row[views]) // gcd(int(row[likes]), int(row[views]))}", axis=1)


'''Method-2'''
top_10_videos['LikesToViewsRatio'] = top_10_videos[likes] / top_10_videos[views]

