Foundation of information

**Assignment**: Data extraction and analysis from social media platform Youtube

**Problem statement**

Videos are a fast growing medium where people communicate, share knowledge, showcase skills etc. YouTube is one of the biggest platforms which hosts videos. The YouTube platform hosts content from many different professions/arts/ cultures across the world.

People can express their opinion about the video in the form of likes, dislikes, comments which are features provided by the YouTube platform which provides the information on the sentiment about the video.

The assignment involves the steps on programmatic data extraction from YouTube on which analysis can be conducted to understand various attributes related to a video.

**Steps to be performed**

1. Connect to the Youtube API using a Python client 



> 1.a Create a YouTube API key 





> 1.b Install the Google API python client  



refer to the [supporting](https://developers.google.com/youtube/v3/getting-started) link on how to create YouTube API Key

Reference link : https://developers.google.com/youtube/v3/quickstart/python

In [179]:
# Installing python packages via conda 
#import sys
#!conda install --yes --prefix {sys.prefix}  google-api-python-client
#!conda install --yes --prefix {sys.prefix}   google-auth-oauthlib google-auth-httplib2

# Installing python packages via pip
import sys
!{sys.executable} -m pip install --upgrade google-api-python-client
!{sys.executable} -m pip install --upgrade google-auth-oauthlib
!{sys.executable} -m pip install --upgrade google-auth-httplib2
!{sys.executable} -m pip install --upgrade prettytable

Collecting prettytable
  Downloading prettytable-3.9.0-py3-none-any.whl (27 kB)
Installing collected packages: prettytable
Successfully installed prettytable-3.9.0


2. Search and extract the data



> 2.a Search videos related to the query string  “avatar movie”
(For this part, choose/search one video of your choice and perform data collection steps on that specific video )

> Output expected : ID, Snippet with following attributes Channel ID, Video Description, Channel Title, Video Title






Reference link:  https://developers.google.com/youtube/v3/docs/search/list

In [44]:
import os
import json # pretty print

import googleapiclient.discovery

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "AIzaSyB8hcaXgAkjX_tueTC1Pj9W6LVRY9tAoTY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    request = youtube.search().list(
        part="id,snippet",
        q="avatar movie"
    )
    
    response = request.execute()
    # DEBUG: The below prints whole query result for the search query "avatar movie" , Uncomment to print the same
    # print(json.dumps(response, indent = 1))
    items = response['items']
    
    # Sort the items acc to title
    sorted_items = sorted(items, key=lambda x: x["snippet"]["title"])
    
    csv_file_path = "output.csv"
    
    # Extract information for the first element
    first_element = sorted_items[0]
    video_id = first_element["id"]["videoId"]
    channel_id = first_element["snippet"]["channelId"]
    video_description = first_element["snippet"]["description"]
    channel_title = first_element["snippet"]["channelTitle"]
    video_title = first_element["snippet"]["title"]

    # Print the extracted information
    print("ID:", video_id)
    print("Snippet:")
    print("  Channel ID:", channel_id)
    print("  Video Description:", video_description)
    print("  Channel Title:", channel_title)
    print("  Video Title:", video_title)

if __name__ == "__main__":
    main()

# Code Output :    
# ID: PLtgIILX7E8
# Snippet:
#   Channel ID: UC0A86RKLCqTEUna3hPlEpzg
#   Video Description: AVATAR Full Movie 2023: Fallen Kingdom | Superhero FXL Action Movies 2023 in English (Game Movie). Best Action Game ...
#   Channel Title: Superhero FXL Games
#   Video Title: AVATAR Full Movie 2023: Fallen Kingdom | Superhero FXL Action Movies 2023 in English (Game Movie)

ID: PLtgIILX7E8
Snippet:
  Channel ID: UC0A86RKLCqTEUna3hPlEpzg
  Video Description: AVATAR Full Movie 2023: Fallen Kingdom | Superhero FXL Action Movies 2023 in English (Game Movie). Best Action Game ...
  Channel Title: Superhero FXL Games
  Video Title: AVATAR Full Movie 2023: Fallen Kingdom | Superhero FXL Action Movies 2023 in English (Game Movie)



> 2.b  Provide the following statistics for top 50 videos sorted by relevance in the US region 

> Output expected: video ID, title, no of views, no of likes,no of comments exported to CSV file


Reference link: https://developers.google.com/youtube/v3/docs/videos/list

In [111]:
import os
import json # pretty print
import csv
import googleapiclient.discovery

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "AIzaSyB8hcaXgAkjX_tueTC1Pj9W6LVRY9tAoTY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        chart="mostPopular",
        maxResults=50,
        regionCode="US"
    )
    
    response = request.execute()
    # DEBUG: The below prints whole query result. 
    # print(json.dumps(response, indent = 1))
    items = response['items']
    csv_file_path = "output.csv"
    
    formatted_data = []
    for item in items:
        video_id = item.get("id", {})
        title = item["snippet"]["title"]
        view_count = item["statistics"]["viewCount"]
        like_count = item.get("statistics", {}).get("likeCount","N/A")
        comment_count = item["statistics"]["commentCount"]
 
        formatted_data.append({
        "Video ID": video_id,
        "Title": title,
        "No of Views": view_count,
        "No of Likes": like_count,
        "No of Comments": comment_count
        })
    
    with open(csv_file_path, 'w', newline='',  encoding='utf-8') as csv_file:
        fieldnames = ["Video ID", "Title", "No of Views", "No of Likes", "No of Comments"]
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(formatted_data)

        print(f"CSV file '{csv_file_path}' has been created.")
        
if __name__ == "__main__":
    main()


CSV file 'output.csv' has been created.


 3. Analyze the exported data obtained in 2.b and carry out the following tasks 



> 3.a Sort the data 2.b  by top 10 comments in descending order and consider the video IDs and Titles of top 10 videos which have highest comments. 



In [110]:
import csv

def read_and_sort_csv(file_path):
    with open(file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        data = list(reader)

    # Sort data by 'No of Comments' in descending order
    sorted_data = sorted(data, key=lambda x: int(x['No of Comments']), reverse=True)

def print_top_10(data):
    print("Top 10 Videos with Highest Comments:")
    print("No of Comments\tVideo ID\tTitle")
    print("="*50)
    for video in data[:10]:
        print(f"{video['No of Comments']}\t{video['Video ID']}\t{video['Title']}")
    return data[:10]
if __name__ == "__main__":
    csv_file_path = "output.csv"
    sorted_data = read_and_sort_csv(csv_file_path)
    print_top_10(sorted_data)
  
# Code Output
# Top 10 Videos with Highest Comments:
# No of Comments	Video ID	Title
# ==================================================
# 25794	Xty2gi5cMa8	Drake - First Person Shooter ft. J. Cole
# 17990	s_76M4c4LTo	MADAME WEB – Official Trailer (HD)
# 10549	jsI3mgcLJZA	The 72nd MISS UNIVERSE National Costume Show
# 9524	Ye3st9z6jQY	The Last of Us Part II Remastered - Announce Trailer | PS5 Games
# 9005	22_4K5lg-Us	I Built a DREAM DOG HOUSE and Hid It From My Dad!
# 8666	63Qn4ng2LTo	SMG4: SMG3's BOMB CAFE
# 8227	wYdAs5yhmhk	ARGENTINA vs. URUGUAY [0-2] | RESUMEN | ELIMINATORIAS SUDAMERICANAS | FECHA 5
# 7893	OhLeDrMYY4k	FOOD BATTLE 2023
# 7561	A6Z9gkJnfgw	The real reason I left Sweden.
# 7531	JdFRjsEZrmU	Acid vs Lava- Testing Liquids That Melt Everything

Top 10 Videos with Highest Comments:
No of Comments	Video ID	Title
25794	Xty2gi5cMa8	Drake - First Person Shooter ft. J. Cole
17990	s_76M4c4LTo	MADAME WEB – Official Trailer (HD)
10549	jsI3mgcLJZA	The 72nd MISS UNIVERSE National Costume Show
9524	Ye3st9z6jQY	The Last of Us Part II Remastered - Announce Trailer | PS5 Games
9005	22_4K5lg-Us	I Built a DREAM DOG HOUSE and Hid It From My Dad!
8666	63Qn4ng2LTo	SMG4: SMG3's BOMB CAFE
8227	wYdAs5yhmhk	ARGENTINA vs. URUGUAY [0-2] | RESUMEN | ELIMINATORIAS SUDAMERICANAS | FECHA 5
7893	OhLeDrMYY4k	FOOD BATTLE 2023
7561	A6Z9gkJnfgw	The real reason I left Sweden.
7531	JdFRjsEZrmU	Acid vs Lava- Testing Liquids That Melt Everything



> 3.b Use a suitable method to retrieve comments of those top 10 videos from 3.a. For doing this, write a program to loop through each video id from 3.a and pass in the part parameter set to "snippet", to retrieve basic details about the comments. Execute this request and print the response using the pprint() method.
 - Note: pprint() will print out the response from the API in a more human-readable format.
- Reference link:  [link](https://developers.google.com/youtube/v3/docs )


> **Output expected** : Use the python library “ pprint “ to print the output of the program with the following properties  etag, items, id , kind, snippet and snippet to have the text display field which represents the comment of videos.






In [153]:
import csv
import googleapiclient.discovery

def read_and_sort_csv(file_path):
    with open(file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        data = list(reader)

    # Sort data by 'No of Comments' in descending order
    sorted_data = sorted(data, key=lambda x: int(x['No of Comments']), reverse=True)

    return sorted_data

def get_top_10(data):
    return data[:10]

def print_top_comments(highest_comments_videos):
    results = []
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "AIzaSyB8hcaXgAkjX_tueTC1Pj9W6LVRY9tAoTY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    for video in highest_comments_videos:
        comments = []
        request = youtube.commentThreads().list(
        part="snippet",
        videoId=video['Video ID'].strip(),
        textFormat="plainText"
        )

        # Execute the request and retrieve comments
        response = request.execute()

        # Extract relevant information from the API response
        etag = response['etag']
        items = response.get('items', [])
        kind = response['kind']

        # Extract snippet information for each comment
        for item in items:
            comment_id = item['id']
            snippet = item['snippet']
            text_display = snippet['topLevelComment']['snippet']['textDisplay']

            comments.append({
                'textDisplay': text_display
            })

        results.append({
            'etag': etag,
            'items': comments,
            'kind': kind,
            'videoId': video['Video ID'].strip(),
            'snippet': {
                'textDisplay': text_display
            }
        })
    return results;
        
if __name__ == "__main__":
    csv_file_path = "output.csv"
    sorted_data = read_and_sort_csv(csv_file_path)
    highest_comments_videos = get_top_10(sorted_data)
    results = print_top_comments(highest_comments_videos)
    pprint(results)


[{'etag': 'zJwnkxreMR-9BXkufCtOvrBAhn4',
  'items': [{'textDisplay': 'Seeing Drake and J. Cole together is another '
                            'thing and it made it better \n'
                            'This is history 👏🏻👏🏻👏🏻👏🏻👏🏻💯\n'
                            '🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥'},
            {'textDisplay': 'Gibson the vision is mash 💥💥'},
            {'textDisplay': 'Drezzy and Cole ❤❤'},
            {'textDisplay': 'Wait? So know one has mentioned any of the '
                            'Michael Jackson homages throughout?\n'
                            '\n'
                            'Maybe i missed the comment.\n'
                            '\n'
                            'But i saw references to: Scream, Bad, Thriller, '
                            'Billie Jean, Jam, History, Black or White, \n'
                            '\n'
                            'Maybe over thinking it. 😅'},
            {'textDisplay': 'I definitely think bardo or YB would be perfect '
          



> 3.c Write a program to export the output of question 3.b in JSON file format and submit the file as part of the assignment 



In [157]:
import json
import csv
import googleapiclient.discovery

def read_and_sort_csv(file_path):
    with open(file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        data = list(reader)

    # Sort data by 'No of Comments' in descending order
    sorted_data = sorted(data, key=lambda x: int(x['No of Comments']), reverse=True)

    return sorted_data

def get_top_10(data):
    return data[:10]

def print_top_comments(highest_comments_videos):
    results = []
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "AIzaSyB8hcaXgAkjX_tueTC1Pj9W6LVRY9tAoTY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    for video in highest_comments_videos:
        comments = []
        request = youtube.commentThreads().list(
        part="snippet",
        videoId=video['Video ID'].strip(),
        textFormat="plainText"
        )

        # Execute the request and retrieve comments
        response = request.execute()

        # Extract relevant information from the API response
        etag = response['etag']
        items = response.get('items', [])
        kind = response['kind']
        #video_id = response['id']

        # Extract snippet information for each comment
        for item in items:
            comment_id = item['id']
            snippet = item['snippet']
            text_display = snippet['topLevelComment']['snippet']['textDisplay']

            comments.append({
                'textDisplay': text_display
            })

        results.append({
            'etag': etag,
            'items': comments,
            'kind': kind,
            'videoId': video['Video ID'].strip(),
            'snippet': {
                'textDisplay': text_display
            }
        })
    return results;
        
def export_as_json(results, json_file_path):
    # Convert the dictionary to a JSON-formatted string
    json_data = json.dumps(results, indent=2) 
    with open(json_file_path, 'w') as json_file:
        json_file.write(json_data)
    
if __name__ == "__main__":
    csv_file_path = "output.csv"
    json_file_path = "output.json"
    sorted_data = read_and_sort_csv(csv_file_path)
    highest_comments_videos = get_top_10(sorted_data)
    results = print_top_comments(highest_comments_videos)
    export_as_json(results, json_file_path)
    print(f"JSON file '{json_file_path}' has been created.")
    


JSON file 'output.json' has been created.


>3.d Write a function to get  the likes vs views ratio of the top 10 videos obtained in 3.a with the highest comments 




In [181]:
import csv
import googleapiclient.discovery
from prettytable import PrettyTable

def read_and_sort_csv(file_path):
    with open(file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        data = list(reader)

    # Sort data by 'No of Comments' in descending order
    sorted_data = sorted(data, key=lambda x: int(x['No of Comments']), reverse=True)

    return sorted_data

def get_top_10(data):
    return data[:10]

def calculate_likes_vs_views_ratio(highest_comments_videos):
    ratio_data = {}

    for video in highest_comments_videos:
        likes = video['No of Likes'] if video['No of Likes'] != 'N/A' else 0
        likes = int(likes)
        views = video['No of Views'] if video['No of Views'] != 'N/A' else 0 
        views = int(views)
        # Avoid division by zero
        ratio = likes / views if views != 0 else 0

        ratio_data[video['Video ID']]= {'likes': likes, 'views': views, 'ratio': ratio}

    return ratio_data

def print_likes_vs_views_ratio(ratio_data):
    table = PrettyTable(['Video ID', 'Likes', 'Views', 'Likes vs Views Ratio'])

    for video_id, video_data in ratio_data.items():
        likes = video_data['likes']
        views = video_data['views']
        ratio = video_data['ratio']
        table.add_row([video_id, likes, views, f"{ratio:.4f}"])
    print(table)
    
if __name__ == "__main__":
    csv_file_path = "output.csv"
    sorted_data = read_and_sort_csv(csv_file_path)
    highest_comments_videos = get_top_10(sorted_data)
    ratio_data = calculate_likes_vs_views_ratio(highest_comments_videos)
    print_likes_vs_views_ratio(ratio_data)

+-------------+--------+----------+----------------------+
|   Video ID  | Likes  |  Views   | Likes vs Views Ratio |
+-------------+--------+----------+----------------------+
| Xty2gi5cMa8 | 560538 | 7773820  |        0.0721        |
| s_76M4c4LTo | 85839  | 14562238 |        0.0059        |
| jsI3mgcLJZA | 99539  | 7981365  |        0.0125        |
| Ye3st9z6jQY | 53787  | 1104015  |        0.0487        |
| 22_4K5lg-Us | 55543  | 2907477  |        0.0191        |
| 63Qn4ng2LTo | 50736  |  873703  |        0.0581        |
| wYdAs5yhmhk |   0    | 5330930  |        0.0000        |
| OhLeDrMYY4k | 148738 | 1029175  |        0.1445        |
| JdFRjsEZrmU | 282186 | 7048017  |        0.0400        |
| A6Z9gkJnfgw | 123612 | 1871026  |        0.0661        |
+-------------+--------+----------+----------------------+
