In [19]:
import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
from json import loads

In [20]:
url = 'https://www.youtube.com/@PW-Foundation/videos'

In [21]:
url

'https://www.youtube.com/@PW-Foundation/videos'

In [22]:
# Get the html by get method
r = requests.get(url)
print(r)

<Response [200]>


In [23]:
#Create BeautifulSoup object
youtube_html = BeautifulSoup(r.text , 'html.parser')

In [24]:
big_box = youtube_html.find_all('script')


In [25]:
len(big_box)

40

So far, this code imports necessary modules for web scraping, sends an HTTP GET request to a YouTube channel URL, creates a BeautifulSoup object from the HTML code of the web page, and finds all the script tags in the web page.

In [26]:
def script_to_json(tags: list) -> dict:
    for tag in reversed(tags):
        text: str = tag.text
        if 'ytInitialData = {"responseContext"' in text:
            return loads(text[20:-1])
        
    raise ValueError('Required script tag is not found in the given tags ')

This code defines a function script_to_json() that converts a script tag to a JSON object.

The function takes a list of script tags as input and searches for the tag that contains the JSON data that we are interested in. It does this by iterating over the list of tags in reverse order (starting from the end of the list), and searching for the ytInitialData JSON object in the text attribute of each tag.

If the required ytInitialData JSON object is found, the function extracts the JSON string from the text attribute of the tag (excluding the first 20 and last 1 characters, which contain some unwanted characters), and uses the loads() method from the json module to convert the string to a Python dictionary.

If the required ytInitialData JSON object is not found in any of the tags, the function raises a ValueError.

The resulting dictionary contains the data that we are interested in, which can be used for further analysis or manipulation.

In [32]:
youtube_data = script_to_json(big_box)

#Return data from videos
def get_contents_dict(data):
    return data['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['richGridRenderer']['contents']

This code defines a function get_contents_dict() that extracts the contents dictionary from the ytInitialData JSON object.

The function takes the ytInitialData dictionary as input and returns the contents dictionary that contains the data for the videos on the channel.

The contents dictionary can be found by navigating through the various keys in the ytInitialData dictionary. In this case, we can find it by following this path:

In [33]:
def get_videoUrl(data:dict, n: int = 5):
    contents = get_contents_dict(youtube_data)

    if n > 30:
        raise ValueError('Max Limit is 30.')

    result = []
    for i in range(n):
        result.append('https://www.youtube.com/watch?v=' +
                        contents[i]['richItemRenderer']['content']['videoRenderer']['videoId'])
    return result

get_videoUrl(youtube_data)

['https://www.youtube.com/watch?v=jXAb1evxaJc',
 'https://www.youtube.com/watch?v=2dn7XMxRtPE',
 'https://www.youtube.com/watch?v=Fks4dVnTb5M',
 'https://www.youtube.com/watch?v=nIuGXeISbSo',
 'https://www.youtube.com/watch?v=p9pqo970kjw']

This code defines a function get_videoUrl() that extracts the URLs of the first n videos on the channel.

The function takes the ytInitialData dictionary as input and an optional parameter n that specifies the number of videos to retrieve (default value is 5). The function first extracts the contents dictionary using the get_contents_dict() function, and then iterates over the first n items in the contents list.

For each item, the function extracts the videoId from the videoRenderer dictionary and concatenates it with the YouTube video URL to create the full URL for the video. The resulting list result contains the URLs of the first n videos on the channel.

Note that the function raises a ValueError if the value of n is greater than 30, which is the maximum number of videos that can be retrieved from the YouTube API in a single request. This limit is enforced to prevent excessive usage of the API, which could result in rate limiting or other errors.

In [34]:
def get_thumbnails(data:dict , n:int = 5):
    contents =get_contents_dict(youtube_data)
    
    if n>30:
        raise ValueError('Max Limit is 30')
        
    result = []
    for i in range(n):
        result.append(contents[i]['richItemRenderer']['content']['videoRenderer']['thumbnail']['thumbnails'][-1]['url'])
        
    return result
get_thumbnails(youtube_data)
        
        

['https://i.ytimg.com/vi/jXAb1evxaJc/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLDmeiri9cimEVHPiAh5ootidgIzIg',
 'https://i.ytimg.com/vi/2dn7XMxRtPE/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLCNuCKbYYT7Bqo7b2xVfh27z3YKMw',
 'https://i.ytimg.com/vi/Fks4dVnTb5M/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLCHXf5XebQJcUioL-AX9g1ZXcizVQ',
 'https://i.ytimg.com/vi/nIuGXeISbSo/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLDnazWdDQJGyaWrFBfL4fojeCtPFg',
 'https://i.ytimg.com/vi/p9pqo970kjw/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLAOxBxNmM_rbpVXQ9jFGTcaP2Dalg']

This code defines a function get_thumbnails() that extracts the URLs of the thumbnail images for the first n videos on the channel.

The function takes the ytInitialData dictionary as input and an optional parameter n that specifies the number of thumbnails to retrieve (default value is 5). The function first extracts the contents dictionary using the get_contents_dict() function, and then iterates over the first n items in the contents list.

For each item, the function extracts the URL of the highest resolution thumbnail image by selecting the last item in the thumbnails list of the thumbnail dictionary.

The resulting list result contains the URLs of the thumbnail images for the first n videos on the channel. These URLs can be used to download and display the thumbnail images for each video.

Q3. Write a python program to extract the title of the first five videos.

In [36]:
def get_title(data:dict , n:int=5):
    contents=get_contents_dict(youtube_data)
    
    if n>30:
        raise ValueError('Max Limit is 30')
        
    result=[]
    for i in range(n):
        result.append(contents[i]['richItemRenderer']['content']['videoRenderer']['title']['runs'][-1]['text'])
        
    return result
get_title(youtube_data)

['Big Announcement for Gulf Region Aspirants 🔥| Physics Wallah Gulf Channel Trailer🚀',
 'Arjuna JEE v/s Arjuna NEET 🏏- Class 11th Faculties ka Cricket Match 🔥',
 'How to Study Zoology in Class 11th? Ab Saare Doubts Solve Honge !! 🔥',
 'BIGGEST OFFER For Class - 8th ,9th & 10th Students 🤩 || Grab This Opportunity Now 🔥',
 'Launching PW प्रयोगशाला 2.0 🔥 || The Unbeatable is Loading...']

This code defines a function get_title() that extracts the titles of the first n videos on the channel.

The function takes the ytInitialData dictionary as input and an optional parameter n that specifies the number of titles to retrieve (default value is 5). The function first extracts the contents dictionary using the get_contents_dict() function, and then iterates over the first n items in the contents list.

For each item, the function extracts the title of the video by selecting the last item in the runs list of the title dictionary.

The resulting list result contains the titles of the first n videos on the channel. These titles can be used to identify each video and to display the titles in a list or table.

In [37]:
def get_views(data: dict , n :int = 5):
    contents = get_contents_dict(youtube_data)
    
    if n > 30:
        raise ValueError('MAx Limit is 30')
        
    result = []
    for i in range(n):
        result.append(int(contents[i]['richItemRenderer']['content']['videoRenderer']['viewCountText']['simpleText']
                      [:-6].replace(',' , '')))
        
    return result
get_views(youtube_data)

[52557, 248083, 8987, 34842, 30439]

This code defines a function get_views() that extracts the view counts of the first n videos on the channel.

The function takes the ytInitialData dictionary as input and an optional parameter n that specifies the number of view counts to retrieve (default value is 5). The function first extracts the contents dictionary using the get_contents_dict() function, and then iterates over the first n items in the contents list.

For each item, the function extracts the view count of the video by selecting the simpleText value of the viewCountText dictionary. The view count is then converted to an integer by removing the last six characters (which represent " views") and any commas.

The resulting list result contains the view counts of the first n videos on the channel. These view counts can be used to measure the popularity of each video and to display the view counts in a list or table.

In [40]:
def get_time_of_post(data: dict , n : int=5):
    contents = get_contents_dict(youtube_data)
    
    if n > 30:
        raise ValueError('Max Limit is 30')
            
    result = []
    for i in range(n):
        result.append(contents[i]['richItemRenderer']['content']['videoRenderer']['publishedTimeText']['simpleText'])
        
    return result
get_time_of_post(youtube_data)

['5 days ago', '7 days ago', '2 weeks ago', '2 weeks ago', '2 weeks ago']

This code defines a function get_time_of_posting() that extracts the publication dates of the first n videos on the channel.

The function takes the ytInitialData dictionary as input and an optional parameter n that specifies the number of publication dates to retrieve (default value is 5). The function first extracts the contents dictionary using the get_contents_dict() function, and then iterates over the first n items in the contents list.

For each item, the function extracts the publication date of the video by selecting the simpleText value of the publishedTimeText dictionary.

The resulting list result contains the publication dates of the first n videos on the channel. These dates can be used to determine when each video was published and to display the publication dates in a list or table.

In [51]:
def get_channel_video_detials(data:dict , n:int):
    thumbnails = get_thumbnails(data , n)
    time_of_post = get_time_of_post(data , n)
    titles = get_title(data, n)
    video_urls = get_videoUrl(data , n)
    viwes = get_views(data , n)
    
    main_data = list(zip(video_urls,titles,time_of_post,viwes,thumbnails))
    
    df = DataFrame.from_dict(main_data)
    df.rename(
        columns={
            0:'video_urls',
            1:'titles',
            2:'thumbnails_url',
            3:'views_post',
            4:'time_of_posting'
        }, inplace = True)
    
    return df
channel_data = get_channel_video_detials(data,30)
channel_data.to_csv('Abhi-pw_foundation', index=False)
    

In [52]:
def get_channel_video_details(data: dict, n: int):
    thumbnails = get_thumbnails(data, n)
    time_of_posting = get_time_of_post(data, n)
    titles = get_title(data, n)
    video_urls = get_videoUrl(data, n)
    views = get_views(data , n)
    
    main_data = list(zip(video_urls, titles, thumbnails, time_of_posting , views))
    
    df = DataFrame.from_dict(main_data)
    df.rename(
        columns={
            0: 'video_urls',
            1: 'title',
            2: 'thumbnail_url',
            3: 'time_of_posting',
            4: 'views'
        }, inplace=True)

    return df

channel_data = get_channel_video_details(data, 10)
channel_data.to_csv('bhi-PW-Foundation.csv', index=False)

This code defines a function get_channel_video_details() that extracts the video details of the first n videos on the channel.

The function takes the ytInitialData dictionary as input and an integer n that specifies the number of videos to retrieve. The function calls the get_thumbnails(), get_time_of_posting(), get_title(), and get_videoUrl() functions to extract the thumbnail URLs, publication dates, titles, and video URLs of the first n videos on the channel. It then zips these lists together into a single list of tuples called main_data.

The function then creates a Pandas DataFrame from main_data and renames the columns to video_urls, title, thumbnail_url, and time_of_posting. Finally, the function returns the DataFrame.

The resulting DataFrame channel_data contains the video details of the first n videos on the channel, including the video URLs, titles, thumbnail URLs, and publication dates. The DataFrame is saved to a CSV file called PW-Foundation.csv with the to_csv() method.

 