# YouTube Data Analysis

The data collection phase involves retrieving comprehensive information from the YouTube API to form a robust dataset for analysis. This includes gathering details about individual videos, channel statistics, and user comments. The API requests provide insights into crucial metrics such as view counts, likes, dislikes, video duration, subscriber counts, and sentiment analysis of comments. The collected data serves as the foundation for subsequent exploratory data analysis (EDA) and insightful visualizations, enabling a thorough understanding of the channel's performance and user engagement patterns.

In this notebook, we will focus on the preprocessing of the collected YouTube data to prepare it for analysis. The provided code snippet demonstrates various preprocessing steps, including calculating title character length, converting data types, parsing date and time, extracting features, and deriving new metrics. These preprocessing steps are essential for ensuring the data is in a suitable format for analysis and modeling.

The subsequent sections will delve into exploratory data analysis (EDA) and visualizations to gain insights into the performance and engagement patterns of the YouTube channel based on the preprocessed data.

Let's dive into the preprocessing steps and uncover valuable insights from the YouTube data!

### Importing libraries 📚

In [202]:
#Imports
from itertools import chain
from googleapiclient.discovery import build
import pandas as pd
import isodate

<div class="alert alert-block alert-info">
<b>Reminder:</b> If necessary, please download the required packages.
</div>


### Data Collection 🗂️

1. **Connecting to the YouTube API**: Establish a connection to the YouTube API using authentication credentials to access the required data.

2. **Creating Functions to Get Channel Data**: Develop functions to retrieve channel statistics such as subscriber counts, video counts, and other relevant metrics.

3. **Retrieving Video IDs Based on Channel Data**: Utilize the API to fetch video IDs associated with the channel to enable further data retrieval.

4. **Fetching Video Details Based on Video IDs**: Develop functions to gather detailed information about individual videos, including metrics such as vi, countdislikes, and video duration.

5. **Collecting Comments Data from the Videos**: Use API requests to extract user com videos.uired.ns.

### Connecting to the YouTube API

<div class="alert alert-block alert-warning">
<b>Reminder:</b> Fill in the API key.
</div>

In [186]:
# Required parameters
api_key ='API_KEY'
api_service_name = "youtube"
api_version = "v3"

# Get credentials and create an API client
youtube = build(api_service_name, api_version, developerKey=api_key)

In [188]:
channel_ids = ['UCJQJAI7IjbLcpsjWdSzYz0Q', #Thu Vu data analytics
               'UC5DNytAJ6_FISueUfzZCVsw', #Code with Ania Kubów
               'UCV8e2g4IWQqK71bbzGDEI4Q', #Data Professor
               'UCn8ujwUInbJkBhffxqAPBVQ', #Dave Ebbelaar
               'UCWr0mx597DnSGLFk1WfvSkQ', #Hallden
               'UCcJQ96WlEhJ0Ve0SLmU310Q', #Internet Made Coder
               'UCq6XkhO5SZ66N04IcPbqNcw', #Keith Galli
               'UCiT9RITQ9PW6BhXK0y2jaeg', #Ken Jee
               'UC8wZnXYK_CGKlBcZp-GxYPA', #NeuralNine
               'UCxladMszXan-jfgzyeIMyvw', #Rob Mulla
               'UC7cs8q-gJRlGwj4A8OmCmXg', #Alex The Analyst
               'UC2UXDak6o7rBm23k3Vv5dww', #Tina Huang
               'UCtYLUTtgS3k1Fg4y5tAhLbw', #StatQuest with Josh Starmer
              ]

### Functions to retrieve the data

In [190]:
def get_channel_data(youtube_object, channel_list):
    all_data = []
    
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics", 
        id=','.join(channel_list))

    response = request.execute()
    
    for item in response['items']:
        data = {'channelName': item['snippet']['title'],
                'totalVidoes': item['statistics']['videoCount'],
                'subscribers': item['statistics']['subscriberCount'],
                'views': item['statistics']['viewCount'],
                'playlistId': item['contentDetails']['relatedPlaylists']['uploads']}
        all_data.append(data)
        
    return(pd.DataFrame(all_data))


In [192]:
def get_video_ids(youtube, playlist_id):
    
    video_ids = []
    
    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId = playlist_id,
        maxResults = 50)
    response = request.execute()
    
    for item in response['items']:
        video_ids.append(item['contentDetails']['videoId'])
        
    next_page_token = response.get('nextPageToken')
    while next_page_token is not None:
        request = youtube.playlistItems().list(
            part='contentDetails',
            playlistId = playlist_id,
            maxResults = 50,
            pageToken = next_page_token)
        response = request.execute()
    
        for item in response['items']:
            video_ids.append(item['contentDetails']['videoId'])

        next_page_token = response.get('nextPageToken')
    return video_ids

In [194]:
def get_video_details(youtube, video_ids):

    all_video_info = []

    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50]))
        response = request.execute()

        for video in response['items']:
            stats_to_keep = {'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                             'statistics': ['viewCount', 'likeCount', 'favouriteCount', 'commentCount'],
                             'contentDetails': ['duration', 'definition', 'caption']}

            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None

            all_video_info.append(video_info)
                         
    return pd.DataFrame(all_video_info)

In [196]:
def get_comments_in_videos(youtube, video_ids):
    all_comments = []
    
    for video_id in video_ids:
        try:   
            request = youtube.commentThreads().list(
                part="snippet,replies",
                videoId=video_id
            )
            response = request.execute()
        
            comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items'][0:10]]
            comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}

            all_comments.append(comments_in_video_info)
            
        except: 
            print('Could not get comments for video ' + video_id)
        
    return pd.DataFrame(all_comments) 

### Creating Dataframes

In [204]:
channel_df = get_channel_data(youtube, channel_ids)

playlist_ids = channel_df['playlistId'].tolist()

# empty list
video_ids = []

for playlist in playlist_ids:
    video_ids.append(get_video_ids(youtube, playlist))
    
video_ids = list(chain.from_iterable(video_ids))

video_df = get_video_details(youtube, video_ids)

comments_df = get_comments_in_videos(youtube, video_ids)

Could not get comments for video Kbzr7p2cIbk
Could not get comments for video BgxBEKhaqyQ
Could not get comments for video dbOWqJxqbXw
Could not get comments for video 5IYDMiEyE90
Could not get comments for video Tx_cuqfX8a4
Could not get comments for video oMBGiUuqyk4
Could not get comments for video 8nJvjNnONbY


### Pre-processing ⚒️
- The code calculates the length of the 'title' and stores it in a new column 'titleLength'.
- It converts specific columns ('viewCount', 'likeCount', 'favouriteCount', 'commentCount') to numeric type, handling errors with 'coerce' option.
- The 'publishedAt' column is converted to a string type and then parsed to datetime format using the dateutil parser.
- The day name is extracted from the 'publishedAt' column and stored in a new column 'publishDayName'.
- The duration of the videos is converted to seconds and stored in a new column 'durationSecs'.
- NaN values in the 'tags' column are filled with 0.
- The code calculates the likes and comments per 1000 views ratio and stores the results in new columns 'likeRatio' and 'commentRatio', respectively.

In [None]:
# Title character length
video_df['titleLength'] = video_df['title'].apply(lambda x: len(x))

# Convert specified columns to numeric type
numeric_cols = ['viewCount', 'likeCount', 'commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors='coerce', axis=1)

# Convert 'publishedAt' column to string type
video_df['publishedAt'] = video_df['publishedAt'].astype(str)

# Parse 'publishedAt' column to datetime format
video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: parser.parse(x))

# Extract the day name from the 'publishedAt' column
video_df['publishDayName'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A"))

# Convert duration to seconds
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs'] = video_df['durationSecs'].astype('timedelta64[s]')

# Convert durationSecs to integer seconds
video_df['durationSecs'] = pd.to_timedelta(video_df['durationSecs']).dt.total_seconds().astype(int)

# Calculate likes and comments per 1000 views ratio
video_df['likeRatio'] = video_df['likeCount'] / video_df['viewCount'] * 1000
video_df['commentRatio'] = video_df['commentCount'] / video_df['viewCount'] * 1000

# Drops column with all 'NaN' values
video_df.dropna(subset = ['favouriteCount'], inplace=True)
# Drops rows with viewCount equal to zero
video_df = video_df[video_df.viewCount != 0]

# Convert durationSecs to minutes
video_df['durationMinutes'] = video_df['durationSecs'] / 60

### Saving retrieved data to CSV files

<div class="alert alert-block alert-info">
    <b>Reminder:</b> Retrieving data can be time-consuming and resource-intensive. It's advisable to save the retrieved data into CSV files for future use.
</div>

In [280]:
# Saves video data 
video_df.to_csv('video_data.csv', index=False)

# Saves comments data
comments_df.to_csv('comments_data.csv', index=False)

# Saves channels data
channel_df.to_csv('channel_data.csv', index=False)