# Task
Analyze a YouTube channel's performance using the YouTube API, build a regression model to predict views for a new video based on its title, tags, and description, and present the findings as a tool for creators to optimize video metadata.

## Set up youtube api access

### Subtask:
Obtain API keys and set up the necessary libraries to interact with the YouTube Data API.


**Reasoning**:
Install the Google API client library for Python.



In [1]:
%pip install google-api-python-client



**Reasoning**:
Import the necessary library to interact with the YouTube Data API.



In [2]:
from googleapiclient.discovery import build

## Fetch channel data

### Subtask:
Write code to fetch video data (title, description, tags, views, etc.) for a given YouTube channel using the API.


**Reasoning**:
Write code to fetch video data for a given YouTube channel using the API, including handling pagination.



In [3]:
api_key = 'YOUR_API_KEY' # Replace with your actual API key
channel_id = 'YOUR_CHANNEL_ID' # Replace with the channel ID

youtube = build('youtube', 'v3', developerKey=api_key)

video_data = []
next_page_token = None

while True:
    search_response = youtube.search().list(
        channelId=channel_id,
        part='id',
        maxResults=50,
        type='video',
        pageToken=next_page_token
    ).execute()

    video_ids = [item['id']['videoId'] for item in search_response.get('items', [])]

    if not video_ids:
        break

    videos_response = youtube.videos().list(
        id=','.join(video_ids),
        part='snippet,statistics'
    ).execute()

    for item in videos_response.get('items', []):
        video_data.append({
            'title': item['snippet']['title'],
            'description': item['snippet']['description'],
            'tags': item['snippet'].get('tags', []),
            'view_count': item['statistics'].get('viewCount'),
            'like_count': item['statistics'].get('likeCount'),
            'comment_count': item['statistics'].get('commentCount')
        })

    next_page_token = search_response.get('nextPageToken')

    if not next_page_token:
        break

print(f"Fetched data for {len(video_data)} videos.")

HttpError: <HttpError 400 when requesting https://youtube.googleapis.com/youtube/v3/search?channelId=YOUR_CHANNEL_ID&part=id&maxResults=50&type=video&key=YOUR_API_KEY&alt=json returned "API key not valid. Please pass a valid API key.". Details: "[{'message': 'API key not valid. Please pass a valid API key.', 'domain': 'global', 'reason': 'badRequest'}]">

## Store the data

### Subtask:
Store the fetched data in a structured format, such as a pandas DataFrame or a database.


**Reasoning**:
Import the pandas library and create a DataFrame from the `video_data` list. Then, display the first 5 rows and the data types of the DataFrame.



In [4]:
import pandas as pd

df_video_data = pd.DataFrame(video_data)

display(df_video_data.head())
display(df_video_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Empty DataFrame


None

## Store the data

### Subtask:
Store the fetched data in a structured format, such as a pandas DataFrame or a database.


**Reasoning**:
Create a pandas DataFrame from the `video_data` list and display its head and info. This addresses steps 2, 3, and 4 of the instructions.



In [5]:
import pandas as pd

df_video_data = pd.DataFrame(video_data)

display(df_video_data.head())
display(df_video_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Empty DataFrame


None

## Explore and preprocess data

### Subtask:
Analyze the data to understand its structure and quality. Handle missing values, outliers, and format inconsistencies. Extract relevant features from title, description, and tags.


**Reasoning**:
Inspect the DataFrame and handle missing values and data type conversions for count columns.



In [6]:
display(df_video_data.head())
display(df_video_data.info())

# Convert count columns to numeric, coercing errors to NaN
for col in ['view_count', 'like_count', 'comment_count']:
    df_video_data[col] = pd.to_numeric(df_video_data[col], errors='coerce')

display(df_video_data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Empty DataFrame


None

KeyError: 'view_count'

## Explore and preprocess data

### Subtask:
Analyze the data to understand its structure and quality. Handle missing values, outliers, and format inconsistencies. Extract relevant features from title, description, and tags.


**Reasoning**:
The previous attempt failed because the dataframe was empty. I will retry the steps to display the head and info of the dataframe, convert count columns to numeric, and then calculate and display the lengths of title and description, and the number of tags.



In [7]:
display(df_video_data.head())
display(df_video_data.info())

# Convert count columns to numeric, coercing errors to NaN
for col in ['view_count', 'like_count', 'comment_count']:
    df_video_data[col] = pd.to_numeric(df_video_data[col], errors='coerce')

display(df_video_data.isnull().sum())

# Calculate length of title
df_video_data['title_length'] = df_video_data['title'].str.len()

# Calculate length of description, handling potential NaN values
df_video_data['description_length'] = df_video_data['description'].str.len().fillna(0).astype(int)

# Calculate number of tags, handling potential NaN or empty lists
df_video_data['num_tags'] = df_video_data['tags'].apply(lambda x: len(x) if isinstance(x, list) else 0)

display(df_video_data[['title_length', 'description_length', 'num_tags']].head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Empty DataFrame


None

KeyError: 'view_count'

## Feature engineering

### Subtask:
Create new features that could be useful for prediction, such as the length of the title/description, the number of tags, or sentiment scores from the text.


## Build a regression model

### Subtask:
Choose and train a regression model to predict video views based on the extracted and engineered features.


**Reasoning**:
Define features (X) and target (y), split data into training and testing sets, instantiate and train a Linear Regression model, and make predictions.



In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

# Ensure 'view_count' is numeric and handle potential NaNs in features
df_video_data['view_count'] = pd.to_numeric(df_video_data['view_count'], errors='coerce')
df_video_data['title_length'] = df_video_data['title'].str.len()
df_video_data['description_length'] = df_video_data['description'].str.len().fillna(0).astype(int)
df_video_data['num_tags'] = df_video_data['tags'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Drop rows with NaN in the target variable 'view_count'
df_video_data.dropna(subset=['view_count'], inplace=True)

# Define features (X) and target (y)
# Use only the engineered features that were successfully created
features = ['title_length', 'description_length', 'num_tags']
# Drop rows where any of the selected features are NaN
df_video_data.dropna(subset=features, inplace=True)

X = df_video_data[features]
y = df_video_data['view_count']

# Check if there's enough data to split
if len(X) > 1:
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Instantiate and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    print("Model training and prediction completed.")
else:
    print("Not enough data to train the model.")


KeyError: 'view_count'

## Evaluate the model

### Subtask:
Assess the performance of the trained model using appropriate metrics (e.g., R-squared, Mean Absolute Error).


**Reasoning**:
Import the necessary metrics from sklearn.metrics and check if the model exists before proceeding with evaluation. If the model does not exist, print a message indicating this.



In [9]:
from sklearn.metrics import r2_score, mean_absolute_error

if 'model' not in locals():
    print("Model was not trained. Evaluation cannot be performed.")
else:
    # Proceed with prediction and evaluation steps if the model exists
    y_pred = model.predict(X_test)

    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)

    print(f"R-squared: {r2}")
    print(f"Mean Absolute Error (MAE): {mae}")


Model was not trained. Evaluation cannot be performed.


## Predict views for a new video

### Subtask:
Implement a function or tool that takes the title, description, and tags of a new video and uses the trained model to predict its views.


**Reasoning**:
Define the function to predict views for a new video based on its title, description, and tags, incorporating the checks and calculations as specified in the instructions.



In [10]:
import pandas as pd
import numpy as np

def predict_views(title, description, tags):
    """
    Predicts the view count for a new video using the trained model.

    Args:
        title (str): The title of the new video.
        description (str): The description of the new video.
        tags (list): A list of tags for the new video.

    Returns:
        float or None: The predicted view count or None if the model is not trained.
    """
    if 'model' not in locals():
        print("Error: Model is not trained. Cannot predict views.")
        return None

    # Calculate features
    title_length = len(title)
    description_length = len(description) if description else 0
    num_tags = len(tags) if isinstance(tags, list) else 0

    # Create a DataFrame for prediction
    features_df = pd.DataFrame([[title_length, description_length, num_tags]],
                               columns=['title_length', 'description_length', 'num_tags'])

    # Predict views
    predicted_view_count = model.predict(features_df)[0]

    return predicted_view_count

# Example usage (assuming 'model' is trained and available)
# Replace with actual title, description, and tags for a new video
# new_video_title = "My Awesome New Video"
# new_video_description = "Check out this cool video!"
# new_video_tags = ["awesome", "video", "cool"]

# predicted_views = predict_views(new_video_title, new_video_description, new_video_tags)

# if predicted_views is not None:
#     print(f"Predicted view count: {predicted_views}")


## Present the findings

### Subtask:
Summarize the analysis, the model's performance, and how creators can use the prediction tool.


**Reasoning**:
Summarize the project's goal, steps taken, model information, evaluation (or lack thereof), the prediction tool's usage, benefits, and limitations based on the previous steps. Since data fetching and model training were unsuccessful, the summary will reflect these limitations.



In [11]:
# Summarize the analysis, model performance, and tool usage.

print("Project Summary:")
print("The goal of this project was to analyze YouTube channel performance and build a regression model to predict video views based on metadata.")

print("\nSteps Taken:")
print("1. Set up YouTube API access (API key and channel ID were required but not fully provided).")
print("2. Attempted to fetch channel data using the API (unsuccessful due to missing valid API key and channel ID).")
print("3. Attempted to store the fetched data in a pandas DataFrame (unsuccessful as no data was fetched).")
print("4. Attempted to explore and preprocess the data, including feature engineering (unsuccessful due to empty DataFrame).")
print("5. Attempted to build and train a regression model (unsuccessful due to lack of data and required columns).")
print("6. Attempted to evaluate the model (unsuccessful as the model was not trained).")
print("7. Implemented a function `predict_views` to predict views for a new video, assuming a trained model is available.")

print("\nModel Information and Performance:")
print("A linear regression model was intended to be used, trained on features derived from video title, description, and tags (specifically, length of title, length of description, and number of tags).")
print("However, due to the failure in fetching and processing the data, the model could not be trained or evaluated.")
print("Therefore, no model performance metrics (like R-squared or MAE) are available to report at this time.")

print("\nUsing the Prediction Tool (`predict_views` function):")
print("The `predict_views` function is designed for creators to get a predicted view count for their new videos.")
print("To use it, a creator would call the function providing the title (string), description (string), and a list of tags (list of strings) for their new video.")
print("For example: `predicted_views = predict_views('My New Video Title', 'A great video description.', ['tag1', 'tag2'])`")
print("The function calculates the relevant features (title length, description length, number of tags) and uses the trained model to return a predicted view count.")
print("NOTE: This function currently requires a trained model to be available in the environment. As the model training was unsuccessful, calling this function now would result in an error message indicating that the model is not trained.")

print("\nPotential Benefits for Creators:")
print("If the model were successfully trained and performed well, creators could use this tool to:")
print("- Understand how their planned metadata (title, description, tags) might influence potential views.")
print("- Experiment with different metadata options and compare their predicted view counts.")
print("- Optimize their video metadata before publishing to potentially increase discoverability and views.")

print("\nLimitations:")
print("1. **Data Availability:** The primary limitation is the inability to fetch actual YouTube channel data due to missing or invalid API credentials.")
print("2. **Model Training:** Consequently, the regression model could not be trained on real-world data.")
print("3. **Model Performance:** Without a trained model, its performance cannot be evaluated, and the reliability of any potential predictions is unknown.")
print("4. **Feature Set:** The model was designed to use only basic metadata features (lengths and count). A more sophisticated model could potentially incorporate more advanced text analysis (e.g., sentiment, keywords) or other factors not available via the API (e.g., thumbnail impact, promotion efforts).")
print("5. **Generalizability:** Even with a trained model, its predictions would be specific to the channel whose data was used for training and might not generalize well to other channels.")
print("6. **Dependencies:** The prediction function relies on a trained model being present in the environment.")

Project Summary:
The goal of this project was to analyze YouTube channel performance and build a regression model to predict video views based on metadata.

Steps Taken:
1. Set up YouTube API access (API key and channel ID were required but not fully provided).
2. Attempted to fetch channel data using the API (unsuccessful due to missing valid API key and channel ID).
3. Attempted to store the fetched data in a pandas DataFrame (unsuccessful as no data was fetched).
4. Attempted to explore and preprocess the data, including feature engineering (unsuccessful due to empty DataFrame).
5. Attempted to build and train a regression model (unsuccessful due to lack of data and required columns).
6. Attempted to evaluate the model (unsuccessful as the model was not trained).
7. Implemented a function `predict_views` to predict views for a new video, assuming a trained model is available.

Model Information and Performance:
A linear regression model was intended to be used, trained on features d

## Summary:

### Data Analysis Key Findings

*   The process of fetching YouTube channel data using the API was unsuccessful due to invalid or missing API credentials (API key and Channel ID).
*   As a direct consequence of the data fetching failure, subsequent steps involving storing, exploring, preprocessing, and feature engineering the data could not be completed, resulting in empty data structures (e.g., an empty pandas DataFrame).
*   The regression model intended for predicting video views could not be trained due to the absence of the necessary data and columns, particularly the 'view\_count' column.
*   Consequently, the performance of the model could not be evaluated using metrics like R-squared or Mean Absolute Error.
*   Despite the inability to train the model, a Python function `predict_views` was successfully implemented, which is designed to calculate features (title length, description length, number of tags) and predict views for a new video, assuming a trained model is available.

### Insights or Next Steps

*   The critical next step is to obtain and use valid YouTube Data API credentials (API key and Channel ID) to successfully fetch channel data.
*   Once data fetching is successful, the subsequent steps of data storage, preprocessing, feature engineering, model training, and evaluation can be performed to build a functional prediction tool.
