# Capstone White Paper: Predicting YouTube Video Engagement Using Metadata and NLP

**Authors**  
Jose Guarneros  
Tysir Shehadey  
Applied Data Science Master’s Program  
Shiley Marcos School of Engineering, University of San Diego

---

## Abstract

This study explores the feasibility of predicting YouTube video popularity using only metadata and text-based features available at upload time. We collected metadata from over 3,000 videos using the YouTube Data API and performed exploratory data analysis, feature engineering, and supervised modeling using Linear Regression, SVR, XGBoost, and CatBoost. While the gradient boosting models performed best, none of the models produced a positive R-squared value, indicating that metadata alone does not sufficiently predict engagement. These findings suggest a need for incorporating richer data sources in future prediction efforts.

---

## Table of Contents

1. Business Background  
2. Problem Statement  
3. Summary of the Findings  
4. Business Questions  
5. Scope of Analysis  
6. Approach  
7. Limitations  
8. Solution Details  
9. Concluding Summary  
10. Call to Action

---

## 1. Business Background

YouTube serves as a massive platform for digital engagement, yet many content creators struggle to gain traction. While established creators benefit from tools and analytics, small channels often lack access to meaningful insights. Metadata, which is readily available at upload time, presents an opportunity for creators to make more informed decisions without relying on post-upload performance data.

---

## 2. Problem Statement

Can YouTube video popularity, measured by view count, be reliably predicted using only metadata available before publishing? This includes data like video title, description, publish time, and engagement ratios like likes per view.

---

## 3. Summary of the Findings

After building and evaluating four models, CatBoost and XGBoost yielded the lowest RMSE scores but still failed to achieve a positive R-squared value. This indicates that metadata alone does not capture the complex factors driving viewership. Features like title length and publish day showed weak or inconsistent patterns. The study confirmed that normalized metrics (likes per view) are more informative than raw counts.

---

## 4. Business Questions

- Can engagement be predicted using only metadata available at upload time?  
- Which metadata features show the strongest relationship with view count?  
- Which modeling approach is most effective using this data?

---

## 5. Scope of Analysis

**Included:** Publicly available metadata (title, publish time, engagement stats) from 3,046 videos across multiple topics.  
**Excluded:** Thumbnails, video content, recommendation placement, user watch behavior.  

This scope was chosen to reflect the real-world constraints that small creators face, relying only on lightweight, accessible inputs.

---

## 6. Approach

We used the YouTube Data API to collect metadata and performed preprocessing and feature engineering in Python. The following models were tested: Linear Regression (baseline), SVR, XGBoost, and CatBoost. Gradient boosting models were selected for their effectiveness with structured data. Feature scaling was applied where needed, and the data was split 80/20 for training/testing.

---

## 7. Limitations

- High variance in video views  
- Absence of content-based features  
- No access to recommendation system or user-level behavior  
- Limited text signal due to sparse or generic video titles

---

## 8. Solution Details

Although metadata does not allow for highly accurate predictions, it does offer insights for content planning. For example, our analysis suggests videos published on Sundays and Mondays perform better on average. Ratios like likes per view were more useful than raw numbers. These findings can guide creators in understanding structural metadata patterns.

---

## 9. Concluding Summary

This white paper demonstrates that while predicting engagement using only metadata has limitations, it is a valuable starting point for creators without access to richer tools. The results reveal the need for combining metadata with contextual or content-driven features. Future work should focus on extending feature sets and applying more advanced modeling techniques.

---

## 10. Call to Action

We encourage content creators, analysts, and researchers to explore enhanced modeling approaches using richer data sources, such as content embeddings or viewer demographics, to build more accurate forecasting tools for digital engagement.