# 🌐  **The Impact of Immigration on the UK – A YouTube Sentiment Analysis**

### Table of Contents

1. **Phase 1: Introduction**
   - [Overview](#overview)
   - [Project Objectives](#project-objectives)
   - [Google Developer Account Setup](#google-developer-account-setup)

3. **Phase 2: Data Collection**
   - [YouTube API Registration and Authentication](#youtube-api-registration-authentication)
   - [Writing Python Scripts to Fetch Comments](#writing-python-scripts-to-fetch-comments)

3. **Phase 3: Database Setup**
   - [Setting up a PostgreSQL Database](#setting-up-a-postgresql-database)
   - [Creating Tables in the Database](#creating-tables-in-the-database)
   - [Database Connection Testing](#database-connection-testing)

4. **Phase 4: ETL Pipeline Development**
   - [Building Airflow DAGs](#building-airflow-dags)
   - [Data Cleaning and Storage](#data-cleaning-and-storage)
   - [Database Integration Testing](#database-integration-testing)

5. **Phase 5: Train and Evaluate Sentiment Models**
   - [Conducting EDA and Preprocessing Comments Data](#conducting-eda-and-preprocessing-comments-data)
   - [TextBlob Sentiment Analysis](#textblob-sentiment-analysis)
   - [VADER Sentiment Analysis](#vader-sentiment-analysis)
   - [Comparing TextBlob and VADER](#comparing-textblob-and-vader)

6. **Phase 6: Model Deployment**
   - [Building a Streamlit App](#building-a-streamlit-app)
   - [Deploying the App on Streamlit/Huggingface](#deploying-the-app-on-streamlit-huggingface)


# 📚 1. **Introduction**
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with the introduction of the project, the objectives, and the setup of the Google Developer Account.

</div>

### 🔍 **Overview**

This project analyzes the sentiment of YouTube comments using the YouTube API and Natural Language Processing (NLP) techniques. The workflow involves collecting comments from selected YouTube videos, preprocessing the data, training sentiment analysis models, and deploying the models in a web application for real-time analysis.

The YouTube comments are fetched using the YouTube API and classified into **positive, negative, or neutral** sentiments using a Huggingface machine learning model. An **ETL pipeline** built with **Apache Airflow** automates the extraction, transformation, and storage of the data. The trained model is then deployed using **Streamlit** or **Huggingface Spaces** for easy accessibility.

### 📹 **YouTube Videos Analyzed**
1. **The “Migrant Crisis” Destroying Britain** – [Watch Here](https://www.youtube.com/watch?v=2F26DR3Be4g) *(The Young Turks)*
2. **The UK far-right blames immigrants for economic woes** – [Watch Here](https://www.youtube.com/watch?v=2F26DR3Be4g) *(Al Jazeera English)*

---

### 🎨 **Project Objectives**

- ✅ **Data Collection & Storage:** Fetch comments from YouTube videos using the **YouTube Data API** and store them in a structured database.
- ⚙️ **ETL Pipeline Automation:** Build an **Apache Airflow** pipeline to automate data fetching, cleaning, and storage.
- 🤖 **Sentiment Analysis Model:** Train a sentiment classification model using **Huggingface transformers** to categorize comments as **positive, negative, or neutral**.
- 🛠️ **Model Deployment:** Deploy the trained model via **Streamlit** or **Huggingface Spaces** for real-time sentiment analysis.

---

### 💻 **Google Developer Account Setup**

To use the YouTube API, follow these steps to create a Google Developer Account and enable the YouTube Data API:

1. Go to the [Google Developers Console](https://console.developers.google.com/).
2. Click on **"Select a project"** at the top of the screen.
3. Click on **"New Project"** and enter a project name.
4. Click **"Create"** to initialize the project.
5. Navigate to **"APIs & Services"** → **"Library"**.
6. Search for **"YouTube Data API v3"** and click on it.
7. Click **"Enable"** to activate the API for your project.
8. Go to **"Credentials"** and click on **"Create credentials"**.
9. Select **"API key"** and copy the generated key for use in your Python scripts.





# 📚 2. **Data collection**
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with data collection from YouTube using the YouTube API and writing Python scripts to fetch comments.

</div>

In [16]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the API key
api_key = os.getenv("api_key")

if not api_key:
    raise ValueError("YOUTUBE_API_KEY not loaded. Please check your .env file.")
else:
    print("API key loaded successfully.")

API key loaded successfully.


In [17]:
# Setting Up YouTube API Client
import googleapiclient.discovery
import googleapiclient.errors

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = os.getenv("api_key")

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=api_key
)

In [18]:
# import pandas
import pandas as pd

# List of video IDs
video_ids = ["2F26DR3Be4g", "vAVsczZ3Oqk"]

# Fetch Comments from a Video
comments = []

for video_id in video_ids:
    request = youtube.commentThreads().list(
        part="snippet",
        videoId=video_id,
        maxResults=100
    )
    response = request.execute()

    for item in response['items']:
        comment = item['snippet']['topLevelComment']['snippet']
        comments.append([
            video_id,  # Add video ID to identify the source video
            comment['authorDisplayName'],
            comment['publishedAt'],
            comment['updatedAt'],
            comment['likeCount'],
            comment['textDisplay']
        ])

# Create a DataFrame with an additional column for video ID
df = pd.DataFrame(comments, columns=['video_id', 'author', 'published_at', 'updated_at', 'like_count', 'text'])

# Display the DataFrame
df.head()


Unnamed: 0,video_id,author,published_at,updated_at,like_count,text
0,2F26DR3Be4g,@JimmyTheGiant,2024-11-28T11:27:08Z,2024-11-28T11:28:16Z,55,"📢Join my discord: <a href=""https://discord.gg/..."
1,2F26DR3Be4g,@ZIxWicced,2025-02-10T14:04:03Z,2025-02-10T14:04:03Z,0,RIP UK 😔
2,2F26DR3Be4g,@CFCNOTBUMMER,2025-02-10T09:48:48Z,2025-02-10T09:48:48Z,0,"Good lad, keep up the good work"
3,2F26DR3Be4g,@imdoneplus,2025-02-10T03:40:52Z,2025-02-10T03:40:52Z,0,"@<a href=""https://www.youtube.com/watch?v=2F26..."
4,2F26DR3Be4g,@scubatuna,2025-02-10T02:48:56Z,2025-02-10T02:48:56Z,0,why don&#39;t you use Australia as a example? ...


In [19]:
# check the columns
df.columns.to_list()

['video_id', 'author', 'published_at', 'updated_at', 'like_count', 'text']

# 📚 3. **DataBase  Setup**
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with the setup of the database and the integration of the collected data to the database.

</div>

In [35]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Setting up the MySQL connection
user = os.getenv("MYSQL_USERNAME") 
password = os.getenv("MYSQL_PASSWORD") 
host = os.getenv("MYSQL_HOST")

if not user or not password or not host:
    raise ValueError("Missing MySQL credentials. Please check your .env file.")
else:
    print("MySQL credentials loaded successfully.")


MySQL credentials loaded successfully.


In [36]:
# Import the MySQL connector
import mysql.connector

# Database connection
def connect_to_db():
    return mysql.connector.connect(
        host="localhost",
        user=os.getenv("MYSQL_USERNAME"),
        password=os.getenv("MYSQL_PASSWORD")
    )

# Create database if it does not exist
def create_database():
    conn = connect_to_db()
    cursor = conn.cursor()
    cursor.execute("CREATE DATABASE IF NOT EXISTS uk_migration_youtube_comments_db")
    conn.close()

# Connect to the database
def connect_to_db_with_db():
    return mysql.connector.connect(
        host="localhost",
        user=os.getenv("MYSQL_USERNAME"),
        password=os.getenv("MYSQL_PASSWORD"),
        database="uk_migration_youtube_comments_db"
    )

# Call the functions
create_database()
db = connect_to_db_with_db()

In [37]:
# Create table
def create_table():
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS youtube_comments (
            video_id VARCHAR(255),
            author VARCHAR(255),
            published_at DATETIME,
            updated_at DATETIME,
            like_count INT,
            text TEXT
        )
    """)
    conn.close()

# Call the function to create the table
create_table()

In [19]:
# Function to insert data into the table
def insert_data(df):
    # Connect to the database
    conn = connect_to_db_with_db()
    cursor = conn.cursor()

    # Convert datetime columns to MySQL-compatible format
    df['published_at'] = pd.to_datetime(df['published_at']).dt.strftime('%Y-%m-%d %H:%M:%S')
    df['updated_at'] = pd.to_datetime(df['updated_at']).dt.strftime('%Y-%m-%d %H:%M:%S')

    # Iterate over each row in the DataFrame and insert it into the table
    for index, row in df.iterrows():
        video_id = row['video_id']
        author = row['author']
        published_at = row['published_at']
        updated_at = row['updated_at']
        like_count = row['like_count']
        text = row['text']

        # Execute the SQL query to insert the data
        cursor.execute("""
            INSERT INTO youtube_comments (video_id, author, published_at, updated_at, like_count, text)
            VALUES (%s, %s, %s, %s, %s, %s)
        """, (video_id, author, published_at, updated_at, like_count, text))

    # Commit the transaction
    conn.commit()

    # Close the connection
    conn.close()


In [20]:
# Call the function to insert data
insert_data(df)

In [38]:
# Function to execute SQL queries and display results
def execute_query(query):
    conn = connect_to_db_with_db()
    cursor = conn.cursor()
    cursor.execute(query)
    
    # Fetch results and convert to a DataFrame
    result = cursor.fetchall()
    columns = [desc[0] for desc in cursor.description]  # Column names
    
    cursor.close()
    conn.close()
    
    return pd.DataFrame(result, columns=columns)


sqlquery = """

SELECT * FROM youtube_comments;

"""
# Execute the query and display the results
result_df = execute_query(sqlquery)
result_df

Unnamed: 0,video_id,author,published_at,updated_at,like_count,text
0,2F26DR3Be4g,@JimmyTheGiant,2024-11-28 11:27:08,2024-11-28 11:28:16,53,"📢Join my discord: <a href=""https://discord.gg/..."
1,2F26DR3Be4g,@ericmorneau8819,2025-01-28 18:04:38,2025-01-28 18:04:38,0,Isn&#39;t there a Dead Kennedy&#39;s song abou...
2,2F26DR3Be4g,@Andrewlang90,2025-01-28 05:24:26,2025-01-28 05:24:26,0,"While I agree, this is a global problem, but I..."
3,2F26DR3Be4g,@SC-zh6li,2025-01-27 23:47:34,2025-01-27 23:47:34,0,They mentioned targetting car washes and nailb...
4,2F26DR3Be4g,@AmatoryLayne2.0,2025-01-27 18:07:01,2025-01-27 18:07:01,0,Well damn and I thought we were having it bad ...
...,...,...,...,...,...,...
595,vAVsczZ3Oqk,@oolong2,2024-08-12 18:11:06,2024-08-12 18:11:06,4,&quot;Blaming immigrants&quot; is probably the...
596,vAVsczZ3Oqk,@samirkhan-mb4yp,2024-08-12 17:57:12,2024-08-12 17:57:12,0,When you are enemy of Muslim and migrant peopl...
597,vAVsczZ3Oqk,@UkUkBDIndia,2024-08-12 17:55:18,2024-08-12 17:55:18,0,the english wants everything for free ...
598,vAVsczZ3Oqk,@lilliankeane5731,2024-08-12 17:40:03,2024-08-12 17:40:03,0,It’s not the poor people that is the problem i...


# 📚 4. **ETL Pipeline Development**
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with ...

</div>

# 📚 5. **Train and Evaluate Sentiment Model**
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with ...

</div>

# 📚 6. **Model Deployment**
___

<div style="font-family: Avenir, sans-serif; font-size: 16px; line-height: 1.6; color: white; background-color: #333; padding: 10px; border-radius: 5px;">
This section deals with ...

</div>