# **Aim**

To analyze Jamaican-related content on YouTube in order to identify the categories, creators, and themes that resonate most with viewers, and to track how interest in Jamaican content has evolved over time. This analysis will provide actionable insights for content creators (including yourself) to better align with audience preferences.





What types of Jamaican-related content resonate most with audiences on YouTube (in terms of views, likes, comments, engagement)?

How has Jamaican YouTube content evolved over the last 5 years (2019–2024) across themes such as music, food, politics, sports, and travel?

What role do Jamaican creators vs. non-Jamaican creators play in shaping content about Jamaica?

How do audiences’ interests in Jamaican content shift over time?



# **Definition of Jamaican Content**



A video will be classified as Jamaican-related content if it satisfies **at least one of** the following:

**Channel-based criterion**

The channel country is listed as Jamaica (channel.snippet.country="JM" in metadata).

OR the channel is identified as owned/operated by a Jamaican creator (to be confirmed through metadata/manual checks).

**Video-based criterion**

The video explicitly mentions “Jamaica” or Jamaican-related keywords in the title, description, tags, or transcript.

OR the video’s subject matter is directly related to Jamaica (e.g., Jamaican food, music, sports, culture, politics).

**Hybrid criterion**

A video produced by a diaspora Jamaican creator (channel.snippet.country ≠ Jamaica, but content consistently focuses on Jamaican themes).



# **Background Requirements**



**YouTube API Access:** A developer key from Google Cloud to fetch video and channel metadata.

**Libraries:**

google-api-python-client (fetch metadata)

youtube_transcript_api (pull transcripts when available)

pandas / numpy (data wrangling)

isodate (parse durations)

NLP/AI tools (scikit-learn, transformers or openai embeddings) for clustering/classification.

**Data Scope:**

~500–1000 videos initially.

Videos collected using broad "Jamaica" keyword + channel.snippet.country="JM".

**Metadata timeframe:** 2019–2024 (5 years).



# **Methodology**



**Step 1: Data Collection**

Use YouTube Data API to collect video-level metadata (title, description, tags, stats, publish date, etc.) and channel-level metadata (region, subs, views, creation date).

Expand scope with transcript data when available.

**Step 2: Data Cleaning**

Identify and filter out irrelevant videos (e.g., “Jamaica Ave NYC” or “Agua de Jamaica drink”).

Create a “True Jamaican Content” flag via manual labeling of a sample, then scale with simple ML classification.

**Step 3: Categorization**

Use a hybrid approach:

Start with metadata (tags, titles, transcripts).

Apply AI-assisted clustering to group content into emergent categories (e.g., music, food, travel, politics, sports).

**Step 4: Analysis**

Descriptive stats: Most-viewed, top-engaged, top creators, category breakdowns.

Trend analysis: Category distribution by year (2019–2024).

Engagement analysis: Compare engagement ratios across categories.

Creator analysis: Jamaican channels vs. non-Jamaican channels producing Jamaican content.

**Step 5: Visualization & Insights**

Timeline of Jamaican content evolution (stacked by categories).

Heatmap of engagement by theme.

Top Jamaican-related creators and videos.

# **Python Code**

🔹 Step 1:  Install Date Package

In [None]:
!pip install isodate

Collecting isodate
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Downloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: isodate
Successfully installed isodate-0.7.2


 🔹 Step 2: Connect to the API

In [None]:
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google.colab import userdata
import pandas as pd
#import getpass
import isodate
import time

# 🔑 Enter your YouTube API key securely (will not show in output)
API_KEY = userdata.get('YOUTUBE_DATA_V3_KEY') #getpass.getpass("Enter your YouTube API Key: ")

# Initialize YouTube API client
youtube = build("youtube", "v3", developerKey=API_KEY)


Step 3: Safe API Call Wrapper

In [None]:
def safe_api_call(request, retries=3, backoff=5):
    """Executes a YouTube API request with retries & quota handling."""
    for attempt in range(retries):
        try:
            return request.execute()
        except HttpError as e:
            error_msg = str(e)
            print(f"Attempt {attempt + 1}: API Error - {error_msg}")

            if "quotaExceeded" in error_msg:
                print("❌ Daily quota exceeded. Try again tomorrow.")
                return None
            elif "userRateLimitExceeded" in error_msg or "rateLimitExceeded" in error_msg:
                wait = backoff * (2 ** attempt)  # Exponential backoff
                print(f"⚠ Rate limit hit. Retrying in {wait} seconds...")
                time.sleep(wait)
                continue
            elif "badRequest" in error_msg or "400" in error_msg:
                print("❌ Bad request - check your parameters")
                return None
            else:
                print(f"❌ API error: {error_msg}")
                return None

    print("❌ Max retries exceeded")
    return None

🔹Step 3: Search for channels with country=JM

In [None]:
def search_jamaican_channels(max_results=50, page_token=None):
    """Search for channels related to Jamaica"""
    try:
        request = youtube.search().list(
            part="snippet",
            type="channel",
            q="Jamaica",
            maxResults=min(max_results, 50),  # API limit is 50
            pageToken=page_token
        )
        response = safe_api_call(request)
        return response.get("items", []) if response else []
    except Exception as e:
        print(f"Error in search_jamaican_channels: {e}")
        return []


🔹Step 4: Filter Channels by Country

In [None]:
def get_channel_metadata(channel_ids):
    """Get channel metadata and filter by country=JM"""
    if not channel_ids:
        return []

    # Split into batches of 50 (API limit)
    channels = []
    for i in range(0, len(channel_ids), 50):
        batch = channel_ids[i:i+50]

        request = youtube.channels().list(
            part="snippet,statistics,contentDetails",
            id=",".join(batch)
        )
        response = safe_api_call(request)

        if not response:
            continue

        for item in response.get("items", []):  # Fixed: added .get()
            country = item["snippet"].get("country", "Unknown")
            if country == "JM":  # ✅ Jamaican channel
                channels.append({
                    "id": item["id"],
                    "title": item["snippet"]["title"],
                    "country": country
                    #"subscriber_count": int(item["statistics"].get("subscriberCount", 0) or 0),
                    #"video_count": int(item["statistics"].get("videoCount", 0) or 0)
                })

    return channels

🔹Step 5: Fetch Uploads from Jamaican Channels

In [None]:
def get_channel_uploads(channel_id, max_results=50, max_pages=8):
    """Get uploads from a channel's uploads playlist"""
    try:
        # Get the uploads playlist ID
        request = youtube.channels().list(
            part="contentDetails",
            id=channel_id
        )
        response = safe_api_call(request)

        if not response or not response.get("items"):
            print(f"No channel found for ID: {channel_id}")
            return []

        uploads_playlist_id = response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]

        # Fetch videos from uploads playlist
        playlist_items = []
        next_page = None
        page_count = 0

        while page_count < max_pages:
            playlist_request = youtube.playlistItems().list(
                part="snippet",
                playlistId=uploads_playlist_id,
                maxResults=min(max_results, 50),
                pageToken=next_page
            )

            playlist_response = safe_api_call(playlist_request)
            if not playlist_response:
                break

            items = playlist_response.get("items", [])
            if not items:
                break

            playlist_items.extend(items)
            next_page = playlist_response.get("nextPageToken")
            page_count += 1

            if not next_page:
                break

            # Small delay to avoid rate limiting
            time.sleep(0.1)

        return playlist_items

    except Exception as e:
        print(f"Error getting uploads for channel {channel_id}: {e}")
        return []

🔹 Step 6: Search for Jamaican Content

In [None]:
def search_videos(query, max_results=20, region_code="JM", max_pages=90):
    """Search for videos with specific query"""
    videos = []
    next_page = None
    page_count = 0

    while page_count < max_pages:
        try:
            request = youtube.search().list(
                q=query,
                part="snippet",
                type="video",
                maxResults=min(max_results, 50),
                regionCode=region_code,
                pageToken=next_page,
                publishedAfter="2019-01-01T00:00:00Z"
            )

            response = safe_api_call(request)
            if not response:
                break

            items = response.get("items", [])
            if not items:
                break

            videos.extend(items)
            next_page = response.get("nextPageToken")
            page_count += 1

            if not next_page:
                break

            time.sleep(0.1)  # Small delay

        except Exception as e:
            print(f"Error in search_videos: {e}")
            break

    return videos

🔹Step 7: Get Video Details

In [None]:
def get_video_details(video_ids):
    """Get detailed video information"""
    if not video_ids:
        return []

    all_videos = []
    # Process in batches of 50 (API limit)
    for i in range(0, len(video_ids), 50):
        batch = video_ids[i:i+50]

        request = youtube.videos().list(
            part="snippet,statistics,contentDetails",
            id=",".join(batch)
        )
        response = safe_api_call(request)

        if response:
            all_videos.extend(response.get("items", []))

        time.sleep(0.1)  # Small delay between requests

    return all_videos

 🔹 Step 8: Store Video Data into a DataFrame

In [None]:
def parse_videos(video_details):
    """Parse video details into DataFrame format"""
    data = []
    for video in video_details:
        try:
            snippet = video["snippet"]
            stats = video.get("statistics", {})
            content = video["contentDetails"]

            # Parse duration safely
            try:
                duration = isodate.parse_duration(content["duration"]).total_seconds()
            except Exception:
                duration = None

            data.append({
                "video_id": video["id"],
                "title": snippet["title"],
                "description": snippet.get("description", ""),
                "tags": ", ".join(snippet.get("tags", []) or []),
                "channel_id": snippet["channelId"],
                "channel_title": snippet["channelTitle"],
                "publish_date": snippet["publishedAt"],
                "duration_seconds": duration,
                "views": int(stats.get("viewCount", 0) or 0),
                "likes": int(stats.get("likeCount", 0) or 0),
                "comments": int(stats.get("commentCount", 0) or 0),
                "category_id": snippet.get("categoryId", "")
            })
        except Exception as e:
            print(f"Error parsing video {video.get('id', 'unknown')}: {e}")
            continue

    return pd.DataFrame(data)


 🔹 Step 9: Get Channel Details

In [None]:
def get_channel_statistics(channel_ids):
    """Get channel statistics with batch processing for API efficiency"""
    if not channel_ids:
        return pd.DataFrame()

    channel_data = []

    # Process in batches of 50 (API limit)
    for i in range(0, len(channel_ids), 50):
        batch = channel_ids[i:i+50]

        request = youtube.channels().list(
            part="snippet,statistics",
            id=",".join(batch)
        )
        response = safe_api_call(request)

        if not response:
            continue

        for channel in response.get("items", []):
            snippet = channel["snippet"]
            stats = channel["statistics"]

            channel_data.append({
                "channel_id": channel["id"],
                "channel_start_date": snippet["publishedAt"],
                "channel_country": snippet.get("country", "Unknown"),
                "subscriber_count": int(stats.get("subscriberCount", 0) or 0),
                "channel_total_views": int(stats.get("viewCount", 0) or 0),
                "channel_video_count": int(stats.get("videoCount", 0) or 0)
            })

        # Small delay between batch requests
        if i + 50 < len(channel_ids):
            time.sleep(0.5)

    return pd.DataFrame(channel_data)

 🔹 Step 10a: Get unique channels from video data

In [None]:
print("🔍 Step 10a: Collecting videos from Jamaican channels...")

jamaican_channels = search_jamaican_channels(max_results=50)

# Fixed channel ID extraction with better error handling
jamaican_channel_ids = []
for ch in jamaican_channels:
    try:
        if (isinstance(ch.get("id"), dict) and
            "channelId" in ch["id"] and
            ch["id"]["channelId"]):
            jamaican_channel_ids.append(ch["id"]["channelId"])
    except (KeyError, TypeError) as e:
        print(f"Skipping invalid channel entry: {e}")
        continue

print(f"Found {len(jamaican_channel_ids)} channel IDs to process")

# Filter channels by country first (more efficient)
if jamaican_channel_ids:
    print("🔍 Filtering channels by country (JM)...")
    jamaican_channels_filtered = get_channel_metadata(jamaican_channel_ids)
    verified_channel_ids = [ch["id"] for ch in jamaican_channels_filtered]
    print(f"Verified {len(verified_channel_ids)} Jamaican channels")
else:
    print("❌ No channel IDs found")
    verified_channel_ids = []

# Collect videos from verified Jamaican channels
all_channel_videos = []
if verified_channel_ids:
    for i, cid in enumerate(verified_channel_ids):
        print(f"📹 Processing channel {i+1}/{len(verified_channel_ids)}: {cid}")

        uploads = get_channel_uploads(cid, max_results=30, max_pages=8)  # Your original limits

        video_ids = []
        for item in uploads:
            try:
                if ("snippet" in item and
                    "resourceId" in item["snippet"] and
                    "videoId" in item["snippet"]["resourceId"]):
                    video_ids.append(item["snippet"]["resourceId"]["videoId"])
            except (KeyError, TypeError):
                continue

        # Get video details if we have valid IDs
        if video_ids:
            details = get_video_details(video_ids)
            if details:
                all_channel_videos.extend(details)
                print(f"  ✅ Collected {len(details)} videos")
            else:
                print(f"  ⚠ No video details retrieved")
        else:
            print(f"  ⚠ No video IDs found")

        # Small delay between channels
        time.sleep(1)

# Parse channel videos
df_channel_videos = parse_videos(all_channel_videos)

# Filter by date if we have data
if not df_channel_videos.empty:
    df_channel_videos['publish_date'] = pd.to_datetime(df_channel_videos['publish_date'])
    initial_count = len(df_channel_videos)
    df_channel_videos = df_channel_videos[df_channel_videos['publish_date'] >= '2019-01-01']
    final_count = len(df_channel_videos)
    print(f"✅ Channel videos: {final_count} (filtered from {initial_count} total)")
else:
    print("❌ No channel videos found for the specified criteria")

🔍 Step 10a: Collecting videos from Jamaican channels...
Found 50 channel IDs to process
🔍 Filtering channels by country (JM)...
Verified 2 Jamaican channels
📹 Processing channel 1/2: UCEL9_gFV9YRasbbsu5v8Ynw
  ✅ Collected 240 videos
📹 Processing channel 2/2: UC1c6TamEwT02iC4LKv9WGlQ
  ✅ Collected 179 videos
✅ Channel videos: 419 (filtered from 419 total)


Step 10b: Videos from Keyword Search

In [None]:
print("\n🔍 Step 10b: Collecting videos from keyword search...")

search_results = search_videos("Jamaica", max_results=20, max_pages=90)  # Your original limits

# Enhanced video ID extraction with validation
search_video_ids = []
for item in search_results:
    try:
        if (isinstance(item, dict) and
            isinstance(item.get("id"), dict) and
            item["id"].get("videoId")):
            search_video_ids.append(item["id"]["videoId"])
    except (KeyError, TypeError):
        continue

# Remove any None or empty values
search_video_ids = [vid_id for vid_id in search_video_ids if vid_id and isinstance(vid_id, str)]

print(f"Found {len(search_video_ids)} video IDs from keyword search")

# Initialize empty DataFrame
df_search_videos = pd.DataFrame()

# Get video details if we have IDs
if search_video_ids:
    print("📹 Getting details for search videos...")
    search_details = get_video_details(search_video_ids)
    if search_details:
        df_search_videos = parse_videos(search_details)
        print(f"✅ Search videos: {len(df_search_videos)}")
    else:
        print("❌ No search video details retrieved")
else:
    print("❌ No search videos found for the specified criteria")


🔍 Step 10b: Collecting videos from keyword search...
Found 598 video IDs from keyword search
📹 Getting details for search videos...
✅ Search videos: 598


Step 10c: Combine & Deduplicate

In [None]:
print("\n🔄 Step 10c: Combining and deduplicating data...")

# Combine dataframes
dataframes_to_combine = []
if not df_channel_videos.empty:
    df_channel_videos['source'] = 'channel_search'
    dataframes_to_combine.append(df_channel_videos)
    print(f"  Channel videos: {len(df_channel_videos)}")

if not df_search_videos.empty:
    df_search_videos['source'] = 'keyword_search'
    dataframes_to_combine.append(df_search_videos)
    print(f"  Search videos: {len(df_search_videos)}")

if dataframes_to_combine:
    df_videos_combined = pd.concat(dataframes_to_combine, ignore_index=True)

    # Show deduplication stats
    before_dedup = len(df_videos_combined)
    df_videos_combined.drop_duplicates(subset="video_id", inplace=True)
    after_dedup = len(df_videos_combined)

    print(f"✅ Combined: {after_dedup} unique videos (removed {before_dedup - after_dedup} duplicates)")
else:
    print("❌ No data to combine")
    df_videos_combined = pd.DataFrame()


🔄 Step 10c: Combining and deduplicating data...
  Channel videos: 419
  Search videos: 598
✅ Combined: 901 unique videos (removed 116 duplicates)


Step 10d: Get Channel Stats & Merge

In [None]:
print("\n📊 Step 10d: Getting channel statistics and merging...")

if not df_videos_combined.empty:
    unique_channels = df_videos_combined["channel_id"].unique().tolist()
    print(f"Getting stats for {len(unique_channels)} unique channels...")

    df_channels = get_channel_statistics(unique_channels)

    if not df_channels.empty:
        # Merge with channel stats
        df_final = df_videos_combined.merge(df_channels, on="channel_id", how="left")

        # Ensure 'publish_date' is in datetime format after merge
        try:
            df_final['publish_date'] = pd.to_datetime(df_final['publish_date'], errors='coerce')
        except Exception as e:
            print(f"Warning: Could not convert 'publish_date' to datetime after merge: {e}")

        # Show merge results
        channels_with_stats = df_final['subscriber_count'].notna().sum()
        print(f"✅ Merged channel stats: {channels_with_stats}/{len(df_final)} videos have channel data")

        # Display summary statistics
        print(f"\n📈 Final Dataset Summary:")
        print(f"  Total videos: {len(df_final):,}")
        print(f"  Unique channels: {df_final['channel_id'].nunique():,}")
        print(f"  Total views: {df_final['views'].sum():,}")

        # Add checks for min/max date before printing
        if pd.api.types.is_datetime64_any_dtype(df_final['publish_date']):
             print(f"  Date range: {df_final['publish_date'].min()} to {df_final['publish_date'].max()}")
        else:
             print("  Date range: Could not determine due to mixed data types in 'publish_date'")

        print(f"  Channels with country data: {df_final['channel_country'].notna().sum():,}")

    else:
        print("⚠ No channel statistics retrieved, using video data only")
        df_final = df_videos_combined
else:
    print("❌ No combined data to process")
    df_final = pd.DataFrame()


📊 Step 10d: Getting channel statistics and merging...
Getting stats for 303 unique channels...
✅ Merged channel stats: 901/901 videos have channel data

📈 Final Dataset Summary:
  Total videos: 901
  Unique channels: 303
  Total views: 1,770,210,203
  Date range: 2019-08-16 13:29:41+00:00 to 2025-08-27 04:52:51+00:00
  Channels with country data: 901


Step 11: Save to CSV/Excel

In [None]:
print("\n💾 Step 11: Saving data...")

if not df_final.empty:
    try:
        # Save as CSV
        csv_filename = f"jamaican_youtube_data_{pd.Timestamp.now().strftime('%Y%m%d_%H%M')}.csv"
        df_final.to_csv(csv_filename, index=False)
        print(f"✅ Saved CSV: {csv_filename}")

        # Save as Excel
        excel_filename = f"jamaican_youtube_data_{pd.Timestamp.now().strftime('%Y%m%d_%H%M')}.xlsx"
        df_final.to_excel(excel_filename, index=False)
        print(f"✅ Saved Excel: {excel_filename}")

        print(f"\n🎉 Data collection complete!")
        print(f"Final dataset: {len(df_final)} videos from {df_final['channel_id'].nunique()} channels")

    except Exception as e:
        print(f"❌ Error saving files: {e}")
        print("Data is still available in df_final variable")
else:
    print("❌ No data to save")

print("\n" + "="*50)
print("COLLECTION SUMMARY")
print("="*50)
if 'df_final' in locals() and not df_final.empty:
    print(f"✅ SUCCESS: {len(df_final)} videos collected")
    print(f"📺 Channels: {df_final['channel_id'].nunique()}")
    print(f"👀 Total Views: {df_final['views'].sum():,}")
    if 'source' in df_final.columns:
        source_counts = df_final['source'].value_counts()
        for source, count in source_counts.items():
            print(f"📊 {source}: {count} videos")
else:
    print("❌ No data collected - check API key and network connection")


💾 Step 11: Saving data...
✅ Saved CSV: jamaican_youtube_data_20250827_0635.csv
❌ Error saving files: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel.
Data is still available in df_final variable

COLLECTION SUMMARY
✅ SUCCESS: 901 videos collected
📺 Channels: 303
👀 Total Views: 1,770,210,203
📊 keyword_search: 482 videos
📊 channel_search: 419 videos


In [None]:
# Save as CSV
df_final.to_csv("jamaican_youtube_data_20250827_0635.csv", index=False)

# Save as Excel
#df_final.to_excel("jamaican_youtube_data.csv", index=False)


Download the csv file

In [None]:
from google.colab import files
files.download("jamaican_youtube_data_20250827_0635.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>