# 02 - Extract Social Media Links from Channels

**Objective:** Extract social media links (Instagram, Twitter, TikTok, Discord, etc.) from YouTube channel About sections.

## Overview

This notebook:
1. Reads channel URLs from Excel output from Notebook 01
2. Visits each channel's About page
3. Extracts and normalizes social media links
4. Pivots data so each channel is one row with platform columns
5. Saves results to Excel

## Setup

In [1]:
# Import required libraries
import sys
from pathlib import Path
import pandas as pd
import logging

# Add src directory to path
src_dir = Path.cwd() / 'src'
sys.path.insert(0, str(src_dir))

# Import custom modules - FIXED VERSION
from data_processor_COMPLETE_FIXED import setup_logging, DataProcessor
from youtube_scraper_COMPLETE_FIXED import YouTubeScraper
import config_UPDATED

# Setup logging
setup_logging("INFO")
logger = logging.getLogger(__name__)

print("[OK] All imports successful!")
print(f"[INFO] Using {config_UPDATED.MAX_WORKERS} workers for scraping")
print(f"[INFO] Max retries: {config_UPDATED.MAX_RETRIES}")

[INFO] Configuration loaded successfully
[INFO] Max workers: 6
[INFO] Headless mode: True
[INFO] UTF-8 encoding: True
[OK] All imports successful!
[INFO] Using 6 workers for scraping
[INFO] Max retries: 3


## Step 1: Load Channel URLs

Read the channel URLs from the output of Notebook 01 (channelsfound.xlsx).

In [2]:
# Read the channelsfound.xlsx file
channels_file = input("Enter path to channelsfound.xlsx file: ").strip().strip('"')

df_channels = pd.read_excel(channels_file)

# Filter out channels that weren't found
channel_urls = df_channels[df_channels['channel_url'] != 'Not Found']['channel_url'].tolist()

print(f"[OK] Loaded {len(channel_urls)} valid channel URLs")
print(f"[INFO] First 5 channels:")
for url in channel_urls[:5]:
    print(f"  - {url}")

Enter path to channelsfound.xlsx file:  C:\Users\ArabTech\02_youtube_channel_scraper\data\processed\channels_found.xlsx


[OK] Loaded 100 valid channel URLs
[INFO] First 5 channels:
  - https://www.youtube.com/@Aymanelgndy_999
  - https://www.youtube.com/@anowyt
  - https://www.youtube.com/@CallMePearlX
  - https://www.youtube.com/@dzairtubetv4557
  - https://www.youtube.com/@IhabSouda-b3q


## Step 2: Initialize Scraper

Create a scraper instance for extracting social media links.

**Note:** Link extraction uses fewer workers (3-4) because it's more complex than channel search.

In [3]:
# Initialize scraper - use fewer workers for link extraction
scraper = YouTubeScraper(max_workers=3, headless=True)

print(f"[OK] YouTubeScraper initialized for link extraction")
print(f"[INFO] Workers: 3 (fewer than channel search for stability)")
print(f"[INFO] Note: Link extraction is slower than channel search")
print(f"[INFO] Expected time: 10-20 minutes for 100 channels")

[INFO] 2025-12-05 20:48:35 - youtube_scraper_COMPLETE_FIXED - [OK] YouTubeScraper initialized with 3 workers
[OK] YouTubeScraper initialized for link extraction
[INFO] Workers: 3 (fewer than channel search for stability)
[INFO] Note: Link extraction is slower than channel search
[INFO] Expected time: 10-20 minutes for 100 channels


## Step 3: Extract Social Media Links

This step will visit each channel's About page and extract links to:
- Instagram
- TikTok
- X (Twitter)
- Facebook
- Discord
- Telegram
- YouTube Channels
- Email addresses

⚠️ **This process is slower than channel search. Progress will update as channels are processed.**

In [4]:
# Extract social media links from all channels
print("[INFO] Extracting social media links...")
print(f"[INFO] Processing {len(channel_urls)} channels...")

all_results = scraper.extract_social_links(channel_urls)

print(f"[OK] Link extraction completed!")
print(f"[INFO] Total links found: {len(all_results)}")

[INFO] Extracting social media links...
[INFO] Processing 100 channels...
[INFO] 2025-12-05 20:48:46 - youtube_scraper_COMPLETE_FIXED - [INFO] Starting link extraction with 3 workers
[INFO] 2025-12-05 20:48:47 - WDM - Get LATEST chromedriver version for google-chrome
[INFO] 2025-12-05 20:48:47 - WDM - Get LATEST chromedriver version for google-chrome
[INFO] 2025-12-05 20:48:47 - WDM - Get LATEST chromedriver version for google-chrome
[INFO] 2025-12-05 20:48:47 - WDM - Get LATEST chromedriver version for google-chrome
[INFO] 2025-12-05 20:48:47 - WDM - Get LATEST chromedriver version for google-chrome
[INFO] 2025-12-05 20:48:47 - WDM - Get LATEST chromedriver version for google-chrome
[INFO] 2025-12-05 20:48:47 - WDM - Driver [C:\Users\ArabTech\.wdm\drivers\chromedriver\win64\134.0.6998.165\chromedriver-win32/chromedriver.exe] found in cache
[INFO] 2025-12-05 20:48:48 - WDM - Driver [C:\Users\ArabTech\.wdm\drivers\chromedriver\win64\134.0.6998.165\chromedriver-win32/chromedriver.exe] fo

## Step 4: View Raw Results

Show the raw extracted results.

In [5]:
# Show raw results
df_raw = pd.DataFrame(all_results)

print("Raw Results (first 15 rows):")
print(df_raw.head(15).to_string(index=False))

# Count by platform
print("\nLinks by Platform:")
platform_counts = df_raw['platform'].value_counts()
for platform, count in platform_counts.items():
    print(f"  {platform}: {count}")

Raw Results (first 15 rows):
                             channel_url        platform                                                           url
   https://www.youtube.com/@CallMePearlX          TikTok  https://www.tiktok.com/@callmepearlxd?_t=ZN-8zcMwKT5zQk&_r=1
https://www.youtube.com/@dzairtubetv4557        Telegram                                             t.me/dzairtube_tv
https://www.youtube.com/@dzairtubetv4557        Facebook                            https://www.facebook.com/Dzairtube
https://www.youtube.com/@dzairtubetv4557     X (Twitter)                                      https://x.com/dzair_tube
    https://www.youtube.com/@BilalFadili       Instagram                      https://www.instagram.com/bilal_fadilii/
    https://www.youtube.com/@BilalFadili          TikTok                          https://www.tiktok.com/@bilalfadilii
    https://www.youtube.com/@BilalFadili       Instagram                      https://www.instagram.com/bilal_fadilii/
    https://www.you

## Step 5: Pivot Data

Transform the data so each row = one channel with columns for each social media platform.
This makes it easier to analyze and use.

In [6]:
# Pivot the data
df_pivot = DataProcessor.pivot_social_links(all_results)

print("Pivoted Results (first 10 channels):")
print(df_pivot.head(10).to_string())

print(f"\n[OK] Pivoted {len(df_pivot)} unique channels")
print(f"[INFO] Columns: {', '.join(df_pivot.columns.tolist())}")

[INFO] 2025-12-05 21:13:25 - data_processor_COMPLETE_FIXED - [OK] Pivoted 44 unique channels
Pivoted Results (first 10 channels):
platform                              channel_url                        Discord                                 Email                                                                 Facebook                                             Instagram                                  Snapchat Telegram                                                        TikTok                                             X (Twitter)                                                              YouTube Channel
0            https://www.youtube.com/@Aba.Stories                            NaN                                   NaN  https://www.facebook.com/profile.php?id=100076107540978&mibextid=LQQJ4d  https://instagram.com/aba.storie?igshid=YmMyMTA2M2Y=  https://www.snapchat.com/add/aba.stories      NaN       https://www.tiktok.com/@aba.stories?_t=8bm6plEz5ba&_r=1  https://x.com/abas

## Step 6: Save Results

Save results in both raw and pivoted formats.

In [7]:
# Save results in both formats

# Format 1: Raw data (all links)
raw_output = "socialmedialinks_raw.xlsx"
DataProcessor.save_results(all_results, raw_output)

# Format 2: Pivoted data (one row per channel)
pivoted_output = "socialmedialinks_pivoted.xlsx"
pivot_path = Path.cwd() / 'data' / 'processed' / pivoted_output
pivot_path.parent.mkdir(parents=True, exist_ok=True)
df_pivot.to_excel(pivot_path, index=False, engine='openpyxl')

print(f"[OK] Pivoted results saved to {pivot_path}")

print("\n[OK] Output files created:")
print(f"  1. {raw_output} (raw format - all links)")
print(f"  2. {pivoted_output} (pivoted format - RECOMMENDED for analysis)")

[INFO] 2025-12-05 21:13:27 - data_processor_COMPLETE_FIXED - [OK] Saved 209 results to data\processed\socialmedialinks_raw.xlsx
[OK] Pivoted results saved to C:\Users\ArabTech\02_youtube_channel_scraper\src\data\processed\socialmedialinks_pivoted.xlsx

[OK] Output files created:
  1. socialmedialinks_raw.xlsx (raw format - all links)
  2. socialmedialinks_pivoted.xlsx (pivoted format - RECOMMENDED for analysis)


## Step 7: Summary Statistics

Generate summary statistics about the extracted links.

In [8]:
# Generate summary statistics
print("Summary Statistics")
print("=" * 50)

# Channels with links
channels_with_links = df_pivot.iloc[:, 1:].notna().sum(axis=1).sum()
channels_total = len(df_pivot)
channels_with_no_links = df_pivot.iloc[:, 1:].isna().all(axis=1).sum()

print(f"Total channels processed: {channels_total}")
print(f"Channels with links: {channels_total - channels_with_no_links}")
print(f"Channels with no links: {channels_with_no_links}")

print("\nLinks found by platform:")
for col in df_pivot.columns[1:]:
    count = df_pivot[col].notna().sum()
    print(f"  {col}: {count}")

print(f"\n[OK] Next step: Run Notebook 03 to get engagement metrics!")

Summary Statistics
Total channels processed: 44
Channels with links: 44
Channels with no links: 0

Links found by platform:
  Discord: 12
  Email: 12
  Facebook: 21
  Instagram: 28
  Snapchat: 2
  Telegram: 1
  TikTok: 16
  X (Twitter): 14
  YouTube Channel: 14

[OK] Next step: Run Notebook 03 to get engagement metrics!


## Summary

You now have social media links extracted from all channels!

### Files Created:
- `socialmedialinks_raw.xlsx` - All extracted links
- `socialmedialinks_pivoted.xlsx` - One row per channel (recommended)

### Next Steps:
1. ✅ Run **Notebook 03** to get engagement metrics
2. Optional: Review the Excel files to verify link accuracy
3. Optional: Create visualizations based on the data