# YouTube Analysis for Multidimensional Poverty Classification in Mexico

This implementation creates a text classification system to analyze YouTube comments and categorize content according to multidimensional poverty dimensions. The analysis follows CONEVAL's (Consejo Nacional de Evaluación de la Política de Desarrollo Social) framework with adaptations for real-time social media data.

We examine seven key dimensions of multidimensional poverty:

- **Income**: Employment status, wages, economic instability, unemployment
- **Access to Health Services**: Healthcare availability, medical infrastructure, health insurance
- **Educational Lag**: School dropout rates, educational access, academic delays
- **Access to Social Security**: Labor protection, social benefits, pension systems
- **Housing**: Living conditions, basic utilities (water, electricity), housing quality
- **Access to Food**: Food security, nutrition, food prices, hunger
- **Social Cohesion**: Community integration, discrimination, social exclusion, belonging

## Technical Methodology

### 1. Data Collection

**Search Parameters:**
- **Temporal Scope**: Full year analysis (2022: January 1 - December 31)
- **Geographic Coverage**: All 32 Mexican states
- **Search Terms**: State name + ["noticias", "news", "economía"] (3 queries per state)
- **Volume Limits**: 100 videos per query, 300 comments per video
- **Language Priority**: Spanish content prioritized via `relevanceLanguage="es"`

### 2. Text Preprocessing

**Preprocessing Steps:**
1. **HTML/Markup Removal**: Strip HTML tags and web links
2. **Character Normalization**: Preserve only alphanumeric + Spanish accented characters
3. **Whitespace Normalization**: Remove extra spaces, convert to lowercase
4. **Length Filtering**: Exclude texts shorter than 10 characters

### 3. Embedding and Classification

**Embedding Generation:**
- **Model**: `paraphrase-multilingual-MiniLM-L12-v2` (768-dimensional embeddings)
- **Language Support**: Optimized for Spanish, English, and mixed-language content
- **Dimension Preprocessing**: Convert keyword lists to normalized text phrases for embedding

**Classification Logic:**
1. Generate embeddings for both input text and poverty dimension definitions
2. Calculate cosine similarity between text and all dimension embeddings
3. Assign text to highest-scoring dimension if score ≥ 0.10 threshold
4. Classify as "OTHER" if below threshold (filters non-poverty content)

### 4. Sentiment Analysis 

**Model**: `cardiffnlp/twitter-xlm-roberta-base-sentiment`
- **Input**: Raw text
- **Output**: The sentiment class with the highest predicted probability (`negative`, `neutral`, or `positive`)  
- **Scoring**: The selected label is mapped to a fixed score:  
  - `negative` = –1.0  
  - `neutral` = 0.0  
  - `positive` = +1.0  
### 5. Extracted Components

**State-Level Metrics:**
- **Dimension Coverage**: Percentage of comments per poverty dimension
- **Conditional Sentiment**: Average sentiment score per dimension per state
- **General Statistics**: Total videos and comments analyzed

In [1]:
# load necessary libraries
import pandas as pd
import numpy as np
import os
import re
import json
from datetime import datetime
from googleapiclient.discovery import build
from time import sleep
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
import torch
import torch.nn.functional as F
from tqdm import tqdm

# load environment variables from .env file
load_dotenv()
YT_API_KEY = os.getenv("YT_API_KEY")

In [2]:
# mapping of Mexican states with their corresponding search terms
STATES_SEARCH_TERMS = {
    "Aguascalientes": ["Aguascalientes noticias", "Aguascalientes news", "Aguascalientes economía"]}

# Poverty dimension definitions with keywords. Each dimension contains a mix of formal Spanish terms, 
# Mexican slang, and English words to capture the diverse jargon used in YouTube comments
POVERTY_DIMENSIONS = {
    "INCOME": """
    empleo trabajo salario ingresos dinero economía sueldo ahorro impuestos
    chamba lana nómina billete jale job salary income money
    """,
    
    "ACCESS TO HEALTH SERVICES": """
    salud médico hospital medicina tratamiento atención clínica seguro
    sistema de salud servicios médicos doctor cuidado ir al doctor health insurance
    seguro médico doctor particular ir a consulta healthcare medical treatment 
    """,
    
    "EDUCATIONAL LAG": """
    educación escuela universidad maestro estudiante aprendizaje escuela pública
    clases formación conocimiento título bachillerato preparatoria escuela secundaria
    """,
    
    "ACCESS TO SOCIAL SECURITY": """
    seguridad social pensión jubilación contrato derechos laborales
    prestaciones protección IMSS ISSSTE afore finiquito ahorro para retiro
    cotizar retirement benefits social security worker rights informal job
    """,
    
    "HOUSING": """
    vivienda casa habitación hogar alquiler renta depa housing utilities
    servicios agua luz gas electricidad construcción propiedad rent 
    techo colonia vecindario urbanización asentamiento cuartito mortgage
    """,
    
    "ACCESS TO FOOD": """
    alimentación comida nutrición alimentos dieta cocinar recetas
    canasta básica food security nutrition meal groceries
    comida saludable dieta balanceada comida rápida comida chatarra
    """,
    
    "SOCIAL COHESION": """
    comunidad sociedad integración participación convivencia barrio raza community
    respeto diversidad solidaridad inclusión pertenencia 
    vecinos apoyo redes sociales confianza belonging inclusion
    """}

In [3]:
# confidence threshold: comments with similarity scores below this threshold will be classified as 'OTHER'
MIN_DIMENSION_CONFIDENCE = 0.10

# define constants for YouTube API usage
MAX_VIDEOS_PER_SEARCH = 100  
MAX_COMMENTS_PER_VIDEO = 300  
API_SLEEP_TIME = 0.5  

### SimpleTextProcessor
**Purpose**: Handles text preprocessing, dimension classification, and sentiment analysis

**Key Methods:**
- `clean_text()`: Normalizes and cleans input text
- `classify_dimension()`: Assigns text to poverty dimensions using embeddings
- `get_sentiment_score()`: Computes sentiment scores

In [None]:
class SimpleTextProcessor:
    def __init__(self):
        # initialize multilingual sentence embedding model
        # this model generates embeddings for both comments and poverty dimension definitions
        self.embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

        # load multilingual sentiment model 
        self.sentiment_model = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
        self.tokenizer = AutoTokenizer.from_pretrained(self.sentiment_model)
        self.model = AutoModelForSequenceClassification.from_pretrained(self.sentiment_model)

        # load poverty dimension names and definitions
        self.dimension_names = list(POVERTY_DIMENSIONS.keys())
        self.dimension_texts = []

        # format and join keyword definitions into single phrases per dimension
        for keywords in POVERTY_DIMENSIONS.values():
            word_list = keywords.strip().split()
            phrase = " ".join(word_list)
            self.dimension_texts.append(phrase)

        # precompute embeddings for each poverty dimension definition
        self.dimension_embeddings = self.embedder.encode(self.dimension_texts, convert_to_tensor=True)

    def clean_text(self, text):
        # remove HTML tags
        text = re.sub(r'<.*?>', ' ', text)
        # remove URLs
        text = re.sub(r'http\S+', '', text)
        # remove punctuation and special characters (preserving accented Spanish characters)
        text = re.sub(r'[^\w\sáéíóúüñÁÉÍÓÚÜÑ]', ' ', text)
        # normalize whitespace and convert to lowercase
        return re.sub(r'\s+', ' ', text).strip().lower()

    def classify_dimension(self, text):
        if not text:
            return "OTHER", 0.0

        # generate embedding for the input text
        embedding = self.embedder.encode(text, convert_to_tensor=True)
        # compute cosine similarity between input and all dimension definitions
        cosine_scores = util.cos_sim(embedding, self.dimension_embeddings)[0]
        # identify the most similar dimension
        max_idx = torch.argmax(cosine_scores).item()
        max_score = cosine_scores[max_idx].item()

        # if similarity is below threshold, classify as "OTHER"
        if max_score < MIN_DIMENSION_CONFIDENCE:
            return "OTHER", max_score

        # otherwise, return the best matching dimension and its similarity score
        return self.dimension_names[max_idx], max_score
    
    def get_sentiment_score(self, text):
        if not text:
            return 0.0

        # tokenize the input
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

        # run the model in inference mode: no gradient computation is needed (faster and more memory-efficient)
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits
            predicted_class = torch.argmax(logits, dim=1).item()  # 0=negative, 1=neutral, 2=positive

        # map class index to fixed score
        class_to_score = {0: -1, 1: 0, 2: 1}
        return class_to_score[predicted_class]

### SimpleYouTubeAnalyzer  
**Purpose**: Manages YouTube API interactions and orchestrates analysis pipeline

**Key Methods:**
- `search_videos()`: Retrieves videos based on search terms and date range
- `get_video_comments()`: Extracts comments from individual videos
- `analyze_state_by_keywords()`: Processes complete state analysis workflow

In [5]:
# handle YouTube API interactions and content analysis. 
class SimpleYouTubeAnalyzer:
    def __init__(self, api_key):
        self.api_key = api_key
        self.youtube = build("youtube", "v3", developerKey=api_key)
        self.processor = SimpleTextProcessor()

    # search for videos on YouTube based on a query and date range
    def search_videos(self, query, published_after, published_before, max_results=100):
        videos = []
        next_page_token = None
        
        try:
            while len(videos) < max_results:
                # request videos from YouTube API with pagination
                response = self.youtube.search().list(
                    q=query,
                    part="snippet",
                    maxResults=min(50, max_results - len(videos)),  # API limit is 50 per request
                    pageToken=next_page_token,
                    type="video",
                    order="relevance",
                    publishedAfter=published_after,
                    publishedBefore=published_before,
                    relevanceLanguage="es"  # prioritize Spanish content
                ).execute()
                
                # extract video information from API response
                for item in response.get("items", []):
                    if item["id"]["kind"] == "youtube#video":
                        videos.append({
                            "id": item["id"]["videoId"],
                            "title": item["snippet"]["title"],
                            "description": item["snippet"].get("description", ""),
                            "published_at": item["snippet"]["publishedAt"]
                        })
                
                # check if more pages are available
                next_page_token = response.get("nextPageToken")
                if not next_page_token or len(videos) >= max_results:
                    break
                
                # small pause to avoid quota issues
                sleep(API_SLEEP_TIME)
                
        except Exception as e:
            print(f"Error searching for '{query}': {e}")
        
        print(f"Found {len(videos)} videos for query '{query}'")
        return videos

    # retrieve comments for a specific video
    def get_video_comments(self, video_id, max_comments=300):
        comments = []
        next_page_token = None
        
        try:
            while len(comments) < max_comments:
                # request comments with pagination support
                response = self.youtube.commentThreads().list(
                    part="snippet",
                    videoId=video_id,
                    maxResults=min(100, max_comments - len(comments)),  # API limit is 100 per request
                    pageToken=next_page_token
                ).execute()
                
                # extract comment text from API response
                for item in response.get("items", []):
                    comment_text = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                    comments.append(comment_text)
                
                # check for additional pages
                next_page_token = response.get("nextPageToken")
                if not next_page_token or len(comments) >= max_comments:
                    break
                
                # pause to avoid quota issues 
                sleep(API_SLEEP_TIME)
                
        except Exception as e:
            # hanlde videos that have disabled comments
            pass
        
        return comments

    def analyze_state_by_keywords(self, state_name, search_terms, date_range):
        print(f"\nAnalyzing {state_name}...")
        
        # initialize statistics tracking for all categories and 'OTHER'
        all_categories = list(POVERTY_DIMENSIONS.keys()) + ["OTHER"]
        dimension_stats = {cat: {"sentiment_sum": 0.0, "count": 0} for cat in all_categories}
        
        total_videos = 0
        total_comments = 0
        classification_stats = {cat: 0 for cat in all_categories}
        
        # process each search term for the current state
        for search_term in search_terms:
            print(f"  Searching for '{search_term}'...")
            
            # get relevant videos for this search term
            videos = self.search_videos(
                query=search_term,
                published_after=date_range["published_after"],
                published_before=date_range["published_before"],
                max_results=MAX_VIDEOS_PER_SEARCH
            )
            
            if not videos:
                continue
                
            total_videos += len(videos)
            
            # process each video and its comments
            for video in tqdm(videos, desc=f"Processing videos for '{search_term}'"):
                # extract comments from the current video
                comments = self.get_video_comments(video["id"], MAX_COMMENTS_PER_VIDEO)
                total_comments += len(comments)
                
                # combine video metadata with comments
                all_texts = [video["title"] + ". " + video["description"]] + comments
                
                # analyze each piece of text individually
                for text in all_texts:
                    clean = self.processor.clean_text(text)
                    
                    # skip very short texts 
                    if len(clean) < 10:
                        continue
                    
                    # classify text into poverty dimensions or 'OTHER'
                    category, confidence = self.processor.classify_dimension(clean)
                    
                    # update classification statistics for reporting
                    classification_stats[category] += 1
                    
                    # calculate sentiment score for all classified texts
                    sentiment = self.processor.get_sentiment_score(clean)
                    dimension_stats[category]["sentiment_sum"] += sentiment
                    dimension_stats[category]["count"] += 1
        
        # print classification statistics for this state
        total_texts = sum(classification_stats.values())
        print(f"  Classification statistics for {state_name}:")
        for category, count in classification_stats.items():
            percentage = (count / total_texts * 100) if total_texts > 0 else 0
            print(f"    {category}: {count} texts ({percentage:.1f}%)")
        
        print(f"  Analyzed {total_videos} videos and {total_comments} comments for {state_name}")
        return dimension_stats, total_videos, total_comments, classification_stats

## Output Files

Results are saved as CSV files in `yt_data_2022/` directory:
- One file per state: `{state_name}.csv`
- Aggregated statistics across all dimensions

In [6]:
#  main execution function that processes all Mexican states
def analyze_all_states_simple():
    # initialize the YouTube analyzer with API credentials
    analyzer = SimpleYouTubeAnalyzer(YT_API_KEY)
    
    # define the analysis time period (2022 full year in this case)
    date_range = {
        "published_after": "2022-01-01T00:00:00Z",
        "published_before": "2022-12-31T23:59:59Z"}
    
    # create output directory for results
    os.makedirs("TRIAL2", exist_ok=True)
    
    # initialize lists for aggregated results
    all_results = []
    overall_classification_stats = {}
    
    # process each Mexican state individually
    for state, search_terms in STATES_SEARCH_TERMS.items():
        # analyze the current state using its specific search terms
        stats, total_videos, total_comments, classification_stats = analyzer.analyze_state_by_keywords(
            state_name=state,
            search_terms=search_terms,
            date_range=date_range)
        
        # accumulate classification statistics across all states
        for category, count in classification_stats.items():
            overall_classification_stats[category] = overall_classification_stats.get(category, 0) + count
        
        # prepare structured data for this state
        df_rows = []
        for category, v in stats.items():
            df_rows.append({
                "state": state,
                "dimension": category.replace("_", " ").title(),
                "avg_sentiment": v["sentiment_sum"] / v["count"] if v["count"] > 0 else 0,
                "mentions_count": v["count"],
                "percentage_of_total": (v["count"] / sum([s["count"] for s in stats.values()]) * 100) if sum([s["count"] for s in stats.values()]) > 0 else 0,
                "videos_analyzed": total_videos,
                "comments_analyzed": total_comments})
        
        # create DataFrame for this state's results
        df = pd.DataFrame(df_rows)
        
        # save state-specific results to CSV file
        output_file = f"TRIAL2/{state.replace(' ', '_').lower()}.csv"
        df.to_csv(output_file, index=False)
        print(f"Saved results to {output_file}")
        
        # add to aggregated results collection
        all_results.append(df)

if __name__ == "__main__":
    analyze_all_states_simple()




Analyzing Aguascalientes...
  Searching for 'Aguascalientes noticias'...
Found 100 videos for query 'Aguascalientes noticias'


Processing videos for 'Aguascalientes noticias': 100%|██████████| 100/100 [04:02<00:00,  2.43s/it]


  Searching for 'Aguascalientes news'...
Found 100 videos for query 'Aguascalientes news'


Processing videos for 'Aguascalientes news': 100%|██████████| 100/100 [07:14<00:00,  4.34s/it]


  Searching for 'Aguascalientes economía'...
Found 100 videos for query 'Aguascalientes economía'


Processing videos for 'Aguascalientes economía': 100%|██████████| 100/100 [02:42<00:00,  1.62s/it]


  Classification statistics for Aguascalientes:
    INCOME: 4819 texts (35.6%)
    ACCESS TO HEALTH SERVICES: 967 texts (7.1%)
    EDUCATIONAL LAG: 2072 texts (15.3%)
    ACCESS TO SOCIAL SECURITY: 325 texts (2.4%)
    HOUSING: 1001 texts (7.4%)
    ACCESS TO FOOD: 796 texts (5.9%)
    SOCIAL COHESION: 1338 texts (9.9%)
    OTHER: 2234 texts (16.5%)
  Analyzed 300 videos and 13767 comments for Aguascalientes
Saved results to TRIAL2/aguascalientes.csv
