## Search for the name of the State + 'news' / 'economy'
Then use the advance nlp approach to do embedding on a neutral set of words to capture comments that talk about each poverty dimension, and do the sentiment on these comments. 
Get the count of the total comments and videos analyzed, the counts of comments that belong to each dimension and the sentiment score condition on each dimension. 

Here there should not be bias since we are:
- doing a generic research (state + 'news' and 'noticias' - state + 'economy')
- doing the embedding basing on words associated to the different dimensions of poverty but in a neutral way ('work', 'salary' ..). In this way we are able to identfy comments that talk about these issues, but we are not necessarily filtering for those that already talk about them negatively. The sentiment is not necessarily negative 

In [1]:
import pandas as pd
import numpy as np
import os
import re
import json
from datetime import datetime
from googleapiclient.discovery import build
from time import sleep
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from tqdm import tqdm

# Load environment variables
load_dotenv()
YT_API_KEY = os.getenv("YT_API_KEY")

# Define states and search terms
STATES_SEARCH_TERMS = {
    "Guanajuato": [
        "Guanajuato noticias", 
        "Guanajuato news", 
        "Guanajuato economía"
    ],
    "Michoacán": [
        "Michoacán noticias", 
        "Michoacán new", 
        "Michoacán economía"
    ],
    "Sinaloa": [
        "Sinaloa noticias", 
        "Sinaloa news", 
        "Sinaloa economía"
    ],
    "Chihuahua": [
        "Chihuahua noticias", 
        "Chihuahua news", 
        "Chihuahua economía"
    ],
    "Guerrero": [
        "Guerrero noticias", 
        "Guerrero news", 
        "Guerrero economía"
    ],
    "Tamaulipas": [
        "Tamaulipas noticias", 
        "Tamaulipas news", 
        "Tamaulipas economía"
    ],
    "Baja California": [
        "Baja California noticias", 
        "Baja California news", 
        "Baja California economía"
    ],
    "Zacatecas": [
        "Zacatecas noticias", 
        "Zacatecas new", 
        "Zacatecas economía"
    ],
    "Colima": [
        "Colima noticias", 
        "Colima news", 
        "Colima economía"
    ],
    "Jalisco": [
        "Jalisco noticias", 
        "Jalisco news", 
        "Jalisco economía"]}

# Neutral keyword-based descriptions for poverty dimensions: around 30 words per dimension 
# (60% standard spanish, 30% mexican/spanish slang and 10% english)
POVERTY_DIMENSIONS = {
    "INCOME": """
    empleo, trabajo, salario, ingresos, dinero, economía, sueldo, ahorro, impuestos, 
    chamba, lana, nómina, billete, jale, job, salary, income, money
    """,
    
    "ACCESS TO HEALTH SERVICES": """
    salud, médico, hospital, medicina, tratamiento, atención, clínica, seguro,
    sistema de salud, vacunas, servicios médicos, doctor, cuidado, ir al doctor,
    seguro médico, doctor particular, ir a consulta, healthcare, medical treatment, 
    doctor appointment, health insurance
    """,
    
    "EDUCATIONAL LAG": """
    educación, escuela, universidad, maestro, estudiante, aprendizaje, 
    clases, formación, conocimiento, título, bachillerato, preparatoria, 
    primaria, materias, escuela lejos, sacar buenas notas,
    education, school, learning, degree, student loans
    """,
    
    "ACCESS TO SOCIAL SECURITY": """
    seguridad social, pensión, jubilación, contrato, derechos laborales, 
    prestaciones, protección, IMSS, ISSSTE, afore, finiquito, ahorro para retiro, 
    cotizar, retirement, benefits, social security, worker rights, informal job
    """,
    
    "HOUSING": """
    vivienda, casa, habitación, hogar, alquiler, renta,
    servicios, agua, luz, gas, electricidad, construcción, propiedad, 
    techo, colonia, vecindario, urbanización, asentamiento, cuartito, 
    depa, housing, rent, mortgage, utilities
    """,
    
    "ACCESS TO FOOD": """
    alimentación, comida, nutrición, alimentos, dieta,
    mercado, productos, frutas, verduras, carne, leche, básicos, 
    despensa, supermercado, tienda, comer, cocinar, 
    canasta básica, tragar, food security, nutrition, meal, groceries
    """,
    
    "SOCIAL COHESION": """
    comunidad, sociedad, integración, participación, convivencia, 
    respeto, diversidad, solidaridad, inclusión, pertenencia, 
    vecinos, familia, apoyo, redes sociales, confianza, 
    barrio, raza, community, belonging, inclusion
    """
}

# limits for scraping
MAX_VIDEOS_PER_SEARCH = 100  
MAX_COMMENTS_PER_VIDEO = 300  
API_SLEEP_TIME = 0.5  

class TextProcessor:
    def __init__(self):
        self.embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
        self.tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
        self.model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
        self.dimension_names = list(POVERTY_DIMENSIONS.keys())
        self.dimension_embeddings = self.embedder.encode(list(POVERTY_DIMENSIONS.values()), convert_to_tensor=True)

    def clean_text(self, text):
        text = re.sub(r'<.*?>', ' ', text)
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'[^\w\sáéíóúüñÁÉÍÓÚÜÑ]', ' ', text)
        return re.sub(r'\s+', ' ', text).strip().lower()

    def classify_dimension(self, text):
        if not text:
            return None, 0.0
        embedding = self.embedder.encode(text, convert_to_tensor=True)
        cosine_scores = util.cos_sim(embedding, self.dimension_embeddings)[0]
        max_idx = torch.argmax(cosine_scores).item()
        return self.dimension_names[max_idx], cosine_scores[max_idx].item()

    def get_sentiment_score(self, text):
        if not text:
            return 0.0
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        with torch.no_grad():
            outputs = self.model(**inputs)
        stars = torch.argmax(outputs.logits, dim=1).item() + 1
        return (stars - 3) / 2  # Normalize to [-1, 1]

class YouTubeAnalyzer:
    def __init__(self, api_key):
        self.api_key = api_key
        self.youtube = build("youtube", "v3", developerKey=api_key)
        self.processor = TextProcessor()

    def search_videos(self, query, published_after, published_before, max_results=MAX_VIDEOS_PER_SEARCH):
        """Search for videos using a keyword query."""
        videos = []
        next_page_token = None
        
        try:
            while len(videos) < max_results:
                response = self.youtube.search().list(
                    q=query,
                    part="snippet",
                    maxResults=min(50, max_results - len(videos)),  # YouTube API allows max 50 per request
                    pageToken=next_page_token,
                    type="video",
                    order="relevance",
                    publishedAfter=published_after,
                    publishedBefore=published_before,
                    relevanceLanguage="es"
                ).execute()
                
                for item in response.get("items", []):
                    if item["id"]["kind"] == "youtube#video":
                        videos.append({
                            "id": item["id"]["videoId"],
                            "title": item["snippet"]["title"],
                            "description": item["snippet"].get("description", ""),
                            "published_at": item["snippet"]["publishedAt"]
                        })
                
                next_page_token = response.get("nextPageToken")
                if not next_page_token or len(videos) >= max_results:
                    break
                
                sleep(API_SLEEP_TIME)  # Avoid quota exceeded errors
                
        except Exception as e:
            print(f"Error searching for '{query}': {e}")
        
        print(f"Found {len(videos)} videos for query '{query}'")
        return videos

    def get_video_comments(self, video_id, max_comments=MAX_COMMENTS_PER_VIDEO):
        """Get comments for a specific video."""
        comments = []
        next_page_token = None
        
        try:
            while len(comments) < max_comments:
                response = self.youtube.commentThreads().list(
                    part="snippet",
                    videoId=video_id,
                    maxResults=min(100, max_comments - len(comments)),  # YouTube API allows max 100 per request
                    pageToken=next_page_token
                ).execute()
                
                for item in response.get("items", []):
                    comment_text = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                    comments.append(comment_text)
                
                next_page_token = response.get("nextPageToken")
                if not next_page_token or len(comments) >= max_comments:
                    break
                
                sleep(API_SLEEP_TIME)  # Avoid quota exceeded errors
                
        except Exception as e:
            # Many videos have comments disabled, so we'll just pass silently
            pass
        
        return comments

    def analyze_state_by_keywords(self, state_name, search_terms, date_range):
        """Analyze a state by searching for videos using specified search terms."""
        print(f"\nAnalyzing {state_name}...")
        dimension_stats = {dim: {"sentiment_sum": 0.0, "count": 0} for dim in POVERTY_DIMENSIONS}
        total_videos = 0
        total_comments = 0
        
        # Search for videos with each search term
        for search_term in search_terms:
            print(f"  Searching for '{search_term}'...")
            videos = self.search_videos(
                query=search_term,
                published_after=date_range["published_after"],
                published_before=date_range["published_before"],
                max_results=MAX_VIDEOS_PER_SEARCH
            )
            
            if not videos:
                continue
                
            total_videos += len(videos)
            
            # Process videos
            for video in tqdm(videos, desc=f"Processing videos for '{search_term}'"):
                # Get video comments
                comments = self.get_video_comments(video["id"], MAX_COMMENTS_PER_VIDEO)
                total_comments += len(comments)
                
                # Concatenate title, description and comments for analysis
                all_texts = [video["title"] + ". " + video["description"]] + comments
                
                # Analyze each text
                for text in all_texts:
                    clean = self.processor.clean_text(text)
                    if len(clean) < 10:  # Skip very short texts
                        continue
                        
                    dimension, confidence = self.processor.classify_dimension(clean)
                    if confidence > 0.1:  # Only count if confidence is high enough
                        sentiment = self.processor.get_sentiment_score(clean)
                        dimension_stats[dimension]["sentiment_sum"] += sentiment
                        dimension_stats[dimension]["count"] += 1
        
        print(f"  Analyzed {total_videos} videos and {total_comments} comments for {state_name}")
        return dimension_stats, total_videos, total_comments

def analyze_all_states():
    analyzer = YouTubeAnalyzer(YT_API_KEY)
    date_range = {
        "published_after": "2022-01-01T00:00:00Z",
        "published_before": "2022-12-31T23:59:59Z"
    }
    
    # Create directories for results
    os.makedirs("yt_keyword_sentiment", exist_ok=True)
    
    # Store overall stats for summary
    all_results = []
    
    for state, search_terms in STATES_SEARCH_TERMS.items():
        stats, total_videos, total_comments = analyzer.analyze_state_by_keywords(
            state_name=state,
            search_terms=search_terms,
            date_range=date_range
        )
        
        # Create dataframe for this state
        df = pd.DataFrame([
            {
                "state": state,
                "dimension": dim.replace("_", " ").title(),
                "avg_sentiment": v["sentiment_sum"] / v["count"] if v["count"] else 0,
                "mentions_count": v["count"],
                "videos_analyzed": total_videos,
                "comments_analyzed": total_comments
            }
            for dim, v in stats.items()
        ])
        
        # Save state-specific results
        output_file = f"yt_keyword_sentiment/{state.replace(' ', '_').lower()}.csv"
        df.to_csv(output_file, index=False)
        print(f"Saved results to {output_file}")
        
        # Add to overall results
        all_results.append(df)
    
    # Combine all results into one dataframe
    if all_results:
        all_df = pd.concat(all_results)
        all_df.to_csv("yt_keyword_sentiment/all_states_results.csv", index=False)
        print("Saved combined results to yt_keyword_sentiment/all_states_results.csv")
    

if __name__ == "__main__":
    analyze_all_states()




Analyzing Guanajuato...
  Searching for 'Guanajuato noticias'...
Found 100 videos for query 'Guanajuato noticias'


Processing videos for 'Guanajuato noticias': 100%|██████████| 100/100 [04:00<00:00,  2.41s/it]


  Searching for 'Guanajuato news'...
Found 100 videos for query 'Guanajuato news'


Processing videos for 'Guanajuato news': 100%|██████████| 100/100 [04:56<00:00,  2.97s/it]


  Searching for 'Guanajuato economía'...
Found 100 videos for query 'Guanajuato economía'


Processing videos for 'Guanajuato economía': 100%|██████████| 100/100 [02:38<00:00,  1.59s/it]


  Analyzed 300 videos and 8987 comments for Guanajuato
Saved results to yt_keyword_sentiment/guanajuato.csv

Analyzing Michoacán...
  Searching for 'Michoacán noticias'...
Found 100 videos for query 'Michoacán noticias'


Processing videos for 'Michoacán noticias': 100%|██████████| 100/100 [05:03<00:00,  3.04s/it]


  Searching for 'Michoacán new'...
Found 100 videos for query 'Michoacán new'


Processing videos for 'Michoacán new': 100%|██████████| 100/100 [10:41<00:00,  6.42s/it]


  Searching for 'Michoacán economía'...
Found 100 videos for query 'Michoacán economía'


Processing videos for 'Michoacán economía': 100%|██████████| 100/100 [05:36<00:00,  3.37s/it]


  Analyzed 300 videos and 15046 comments for Michoacán
Saved results to yt_keyword_sentiment/michoacán.csv

Analyzing Sinaloa...
  Searching for 'Sinaloa noticias'...
Found 100 videos for query 'Sinaloa noticias'


Processing videos for 'Sinaloa noticias': 100%|██████████| 100/100 [06:47<00:00,  4.08s/it]


  Searching for 'Sinaloa news'...
Found 100 videos for query 'Sinaloa news'


Processing videos for 'Sinaloa news': 100%|██████████| 100/100 [11:12<00:00,  6.72s/it]


  Searching for 'Sinaloa economía'...
Found 100 videos for query 'Sinaloa economía'


Processing videos for 'Sinaloa economía': 100%|██████████| 100/100 [01:07<00:00,  1.48it/s]


  Analyzed 300 videos and 13467 comments for Sinaloa
Saved results to yt_keyword_sentiment/sinaloa.csv

Analyzing Chihuahua...
  Searching for 'Chihuahua noticias'...
Found 100 videos for query 'Chihuahua noticias'


Processing videos for 'Chihuahua noticias': 100%|██████████| 100/100 [02:22<00:00,  1.43s/it]


  Searching for 'Chihuahua news'...
Found 100 videos for query 'Chihuahua news'


Processing videos for 'Chihuahua news': 100%|██████████| 100/100 [02:13<00:00,  1.34s/it]


  Searching for 'Chihuahua economía'...
Found 100 videos for query 'Chihuahua economía'


Processing videos for 'Chihuahua economía': 100%|██████████| 100/100 [04:44<00:00,  2.85s/it]


  Analyzed 300 videos and 10811 comments for Chihuahua
Saved results to yt_keyword_sentiment/chihuahua.csv

Analyzing Guerrero...
  Searching for 'Guerrero noticias'...
Found 100 videos for query 'Guerrero noticias'


Processing videos for 'Guerrero noticias': 100%|██████████| 100/100 [02:36<00:00,  1.57s/it]


  Searching for 'Guerrero news'...
Found 100 videos for query 'Guerrero news'


Processing videos for 'Guerrero news': 100%|██████████| 100/100 [03:29<00:00,  2.09s/it]


  Searching for 'Guerrero economía'...
Found 100 videos for query 'Guerrero economía'


Processing videos for 'Guerrero economía': 100%|██████████| 100/100 [03:08<00:00,  1.88s/it]


  Analyzed 300 videos and 10180 comments for Guerrero
Saved results to yt_keyword_sentiment/guerrero.csv

Analyzing Tamaulipas...
  Searching for 'Tamaulipas noticias'...
Found 100 videos for query 'Tamaulipas noticias'


Processing videos for 'Tamaulipas noticias': 100%|██████████| 100/100 [01:57<00:00,  1.17s/it]


  Searching for 'Tamaulipas news'...
Found 100 videos for query 'Tamaulipas news'


Processing videos for 'Tamaulipas news': 100%|██████████| 100/100 [08:06<00:00,  4.86s/it]


  Searching for 'Tamaulipas economía'...
Found 100 videos for query 'Tamaulipas economía'


Processing videos for 'Tamaulipas economía': 100%|██████████| 100/100 [02:43<00:00,  1.63s/it]


  Analyzed 300 videos and 14120 comments for Tamaulipas
Saved results to yt_keyword_sentiment/tamaulipas.csv

Analyzing Baja California...
  Searching for 'Baja California noticias'...
Found 100 videos for query 'Baja California noticias'


Processing videos for 'Baja California noticias': 100%|██████████| 100/100 [02:13<00:00,  1.34s/it]


  Searching for 'Baja California news'...
Found 100 videos for query 'Baja California news'


Processing videos for 'Baja California news': 100%|██████████| 100/100 [03:38<00:00,  2.18s/it]


  Searching for 'Baja California economía'...
Found 100 videos for query 'Baja California economía'


Processing videos for 'Baja California economía': 100%|██████████| 100/100 [00:24<00:00,  4.15it/s]


  Analyzed 300 videos and 6178 comments for Baja California
Saved results to yt_keyword_sentiment/baja_california.csv

Analyzing Zacatecas...
  Searching for 'Zacatecas noticias'...
Found 100 videos for query 'Zacatecas noticias'


Processing videos for 'Zacatecas noticias': 100%|██████████| 100/100 [03:58<00:00,  2.38s/it]


  Searching for 'Zacatecas new'...
Found 100 videos for query 'Zacatecas new'


Processing videos for 'Zacatecas new': 100%|██████████| 100/100 [06:25<00:00,  3.85s/it]


  Searching for 'Zacatecas economía'...
Found 100 videos for query 'Zacatecas economía'


Processing videos for 'Zacatecas economía': 100%|██████████| 100/100 [05:03<00:00,  3.03s/it]


  Analyzed 300 videos and 17707 comments for Zacatecas
Saved results to yt_keyword_sentiment/zacatecas.csv

Analyzing Colima...
  Searching for 'Colima noticias'...
Found 100 videos for query 'Colima noticias'


Processing videos for 'Colima noticias': 100%|██████████| 100/100 [03:14<00:00,  1.94s/it]


  Searching for 'Colima news'...
Found 100 videos for query 'Colima news'


Processing videos for 'Colima news': 100%|██████████| 100/100 [11:31<00:00,  6.92s/it]


  Searching for 'Colima economía'...
Found 100 videos for query 'Colima economía'


Processing videos for 'Colima economía': 100%|██████████| 100/100 [02:59<00:00,  1.79s/it]


  Analyzed 300 videos and 21324 comments for Colima
Saved results to yt_keyword_sentiment/colima.csv

Analyzing Jalisco...
  Searching for 'Jalisco noticias'...
Found 100 videos for query 'Jalisco noticias'


Processing videos for 'Jalisco noticias': 100%|██████████| 100/100 [00:49<00:00,  2.01it/s]


  Searching for 'Jalisco news'...
Found 100 videos for query 'Jalisco news'


Processing videos for 'Jalisco news': 100%|██████████| 100/100 [01:19<00:00,  1.26it/s]


  Searching for 'Jalisco economía'...
Found 100 videos for query 'Jalisco economía'


Processing videos for 'Jalisco economía': 100%|██████████| 100/100 [00:21<00:00,  4.57it/s]


  Analyzed 300 videos and 2075 comments for Jalisco
Saved results to yt_keyword_sentiment/jalisco.csv
Saved combined results to yt_keyword_sentiment/all_states_results.csv
