## Search for the name of the State + 'news' / 'economy'
Then use the advance nlp approach to do embedding on a neutral set of words to capture comments that talk about each poverty dimension, and do the sentiment on these comments. 
Get the count of the total comments and videos analyzed, the counts of comments that belong to each dimension and the sentiment score condition on each dimension. 

Here there should not be bias since we are:
- doing a generic research (state + 'news' and 'noticias' - state + 'economy')
- doing the embedding basing on words associated to the different dimensions of poverty but in a neutral way ('work', 'salary' ..). In this way we are able to identfy comments that talk about these issues, but we are not necessarily filtering for those that already talk about them negatively. The sentiment is not necessarily negative 

In [None]:
import pandas as pd
import numpy as np
import os
import re
import json
from datetime import datetime
from googleapiclient.discovery import build
from time import sleep
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from tqdm import tqdm

# Load environment variables
load_dotenv()
YT_API_KEY = os.getenv("YT_API_KEY")

# Define states and search terms
STATES_SEARCH_TERMS = {
    "Aguascalientes": ["Aguascalientes noticias", "Aguascalientes news", "Aguascalientes economía"],
    "Baja California": ["Baja California noticias", "Baja California news", "Baja California economía"],
    "Baja California Sur": ["Baja California Sur noticias", "Baja California Sur news", "Baja California Sur economía"],
    "Campeche": ["Campeche noticias", "Campeche news", "Campeche economía"],
    "Chiapas": ["Chiapas noticias", "Chiapas news", "Chiapas economía"],
    "Chihuahua": ["Chihuahua noticias", "Chihuahua news", "Chihuahua economía"],
    "Ciudad de México": ["Ciudad de México noticias", "Ciudad de México news", "Ciudad de México economía"],
    "Coahuila": ["Coahuila noticias", "Coahuila news", "Coahuila economía"],
    "Colima": ["Colima noticias", "Colima news", "Colima economía"],
    "Durango": ["Durango noticias", "Durango news", "Durango economía"],
    "Estado de México": ["Estado de México noticias", "Estado de México news", "Estado de México economía"],
    "Guanajuato": ["Guanajuato noticias", "Guanajuato news", "Guanajuato economía"],
    "Guerrero": ["Guerrero noticias", "Guerrero news", "Guerrero economía"],
    "Hidalgo": ["Hidalgo noticias", "Hidalgo news", "Hidalgo economía"],
    "Jalisco": ["Jalisco noticias", "Jalisco news", "Jalisco economía"],
    "Michoacán": ["Michoacán noticias", "Michoacán news", "Michoacán economía"],
    "Morelos": ["Morelos noticias", "Morelos news", "Morelos economía"],
    "Nayarit": ["Nayarit noticias", "Nayarit news", "Nayarit economía"],
    "Nuevo León": ["Nuevo León noticias", "Nuevo León news", "Nuevo León economía"],
    "Oaxaca": ["Oaxaca noticias", "Oaxaca news", "Oaxaca economía"],
    "Puebla": ["Puebla noticias", "Puebla news", "Puebla economía"],
    "Querétaro": ["Querétaro noticias", "Querétaro news", "Querétaro economía"],
    "Quintana Roo": ["Quintana Roo noticias", "Quintana Roo news", "Quintana Roo economía"],
    "San Luis Potosí": ["San Luis Potosí noticias", "San Luis Potosí news", "San Luis Potosí economía"],
    "Sinaloa": ["Sinaloa noticias", "Sinaloa news", "Sinaloa economía"],
    "Sonora": ["Sonora noticias", "Sonora news", "Sonora economía"],
    "Tabasco": ["Tabasco noticias", "Tabasco news", "Tabasco economía"],
    "Tamaulipas": ["Tamaulipas noticias", "Tamaulipas news", "Tamaulipas economía"],
    "Tlaxcala": ["Tlaxcala noticias", "Tlaxcala news", "Tlaxcala economía"],
    "Veracruz": ["Veracruz noticias", "Veracruz news", "Veracruz economía"],
    "Yucatán": ["Yucatán noticias", "Yucatán news", "Yucatán economía"],
    "Zacatecas": ["Zacatecas noticias", "Zacatecas news", "Zacatecas economía"]}


# Neutral keyword-based descriptions for poverty dimensions: around 30 words per dimension 
# (60% standard spanish, 30% mexican/spanish slang and 10% english)
POVERTY_DIMENSIONS = {
    "INCOME": """
    empleo trabajo salario ingresos dinero economía sueldo ahorro impuestos
    chamba lana nómina billete jale job salary income money
    """,
    
    "ACCESS TO HEALTH SERVICES": """
    salud médico hospital medicina tratamiento atención clínica seguro
    sistema de salud servicios médicos doctor cuidado ir al doctor health insurance
    seguro médico doctor particular ir a consulta healthcare medical treatment 
    """,
    
    "EDUCATIONAL LAG": """
    educación escuela universidad maestro estudiante aprendizaje escuela pública
    clases formación conocimiento título bachillerato preparatoria escuela secundaria
    """,
    
    "ACCESS TO SOCIAL SECURITY": """
    seguridad social pensión jubilación contrato derechos laborales
    prestaciones protección IMSS ISSSTE afore finiquito ahorro para retiro
    cotizar retirement benefits social security worker rights informal job
    """,
    
    "HOUSING": """
    vivienda casa habitación hogar alquiler renta depa housing utilities
    servicios agua luz gas electricidad construcción propiedad rent 
    techo colonia vecindario urbanización asentamiento cuartito mortgage
    """,
    
    "ACCESS TO FOOD": """
    alimentación comida nutrición alimentos dieta cocinar recetas
    canasta básica food security nutrition meal groceries
    comida saludable dieta balanceada comida rápida comida chatarra
    """,
    
    "SOCIAL COHESION": """
    comunidad sociedad integración participación convivencia barrio raza community
    respeto diversidad solidaridad inclusión pertenencia 
    vecinos apoyo redes sociales confianza belonging inclusion
    """}

# limits for scraping
MAX_VIDEOS_PER_SEARCH = 100  
MAX_COMMENTS_PER_VIDEO = 300  
API_SLEEP_TIME = 0.5  

class TextProcessor:
    def __init__(self):
        self.embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
        self.tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
        self.model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
        self.dimension_names = list(POVERTY_DIMENSIONS.keys())
        self.dimension_texts = []
        for keywords in POVERTY_DIMENSIONS.values():
            word_list = keywords.strip().split()
            phrase = " ".join(word_list)
            self.dimension_texts.append(phrase)
        self.dimension_embeddings = self.embedder.encode(self.dimension_texts, convert_to_tensor=True)

    def clean_text(self, text):
        text = re.sub(r'<.*?>', ' ', text)
        text = re.sub(r'http\S+', '', text)
        text = re.sub(r'[^\w\sáéíóúüñÁÉÍÓÚÜÑ]', ' ', text)
        return re.sub(r'\s+', ' ', text).strip().lower()

    def classify_dimension(self, text):
        if not text:
            return None, 0.0
        embedding = self.embedder.encode(text, convert_to_tensor=True)
        cosine_scores = util.cos_sim(embedding, self.dimension_embeddings)[0]
        max_idx = torch.argmax(cosine_scores).item()
        return self.dimension_names[max_idx], cosine_scores[max_idx].item()

    def get_sentiment_score(self, text):
        if not text:
            return 0.0
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
        with torch.no_grad():
            outputs = self.model(**inputs)
        stars = torch.argmax(outputs.logits, dim=1).item() + 1
        return (stars - 3) / 2  # Normalize to [-1, 1]

class YouTubeAnalyzer:
    def __init__(self, api_key):
        self.api_key = api_key
        self.youtube = build("youtube", "v3", developerKey=api_key)
        self.processor = TextProcessor()

    def search_videos(self, query, published_after, published_before, max_results=MAX_VIDEOS_PER_SEARCH):
        """Search for videos using a keyword query."""
        videos = []
        next_page_token = None
        
        try:
            while len(videos) < max_results:
                response = self.youtube.search().list(
                    q=query,
                    part="snippet",
                    maxResults=min(50, max_results - len(videos)),  # YouTube API allows max 50 per request
                    pageToken=next_page_token,
                    type="video",
                    order="relevance",
                    publishedAfter=published_after,
                    publishedBefore=published_before,
                    relevanceLanguage="es"
                ).execute()
                
                for item in response.get("items", []):
                    if item["id"]["kind"] == "youtube#video":
                        videos.append({
                            "id": item["id"]["videoId"],
                            "title": item["snippet"]["title"],
                            "description": item["snippet"].get("description", ""),
                            "published_at": item["snippet"]["publishedAt"]
                        })
                
                next_page_token = response.get("nextPageToken")
                if not next_page_token or len(videos) >= max_results:
                    break
                
                sleep(API_SLEEP_TIME)  # Avoid quota exceeded errors
                
        except Exception as e:
            print(f"Error searching for '{query}': {e}")
        
        print(f"Found {len(videos)} videos for query '{query}'")
        return videos

    def get_video_comments(self, video_id, max_comments=MAX_COMMENTS_PER_VIDEO):
        """Get comments for a specific video."""
        comments = []
        next_page_token = None
        
        try:
            while len(comments) < max_comments:
                response = self.youtube.commentThreads().list(
                    part="snippet",
                    videoId=video_id,
                    maxResults=min(100, max_comments - len(comments)),  # YouTube API allows max 100 per request
                    pageToken=next_page_token
                ).execute()
                
                for item in response.get("items", []):
                    comment_text = item["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
                    comments.append(comment_text)
                
                next_page_token = response.get("nextPageToken")
                if not next_page_token or len(comments) >= max_comments:
                    break
                
                sleep(API_SLEEP_TIME)  # Avoid quota exceeded errors
                
        except Exception as e:
            # Many videos have comments disabled, so we'll just pass silently
            pass
        
        return comments

    def analyze_state_by_keywords(self, state_name, search_terms, date_range):
        """Analyze a state by searching for videos using specified search terms."""
        print(f"\nAnalyzing {state_name}...")
        dimension_stats = {dim: {"sentiment_sum": 0.0, "count": 0} for dim in POVERTY_DIMENSIONS}
        total_videos = 0
        total_comments = 0
        
        # Search for videos with each search term
        for search_term in search_terms:
            print(f"  Searching for '{search_term}'...")
            videos = self.search_videos(
                query=search_term,
                published_after=date_range["published_after"],
                published_before=date_range["published_before"],
                max_results=MAX_VIDEOS_PER_SEARCH
            )
            
            if not videos:
                continue
                
            total_videos += len(videos)
            
            # Process videos
            for video in tqdm(videos, desc=f"Processing videos for '{search_term}'"):
                # Get video comments
                comments = self.get_video_comments(video["id"], MAX_COMMENTS_PER_VIDEO)
                total_comments += len(comments)
                
                # Concatenate title, description and comments for analysis
                all_texts = [video["title"] + ". " + video["description"]] + comments
                
                # Analyze each text
                for text in all_texts:
                    clean = self.processor.clean_text(text)
                    if len(clean) < 10:  # Skip very short texts
                        continue
                        
                    dimension, confidence = self.processor.classify_dimension(clean)
                    if confidence > 0.1:  # Only count if confidence is high enough
                        sentiment = self.processor.get_sentiment_score(clean)
                        dimension_stats[dimension]["sentiment_sum"] += sentiment
                        dimension_stats[dimension]["count"] += 1
        
        print(f"  Analyzed {total_videos} videos and {total_comments} comments for {state_name}")
        return dimension_stats, total_videos, total_comments

def analyze_all_states():
    analyzer = YouTubeAnalyzer(YT_API_KEY)
    date_range = {
        "published_after": "2020-01-01T00:00:00Z",
        "published_before": "2020-12-31T23:59:59Z"
    }
    
    # Create directories for results
    os.makedirs("yt_data_2020", exist_ok=True)
    
    # Store overall stats for summary
    all_results = []
    
    for state, search_terms in STATES_SEARCH_TERMS.items():
        stats, total_videos, total_comments = analyzer.analyze_state_by_keywords(
            state_name=state,
            search_terms=search_terms,
            date_range=date_range
        )
        
        # Create dataframe for this state
        df = pd.DataFrame([
            {
                "state": state,
                "dimension": dim.replace("_", " ").title(),
                "avg_sentiment": v["sentiment_sum"] / v["count"] if v["count"] else 0,
                "mentions_count": v["count"],
                "videos_analyzed": total_videos,
                "comments_analyzed": total_comments
            }
            for dim, v in stats.items()
        ])
        
        # Save state-specific results
        output_file = f"yt_data_2020/{state.replace(' ', '_').lower()}.csv"
        df.to_csv(output_file, index=False)
        print(f"Saved results to {output_file}")
        
        # Add to overall results
        all_results.append(df)
    
if __name__ == "__main__":
    analyze_all_states()




Analyzing Querétaro...
  Searching for 'Querétaro noticias'...
Found 100 videos for query 'Querétaro noticias'


Processing videos for 'Querétaro noticias': 100%|██████████| 100/100 [01:24<00:00,  1.18it/s]


  Searching for 'Querétaro news'...
Found 100 videos for query 'Querétaro news'


Processing videos for 'Querétaro news': 100%|██████████| 100/100 [05:01<00:00,  3.01s/it]


  Searching for 'Querétaro economía'...
Found 100 videos for query 'Querétaro economía'


Processing videos for 'Querétaro economía': 100%|██████████| 100/100 [01:36<00:00,  1.04it/s]


  Analyzed 300 videos and 8876 comments for Querétaro
Saved results to yt_data_2020/querétaro.csv

Analyzing Quintana Roo...
  Searching for 'Quintana Roo noticias'...
Found 100 videos for query 'Quintana Roo noticias'


Processing videos for 'Quintana Roo noticias': 100%|██████████| 100/100 [00:52<00:00,  1.90it/s]


  Searching for 'Quintana Roo news'...
Found 100 videos for query 'Quintana Roo news'


Processing videos for 'Quintana Roo news': 100%|██████████| 100/100 [01:50<00:00,  1.10s/it]


  Searching for 'Quintana Roo economía'...
Found 100 videos for query 'Quintana Roo economía'


Processing videos for 'Quintana Roo economía': 100%|██████████| 100/100 [00:38<00:00,  2.56it/s]


  Analyzed 300 videos and 2968 comments for Quintana Roo
Saved results to yt_data_2020/quintana_roo.csv

Analyzing San Luis Potosí...
  Searching for 'San Luis Potosí noticias'...
Found 100 videos for query 'San Luis Potosí noticias'


Processing videos for 'San Luis Potosí noticias': 100%|██████████| 100/100 [01:44<00:00,  1.04s/it]


  Searching for 'San Luis Potosí news'...
Found 100 videos for query 'San Luis Potosí news'


Processing videos for 'San Luis Potosí news': 100%|██████████| 100/100 [02:04<00:00,  1.25s/it]


  Searching for 'San Luis Potosí economía'...
Found 100 videos for query 'San Luis Potosí economía'


Processing videos for 'San Luis Potosí economía': 100%|██████████| 100/100 [00:23<00:00,  4.34it/s]


  Analyzed 300 videos and 3394 comments for San Luis Potosí
Saved results to yt_data_2020/san_luis_potosí.csv

Analyzing Sinaloa...
  Searching for 'Sinaloa noticias'...
Found 100 videos for query 'Sinaloa noticias'


Processing videos for 'Sinaloa noticias': 100%|██████████| 100/100 [01:46<00:00,  1.06s/it]


  Searching for 'Sinaloa news'...
Found 100 videos for query 'Sinaloa news'


Processing videos for 'Sinaloa news': 100%|██████████| 100/100 [09:26<00:00,  5.66s/it]


  Searching for 'Sinaloa economía'...
Found 100 videos for query 'Sinaloa economía'


Processing videos for 'Sinaloa economía': 100%|██████████| 100/100 [02:03<00:00,  1.23s/it]


  Analyzed 300 videos and 13493 comments for Sinaloa
Saved results to yt_data_2020/sinaloa.csv

Analyzing Sonora...
  Searching for 'Sonora noticias'...
Found 100 videos for query 'Sonora noticias'


Processing videos for 'Sonora noticias': 100%|██████████| 100/100 [01:48<00:00,  1.08s/it]


  Searching for 'Sonora news'...
Found 100 videos for query 'Sonora news'


Processing videos for 'Sonora news': 100%|██████████| 100/100 [02:39<00:00,  1.60s/it]


  Searching for 'Sonora economía'...
Found 100 videos for query 'Sonora economía'


Processing videos for 'Sonora economía': 100%|██████████| 100/100 [00:47<00:00,  2.09it/s]


  Analyzed 300 videos and 4591 comments for Sonora
Saved results to yt_data_2020/sonora.csv

Analyzing Tabasco...
  Searching for 'Tabasco noticias'...
Found 100 videos for query 'Tabasco noticias'


Processing videos for 'Tabasco noticias': 100%|██████████| 100/100 [01:57<00:00,  1.18s/it]


  Searching for 'Tabasco news'...
Found 100 videos for query 'Tabasco news'


Processing videos for 'Tabasco news': 100%|██████████| 100/100 [06:33<00:00,  3.94s/it]


  Searching for 'Tabasco economía'...
Found 100 videos for query 'Tabasco economía'


Processing videos for 'Tabasco economía': 100%|██████████| 100/100 [01:15<00:00,  1.33it/s]


  Analyzed 300 videos and 8958 comments for Tabasco
Saved results to yt_data_2020/tabasco.csv

Analyzing Tamaulipas...
  Searching for 'Tamaulipas noticias'...
Found 100 videos for query 'Tamaulipas noticias'


Processing videos for 'Tamaulipas noticias': 100%|██████████| 100/100 [03:02<00:00,  1.83s/it]


  Searching for 'Tamaulipas news'...
Found 100 videos for query 'Tamaulipas news'


Processing videos for 'Tamaulipas news': 100%|██████████| 100/100 [08:27<00:00,  5.08s/it]


  Searching for 'Tamaulipas economía'...
Found 100 videos for query 'Tamaulipas economía'


Processing videos for 'Tamaulipas economía': 100%|██████████| 100/100 [01:19<00:00,  1.26it/s]


  Analyzed 300 videos and 13011 comments for Tamaulipas
Saved results to yt_data_2020/tamaulipas.csv

Analyzing Tlaxcala...
  Searching for 'Tlaxcala noticias'...
Found 100 videos for query 'Tlaxcala noticias'


Processing videos for 'Tlaxcala noticias': 100%|██████████| 100/100 [01:16<00:00,  1.30it/s]


  Searching for 'Tlaxcala news'...
Found 100 videos for query 'Tlaxcala news'


Processing videos for 'Tlaxcala news': 100%|██████████| 100/100 [03:31<00:00,  2.12s/it]


  Searching for 'Tlaxcala economía'...
Found 100 videos for query 'Tlaxcala economía'


Processing videos for 'Tlaxcala economía': 100%|██████████| 100/100 [01:55<00:00,  1.15s/it]


  Analyzed 300 videos and 5786 comments for Tlaxcala
Saved results to yt_data_2020/tlaxcala.csv

Analyzing Veracruz...
  Searching for 'Veracruz noticias'...
Found 100 videos for query 'Veracruz noticias'


Processing videos for 'Veracruz noticias': 100%|██████████| 100/100 [02:56<00:00,  1.76s/it]


  Searching for 'Veracruz news'...
Found 100 videos for query 'Veracruz news'


Processing videos for 'Veracruz news': 100%|██████████| 100/100 [09:00<00:00,  5.41s/it]


  Searching for 'Veracruz economía'...
Found 100 videos for query 'Veracruz economía'


Processing videos for 'Veracruz economía': 100%|██████████| 100/100 [03:50<00:00,  2.30s/it]


  Analyzed 300 videos and 16241 comments for Veracruz
Saved results to yt_data_2020/veracruz.csv

Analyzing Yucatán...
  Searching for 'Yucatán noticias'...
Found 100 videos for query 'Yucatán noticias'


Processing videos for 'Yucatán noticias': 100%|██████████| 100/100 [00:38<00:00,  2.62it/s]


  Searching for 'Yucatán news'...
Found 100 videos for query 'Yucatán news'


Processing videos for 'Yucatán news': 100%|██████████| 100/100 [02:11<00:00,  1.31s/it]


  Searching for 'Yucatán economía'...
Found 100 videos for query 'Yucatán economía'


Processing videos for 'Yucatán economía': 100%|██████████| 100/100 [00:43<00:00,  2.29it/s]


  Analyzed 300 videos and 2971 comments for Yucatán
Saved results to yt_data_2020/yucatán.csv

Analyzing Zacatecas...
  Searching for 'Zacatecas noticias'...
Found 100 videos for query 'Zacatecas noticias'


Processing videos for 'Zacatecas noticias': 100%|██████████| 100/100 [05:21<00:00,  3.22s/it]


  Searching for 'Zacatecas news'...
Found 100 videos for query 'Zacatecas news'


Processing videos for 'Zacatecas news': 100%|██████████| 100/100 [04:47<00:00,  2.88s/it]


  Searching for 'Zacatecas economía'...
Error searching for 'Zacatecas economía': <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/search?q=Zacatecas+econom%C3%ADa&part=snippet&maxResults=50&type=video&order=relevance&publishedAfter=2020-01-01T00%3A00%3A00Z&publishedBefore=2020-12-31T23%3A59%3A59Z&relevanceLanguage=es&key=AIzaSyALlbNSWF23xN2MS12rL3-cJEviyA0nPwU&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.". Details: "[{'message': 'The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.', 'domain': 'youtube.quota', 'reason': 'quotaExceeded'}]">
Found 0 videos for query 'Zacatecas economía'
  Analyzed 200 videos and 8858 comments for Zacatecas
Saved results to yt_data_2020/zacatecas.csv
