# Telegram Analysis for Multidimensional Poverty Classification in Mexico

This notebook implements a text classification system to analyze Telegram posts from Mexican news channels and categorize their content according to the seven dimensions of multidimensional poverty we are considering. The dimensions of poverty we consider are those identified by CONEVAL (Consejo Nacional de Evaluación de la Política de Desarrollo Social), with some minor modifications. 

Precisely, we consider:
   - **Income**: Employment, wages, economic instability
   - **Access to Health Services**: Healthcare availability, medical infrastructure
   - **Educational Lag**: School dropout, educational access, academic delays
   - **Access to Social Security**: Labor protection, social benefits, pension systems
   - **Housing**: Living conditions, basic services, housing quality
   - **Access to Food**: Food security, nutrition, food prices
   - **Social Cohesion**: Discrimination, social exclusion, community tensions

## Methodology Implemented

The analysis follows these key steps:

1. **Data Collection**: Extract posts from 10 major Mexican news channels on Telegram for a given year,
2. **Geographic Classification**: Assign posts to Mexican states based on textual mentions of state names,
3. **Poverty Dimension Classification**: Use sentence embeddings and cosine similarity to classify posts into the corresponding dimension,
4. **Output Generation**: Produce counts and percentages of posts per poverty dimension for each of the 32 Mexican states.

## Technical Approach

### Text Preprocessing 

- **HTML/Markup Removal**: Strip HTML tags and web links from Telegram posts
- **Character Normalization**: Keep only letters, numbers, and Spanish accented characters
- **Whitespace Normalization**: Clean extra spaces and convert to lowercase
- **Length Filtering**: Exclude very short texts (< 10 characters)

### Embeddings and Classification

We used the pre-trained model `hiiamsid/sentence_similarity_spanish_es`, which is a sentence transformer suited for Spanish language. This model converts each text (both posts and poverty dimension definitions) into a 768-dimensional numerical vector that captures the semantic meaning of the text, so similar concepts have similar vector representations. 

So we create embeddings for each of the 7 poverty dimension definition and for each Telegram post, and we compute cosine similarity between the post embedding and each dimension embedding. After, each post is assigned to the dimension with highest similarity score, but only if the score exceeds 0.10 threshold. If a post does not meet this requirement, it falls into the 'other' category, meaning that the post is not talking about any of the poverty dimension. 

We believe this approach is adapt to our classification task since considers the full context of the text, not just individual words (like as simple word matching would do). Also, having the 0.10 threshold ensures we only classify posts that are actually related to poverty topics, avoiding missclassifications due to pure noise. 

### Geographic Classification

We use regex to find exact state name mentions in post text, and so to assign posts to their state. Posts mentioning multiple states are counted for each mentioned state.

### Extracted Components 

The final output provides quantitative measures of how frequently each poverty dimension is discussed in relation to each Mexican state, serving as a real-time indicator for multidimensional poverty analysis.

In [None]:
# load necessary libraries
import pandas as pd
import numpy as np
import re
from tqdm import tqdm
import os
from sentence_transformers import SentenceTransformer, util
import torch
from dotenv import load_dotenv
from mongo_wrapper.mongo_wrapper import MongoWrapper
load_dotenv()

In [None]:
# define states to categorize
STATES = [
    "Aguascalientes", "Baja California", "Baja California Sur", "Campeche", "Chiapas", "Chihuahua",
    "Ciudad de México", "Coahuila", "Colima", "Durango", "Estado de México", "Guanajuato", 
    "Guerrero", "Hidalgo", "Jalisco", "Michoacán", "Morelos", "Nayarit", "Nuevo León", "Oaxaca", 
    "Puebla", "Querétaro", "Quintana Roo", "San Luis Potosí", "Sinaloa", "Sonora", "Tabasco", 
    "Tamaulipas", "Tlaxcala", "Veracruz", "Yucatán", "Zacatecas"]

In [None]:
# channels to analyze - these are the channels that have been previosly scraped and stored in Mongo database 
# the collecctions name is the channel name and the year of the scrape
TARGET_CHANNELS = [
    "elpaismexico_2020",
    "ElUniversalOnline_2020",
    "proceso_unofficial_2020",
    "politicomx_2020",
    "lajornada_unofficial_2020",
    "larazondemexico_2020",
    "sinembargomx_2020",
    "elpaisamerica_2020",
    "animalpolitico_2020",
    "ElEconomista_MTY_2020"]

We start by definying the seven poverty dimensions using carefully selected Spanish keywords and phrases that capture the linguistic patterns associated with discussions about each particular aspect of poverty. These serve as semantic anchors for the classification algorithm.
Given the nature of the data we are analyzing, we maintained formal journalistic jargon, including only standard words and phrases in Spanish rather than colloquial expressions.

We extensively worked on building these lists, as initially some dimensions were overrepresented while others were underrepresented. We found that in some cases (for example, the housing dimension) it was optimal to include fewer words, as otherwise an unrealistic percentage of posts would fall into this dimension.

Through different analyses, we discovered that dimensions need to be handled differently, in the sense that some are easier to capture linguistically while others prove more challenging. Regardless of the specific analysis conducted, a crucial component is properly defining these word lists, since they largely determine the quality of the classification results.

In [None]:
# define dimensions of poverty 
POVERTY_DIMENSIONS = {
    "INCOME": """
    desempleo salario mínimo bajos ingresos deudas familiares pobreza laboral
    pérdida de empleo ingreso insuficiente precariedad laboral empleo informal
    falta de oportunidades laborales reducción de salario inestabilidad económica
    recesión subempleo despidos masivos contratos temporales informalidad
    costos de vida elevados falta de empleo formal insuficiencia salarial
    """,

    "ACCESS TO HEALTH SERVICES": """
    falta de acceso a servicios de salud hospitales saturados escasez de medicamentos 
    deficiencias en la atención médica carencia de personal médico emergencia sanitaria
    costos elevados de tratamientos cierre de centros de salud lista de espera prolongada
    equipos médicos inoperantes desabasto de vacunas falta de atención especializada
    """,

    "EDUCATIONAL_LAG": """
    deserción escolar suspensión de clases carencia de docentes  
    dificultades de acceso a la educación educación interrumpida rezago académico 
    falta de recursos escolares acceso desigual a la educación deficiencias en formación 
    básica carencia de materiales educativos 
    """,

    "ACCESS TO SOCIAL SECURITY": """
    empleo informal ausencia de prestaciones sociales falta de contrato laboral 
    exclusión del sistema de pensiones carencia de protección social trabajo precario 
    derechos laborales no garantizados falta de cotización al sistema desprotección estructural
    dificultades para acceder al seguro social informalidad laboral empleo sin afiliación
    """,

    "HOUSING": """
    vivienda precaria hacinamiento falta de servicios básicos 
    infraestructura deteriorada zonas marginadas viviendas inseguras
    """,
    
    "ACCESS TO FOOD": """
    inseguridad alimentaria acceso limitado a alimentos inflación precios
    raciones insuficientes pobreza alimentaria aumento de precios comeder comunitario
    canasta básica crisis alimentaria comida pobre ayuda alimentaria 
    insuficiencia nutricional alimentación deficiente encarecimiento de alimentos
    inflación en alimentos carencia alimentaria productos básicos banco de alimentos
    alimentos inaccesibles gasto alimentario elevado programas alimentarios
    """,

    "SOCIAL_COHESION": """
    discriminación étnica marginación social exclusión comunidades vulnerables
    conflictos intercomunitarios tensiones sociales barreras sociales 
    desigualdad aislamiento social
    """}

The `PovertyDimensionClassifier` class implements the core classification task through three steps:

**Initialization:**
- Loads a Spanish-optimized sentence transformer model (`hiiamsid/sentence_similarity_spanish_es`)
- Precomputes embeddings for all seven poverty dimension definitions
- Represents each poverty concept in a 768-dimensional vector space

**Text Preprocessing:**
- Removes HTML tags, URLs, and special characters
- Converts to lowercase and normalize whitespaces
- Filters out posts shorter than 10 characters 

**Classification Process:**
- Converts cleaned text into a 768-dimensional embedding vector 
- Computes cosine similarity between the post embedding and each pre-computed dimension embedding
- Selects the dimension with highest similarity score
- Applies a 0.10 threshold to only classify posts with meaningful semantic overlap, and so to reduce the false positives 
- Classifies posts below the threshold as unrelated to poverty topics, falling into the 'other' category

In [None]:
# initialize the classifier with Spanish sentence embeddings model and precompute embeddings for all poverty dimensions
class PovertyDimensionClassifier:
    def __init__(self):
        # load Spanish sentence transformer model optimized for semantic similarity
        self.model = SentenceTransformer('hiiamsid/sentence_similarity_spanish_es')
        
        # store dimension names for easy reference
        self.dimension_names = list(POVERTY_DIMENSIONS.keys())
        
        # precompute embeddings for all poverty dimension descriptions
        self.dimension_embeddings = self.model.encode(
            list(POVERTY_DIMENSIONS.values()), 
            convert_to_tensor=True)
    
    # clean and preprocess text for better embedding quality
    def clean_text(self, text):
        if not isinstance(text, str):
            return ""
        
        # remove HTML tags that might appear in Telegram posts
        text = re.sub(r'<.*?>', ' ', text)
        
        # remove URLs and links
        text = re.sub(r'http\S+', '', text)
        
        # keep only alphanumeric characters and Spanish accented letters
        text = re.sub(r'[^\w\sáéíóúüñÁÉÍÓÚÜÑ]', ' ', text)
        
        # normalize whitespace and convert to lowercase
        return re.sub(r'\s+', ' ', text).strip().lower()
    
    # classify text into poverty dimensions using semantic similarity
    def classify_text(self, text, threshold=0.10):
        if not text:
            return None, 0.0
        
        # clean the input text
        cleaned_text = self.clean_text(text)
        
        # skip very short texts as they might lack semantic content
        if len(cleaned_text) < 10:
            return None, 0.0
        
        # generate embedding for the input text
        text_embedding = self.model.encode(cleaned_text, convert_to_tensor=True)
        
        # compute cosine similarity between text and all poverty dimensions
        cosine_scores = util.cos_sim(text_embedding, self.dimension_embeddings)[0]
        
        # find the dimension with highest similarity score
        max_idx = torch.argmax(cosine_scores).item()
        max_score = cosine_scores[max_idx].item()
        
        # classify into one of the dimension only if similarity exceeds threshold
        if max_score >= threshold:
            return self.dimension_names[max_idx], max_score
        else:
            return None, max_score

This function handles the data extraction and geographic classification pipeline:

- **Database Connection:** Connects to MongoDB and filter from all available collections only those of interest (i.e., target channels and year)

- **Geographic Classification:** Creates regex patterns, with word boundaries (`\b`) to prevent partial matches, to classify posts depending on the State they talk about. 

**Pipeline and Output:**
- Iterates through each channel
- Retrieves all posts from MongoDB collections
- Searches each post for state name mentions using regex patterns
- Stores posts for each mentioned state
- Returns dictionary with state names as keys
- Each value is a DataFrame containing posts mentioning that state
- Single posts can appear in multiple states if they mention multiple locations

In [None]:
# load Telegram posts from MongoDB and classify them by Mexican states
def load_state_posts():
    # connect to MongoDB
    mongo_client = MongoWrapper(
        db=os.getenv("MONGO_DB"),
        user=os.getenv("MONGO_USERNAME"),
        password=os.getenv("MONGO_PASSWORD"),
        ip=os.getenv("MONGO_IP"),
        port=os.getenv("MONGO_PORT"))
    
    # get all available collections and filter for those of interest
    all_channels = mongo_client.get_all_collections()
    available_target_channels = [
        channel for channel in TARGET_CHANNELS 
        if channel in all_channels]
    
    # initialize dictionary to store posts categorized by state
    state_posts = {state: [] for state in STATES}
    
    # create regex patterns for each state to identify mentions in posts - match complete state names, not partial matches
    state_patterns = {
        state: re.compile(r'\b' + re.escape(state) + r'\b', re.IGNORECASE) 
        for state in STATES}
    
    # process each available target channel
    for channel in tqdm(available_target_channels, desc="Loading channels"):
        # retrieve all posts from the current channel
        posts = mongo_client.get_collection_entries(collection=channel)
        print(f"Channel: {channel} - {len(posts)} posts found")
        # process each post in the channel
        for post in tqdm(posts, desc=f"Analyzing {channel}", leave=False):
            post_text = post.get('text', '')
            # check if post mentions any of the Mexican states   
            for state, pattern in state_patterns.items():
                if pattern.search(post_text):
                    # store post if state is mentioned
                    state_posts[state].append(post_text)  
    
    # convert to df 
    for state in STATES:
        if state_posts[state]:
            state_posts[state] = pd.DataFrame(state_posts[state], columns=['text'])
        else:
            state_posts[state] = pd.DataFrame(columns=['text'])
    
    return state_posts

This function classifies posts into multidimensional poverty categories, across Mexican states. 

- The process begins by initializing the classifier to processe posts on a state-by-state basis, classifying each one according to its semantic similarity to predefined poverty dimension categories. As said before, posts having a similarity score lower than 0.10 are labeled as "other".

- As it runs, the function tracks the number of posts classified under each dimension, to calculate both the absolute counts and the percentage distribution of posts across dimensions.

- The output is a dataframe, where each row corresponds to a single poverty dimension for a given state. It includes the number of matching posts, their percentage share, and the total number of posts analyzed. 

In [None]:
# analyze all posts for each state and classify them into poverty dimensions
def analyze_poverty_dimensions(state_posts):
    # initialize the classifier
    classifier = PovertyDimensionClassifier()
    
    # store results for all states and dimensions
    results = []

    # process each state individually
    for state, df in state_posts.items():
        print(f"\nAnalyzing {state} ({len(df)} posts)...")
    
        # initialize counters for each poverty dimension plus the "other" category - fallback for posts not related to 
        # any dimension or posts that do not exceed the threshold
        dimension_counts = {dim: 0 for dim in POVERTY_DIMENSIONS.keys()}
        dimension_counts["OTHER"] = 0  
    
        # classify each post in the current state
        for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Classifying {state}"):
            text = row['text']
        
            # get classification result from the classifier
            dimension, score = classifier.classify_text(text)
        
            # increment counter for the identified dimension or "other"
            if dimension:
                dimension_counts[dimension] += 1
            else:
                dimension_counts["OTHER"] += 1
        
        # calculate statistics and print statistics
        total_posts = len(df)
        dimension_percentages = {
            dim: (count / total_posts) * 100 
            for dim, count in dimension_counts.items()}
        print(f"\nResults for {state}:")
        print(f"Total posts: {total_posts}")
        print("\nDistribution of posts across poverty dimensions:")
        
        for dim, count in dimension_counts.items():
            dim_name = dim if dim != "OTHER" else "Non-poverty posts"
            pct = dimension_percentages[dim]
            print(f"- {dim_name}: {count} posts ({pct:.1f}%)")
        
        # store results 
        for dim in list(POVERTY_DIMENSIONS.keys()) + ["OTHER"]:
            results.append({
                'state': state,
                'dimension': dim,
                'count': dimension_counts[dim],
                'percentage': dimension_percentages[dim],
                'total_posts': total_posts})
    
    # convert to df 
    results_df = pd.DataFrame(results)
    return results_df

In [None]:
# main function that implements the whole pipeline
def main():
    # load and geographically classify posts from MongoDB
    state_posts = load_state_posts()
    
    # analyze posts for poverty dimensions
    results = analyze_poverty_dimensions(state_posts)
    
    # export results to CSV for further analysis
    results.to_csv("tg_2020.csv", index=False)

if __name__ == "__main__":
    main()