# Google Trends Analysis for Multidimensional Poverty Classification in Mexico

This notebook analyzes Google search trends data to define a dimension-specific index that will be used to quantify multidimensional poverty across Mexican states. 

## Methodology Implemented

The analysis leverages Google Trends data to understand public search behavior related to poverty dimensions:

1. **Keyword Selection**: Carefully selected Spanish keywords that can serve as indicator of poverty 
2. **Geographic Scope**: Analysis across all 32 Mexican states using state-specific Google Trends data
3. **Aggregation Strategy**: Compute individual keyword averages and then dimension averages

##  Technical Approach

We retrieved Google Trends timeseries data using SerpApi and then we saved them as separate collections in MongoDB. Since in SerpApi each request includes 5 words, each MongoDB collection contains search volume data for 5 specific keyword combinations by state. Data goes from 2004 to present, but analysis can focus only on target years (2020 and 2022 in our case).

From each keyword, we extract the search volume for years of interest and we alculate annual averages. Afterwards, to create a unique and more informative dimension-specific measure, we aggregate individual keywords into our 7 poverty dimensions. 

In our analysis, we consider the dimensions of poverty identified by CONEVAL (Consejo Nacional de Evaluación de la Política de Desarrollo Social), with some minor modifications:

Precisely, we consider:
   - **Income**: Employment, wages, economic instability
   - **Access to Health Services**: Healthcare availability, medical infrastructure
   - **Educational Lag**: School dropout, educational access, academic delays
   - **Access to Social Security**: Labor protection, social benefits, pension systems
   - **Housing**: Living conditions, basic services, housing quality
   - **Access to Food**: Food security, nutrition, food prices
   - **Social Cohesion**: Discrimination, social exclusion, community tensions

## Expected Output

The analysis generates CSV files for each target year containing:
- Individual keyword search volumes by state
- Aggregated dimension averages by state


This measure is used as a component that will be combined with components retrieved from other data sources in order to quantify each dimension of poverty.

In [1]:
# load necessary libraries
import pandas as pd
import numpy as np
from datetime import datetime
from mongo_wrapper.mongo_wrapper import MongoWrapper
import os
from dotenv import load_dotenv
from collections import defaultdict
load_dotenv()

True

Below we define Mexican states using Google Trends API-compatible codes and list the keywords we want to retrieve from MongoDB as they were retrieved from SerpApi. These keyword combinations were optimized for API efficiency, so sometimes they might be mixed; therefore, we subsequently map them to poverty dimensions.

Some dimensions have fewer keywords due to difficulties in identifying informative words. Indeed, some dimensions were found to be easy to capture and quantify while others were harder to detect. In this specific case, for example, it was challenging to find words indicating the 'access to food' dimension and the 'housing' dimension.

Lastly, the comprehensive keyword list ensures we process all relevant search terms while avoiding duplicates.

In [2]:
# Mexican states using ISO 3166-2 codes for Google Trends API compatibility
STATES = [
    "MX-AGU", "MX-BCN", "MX-BCS", "MX-CAM", "MX-CHP", "MX-CHH",
    "MX-COA", "MX-COL", "MX-DUR", "MX-GUA", "MX-GRO", "MX-HID",
    "MX-JAL", "MX-DIF", "MX-MIC", "MX-MOR", "MX-NAY", "MX-NLE",
    "MX-OAX", "MX-PUE", "MX-QUE", "MX-ROO", "MX-SLP", "MX-SIN",
    "MX-SON", "MX-MEX", "MX-TAB", "MX-TAM", "MX-TLA", "MX-VER",
    "MX-YUC", "MX-ZAC"]

# keyword combinations used in Google Trends searches - grouped to maximize API efficiency 
keywords = [
    "crisis,desempleo,pobreza",
    "conflictos,discriminación", 
    "violencia,becas,escuela secundaria,enfermedad,centro de salud",
    "pensiones,seguro social,ayuda alimentaria,banco de alimentos,comedor comunitario",
    "comida barata,receta pobre,apoyo Infonavit,ayuda renta,renta barata",
    "servicios en la vivienda,vivienda del gobierno", 
    "agua potable, FOVISSSTE",
    "tianguis, tiendeo, PromoDescuentos"]

# map individual keywords to CONEVAL's seven poverty dimensions
POVERTY_DIMENSIONS = {
    'income': ['crisis', 'desempleo', 'pobreza'],
    'access_to_health_services': ['enfermedad', 'centro de salud'],
    'educational_lag': ['becas', 'escuela secundaria'],
    'access_to_social_security': ['pensiones', 'seguro social'],
    'access_to_food': ['banco de alimentos', 'tianguis', 'tiendeo', 'PromoDescuentos'],
    'housing': ['apoyo Infonavit', 'agua potable', 'FOVISSSTE'],
    'social_cohesion': ['violencia', 'conflictos', 'discriminación']}


# create a list of all unique keywords for processing
ALL_KEYWORDS = []
for dimension_words in POVERTY_DIMENSIONS.values():
    ALL_KEYWORDS.extend(dimension_words)
ALL_KEYWORDS = list(set(ALL_KEYWORDS))  # remove duplicates
ALL_KEYWORDS = list(set(ALL_KEYWORDS))  # remove duplicates

In [3]:
# from the timeseries - which go from 2004 onwards - extract only the year so that we can filter the data by year of interest
def extract_year_from_timestamp(timestamp):
    return datetime.fromtimestamp(int(timestamp)).year

The function `extract_individual_word_averages` processes Google Trends timeseries data to calculate annual averages for each keyword. This is just an intermediate step, as our goal is to extract just one component per dimension, so this initial averaging serves only to compute the dimension-specific average later on.

Afterwards, the function `calculate_dimension_averages` aggregates the individual keyword averages to create dimension averages. 

In [4]:
# for each word, compute the average value for each year
def extract_individual_word_averages(raw_data, target_years=[2020, 2022]):
    if 'interest_over_time' not in raw_data:
        return {}
    
    timeline_data = raw_data['interest_over_time']['timeline_data']
    
    # define a dictionary: {year: {word: [values]}}
    yearly_word_data = defaultdict(lambda: defaultdict(list))
    
    for entry in timeline_data:
        timestamp = entry['timestamp']
        year = extract_year_from_timestamp(timestamp)
        
        # keep only the years we are interested in 
        if year in target_years:
            for value_entry in entry['values']:
                word = value_entry['query']
                extracted_value = value_entry['extracted_value']
                yearly_word_data[year][word].append(extracted_value)
    
    # compute the average for each word in each year
    result = {}
    for year in yearly_word_data:
        result[year] = {}
        for word in yearly_word_data[year]:
            values = yearly_word_data[year][word]
            if values:
                result[year][word] = round(np.mean(values), 2)
            else:
                result[year][word] = 0
    
    return result

# after computing the averages for each word, we aggregate words that belong to the same dimension and calculate its average 
def calculate_dimension_averages(word_averages):
    dimension_averages = {}
    
    for year, words_data in word_averages.items():
        dimension_averages[year] = {}
        
        for dimension, dimension_words in POVERTY_DIMENSIONS.items():
            available_values = []
            for word in dimension_words:
                if word in words_data and words_data[word] > 0:
                    available_values.append(words_data[word])
            if available_values:
                dimension_averages[year][dimension] = round(np.mean(available_values), 2)
            else:
                dimension_averages[year][dimension] = 0
    
    return dimension_averages

### Main Data Processing and Output Generation

The function `process_all_states` implements the complete data processing pipeline for all Mexican states:
- Establishes MongoDB connections to retrieve Google Trends data 
- Aggregates data from different collections that contain overlapping keywords
- Handles missing data by skipping unavailable collections,
- Calculates final averages for keywords that appear in multiple collections
- Returns a dataset with state-year-keyword structure ready for dimension aggregation.

The function `create_output_files` then transforms this processed data into structured CSV files:
- Generates separate output files for each target year (2020 and 2022 in our case), building columns for individual keyword search volumes and aggregated dimension averages. 
- Calculates dimension-specific averages for each state, and handles missing data by assigning zero values. 

In [5]:
# process all states to extract individual words data and dimension averages
def process_all_states():
    mongo_client = MongoWrapper(
        db="serpapi_5y",
        user=os.getenv("MONGO_USERNAME"),
        password=os.getenv("MONGO_PASSWORD"),
        ip=os.getenv("MONGO_IP"),
        port=os.getenv("MONGO_PORT"))
    
    all_states_data = {}
    
    for state in STATES:
        state_word_data = defaultdict(lambda: defaultdict(list))
        
        for keyword_set in keywords:
            collection_name = f'serpapi_timeseries_{state}_{keyword_set}'
            try:
                gt_data = mongo_client.get_collection_entries(collection=collection_name)
                if gt_data:
                    raw_data = gt_data[0]
                    word_averages = extract_individual_word_averages(raw_data)
                    for year in word_averages:
                        for word, avg_value in word_averages[year].items():
                            if word in ALL_KEYWORDS:
                                state_word_data[year][word].append(avg_value)
            except Exception:
                continue
        
        if state_word_data:
            final_state_data = {}
            for year in [2020, 2022]:
                final_state_data[year] = {}
                for word in ALL_KEYWORDS:
                    if word in state_word_data[year] and state_word_data[year][word]:
                        final_state_data[year][word] = round(np.mean(state_word_data[year][word]), 2)
                    else:
                        final_state_data[year][word] = 0
            all_states_data[state] = final_state_data
    return all_states_data

# create output files for each year with individual words and dimension averages
def create_output_files(all_states_data):
    target_years = [2020, 2022]
    
    # create columns for indivual averages and dimensions averages 
    for year in target_years:
        rows = []
        for state, state_data in all_states_data.items():
            if year in state_data:
                row = {'state': state}
                for word in ALL_KEYWORDS:
                    row[word] = state_data[year].get(word, 0)
                
                dimension_averages = calculate_dimension_averages({year: state_data[year]})
                if year in dimension_averages:
                    for dimension, avg_value in dimension_averages[year].items():
                        row[dimension] = avg_value
                rows.append(row)
        
        if rows:
            df = pd.DataFrame(rows)
            word_cols = [col for col in df.columns if col in ALL_KEYWORDS]
            dimension_cols = list(POVERTY_DIMENSIONS.keys())
            ordered_cols = ['state'] + sorted(word_cols) + sorted(dimension_cols)
            df = df[ordered_cols]
            filename = f'gt_{year}.csv'
            df.to_csv(filename, index=False)

In [6]:
def main():
    all_states_data = process_all_states()
    if all_states_data:
        create_output_files(all_states_data)
        return all_states_data
    else:
        return None

if __name__ == "__main__":
    results = main()

2025-06-09 11:35:56,180 INFO Connected to serpapi_5y database on 206.81.16.39
