# 📊 **Re-Implementation of "Predicting Food Crises Using News Streams"**

---

#### 🔍 **Objective**

This notebook aims to **reproduce and analyze** the methodology presented in the paper:

📄 **Paper:** [Predicting food crises using news streams](https://www.science.org/doi/10.1126/sciadv.abm3449)  
📊 **Dataset:** [Harvard Dataverse Repository](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CJDWUW)  
📜 **Original Code & Methods:** [GitHub - Regression Modeling (Step 5)](https://github.com/philippzi98/food_insecurity_predictions_nlp/blob/main/Step%205%20-%20Regression%20Modelling/README.md)

---

#### 🛠 **Methodology**

This implementation follows the **key steps** outlined in the paper to predict **food insecurity crises** using a combination of:
1️⃣ **Traditional Risk Factors** (conflict, climate, food prices, etc.)  
2️⃣ **News-Based Indicators** (text feature frequencies from news articles)  
3️⃣ **Lagging & Aggregation** (temporal dependencies at district, province, and country levels)  
4️⃣ **Machine Learning Models** (Random Forest, OLS, Lasso)

---

#### 🔗 **Reference Materials**

📄 **Supplementary Material:** Available in `supplemental_material_from_paper.pdf`  
📊 **Datasets Used:**

- `time_series_with_causes_zscore_full.csv` (Main dataset with time-series features)
- `famine-country-province-district-years-CS.csv` (Food insecurity classification)
- `matching_districts.csv` (Geographical standardization)


# 📚🔧 Import Libraries

In this notebook, we will use uv to manage our Python environment and packages efficiently. uv is a modern and fast package manager that simplifies virtual environment creation, and dependency installation. We will create a virtual environment, install necessary libraries, and ensure our environment stays consistent across different setups.


In [30]:
# # Uncoment the below cell to install `uv` if you have not already. You can also install it trhiugh `pip` by running `!pip install uv` but this will be within your current python environment and not globally.

# !curl -LsSf https://astral.sh/uv/install.sh | sh
# !uv venv world-bank
# !source world-bank/bin/activate

In [31]:
# !pip install -r requirements.txt

In [32]:
import pandas as pd
import numpy as np
from IPython.display import display, Image
import os
import gdown
import zipfile
from fuzzywuzzy import fuzz
import math

In [33]:
url = "https://drive.google.com/uc?id=1YoQ1hz9RlaLr2xW3KoKCfJPyyO2PErym"
output = "data.zip"

if not os.path.exists("./data"):
    gdown.download(url, output, quiet=False) 
    zipfile.ZipFile('data.zip', 'r').extractall()
else:
    print("You already have the data downloaded and extracted")

You already have the data downloaded and extracted


## 📂 Load and Clean Data

**Understanding the Time-Series Dataset & Column Selection**

This dataset contains **district-level time-series data** on food insecurity risk factors, including:

- **📅 Temporal Information:** `year`, `month`, `year_month`
- **📍 Geographical Identifiers:** `admin_code`, `admin_name`, `province`, `country`
- **🌍 Traditional Risk Factors:** Climate (`rain_mean`, `ndvi_mean`), conflict (`acled_count`), food prices (`p_staple_food`)
- **📰 News-Based Indicators:** Proportions of news articles mentioning crisis-related keywords (`conflict_0`, `famine_0`, etc.)
- **📉 Food Insecurity Label:** `fews_ipc` (Integrated Phase Classification)

🔥 **Columns We Will Drop & Why**
✔ **Redundant Aggregations:** `_1`, `_2` columns (province & country-level values) since we will recompute aggregations from scratch anyways.  
✔ **Unnamed/Index Columns:** `Unnamed: 0` as it is unnecessary. It is just a duplicate of default index.
✔ **Unnecessary Identifiers:** If `admin_code` and `admin_name`, after matching these to `matching_districts.csv`, we can drop them.

---

> ⚠️ **NOTE:**  
> For a detailed explanation of the dataset and features, refer to the [`explore_time_series.ipynb`](./explore_time_series.ipynb) notebook.


In [34]:
time_series = pd.read_csv('./data/time_series_with_causes_zscore_full.csv')
admins = pd.read_csv('./data/famine-country-province-district-years-CS.csv')
valid_matching = pd.read_csv('./data/matching_districts.csv')

In [35]:
time_series.head(5)

Unnamed: 0.1,Unnamed: 0,index,country,admin_code,admin_name,centx,centy,year_month,year,month,...,carbon_2,mayhem_0,mayhem_1,mayhem_2,dehydrated_0,dehydrated_1,dehydrated_2,mismanagement_0,mismanagement_1,mismanagement_2
0,0,30,Afghanistan,202,Kandahar,65.709343,31.043618,2009_07,2009,7,...,1.053,0.667,-0.171,-0.833,0.173667,0.168,1.284667,-0.073,-0.427667,0.668333
1,1,33,Afghanistan,202,Kandahar,65.709343,31.043618,2009_10,2009,10,...,-0.660812,-0.63658,-0.520247,-0.782913,-0.671587,-0.612254,-0.926921,-0.510467,-0.625133,-0.452467
2,2,36,Afghanistan,202,Kandahar,65.709343,31.043618,2010_01,2010,1,...,-0.134333,1.447667,-0.844333,0.778667,-0.676,-0.689667,0.293333,0.530333,-0.471333,0.955333
3,3,39,Afghanistan,202,Kandahar,65.709343,31.043618,2010_04,2010,4,...,-0.326927,-0.594877,0.16479,-0.90521,-0.62054,0.165794,0.045794,-1.0116,-0.8106,-0.2056
4,4,42,Afghanistan,202,Kandahar,65.709343,31.043618,2010_07,2010,7,...,-1.085146,-0.709913,-0.867913,-0.770247,-0.787921,-0.974587,-0.946921,-0.611133,-0.7098,-0.6228


In [36]:
t_variant_traditional_factors = [ 'p_staple_food']
t_variant_traditional_factors = ['ndvi_mean', 'ndvi_anom', 'rain_mean', 'rain_anom', 'et_mean', 'et_anom', 
                                    'acled_count', 'acled_fatalities', 'p_staple_food']
t_invariant_traditional_factors = ['area', 'cropland_pct', 'pop', 'ruggedness_mean', 'pasture_pct']
news_factors = [name for name in time_series.columns.values if '_0' in name]

In [37]:
news_factors[0]

'land seizures_0'

In [38]:
potential_extra_cols = set(time_series.columns.values) - set(t_variant_traditional_factors) - set(t_invariant_traditional_factors) - set(news_factors)
potential_extra_cols = [col for col in potential_extra_cols if not col.endswith(('_1', '_2', '_3'))]
print("Potential extra columns", sorted(potential_extra_cols))

Potential extra columns ['Unnamed: 0', 'admin_code', 'admin_name', 'centx', 'centy', 'change_fews', 'country', 'fews_ha', 'fews_ipc', 'fews_proj_med', 'fews_proj_med_ha', 'fews_proj_near', 'fews_proj_near_ha', 'index', 'month', 'year', 'year_month']


### 🌍 Admin Level Mapping: Standardizing Geographical Identifiers

In this section, we will **map and standardize** the `admin_code` and `admin_name` fields to their corresponding **district, province, and country names**. This step is **crucial** for ensuring **consistency** across different datasets and enabling **accurate aggregations** at multiple administrative levels.

🛠 **Why is Admin Level Mapping Important?**
✅ Different datasets may use **slightly different spellings or formats** for district names.  
✅ Some district names might be **missing or misspelled**, requiring standardization.  
✅ We need to **match and align** district names across various sources before aggregating at **province and country levels**.  
✅ Proper mapping allows us to **merge datasets correctly** without losing information.  

📌 **Steps in Admin Mapping**
1️⃣ **Load the `matching_districts.csv` file**, which provides the mapping between different district name variations.  
2️⃣ **Identify missing or unmatched `admin_name` values** and find their closest matches using fuzzy matching techniques.  
3️⃣ **Ensure that each `admin_code` uniquely maps to one `district`, `province`, and `country`.**  
4️⃣ **Replace inconsistent names** in the dataset with their standardized versions.  
5️⃣ **Aggregate data at the `province` and `country` levels** after ensuring all districts are correctly mapped.  


In [39]:
len(admins.country.unique())

39

In [40]:
admins.columns.values

array(['Unnamed: 0', 'country', 'district', 'year', 'month', 'CS',
       'province'], dtype=object)

In [41]:
admin_names = time_series['admin_name'].unique()
districts = admins['district'].unique()
provinces = admins['province'].unique()
countries = admins['country'].unique()

In [42]:
print (len(admin_names), len(districts), len(provinces), len(countries))
print (len(set(admin_names).difference(districts)))
missing_admin_names = set(admin_names).difference(districts)
print (len(missing_admin_names.difference(provinces)))
missing_admin_names = missing_admin_names.difference(provinces)

1142 4113 474 39
369
230


### Fuzzy String Matching for Missing Names

The function uses **fuzzy string matching** to find the best approximate matches for missing administrative names (e.g., districts and provinces). 

- Finds the **best matching district/province** for each missing name.
- Uses **fuzzy string matching** to calculate the similarity between missing names and known names.
- Returns a dictionary that maps each missing name to its closest match.


In [43]:
def find_matching(missing, names):
    matching_districts = {}
    for m in missing:
        max_overlap = 0
        nearest_d = None
        for d in names:
            d = str(d)
            dist = fuzz.partial_ratio(m, d)
            if dist > max_overlap:
                max_overlap = dist
                nearest_d = d
        matching_districts[m] = nearest_d
    return matching_districts


matching = find_matching(missing_admin_names, districts)
matching_p = find_matching(missing_admin_names, provinces)

# manually verify matching and update
for k in matching.keys():
    print (k, matching[k], matching_p[k])


Amran `Amran `Amran
Croix-Des-Bouquets Bo Ouest
Kajo-keji Kajo-Keji Kano
Meru South Meru Meru
Ville de Maradi Maridi Mara
Bankilaré Bankilare Sila
Kuria Kuria East Ituri
Lughaye Lughaya Bay
Marakwet Marakwet West Elgeyo-Marakwet
Hirat Wag Himra Hiiraan
Wardi Hawar Wadi Hawar Bari
Barh El Gazel Sud Barh el Gazel Sud Sud
Shabelle Shebelle Middle Shabelle
Al Fushqa Al Husha' Arusha
Gothèye Gotheye Gao
Sa'dah Sa`dah Sa`dah
Nandi North Nnewi North Nandi
Tillabéri Tillaberi Commune Tillaberi
Al Wazi'iyah Al Wazi`iyah Siaya
Maragua Maragwa Mara
North al Gazera Ganze North
Butembo City of Butembo Kemo
Bukavu City of Bukavu Busia
Gonave La Gonave Gao
Mangwe (South) Mangwe Southern
Sharq al Gazera Ganze Gaza
Belet Weyne Bale Benue
Gourma-Rharous Gourma Ghor
Um Badda Um Keddada Bay
North Shewa(R4) North Shewa North
Anse-D'Ainault `Ain Abia
Kolwezi.1 City of Kolwezi Kwale
Ad Damer Same Dhamar
Port-Salut Port Salut Salamat
Shar'ab As Salam Shar`ab As Salam Mara
Bura' Bura Guera
Lafon Lopa/Lafon Lac

### Encoding Decoding

`to_ascii_escaped(s)`: Converts a Unicode string to an ASCII-safe representation using **unicode-escape**.

`from_ascii_escaped(escaped)`: Converts the escaped ASCII string back into its original Unicode form.

In [44]:
def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    # Using 'unicode-escape' encoding produces a bytes object,
    # then decode it to get an ASCII string.
    return s.encode('unicode-escape').decode('ascii')

def from_ascii_escaped(escaped):
    """
    Convert the ASCII-escaped string back to the original Unicode string.
    """
    # Encode the ASCII string to bytes, then decode using 'unicode-escape'
    return escaped.encode('ascii').decode('unicode-escape')


### Finding the Province for a Given District or Province

`find_province(x)`, finds the **province** corresponding to a given administrative name. It accounts for:
- **Direct Lookups** (Exact match in known district/province lists)
- **Fuzzy Matching** (Using ASCII-safe transformation for inconsistent text encoding)
- **Validation Against a Predefined Mapping (`valid_matching`)**

In [45]:
# Define matched globally
matched = valid_matching['missing'].unique()

def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    return s.encode('unicode-escape').decode('ascii')

def find_province(x):
    try:
        # Ensure x is a Unicode string.
        if isinstance(x, bytes):
            x = x.decode('utf-8')
        
        # Direct lookup in districts or provinces.
        if x in districts:
            return admins[admins['district'] == x]['province'].values[0]
        elif x in provinces:
            return x

        # Convert x to an ASCII-escaped version.
        escaped_x = to_ascii_escaped(x)
        
        # Check if the escaped version is in matched.
        if escaped_x in matched:
            v = valid_matching[valid_matching['missing'] == escaped_x]
            if v['match'].values[0] == 'district':
                x2 = v['district'].values[0]
                return admins[admins['district'] == x2]['province'].values[0]
            elif v['match'].values[0] == 'province':
                return v['province'].values[0]
        
        # If no conditions are met, raise an exception.
        raise Exception("No matching province found")
    except Exception as e:
        raise Exception("Province not found for: {} ({})".format(x, e))


### Handling Admin Names with Accented Characters and Mapping to Provinces

Maps `admin_names` to provinces using the `find_province(a)` function.  
If a **direct lookup fails**, it tries to handle cases where the **admin name contains accented characters** (`é`, `è`, `ô`) ->  (encoding decoding issues resolved through directly replacing these with 'e' or 'o', leads to finding a valid match). 

In [46]:
admin_to_province = {}
for a in admin_names:
    try:
        admin_to_province[a] = find_province(a)
    except Exception as e:
        # Print the admin name that caused an error
        print("Error with:", a)
        # Check if a contains accented characters "é" or "è"
        if 'é' in a or 'è' in a or 'ô' in a:
            a_modified = a.replace('é', 'e').replace('è', 'e').replace('ô', 'o')
            # Check if the modified name is in districts
            if a_modified in districts:
                # Use the modified name to look up the province from admins
                try:
                    province = admins[admins['district'] == a_modified]['province'].values[0]
                    admin_to_province[a] = province
                    print(f"Replaced '{a}' with '{a_modified}', found province: {province}")
                except Exception as ex:
                    print(f"Modified name '{a_modified}' not found in admins: {ex}")
            else:
                print(f"Modified name '{a_modified}' not in districts.")
        else:
            print(f"No accented e found in '{a}'.")


Error with: Mangalmé
Replaced 'Mangalmé' with 'Mangalme', found province: Guera
Error with: La Pendé
Replaced 'La Pendé' with 'La Pende', found province: Logone Oriental
Error with: La Nya Pendé
Replaced 'La Nya Pendé' with 'La Nya Pende', found province: Logone Oriental
Error with: Lac-Léré
Replaced 'Lac-Léré' with 'Lac-Lere', found province: Mayo-Kebbi Ouest
Error with: Barh-Kôh
Replaced 'Barh-Kôh' with 'Barh-Koh', found province: Moyen-Chari
Error with: Aguié
Replaced 'Aguié' with 'Aguie', found province: Maradi
Error with: Bankilaré
Replaced 'Bankilaré' with 'Bankilare', found province: Tillaberi
Error with: Filingué
Replaced 'Filingué' with 'Filingue', found province: Tillaberi
Error with: Gothèye
Replaced 'Gothèye' with 'Gotheye', found province: Tillaberi
Error with: Gouré
Replaced 'Gouré' with 'Goure', found province: Zinder
Error with: Illéla
Replaced 'Illéla' with 'Illela', found province: Sokoto
Error with: Kantché
Replaced 'Kantché' with 'Kantche', found province: Zinder
Er

### Mapping Administrative Names to Provinces in time_series

Maps `admin_name` to their respective **provinces** using a precomputed dictionary - >`admin_to_province` in `time_series`.


In [47]:
time_series['province'] = time_series['admin_name'].apply(
    lambda x: admin_to_province[x] if x in admin_to_province else admin_to_province.get(x.replace('ô', 'o'))
)


# ⏳ Time Lagging & Feature Engineering

#### 📅 **Why Use Lagging?**

To predict food insecurity **for a given quarter**, we use:

- **6 months of historical values** for traditional & news-based features.
- **Province & country-level aggregations** to capture broader shocks.
- **6 quarters of lagged IPC phase values** to model temporal dependencies.

#### ⚡ **Optimized Lagging Approach**

To improve computational efficiency, we:
✔ Use `groupby()` for **fast province & country-level aggregations**.  
✔ Merge lagged data via `merge()` instead of slow `.apply()`.  
✔ Only keep **past data** to ensure no data leakage.


In [48]:
def add_time_lagged(features, start=3, end=9, diff=1, agg=True, time_series=time_series):
    levels = ['', '_province', '_country'] if agg else ['']
    
    # Work on a copy to avoid modifying the original during processing
    working_df = time_series.copy()
    
    # Precompute a mapping for each feature (with its suffix) for fast lookups. For each row, its lookup key will be: admin_code + '_' + year_month.
    lookup_maps = {}  # dict mapping f_s -> mapping dict
    for suffix in levels:
        for f in features:
            f_s = f + suffix
            # Build a mapping from key to first occurrence of f_s value.
            # Key: admin_code + '_' + year_month
            keys = working_df['admin_code'].astype(str) + '_' + working_df['year_month'].astype(str)
            # If there are duplicates, the first occurrence will be used.
            mapping = dict(zip(keys, working_df[f_s]))
            lookup_maps[f_s] = mapping

    # Prepare list to collect all new columns (as Series)
    new_cols = {}
    
    # Process each feature and lag combination
    for suffix in levels:
        for f in features:
            f_s = f + suffix
            mapping = lookup_maps[f_s]
            for t in range(start, end, diff):
                col_name = f"{f_s}_{t}"
                if col_name in time_series.columns:
                    continue
                
                # Compute lagged month and lagged year (vectorized)
                month = working_df['month']
                year = working_df['year']
                l_month = ((month - 1 - t) % 12) + 1
                l_year = np.where(month - t <= 0, year - 1, year) # If (month - t) is less than or equal to 0 (i.e., you’ve gone into the previous year), then l_year is year - 1; otherwise, it remains year.
                
                # Build the reference key: admin_code + '_' + "{l_year}_{l_month}"
                ref_key = working_df['admin_code'].astype(str) + '_' + \
                          l_year.astype(str) + '_' + \
                          l_month.astype(str)
                
                # Map the reference key to the lagged feature values using our precomputed mapping. Default to current value for working_df[f_s] if no match is found.
                lagged_values = ref_key.map(mapping)
                lagged_values = lagged_values.fillna(working_df[f_s])
                
                # Store the new column in our dictionary (preserving the original index)
                new_cols[col_name] = lagged_values
                
    # If any new columns were created, add them to the original time_series DataFrame.
    if new_cols:
        new_cols_df = pd.DataFrame(new_cols, index=working_df.index)
        time_series = pd.concat([time_series, new_cols_df], axis=1)
        
    return time_series


# Province & Country-Level Aggregation

This function aggregates feature values at the province and country levels to capture regional trends, aiding in food insecurity prediction. The process includes:

- **Grouping by year_month and level:** Data is grouped by year_month and the specified level (province or country) to calculate the mean of features, reflecting regional trends over time.

- **Applying transformations efficiently:** Instead of merging aggregated data, `transform("mean")` is used to directly assign the computed mean to each row, avoiding unnecessary joins and improving performance.  

#### ⚡ **Efficiency Gains**

- **Fast Aggregation**: Uses `groupby()` for efficient aggregation.
- **Avoids Costly Joins**: Eliminates the need for `merge()` by using `transform()` instead, reducing computational overhead.  
- **Memory Efficiency**: Converts the `level` column to a categorical type to reduce memory usage.

This approach ensures faster processing while maintaining the quality of aggregated features.


In [49]:
def add_agg_factors(features, level):
    global time_series  

    # Convert 'level' column to categorical for performance
    time_series[level] = time_series[level].astype('category')
    
    # Compute grouped mean values for the given features
    # TODO : Explain these arguments
    grouped_df = time_series.groupby(['year_month', level], observed=True, sort=False)[features].transform("mean")

    # Rename columns to include level
    grouped_df = grouped_df.rename(columns={f: f"{f}_{level}" for f in features})

    # Use pd.concat() to add all columns at once, avoiding fragmentation
    time_series = pd.concat([time_series, grouped_df], axis=1)

    return time_series


In [None]:
# Aggregating news factors
time_series = add_agg_factors(news_factors, level='country')
time_series = add_agg_factors(news_factors, level='province')

# Aggregating variant traditional factors
time_series = add_agg_factors(t_variant_traditional_factors, level='province')
time_series = add_agg_factors(t_variant_traditional_factors, level='country')

# Aggregating invariant traditional factors
time_series = add_agg_factors(t_invariant_traditional_factors, level='province')
time_series = add_agg_factors(t_invariant_traditional_factors, level='country')

# Drop null values
time_series = time_series.dropna()
time_series.head()

Unnamed: 0.1,Unnamed: 0,index,country,admin_code,admin_name,centx,centy,year_month,year,month,...,area_province,cropland_pct_province,pop_province,ruggedness_mean_province,pasture_pct_province,area_country,cropland_pct_country,pop_country,ruggedness_mean_country,pasture_pct_country
11,11,63,Afghanistan,202,Kandahar,65.709343,31.043618,2012_04,2012,4,...,54174.53381,1.417796,1379956.0,101047.1587,16.246279,18883.616188,7.335028,874755.441176,316314.116188,49.145729
12,12,66,Afghanistan,202,Kandahar,65.709343,31.043618,2012_07,2012,7,...,54174.53381,1.417796,1379956.0,101047.1587,16.246279,18883.616188,7.335028,874755.441176,316314.116188,49.145729
13,13,69,Afghanistan,202,Kandahar,65.709343,31.043618,2012_10,2012,10,...,54174.53381,1.417796,1379956.0,101047.1587,16.246279,18883.616188,7.335028,874755.441176,316314.116188,49.145729
14,14,72,Afghanistan,202,Kandahar,65.709343,31.043618,2013_01,2013,1,...,54174.53381,1.417796,1429508.0,101047.1587,16.246279,18883.616188,7.335028,901519.941176,316314.116188,49.145729
15,15,75,Afghanistan,202,Kandahar,65.709343,31.043618,2013_04,2013,4,...,54174.53381,1.417796,1429508.0,101047.1587,16.246279,18883.616188,7.335028,901519.941176,316314.116188,49.145729


# Add time lagged features


In [None]:
time_series = add_time_lagged(t_variant_traditional_factors, time_series=time_series)
time_series = add_time_lagged(news_factors, time_series=time_series)
time_series = add_time_lagged(['fews_ipc'], end=21, diff=3, agg=False, time_series=time_series)
time_series = add_time_lagged(['fews_proj_near'], start=3, end=4, diff=1, agg=False, time_series=time_series)

# Drop null values again
time_series = time_series.dropna()
time_series.shape


(28141, 4070)

In [53]:
time_series.head()

Unnamed: 0.1,Unnamed: 0,index,country,admin_code,admin_name,centx,centy,year_month,year,month,...,mismanagement_0_country_6,mismanagement_0_country_7,mismanagement_0_country_8,fews_ipc_3,fews_ipc_6,fews_ipc_9,fews_ipc_12,fews_ipc_15,fews_ipc_18,fews_proj_near_3
0,11,63,Afghanistan,202,Kandahar,65.709343,31.043618,2012_04,2012,4,...,0.450139,0.450139,0.450139,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,12,66,Afghanistan,202,Kandahar,65.709343,31.043618,2012_07,2012,7,...,0.398922,0.398922,0.398922,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,13,69,Afghanistan,202,Kandahar,65.709343,31.043618,2012_10,2012,10,...,0.2515,0.2515,0.2515,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,14,72,Afghanistan,202,Kandahar,65.709343,31.043618,2013_01,2013,1,...,0.362775,0.362775,0.362775,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,15,75,Afghanistan,202,Kandahar,65.709343,31.043618,2013_04,2013,4,...,0.2515,0.151222,0.151222,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Run the Model


In [None]:
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
from cuml.ensemble import RandomForestRegressor as cuRF
from cuml.linear_model import LinearRegression as cuLR
from cuml.linear_model import Lasso as cuLasso
import cupy as cp
import cudf

test_splits = [
    ((2012,7), (2013, 7)), 
    ((2013,7), (2014, 7)), 
    ((2014,7), (2015, 7)), 
    ((2016,7), (2017, 7)), 
]

train_splits = [ 
    ((2011,7), (2013,7)),
    ((2012,7), (2014,1)),
    ((2013,7), (2015,10)),
    ((2015,7), (2017,2)),
]

models = {
    'RF': cuRF(),
}

def get_agg_lagged_features(factors):
    return [f"{f}_{t}" for f in factors for t in range(3, 9)] + \
           [f"{f}_province_{t}" for f in factors for t in range(3, 9)] + \
           [f"{f}_country_{t}" for f in factors for t in range(3, 9)]

features = {
    'traditional': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors],
    
    'news': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(news_factors)],
    
    'traditional+news': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors + 
        get_agg_lagged_features(news_factors)],   
}

labels_df = time_series[['fews_ipc', 'year', 'month']]

def get_time_split(df, start, end):
    return df[
        (((df['year'] > start[0])) | ((df['year'] == start[0]) & (df['month'] >= start[1]))) &
        (((df['year'] < end[0])) | ((df['year'] == end[0]) & (df['month'] <= end[1])))
    ]

thresholds = {'traditional': (2.236, 3.125), 
              'news': (1.907, 2.712), 
              'traditional+news': (2.105, 3.314),
             }

def train_and_evaluate(train, test, f, D):
    results = []
    
    X_train = get_time_split(D, train[0], train[1]).drop(columns=['year', 'month']).to_numpy()# not sure how okay it is to do fillna. When me and Bilal were running this we were getting the error that cannot run the model on NaN values. First we dropped na but this was causing the shape of the X_train to be different from the y_train. So we decided to fillna with 0. - aysha & bilal
    
    y_train = get_time_split(labels_df, train[0], train[1]).drop(columns=['year', 'month']).to_numpy().ravel()

    nan_mask = np.isnan(X_train).any(axis=1)
    X_train = X_train[~nan_mask]
    y_train = y_train[~nan_mask]
    
    X_test = get_time_split(D, test[0], test[1]).drop(columns=['year', 'month']).to_numpy()
    y_test = get_time_split(labels_df, test[0], test[1]).drop(columns=['year', 'month']).to_numpy().ravel()
    nan_mask_test = np.isnan(X_test).any(axis=1)
    X_test = X_test[~nan_mask_test]
    y_test = y_test[~nan_mask_test]
    
    # check if X train is empty:
    print(f"Rows in X_train: {X_train.shape[0]} \nRows in X_test: {X_test.shape[0]}")
    if X_train.shape[0] <= 0:
        # print(f"X train is empty for {f} from {train}")
        return results
    if X_test.shape[0] <= 0:
        # print(f"X test is empty for {f} from {test}")
        return results
       
    # convert y_test into binary classification (1 if inside threshold, else 0)
    lower, upper = thresholds[f]
    y_test_binary = np.where((y_test >= lower) & (y_test <= upper), 1, 0)
    
    X_train = cp.asarray(X_train, dtype=cp.float32)
    y_train = cp.asarray(y_train, dtype=cp.float32)
    X_test = cp.asarray(X_test, dtype=cp.float32)
    y_test = cp.asarray(y_test, dtype=cp.float32)

    for name, model in models.items():
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        rmse = np.sqrt(np.mean((y_test - preds) ** 2))

        stderr = np.std(y_test - preds) / np.sqrt(len(y_test))
        upper_bound = np.sqrt(rmse**2 + 1.96 * stderr)
        lower_bound = np.sqrt(rmse**2 - 1.96 * stderr)
        # precision, recall, _ = precision_recall_curve(y_test_binary, preds)
        # aucpr = auc(recall, precision)

        results.append({
            'method': name, 'split': test, 'features': f, 
            'rmse': rmse,
            # 'rmse': rmse, 'lower_bound': lower_bound, 'upper_bound': upper_bound,
            # 'aucpr': aucpr
        })

        print(f"Method: {name}, Split: {test}, Features: {f}, RMSE: {rmse:.4f} [{lower_bound:.4f}, {upper_bound:.4f}]")

        
        # completely removed the part where they were doing country-wise evaluation. Do not see point - aysha
    
    return results

# Run model evaluation
all_results = []
for train, test in zip(train_splits, test_splits):
    for f, D in features.items():
        try:
            result = train_and_evaluate(train, test, f, D)
            all_results.append(result)
        except Exception as e:
            print(f"Error: {e} on {train} & {test}")
            continue


fig_3a = pd.DataFrame([res for sublist in all_results for res in sublist])
# fig_3a.to_csv('fig_3a.csv', index=False)


Rows in X_train: 6110 
Rows in X_test: 5127
Method: RF, Split: ((2012, 7), (2013, 7)), Features: traditional, RMSE: 0.0004 [nan, 0.0033]
Rows in X_train: 6110 
Rows in X_test: 5127
Method: RF, Split: ((2012, 7), (2013, 7)), Features: news, RMSE: 0.0020 [nan, 0.0077]
Rows in X_train: 6110 
Rows in X_test: 5127
Method: RF, Split: ((2012, 7), (2013, 7)), Features: traditional+news, RMSE: 0.0020 [nan, 0.0077]
Rows in X_train: 7131 
Rows in X_test: 4976
Method: RF, Split: ((2013, 7), (2014, 7)), Features: traditional, RMSE: 0.0899 [0.0748, 0.1028]
Rows in X_train: 7131 
Rows in X_test: 4976
Method: RF, Split: ((2013, 7), (2014, 7)), Features: news, RMSE: 0.0635 [0.0476, 0.0761]
Rows in X_train: 7131 
Rows in X_test: 4976
Method: RF, Split: ((2013, 7), (2014, 7)), Features: traditional+news, RMSE: 0.0636 [0.0478, 0.0762]
Rows in X_train: 10361 
Rows in X_test: 5485
Method: RF, Split: ((2014, 7), (2015, 7)), Features: traditional, RMSE: 0.0061 [nan, 0.0141]
Rows in X_train: 10361 
Rows in X_t

## Data Splits and Train/Test Set Sizes

### 1. Split: ((2012, 7), (2013, 7))
- Rows in Train set: 6110
- Rows in Test set: 5127

### 2. Split: ((2013, 7), (2014, 7))
- Rows in Train set: 7131
- Rows in Test set: 4976

### 3. Split: ((2014, 7), (2015, 7))
- Rows in Train set: 10361
- Rows in Test set: 5485

### 4. Split: ((2016, 7), (2017, 7))
- Rows in Train set: 6472
- Rows in Test set: 3335

### Average sizes of Train and Test sets
- Rows in Train set: 7516
- Rows in Test set: 4730

In [55]:
fig_3a.groupby(by=['method', 'features'])['rmse'].mean()

method  features        
RF      news                0.021064
        traditional         0.025874
        traditional+news    0.021106
Name: rmse, dtype: float64