# 📊 **Re-Implementation of "Predicting Food Crises Using News Streams"**

---

#### 🔍 **Objective**

This notebook aims to **reproduce and analyze** the methodology presented in the paper:

📄 **Paper:** [Predicting food crises using news streams](https://www.science.org/doi/10.1126/sciadv.abm3449)  
📊 **Dataset:** [Harvard Dataverse Repository](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CJDWUW)  
📜 **Original Code & Methods:** [GitHub - Regression Modeling (Step 5)](https://github.com/philippzi98/food_insecurity_predictions_nlp/blob/main/Step%205%20-%20Regression%20Modelling/README.md)

---

#### 🛠 **Methodology**

This implementation follows the **key steps** outlined in the paper to predict **food insecurity crises** using a combination of:
1️⃣ **Traditional Risk Factors** (conflict, climate, food prices, etc.)  
2️⃣ **News-Based Indicators** (text feature frequencies from news articles)  
3️⃣ **Lagging & Aggregation** (temporal dependencies at district, province, and country levels)  
4️⃣ **Machine Learning Models** (Random Forest, OLS, Lasso)

---

#### 🔗 **Reference Materials**

📄 **Supplementary Material:** Available in `supplemental_material_from_paper.pdf`  
📊 **Datasets Used:**

- `time_series_with_causes_zscore_full.csv` (Main dataset with time-series features)
- `famine-country-province-district-years-CS.csv` (Food insecurity classification)
- `matching_districts.csv` (Geographical standardization)


# 📚🔧 Import Libraries

In this notebook, we will use uv to manage our Python environment and packages efficiently. uv is a modern and fast package manager that simplifies virtual environment creation, and dependency installation. We will create a virtual environment, install necessary libraries, and ensure our environment stays consistent across different setups.


In [None]:
## Uncoment the below cell to install `uv` if you have not already. You can also install it trhiugh `pip` by running `!pip install uv` but this will be within your current python environment and not globally.

# !curl -LsSf https://astral.sh/uv/install.sh | sh
# !uv venv world-bank
# !source world-bank/bin/activate

In [None]:
%uv pip install -r requirements.txt

In [1]:
import pandas as pd
import numpy as np
import folium
from IPython.display import display, Image
import os
import gdown
import zipfile

In [2]:
url = "https://drive.google.com/uc?id=1YoQ1hz9RlaLr2xW3KoKCfJPyyO2PErym"
output = "data.zip"

if not os.path.exists("./data"):
    gdown.download(url, output, quiet=False) 
    zipfile.ZipFile('data.zip', 'r').extractall()
else:
    print("You already have the data downloaded and extracted")

You already have the data downloaded and extracted


## 📂 Load and Clean Data

**Understanding the Time-Series Dataset & Column Selection**

This dataset contains **district-level time-series data** on food insecurity risk factors, including:

- **📅 Temporal Information:** `year`, `month`, `year_month`
- **📍 Geographical Identifiers:** `admin_code`, `admin_name`, `province`, `country`
- **🌍 Traditional Risk Factors:** Climate (`rain_mean`, `ndvi_mean`), conflict (`acled_count`), food prices (`p_staple_food`)
- **📰 News-Based Indicators:** Proportions of news articles mentioning crisis-related keywords (`conflict_0`, `famine_0`, etc.)
- **📉 Food Insecurity Label:** `fews_ipc` (Integrated Phase Classification)

🔥 **Columns We Will Drop & Why**
✔ **Redundant Aggregations:** `_1`, `_2` columns (province & country-level values) since we will recompute aggregations from scratch anyways.  
✔ **Unnamed/Index Columns:** `Unnamed: 0` as it is unnecessary. It is just a duplicate of default index.
✔ **Unnecessary Identifiers:** If `admin_code` and `admin_name`, after matching these to `matching_districts.csv`, we can drop them.

---

> ⚠️ **NOTE:**  
> For a detailed explanation of the dataset and features, refer to the [`explore_time_series.ipynb`](./explore_time_series.ipynb) notebook.


In [2]:
time_series = pd.read_csv('./data/time_series_with_causes_zscore_full.csv')
admins = pd.read_csv('./data/famine-country-province-district-years-CS.csv')
valid_matching = pd.read_csv('./data/matching_districts.csv')

In [3]:
sorted(time_series.columns.values)

['Unnamed: 0',
 'abnormally low rainfall_0',
 'abnormally low rainfall_1',
 'abnormally low rainfall_2',
 'acled_count',
 'acled_fatalities',
 'acute hunger_0',
 'acute hunger_1',
 'acute hunger_2',
 'admin_code',
 'admin_name',
 'aid appeal_0',
 'aid appeal_1',
 'aid appeal_2',
 'aid workers died_0',
 'aid workers died_1',
 'aid workers died_2',
 'air attack_0',
 'air attack_1',
 'air attack_2',
 'alarming level_0',
 'alarming level_1',
 'alarming level_2',
 'anti-western policies_0',
 'anti-western policies_1',
 'anti-western policies_2',
 'apathy_0',
 'apathy_1',
 'apathy_2',
 'area',
 'asylum seekers_0',
 'asylum seekers_1',
 'asylum seekers_2',
 'authoritarian_0',
 'authoritarian_1',
 'authoritarian_2',
 'bad harvests_0',
 'bad harvests_1',
 'bad harvests_2',
 'blockade_0',
 'blockade_1',
 'blockade_2',
 'bombing campaign_0',
 'bombing campaign_1',
 'bombing campaign_2',
 'brain drain_0',
 'brain drain_1',
 'brain drain_2',
 'brutal government_0',
 'brutal government_1',
 'brutal 

In [6]:
time_series.head(5)

Unnamed: 0.1,Unnamed: 0,index,country,admin_code,admin_name,centx,centy,year_month,year,month,...,carbon_2,mayhem_0,mayhem_1,mayhem_2,dehydrated_0,dehydrated_1,dehydrated_2,mismanagement_0,mismanagement_1,mismanagement_2
0,0,30,Afghanistan,202,Kandahar,65.709343,31.043618,2009_07,2009,7,...,1.053,0.667,-0.171,-0.833,0.173667,0.168,1.284667,-0.073,-0.427667,0.668333
1,1,33,Afghanistan,202,Kandahar,65.709343,31.043618,2009_10,2009,10,...,-0.660812,-0.63658,-0.520247,-0.782913,-0.671587,-0.612254,-0.926921,-0.510467,-0.625133,-0.452467
2,2,36,Afghanistan,202,Kandahar,65.709343,31.043618,2010_01,2010,1,...,-0.134333,1.447667,-0.844333,0.778667,-0.676,-0.689667,0.293333,0.530333,-0.471333,0.955333
3,3,39,Afghanistan,202,Kandahar,65.709343,31.043618,2010_04,2010,4,...,-0.326927,-0.594877,0.16479,-0.90521,-0.62054,0.165794,0.045794,-1.0116,-0.8106,-0.2056
4,4,42,Afghanistan,202,Kandahar,65.709343,31.043618,2010_07,2010,7,...,-1.085146,-0.709913,-0.867913,-0.770247,-0.787921,-0.974587,-0.946921,-0.611133,-0.7098,-0.6228


In [4]:
t_variant_traditional_factors = ['ndvi_mean', 'ndvi_anom', 'rain_mean', 'rain_anom', 'et_mean', 'et_anom', 
                                    'acled_count', 'acled_fatalities', 'p_staple_food']
t_invariant_traditional_factors = ['area', 'cropland_pct', 'pop', 'ruggedness_mean', 'pasture_pct']

news_factors = [name for name in time_series.columns.values if '_0' in name]

In [5]:
news_factors[0]

'land seizures_0'

In [14]:
time_series.drop(columns=["Unnamed: 0", "year_month", "centx", "centy", 'change_fews', 'fews_ha','fews_proj_med', 'fews_proj_med_ha', 'fews_proj_near_ha'], inplace=True)

In [15]:
potential_extra_cols = set(time_series.columns.values) - set(t_variant_traditional_factors) - set(t_invariant_traditional_factors) - set(news_factors)
potential_extra_cols = [col for col in potential_extra_cols if not col.endswith(('_1', '_2', '_3'))]
print(potential_extra_cols)

['country', 'admin_code', 'fews_ipc', 'admin_name', 'index', 'year', 'month', 'fews_proj_near']


# ⏳ Time Lagging & Feature Engineering

#### 📅 **Why Use Lagging?**

To predict food insecurity **for a given quarter**, we use:

- **6 months of historical values** for traditional & news-based features.
- **Province & country-level aggregations** to capture broader shocks.
- **6 quarters of lagged IPC phase values** to model temporal dependencies.

#### ⚡ **Optimized Lagging Approach**

To improve computational efficiency, we:
✔ Use `groupby()` for **fast province & country-level aggregations**.  
✔ Merge lagged data via `merge()` instead of slow `.apply()`.  
✔ Only keep **past data** to ensure no data leakage.


In [5]:
def get_lagged(x, f, t):
    admin_code = x['admin_code']
    year = x['year']
    month = x['month']
    l_month = ((month-1-t)%12)+1
    l_year = year
    if month-t<=0:
        l_year -= 1
    ts=time_series[time_series['admin_code']==admin_code]
    lagged_year_month = '{}_{}'.format(l_year, l_month)
    if lagged_year_month in ts['year_month'].values:
        ts = ts[ts['year_month']==lagged_year_month]
        return ts[f].values[0]
    else:
        return x[f]
    

In [6]:
def add_time_lagged(features, start=3, end=9, diff=1, agg=True):
    if agg:
        levels = ['', '_province', '_country']
    else:
        levels = ['']
    for suffix in levels:
        for f in features:
            f_s = f+suffix
            for t in range(start,end,diff):
                if '{}_{}'.format(f_s,t) in time_series:
                    continue
                time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)

You already have the data downloaded and extracted


# Get Admin level mapping


In [None]:
# Adjust filepath (file also in GitHub repository)

In [None]:
len(admins.country.unique())

39

In [None]:
admin_names = time_series['admin_name'].unique()
districts = admins['district'].unique()
provinces = admins['province'].unique()
countries = admins['country'].unique()

In [None]:
print (len(admin_names), len(districts), len(provinces), len(countries))
print (len(set(admin_names).difference(districts)))
missing_admin_names = set(admin_names).difference(districts)
print (len(missing_admin_names.difference(provinces)))
missing_admin_names = missing_admin_names.difference(provinces)

1142 4113 474 39
369
230


In [None]:
import editdistance
from fuzzywuzzy import fuzz
def find_matching(missing, names):
    matching_districts = {}
    for m in missing:
        max_overlap = 0
        nearest_d = None
        for d in names:
            d = str(d)
            dist = fuzz.partial_ratio(m, d)
            if dist > max_overlap:
                max_overlap = dist
                nearest_d = d
        matching_districts[m] = nearest_d
    return matching_districts


matching = find_matching(missing_admin_names, districts)
matching_p = find_matching(missing_admin_names, provinces)
#manually verify matching and update
for k in matching.keys():
    print (k, matching[k], matching_p[k])


b'Baydhaba' Baydhabo Bay
b'Port De Paix' Port de Paix Pwani
b'Hwange' Kang Kwango
b'Jebrat al Sheikh' Jebrat El Sheikh Herat
b'Adan Yabaal' Aadan Yabaal `Adan
b'Cayes' Kayes Kayes
b'Chiengi' Chienge Singida
b'Kibale' Kibaale Kidal
b'Tillab\\xe9ri' Tillia Sila
b'Ad Douiem' El Douiem Ad Dali`
b'Sheikh' Sheekh Sahel
b'South Gonder' South Gondar Sud
b'Abu Hamad' Abu Hamed Hilmand
b'Um Kadada' Um Keddada Kandahar
b'Koibatek' Kibra Kogi
b'Muranga' Mkuranga Murang'a
b'Ville de Niamey' Ndia Niamey
b'Gour\\xe9' Goure Ghor
b'North Western Tigray' Lira Western
b'Balleyara' Bale Mara
b'Agnuak' Awgu Ouaka
b"Bura'" Bura Busia
b"Amanat Al 'Asimah" Arsi Amanat Al `Asimah
b'Ceca La Source' Cerca La Source Sud
b'Zvishavane' Zvishavane Urban Kinshasa
b'Teso' Tewor Tshopo
b'Ad Damer' Same Dhamar
b'Mayo-Lemi' Mayo-Lemie Bay
b'Barh El Gazel Sud' Barh el Gazel Sud Sud
b'Marakwet' Marawi Mara
b'Western Tigray' Lira Western
b"Anse-D'Ainault" `Ans Abia
b"Sa'dah" Sa`dah Sa`dah
b'Kantch\\xe9' Kantche Kano
b'UMP' 

In [None]:
# Adjust filepath (file also in GitHub repository)
# After validating the matches, the names are logged in this csv file

In [None]:
def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    # Using 'unicode-escape' encoding produces a bytes object,
    # then decode it to get an ASCII string.
    return s.encode('unicode-escape').decode('ascii')

def from_ascii_escaped(escaped):
    """
    Convert the ASCII-escaped string back to the original Unicode string.
    """
    # Encode the ASCII string to bytes, then decode using 'unicode-escape'
    return escaped.encode('ascii').decode('unicode-escape')

# Test the round-trip on each unique value from valid_matching['missing']:
for m in valid_matching['missing'].unique():
    # Ensure m is a Unicode string
    original = m.decode('utf-8') if isinstance(m, bytes) else m
    # Convert to an ASCII-escaped representation
    encoded = to_ascii_escaped(original)
    # Convert back from the ASCII-escaped representation to Unicode
    decoded = from_ascii_escaped(encoded)
    
    # Print the results
    print("Original: ", original)
    print("Encoded:  ", encoded)
    print("Decoded:  ", decoded)
    print("Round-trip equal:", original == decoded)
    print("-" * 40)


Original:  Port-Au-Prince
Encoded:   Port-Au-Prince
Decoded:   Port-Au-Prince
Round-trip equal: True
----------------------------------------
Original:  Teso
Encoded:   Teso
Decoded:   Teso
Round-trip equal: True
----------------------------------------
Original:  Tanganyka
Encoded:   Tanganyka
Decoded:   Tanganyka
Round-trip equal: True
----------------------------------------
Original:  Tayeeglow
Encoded:   Tayeeglow
Decoded:   Tayeeglow
Round-trip equal: True
----------------------------------------
Original:  Kadoma
Encoded:   Kadoma
Decoded:   Kadoma
Round-trip equal: True
----------------------------------------
Original:  Ad Dali'
Encoded:   Ad Dali'
Decoded:   Ad Dali'
Round-trip equal: True
----------------------------------------
Original:  MPongwe
Encoded:   MPongwe
Decoded:   MPongwe
Round-trip equal: True
----------------------------------------
Original:  Saint-Raphael
Encoded:   Saint-Raphael
Decoded:   Saint-Raphael
Round-trip equal: True
-------------------------------

In [None]:
# Define matched globally
matched = valid_matching['missing'].unique()

def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    return s.encode('unicode-escape').decode('ascii')

def find_province(x):
    try:
        # Ensure x is a Unicode string.
        if isinstance(x, bytes):
            x = x.decode('utf-8')
        
        # Direct lookup in districts or provinces.
        if x in districts:
            return admins[admins['district'] == x]['province'].values[0]
        elif x in provinces:
            return x

        # Convert x to an ASCII-escaped version.
        escaped_x = to_ascii_escaped(x)
        
        # Check if the escaped version is in matched.
        if escaped_x in matched:
            v = valid_matching[valid_matching['missing'] == escaped_x]
            if v['match'].values[0] == 'district':
                x2 = v['district'].values[0]
                return admins[admins['district'] == x2]['province'].values[0]
            elif v['match'].values[0] == 'province':
                return v['province'].values[0]
        
        # If no conditions are met, raise an exception.
        raise Exception("No matching province found")
    except Exception as e:
        raise Exception("Province not found for: {} ({})".format(x, e))


In [None]:
admin_to_province = {}
for a in admin_names:
    try:
        admin_to_province[a] = find_province(a)
    except Exception as e:
        # Print the admin name that caused an error
        print("Error with:", a)
        # Check if a contains accented characters "é" or "è"
        if 'é' in a or 'è' in a or 'ô' in a:
            a_modified = a.replace('é', 'e').replace('è', 'e').replace('ô', 'o')
            # Optionally, normalize if you expect other accents:
            # import unicodedata
            # a_modified = unicodedata.normalize('NFC', a_modified)
            
            # Check if the modified name is in districts
            if a_modified in districts:
                # Use the modified name to look up the province from admins
                try:
                    province = admins[admins['district'] == a_modified]['province'].values[0]
                    admin_to_province[a] = province
                    print(f"Replaced '{a}' with '{a_modified}', found province: {province}")
                except Exception as ex:
                    print(f"Modified name '{a_modified}' not found in admins: {ex}")
            else:
                print(f"Modified name '{a_modified}' not in districts.")
        else:
            print(f"No accented e found in '{a}'.")


Error with: Mangalmé
Replaced 'Mangalmé' with 'Mangalme', found province: Guera
Error with: La Pendé
Replaced 'La Pendé' with 'La Pende', found province: Logone Oriental
Error with: La Nya Pendé
Replaced 'La Nya Pendé' with 'La Nya Pende', found province: Logone Oriental
Error with: Lac-Léré
Replaced 'Lac-Léré' with 'Lac-Lere', found province: Mayo-Kebbi Ouest
Error with: Barh-Kôh
Replaced 'Barh-Kôh' with 'Barh-Koh', found province: Moyen-Chari
Error with: Aguié
Replaced 'Aguié' with 'Aguie', found province: Maradi
Error with: Bankilaré
Replaced 'Bankilaré' with 'Bankilare', found province: Tillaberi
Error with: Filingué
Replaced 'Filingué' with 'Filingue', found province: Tillaberi
Error with: Gothèye
Replaced 'Gothèye' with 'Gotheye', found province: Tillaberi
Error with: Gouré
Replaced 'Gouré' with 'Goure', found province: Zinder
Error with: Illéla
Replaced 'Illéla' with 'Illela', found province: Sokoto
Error with: Kantché
Replaced 'Kantché' with 'Kantche', found province: Zinder
Er

In [None]:
# print(admin_to_province)
for k, v in admin_to_province.items():
    print("key is : ", k)
    print("value is : ", v)

key is :  Kandahar
value is :  Kandahar
key is :  Kapisa
value is :  Kapisa
key is :  Khost
value is :  Khost
key is :  Kunar
value is :  Kunar
key is :  Kunduz
value is :  Kunduz
key is :  Laghman
value is :  Laghman
key is :  Logar
value is :  Logar
key is :  Nangarhar
value is :  Nangarhar
key is :  Paktika
value is :  Paktika
key is :  Paktya
value is :  Paktya
key is :  Samangan
value is :  Samangan
key is :  Sar-e-Pul
value is :  Sari Pul
key is :  Takhar
value is :  Takhar
key is :  Wardak
value is :  Wardak
key is :  Zabul
value is :  Zabul
key is :  Daykundi
value is :  Daykundi
key is :  Panjsher
value is :  Panjsher
key is :  Parwan
value is :  Parwan
key is :  Uruzgan
value is :  Uruzgan
key is :  Badakhshan
value is :  Badakhshan
key is :  Badghis
value is :  Badghis
key is :  Baghlan
value is :  Baghlan
key is :  Balkh
value is :  Balkh
key is :  Bamyan
value is :  Bamyan
key is :  Farah
value is :  Farah
key is :  Faryab
value is :  Faryab
key is :  Ghazni
value is :  Gh

In [None]:
# time_series['province'] = time_series['admin_name'].apply(lambda x: admin_to_province[x])
time_series['province'] = time_series['admin_name'].apply(
    lambda x: admin_to_province[x] if x in admin_to_province else admin_to_province.get(x.replace('ô', 'o'))
)


# Add province and country aggregate values


In [None]:
# def add_agg_factors(features, level='province'):
#     grouped_df = time_series.groupby(['year_month', level]).mean(numeric_only=True)
#     for f in features:
#         time_series['{}_{}'.format(f, level)] = time_series.apply(lambda x: grouped_df.loc[x['year_month'], x[level]][f], axis=1)


def add_agg_factors(features, level='province'):
    # First, create a cleaned version of the grouping column.
    # (If time_series[level] is already a string column, this works directly.)
    time_series['{}_clean'.format(level)] = time_series[level]
    
    # Now group by 'year_month' and the cleaned level.
    grouped_df = time_series.groupby(['year_month', '{}_clean'.format(level)]).mean(numeric_only=True)
    
    # For each feature, add a new column.
    for f in features:
        # Use the cleaned level for the lookup in the grouped DataFrame.
        time_series['{}_{}'.format(f, level)] = time_series.apply(
            lambda x: grouped_df.loc[x['year_month'], x['{}_clean'.format(level)]][f]
                      if pd.notnull(x['{}_clean'.format(level)]) and x['{}_clean'.format(level)] in grouped_df.loc[x['year_month']].index
                      else np.nan,
            axis=1
        )

In [None]:
add_agg_factors(news_factors)

KeyboardInterrupt: 

In [None]:
add_agg_factors(news_factors, level='country')
display(time_series.columns.values)

array(['Unnamed: 0', 'index', 'country', 'admin_code', 'admin_name',
       'centx', 'centy', 'year_month', 'year', 'month', 'fews_ipc',
       'fews_ha', 'fews_proj_near', 'fews_proj_near_ha', 'fews_proj_med',
       'fews_proj_med_ha', 'ndvi_mean', 'ndvi_anom', 'rain_mean',
       'rain_anom', 'et_mean', 'et_anom', 'acled_count',
       'acled_fatalities', 'p_staple_food', 'area', 'cropland_pct', 'pop',
       'ruggedness_mean', 'pasture_pct', 'change_fews', 'land seizures_0',
       'land seizures_1', 'land seizures_2', 'slashed export_0',
       'slashed export_1', 'slashed export_2', 'price rise_0',
       'price rise_1', 'price rise_2', 'mass hunger_0', 'mass hunger_1',
       'mass hunger_2', 'cyclone_0', 'cyclone_1', 'cyclone_2',
       'failed crops_0', 'failed crops_1', 'failed crops_2',
       'disruption to farming_0', 'disruption to farming_1',
       'disruption to farming_2', 'massive starvation_0',
       'massive starvation_1', 'massive starvation_2',
       'abnormall

In [None]:
add_agg_factors(t_variant_traditional_factors, level='province')
display(time_series.columns.values)

array(['Unnamed: 0', 'index', 'country', 'admin_code', 'admin_name',
       'centx', 'centy', 'year_month', 'year', 'month', 'fews_ipc',
       'fews_ha', 'fews_proj_near', 'fews_proj_near_ha', 'fews_proj_med',
       'fews_proj_med_ha', 'ndvi_mean', 'ndvi_anom', 'rain_mean',
       'rain_anom', 'et_mean', 'et_anom', 'acled_count',
       'acled_fatalities', 'p_staple_food', 'area', 'cropland_pct', 'pop',
       'ruggedness_mean', 'pasture_pct', 'change_fews', 'land seizures_0',
       'land seizures_1', 'land seizures_2', 'slashed export_0',
       'slashed export_1', 'slashed export_2', 'price rise_0',
       'price rise_1', 'price rise_2', 'mass hunger_0', 'mass hunger_1',
       'mass hunger_2', 'cyclone_0', 'cyclone_1', 'cyclone_2',
       'failed crops_0', 'failed crops_1', 'failed crops_2',
       'disruption to farming_0', 'disruption to farming_1',
       'disruption to farming_2', 'massive starvation_0',
       'massive starvation_1', 'massive starvation_2',
       'abnormall

In [None]:
add_agg_factors(news_factors, level='country')
add_agg_factors(t_variant_traditional_factors, level='province')
add_agg_factors(t_variant_traditional_factors, level='country')
add_agg_factors(t_invariant_traditional_factors, level='province')
add_agg_factors(t_invariant_traditional_factors, level='country')

In [None]:
time_series.to_csv('agg_province_features.csv')

# Add time lagged features


In [None]:
add_time_lagged(t_variant_traditional_factors)

  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}

In [None]:
add_time_lagged(news_factors)

  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}

In [None]:
add_time_lagged(['fews_ipc'], end=21, diff=3, agg=False)

  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)
  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)


In [None]:
add_time_lagged(['fews_proj_near'], start=3, end=4, diff=1, agg=False)

  time_series['{}_{}'.format(f_s,t)] = time_series.apply(lambda x: get_lagged(x, f_s, t), axis=1)


In [None]:
import math
def diebold_mariano(preds, labels):
    sq_error = [(p-l)**2 for p,l in zip(preds, labels)]
    mean = np.mean(sq_error)
    n = len(preds)
    gammas = {}
    m = max(n,int(math.ceil(np.cbrt(n))+2))
    for k in range(m):
        gammas[k] = 0
        for i in range(k+1, n):
            gammas[k] += (sq_error[i] - mean)*(sq_error[i-k] - mean)
        gammas[k] = gammas[k]/n
    sum_gamma = gammas[0]
    for k in range(1, m):
        sum_gamma += 2*gammas[k]
    return np.sqrt(sum_gamma/n)

In [None]:
display(time_series.columns.values)

array(['Unnamed: 0', 'index', 'country', 'admin_code', 'admin_name',
       'centx', 'centy', 'year_month', 'year', 'month', 'fews_ipc',
       'fews_ha', 'fews_proj_near', 'fews_proj_near_ha', 'fews_proj_med',
       'fews_proj_med_ha', 'ndvi_mean', 'ndvi_anom', 'rain_mean',
       'rain_anom', 'et_mean', 'et_anom', 'acled_count',
       'acled_fatalities', 'p_staple_food', 'area', 'cropland_pct', 'pop',
       'ruggedness_mean', 'pasture_pct', 'change_fews', 'land seizures_0',
       'land seizures_1', 'land seizures_2', 'slashed export_0',
       'slashed export_1', 'slashed export_2', 'price rise_0',
       'price rise_1', 'price rise_2', 'mass hunger_0', 'mass hunger_1',
       'mass hunger_2', 'cyclone_0', 'cyclone_1', 'cyclone_2',
       'failed crops_0', 'failed crops_1', 'failed crops_2',
       'disruption to farming_0', 'disruption to farming_1',
       'disruption to farming_2', 'massive starvation_0',
       'massive starvation_1', 'massive starvation_2',
       'abnormall

# Generate and save data for Fig 3A, B, C


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import linear_model

from sklearn.metrics import mean_squared_error
from sklearn.metrics import average_precision_score, precision_recall_curve
from sklearn.metrics import auc

test_splits = [
    ((2010,7), (2011, 7)), 
    ((2011,7), (2012, 7)),
    ((2012,7), (2013, 7)), 
    ((2013,7), (2014, 7)), 
    ((2014,7), (2015, 7)), 
    ((2015,7), (2016, 7)), 
    ((2016,7), (2017, 7)), 
    ((2017,7), (2018, 7)),
    ((2018,7), (2019, 7)), 
    ((2019,2), (2020, 2)),
]
train_splits = [
    ((2009,7), (2010,4)),
    ((2009,7), (2011,1)),
    ((2009,7), (2011,10)),
    ((2009,7), (2012,7)),
    ((2009,7), (2013,7)),
    ((2009,7), (2014,1)),
    ((2009,7), (2015,1)),
    ((2009,7), (2015,10)),
    ((2009,7), (2016,10)),
    ((2009,7), (2017,2))]
dev_splits = [
    ((2010,4), (2010, 7)),
    ((2011,1), (2011, 7)),
    ((2011,10), (2012, 7)),
    ((2012,7), (2013, 7)),
    ((2013,4), (2014, 7)),
    ((2014,1), (2015, 7)),
    ((2015,1), (2016, 7)),
    ((2015,10), (2017, 7)),
    ((2016,10), (2018, 7)),
    ((2017,2), (2019, 2)),
]
rf = RandomForestRegressor(max_features='auto', n_estimators=100, 
                             min_samples_split=0.5, min_impurity_decrease=0.001, random_state=0)
ols = LinearRegression()

lasso = linear_model.Lasso(alpha=0.1)

def get_agg_lagged_features(factors):
    return ['{}_{}'.format(f, t) for f, t in zip(factors, range(3,9))] + ['{}_province_{}'.format(f, t) for f, t in zip(factors, range(3,9))] + ['{}_country_{}'.format(f, t) for f, t in zip(factors, range(3,9))]
        

features = {
    'traditional': time_series[
        ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors
    ], 
    'news': time_series[
        ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] +
        get_agg_lagged_features(news_factors)
    ], 
    'traditional+news': time_series[
        ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] +
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors +
        get_agg_lagged_features(news_factors)
    ],
    'expert': time_series['fews_proj_near_3'],
    'expert+traditional': time_series[
        ['fews_proj_near_3'] +
        ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors
    ],
    'expert+news': time_series[
        ['fews_proj_near_3'] +
        ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] +
        get_agg_lagged_features(news_factors)
    ],
    'expert+traditional+news': time_series[
        ['fews_proj_near_3'] +
        ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] +
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors +
        get_agg_lagged_features(news_factors)
    ]
}

labels_df = time_series['fews_ipc']

def get_time_split(df, start, end):
    return df[df['year'] >= start[0] & df['month'] >= start[1] & df['year'] <= end[0] & df['month'] <= end[1]]


fig_3a = pd.DataFrame(columns=['method', 'split', 'features', 'country', 'rmse', 'lower_bound', 'upper_bound'])
fig_3b = pd.DataFrame(columns=['method', 'split', 'features', 'aucpr'])
fig_3c = pd.DataFrame(columns=['method', 'split', 'features', 'recall_at_80p'])

thresholds = {'traditional': (2.236, 3.125), 
              'news': (1.907, 2.712), 
              'traditional+news': (2.105, 3.314),
              'expert': (2, 3),
              'expert+news': (1.912, 2.813),
              'expert+traditional': (2.241, 3.132),
              'expert+traditional+news': (2.172, 3.321)
             }

for train, dev, test in zip(train_splits, dev_splits, test_splits):
    for f, D in features.items():
        X = get_time_split(D, train[0], train[1])
        y = get_time_split(labels_df, test[0], test[1])
        X_test = get_time_split(D, test[0], test[1])
        for name, regr in zip(['RF', 'OLS', 'Lasso'], [rf, ols, lasso]):
            regr.fit(X, y)
            preds = regr.predict(X_test)
            labels = get_time_split(labels_df, test[0], test[1])
            rmse = mean_squared_error(labels, preds, squared=False)
            stderr = diebold_mariano(preds, labels)
            upper_bound = np.sqrt(rmse**2 + 1.96*stderr)
            lower_bound = np.sqrt(rmse**2 - 1.96*stderr)
            precision, recall, thresholds = precision_recall_curve(labels, preds)
            auc_precision_recall = auc(recall, precision)
            _row = pd.DataFrame.from_dict({'method': [name], 'split': [test], 'features': [f], 'country': ['all'],
                                           'rmse': [rmse], 'lower_bound': [lower_bound], 'upper_bound': [upper_bound]},
                                          orient='columns')
            fig_3a = pd.concat([fig_3a, _row], axis=0)
            _row = pd.DataFrame.from_dict({'method': [name], 'split': [test], 'features': [f], 
                                           'aucpr': [auc_precision_recall]},
                                          orient='columns')
            fig_3b = pd.concat([fig_3b, _row], axis=0)
            print ("Method: {}, Split: {}, Features: {}, AUCPR: {}".format(name, test, f, auc_precision_recall))
            print ("Method: {}, Split: {}, Features: {}, RMSE: {} [{}, {}]".format(name, test, f, rmse, lower_bound, upper_bound))
            
            recall_at_80p = 0
            for p_t, p_t_add_3, p_t_min_3 in zip(preds, preds[3:] + [1,1,1], preds[:-3]+[5,5,5]):
                u_b = thresholds[f]['upper_bound']
                l_b = thresholds[f]['lower_bound']
                if p_t >= u_b and p_t_add_3 >= u_b and p_t_min_3 <= l_b:
                    recall_at_80p += 1
            
            _row = pd.DataFrame.from_dict({'method': [name], 'split': [test], 'features': [f], 
                                           'recall_at_80p': [recall_at_80p]},
                                          orient='columns')
            fig_3c = pd.concat([fig_3c, _row], axis=0)
            
            for country in time_series['country'].unique():
                c_id = X_test[X_test['country']==country]
                labels_c = labels[c_id]
                preds_c = preds[c_id]
                rmse = mean_squared_error(labels_c, preds_c, squared=False)
                stderr = diebold_mariano(preds_c, labels_c)
                upper_bound = np.sqrt(rmse**2 + 1.96*stderr)
                lower_bound = np.sqrt(rmse**2 - 1.96*stderr)
                _row = pd.DataFrame.from_dict({'method': [name], 'split': [test], 'features': [f], 'country': [country],
                                           'rmse': [rmse], 'lower_bound': [lower_bound], 'upper_bound': [upper_bound]},
                                          orient='columns')
                fig_3a = pd.concat([fig_3a, _row], axis=0)
                print ("Country: {}, Method: {}, Split: {}, Features: {}, RMSE: {} [{}, {}]".format(country, name, test, f, rmse, lower_bound, upper_bound))

fig_3a.to_csv('fig_3a.csv')
fig_3b.to_csv('fig_3b.csv')
fig_3c.to_csv('fig_3c.csv')

KeyError: 'year'