# 📊 **Re-Implementation of "Predicting Food Crises Using News Streams"**

---

#### 🔍 **Objective**

This notebook aims to **reproduce and analyze** the methodology presented in the paper:

📄 **Paper:** [Predicting food crises using news streams](https://www.science.org/doi/10.1126/sciadv.abm3449)  
📊 **Dataset:** [Harvard Dataverse Repository](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CJDWUW)  
📜 **Original Code & Methods:** [GitHub - Regression Modeling (Step 5)](https://github.com/philippzi98/food_insecurity_predictions_nlp/blob/main/Step%205%20-%20Regression%20Modelling/README.md)

---

#### 🛠 **Methodology**

This implementation follows the **key steps** outlined in the paper to predict **food insecurity crises** using a combination of:
1️⃣ **Traditional Risk Factors** (conflict, climate, food prices, etc.)  
2️⃣ **News-Based Indicators** (text feature frequencies from news articles)  
3️⃣ **Lagging & Aggregation** (temporal dependencies at district, province, and country levels)  
4️⃣ **Machine Learning Models** (Random Forest, OLS, Lasso)

---

#### 🔗 **Reference Materials**

📄 **Supplementary Material:** Available in `supplemental_material_from_paper.pdf`  
📊 **Datasets Used:**

- `time_series_with_causes_zscore_full.csv` (Main dataset with time-series features)
- `famine-country-province-district-years-CS.csv` (Food insecurity classification)
- `matching_districts.csv` (Geographical standardization)


# 📚🔧 Import Libraries

In this notebook, we will use uv to manage our Python environment and packages efficiently. uv is a modern and fast package manager that simplifies virtual environment creation, and dependency installation. We will create a virtual environment, install necessary libraries, and ensure our environment stays consistent across different setups.


In [41]:
## Uncoment the below cell to install `uv` if you have not already. You can also install it trhiugh `pip` by running `!pip install uv` but this will be within your current python environment and not globally.

# !curl -LsSf https://astral.sh/uv/install.sh | sh
# !uv venv world-bank
# !source world-bank/bin/activate

In [42]:
# !uv pip install -r requirements.txt

In [43]:
import pandas as pd
import numpy as np
import folium
from IPython.display import display, Image
import os
import gdown
import zipfile
import editdistance
from fuzzywuzzy import fuzz
import math

In [44]:
url = "https://drive.google.com/uc?id=1YoQ1hz9RlaLr2xW3KoKCfJPyyO2PErym"
output = "data.zip"

if not os.path.exists("./data"):
    gdown.download(url, output, quiet=False) 
    zipfile.ZipFile('data.zip', 'r').extractall()
else:
    print("You already have the data downloaded and extracted")

You already have the data downloaded and extracted


## 📂 Load and Clean Data

**Understanding the Time-Series Dataset & Column Selection**

This dataset contains **district-level time-series data** on food insecurity risk factors, including:

- **📅 Temporal Information:** `year`, `month`, `year_month`
- **📍 Geographical Identifiers:** `admin_code`, `admin_name`, `province`, `country`
- **🌍 Traditional Risk Factors:** Climate (`rain_mean`, `ndvi_mean`), conflict (`acled_count`), food prices (`p_staple_food`)
- **📰 News-Based Indicators:** Proportions of news articles mentioning crisis-related keywords (`conflict_0`, `famine_0`, etc.)
- **📉 Food Insecurity Label:** `fews_ipc` (Integrated Phase Classification)

🔥 **Columns We Will Drop & Why**
✔ **Redundant Aggregations:** `_1`, `_2` columns (province & country-level values) since we will recompute aggregations from scratch anyways.  
✔ **Unnamed/Index Columns:** `Unnamed: 0` as it is unnecessary. It is just a duplicate of default index.
✔ **Unnecessary Identifiers:** If `admin_code` and `admin_name`, after matching these to `matching_districts.csv`, we can drop them.

---

> ⚠️ **NOTE:**  
> For a detailed explanation of the dataset and features, refer to the [`explore_time_series.ipynb`](./explore_time_series.ipynb) notebook.


In [45]:
time_series = pd.read_csv('./data/time_series_with_causes_zscore_full.csv', nrows=15)
admins = pd.read_csv('./data/famine-country-province-district-years-CS.csv')
valid_matching = pd.read_csv('./data/matching_districts.csv')

In [46]:
sorted(time_series.columns.values)

['Unnamed: 0',
 'abnormally low rainfall_0',
 'abnormally low rainfall_1',
 'abnormally low rainfall_2',
 'acled_count',
 'acled_fatalities',
 'acute hunger_0',
 'acute hunger_1',
 'acute hunger_2',
 'admin_code',
 'admin_name',
 'aid appeal_0',
 'aid appeal_1',
 'aid appeal_2',
 'aid workers died_0',
 'aid workers died_1',
 'aid workers died_2',
 'air attack_0',
 'air attack_1',
 'air attack_2',
 'alarming level_0',
 'alarming level_1',
 'alarming level_2',
 'anti-western policies_0',
 'anti-western policies_1',
 'anti-western policies_2',
 'apathy_0',
 'apathy_1',
 'apathy_2',
 'area',
 'asylum seekers_0',
 'asylum seekers_1',
 'asylum seekers_2',
 'authoritarian_0',
 'authoritarian_1',
 'authoritarian_2',
 'bad harvests_0',
 'bad harvests_1',
 'bad harvests_2',
 'blockade_0',
 'blockade_1',
 'blockade_2',
 'bombing campaign_0',
 'bombing campaign_1',
 'bombing campaign_2',
 'brain drain_0',
 'brain drain_1',
 'brain drain_2',
 'brutal government_0',
 'brutal government_1',
 'brutal 

In [47]:
time_series.head(5)

Unnamed: 0.1,Unnamed: 0,index,country,admin_code,admin_name,centx,centy,year_month,year,month,...,carbon_2,mayhem_0,mayhem_1,mayhem_2,dehydrated_0,dehydrated_1,dehydrated_2,mismanagement_0,mismanagement_1,mismanagement_2
0,0,30,Afghanistan,202,Kandahar,65.709343,31.043618,2009_07,2009,7,...,1.053,0.667,-0.171,-0.833,0.173667,0.168,1.284667,-0.073,-0.427667,0.668333
1,1,33,Afghanistan,202,Kandahar,65.709343,31.043618,2009_10,2009,10,...,-0.660812,-0.63658,-0.520247,-0.782913,-0.671587,-0.612254,-0.926921,-0.510467,-0.625133,-0.452467
2,2,36,Afghanistan,202,Kandahar,65.709343,31.043618,2010_01,2010,1,...,-0.134333,1.447667,-0.844333,0.778667,-0.676,-0.689667,0.293333,0.530333,-0.471333,0.955333
3,3,39,Afghanistan,202,Kandahar,65.709343,31.043618,2010_04,2010,4,...,-0.326927,-0.594877,0.16479,-0.90521,-0.62054,0.165794,0.045794,-1.0116,-0.8106,-0.2056
4,4,42,Afghanistan,202,Kandahar,65.709343,31.043618,2010_07,2010,7,...,-1.085146,-0.709913,-0.867913,-0.770247,-0.787921,-0.974587,-0.946921,-0.611133,-0.7098,-0.6228


In [48]:
t_variant_traditional_factors = ['rain_mean']
t_invariant_traditional_factors = ['area', 'pasture_pct']

news_factors = [name for name in time_series.columns.values if '_0' in name]

In [49]:
news_factors[0]

'land seizures_0'

In [50]:
print("Columns count BEFORE dropping: ", len(time_series.columns.values))

Columns count BEFORE dropping:  532


In [51]:
cols_to_drop = ["Unnamed: 0", "centx", "centy", 'change_fews', 'fews_ha', 'fews_proj_med', 'fews_proj_med_ha', 'fews_proj_near_ha'] + [col for col in time_series.columns if col.endswith(('_1', '_2', '_3'))]
time_series.drop(columns=cols_to_drop, inplace=True)

In [52]:
potential_extra_cols = set(time_series.columns.values) - set(t_variant_traditional_factors) - set(t_invariant_traditional_factors) - set(news_factors)
potential_extra_cols = [col for col in potential_extra_cols if not col.endswith(('_1', '_2', '_3'))]
print("Potential extra columns", potential_extra_cols)

Potential extra columns ['et_mean', 'year', 'fews_proj_near', 'country', 'acled_fatalities', 'month', 'ndvi_mean', 'acled_count', 'rain_anom', 'year_month', 'et_anom', 'admin_name', 'pop', 'ndvi_anom', 'p_staple_food', 'ruggedness_mean', 'admin_code', 'cropland_pct', 'fews_ipc', 'index']


In [53]:
print("Columns count after dropping: ", len(time_series.columns.values))

Columns count after dropping:  190


### 🌍 Admin Level Mapping: Standardizing Geographical Identifiers

In this section, we will **map and standardize** the `admin_code` and `admin_name` fields to their corresponding **district, province, and country names**. This step is **crucial** for ensuring **consistency** across different datasets and enabling **accurate aggregations** at multiple administrative levels.

🛠 **Why is Admin Level Mapping Important?**
✅ Different datasets may use **slightly different spellings or formats** for district names.  
✅ Some district names might be **missing or misspelled**, requiring standardization.  
✅ We need to **match and align** district names across various sources before aggregating at **province and country levels**.  
✅ Proper mapping allows us to **merge datasets correctly** without losing information.  

📌 **Steps in Admin Mapping**
1️⃣ **Load the `matching_districts.csv` file**, which provides the mapping between different district name variations.  
2️⃣ **Identify missing or unmatched `admin_name` values** and find their closest matches using fuzzy matching techniques.  
3️⃣ **Ensure that each `admin_code` uniquely maps to one `district`, `province`, and `country`.**  
4️⃣ **Replace inconsistent names** in the dataset with their standardized versions.  
5️⃣ **Aggregate data at the `province` and `country` levels** after ensuring all districts are correctly mapped.  


In [54]:
len(admins.country.unique())

39

In [55]:
admins.columns.values

array(['Unnamed: 0', 'country', 'district', 'year', 'month', 'CS',
       'province'], dtype=object)

In [56]:
admin_names = time_series['admin_name'].unique()
districts = admins['district'].unique()
provinces = admins['province'].unique()
countries = admins['country'].unique()

In [57]:
print (len(admin_names), len(districts), len(provinces), len(countries))
print (len(set(admin_names).difference(districts)))
missing_admin_names = set(admin_names).difference(districts)
print (len(missing_admin_names.difference(provinces)))
missing_admin_names = missing_admin_names.difference(provinces)

1 4113 474 39
0
0


### Fuzzy String Matching for Missing Names

The function uses **fuzzy string matching** to find the best approximate matches for missing administrative names (e.g., districts and provinces). 

- Finds the **best matching district/province** for each missing name.
- Uses **fuzzy string matching** to calculate the similarity between missing names and known names.
- Returns a dictionary that maps each missing name to its closest match.


In [58]:
def find_matching(missing, names):
    matching_districts = {}
    for m in missing:
        max_overlap = 0
        nearest_d = None
        for d in names:
            d = str(d)
            dist = fuzz.partial_ratio(m, d)
            if dist > max_overlap:
                max_overlap = dist
                nearest_d = d
        matching_districts[m] = nearest_d
    return matching_districts


matching = find_matching(missing_admin_names, districts)
matching_p = find_matching(missing_admin_names, provinces)

# manually verify matching and update
for k in matching.keys():
    print (k, matching[k], matching_p[k])


### Encoding Decoding

`to_ascii_escaped(s)`: Converts a Unicode string to an ASCII-safe representation using **unicode-escape**.

`from_ascii_escaped(escaped)`: Converts the escaped ASCII string back into its original Unicode form.

In [59]:
def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    # Using 'unicode-escape' encoding produces a bytes object,
    # then decode it to get an ASCII string.
    return s.encode('unicode-escape').decode('ascii')

def from_ascii_escaped(escaped):
    """
    Convert the ASCII-escaped string back to the original Unicode string.
    """
    # Encode the ASCII string to bytes, then decode using 'unicode-escape'
    return escaped.encode('ascii').decode('unicode-escape')


### Finding the Province for a Given District or Province

`find_province(x)`, finds the **province** corresponding to a given administrative name. It accounts for:
- **Direct Lookups** (Exact match in known district/province lists)
- **Fuzzy Matching** (Using ASCII-safe transformation for inconsistent text encoding)
- **Validation Against a Predefined Mapping (`valid_matching`)**

In [60]:
# Define matched globally
matched = valid_matching['missing'].unique()

def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    return s.encode('unicode-escape').decode('ascii')

def find_province(x):
    try:
        # Ensure x is a Unicode string.
        if isinstance(x, bytes):
            x = x.decode('utf-8')
        
        # Direct lookup in districts or provinces.
        if x in districts:
            return admins[admins['district'] == x]['province'].values[0]
        elif x in provinces:
            return x

        # Convert x to an ASCII-escaped version.
        escaped_x = to_ascii_escaped(x)
        
        # Check if the escaped version is in matched.
        if escaped_x in matched:
            v = valid_matching[valid_matching['missing'] == escaped_x]
            if v['match'].values[0] == 'district':
                x2 = v['district'].values[0]
                return admins[admins['district'] == x2]['province'].values[0]
            elif v['match'].values[0] == 'province':
                return v['province'].values[0]
        
        # If no conditions are met, raise an exception.
        raise Exception("No matching province found")
    except Exception as e:
        raise Exception("Province not found for: {} ({})".format(x, e))


### Handling Admin Names with Accented Characters and Mapping to Provinces

Maps `admin_names` to provinces using the `find_province(a)` function.  
If a **direct lookup fails**, it tries to handle cases where the **admin name contains accented characters** (`é`, `è`, `ô`) ->  (encoding decoding issues resolved through directly replacing these with 'e' or 'o', leads to finding a valid match). 

In [61]:
admin_to_province = {}
for a in admin_names:
    try:
        admin_to_province[a] = find_province(a)
    except Exception as e:
        # Print the admin name that caused an error
        print("Error with:", a)
        # Check if a contains accented characters "é" or "è"
        if 'é' in a or 'è' in a or 'ô' in a:
            a_modified = a.replace('é', 'e').replace('è', 'e').replace('ô', 'o')
            # Check if the modified name is in districts
            if a_modified in districts:
                # Use the modified name to look up the province from admins
                try:
                    province = admins[admins['district'] == a_modified]['province'].values[0]
                    admin_to_province[a] = province
                    print(f"Replaced '{a}' with '{a_modified}', found province: {province}")
                except Exception as ex:
                    print(f"Modified name '{a_modified}' not found in admins: {ex}")
            else:
                print(f"Modified name '{a_modified}' not in districts.")
        else:
            print(f"No accented e found in '{a}'.")


### Mapping Administrative Names to Provinces in time_series

Maps `admin_name` to their respective **provinces** using a precomputed dictionary - >`admin_to_province` in `time_series`.


In [62]:
time_series['province'] = time_series['admin_name'].apply(
    lambda x: admin_to_province[x] if x in admin_to_province else admin_to_province.get(x.replace('ô', 'o'))
)


In [63]:
time_series[["admin_name", "province"]]

Unnamed: 0,admin_name,province
0,Kandahar,Kandahar
1,Kandahar,Kandahar
2,Kandahar,Kandahar
3,Kandahar,Kandahar
4,Kandahar,Kandahar
5,Kandahar,Kandahar
6,Kandahar,Kandahar
7,Kandahar,Kandahar
8,Kandahar,Kandahar
9,Kandahar,Kandahar


# ⏳ Time Lagging & Feature Engineering

#### 📅 **Why Use Lagging?**

To predict food insecurity **for a given quarter**, we use:

- **6 months of historical values** for traditional & news-based features.
- **Province & country-level aggregations** to capture broader shocks.
- **6 quarters of lagged IPC phase values** to model temporal dependencies.

#### ⚡ **Optimized Lagging Approach**

To improve computational efficiency, we:
✔ Use `groupby()` for **fast province & country-level aggregations**.  
✔ Merge lagged data via `merge()` instead of slow `.apply()`.  
✔ Only keep **past data** to ensure no data leakage.


In [64]:
import pandas as pd
import numpy as np

def get_lagged(x, f, t, ts_dict):
    admin_code = x['admin_code']
    year, month = x['year'], x['month']

    # Compute lagged year and month
    l_month = ((month - 1 - t) % 12) + 1
    l_year = year - 1 if month - t <= 0 else year
    lagged_year_month = f"{l_year}_{l_month}"

    # Retrieve pre-filtered DataFrame
    ts = ts_dict.get(admin_code)

    # Ensure ts is a dictionary and extract only the requested feature
    if ts is not None and lagged_year_month in ts:
        lagged_values = ts[lagged_year_month]  # This might be a dictionary
        if isinstance(lagged_values, dict):
            return lagged_values.get(f, x[f])  # Extract only `f`
        return lagged_values  # Directly return value if not a dict

    return x[f]  # Fallback to original value


def add_time_lagged(features, start=3, end=9, diff=1, agg=True):
    levels = ['', '_province', '_country'] if agg else ['']

    # Convert 'year_month' to index and precompute admin_code groupings
    ts_dict = {ac: df.set_index('year_month')[features].to_dict(orient="index")
               for ac, df in time_series.groupby('admin_code')}

    for suffix in levels:
        for f in features:
            f_s = f + suffix
            for t in range(start, end, diff):
                lagged_col = f"{f_s}_{t}"

                if lagged_col in time_series.columns:
                    continue

                # Use list comprehension for better performance
                time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]

    return time_series  # Return modified DataFrame


# Province & Country-Level Aggregation

This function aggregates feature values at the province and country levels to capture regional trends, aiding in food insecurity prediction. The process includes:

- **Grouping by year_month and level:** Data is grouped by year_month and the specified level (province or country) to calculate the mean of features, reflecting regional trends over time.

- **Applying transformations efficiently:** Instead of merging aggregated data, `transform("mean")` is used to directly assign the computed mean to each row, avoiding unnecessary joins and improving performance.  

#### ⚡ **Efficiency Gains**

- **Fast Aggregation**: Uses `groupby()` for efficient aggregation.
- **Avoids Costly Joins**: Eliminates the need for `merge()` by using `transform()` instead, reducing computational overhead.  
- **Memory Efficiency**: Converts the `level` column to a categorical type to reduce memory usage.

This approach ensures faster processing while maintaining the quality of aggregated features.


In [65]:
def add_agg_factors(features, level='province'):
    global time_series  

    # Convert 'level' column to categorical for performance
    time_series[level] = time_series[level].astype('category')
    
    # Compute grouped mean values for the given features
    grouped_df = time_series.groupby(['year_month', level], observed=True, sort=False)[features].transform("mean")

    # Rename columns to include level
    grouped_df = grouped_df.rename(columns={f: f"{f}_{level}" for f in features})

    # Use pd.concat() to add all columns at once, avoiding fragmentation
    time_series = pd.concat([time_series, grouped_df], axis=1)

    return time_series


In [66]:
add_agg_factors(news_factors)

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,gastrointestinal_0_province,terrorist_0_province,warlord_0_province,d'etat_0_province,overthrow_0_province,convoys_0_province,carbon_0_province,mayhem_0_province,dehydrated_0_province,mismanagement_0_province
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,-0.192,-0.284333,-0.668667,0.647333,-0.891333,0.112667,1.265333,0.667,0.173667,-0.073
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-0.545727,-1.037016,-0.811291,-0.850261,-0.948892,-0.728972,-0.765146,-0.63658,-0.671587,-0.510467
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,1.506333,0.455,1.595667,0.571667,0.279,-0.868333,0.058333,1.447667,-0.676,0.530333
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,-0.79397,-0.722159,-0.130521,0.04763,0.362613,0.480986,0.026073,-0.594877,-0.62054,-1.0116
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,-0.509394,-0.69435,-1.215958,-0.865261,-1.119225,-1.060638,-0.673479,-0.709913,-0.787921,-0.611133
5,45,Afghanistan,202,Kandahar,2010_10,2010,10,2.0,,0.095377,...,-0.691,1.168667,-0.279333,-0.296667,1.651333,-0.356,1.156667,1.202667,0.446667,0.696667
6,48,Afghanistan,202,Kandahar,2011_01,2011,1,2.0,,0.09262,...,-0.916,0.334333,-0.847,1.27,0.744333,1.328667,-0.705,0.738,0.358,0.357667
7,51,Afghanistan,202,Kandahar,2011_04,2011,4,2.0,2.0,0.131462,...,0.098364,-0.792159,-0.083854,-0.22937,0.050279,0.181986,-0.413594,0.278123,-0.23554,-0.8176
8,54,Afghanistan,202,Kandahar,2011_07,2011,7,1.0,1.0,0.106885,...,-0.820667,-0.696,-0.588667,0.714,1.049667,-0.59,-0.759667,-0.057,0.993333,1.217667
9,57,Afghanistan,202,Kandahar,2011_10,2011,10,1.0,1.0,0.103268,...,0.339,-0.851667,0.421,1.312,0.010667,1.056333,1.055333,0.844,0.652,-0.213


In [67]:
add_agg_factors(news_factors, level='country')
time_series.head(10)

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,gastrointestinal_0_country,terrorist_0_country,warlord_0_country,d'etat_0_country,overthrow_0_country,convoys_0_country,carbon_0_country,mayhem_0_country,dehydrated_0_country,mismanagement_0_country
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,-0.192,-0.284333,-0.668667,0.647333,-0.891333,0.112667,1.265333,0.667,0.173667,-0.073
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-0.545727,-1.037016,-0.811291,-0.850261,-0.948892,-0.728972,-0.765146,-0.63658,-0.671587,-0.510467
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,1.506333,0.455,1.595667,0.571667,0.279,-0.868333,0.058333,1.447667,-0.676,0.530333
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,-0.79397,-0.722159,-0.130521,0.04763,0.362613,0.480986,0.026073,-0.594877,-0.62054,-1.0116
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,-0.509394,-0.69435,-1.215958,-0.865261,-1.119225,-1.060638,-0.673479,-0.709913,-0.787921,-0.611133
5,45,Afghanistan,202,Kandahar,2010_10,2010,10,2.0,,0.095377,...,-0.691,1.168667,-0.279333,-0.296667,1.651333,-0.356,1.156667,1.202667,0.446667,0.696667
6,48,Afghanistan,202,Kandahar,2011_01,2011,1,2.0,,0.09262,...,-0.916,0.334333,-0.847,1.27,0.744333,1.328667,-0.705,0.738,0.358,0.357667
7,51,Afghanistan,202,Kandahar,2011_04,2011,4,2.0,2.0,0.131462,...,0.098364,-0.792159,-0.083854,-0.22937,0.050279,0.181986,-0.413594,0.278123,-0.23554,-0.8176
8,54,Afghanistan,202,Kandahar,2011_07,2011,7,1.0,1.0,0.106885,...,-0.820667,-0.696,-0.588667,0.714,1.049667,-0.59,-0.759667,-0.057,0.993333,1.217667
9,57,Afghanistan,202,Kandahar,2011_10,2011,10,1.0,1.0,0.103268,...,0.339,-0.851667,0.421,1.312,0.010667,1.056333,1.055333,0.844,0.652,-0.213


In [68]:
add_agg_factors(t_variant_traditional_factors, level='province')

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,terrorist_0_country,warlord_0_country,d'etat_0_country,overthrow_0_country,convoys_0_country,carbon_0_country,mayhem_0_country,dehydrated_0_country,mismanagement_0_country,rain_mean_province
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,-0.284333,-0.668667,0.647333,-0.891333,0.112667,1.265333,0.667,0.173667,-0.073,0.353588
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-1.037016,-0.811291,-0.850261,-0.948892,-0.728972,-0.765146,-0.63658,-0.671587,-0.510467,0.409304
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,0.455,1.595667,0.571667,0.279,-0.868333,0.058333,1.447667,-0.676,0.530333,3.894158
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,-0.722159,-0.130521,0.04763,0.362613,0.480986,0.026073,-0.594877,-0.62054,-1.0116,1.609664
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,-0.69435,-1.215958,-0.865261,-1.119225,-1.060638,-0.673479,-0.709913,-0.787921,-0.611133,0.393834
5,45,Afghanistan,202,Kandahar,2010_10,2010,10,2.0,,0.095377,...,1.168667,-0.279333,-0.296667,1.651333,-0.356,1.156667,1.202667,0.446667,0.696667,0.625036
6,48,Afghanistan,202,Kandahar,2011_01,2011,1,2.0,,0.09262,...,0.334333,-0.847,1.27,0.744333,1.328667,-0.705,0.738,0.358,0.357667,3.142909
7,51,Afghanistan,202,Kandahar,2011_04,2011,4,2.0,2.0,0.131462,...,-0.792159,-0.083854,-0.22937,0.050279,0.181986,-0.413594,0.278123,-0.23554,-0.8176,4.219678
8,54,Afghanistan,202,Kandahar,2011_07,2011,7,1.0,1.0,0.106885,...,-0.696,-0.588667,0.714,1.049667,-0.59,-0.759667,-0.057,0.993333,1.217667,0.367243
9,57,Afghanistan,202,Kandahar,2011_10,2011,10,1.0,1.0,0.103268,...,-0.851667,0.421,1.312,0.010667,1.056333,1.055333,0.844,0.652,-0.213,0.848252


In [69]:
add_agg_factors(t_variant_traditional_factors, level='country')
add_agg_factors(t_invariant_traditional_factors, level='province')
add_agg_factors(t_invariant_traditional_factors, level='country')

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,carbon_0_country,mayhem_0_country,dehydrated_0_country,mismanagement_0_country,rain_mean_province,rain_mean_country,area_province,pasture_pct_province,area_country,pasture_pct_country
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,1.265333,0.667,0.173667,-0.073,0.353588,0.353588,54174.53381,16.246279,54174.53381,16.246279
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-0.765146,-0.63658,-0.671587,-0.510467,0.409304,0.409304,54174.53381,16.246279,54174.53381,16.246279
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,0.058333,1.447667,-0.676,0.530333,3.894158,3.894158,54174.53381,16.246279,54174.53381,16.246279
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,0.026073,-0.594877,-0.62054,-1.0116,1.609664,1.609664,54174.53381,16.246279,54174.53381,16.246279
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,-0.673479,-0.709913,-0.787921,-0.611133,0.393834,0.393834,54174.53381,16.246279,54174.53381,16.246279
5,45,Afghanistan,202,Kandahar,2010_10,2010,10,2.0,,0.095377,...,1.156667,1.202667,0.446667,0.696667,0.625036,0.625036,54174.53381,16.246279,54174.53381,16.246279
6,48,Afghanistan,202,Kandahar,2011_01,2011,1,2.0,,0.09262,...,-0.705,0.738,0.358,0.357667,3.142909,3.142909,54174.53381,16.246279,54174.53381,16.246279
7,51,Afghanistan,202,Kandahar,2011_04,2011,4,2.0,2.0,0.131462,...,-0.413594,0.278123,-0.23554,-0.8176,4.219678,4.219678,54174.53381,16.246279,54174.53381,16.246279
8,54,Afghanistan,202,Kandahar,2011_07,2011,7,1.0,1.0,0.106885,...,-0.759667,-0.057,0.993333,1.217667,0.367243,0.367243,54174.53381,16.246279,54174.53381,16.246279
9,57,Afghanistan,202,Kandahar,2011_10,2011,10,1.0,1.0,0.103268,...,1.055333,0.844,0.652,-0.213,0.848252,0.848252,54174.53381,16.246279,54174.53381,16.246279


In [70]:
time_series.to_csv('agg_province_features.csv')

# Add time lagged features


In [71]:
time_series = add_time_lagged(t_variant_traditional_factors)

In [72]:
time_series = add_time_lagged(news_factors)

  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]


In [73]:
time_series = add_time_lagged(['fews_ipc'], end=21, diff=3, agg=False)

  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]
  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]


In [74]:
time_series = add_time_lagged(['fews_proj_near'], start=3, end=4, diff=1, agg=False)

  time_series[lagged_col] = [get_lagged(row, f_s, t, ts_dict) for _, row in time_series.iterrows()]


In [75]:
def diebold_mariano(preds, labels):
    sq_error = [(p-l)**2 for p,l in zip(preds, labels)]
    mean = np.mean(sq_error)
    n = len(preds)
    gammas = {}
    m = max(n,int(math.ceil(np.cbrt(n))+2))
    for k in range(m):
        gammas[k] = 0
        for i in range(k+1, n):
            gammas[k] += (sq_error[i] - mean)*(sq_error[i-k] - mean)
        gammas[k] = gammas[k]/n
    sum_gamma = gammas[0]
    for k in range(1, m):
        sum_gamma += 2*gammas[k]
    return np.sqrt(sum_gamma/n)

In [76]:
time_series.columns.shape

(3562,)

In [77]:
time_series.to_csv("our_results2.csv")

In [78]:
time_series.head()

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,mismanagement_0_country_6,mismanagement_0_country_7,mismanagement_0_country_8,fews_ipc_3,fews_ipc_6,fews_ipc_9,fews_ipc_12,fews_ipc_15,fews_ipc_18,fews_proj_near_3
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,-0.073,-0.073,-0.073,1.0,1.0,1.0,1.0,1.0,1.0,
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-0.510467,-0.510467,-0.510467,1.0,1.0,1.0,1.0,1.0,1.0,
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,0.530333,0.530333,0.530333,1.0,2.0,2.0,2.0,1.0,2.0,
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,-1.0116,-1.0116,-1.0116,2.0,1.0,2.0,2.0,2.0,1.0,
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,-0.611133,-0.611133,-0.611133,1.0,1.0,1.0,1.0,1.0,1.0,


In [79]:
# find columns that contain a particular substring

list(filter(lambda x: 'co' in x, time_series.columns))

['country',
 'admin_code',
 'acled_count',
 'continued deterioration_0',
 'economic impoverishment_0',
 'conflict_0',
 'collapsing economy_0',
 'corrupt government_0',
 'continued strife_0',
 'ecological crisis_0',
 'coup_0',
 'economic crisis_0',
 'corruption_0',
 'collapse of government_0',
 'devastated the economy_0',
 'convoys_0',
 'continued deterioration_0_province',
 'economic impoverishment_0_province',
 'conflict_0_province',
 'collapsing economy_0_province',
 'corrupt government_0_province',
 'continued strife_0_province',
 'ecological crisis_0_province',
 'coup_0_province',
 'economic crisis_0_province',
 'corruption_0_province',
 'collapse of government_0_province',
 'devastated the economy_0_province',
 'convoys_0_province',
 'land seizures_0_country',
 'slashed export_0_country',
 'price rise_0_country',
 'mass hunger_0_country',
 'cyclone_0_country',
 'failed crops_0_country',
 'disruption to farming_0_country',
 'massive starvation_0_country',
 'abnormally low rainfall_

In [80]:
# find columns that begin with a particular substring
list(filter(lambda x: x.startswith("coun"), time_series.columns))


['country']

# Generate and save data


In [81]:
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import root_mean_squared_error, precision_recall_curve, auc

train_splits = [((2009,7), (2010,4)), ((2009,7), (2011,1)), ((2009,7), (2011,10)), 
                ((2009,7), (2012,7)), ((2009,7), (2013,7)), ((2009,7), (2014,1)), 
                ((2009,7), (2015,1)), ((2009,7), (2015,10)), ((2009,7), (2016,10)), 
                ((2009,7), (2017,2))]

dev_splits = [((2010,4), (2010, 7)), ((2011,1), (2011, 7)), ((2011,10), (2012, 7)), 
              ((2012,7), (2013, 7)), ((2013,4), (2014, 7)), ((2014,1), (2015, 7)), 
              ((2015,1), (2016, 7)), ((2015,10), (2017, 7)), ((2016,10), (2018, 7)), 
              ((2017,2), (2019, 2))]

test_splits = [((2010,7), (2011, 7)), ((2011,7), (2012, 7)), ((2012,7), (2013, 7)), 
               ((2013,7), (2014, 7)), ((2014,7), (2015, 7)), ((2015,7), (2016, 7)), 
               ((2016,7), (2017, 7)), ((2017,7), (2018, 7)), ((2018,7), (2019, 7)), 
               ((2019,2), (2020, 2))]

# just like them we will evaluate three dufferent models, Random Forest, OLS and Lasso. Random Forest is a tree-based model, OLS is a linear regression model and Lasso is a linear regression model with L1 regularization
models = {
    'RF': RandomForestRegressor(max_features='sqrt', n_estimators=100, min_samples_split=0.5, min_impurity_decrease=0.001, random_state=0),
    'OLS': LinearRegression(),
    'Lasso': Lasso(alpha=0.1)
}

def get_agg_lagged_features(factors):
    return [f"{f}_{t}" for f in factors for t in range(3, 9)] + \
           [f"{f}_province_{t}" for f in factors for t in range(3, 9)] + \
           [f"{f}_country_{t}" for f in factors for t in range(3, 9)]

features = {
    'traditional': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors],
    
    'news': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(news_factors)],
    
    'traditional+news': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors + 
        get_agg_lagged_features(news_factors)]
}

labels_df = time_series[['fews_ipc', 'year', 'month']]

def get_time_split(df, start, end):
    return df[
        (((df['year'] > start[0])) | ((df['year'] == start[0]) & (df['month'] >= start[1]))) &
        (((df['year'] < end[0])) | ((df['year'] == end[0]) & (df['month'] <= end[1])))
    ]

thresholds = {
    'traditional': (2.236, 3.125), 'news': (1.907, 2.712), 'traditional+news': (2.105, 3.314),
} # (lowerbound, upperbound)

def train_and_evaluate(train, dev, test, f, D):
    results = []

    X_train = get_time_split(D, train[0], dev[1]).drop(columns=['year', 'month']).fillna(0).to_numpy() # not sure how okay it is to do fillna. When me and Bilal were running this we were getting the error that cannot run the model on NaN values. First we dropped na but this was causing the shape of the X_train to be different from the y_train. So we decided to fillna with 0. - aysha & bilal
    y_train = get_time_split(labels_df, train[0], dev[1]).drop(columns=['year', 'month']).to_numpy().ravel()
    
    X_test = get_time_split(D, test[0], test[1]).drop(columns=['year', 'month']).fillna(0).to_numpy()
    y_test = get_time_split(labels_df, test[0], test[1]).drop(columns=['year', 'month']).to_numpy().ravel()
    
    # convert y_test into binary classification (1 if inside threshold, else 0)
    lower, upper = thresholds[f]
    y_test_binary = np.where((y_test >= lower) & (y_test <= upper), 1, 0)

    for name, model in models.items():
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        rmse = root_mean_squared_error(y_test, preds)

        stderr = np.std(y_test - preds) / np.sqrt(len(y_test))
        upper_bound = np.sqrt(rmse**2 + 1.96 * stderr)
        lower_bound = np.sqrt(rmse**2 - 1.96 * stderr)

        precision, recall, _ = precision_recall_curve(y_test_binary, preds)
        aucpr = auc(recall, precision)

        results.append({
            'method': name, 'split': test, 'features': f, 
            'rmse': rmse, 'lower_bound': lower_bound, 'upper_bound': upper_bound,
            'aucpr': aucpr
        })

        print(f"Method: {name}, Split: {test}, Features: {f}, AUCPR: {aucpr:.4f}")
        print(f"Method: {name}, Split: {test}, Features: {f}, RMSE: {rmse:.4f} [{lower_bound:.4f}, {upper_bound:.4f}]")
        
        # completely removed the part where they were doing country-wise evaluation. Do not see point - aysha
    
    return results

# run in parallel on 4 cpu cores/decrease this if you do not want ur system to crash (speaking from experience)
all_results = Parallel(n_jobs=4)(
    delayed(train_and_evaluate)(train, dev, test, f, D) for train, dev, test in zip(train_splits, dev_splits, test_splits) for f, D in features.items()
)

fig_3a = pd.DataFrame([res for sublist in all_results for res in sublist])
fig_3a.to_csv('fig_3a.csv', index=False)




Method: RF, Split: ((2011, 7), (2012, 7)), Features: traditional, AUCPR: 0.5000
Method: RF, Split: ((2011, 7), (2012, 7)), Features: traditional, RMSE: 0.6266 [0.2275, 0.8565]
Method: OLS, Split: ((2011, 7), (2012, 7)), Features: traditional, AUCPR: 0.5000
Method: OLS, Split: ((2011, 7), (2012, 7)), Features: traditional, RMSE: 4.0611 [3.6230, 4.4564]
Method: Lasso, Split: ((2011, 7), (2012, 7)), Features: traditional, AUCPR: 0.5000
Method: Lasso, Split: ((2011, 7), (2012, 7)), Features: traditional, RMSE: 0.7176 [0.3812, 0.9405]
Method: RF, Split: ((2010, 7), (2011, 7)), Features: traditional, AUCPR: 0.5000
Method: RF, Split: ((2010, 7), (2011, 7)), Features: traditional, RMSE: 0.3683 [nan, 0.6549]
Method: OLS, Split: ((2010, 7), (2011, 7)), Features: traditional, AUCPR: 0.5000
Method: OLS, Split: ((2010, 7), (2011, 7)), Features: traditional, RMSE: 0.3164 [nan, 0.5964]
Method: Lasso, Split: ((2010, 7), (2011, 7)), Features: traditional, AUCPR: 0.5000
Method: Lasso, Split: ((2010, 7),



Method: RF, Split: ((2010, 7), (2011, 7)), Features: news, AUCPR: 0.9028
Method: RF, Split: ((2010, 7), (2011, 7)), Features: news, RMSE: 0.4091 [nan, 0.7131]
Method: RF, Split: ((2010, 7), (2011, 7)), Features: traditional+news, AUCPR: 0.5000
Method: RF, Split: ((2010, 7), (2011, 7)), Features: traditional+news, RMSE: 0.4435 [nan, 0.7491]
Method: OLS, Split: ((2010, 7), (2011, 7)), Features: news, AUCPR: 0.7639
Method: OLS, Split: ((2010, 7), (2011, 7)), Features: news, RMSE: 0.4581 [nan, 0.7787]
Method: OLS, Split: ((2010, 7), (2011, 7)), Features: traditional+news, AUCPR: 0.5000
Method: OLS, Split: ((2010, 7), (2011, 7)), Features: traditional+news, RMSE: 0.4167 [nan, 0.7311]
Method: Lasso, Split: ((2010, 7), (2011, 7)), Features: news, AUCPR: 0.9028
Method: Lasso, Split: ((2010, 7), (2011, 7)), Features: news, RMSE: 0.5171 [nan, 0.8353]
Method: Lasso, Split: ((2010, 7), (2011, 7)), Features: traditional+news, AUCPR: 0.5000
Method: Lasso, Split: ((2010, 7), (2011, 7)), Features: tra



Method: RF, Split: ((2012, 7), (2013, 7)), Features: news, AUCPR: 0.5000
Method: RF, Split: ((2012, 7), (2013, 7)), Features: news, RMSE: 0.3876 [0.2015, 0.5098]
Method: OLS, Split: ((2012, 7), (2013, 7)), Features: news, AUCPR: 0.5000
Method: OLS, Split: ((2012, 7), (2013, 7)), Features: news, RMSE: 0.3448 [nan, 0.5946]
Method: Lasso, Split: ((2012, 7), (2013, 7)), Features: news, AUCPR: 0.5000
Method: Lasso, Split: ((2012, 7), (2013, 7)), Features: news, RMSE: 0.2283 [nan, 0.3744]
Method: RF, Split: ((2012, 7), (2013, 7)), Features: traditional+news, AUCPR: 0.5000
Method: RF, Split: ((2012, 7), (2013, 7)), Features: traditional+news, RMSE: 0.3317 [0.0851, 0.4613]
Method: OLS, Split: ((2012, 7), (2013, 7)), Features: traditional+news, AUCPR: 0.5000
Method: OLS, Split: ((2012, 7), (2013, 7)), Features: traditional+news, RMSE: 0.3183 [nan, 0.5574]
Method: Lasso, Split: ((2012, 7), (2013, 7)), Features: traditional+news, AUCPR: 0.5000
Method: Lasso, Split: ((2012, 7), (2013, 7)), Feature

ValueError: Found array with 0 sample(s) (shape=(0, 26)) while a minimum of 1 is required by RandomForestRegressor.