# 📊 **Re-Implementation of "Predicting Food Crises Using News Streams"**

---

#### 🔍 **Objective**

This notebook aims to **reproduce and analyze** the methodology presented in the paper:

📄 **Paper:** [Predicting food crises using news streams](https://www.science.org/doi/10.1126/sciadv.abm3449)  
📊 **Dataset:** [Harvard Dataverse Repository](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CJDWUW)  
📜 **Original Code & Methods:** [GitHub - Regression Modeling (Step 5)](https://github.com/philippzi98/food_insecurity_predictions_nlp/blob/main/Step%205%20-%20Regression%20Modelling/README.md)

---

#### 🛠 **Methodology**

This implementation follows the **key steps** outlined in the paper to predict **food insecurity crises** using a combination of:
1️⃣ **Traditional Risk Factors** (conflict, climate, food prices, etc.)  
2️⃣ **News-Based Indicators** (text feature frequencies from news articles)  
3️⃣ **Lagging & Aggregation** (temporal dependencies at district, province, and country levels)  
4️⃣ **Machine Learning Models** (Random Forest, OLS, Lasso)

---

#### 🔗 **Reference Materials**

📄 **Supplementary Material:** Available in `supplemental_material_from_paper.pdf`  
📊 **Datasets Used:**

- `time_series_with_causes_zscore_full.csv` (Main dataset with time-series features)
- `famine-country-province-district-years-CS.csv` (Food insecurity classification)
- `matching_districts.csv` (Geographical standardization)


# 📚🔧 Import Libraries

In this notebook, we will use uv to manage our Python environment and packages efficiently. uv is a modern and fast package manager that simplifies virtual environment creation, and dependency installation. We will create a virtual environment, install necessary libraries, and ensure our environment stays consistent across different setups.


In [44]:
## Uncoment the below cell to install `uv` if you have not already. You can also install it trhiugh `pip` by running `!pip install uv` but this will be within your current python environment and not globally.

# !curl -LsSf https://astral.sh/uv/install.sh | sh
# !uv venv world-bank
# !source world-bank/bin/activate

In [1]:
# !uv pip install -r requirements.txt

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display, Image
import os
import gdown
import zipfile
from fuzzywuzzy import fuzz
import math

In [3]:
url = "https://drive.google.com/uc?id=1YoQ1hz9RlaLr2xW3KoKCfJPyyO2PErym"
output = "data.zip"

if not os.path.exists("./data"):
    gdown.download(url, output, quiet=False) 
    zipfile.ZipFile('data.zip', 'r').extractall()
else:
    print("You already have the data downloaded and extracted")

You already have the data downloaded and extracted


## 📂 Load and Clean Data

**Understanding the Time-Series Dataset & Column Selection**

This dataset contains **district-level time-series data** on food insecurity risk factors, including:

- **📅 Temporal Information:** `year`, `month`, `year_month`
- **📍 Geographical Identifiers:** `admin_code`, `admin_name`, `province`, `country`
- **🌍 Traditional Risk Factors:** Climate (`rain_mean`, `ndvi_mean`), conflict (`acled_count`), food prices (`p_staple_food`)
- **📰 News-Based Indicators:** Proportions of news articles mentioning crisis-related keywords (`conflict_0`, `famine_0`, etc.)
- **📉 Food Insecurity Label:** `fews_ipc` (Integrated Phase Classification)

🔥 **Columns We Will Drop & Why**
✔ **Redundant Aggregations:** `_1`, `_2` columns (province & country-level values) since we will recompute aggregations from scratch anyways.  
✔ **Unnamed/Index Columns:** `Unnamed: 0` as it is unnecessary. It is just a duplicate of default index.
✔ **Unnecessary Identifiers:** If `admin_code` and `admin_name`, after matching these to `matching_districts.csv`, we can drop them.

---

> ⚠️ **NOTE:**  
> For a detailed explanation of the dataset and features, refer to the [`explore_time_series.ipynb`](./explore_time_series.ipynb) notebook.


In [4]:
time_series = pd.read_csv('./data/time_series_with_causes_zscore_full.csv')
admins = pd.read_csv('./data/famine-country-province-district-years-CS.csv')
valid_matching = pd.read_csv('./data/matching_districts.csv')

In [5]:
# sorted(time_series.columns.values)

In [6]:
time_series.head(5)

Unnamed: 0.1,Unnamed: 0,index,country,admin_code,admin_name,centx,centy,year_month,year,month,...,carbon_2,mayhem_0,mayhem_1,mayhem_2,dehydrated_0,dehydrated_1,dehydrated_2,mismanagement_0,mismanagement_1,mismanagement_2
0,0,30,Afghanistan,202,Kandahar,65.709343,31.043618,2009_07,2009,7,...,1.053,0.667,-0.171,-0.833,0.173667,0.168,1.284667,-0.073,-0.427667,0.668333
1,1,33,Afghanistan,202,Kandahar,65.709343,31.043618,2009_10,2009,10,...,-0.660812,-0.63658,-0.520247,-0.782913,-0.671587,-0.612254,-0.926921,-0.510467,-0.625133,-0.452467
2,2,36,Afghanistan,202,Kandahar,65.709343,31.043618,2010_01,2010,1,...,-0.134333,1.447667,-0.844333,0.778667,-0.676,-0.689667,0.293333,0.530333,-0.471333,0.955333
3,3,39,Afghanistan,202,Kandahar,65.709343,31.043618,2010_04,2010,4,...,-0.326927,-0.594877,0.16479,-0.90521,-0.62054,0.165794,0.045794,-1.0116,-0.8106,-0.2056
4,4,42,Afghanistan,202,Kandahar,65.709343,31.043618,2010_07,2010,7,...,-1.085146,-0.709913,-0.867913,-0.770247,-0.787921,-0.974587,-0.946921,-0.611133,-0.7098,-0.6228


In [7]:
t_variant_traditional_factors = ['ndvi_mean', 'ndvi_anom', 'rain_mean', 'rain_anom', 'et_mean', 'et_anom', 
                                    'acled_count', 'acled_fatalities', 'p_staple_food']
t_invariant_traditional_factors = ['area', 'cropland_pct', 'pop', 'ruggedness_mean', 'pasture_pct']
news_factors = [name for name in time_series.columns.values if '_0' in name]

In [8]:
news_factors[0]

'land seizures_0'

In [9]:
print("Columns count BEFORE dropping: ", len(time_series.columns.values))

Columns count BEFORE dropping:  532


In [10]:
cols_to_drop = ["Unnamed: 0", "centx", "centy", 'change_fews', 'fews_ha', 'fews_proj_med', 'fews_proj_med_ha', 'fews_proj_near_ha'] + [col for col in time_series.columns if col.endswith(('_1', '_2', '_3'))]
time_series.drop(columns=cols_to_drop, inplace=True)

In [11]:
potential_extra_cols = set(time_series.columns.values) - set(t_variant_traditional_factors) - set(t_invariant_traditional_factors) - set(news_factors)
potential_extra_cols = [col for col in potential_extra_cols if not col.endswith(('_1', '_2', '_3'))]
print("Potential extra columns", potential_extra_cols)

Potential extra columns ['country', 'year_month', 'admin_code', 'admin_name', 'fews_proj_near', 'year', 'index', 'fews_ipc', 'month']


In [12]:
print("Columns count after dropping: ", len(time_series.columns.values))

Columns count after dropping:  190


### 🌍 Admin Level Mapping: Standardizing Geographical Identifiers

In this section, we will **map and standardize** the `admin_code` and `admin_name` fields to their corresponding **district, province, and country names**. This step is **crucial** for ensuring **consistency** across different datasets and enabling **accurate aggregations** at multiple administrative levels.

🛠 **Why is Admin Level Mapping Important?**
✅ Different datasets may use **slightly different spellings or formats** for district names.  
✅ Some district names might be **missing or misspelled**, requiring standardization.  
✅ We need to **match and align** district names across various sources before aggregating at **province and country levels**.  
✅ Proper mapping allows us to **merge datasets correctly** without losing information.  

📌 **Steps in Admin Mapping**
1️⃣ **Load the `matching_districts.csv` file**, which provides the mapping between different district name variations.  
2️⃣ **Identify missing or unmatched `admin_name` values** and find their closest matches using fuzzy matching techniques.  
3️⃣ **Ensure that each `admin_code` uniquely maps to one `district`, `province`, and `country`.**  
4️⃣ **Replace inconsistent names** in the dataset with their standardized versions.  
5️⃣ **Aggregate data at the `province` and `country` levels** after ensuring all districts are correctly mapped.  


In [13]:
len(admins.country.unique())

39

In [14]:
admins.columns.values

array(['Unnamed: 0', 'country', 'district', 'year', 'month', 'CS',
       'province'], dtype=object)

In [15]:
admin_names = time_series['admin_name'].unique()
districts = admins['district'].unique()
provinces = admins['province'].unique()
countries = admins['country'].unique()

In [16]:
print (len(admin_names), len(districts), len(provinces), len(countries))
print (len(set(admin_names).difference(districts)))
missing_admin_names = set(admin_names).difference(districts)
print (len(missing_admin_names.difference(provinces)))
missing_admin_names = missing_admin_names.difference(provinces)

1142 4113 474 39
369
230


### Fuzzy String Matching for Missing Names

The function uses **fuzzy string matching** to find the best approximate matches for missing administrative names (e.g., districts and provinces). 

- Finds the **best matching district/province** for each missing name.
- Uses **fuzzy string matching** to calculate the similarity between missing names and known names.
- Returns a dictionary that maps each missing name to its closest match.


In [17]:
def find_matching(missing, names):
    matching_districts = {}
    for m in missing:
        max_overlap = 0
        nearest_d = None
        for d in names:
            d = str(d)
            dist = fuzz.partial_ratio(m, d)
            if dist > max_overlap:
                max_overlap = dist
                nearest_d = d
        matching_districts[m] = nearest_d
    return matching_districts


matching = find_matching(missing_admin_names, districts)
matching_p = find_matching(missing_admin_names, provinces)

# manually verify matching and update
for k in matching.keys():
    print (k, matching[k], matching_p[k])


Saint-Raphael Saint Raphael Santa Rosa
North al Gazera Ganze North
Mangwe (South) Mangwe Southern
Agnuak Awgu Ouaka
Nyala.1 Nyala Nyamira
Balleyara Bale Mara
La Pendé La Pende Lac
Muranga Mkuranga Murang'a
Al Rahd El Rahad Al Mahrah
Chegutu Chegutu Rural Hodh ech Chargui
Khartoum Bahri Khartoum Khartoum
Mwene-Ditu City of Mwene-Ditu Kitui
Bankilaré Bankilare Sila
Adan Yabaal Aadan Yabaal `Adan
Sharg En Nile Sahar Niger
Illéla Illela Sila
Tanganyka Tanga Tanga
Gonave La Gonave Gao
Siti Sirisia Simiyu
Galdogob Goldogob Edo
Hamashkorieb Hamashkoreib Ghor
North Shewa(R4) North Shewa North
Nahr Atbara Atbara Mara
Al Marawi'ah Marawi Mara
East al Gazera Ganze Gaza
Merawi Marawi Mara
Mt Elgon Mt. Elgon Khatlon
Ville de Niamey Ndia Niamey
Sheikan Shiekan Shinyanga
Gedio Gedeo Gedo
Chiredzi Chiredzi Rural Moyen-Chari
En Nuhud Al Nuhud Sud
Nandi South Nnewi South Nandi
Burtinle Butinle Iilemi triangle
Baw Bahr El Arab Western Bahr el Ghazal
As Salam Shar`ab As Salam Dar es Salaam
Belet Weyne Bal

### Encoding Decoding

`to_ascii_escaped(s)`: Converts a Unicode string to an ASCII-safe representation using **unicode-escape**.

`from_ascii_escaped(escaped)`: Converts the escaped ASCII string back into its original Unicode form.

In [18]:
def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    # Using 'unicode-escape' encoding produces a bytes object,
    # then decode it to get an ASCII string.
    return s.encode('unicode-escape').decode('ascii')

def from_ascii_escaped(escaped):
    """
    Convert the ASCII-escaped string back to the original Unicode string.
    """
    # Encode the ASCII string to bytes, then decode using 'unicode-escape'
    return escaped.encode('ascii').decode('unicode-escape')


### Finding the Province for a Given District or Province

`find_province(x)`, finds the **province** corresponding to a given administrative name. It accounts for:
- **Direct Lookups** (Exact match in known district/province lists)
- **Fuzzy Matching** (Using ASCII-safe transformation for inconsistent text encoding)
- **Validation Against a Predefined Mapping (`valid_matching`)**

In [19]:
# Define matched globally
matched = valid_matching['missing'].unique()

def to_ascii_escaped(s):
    """
    Convert a Unicode string to an ASCII-safe string using unicode-escape.
    This will replace non-ASCII characters with their escape sequences.
    """
    if isinstance(s, bytes):
        s = s.decode('utf-8')
    return s.encode('unicode-escape').decode('ascii')

def find_province(x):
    try:
        # Ensure x is a Unicode string.
        if isinstance(x, bytes):
            x = x.decode('utf-8')
        
        # Direct lookup in districts or provinces.
        if x in districts:
            return admins[admins['district'] == x]['province'].values[0]
        elif x in provinces:
            return x

        # Convert x to an ASCII-escaped version.
        escaped_x = to_ascii_escaped(x)
        
        # Check if the escaped version is in matched.
        if escaped_x in matched:
            v = valid_matching[valid_matching['missing'] == escaped_x]
            if v['match'].values[0] == 'district':
                x2 = v['district'].values[0]
                return admins[admins['district'] == x2]['province'].values[0]
            elif v['match'].values[0] == 'province':
                return v['province'].values[0]
        
        # If no conditions are met, raise an exception.
        raise Exception("No matching province found")
    except Exception as e:
        raise Exception("Province not found for: {} ({})".format(x, e))


### Handling Admin Names with Accented Characters and Mapping to Provinces

Maps `admin_names` to provinces using the `find_province(a)` function.  
If a **direct lookup fails**, it tries to handle cases where the **admin name contains accented characters** (`é`, `è`, `ô`) ->  (encoding decoding issues resolved through directly replacing these with 'e' or 'o', leads to finding a valid match). 

In [20]:
admin_to_province = {}
for a in admin_names:
    try:
        admin_to_province[a] = find_province(a)
    except Exception as e:
        # Print the admin name that caused an error
        print("Error with:", a)
        # Check if a contains accented characters "é" or "è"
        if 'é' in a or 'è' in a or 'ô' in a:
            a_modified = a.replace('é', 'e').replace('è', 'e').replace('ô', 'o')
            # Check if the modified name is in districts
            if a_modified in districts:
                # Use the modified name to look up the province from admins
                try:
                    province = admins[admins['district'] == a_modified]['province'].values[0]
                    admin_to_province[a] = province
                    print(f"Replaced '{a}' with '{a_modified}', found province: {province}")
                except Exception as ex:
                    print(f"Modified name '{a_modified}' not found in admins: {ex}")
            else:
                print(f"Modified name '{a_modified}' not in districts.")
        else:
            print(f"No accented e found in '{a}'.")


Error with: Mangalmé
Replaced 'Mangalmé' with 'Mangalme', found province: Guera
Error with: La Pendé
Replaced 'La Pendé' with 'La Pende', found province: Logone Oriental
Error with: La Nya Pendé
Replaced 'La Nya Pendé' with 'La Nya Pende', found province: Logone Oriental
Error with: Lac-Léré
Replaced 'Lac-Léré' with 'Lac-Lere', found province: Mayo-Kebbi Ouest
Error with: Barh-Kôh
Replaced 'Barh-Kôh' with 'Barh-Koh', found province: Moyen-Chari
Error with: Aguié
Replaced 'Aguié' with 'Aguie', found province: Maradi
Error with: Bankilaré
Replaced 'Bankilaré' with 'Bankilare', found province: Tillaberi
Error with: Filingué
Replaced 'Filingué' with 'Filingue', found province: Tillaberi
Error with: Gothèye
Replaced 'Gothèye' with 'Gotheye', found province: Tillaberi
Error with: Gouré
Replaced 'Gouré' with 'Goure', found province: Zinder
Error with: Illéla
Replaced 'Illéla' with 'Illela', found province: Sokoto
Error with: Kantché
Replaced 'Kantché' with 'Kantche', found province: Zinder
Er

### Mapping Administrative Names to Provinces in time_series

Maps `admin_name` to their respective **provinces** using a precomputed dictionary - >`admin_to_province` in `time_series`.


In [21]:
time_series['province'] = time_series['admin_name'].apply(
    lambda x: admin_to_province[x] if x in admin_to_province else admin_to_province.get(x.replace('ô', 'o'))
)


In [22]:
# time_series[["admin_name", "province"]]

# ⏳ Time Lagging & Feature Engineering

#### 📅 **Why Use Lagging?**

To predict food insecurity **for a given quarter**, we use:

- **6 months of historical values** for traditional & news-based features.
- **Province & country-level aggregations** to capture broader shocks.
- **6 quarters of lagged IPC phase values** to model temporal dependencies.

#### ⚡ **Optimized Lagging Approach**

To improve computational efficiency, we:
✔ Use `groupby()` for **fast province & country-level aggregations**.  
✔ Merge lagged data via `merge()` instead of slow `.apply()`.  
✔ Only keep **past data** to ensure no data leakage.


In [23]:
def add_time_lagged(features, start=3, end=9, diff=1, agg=True, time_series=time_series):
    levels = ['', '_province', '_country'] if agg else ['']
    
    # Work on a copy to avoid modifying the original during processing
    working_df = time_series.copy()
    
    # Precompute a mapping for each feature (with its suffix) for fast lookups.
    # For each row, its lookup key will be: admin_code + '_' + year_month.
    lookup_maps = {}  # dict mapping f_s -> mapping dict
    for suffix in levels:
        for f in features:
            f_s = f + suffix
            # Build a mapping from key to first occurrence of f_s value.
            # Key: admin_code + '_' + year_month
            keys = working_df['admin_code'].astype(str) + '_' + working_df['year_month'].astype(str)
            # If there are duplicates, the first occurrence will be used.
            mapping = dict(zip(keys, working_df[f_s]))
            lookup_maps[f_s] = mapping

    # Prepare list to collect all new columns (as Series)
    new_cols = {}
    
    # Process each feature and lag combination
    for suffix in levels:
        for f in features:
            f_s = f + suffix
            mapping = lookup_maps[f_s]
            for t in range(start, end, diff):
                col_name = f"{f_s}_{t}"
                if col_name in time_series.columns:
                    continue
                
                # Compute lagged month and lagged year (vectorized)
                month = working_df['month']
                year = working_df['year']
                l_month = ((month - 1 - t) % 12) + 1
                l_year = np.where(month - t <= 0, year - 1, year) # If (month - t) is less than or equal to 0 (i.e., you’ve gone into the previous year), then l_year is year - 1; otherwise, it remains year.
                
                # Build the reference key: admin_code + '_' + "{l_year}_{l_month}"
                ref_key = working_df['admin_code'].astype(str) + '_' + \
                          l_year.astype(str) + '_' + \
                          l_month.astype(str)
                
                # Map the reference key to the lagged feature values using our precomputed mapping.
                # Where no match is found, use the current value from working_df[f_s].
                lagged_values = ref_key.map(mapping)
                lagged_values = lagged_values.fillna(working_df[f_s])
                
                # Store the new column in our dictionary (preserving the original index)
                new_cols[col_name] = lagged_values
                
    # If any new columns were created, add them to the original time_series DataFrame.
    if new_cols:
        new_cols_df = pd.DataFrame(new_cols, index=working_df.index)
        time_series = pd.concat([time_series, new_cols_df], axis=1)
        
    return time_series


# Province & Country-Level Aggregation

This function aggregates feature values at the province and country levels to capture regional trends, aiding in food insecurity prediction. The process includes:

- **Grouping by year_month and level:** Data is grouped by year_month and the specified level (province or country) to calculate the mean of features, reflecting regional trends over time.

- **Applying transformations efficiently:** Instead of merging aggregated data, `transform("mean")` is used to directly assign the computed mean to each row, avoiding unnecessary joins and improving performance.  

#### ⚡ **Efficiency Gains**

- **Fast Aggregation**: Uses `groupby()` for efficient aggregation.
- **Avoids Costly Joins**: Eliminates the need for `merge()` by using `transform()` instead, reducing computational overhead.  
- **Memory Efficiency**: Converts the `level` column to a categorical type to reduce memory usage.

This approach ensures faster processing while maintaining the quality of aggregated features.


In [24]:
def add_agg_factors(features, level='province'):
    global time_series  

    # Convert 'level' column to categorical for performance
    time_series[level] = time_series[level].astype('category')
    
    # Compute grouped mean values for the given features
    # TODO : Explain these arguments
    grouped_df = time_series.groupby(['year_month', level], observed=True, sort=False)[features].transform("mean")

    # Rename columns to include level
    grouped_df = grouped_df.rename(columns={f: f"{f}_{level}" for f in features})

    # Use pd.concat() to add all columns at once, avoiding fragmentation
    time_series = pd.concat([time_series, grouped_df], axis=1)

    return time_series


In [25]:
t = add_agg_factors(news_factors)

In [26]:
t = add_agg_factors(news_factors, level='country')
time_series.head(10)

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,gastrointestinal_0_country,terrorist_0_country,warlord_0_country,d'etat_0_country,overthrow_0_country,convoys_0_country,carbon_0_country,mayhem_0_country,dehydrated_0_country,mismanagement_0_country
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,0.009191,-0.196791,-0.277796,-0.080313,-0.158093,-0.091979,-0.205168,-0.351945,-0.004046,0.020729
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-0.123518,-0.169664,-0.039284,0.096598,-0.145231,-0.058545,0.024883,0.039075,-0.060053,-0.11289
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,0.194285,-0.051662,0.11223,0.271666,0.351302,0.17547,0.236345,0.159935,0.198416,0.409551
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,-0.073142,-0.122469,0.135643,0.204327,0.101177,-0.144836,0.22267,0.156061,0.27463,0.180825
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,0.15815,-0.068429,-0.161717,-0.187331,-0.13909,-0.142929,-0.010496,-0.081932,0.007388,0.138598
5,45,Afghanistan,202,Kandahar,2010_10,2010,10,2.0,,0.095377,...,0.01429,-0.021839,0.010606,-0.085573,0.147253,-0.040194,0.05393,-0.185165,0.172638,-0.085029
6,48,Afghanistan,202,Kandahar,2011_01,2011,1,2.0,,0.09262,...,-0.059419,-0.050372,-0.144964,-0.140235,-0.056617,-0.232453,-0.117227,0.117,-0.111434,-0.100282
7,51,Afghanistan,202,Kandahar,2011_04,2011,4,2.0,2.0,0.131462,...,-0.283481,-0.311409,-0.295271,-0.370462,-0.298859,-0.246249,-0.109567,-0.168876,-0.335914,-0.276873
8,54,Afghanistan,202,Kandahar,2011_07,2011,7,1.0,1.0,0.106885,...,0.198956,0.232582,0.165008,0.475172,0.257077,0.246208,0.089489,0.217305,0.162207,0.096265
9,57,Afghanistan,202,Kandahar,2011_10,2011,10,1.0,1.0,0.103268,...,0.456875,-0.102239,0.338327,0.302879,0.120221,0.354626,0.248271,0.316393,0.335178,0.232545


In [27]:
t = add_agg_factors(t_variant_traditional_factors, level='province')

In [28]:
t = add_agg_factors(t_variant_traditional_factors, level='country')
t = add_agg_factors(t_invariant_traditional_factors, level='province')
t = add_agg_factors(t_invariant_traditional_factors, level='country')

In [29]:
# time_series.to_csv('ours_agg_province_features_full.csv')

# Add time lagged features


In [30]:
time_series = add_time_lagged(t_variant_traditional_factors, time_series=time_series)

In [31]:
time_series.to_csv('ours_modifed_time_series_tvariant_all_rows.csv')

# raise Exception("Stop here")

In [32]:
time_series = add_time_lagged(news_factors, time_series=time_series)

In [33]:
time_series = add_time_lagged(['fews_ipc'], end=21, diff=3, agg=False, time_series=time_series)

In [34]:
time_series = add_time_lagged(['fews_proj_near'], start=3, end=4, diff=1, agg=False, time_series=time_series)

In [35]:
def diebold_mariano(preds, labels):
    sq_error = [(p-l)**2 for p,l in zip(preds, labels)]
    mean = np.mean(sq_error)
    n = len(preds)
    gammas = {}
    m = max(n,int(math.ceil(np.cbrt(n))+2))
    for k in range(m):
        gammas[k] = 0
        for i in range(k+1, n):
            gammas[k] += (sq_error[i] - mean)*(sq_error[i-k] - mean)
        gammas[k] = gammas[k]/n
    sum_gamma = gammas[0]
    for k in range(1, m):
        sum_gamma += 2*gammas[k]
    return np.sqrt(sum_gamma/n)

In [36]:
time_series.columns.shape

(3728,)

In [37]:
# time_series.to_csv("our_results_final_all_30.csv")

In [38]:
time_series.head()

Unnamed: 0,index,country,admin_code,admin_name,year_month,year,month,fews_ipc,fews_proj_near,ndvi_mean,...,mismanagement_0_country_6,mismanagement_0_country_7,mismanagement_0_country_8,fews_ipc_3,fews_ipc_6,fews_ipc_9,fews_ipc_12,fews_ipc_15,fews_ipc_18,fews_proj_near_3
0,30,Afghanistan,202,Kandahar,2009_07,2009,7,1.0,,0.106035,...,0.020729,0.020729,0.020729,1.0,1.0,1.0,1.0,1.0,1.0,
1,33,Afghanistan,202,Kandahar,2009_10,2009,10,1.0,,0.103009,...,-0.11289,-0.11289,-0.11289,1.0,1.0,1.0,1.0,1.0,1.0,
2,36,Afghanistan,202,Kandahar,2010_01,2010,1,2.0,,0.1096,...,0.409551,0.409551,0.409551,1.0,2.0,2.0,2.0,1.0,2.0,
3,39,Afghanistan,202,Kandahar,2010_04,2010,4,2.0,,0.111599,...,-0.11289,0.180825,0.180825,2.0,1.0,2.0,2.0,2.0,1.0,
4,42,Afghanistan,202,Kandahar,2010_07,2010,7,1.0,,0.096943,...,0.138598,0.138598,0.138598,1.0,1.0,1.0,1.0,1.0,1.0,


In [39]:
# raise Exception("Stop here")


In [40]:
# find columns that contain a particular substring

# list(filter(lambda x: 'co' in x, time_series.columns))

In [41]:
# find columns that begin with a particular substring
list(filter(lambda x: x.startswith("coun"), time_series.columns))


['country']

In [42]:
time_series[["fews_proj_near_3", "fews_proj_near", "year"]].to_csv("fews_proj_near_3.csv")

# Generate and save data


In [43]:
import numpy as np
import pandas as pd
from joblib import Parallel, delayed
# from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import root_mean_squared_error, precision_recall_curve, auc
from cuml.ensemble import RandomForestRegressor
import cudf
test_splits = [
    ((2010,7), (2011, 7)), 
    ((2011,7), (2012, 7)),
    ((2012,7), (2013, 7)), 
    ((2013,7), (2014, 7)), 
    ((2014,7), (2015, 7)), 
    ((2015,7), (2016, 7)), 
    ((2016,7), (2017, 7)), 
    ((2017,7), (2018, 7)),
    ((2018,7), (2019, 7)), 
    ((2019,2), (2020, 2)),
]
train_splits_old = [
    ((2009,7), (2010,4)),
    ((2009,7), (2011,1)),
    ((2009,7), (2011,10)),
    ((2009,7), (2012,7)),
    ((2009,7), (2013,7)),
    ((2009,7), (2014,1)),
    ((2009,7), (2015,1)),
    ((2009,7), (2015,10)),
    ((2009,7), (2016,10)),
    ((2009,7), (2017,2))]
dev_splits = [
    ((2010,4), (2010, 7)),
    ((2011,1), (2011, 7)),
    ((2011,10), (2012, 7)),
    ((2012,7), (2013, 7)),
    ((2013,4), (2014, 7)),
    ((2014,1), (2015, 7)),
    ((2015,1), (2016, 7)),
    ((2015,10), (2017, 7)),
    ((2016,10), (2018, 7)),
    ((2017,2), (2019, 2)),
]
# train_splits = train_splits_old + dev_splits
train_splits =  dev_splits

# just like them we will evaluate three dufferent models, Random Forest, OLS and Lasso. Random Forest is a tree-based model, OLS is a linear regression model and Lasso is a linear regression model with L1 regularization
models = {
    # 'RF': RandomForestRegressor(max_features='sqrt', n_estimators=100, min_samples_split=0.5, min_impurity_decrease=0.001, random_state=0)
    'RF': RandomForestRegressor(
        max_features='sqrt',  # Keep this
        n_estimators=500,  # Increase trees to reduce variance
        min_samples_split=5,  # 🚨 Fix this, should be an integer
        min_samples_leaf=2,  # Helps prevent overfitting
        max_depth=None,  # Allow full tree growth
        bootstrap=True,  # Default setting, makes it more robust
        random_state=0
    )
    # 'OLS': LinearRegression(),
    # 'Lasso': Lasso(alpha=0.1)
}

def get_agg_lagged_features(factors):
    return [f"{f}_{t}" for f in factors for t in range(3, 9)] + \
           [f"{f}_province_{t}" for f in factors for t in range(3, 9)] + \
           [f"{f}_country_{t}" for f in factors for t in range(3, 9)]

features = {
    'traditional': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors],
    
    'news': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(news_factors)],
    
    'traditional+news': time_series[['year', 'month'] + 
        [f"fews_ipc_{t}" for t in range(3, 21, 3)] + 
        get_agg_lagged_features(t_variant_traditional_factors) + 
        t_invariant_traditional_factors + 
        get_agg_lagged_features(news_factors)]
    
    # 'expert': time_series[['fews_proj_near_3' ] + ['year', 'month']],
    
    # 'expert+traditional': time_series[ ['year', 'month']+ 
    #     ['fews_proj_near_3'] +
    #     ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] + 
    #     get_agg_lagged_features(t_variant_traditional_factors) + 
    #     t_invariant_traditional_factors
    # ],
    # 'expert+news': time_series[ ['year', 'month'] +
    #     ['fews_proj_near_3'] +
    #     ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] +
    #     get_agg_lagged_features(news_factors)
    # ],
    # 'expert+traditional+news': time_series[ ['year', 'month'] +
    #     ['fews_proj_near_3'] +
    #     ['{}_{}'.format('fews_ipc', t) for t in range(3,21,3)] +
    #     get_agg_lagged_features(t_variant_traditional_factors) + 
    #     t_invariant_traditional_factors +
    #     get_agg_lagged_features(news_factors)
    # ]
}

labels_df = time_series[['fews_ipc', 'year', 'month']]

def get_time_split(df, start, end):
    return df[
        (((df['year'] > start[0])) | ((df['year'] == start[0]) & (df['month'] >= start[1]))) &
        (((df['year'] < end[0])) | ((df['year'] == end[0]) & (df['month'] <= end[1])))
    ]

thresholds = {'traditional': (2.236, 3.125), 
              'news': (1.907, 2.712), 
              'traditional+news': (2.105, 3.314),
            #   'expert': (2, 3),
            #   'expert+news': (1.912, 2.813),
            #   'expert+traditional': (2.241, 3.132),
            #   'expert+traditional+news': (2.172, 3.321)
             }

def train_and_evaluate(train, dev, test, f, D):
    results = []
    
    # D.to_csv(f"D_features_{f}.csv")
    # print("The feature is: ", f)
    # print("train split:", train)
    # print("Shape of D: ", D.shape)
    # print("All columns ", D.columns)
    

    X_train = get_time_split(D, train[0], train[1]).drop(columns=['year', 'month']).to_numpy()# not sure how okay it is to do fillna. When me and Bilal were running this we were getting the error that cannot run the model on NaN values. First we dropped na but this was causing the shape of the X_train to be different from the y_train. So we decided to fillna with 0. - aysha & bilal
    
    y_train = get_time_split(labels_df, train[0], train[1]).drop(columns=['year', 'month']).to_numpy().ravel()
    # print("The shape of X_train before removing nans is: ", X_train.shape)
    # shape_before = X_train.shape
    nan_mask = np.isnan(X_train).any(axis=1)
    X_train = X_train[~nan_mask]
    y_train = y_train[~nan_mask]
    
    
    X_test = get_time_split(D, test[0], test[1]).drop(columns=['year', 'month']).to_numpy()
    y_test = get_time_split(labels_df, test[0], test[1]).drop(columns=['year', 'month']).to_numpy().ravel()
    nan_mask_test = np.isnan(X_test).any(axis=1)
    X_test = X_test[~nan_mask_test]
    y_test = y_test[~nan_mask_test]
    
    # X_train.to_csv(f"X_train_{f}.csv")
    # print("The shape of X_train after removing nans is: ", X_train.shape)
    # print("The number of nans removed: ", shape_before[0] - X_train.shape[0])
    # print("fraction of nans", (shape_before[0]-X_train.shape[0])/shape_before[0])
    
    # print(f"Train is {X_train}")
    # print(f"Y train is {y_train}")
    # print(f"Test is {X_test}")
    # print(f"Y test is {y_test}")
    
    X_train = cudf.DataFrame(X_train)
    y_train = cudf.Series(y_train)
    X_test = cudf.DataFrame(X_test)
    y_test = cudf.Series(y_test)
    # convert y_test into binary classification (1 if inside threshold, else 0)
    lower, upper = thresholds[f]
    y_test_binary = np.where((y_test >= lower) & (y_test <= upper), 1, 0)

    for name, model in models.items():
        model.fit(X_train, y_train)
        preds = model.predict(X_test)

        rmse = root_mean_squared_error(y_test, preds)

        # stderr = np.std(y_test - preds) / np.sqrt(len(y_test))
        # upper_bound = np.sqrt(rmse**2 + 1.96 * stderr)
        # lower_bound = np.sqrt(rmse**2 - 1.96 * stderr)

        # precision, recall, _ = precision_recall_curve(y_test_binary, preds)
        # aucpr = auc(recall, precision)

        results.append({
            'method': name, 'split': test, 'features': f, 
            'rmse': rmse,
            # 'rmse': rmse, 'lower_bound': lower_bound, 'upper_bound': upper_bound,
            # 'aucpr': aucpr
        })

        # print(f"Method: {name}, Split: {test}, Features: {f}, AUCPR: {aucpr:.4f}")
        # print(f"Method: {name}, Split: {test}, Features: {f}, RMSE: {rmse:.4f} [{lower_bound:.4f}, {upper_bound:.4f}]")
        print(f"Method: {name}, Split: {test}, Features: {f}, RMSE: {rmse:.4f}")
        
    #     # completely removed the part where they were doing country-wise evaluation. Do not see point - aysha
    
    return results

# run in parallel on 4 cpu cores/decrease this if you do not want ur system to crash (speaking from experience)
all_results = Parallel(n_jobs=4)(
    delayed(train_and_evaluate)(train, dev, test, f, D) for train, dev, test in zip(train_splits, dev_splits, test_splits) for f, D in features.items()
)


# all_results = []
# for train, dev, test in zip(train_splits, dev_splits, test_splits):
#     for f, D in features.items():
#         # print(f"Running for {f}")
        
#         # print(f"Train: {train}, Dev: {dev}, Test: {test}")
#         # # print(f"Features: {D.columns}")
#         # print(f"{D.shape}")
#         # print(f"{D}")
        
#         all_results.append(train_and_evaluate(train, dev, test, f, D))


fig_3a = pd.DataFrame([res for sublist in all_results for res in sublist])
fig_3a.to_csv('fig_3a.csv', index=False)


RuntimeError: Failed to dlopen libcudart.so.11.0

In [None]:
fig_3a.groupby(by=['method', 'features'])['rmse'].mean()

KeyError: 'method'

**If the train and the test dataset are the same:**
```
method  features        
RF      news                0.099734
        traditional         0.027667
        traditional+news    0.096902
Name: rmse, dtype: float64
```
**If we use their provided train-test split:**
```
method  features        
RF      news                0.400070
        traditional         0.134550
        traditional+news    0.390469
Name: rmse, dtype: float64
```

**If we use dev-test split:**
```
method  features        
RF      news                0.388210
        traditional         0.132620
        traditional+news    0.364279
Name: rmse, dtype: float64
```
**If we use train+dev-test split:**
```

```