# Process Surveys
We need to process the raw data, so we can use it to scrape images and as a base for our models. From the LSMS surveys we need two files - the one which contains the geovariables (lat and lon of the cluster) and one which contains the consumption. Sometimes it is a bit tricky to get the data, since they are linked through some keys which lays in other files.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
cd ..

C:\Users\Mohamed\Desktop\Master\MA1\ML\ML4Science\forked-cm110-poverty\src


In [3]:
%reload_ext autoreload
%autoreload 2

In [4]:
from lib.lsms import LSMS
from bidict import  bidict
from tqdm import tqdm
import json
import pandas as pd

In [5]:
with open("../data/lsms/country_keys.json", "r") as f:
    metadata = json.load(f)

Loads json file which contains the rules for processing. Have a look in the Readme.md in the `data/LSMS` folder to understand the structure of the file. It can be extended easily.

It's convenient to have one large file with all countries included. So we will also save it.

In [6]:
def process_set_countries(metadata: dict, countries: set[str], nominal: bool):
    """
    Process a set of countries of present years in their necessary metadata and outputs aggregates data
    processing could be done in nominal and real way

    Args:
        metadata:
        countries:
        nominal:

    Returns:

    """
    ppp = 1 if nominal else -1
    master_df: pd.DataFrame = pd.DataFrame()
    for country in tqdm(countries):
        for year in metadata[country]:

            # Ignore special
            if metadata[country][year]['special']:
                print(f'{country}:{year} is special and was ignored')
                continue

            cur = metadata[country][year]

            lsms = LSMS(country, year, cons_path=f"../{cur['cons_path']}", hh_path=f"../{cur['hh_path']}", ppp=ppp)
            lsms.read_data()
            lsms.process_survey(cons_key=cur["cons_key"], hhsize_key=cur["hhsize_key"], lat_key=cur["lat_key"],
                                lon_key=cur["lon_key"], hhid_key=cur["hhid_key"], rural_key=cur["rural_key"],
                                rural_tag=cur["rural"], urban_tag=cur["urban"], multiply=cur["multiply"])
            lsms.write_processed(f"../data/lsms/processed/{country}_{year}_{'nominal' if nominal else 'real'}.csv")
            master_df = pd.concat([master_df, lsms.processed])
    return master_df


In [9]:
from lib.process_uganda import process_uga_2009, process_uga_2010
from lib.process_tanzania import process_tza

def process_all(output: str, metadata: dict):

    countries = {'NER', 'ETH', 'MW', 'MLI', 'NG', 'TZA'}

    print("Processing countries ")
    df_countries_nominal = process_set_countries(metadata=metadata,
                                                countries=countries, nominal=True)
    df_countries_real = process_set_countries(metadata=metadata,
                                             countries=countries, nominal=False)

    print("Processing Tanzania 2014..")
    df_tza_nominal = process_tza(metadata['TZA']['2014'], '2014', ppp=1)
    df_tza_real = process_tza(metadata['TZA']['2014'], '2014', ppp=-1)
    print("Done.")

    print("Processing Uganda 2009..")
    matched_keys = metadata["UGA"]["2009"]
    df_uga_2009_nominal, df_uga_2009_real = process_uga_2009(matched_keys)
    print("Done.")


    print("Processing Uganda 2010..")
    matched_keys = metadata["UGA"]["2010"]
    df_uga_2010_nominal, df_uga_2010_real = process_uga_2010(matched_keys)
    print("Done.")

    print("Collecting all data..")
    df_all_nominal = pd.concat([df_countries_nominal, df_tza_nominal, df_uga_2009_nominal, df_uga_2010_nominal])
    df_all_real = pd.concat([df_countries_real, df_tza_real, df_uga_2009_real, df_uga_2010_real])
    print("Done.")

    # Drop rows with na values
    df_all_real.dropna(inplace=True)
    df_all_nominal.dropna(inplace=True)

    print("Writing to files..")
    df_all_nominal.to_csv(f'{output}/_all_nominal.csv')
    df_all_real.to_csv(f'{output}/_all_real.csv')
    print("Done.")


In [10]:
process_all('../data/lsms/processed', metadata)

Processing countries 


100%|██████████| 6/6 [00:03<00:00,  2.00it/s]


TZA:2014 is special and was ignored


100%|██████████| 6/6 [00:05<00:00,  1.15it/s]

TZA:2014 is special and was ignored
Processing Tanzania 2014..





Done.
Processing Uganda 2009..
Done.
Processing Uganda 2010..
Done.
Collecting all data..
Done.
Writing to files..
Done.
