# Download LSMS
This notebook helps you to download the LSMS data. It still requires manual work, but reduces it. 

In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd drive/MyDrive/src/0_lsms_processing/

/content/drive/MyDrive/src/0_lsms_processing


In [3]:
from bs4 import BeautifulSoup
from tqdm import tqdm

import json
import os
import pandas as pd
import re
import requests

from typing import List, Set, Dict

In [4]:
# Select the continent (Africa or Asia)
continent: str = 'south_america'
path: str = f'../../data/continents/{continent}/countries_meta/'
path_lsms: str = f'../../data/continents/{continent}/lsms/'

In [5]:
df: pd.DataFrame = pd.read_csv(path + "countries_lsms_time_valid.csv")

You have to specify you World Bank login data in the accounts json, or just remove the following block and hard code you data in the block after. Please be careful to not push it to a public repository or to one which you will make once public.

The json format is the following:
```
{
  "worldbank": {
    "user": "xx",
    "pw": "xx"
  }
}
```

The json file should lay on the top level of the project.

In [6]:
with open("../../accounts.json", "r") as f:
    auth_data: any = json.load(f)

In [7]:
user: str = auth_data["worldbank"]["user"]
pw: str = auth_data["worldbank"]["pw"]

Perform Login

In [8]:
def login(session: requests.Session, user: str, pw: str) -> None:
    """Performs the login.

    Args:
        session (requests.Session): Session
        user (str): Username
        pw (str): Password
    """
    login_url: str = "https://microdata.worldbank.org/index.php/auth/login"
    login_params: Dict[str, str] = {
        "email": user,
        "password": pw,
        "submit": "Login"
    }
    session.post(login_url, data=login_params)

In [9]:
session: requests.Session = requests.Session()
login(session, user, pw)

Consent: Accept the consent form to download the data

In [10]:
for _, row in tqdm(df.iterrows(), total=len(df)):
    url: str = row["url"]
    surveyid: str = url.split("/")[-1]
    res: any = session.get(url + "/get-microdata").content
    soup: any = BeautifulSoup(res)
    surveytitle: str = soup.find("h1", {"id": "dataset-title"}).span.text
    submitparam: Dict[str, str] = {
        "surveytitle": surveytitle,
        "surveyid": surveyid,
        "id": "",
        "abstract": "Research project to predict poverty.",
        "chk_agree": "on",
        "submit": "Submit"
    }

    session.post(url + "/get-microdata", data=submitparam)

100%|██████████| 6/6 [00:04<00:00,  1.50it/s]


Download LSMS Surveys. You have to check for false positives in the end anyways. Just to reduce the click work.

In [11]:
# Check files to download with CSV and SPSS extensions
regex_csv: re.Pattern = re.compile(".*CSV.*")
regex_spss: re.Pattern = re.compile(".*SPSS.")

for _, row in tqdm(df.iterrows(), total=len(df)):
    path: str = path_lsms + f"raw/{row['name']}/{row['year']}"
    if not os.path.exists(path):
        os.makedirs(path)
    rl: str = row["url"]
    res: requests.Response = session.get(url + "/get-microdata").content
    soup: any = BeautifulSoup(res)
    if "Terms and conditions" in [x.text for x in soup.findAll("h1")]:
        data: Dict[str, str] = {
            "accept": "Accept"
        }
        res = session.post(url + "/get-microdata", data=data).content
        soup = BeautifulSoup(res)
    try:
        if soup.find("a", {"data-filename": regex_csv}) == None:
            regex = regex_spss
        else:
            regex = regex_csv
        href: any = soup.find("a", {"data-filename": regex})["href"]
        title: any = soup.find("a", {"data-filename": regex})["title"]

        if os.path.exists(f"{path}/{title}"):
            continue
        res = session.get(href)
        with open(f"{path}/{title}", "wb") as f:
            f.write(res.content)
    except:
        print(url) # for manual work
        login(session, user, pw) # automatically login to speed up the process

  0%|          | 0/6 [00:00<?, ?it/s]

https://microdata.worldbank.org/index.php/catalog/597


 17%|█▋        | 1/6 [00:00<00:03,  1.53it/s]

https://microdata.worldbank.org/index.php/catalog/597


 33%|███▎      | 2/6 [00:01<00:02,  1.52it/s]

https://microdata.worldbank.org/index.php/catalog/597


 50%|█████     | 3/6 [00:01<00:01,  1.52it/s]

https://microdata.worldbank.org/index.php/catalog/597


 67%|██████▋   | 4/6 [00:02<00:01,  1.51it/s]

https://microdata.worldbank.org/index.php/catalog/597


 83%|████████▎ | 5/6 [00:03<00:00,  1.50it/s]

https://microdata.worldbank.org/index.php/catalog/597


100%|██████████| 6/6 [00:04<00:00,  1.49it/s]


The following are missing and require manual work:

## Africa

- [Ghana 1999](https://microdata.worldbank.org/index.php/catalog/2331): Data hosted on gov. server
- [Ghana 1992](https://microdata.worldbank.org/index.php/catalog/2315): Data hosted on gov. server
- [Ghana 1989](https://microdata.worldbank.org/index.php/catalog/2314): Data hosted on gov. server
- [Ghana 1988](https://microdata.worldbank.org/index.php/catalog/2313): Data hosted on gov. server
- [Malawi 2011](https://microdata.worldbank.org/index.php/catalog/3016): Other Term and No CSV
- [South Africa 2015](https://microdata.worldbank.org/index.php/catalog/3062): Terms by worldbank
- [South Africa 2015](https://microdata.worldbank.org/index.php/catalog/2882): Data hosted on gov. server
- [South Africa 1999](https://microdata.worldbank.org/index.php/catalog/1576): Data hosted on gov. server
- [South Africa 1993](https://microdata.worldbank.org/index.php/catalog/297): No CSV, or SPSS
- [South Africa 1993](https://microdata.worldbank.org/index.php/catalog/902): Data hosted on gov. server

## Asia

- [Armenia 2017](https://microdata.worldbank.org/index.php/catalog/3591): Data in STATA format
- [Armenia 2018](https://microdata.worldbank.org/index.php/catalog/3617): Licensed dataset
- [Indonesia 2007](https://microdata.worldbank.org/index.php/catalog/1044): Data hosted on external website
- [Nepal 2010](https://microdata.worldbank.org/index.php/catalog/1000): Data hosted on external website
- [Nepal 2003](https://microdata.worldbank.org/index.php/catalog/74): Data hosted on external website
- [Nepal 1995](https://microdata.worldbank.org/index.php/catalog/2301): Data hosted on external website
- [Timor-Leste 2001](https://microdata.worldbank.org/index.php/catalog/75): Data in STATA format
- [Vietnam 2006](https://microdata.worldbank.org/index.php/catalog/2350): Data hosted on external website
- [Vietnam 2004](https://microdata.worldbank.org/index.php/catalog/2370): Data hosted on external website
- [Vietnam 2002](https://microdata.worldbank.org/index.php/catalog/2306): Data hosted on external website
- [Vietnam 1997](https://microdata.worldbank.org/index.php/catalog/2694): Data hosted on external website
- [Vietnam 1992](https://microdata.worldbank.org/index.php/catalog/1910): Data hosted on external website
- [Tajikistan 2016](https://microdata.worldbank.org/index.php/catalog/2985): Licensed dataset
- [Tajikistan 1999](https://microdata.worldbank.org/index.php/catalog/279): Data in STATA format

## South America

- Perù and Ecuador have very old surveys not in CSV formats