# Summary

## Overview of Qarik Project

Qarik provides a corpus of World Bank Loans. The goal of this project is to offer insights into economic and development trends from this unstructured data. Group Bear hopes to achieve this goal through first extracting the text in each Loan Agreement and then cleaning the data to extract some available features of interest. After extracting the relevant features, we also obtain external economic data on the countries in the loan amount in order aide in our analysis of different economic and development trends. We also provide two models (LDA and k-means) for clustering the corpus by which sectors of the economy are impacted from the project descriptions and project names for each Loan Agreement.. Additionally, we provide visualizations of the data using Tableau.

## Stakeholders and KPI

### Stakeholders

- World Bank
- Government Bodies and Journalists
- Economic Analysts

### KPIs

- Successfully extract relevant features from over 95% of the documents. Relevant features include: name of country taking the loan, loan amount, year of loan approval, project name, and project description
- Preprocess the extracted data according to each country's economic reality – for example, by normalizing the loan by the country's GDP.
- Successfully cluster loans based on the project description, with over 90% accuracy. In order to check if we were successful we could take a sample of 100 loans and manually check if we think that these loans actually belong to their assigned clusters.
- Interpretability of our analysis via data visualization. We would like our analysis to be easily understood by a non-technical audience – this is crucial if we want our analysis to have a real-world impact.


## Extract Data

The corpus of World Bank loans consists of 3205 World Bank Loan Agreements in a pdf format. Of the 3205 loan agreements, 400 of them were scanned documents put in a pdf format. Generally, the loans were either standard pdfs or scanned images converted to a pdf format. The dates of the loan agreements range from 1990 to 2021. Below is an example of the first two pages of a Loan Agreement of a standard pdf.

<table><tr>
<td> <img src="ex_1.png" alt="Drawing" style="width: 300px;"/> </td>
<td> <img src="ex_2.png" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

Below is an example of the first two pages of a Loan Agreement that is a scanned image in pdf format.

<table><tr>
<td> <img src="ex1_1.png" alt="Drawing" style="width: 300px;"/> </td>
<td> <img src="ex1_2.png" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

In order to extract the data we tried both Python packages PyMuPdf and PDFminer.six. Between these two packages, PyMuPdf performed better as the structure and order of the document was maintained. However, neither of these packages were able to extract any text from the 400 scanned pdfs. For those documents we use pyTesseract (OCR software) in order to extract those texts. A disadvantage of the OCR software is that at times certain characters were not correctly recognized and so the text contained typos and the OCR software also took significantly longer time to run.

The python scripts used to extract the texts are located in the Extract_Data folder. Please refer to them for further information on the implementation.

The raw text files that were extracted from the original pdf files are located in the PyMuPdf_Text and Tesseract_Text folders.

## Clean Data

The features that we chose to extract from the Loan Agreements were as follows:

- Loan Amount (and Currency that it was in)
    - Extracted for ~95% of Loan Agreements with an estimated accuracy of ~93%
- Name of the Country
    - Extracted for 100% of Loan Agreements
- Date of Agreement
    - Extracted for 100% of Loan Agreements
- Project Name
    - Extracted for 92% of Loan Agreements
- Project Description
    - Extracted for 92% of Loan Agreements
    

The notebook and scripts used to extract and clean this data from the text files are located in the Clean_Data Folder in their respective folders. Please refer to the notebooks and scripts for further information on their implementation. 

## Finalized Data 

#### Currency Conversion

Raw loan amount (and currency) data was further processed. First, all loans were converted to USD using historical exchange date data from:

https://fxtop.com/en/historical-exchange-rates.php

Currency conversions were NOT adjusted for inflation. Conversions are done an a yearly basis using average values for that year. The coding and details of how this was done is available in "Convert Loan Amount Currency" in Clean_Data/loan_amount.

#### World Bank External Data

We extracted external data to use for an ML model. Since our loan documents were already from the World Bank, we extracted this additional information from the World Bank:

https://datacatalog.worldbank.org/search/dataset/0037712/World-Development-Indicators

https://info.worldbank.org/governance/wgi/

Information extracted from these were:

- GDP per Capita (put into 2021 USD equivalent)
- Political stability / absence of violence/terrorism
- Literacy rate (as a percentage of adult population)
- Electricity usage (as a percentage of population)
- Gini coefficient (a measure of economic inequality)

Further details are provided in "Regression Data" Clean_Data/loan_amount

## Visualizations

Will have to add later =)

## Models

### Latent Dirchlet Allocation Model for Clustering

LDA is an unsupervised topic analysis technique used to cluster by topic. The model uses the lemmatized words from project name and description of each document in order to determine a probability distribution for each document to see the probability the document belongs to each topic/cluster. The cluster that each document belongs to was determined by taking the topic/cluster with maximum probability. One should note that the model is stochastic in nature and so the results of the model change every time it is run.

The LDA model also takes as input the number of clusters. While the World Bank has loans arranged into 11 sectors, it was determined that topic coherence was maximized with 7 topics/clusters. This results in some of the topics/clusters being combination of several sectors, indicating relationships between different sectors of the economy. Below is a visualization of the 7 topics/clusters generated by the LDA model.

In addition to using unigrams, the model was also run using bigrams. However, the topic coherence was lower and the bigrams representing each of the topics were not as clear as with model with unigrams.

The LDA model implementation is located in the LDA folder in the Cluster_Data folder. The distribution for each document is saved as lda_sector_distr.csv. The topic/sector determined for each document is saved as lda_sector.csv.

In [1]:
from IPython.display import IFrame

IFrame(src='./lda.html', width=900, height=900)

### K-means/ Gaussian Mixture Model for Clustering

We use regular expression to extract the dates of all loan agreements from their file names.

We would like to divide different loans into different sectors according to the project 
descriptions extracted from the documents. There are 11 sectors listed at world bank website.

To fulfill the grouping, we extract and refine a dictionary consisting of around 6000 words from 
all project descriptions, and use pre-trained word embeddings from Glove to assign each 
document a vector of 300 dimension. For each document, the vector is the average of the 
embeddings of all words of this doc appearing in the dictionary. Also for each sector, we find a 
description of the sector from world bank, and assign a vector to each sector using the same 
method.  Next, we use GaussionMixture Model to cluster these words into 11 clusters, to tune 
the hyperpameters of GMM, we use BIC (Bayesian Information Criterion). 

### GDP Analysis

## Future Work

We attempted to build a ML model to predict average loan amount from our finalized dataset:

- GDP per Capita (put into 2021 USD equivalent)
- Political stability / absence of violence/terrorism
- Literacy rate (as a percentage of adult population)
- Electricity usage (as a percentage of population)
- Gini coefficient (a measure of economic inequality)
- Sector of loan (Classified using LDA Clustering Model)

A series of different models were chosen to tackle this problem. The first approach was using Linear Regression.