# Data Extraction

In this notebook, we will extract the different tables characterizing poverty rate data by region in Tunisia. The source is the PDF report from the National Institute of Statistics in collaboration with the World Bank (2020), available at: [link](https://ins.tn/sites/default/files/publicationS/pdf/Carte%20de%20la%20pauvret√©%20en%20Tunisie_final_0.pdf).

All code in this notebook serves as **parsers**, generating intermediate CSV files for each required table. The subsequent notebook, `02_cleaning.ipynb`, will perform cleaning operations on these extracted data.


## 1. Extracting All Tables from the PDF Report

In [39]:
import camelot
import os

pdf_path = "../data/raw/ins_tunisia_report_2020.pdf"

tables = camelot.read_pdf(pdf_path, pages='all', flavor='stream')

print(f"Found {len(tables)} tables in the PDF.")

  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)


KeyboardInterrupt: 

## 2. Review Tables Before Saving

In [40]:
for i, table in enumerate(tables):
    print(f"Table {i}:")
    print(table.df.head(), "\n")

Table 0:
                                 0
0  Carte de la pauvret√© en Tunisie
1                   Septembre 2020 

Table 1:
   0                     1
0     CARTE DE LA PAUVRETE
1  1                       

Table 2:
   0                     1
0     CARTE DE LA PAUVRETE
1  2                       

Table 3:
                                                   0
0                                 Table des mati√®res
1  R√©sum√© ..........................................
2  National ........................................
3  Introduction ....................................
4  M√©thodologie et enjeux .......................... 

Table 4:
                                                   0
0                               CARTE DE LA PAUVRETE
1  D√©l√©gations du gouvernorat de Nabeul ............
2  D√©l√©gations du gouvernorat de Zaghouan ..........
3  D√©l√©gations du gouvernorat de Bizerte ...........
4  Nord-Ouest : Beja, Jendouba, Kef et Seliana ..... 

Table 5:
                      

## 3. Save Tables: Poverty & School Dropout Rates (2015)

Store the tables containing poverty rates per delegation and governorate along with school dropout rates, using 2015 data from the Ministry of Education and the National Institute of Statistics (INS).


In [43]:
# Save tables per governorate
# We should create a map of gouvernorate name to table indices
governorate_tables = {
    "Tunis": 41,
    "Ariana": 43,
    "Ben Arous": 45,
    "Manouba": 46,
    "Nabeul": 50,
    "Zaghouan": 51,
    "Bizerte": 53,
    "Beja": 56,
    "Jendouba": 57,
    "Kef": 59,
    "Siliana": 63,
    "Sousse": 67,
    "Monastir": 69,
    "Mahdia": 70,
    "Sfax": 72,
    "Kairouan": 74,
    "Kasserine": 76,
    "Sidi Bouzid": 78,
    "Gabes": 81,
    "Medenine": 85,
    "Tataouine": 84,
    "Gafsa": 88,
    "Tozeur": 89,
    "Kebili": 91 
}
for governorate, table_index in governorate_tables.items():
    table = tables[table_index]
    output_path = f"../data/interim/poverty_{governorate.lower().replace(' ', '_')}_2015.csv"
    table.to_csv(output_path)
    print(f"Saved {governorate} table to {output_path}")

Saved Tunis table to ../data/interim/poverty_tunis_2015.csv
Saved Ariana table to ../data/interim/poverty_ariana_2015.csv
Saved Ben Arous table to ../data/interim/poverty_ben_arous_2015.csv
Saved Manouba table to ../data/interim/poverty_manouba_2015.csv
Saved Nabeul table to ../data/interim/poverty_nabeul_2015.csv
Saved Zaghouan table to ../data/interim/poverty_zaghouan_2015.csv
Saved Bizerte table to ../data/interim/poverty_bizerte_2015.csv
Saved Beja table to ../data/interim/poverty_beja_2015.csv
Saved Jendouba table to ../data/interim/poverty_jendouba_2015.csv
Saved Kef table to ../data/interim/poverty_kef_2015.csv
Saved Siliana table to ../data/interim/poverty_siliana_2015.csv
Saved Sousse table to ../data/interim/poverty_sousse_2015.csv
Saved Monastir table to ../data/interim/poverty_monastir_2015.csv
Saved Mahdia table to ../data/interim/poverty_mahdia_2015.csv
Saved Sfax table to ../data/interim/poverty_sfax_2015.csv
Saved Kairouan table to ../data/interim/poverty_kairouan_2015.