## FIT5196 Task 2 in Assessment 1

### Student Name: Zhiqing Shu
### Student ID: 28217551

#### Date: 27/03/2018

Version: 3.0

Environment: Python 3.6.4 and Anaconda 5.1.0 (64-bit)

Libraries used: 

- pdftables.six 0.0.5 (for pdf file, included in Anaconda Python 3.6)
- re 2.2.1 (for regular expression, included in Anaconda Python 3.6)
- pandas (for dataframe, included in Anaconda Python 3.6)
- numpy (for numpy array, included in Anaconda Python 3.6)

## 1. Introduction

This task is to extract data from a PDF File. The PDF file named **"health.pdf"** contains the children's health data over 202 countries in the world. The table spreads over four pages. The task is to extract the table and save them in a CSV file as where the first column contains the country names, and the following 22 columns contain various health information.

The detailed requirements of this task is as the following:
- must correctly parse and extract the table;
- existing Python packages (like: pdfminer or pdftables) can be used, however, APIs (like pdftables_api), which requires API keys to push the PDF file to the server in order to get the file parsed, must not be used;
- it is not required to extract the column labels. Except for the first column, which should be named as **"Country Name"**, the other columns should be indexed with integers as shown in the figure;
- if the number followed by a character "x" in the pdf file, "x" must **be dropped in** your script;
- script must be written in a Jupyter notebook named as **"pdf.ipynb"**;
- the extracted table should be saved in a CSV file named as **"health.csv"**;
- the input file must only be **"health.pdf"**.

## 2. Import libraries

PDF is a document representation format and not a data format that is machine readable, like CSV, JSON, and XML, so specific libiraies are needed to extract data from the original PDF file. 
In this task, `pdftables.get_tables()` will be used to extract tables from PDF files. It is worthy to noticing that each row in the table is extracted and stored in a **list**.

`re` is the library for regular expression.

In [None]:
from pdftables import get_tables
import re
import pandas as pd
import numpy as np

## 3. Parse Excel File

The `get_tables()` [function](https://moodle.vle.monash.edu/mod/resource/view.php?id=4845787) returns each page of PDF file as a table, each of those tables have a list of rows, and each of those row is a contained list of columns.

In [None]:
pdfFile = 'health.pdf'
pdf = open(pdfFile, 'rb')
tables = get_tables(pdf)

Viewing the first ten rows of each page.

In [None]:
for table in tables:
    for row in table[:10]:
        print(row)
    print ('==========================\n')

Viewing the last ten rows of each page.

In [None]:
for table in tables:
    for row in table[-10:]:
        print (row)
    print ('==========================\n')

By observing the previous results, we can find that the first five lines and the last two  lines in the first three pages is not required to be presented in the final CSV file, so we need to delete it. Furthermore, apart from what should be removed as the previous pages, the forth one contains some data description, which will also be excluded.

Before get the lines we need, we should know the index range of each page.

In [None]:
print(len(tables[0]))
print(len(tables[1]))
print(len(tables[2]))
print(len(tables[3]))

For page_0, the index range we need is from 5 to 67, to exclude the first five line ( e.g. [5:] ) and last two line (e.g. the largest index is 68 (69-1), then 68-2 = 66,  so [:67], where 67 is optional). We can use the same way to process page_1 and page_2.
However, the content of page_3 is a little different. Fortunately，the number of lines we need in page_3 can be easily count, just 14, so [5:19].

In [None]:
# exclude titles and '-'
page_0 = tables[0][5:67]
page_1 = tables[1][5:68]
page_2 = tables[2][5:68]
page_3 = tables[3][5:19]

Check if all information we need has been got.

In [None]:
print(page_0[:1])
print(page_0[-1:])

In [None]:
print(page_1[:1])
print(page_1[-1:])

In [None]:
print(page_2[:1])
print(page_2[-1:])

In [None]:
print(page_3[:1])
print(page_3[-1:])

After carefully watching each page, we can find there are still some messiness remianing and the problems of page_0, page_1 and page_2 are very similar, just one merge of two adjacent data, but in page_0 and page_1 the merge occurs in the forth item, while in page_2, it occurs in the fifth one.

According to the description in the original PDF, we can know that all data in this file is shown in persentage, and by using our commen sense, the range of all data should **between 0 to 100**. Apart from this, the unavailable data in this file is indicated as **'-'**, and the special characters **'x'** is also used to clarify the data referring to years or periods other than those specified in the column heading. As a result, we can write the regular expression in this way `r'(100|[1-9][0-9]?x?|[–])'`, and use `re.findall()` [function](https://docs.python.org/3/library/re.html) to return all non-overlapping matches of pattern in string, as a list of strings.

Besides finding the matches, we also need to remove the matches and insert them into the correct position. Here, `pop()` [function](https://docs.python.org/3.6/tutorial/datastructures.html) will be used to remove the matches.

In [None]:
for i in range(len(page_0)):
    value = re.findall(r'(100|[1-9][0-9]?x?|–|0)', page_0[i].pop(4))
    page_0[i] = page_0[i][:4] + value + page_0[i][4:]
for i in range(len(page_1)):
    value = re.findall(r'(100|[1-9][0-9]?x?|–|0)', page_1[i].pop(5))
    page_1[i] = page_1[i][:5] + value + page_1[i][5:]
for i in range(len(page_2)):
    value = re.findall(r'(100|[1-9][0-9]?x?|–|0)', page_2[i].pop(4))
    page_2[i] = page_2[i][:4] + value + page_2[i][4:]

We need to check if each row in each page has been splited. 

In [None]:
# get the length of each row
len(page_0[0]) 

In [None]:
# get the number of rows in each page
print(len(page_0))
print(len(page_1))
print(len(page_2))

In [None]:
# check if each row in every page has 23 items
num0 = 0
num1 = 0
num2 = 0
for item in page_0:
    if len(item) == 23:
        num0 += 1
print('num0:', num0)
for item in page_1:
    if len(item) == 23:
        num1 += 1
print('num1:', num1)
for item in page_2:
    if len(item) == 23:
        num2 += 1
print('num2:', num2)

Since the length of each row is same, we can merge the first three pages.

In [None]:
page_0_to_2 = page_0 + page_1 + page_2

Check if page_0_to_2 contains all rows.

In [None]:
len(page_0_to_2) == len(page_0) + len(page_1) + len(page_2)

We need to confirm all data in page_0_to_2 is processed correctly.
The three regular expressions are worked in three columns under the heading named "Use of basic sanitation service(%)" of the PDF file, and the only difference is that in page_0 and page_2, it split the "total" and "urban" data, while in page_1, it splits the "urban" and "rural" data. It is a kind of common sense that the range of "total" should be between the "urban" and the "rural", so it can be a idea to check the result.

In [None]:
for item in page_0_to_2:
    if '–' not in item[5] and '–' not in item[6]:
        # total > urban and total > rural
        if int(item[4]) > int(item[5]) and int(item[4]) > int(item[6]): 
            print(item)
        # total < urban and total < rural
        elif int(item[4]) < int(item[5]) and int(item[4]) < int(item[6]):
            print(item)

We get the abnormal row. Compared with the original PDF file, the error can be found, that is the merged "`718"` is splited into `"71"` and `"8"`, while in PDF they are `"7"` and `"18"`. This error is due to the fact that the digit of the after-spliting number is different, and the one-digit number is before the two-digit one. Since `? Quantifier` is greedy and will match between zero and one times, as many times as possible, so the first number splited through the regular expression used is always two-digit. We need to re-split the incorrect data as following:

In [None]:
for item in page_0_to_2:
     if '–' not in item[5] and '–' not in item[6]:
        # since 71 > 8 and 4, and the "<" situation not exists
        if int(item[4]) > int(item[5]) and int(item[4])>int(item[6]):
                data1 = re.findall(r'(\d)(\d)', item[4])
                data2 = re.findall(r'(\d)', item[5])
                item[4] = data1[0][0]
                item[5] = data1[0][1] + data2[0][0]

Check whether the problem has been fixed or not.

In [None]:
for row in page_0_to_2:
    if row[0] == "Ethiopia":
        print(row)

There is another way to check the abnormal data. As we know `? Quantifier` is greedy, so we can suppose that the first number splited from the original data will always be two-digit, and the second one will always be one-digit, so we can just view those rows in which the sixth data is one-digit and check them.
The code is as following: 
```
for row in page_0_to_2:
    if len(row[5]) < 2 and row[5] != '–': 
        print(row)
```

After processing the first three pages and checking the correctness, we can start parsing the last page.

At first, we need to have a look of page_3

In [None]:
page_3

There are many merge problems in page_3, taking the first row as an example:
- 'Uganda397332' should be 'Uganda','39','73' and '32'
- '192817938978' should be '19','28','17','93','89' and '78'
- '07887' should be '0','78' and '78'
- '4781' should be '47' and '81'

These problem can be fixed one by one:

In [None]:
#'Uganda397332' to 'Uganda','39','73' and '32'
for item in range(len(page_3)):
    data_0_to_3 = re.findall(r'([a-zA-Z()\s]+|100|[1-9][0-9]?x?|–|0)', page_3[item].pop(0))
    page_3[item] = data_0_to_3 + page_3[item][0:]
page_3

The result seems to be right. However, we may still need to check whether the first item of each row have been splited to four data.

In [None]:
len(page_3[0])

In [None]:
num = 0

for item in page_3:
    if len(item) == 15:
        num += 1
print(num)

In [None]:
len(page_3)

Correct. Then, moving to next step.

In [None]:
#'192817938978' to '19','28','17','93','89' and '78'
for item in range(len(page_3)):
    data_4_to_9 = re.findall(r'(100|[1-9][0-9]?x?|–|0)', page_3[item].pop(4))
    page_3[item] = page_3[item][:4] + data_4_to_9 + page_3[item][4:] 
page_3

Check the length of each data again.

In [None]:
len(page_3[0])

In [None]:
num = 0
for item in page_3:
    if len(item) == 20:
        num += 1
print(num)

Correct. Then, moving to next step.

In [None]:
for item in range(len(page_3)):
    data_15_to_17 = re.findall(r'(100|[1-9][0-9]?x?|–|0)', page_3[item].pop(15))
    page_3[item] = page_3[item][:15] + data_15_to_17 + page_3[item][15:] 
page_3

Check the length of each data.

In [None]:
len(page_3[0])

In [None]:
num = 0
for item in page_3:
    if len(item) == 22:
        num += 1
print(num)

Correct. Moving to next step.

In [None]:
for item in range(len(page_3)):
    data_19_to_20 = re.findall(r'([1-9][0-9]?x?|–)', page_3[item].pop(19))
    page_3[item] = page_3[item][:19] + data_19_to_20 + page_3[item][19:] 
page_3

In [None]:
len(page_3[0])

In [None]:
num = 0
for item in page_3:
    if len(item) == 23:
        num += 1
print(num)

All rows are split into same and correct length. However, it does not mean the result is correct.

The `'Uganda397332' to 'Uganda','39','73' and '32'` and `'192817938978' to '19','28','17','93','89' and '78'` can be checked by using the previous `'total', 'urban' and 'rural'` idea.

In [None]:
for item in page_3:
    if '–' not in item[2] and '–' not in item[3]:
        if int(item[1]) > int(item[2]) and int(item[1]) > int(item[3]):
            print(item)
        elif int(item[1]) < int(item[2]) and int(item[1]) < int(item[3]):
            print(item)

In [None]:
for item in page_3:
    if '–' not in item[5] and '–' not in item[6]:
        if int(item[4]) > int(item[5]) and int(item[4]) > int(item[6]):
            print(item)
        elif int(item[4]) < int(item[5]) and int(item[4]) < int(item[6]):
            print(item)

Since data in other columns does not have relation between each, we can just use the 'greed' of `? Quantifier` to check if there is any abnormal data, e.g if data is not in `'-', '0'` and `'100'`, most of them should be two-digit number, and some of them contain `'x'`.

In [None]:
for row in page_3:
    if '–' not in row[7]:
        if len(row[7]) < 2 and int(row[7]) != 0 and int(row[7]) != 100: 
            print(row)

In [None]:
for row in page_3:
    if '–' not in row[8]:
        if len(row[8]) < 2 and int(row[8]) != 0 and int(row[8]) != 100: 
            print(row)

In [None]:
for row in page_3:
    if '–' not in row[9]:
        if len(row[9]) < 2 and int(row[9]) != 0 and int(row[9]) != 100: 
            print(row)

In [None]:
for row in page_3:
    if '–' not in row[15]:
        if len(row[15]) < 2 and int(row[15]) != 0 and int(row[15]) != 100: 
            print(row)

In [None]:
for row in page_3:
    if '–' not in row[16]:
        if len(row[16]) < 2 and int(row[16]) != 0 and int(row[16]) != 100: 
            print(row)

In [None]:
for row in page_3:
    if '–' not in row[17]:
        if len(row[17]) < 2 and int(row[17]) != 0 and int(row[17]) != 100: 
            print(row)

Compared with the original PDF file, the error can be found, that is the merged "`47775"` is splited into `"47"`, `"77"` and `"5"`, while in PDF they are `"47"`, `"7"` and `"75"`. So we need to re-split the incorrect data as following:

In [None]:
for item in page_3:
    if '–' not in item[17]:
        if int(item[17]) < 10: 
            data1 = re.findall(r'(\d)(\d)', item[16])
            data2 = re.findall(r'(\d)', item[17])
            item[16] = data1[0][0]
            item[17] = data1[0][1] + data2[0][0]

In [None]:
for row in page_3:
    if row[0] == "Venezuela (Bolivarian Republic of)":
        print(row)

Correct. Continuing checking.

In [None]:
for row in page_3:
    if '–' not in row[19]:
        if len(row[19]) < 2 and int(row[19]) != 0 and int(row[19]) != 100: 
            print(row)

In [None]:
for row in page_3:
    if '–' not in row[20]:
        if len(row[20]) < 2 and int(row[20]) != 0 and int(row[20]) != 100: 
            print(row)

After confirming the correctness, we can merge all data together.

In [None]:
entire_pdf = page_0_to_2 + page_3

Check the integrity of data.

In [None]:
len(entire_pdf)

In [None]:
num = 0
for item in entire_pdf:
    if len(item) == 23:
        num += 1
print(num)

In [None]:
type(entire_pdf)

We create DataFrame by passing a list of object

In [None]:
data = pd.DataFrame(entire_pdf)

In [None]:
#df = data.copy()

According to the requirement, "x" must be dropped in your script, so we replace 'x' with ''.

In [None]:
data.replace('x','',regex=True,inplace=True)
data.tail()

And based on the sample figure, the "–" should be replaced as "NaN".

In [None]:
data.replace('–',np.nan,inplace=True)
data.head()

We also need to reset the index(using country names).

In [None]:
data = data.set_index(data[0].values)
data.head()

Deleting the redundant column.

In [None]:
data = data.drop(0,axis=1)
data.head()

Reindexing all columns.

In [None]:
data.columns = list(range(len(data.columns))) 
data.head()

Setting the index name.

In [None]:
data.index.rename('Country Name', inplace=True)
data.head()

Finally, we are going to store the data in CSV format using Pandas `'to_csv'` function, and `encoding = 'utf-16'` and `sep = '\t'` will also be used to address non-English characters.

In [None]:
data.to_csv('./health.csv',encoding='UTF-16',sep='\t')

## 4. Summary

PDF file is not easy to parse, despite using existing package like pdftables, cause there may contain many unexcepted merge of data. Due to the fact that data can be very diverse in terms of length and type, regular expression may not able to handle all ppossibility, so using the relationship between data or other way that we can figure to check the correctness of result is very necessary.

The main outcomes achieved while completing this task were:
- Thinking out right regular expressions to get desired result
- Using effective and efficient way to check the correctness of result
- Understanding data and the relationship between and the meaning behind data

## 5. Reference

- Moodle FIT5196 Week3. *parsing PDF,* Retrieved from: https://moodle.vle.monash.edu/mod/resource/view.php?id=4845787
- Python sofyware foundation.(2018) *6.2. re — Regular expression operations.* Retrieved from https://docs.python.org/3/library/re.html
- Python sofyware foundation.(2018) *5. Data Structures.* Retrieved from https://docs.python.org/3.6/tutorial/datastructures.html
- Chris Albon (2017 December,20) *Replacing Values In pandas* Retrived from: https://chrisalbon.com/python/data_wrangling/pandas_replace_values/