# Extracting data from a pdf

We are going to extract tabular data from the `exploration-of-solar-radiation-in-nigeria.pdf` file in the data folder. The package used here to extract the table is tabula-py. I learnt about using colab notebook [@Ayanlola2002](https://github.com/Ayanlola2002) posted on the [Odmena Nigeria Energy github repo](https://github.com/OmdenaAI/omdena-nigeria-energy). 

Thanks [@Ayanlola2002](https://github.com/Ayanlola2002) for your resource.

**Note:**
* You need to have the pdf data to run this code
* You can also use this method to scrape web pages for tabular data by downloading the page as a pdf using your browser (use the print function) then scrape that pdf using tabular-py.

## Step 1: download and import the package

In [1]:
try: 
    import tabula
except ModuleNotFoundError:
    !pip install tabula-py
    
    import tabula

Collecting tabula-py
  Downloading tabula_py-2.2.0-py3-none-any.whl (11.7 MB)
Collecting distro
  Using cached distro-1.5.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.5.0 tabula-py-2.2.0


## Step 2: read your pdf using tabula

Tabula would automatically scan for tables

In [4]:
data = "./data/exploration-of-solar-radiation-in-nigeria.pdf"

result = tabula.read_pdf(data, pages=7, multiple_tables=True)
for table in result:
    print(table)
    print('')

    Year Month Quarter     Ibadan    Sokoto  Port Harcourt
0   2011   Jan      Q1  152.95440  216.6395       195.3404
1   2011   Feb      Q1  167.18160  221.9066       173.1943
2   2011   Mar      Q1  179.93890  244.9392       180.4480
3   2011   Apr      Q2  162.21800  242.0993       182.9117
4   2011   May      Q2  154.39700  232.1683       160.7612
5   2011   Jun      Q2  126.53000  189.8827       131.3527
6   2011   Jul      Q3   90.88166  168.1862       121.4725
7   2011   Aug      Q3   91.89994  163.3494       114.3021
8   2011   Sep      Q3  112.58800  218.2928       134.7724
9   2011   Oct      Q4  141.07450  231.2348       141.5836
10  2011   Nov      Q4  166.86530  259.2857       155.8608
11  2011   Dec      Q4  186.76990  220.9247       188.9337
12  2012   Jan      Q1  153.84540  233.9503       169.7561
13  2012   Feb      Q1  149.98740  236.8866       148.0371
14  2012   Mar      Q1  171.49560  255.0796       157.5473
15  2012   Apr      Q2  168.66280  254.1561       155.80

## Step 3: Write into a file using tabula's convert_into method

Make sure to create a results folder before running the next cell block

In [7]:
tabula.convert_into(data, './results/quaterly-solar-radiance-ibadan-sokoto-ph.csv', output_format='csv', pages=7)

# Conclusion

Tabular data can be scrapped from a PDF using the tabula package in python.