<a href="https://colab.research.google.com/github/ExtractTable/ExtractTable-py/blob/master/example-code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install ExtractTable

In [0]:
from ExtractTable import ExtractTable

In [0]:
api_key = YOUR_APIKEY_HERE

**Create Session** with your API Key

In [0]:
et_sess = ExtractTable(api_key)

**Validate** the Key and check the plan usage

In [0]:
usage = et_sess.check_usage()

*If there is no error encountered in the above cell, it means we have a valid API key. Now, get started by checking the usage and trigger the file for processing*

In [0]:
print(usage)

{'credits': 500, 'queued': 0, 'used': 132}


**credits**: Total number credits attached to the API Key

**queued** : Number of triggered jobs that are still processing in the queue

**used**   : Number of credits already used 

In [0]:
# filepath = "image_path_or_image_url_with_tables"
# filepath = r'samples/BlurryImage.jpg'
filepath = "https://raw.githubusercontent.com/ExtractTable/ExtractTable-py/master/samples/QualityImage.jpg"

**Trigger** the process to extract tabular data from the file

In [0]:
table_data = et_sess.process_file(filepath=Location_of_PDF_with_Tables)

<ins>Note</ins>: To <ins>process a PDF</ins>, use **pages** params in the read_pdf function, as shown below
```python 
table_data = et_sess.process_file(filepath=Location_of_PDF_with_Tables, pages="all")
```
Below are the sample values ```pages``` accepts

* pages = "2" - considers only 2nd page of the PDF
* pages = "1,3,5" - considers only 1st, 3rd and 5th page of the PDF
* pages = "1, 3-5" - considers 1st, 3rd, 4th and 5th page of the PDF
* pages = "all" - considers complete PDF

> By default, the `process_file()` returns **only** the table data

> **Explore** all objects of the latest file processing with `et_sess.__dict__.keys()` - Depends on the plan type of your API Key

In [0]:
et_sess.__dict__.keys()

dict_keys(['api_key', '_session', 'ServerResponse', 'JobStatus', 'Lines', 'Pages', 'Tables'])

In [0]:
# Access the class objects as you want
print("Number of pages processed in this job:", et_sess.Pages)
print("Number of tables found in this job:", len(et_sess.Tables))
# print("Number of lines in the first page of this job:", len(et_sess.Lines[0]['LineArray']))

# et_sess.Tables
# et_sess.Lines


Number of pages processed in this job: 1
Number of tables found in this job: 1
Number of lines in the first page of this job: 42


> **Understand the output**: The response of a triggered job is a JSON object in the below format. Note that the response depends on the plan type of the API Key.

```javascript
{
    "JobStatus": <string>,                              # Status of the triggered Process  @ JOB-LEVEL
    "Pages": <integer>,                                 # Number of pages processed in this request @ PAGE-LEVEL
    "Tables": [<list of key-value objects of table>     # List of all tables found @ TABLE-LEVEL
        {
            "Page": <integer>,                              ## Page number in which this table is found
            "CharacterConfidence": <float>,                 ## Accuracy of Characters recognized from the input-page
            "LayoutConfidence": <float>,                    ## Accuracy of table layout's design decision
            "TableJson": <dict>,                            ## Table Cell Text in key-value format with index orientation - {row#: {col#: <str>}}
            "TableCoordinates": <dict>,                     ## Top-left & Bottom-right Cell Coordinates - {row#: {col#: <list(x1,y1,x2,y2)>}}
            "TableConfidence": <dict>                       ## Cell level accuracy of detected characters - {row#: {col#: <float>}}
        },
    {...}                                               ## ... more "Tables" objects
    ],
    "Lines": [<list of key-value objects>               # Pagewise Line details @ PAGE-LEVEL
        {
            "Page": <integer>,                          # Page number in which the lines are found
            "CharacterConfidence": <float>,             # Average Accuracy of all Characters recognized from the input-page
            "LinesArray": [
                <list of key-value objects of line>     # Ordered list of lines in this page @ LINE-LEVEL
                {
                    "Line": <str>,                          ## Detected text of the complete line
                    "WordsArray": [
                        <list of key-value objects>         ## Word level datails in this line @ WORD-LEVEL
                        {
                            "Conf": <float>,                    ### Accuracy of recognized characters of the word
                            "Word": <str>,                      ### Detected text of the word
                            "Loc": [x1, y1, x2, y2]             ### Top-left & Bottom-right coordinates, w.r.t the input-page width-height dimensions
                        },
                    {...}                                   ### More "WordsArray" objects
                    ]
                },
            {...}                                       ## More "LinesArray" objects
            ]
        },
    {...}                                               # More Pagewise "Lines" details
    ]
}
```

In [0]:
table_data     # Notice the default output is a pandas dataframe

[          0                           1  ...      5                   6
 0  FLC Code                   Room Name  ...  W (m)  Ceiling Height (m)
 1   RGOOTO1  Indigenous Support Officer  ...    7.3                 2.7
 2   RGOOTO2         Instrum. Music Room  ...    7.3                 2.7
 3   RGOTO1A                    Verandah  ...    1.7                 3.0
 4   RGOTO1B              Eastern Stairs  ...    1.7                 N/A
 5   RGOTO2B              Western Stairs  ...    1.0                 N/A
 
 [6 rows x 7 columns]]

Default output is an array of pandas dataframes, with which you can change to any other format, follow https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [0]:
# If your API Key supports "Lines" - Sample to get Lines

all_page_lines = []
for each_page in et_sess.Lines:
  for each_line in each_page['LinesArray']:
    all_page_lines.append(each_line['Line'])

Play with the <ins>result</ins>:
- check the complete server response of the latest job with `et_sess.ServerResponse.json()`
- check out list of available output formats of table `ExtractTable._OUTPUT_FORMATS`
- Retrieve the result as long as the `JobId` is unexpired, usually stays for 24 hours
  - ```job_output = et_sess.get_result(job_id=JobID_HERE)```

## Social Media
Follow us on Social media for library updates and free credits.

[![Image](https://cdn3.iconfinder.com/data/icons/socialnetworking/32/linkedin.png)](https://www.linkedin.com/company/extracttable)
&nbsp;&nbsp;&nbsp;&nbsp;
[![Image](https://abs.twimg.com/favicons/twitter.ico)](https://twitter.com/extracttable)