# Extraction, Transform, and Load

ETL is a common necessity for data engineering and data processing pipelines.
The source of the data may be other structured databases, unstructured data stores, data APIs, etc.

ETL can be a simple data acquisition task, such as shown below.

![AutomatedDataAcquisition.png MISSING](../images/AutomatedDataAcquisition.png)

**Or, it may be part of larger process to accumulated data and information in support of advanced analytical systems.**

![AutomatedDataAcquisition_to_Analytics.png MISSING](../images/AutomatedDataAcquisition_to_Analytics.png)

---

## In the context of ETL, you now have the tools to perform this activity.

In the data loading lab, you read in three data files and then massaged the Panda data frame to prepare the data for loading and to understand the semantics of the data.
You then loaded the database with data from the files.

We just need to understand how to acquire data from a remote resource, such as the web or an API and process it with Pandas.

Additionally, in this notebook we will see how to use the SQLAlchemy library to simplify data loading.

## Tasks:

 **Consider**:
 + https://en.wikipedia.org/wiki/Land_use_statistics_by_country   
 
In the cells below, 

 1. Define a table for information about the worlds countries.
 1. Describe some challenges you foresee with the data
 1. Review and modify code cells that pull down the data from the tables into a data frame
 1. Load the data into your database
 1. Test loaded data with SQL queries

### 1. Define Tables

### 2. Describe the challenges

### 3. Data Scrapping Code

In [1]:
#import the library to query a website
import requests
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup



In [2]:
# specify the url
url = "https://en.wikipedia.org/wiki/Land_use_statistics_by_country"
# Open website URL and return the html to the variable 'response'
response = requests.get(url)
print(response.encoding)
print(response)

UTF-8
<Response [200]>


The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a `BeautifulSoup` object is created from an html response, we explicitly reference the text format(`response.text`).

The default encoding format is 'UTF-8' as shown below. 

[Click here for additional documentations about the response object.](http://docs.python-requests.org/en/master/user/quickstart/#response-content)



In [3]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")

#####  Basic Inspection
Use `prettify` function to print the data in its nested html structured format.

In [4]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Land use statistics by country - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3171f8f5-b2af-4a7a-9afc-35bc955818d6","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Land_use_statistics_by_country","wgTitle":"Land use statistics by country","wgCurRevisionId":1043349551,"wgRevisionId":1043349551,"wgArticleId":20860130,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Agricultural land","Lists by country"],"wgPageContentLanguage":"en","w

We need to extract the table which has list of all baseball world series champions. 

This table should be present in one of the html tags.

We can work with the tags to extract data present in them.  
"**soup.tag**": will return the content between opening and closing tag including tag. 

Additionally, the `.string` value is the data between the tags.
Compare the two cells below.

In [5]:
print(soup.title)
print(soup.title.string)

<title>Land use statistics by country - Wikipedia</title>
Land use statistics by country - Wikipedia


**Identify the html tag**: 
The data is in a table. 
You can use inspect element option when you right click the mouse to identify the tag which has the data. 

 * [Additional guide on webpage inspection](../resources/AnalyzingHTMLwithTheWebInspector.pdf)


<img src="../images/Wikipedia_Inspect_Screen.png">

**If we look at the inspected HTML source for the table,** 
abbreviated here to focus on the top two rows of data.

```HTML
<table class="sortable wikitable jquery-tablesorter">

 <thead>
     <tr bgcolor="#ececec" valign="top">
         <th data-sort-type="number" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Rank</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Country</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cultivated <br> land <br> (km<sup>2</sup>)</th>
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cultivated <br> land <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Arable <br> land <br> (km<sup>2</sup>)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Arable <br> land <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Permanent <br> crops <br> (km<sup>2</sup>)</th>
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Permanent <br> crops <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Other <br> lands <br> (km<sup>2</sup>)</th>
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Other <br> lands <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Total <br> area <br> (km<sup>2</sup>)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Date
</th>
     </tr>
 </thead>
 <tbody>
<tr>
    <td>—</td>
    <td><span class="flagicon" style="padding-left:25px;">&nbsp;</span><b><a href="/wiki/World" title="World">World</a></b></td>
    <td>17,235,800</td>
    <td>11.6</td>
    <td>15,749,300</td>
    <td>10.6</td>
    <td>1,549,600</td>
    <td>1</td>
    <td>131.701.100</td>
    <td>88.4</td>
    <td>149,000,000</td>
    <td>2011
    </td>
</tr>
<tr>
    <td>1</td>
    <td><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/23px-Flag_of_India.svg.png" decoding="async" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/35px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/45px-Flag_of_India.svg.png 2x" data-file-width="1350" data-file-height="900">&nbsp;</span><a href="/wiki/India" title="India">India</a></td>
    <td>1,891,761</td>
    <td>57</td>
    <td>1,753,694</td>
    <td>52.8</td>
    <td>138,067</td>
    <td>4.2</td>
    <td>1,395,502</td>
    <td>43</td>
    <td>3,287,263</td>
    <td>2011
    </td>
</tr>
<tr>
    <td>2</td>
    <td><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" decoding="async" width="23" height="12" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" data-file-width="1235" data-file-height="650">&nbsp;</span><a href="/wiki/United_States" title="United States">United States</a></td>
    <td>1,681,826</td>
    <td>17.1</td>
    <td>1,652,028</td>
    <td>16.8</td>
    <td>29,798</td>
    <td>0.3</td>
    <td>8,151,691</td>
    <td>82.9</td>
    <td>9,833,517</td>
    <td>2011
    </td>
</tr>
<tr>
...
</tr></tbody><tfoot></tfoot></table>
```

We see that the table tag has class settings of:
 * sortable 
 * wikitable 
 * jquery-tablesorter
 
```HTML
<table class="sortable wikitable jquery-tablesorter">
```

We want to focus on the `wikitable`.  

In [6]:
# We can fetch all Tables with a find_all() 
all_tables=soup.find_all('table')
print(type(all_tables))
print(len(all_tables))


# We can find the first (only) occurrence 
right_table=soup.find('table', class_='wikitable')
print(type(right_table))

<class 'bs4.element.ResultSet'>
2
<class 'bs4.element.Tag'>


The `Tag` element is the table.

**Examining the HTML Table Header, we have these columns**

 * Rank
 * Country
 * Cultivated Land km^2
 * Cultivated Land %
 * Arable Land km^2
 * Arable Land %
 * Permanent Crops km^2
 * Permanent Crops %
 * Other lands km^2
 * Other lands %
 * Total Area
 * Date

Therefore, a simple approach is to iterate through the HTML table rows, the `<tr>...</tr>` and process the data elements.

Reviewing the HTML above, we see we need to skip the headers and the "World" row.

Additionally, we will stop when we get out of the ranked rows, that is when Rank is not a number.


In [7]:
# We will use the locale library so we can use 
# atof and atoi to convert alphanumeric to float and integers, respectively.
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' ) 

rank=[]
country=[]
cultivated_land_k=[]
cultivated_land_p=[]
arable_land_k=[]
arable_land_p=[]
permanent_crops_k=[]
permanent_crops_p=[]
other_land_k=[]
other_land_p=[]
total_area=[]
date_yr=[]


# skip first iteration as we dont need headers
for row in right_table.findAll("tr")[1:]: 
    # for each row, pull out the td elements.
    cells = row.findAll('td') # To store all other details
    
    if len(cells)>2: # Only extract information if there is table body not heading
        
        
        this_rank = cells[0].find(text=True)
        print("Processing rank {}".format(this_rank))
        
        # If the rank is a number, we can convert it
        if (not this_rank.isnumeric()):
            print("Non-Ranked, skipping")
            continue
        
        rank.append(locale.atoi(this_rank))
        
        # for the country name, we need to find the name (text) in the Country Hyperlink (a)
        countr_cell = cells[1].find('a').find(text=True)
        print(countr_cell)
        
        country.append(countr_cell)
        
        # Adjust the the data from Text to numeric data types
        cultivated_land_k.append(locale.atoi(cells[2].find(text=True)))
        cultivated_land_p.append(locale.atof(cells[3].find(text=True)))
        
        arable_land_k.append(locale.atoi(cells[4].find(text=True)))
        arable_land_p.append(locale.atof(cells[5].find(text=True)))
        
        permanent_crops_k.append(locale.atoi(cells[6].find(text=True)))
        permanent_crops_p.append(locale.atof(cells[7].find(text=True)))
        
        # Note, that this is to float because the vatican row has a non-int value
        other_land_k.append(locale.atof(cells[8].find(text=True)))
        other_land_p.append(locale.atof(cells[9].find(text=True)))
        
        total_area.append(locale.atof(cells[10].find(text=True)))
        date_yr.append(locale.atoi(cells[11].find(text=True)))


Processing rank —
Non-Ranked, skipping
Processing rank 1
India
Processing rank 2
United States
Processing rank 3
Russia
Processing rank 4
China
Processing rank 5
Brazil
Processing rank 6
Canada
Processing rank 7
Australia
Processing rank 8
Indonesia
Processing rank 9
Nigeria
Processing rank 10
Argentina
Processing rank 11
Ukraine
Processing rank 12
Sudan
Processing rank 13
Mexico
Processing rank 14
Kazakhstan
Processing rank 15
Turkey
Processing rank 16
Pakistan
Processing rank 17
France
Processing rank 18
Thailand
Processing rank 19
Iran
Processing rank 20
Ethiopia
Processing rank 21
Spain
Processing rank 22
Tanzania
Processing rank 23
Niger
Processing rank 24
Myanmar (Burma)
Processing rank 25
South Africa
Processing rank 26
Germany
Processing rank 27
Poland
Processing rank 28
Uganda
Processing rank 29
Vietnam
Processing rank 30
Philippines
Processing rank 31
Romania
Processing rank 32
Bangladesh
Processing rank 33
Italy
Processing rank 34
Morocco
Processing rank 35
Algeria
Processin

##### Now that we have built all our columns, stack into a data frame!

In [8]:
import pandas as pd

# Note, in the table definition about, we listed 
# the country name first to use as a primary key

df=pd.DataFrame({'country': country,
                'rank': rank,
                'cultivated_land_k': cultivated_land_k,
                'cultivated_land_p': cultivated_land_p,
                'arable_land_k': arable_land_k,
                'arable_land_p': arable_land_p,
                'permanent_crops_k': permanent_crops_k,
                'permanent_crops_p': permanent_crops_p,
                'other_land_k': other_land_k,
                'other_land_p': other_land_p,
                'total_area': total_area,
                'date_yr': date_yr
                })


In [9]:
df.head()

Unnamed: 0,country,rank,cultivated_land_k,cultivated_land_p,arable_land_k,arable_land_p,permanent_crops_k,permanent_crops_p,other_land_k,other_land_p,total_area,date_yr
0,India,1,1891761,57.0,1753694,52.8,138067,4.2,1395502.0,43.0,3287263.0,2011
1,United States,2,1681826,17.1,1652028,16.8,29798,0.3,8151691.0,82.9,9833517.0,2011
2,Russia,3,1265267,7.4,1248169,7.3,17098,0.1,15832975.0,92.6,17098242.0,2011
3,China,4,1238013,12.9,1084461,11.3,153552,1.6,8358947.0,87.1,9596960.0,2011
4,Brazil,5,800485,9.4,732359,8.6,68126,0.8,7715285.0,90.6,8515770.0,2011


In [10]:
df.tail()

Unnamed: 0,country,rank,cultivated_land_k,cultivated_land_p,arable_land_k,arable_land_p,permanent_crops_k,permanent_crops_p,other_land_k,other_land_p,total_area,date_yr
188,San Marino,190,10,16.67,10,16.67,0,0.0,51.0,83.33,61.0,2005
189,Djibouti,191,9,0.04,9,0.04,0,0.0,22971.0,99.96,22980.0,2005
190,Nauru,192,0,0.0,0,0.0,0,0.0,21.0,100.0,21.0,2005
191,Monaco,193,0,0.0,0,0.0,0,0.0,2.0,100.0,2.0,2005
192,Vatican City,194,0,0.0,0,0.0,0,0.0,0.44,100.0,0.44,2005


### Check our column data types!
Does this match the data types we sketched out in the `CREATE TABLE` statement above?
If you need to adjust the definition, this would be the time.
Alternatively, we can adjust the columns using Pandas techniques.

In [11]:
df.dtypes

country               object
rank                   int64
cultivated_land_k      int64
cultivated_land_p    float64
arable_land_k          int64
arable_land_p        float64
permanent_crops_k      int64
permanent_crops_p    float64
other_land_k         float64
other_land_p         float64
total_area           float64
date_yr                int64
dtype: object

Once we have our Panda data frame and the SQL table inline, we can load it into the database.

---

### 4. Load the data into your database

This time, instead of the manual loading, we are going to use the SQLAlchemy library.


In [12]:
import getpass
mypasswd = getpass.getpass()
username = 'jch5x8'
host = 'pgsql.dsa.lan'
database = 'dsa_student'

········


In [13]:
# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}
engine = create_engine(URL(**postgres_db), echo=True)
del mypasswd


#### When you run the cell below, carefully examine the output so you see what the SQLAlchemy library is doing!

In [14]:

## Now that SQLAlchemy is loaded, the to_sql function
df.to_sql('land_use_statistics', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Recall that panda data frame has a row index, so we need to ignore it
          chunksize=20)       # Do 20 records from the data frame at a time


2021-11-22 12:33:23,284 INFO sqlalchemy.engine.base.Engine select version()
2021-11-22 12:33:23,285 INFO sqlalchemy.engine.base.Engine {}
2021-11-22 12:33:23,288 INFO sqlalchemy.engine.base.Engine select current_schema()
2021-11-22 12:33:23,288 INFO sqlalchemy.engine.base.Engine {}
2021-11-22 12:33:23,290 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2021-11-22 12:33:23,291 INFO sqlalchemy.engine.base.Engine {}
2021-11-22 12:33:23,293 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2021-11-22 12:33:23,293 INFO sqlalchemy.engine.base.Engine {}
2021-11-22 12:33:23,294 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2021-11-22 12:33:23,295 INFO sqlalchemy.engine.base.Engine {}
2021-11-22 12:33:23,296 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where n.nspname=%(schema)s and relname=%(name)s
2021-11-22 12:33:23

2021-11-22 12:33:23,337 INFO sqlalchemy.engine.base.Engine INSERT INTO jch5x8.land_use_statistics (country, rank, cultivated_land_k, cultivated_land_p, arable_land_k, arable_land_p, permanent_crops_k, permanent_crops_p, other_land_k, other_land_p, total_area, date_yr) VALUES (%(country)s, %(rank)s, %(cultivated_land_k)s, %(cultivated_land_p)s, %(arable_land_k)s, %(arable_land_p)s, %(permanent_crops_k)s, %(permanent_crops_p)s, %(other_land_k)s, %(other_land_p)s, %(total_area)s, %(date_yr)s)
2021-11-22 12:33:23,337 INFO sqlalchemy.engine.base.Engine ({'country': 'Kenya', 'rank': 41, 'cultivated_land_k': 62103, 'cultivated_land_p': 10.7, 'arable_land_k': 56879, 'arable_land_p': 9.8, 'permanent_crops_k': 5224, 'permanent_crops_p': 0.9, 'other_land_k': 518264.0, 'other_land_p': 89.3, 'total_area': 580367.0, 'date_yr': 2011}, {'country': 'United Kingdom', 'rank': 42, 'cultivated_land_k': 61631, 'cultivated_land_p': 25.3, 'arable_land_k': 61144, 'arable_land_p': 25.1, 'permanent_crops_k': 487

2021-11-22 12:33:23,359 INFO sqlalchemy.engine.base.Engine INSERT INTO jch5x8.land_use_statistics (country, rank, cultivated_land_k, cultivated_land_p, arable_land_k, arable_land_p, permanent_crops_k, permanent_crops_p, other_land_k, other_land_p, total_area, date_yr) VALUES (%(country)s, %(rank)s, %(cultivated_land_k)s, %(cultivated_land_p)s, %(arable_land_k)s, %(arable_land_p)s, %(permanent_crops_k)s, %(permanent_crops_p)s, %(other_land_k)s, %(other_land_p)s, %(total_area)s, %(date_yr)s)
2021-11-22 12:33:23,359 INFO sqlalchemy.engine.base.Engine ({'country': 'Honduras', 'rank': 101, 'cultivated_land_k': 14684, 'cultivated_land_p': 13.1, 'arable_land_k': 10201, 'arable_land_p': 9.1, 'permanent_crops_k': 4483, 'permanent_crops_p': 4.0, 'other_land_k': 97406.0, 'other_land_p': 86.9, 'total_area': 112090.0, 'date_yr': 2011}, {'country': 'Austria', 'rank': 102, 'cultivated_land_k': 14515, 'cultivated_land_p': 17.3, 'arable_land_k': 13844, 'arable_land_p': 16.5, 'permanent_crops_k': 671, '

2021-11-22 12:33:23,382 INFO sqlalchemy.engine.base.Engine INSERT INTO jch5x8.land_use_statistics (country, rank, cultivated_land_k, cultivated_land_p, arable_land_k, arable_land_p, permanent_crops_k, permanent_crops_p, other_land_k, other_land_p, total_area, date_yr) VALUES (%(country)s, %(rank)s, %(cultivated_land_k)s, %(cultivated_land_p)s, %(arable_land_k)s, %(arable_land_p)s, %(permanent_crops_k)s, %(permanent_crops_p)s, %(other_land_k)s, %(other_land_p)s, %(total_area)s, %(date_yr)s)
2021-11-22 12:33:23,383 INFO sqlalchemy.engine.base.Engine ({'country': 'São Tomé and Príncipe', 'rank': 162, 'cultivated_land_k': 479, 'cultivated_land_p': 49.7, 'arable_land_k': 87, 'arable_land_p': 9.1, 'permanent_crops_k': 392, 'permanent_crops_p': 40.6, 'other_land_k': 485.0, 'other_land_p': 50.3, 'total_area': 964.0, 'date_yr': 2011}, {'country': 'Trinidad and Tobago', 'rank': 163, 'cultivated_land_k': 471, 'cultivated_land_p': 9.2, 'arable_land_k': 251, 'arable_land_p': 4.9, 'permanent_crops_k

### 5. Test loaded data with SQL queries



```SQL
\x
select * from SSO.land_use_statistics limit 2;
```

---

```
-[ RECORD 1 ]-----+--------------
country           | India
rank              | 1
cultivated_land_k | 1891761
cultivated_land_p | 57
arable_land_k     | 1753694
arable_land_p     | 52.8
permanent_crops_k | 138067
permanent_crops_p | 4.2
other_land_k      | 1.3955e+06
other_land_p      | 43
total_area        | 3.28726e+06
date_yr           | 2011
-[ RECORD 2 ]-----+--------------
country           | United States
rank              | 2
cultivated_land_k | 1681826
cultivated_land_p | 17.1
arable_land_k     | 1652028
arable_land_p     | 16.8
permanent_crops_k | 29798
permanent_crops_p | 0.3
other_land_k      | 8.15169e+06
other_land_p      | 82.9
total_area        | 9.83352e+06
date_yr           | 2011
```

---






#### Now that the data is loaded, let's pull it back out!





In [15]:
df_backout = pd.read_sql_table(
    'land_use_statistics',
    con = engine,             # The engine created above
    schema= username   # The schema where the table lives, our pawprint
)

2021-11-22 12:34:50,021 INFO sqlalchemy.engine.base.Engine SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = %(schema)s AND c.relkind in ('r', 'p')
2021-11-22 12:34:50,022 INFO sqlalchemy.engine.base.Engine {'schema': 'jch5x8'}
2021-11-22 12:34:50,037 INFO sqlalchemy.engine.base.Engine SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = %(schema)s AND c.relkind IN ('v', 'm')
2021-11-22 12:34:50,037 INFO sqlalchemy.engine.base.Engine {'schema': 'jch5x8'}
2021-11-22 12:34:50,043 INFO sqlalchemy.engine.base.Engine 
            SELECT c.oid
            FROM pg_catalog.pg_class c
            LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
            WHERE (n.nspname = %(schema)s)
            AND c.relname = %(table_name)s AND c.relkind in
            ('r', 'v', 'm', 'f', 'p')
        
2021-11-22 12:34:50,043 INFO sqlalchemy.engine.base.Engine {'schema': 'jch5x8', 'table_name': 'land_us

In [16]:
df_backout.head(10)

Unnamed: 0,country,rank,cultivated_land_k,cultivated_land_p,arable_land_k,arable_land_p,permanent_crops_k,permanent_crops_p,other_land_k,other_land_p,total_area,date_yr
0,India,1,1891761,57.0,1753694,52.8,138067,4.2,1395502.0,43.0,3287263.0,2011
1,United States,2,1681826,17.1,1652028,16.8,29798,0.3,8151691.0,82.9,9833517.0,2011
2,Russia,3,1265267,7.4,1248169,7.3,17098,0.1,15832975.0,92.6,17098242.0,2011
3,China,4,1238013,12.9,1084461,11.3,153552,1.6,8358947.0,87.1,9596960.0,2011
4,Brazil,5,800485,9.4,732359,8.6,68126,0.8,7715285.0,90.6,8515770.0,2011
5,Canada,6,519205,5.2,469281,4.7,49924,0.5,9465465.0,94.8,9984670.0,2011
6,Australia,7,487695,6.3,479954,6.2,7741,0.1,7253525.0,93.7,7741220.0,2011
7,Indonesia,8,478055,25.1,247598,13.0,230457,12.1,1426514.0,74.9,1904569.0,2011
8,Nigeria,9,412938,44.7,344577,37.3,68361,7.4,510830.0,55.3,923768.0,2011
9,Argentina,10,397598,14.3,386476,13.9,11122,0.4,2382802.0,85.7,2780400.0,2016


In [17]:
df_backout.tail(10)

Unnamed: 0,country,rank,cultivated_land_k,cultivated_land_p,arable_land_k,arable_land_p,permanent_crops_k,permanent_crops_p,other_land_k,other_land_p,total_area,date_yr
183,Bahrain,185,56,8.45,19,2.82,37,5.63,609.0,91.55,665.0,2005
184,Liechtenstein,186,40,25.0,40,25.0,0,0.0,120.0,75.0,160.0,2005
185,Singapore,187,20,2.94,10,1.47,10,1.47,663.0,97.06,683.0,2005
186,Tuvalu,188,17,66.67,0,0.0,17,66.67,9.0,33.33,26.0,2005
187,Andorra,189,10,2.13,10,2.13,0,0.0,458.0,97.87,468.0,2005
188,San Marino,190,10,16.67,10,16.67,0,0.0,51.0,83.33,61.0,2005
189,Djibouti,191,9,0.04,9,0.04,0,0.0,22971.0,99.96,22980.0,2005
190,Nauru,192,0,0.0,0,0.0,0,0.0,21.0,100.0,21.0,2005
191,Monaco,193,0,0.0,0,0.0,0,0.0,2.0,100.0,2.0,2005
192,Vatican City,194,0,0.0,0,0.0,0,0.0,0.44,100.0,0.44,2005


# Save your notebook, then `File > Close and Halt`

---