# Subject: Data Science Foundation

## Session 14 - ArcGIS API for Python.

### Demo 5 -  Descriptive Statistics using a HTML table to Pandas Data Frame to Portal Item

# Descriptive Statistics

Descriptive Analytics is backwards looking and focuses on telling users what happened. Examples of questions that descriptive analytics can answer are how many, when and where?

Often we read informative articles that present data in a tabular form. If such data contained location information, it would be much more insightful if presented as a cartographic map. Thus this sample shows how Pandas can be used to extract data from a table within a web page (in this case, a Wikipedia article) and how it can be then brought into the GIS for further analysis and visualization.

Note: to run this sample, you need a few extra libraries in your conda environment. If you don't have the libraries, install them by running the following commands from cmd.exe or your shell

- *conda install lxml* - Pythonic binding for the C libraries libxml2 and libxslt.
- *conda install html5lib* - HTML parser based on the WHATWG HTML specification.
- *conda install beautifulsoup4* - Python library designed for screen-scraping.
- *conda install matplotlib* - Publication quality figures in Python.

In [51]:
import pandas as pd

Let us read the Wikipedia article on List of countries by vehicles per capita as a pandas data frame object.

Note: This article is a list of countries by the number of road motor vehicles per 1,000 inhabitants. Note that car is different from road motor vehicle as the latter includes automobiles but also vans, buses, freight and other trucks.
The list however excludes motorcycles and other two-wheelers.

https://en.wikipedia.org/wiki/Estimated_number_of_guns_per_capita_by_country

In [52]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_vehicles_per_capita")[0]

In [53]:
df.head()

Unnamed: 0,0,1,2,3,4
0,Rank,Country,"Motor vehicles per 1,000 people",Total,Notes
1,1,San Marino,1263,,2014[1]
2,2,Monaco,899,,2014[1]
3,3,United States,797,"255,009,283[2]",2015[2]
4,4,New Zealand,774,3600000,2017[3]


In [54]:
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))

In [55]:
df.head()

Unnamed: 0,Rank,Country,"Motor vehicles per 1,000 people",Total,Notes
1,1,San Marino,1263,,2014[1]
2,2,Monaco,899,,2014[1]
3,3,United States,797,"255,009,283[2]",2015[2]
4,4,New Zealand,774,3600000,2017[3]
5,5,Liechtenstein,750,,2014[1][4]


Lets check the data structure

In [56]:
df.dtypes

0
Rank                               object
Country                            object
Motor vehicles per 1,000 people    object
Total                              object
Notes                              object
dtype: object

In [57]:
df.shape 

(193, 5)

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 1 to 193
Data columns (total 5 columns):
Rank                               192 non-null object
Country                            192 non-null object
Motor vehicles per 1,000 people    192 non-null object
Total                              15 non-null object
Notes                              192 non-null object
dtypes: object(5)
memory usage: 9.0+ KB


Lets find the ranking position of our Country

In [90]:
df.loc[df['Country'] == "Spain"]

Unnamed: 0,Rank,Country,Motorveh1_000people,Total,Notes
15,15.0,Spain,593.0,,2014[1]


In [91]:
df.loc[df['Country'] == "Portugal"]

Unnamed: 0,Rank,Country,Motorveh1_000people,Total,Notes
25,25.0,Portugal,548.0,,2010[12]


Lets check the descriptive statistics

In [59]:
df.describe()

Unnamed: 0,Rank,Country,"Motor vehicles per 1,000 people",Total,Notes
count,192,192,192,15,192
unique,185,192,143,15,55
top,135,China,7,42192000,2010[1]
freq,3,1,4,1,48


Lets rename the columns to prepare the data for a correlation analysis and also for mapping

In [60]:
df.rename(columns={'Rank': 'Rank', 'Country': 'Country', 'Motor vehicles per 1,000 people': 'Motorveh1_000people', 'Total': 'Total', 'Notes': 'Notes'}, inplace=True)

In [61]:
df.head()

Unnamed: 0,Rank,Country,Motorveh1_000people,Total,Notes
1,1,San Marino,1263,,2014[1]
2,2,Monaco,899,,2014[1]
3,3,United States,797,"255,009,283[2]",2015[2]
4,4,New Zealand,774,3600000,2017[3]
5,5,Liechtenstein,750,,2014[1][4]


We need the "Motor vehicles per 1,000 people" column (Motorveh1_000people) in numeric format. Hence let us convert it and while doing so, convert incorrect values to NaN which stands for Not a Number.

In [73]:
converted_column = pd.to_numeric(df["Motorveh1_000people"], errors = 'coerce') # If ‘coerce’, then invalid parsing will be set as NaN.
df['Motorveh1_000people'] = converted_column
df.head()

Unnamed: 0,Rank,Country,Motorveh1_000people,Total,Notes
1,1,San Marino,1263.0,,2014[1]
2,2,Monaco,899.0,,2014[1]
3,3,United States,797.0,"255,009,283[2]",2015[2]
4,4,New Zealand,774.0,3600000,2017[3]
5,5,Liechtenstein,750.0,,2014[1][4]


Repeat for the "Rank" column

In [75]:
converted_column = pd.to_numeric(df["Rank"], errors = 'coerce')
df['Rank'] = converted_column
df.head()

Unnamed: 0,Rank,Country,Motorveh1_000people,Total,Notes
1,1.0,San Marino,1263.0,,2014[1]
2,2.0,Monaco,899.0,,2014[1]
3,3.0,United States,797.0,"255,009,283[2]",2015[2]
4,4.0,New Zealand,774.0,3600000,2017[3]
5,5.0,Liechtenstein,750.0,,2014[1][4]


Lets calculate the correlation

In [76]:
# pairwise correlation
df.drop(['Country', 'Total', 'Notes'], axis=1).corr(method='spearman')

Unnamed: 0_level_0,Rank,Motorveh1_000people
0,Unnamed: 1_level_1,Unnamed: 2_level_1
Rank,1.0,-0.997596
Motorveh1_000people,-0.997596,1.0


## Plot as a map

Let us connect to our GIS to geocode this data and present it as a map

In [79]:
from arcgis.gis import GIS
import json

gis = GIS("https://www.arcgis.com", "FSGutierres_BTS", "Liberdade_3030")

In [80]:
fc = gis.content.import_data(df, {"CountryCode":"Country"})

In [92]:
map1 = gis.map('Spain')

In [93]:
map1

Let us us smart mapping to render the points with varying sizes representing the number of Motor vehicles per 1,000 people

In [94]:
map1.add_layer(fc, {"renderer":"ClassedSizeRenderer",
               "field_name": "Motorveh1_000people"})

Let us publish this layer as a feature collection item in our GIS

In [95]:
item_properties = {
    "title": "Worldwide Motor vehicles ownership",
    "tags" : "Motor",
    "snippet": " Worldwide Motor vehicles ownership",
    "description": "test description",
    "text": json.dumps({"featureCollection": {"layers": [dict(fc.layer)]}}),
    "type": "Feature Collection",
    "typeKeywords": "Data, Feature Collection, Singlelayer",
    "extent" : "-102.5272,-41.7886,172.5967,64.984"
}

item = gis.content.add(item_properties)

Let us search for this item

In [96]:
search_result = gis.content.search("Worldwide Motor vehicles ownership")
search_result[0]