# Subject: Data Science Foundation

## Session 14 - ArcGIS API for Python.

### Exercise 2 -  Descriptive Statistics using a HTML table to Pandas Data Frame to Portal Item

Let us read the Wikipedia article on List of countries by cigarette consumption per capita. 
This is a list of countries by annual per capita consumption of tobacco cigarettes. 
Explore the dataframe (descriptive statistics and correlation) and creates a map. 

https://en.wikipedia.org/wiki/List_of_countries_by_cigarette_consumption_per_capita

In [1]:
import pandas as pd

In [2]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_cigarette_consumption_per_capita")[0]

In [3]:
df.head()

Unnamed: 0,0,1,2
0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per ...
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23


In [4]:
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))

In [5]:
df.head()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


Lets check the data structure

In [6]:
df.dtypes

0
Ranking                                                  object
Country/Territory                                        object
Number of cigarettes per person aged ≥ 15 per year[7]    object
dtype: object

In [7]:
df.shape 

(182, 3)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 1 to 182
Data columns (total 3 columns):
Ranking                                                  182 non-null object
Country/Territory                                        182 non-null object
Number of cigarettes per person aged ≥ 15 per year[7]    182 non-null object
dtypes: object(3)
memory usage: 5.7+ KB


Lets find the ranking position of our Country

In [9]:
df.loc[df['Country/Territory'] == "Spain"]

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
47,47,Spain,1264.74


In [10]:
df.loc[df['Country/Territory'] == "Portugal"]

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
55,55,Portugal,1114.11


Lets check the descriptive statistics

In [11]:
df.describe()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
count,182,182,182.0
unique,182,182,182.0
top,73,Hungary,895.24
freq,1,1,1.0


Lets rename the columns to prepare the data for a correlation analysis and also for mapping

In [12]:
df.rename(columns={'Ranking': 'Ranking', 'Country/Territory': 'Country_Terr', 'Number of cigarettes per person aged ≥ 15 per year[7]': 'Nrcigar_pp'}, inplace=True)

In [13]:
df.head()

Unnamed: 0,Ranking,Country_Terr,Nrcigar_pp
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


We need the "Number of cigarettes per person aged ≥ 15 per year[7]" column (Nrcigar_ppe) in numeric format. Hence let us convert it and while doing so, convert incorrect values to NaN which stands for Not a Number.

In [14]:
converted_column = pd.to_numeric(df["Nrcigar_pp"], errors = 'coerce') # If ‘coerce’, then invalid parsing will be set as NaN.
df['Nrcigar_pp'] = converted_column
df.head()

Unnamed: 0,Ranking,Country_Terr,Nrcigar_pp
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


Repeat for the "Ranking" column

In [15]:
converted_column = pd.to_numeric(df["Ranking"], errors = 'coerce')
df['Ranking'] = converted_column
df.head()

Unnamed: 0,Ranking,Country_Terr,Nrcigar_pp
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


In [16]:
df.dtypes

0
Ranking           int64
Country_Terr     object
Nrcigar_pp      float64
dtype: object

Lets calculate the correlation

In [17]:
# pairwise correlation
df.drop(['Country_Terr'], axis=1).corr(method='spearman')

Unnamed: 0_level_0,Ranking,Nrcigar_pp
0,Unnamed: 1_level_1,Unnamed: 2_level_1
Ranking,1.0,-1.0
Nrcigar_pp,-1.0,1.0


## Plot as a map

Let us connect to our GIS to geocode this data and present it as a map

In [18]:
from arcgis.gis import GIS
import json

gis = GIS("https://www.arcgis.com", "FSGutierres_BTS", "Liberdade_3030")

In [19]:
fc = gis.content.import_data(df, {"CountryCode":"Country_Terr"})

In [20]:
map1 = gis.map('Spain')

In [21]:
map1

Let us us smart mapping to render the points with varying sizes representing the number of Number of cigarettes per person aged ≥ 15 per year

In [22]:
map1.add_layer(fc, {"renderer":"ClassedSizeRenderer",
               "field_name": "Nrcigar_pp"})

Let us publish this layer as a feature collection item in our GIS

In [23]:
item_properties = {
    "title": "Worldwide Number of cigarettes per person aged ≥ 15 per year",
    "tags" : "cigarettes, aged ≥ 15",
    "snippet": " Worldwide Number of cigarettes per person aged ≥ 15 per year",
    "description": "test description",
    "text": json.dumps({"featureCollection": {"layers": [dict(fc.layer)]}}),
    "type": "Feature Collection",
    "typeKeywords": "Data, Feature Collection, Singlelayer",
    "extent" : "-102.5272,-41.7886,172.5967,64.984"
}

item = gis.content.add(item_properties)

Let us search for this item

In [24]:
search_result = gis.content.search("Worldwide Number of cigarettes per person aged ≥ 15 per year")
search_result[0]