## Lab1 (continued) - Working with Jupyter Notebooks in IBM Watson Studio
In this notebook, we will conduct some early exploratory analysis on the `GoSales_Tx.csv` data file.

Tips:
* Code cells are identifiable by their `In [ ]:` prefix in the margin
* To execute the cells in the notebook, select the cell and click the **Run** button, or hit **Ctrl-Enter**.
* Cells which have not been executed before will have empty brackets, while executed cells will have a sequence number within, e.g. `In [13]`
* Cell execution result displays below the cell
* To clear all exection statuses and outputs, use the `Cell/All Output/Clear` menu.

### Getting started:
* Select the code cell below, and **delete all its content**
* Open the data panel on the right using the    
`10`   
`01` button icon  (top right)
* From the data panel on the right use context menu on `GoSales_Tx.csv` file to _Insert to code/Insert pandas DataFrame_

The python source code to create a `df_data_1` panda DataFrame that accesses the `GoSales_Tx.csv` file is generated.
>NOTE: If the name is different, change the variable name back to `df_data_1`

Then execute the cell (Ctrl-Enter or run button), upon completion, a subset of the data will be shown below the code cell.

In [1]:
import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_70ab70feb7fb4d8f9a47a408afd9f30f = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='ZATI1oq_TlsWqN-oEz3Wo1IPXWwOkCx0PXV3gj0d5eui',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_70ab70feb7fb4d8f9a47a408afd9f30f.get_object(Bucket='watstudworkshop-donotdelete-pr-basx79wonvxlys',Key='GoSales_Tx.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()

Unnamed: 0,IS_TENT,GENDER,AGE,MARITAL_STATUS,PROFESSION
0,False,M,27,Single,Professional
1,False,F,39,Married,Other
2,False,F,39,Married,Other
3,False,F,56,Unspecified,Hospitality
4,False,M,45,Married,Retired


At this stage, the dataframe should be loaded and the 5 first rows displayed.  
The following cell checks that no mistake has been made in the variable naming as `df_data_1`, and creates a convenience shortcut variable `df` for the `DataFrame`

In [2]:
if not 'df_data_1' in globals().keys():
    print("***************\nERROR: df_data_1 variable is not defined, please check teh cell above\n***************")
else:
    # Created a convenience shortcut variable
    df=df_data_1

## Extract some facts about the dataset

In the following section, we will start using `Pandas` functions to query information about the data represented in the CSV file which was just loaded in a `DataFrame`

#### Finding out about data types
The first thing we will want to know besides the column titles, is the data types of the columns.   
For this we use the `DataFrame`'s `dtypes` attribute. The data types have been inferred when loading the CSV file, in this case we can see that:
* `IS_TENT` has been identified as a boolean
* `AGE` has been inferred as an integer
* `GENDER`, `MARITAL_STATUS`, `PROFESSION` are strings which match the generic `object` type.
Note that the types themselves are instances of `numpy` data types and are returned in a `Series` object indexed by column name.

In [3]:
df.dtypes

IS_TENT             bool
GENDER            object
AGE                int64
MARITAL_STATUS    object
PROFESSION        object
dtype: object

#### Counting full rows
* We'll use the DataFramce `count()` method to compute non-empty rows for each column in the next code cell

In this dataset we should find 60252 data rows for each of the columns, which means that the table has no hollows cells.

In [4]:
import numpy as np
df.count()

IS_TENT           60252
GENDER            60252
AGE               60252
MARITAL_STATUS    60252
PROFESSION        60252
dtype: int64

## Gettings base statistics
* Use `describe()` to get statistics on fields:
    * The numeric `AGE` confirms the average age is about 34 yo.
    * There are 9 unique `PROFESSION` of which the top is 'Other', and 3 `MARITAL_STATUS` of which the top is 'Married'

In [5]:
df.describe(include='all')

Unnamed: 0,IS_TENT,GENDER,AGE,MARITAL_STATUS,PROFESSION
count,60252,60252,60252.0,60252,60252
unique,2,2,,3,9
top,False,M,,Married,Other
freq,54241,31352,,30779,24503
mean,,,34.187479,,
std,,,10.105477,,
min,,,17.0,,
25%,,,26.0,,
50%,,,33.0,,
75%,,,41.0,,


#### Investigate frequency of buying behavior
* Now we would want to know the proportion of positive buying decisions vs total number of records, so we count the number of each values (here, just boolean True/False) of the **IS_TENT** column:

>NOTE: in terms of notation, a `DataFrame` column can be accessed through the `df['colName']` notation, or through the `df.colName` notation, which is available only when the column name does not conflict with python variable naming.

In [6]:
# access column through generated attribute
df.IS_TENT.value_counts()

False    54241
True      6011
Name: IS_TENT, dtype: int64

We find a roughly one-to-ten ratio of buy vs non-buy

#### Get statistics on the 3 string columns
Similarly, we count the `GENDER`, `MARITAL_STATUS` and `PROFESSION` values.   
You will notice some variations on the code syntax here, using:
* the array indexing notation for `GENDER`
* applying `to_frame()` for `MARITAL_STATUS` to convert the returned `Series` to a `DataFrame` which yields a prettier display in Jupyter notebook output.
* Using indexed access for the `PROFESSION` column number 4 through the `iloc` zero-based range `[:,4]`

In [7]:
# access column through indexing by column name
df['GENDER'].value_counts()

M    31352
F    28900
Name: GENDER, dtype: int64

In [8]:
# Convert value_counts() Series to DataFrame for prettier display
df.MARITAL_STATUS.value_counts().to_frame()

Unnamed: 0,MARITAL_STATUS
Married,30779
Single,24549
Unspecified,4924


In [9]:
# Access PROFESSION by column index using iloc
df.iloc[:,4].value_counts()

Other           24503
Professional     8938
Sales            6708
Executive        5871
Trades           4008
Hospitality      3311
Student          2945
Retail           2785
Retired          1183
Name: PROFESSION, dtype: int64

#### Compute additional statistics
There are many ways to perform column-wise or row-wise computations with Pandas.   
We will show a few of them here. One thing to keep in mind for the following code cells is that the outputu of the `value_counts()` function is a Pandas `Series` object, which is basically an indexed list, in this case the index is buit from the values (categories), while the value is the total count of the rows having the value.   
We will compute the relative percentage using several different techniques on the various columns, keeping in mind that the result of `value_counts()` is a `pandas.Series` object:
* Apply Series overloaded arithmetic operators on the `PROFESSION` values, multiplying by 100 and dividing by the total count. You will notice the `dtype` is `float64`
* Applying an unnamed python __lambda__ function with `map()`, on the `MARITAL_STATUS`, computing the percentage. Notice the `dtype` as `float64`
* Applying an unnamed python __lambda__ function with `map()`, on the `GENDER` formatting the result as a text string. Notice the `dtype` as `object`
* Use a `groupby()` method to 

In [10]:
# Apply multiplication then division operators to Series, returning computed percentage
df.PROFESSION.value_counts()*100/df.PROFESSION.count()

Other           40.667530
Professional    14.834362
Sales           11.133240
Executive        9.744075
Trades           6.652061
Hospitality      5.495253
Student          4.887805
Retail           4.622253
Retired          1.963420
Name: PROFESSION, dtype: float64

In [11]:
# Use map() to apply an unnamed function that computes percentage
df.MARITAL_STATUS.value_counts().map(lambda x: 100*x/len(df.GENDER))

Married        51.083781
Single         40.743876
Unspecified     8.172343
Name: MARITAL_STATUS, dtype: float64

In [12]:
# Use map() to apply a function that returns percentage evaluated as a string
df.GENDER.value_counts().map(lambda x: '{0: >2.0f} %'.format(100*x/df.MARITAL_STATUS.count()))

M    52 %
F    48 %
Name: GENDER, dtype: object

In [13]:
# Variant using groupby to generate
df.groupby(['IS_TENT']).size()*100/len(df.IS_TENT)

IS_TENT
False    90.023568
True      9.976432
dtype: float64

>The above output confirms that the one-to-nine ratio of buy vs non-buy behaviour

#### Analyzing buying behavior factors
Now, we would like to understand which factor drives the buying behavior.   
* Without going into Machine Learning yet, we can analyse the correlation between the `IS_TENT` indicator and each one of the other features or variables.
* For this we use the `crosstab()` function which ventilates values of one column according to another one

Features with less discrete values will be easier to apprehend, let's start with `GENDER`

In [14]:
x_tent_gender=pd.crosstab(df.IS_TENT,[df.GENDER])
x_tent_gender

GENDER,F,M
IS_TENT,Unnamed: 1_level_1,Unnamed: 2_level_1
False,27230,27011
True,1670,4341


This shows that Male customers tend to buy a tent 2 to 3 times more often than Female ones.

* Similarly we can run the same on `PROFESSION`, but here we will get the result as a percentage of the total per column

In [15]:
x_tent_prof=pd.crosstab(df.IS_TENT,[df.PROFESSION])
(x_tent_prof/df.PROFESSION.value_counts()).applymap(lambda x:"{0: >2.1f}%".format(x*100))

Unnamed: 0_level_0,Executive,Hospitality,Other,Professional,Retail,Retired,Sales,Student,Trades
IS_TENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
False,86.0%,89.6%,91.1%,91.1%,95.9%,99.3%,88.4%,94.6%,79.8%
True,14.0%,10.4%,8.9%,8.9%,4.1%,0.7%,11.6%,5.4%,20.2%


This gives an indication that `Retired` and `Sales` proportionally buy rather less tents than the average.  
Other than that it is not very conclusive since a large subset has no specified Profession.

* Let's try the same for `AGE` ventilation, we reverse the axis so that they can all get displayed

In [16]:
x_age_tent=pd.crosstab(df.AGE,[df.IS_TENT])
x_age_tent

IS_TENT,False,True
AGE,Unnamed: 1_level_1,Unnamed: 2_level_1
17,106,0
18,243,0
19,1289,240
20,1469,271
21,1422,272
22,1463,280
23,1658,329
24,1533,315
25,1989,262
26,2161,295


The results are a bit less obvious to grasp without a graphical representation.

#### Introducing Visualization
* As a glimpse into the next lab, we can use Brunel to quickly display the result

In [17]:
# Rename the columns of the CrossTab for use by Brunel
import brunel
x_age_tent.columns=['AGE','SUM_TENT']
%brunel data('x_age_tent') bar x(AGE) y(SUM_TENT)

<IPython.core.display.Javascript object>

## Conclusion
This concludes this first introduction to Jupyter Python notebooks using Pandas DataFrames.   

In this short example, you have learned the very basics of Pandas DataFrame and Series usage.