# Purpose
This notebook describes the typical activities carried out  at the beginning to a project / thread when customer shares new data. We will be trying to understand the tables, columns and information flow. Typically we also look for data issues and confirm with respective owners for resolution. At the end of this activity, the data sources and their treatment is finalized. Code in this notebook will not be part of the production code.

# Initialization

In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
%%time

# Third-party imports
import os.path as op
import pandas as pd
import great_expectations as ge

# Project imports
from ta_lib.core.api import display_as_tabs, initialize_environment

# Initialization
initialize_environment(debug=False, hide_warnings=True)

CPU times: user 104 µs, sys: 0 ns, total: 104 µs
Wall time: 108 µs


# Data

## Background

Customer is a distributor of electronic devices. They partner with manufacturers, carriers and refurbishers and sell across to  retailers. The selling price is the outcome of negotiation between sales representatives and retailers. Customer wants to understand the selling price variation and determine  optimal pricing with Machine Learning.

# Data download

In [9]:
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("data/raw/", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
#fetch_housing_data()

In [10]:
HOUSING_PATH

'data/raw/housing'

In [11]:
from ta_lib.core.api import create_context, list_datasets, load_dataset

In [12]:
config_path = op.join('conf', 'config.yml')
context = create_context(config_path)

In [13]:
list_datasets(context)

['/raw/housing',
 '/raw/orders',
 '/raw/product',
 '/cleaned/housing',
 '/cleaned/orders',
 '/cleaned/product',
 '/cleaned/sales',
 '/processed/housing',
 '/processed/sales',
 '/train/housing/features',
 '/train/housing/target',
 '/train/sales/features',
 '/train/sales/target',
 '/test/housing/features',
 '/test/housing/target',
 '/test/sales/features',
 '/test/sales/target',
 '/score/housing/output',
 '/score/sales/output']

In [14]:
# load datasets
housing_df = load_dataset(context, 'raw/housing')

# Exploratory Analysis

Given the raw data from data ingestion, we would now like to explore and learn more details about the data.


The output of the step would be a summary report and discussion of any pertinent findings.


In [15]:
# Import the eda API
import ta_lib.eda.api as eda

## Variable summary

In [16]:
display_as_tabs([('housing', housing_df.shape)])

In [17]:
sum1 = eda.get_variable_summary(housing_df)

display_as_tabs([('housing', sum1)])

In [18]:
housing_df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

## Health Analysis

Get an overview of the overall health of the dataset. This is usually quick to compute and hopefully highlights some problems to focus on.



### Summary Plot

Provides a high level summary of the dataset health.

**Watch out for:**

* too few numeric values
* high % of missing values
* high % of duplicate values
* high % of duplicate columns 

In [19]:
sum1, plot1 = eda.get_data_health_summary(housing_df, return_plot=True)

display_as_tabs([('housing', plot1)])

**Dev NOTES**

<details>
1. Datatypes : We have both numeric and other types. The bulk of them seem to be numeric. `Numeric` is defined to be one of [float|int|date] and the rest are categorized as `Others`. A column is assumed to have `date` values if it has the string `date` in the column name.

2. The missing value plot seems to indicate missing values are not present but we do have them. 

3. We are looking for duplicate observations (rows in the data). The plot shows the % of rows that are an exact replica of another row (using `df.duplicated`)

4. We are looking for duplicate features (columns in the data).

</details>

### Missing Values summary

This provides an overall view focussing on amount of missing values in the dataset.

**Watch out for:**
* A few columns have significant number of missing values 
* Most columns have significant number of missing values


In [20]:
sum1, plot1 = eda.get_missing_values_summary(housing_df, return_plot=True)

display_as_tabs([('housing', plot1)])

**Dev notes:**

<details>
    
    * By default, the following are considered missing/NA values : `[np.Nan, pd.NaT, 'NA', None]`
    * additional values can be passed to tigerml (add_additional_na_values)
    * these are applied to all columns.
    
    * some of the above information can be learnt from the data discovery step (see discussion below)
    
</details>

In [21]:
sum1 = eda.get_duplicate_columns(housing_df)

display_as_tabs([('housing', sum1)])

In [22]:
sum1 = eda.get_outliers(housing_df)

display_as_tabs([('housing', sum1)])

## Health Analysis report

Generate a report that has all the above data in a single html. This could be useful to submit to a client

In [23]:
from ta_lib.reports.api import summary_report

summary_report(housing_df, './housing.html')