# Purpose
This notebook describes the typical activities carried out  at the beginning to a project / thread when customer shares new data. We will be trying to understand the tables, columns and information flow. Typically we also look for data issues and confirm with respective owners for resolution. At the end of this activity, the data sources and their treatment is finalized. Code in this notebook will not be part of the production code.

# Initialization

In [25]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [26]:
%%time
# Third-party imports
import os.path as op
import pandas as pd
# import great_expectations as ge

# Project imports
from ta_lib.core.api import display_as_tabs, initialize_environment

import warnings
warnings.filterwarnings("ignore")
# Initialization
initialize_environment(debug=False, hide_warnings=True)

CPU times: user 232 μs, sys: 19 μs, total: 251 μs
Wall time: 253 μs


# Data

## Background

Our client (Manufacturer A) is a leading Food & Beverage manufacturer. Client wants to understand the growth patterns of consumer preferences (themes) and evaluate positioning of their brand across different themes. Client also wants to know the sales drivers of their products.

In [27]:
from ta_lib.core.api import create_context, list_datasets, load_dataset

In [28]:
config_path = op.join('conf', 'config.yml')
context = create_context(config_path)

In [29]:
list_datasets(context)

['/raw/FnB/google_search_data',
 '/raw/FnB/product_manufacturer_list',
 '/raw/FnB/sales_data',
 '/raw/FnB/social_media_data',
 '/raw/FnB/theme_list',
 '/raw/FnB/theme_product_list',
 '/cleaned/FnB/google_search_data',
 '/cleaned/FnB/product_manufacturer_list',
 '/cleaned/FnB/sales_data',
 '/cleaned/FnB/social_media_data',
 '/cleaned/FnB/theme_list',
 '/cleaned/FnB/theme_product_list',
 '/train/FnB/features',
 '/train/FnB/target',
 '/test/FnB/features',
 '/test/FnB/target',
 '/processed/FnB/client_data',
 '/score/FnB/output']

## Loading Datasets

In [30]:
sales_data = load_dataset(context, 'raw/FnB/sales_data')
product_manufacturer_list = load_dataset(context, 'raw/FnB/product_manufacturer_list')
google_search_data = load_dataset(context, 'raw/FnB/google_search_data')
social_media_data = load_dataset(context, 'raw/FnB/social_media_data')
Theme_list = load_dataset(context, 'raw/FnB/theme_list')
Theme_product_list = load_dataset(context, 'raw/FnB/theme_product_list')

# Exploratory Analysis

Given the raw data from data ingestion, we would now like to explore and learn more details about the data.


The output of the step would be a summary report and discussion of any pertinent findings.


In [31]:
# Import the eda API
import ta_lib.eda.api as eda

## Variable Summary

In [32]:
# shapes of datasets
display_as_tabs([('sales_data', sales_data.shape),
                 ('social_media_data', social_media_data.shape), 
                 ('google_search_data', google_search_data.shape),
                 ('Theme_product_list', Theme_product_list.shape),
                 ('Theme_list', Theme_list.shape),
                 ('product_manufacturer_list', product_manufacturer_list.shape) ])

BokehModel(combine_events=True, render_bundle={'docs_json': {'00bdcadd-5b01-465d-86ee-96ffc1f16903': {'version…

In [33]:
social_media_data['published_date'] = pd.to_datetime(social_media_data['published_date'], errors='coerce', infer_datetime_format=True)
google_search_data["date"]= pd.to_datetime(google_search_data["date"], format="%d-%m-%Y")
sum1 = eda.get_variable_summary(sales_data)
sum2 = eda.get_variable_summary(social_media_data)
sum3 = eda.get_variable_summary(google_search_data)
sum4 = eda.get_variable_summary(Theme_product_list)
sum5 = eda.get_variable_summary(Theme_list)
sum6 = eda.get_variable_summary(product_manufacturer_list)

display_as_tabs([('sales_data', sum1),
                ('social_media_data', sum2),
                ('google_search_data', sum3),
                ('Theme_product_list', sum4),
                ('Theme_list', sum5),
                ('product_manufacturer_list', sum6)])

BokehModel(combine_events=True, render_bundle={'docs_json': {'307efb62-2197-44d5-a3f7-85f1b10cbf04': {'version…

From the variable summary conducted on the and sales_data , social_media_data  google_search_data , Theme_product_list , Theme_list product_manufacturer_list dataset, we observe that the datasets have both `numeric` and `other` datatypes. The bulk of them seem to be `numeric`. Numeric is defined to be one of `float`, `in`t or `date` and the rest are categorized as `other`. A column is assumed to have `date` values if it has the string date in the column name.

In [34]:
nan1= sales_data.isna().sum()
nan2= social_media_data.isna().sum()
nan3= google_search_data.isna().sum()
nan4= Theme_product_list.isna().sum()
nan5= Theme_list.isna().sum()
nan6= product_manufacturer_list.isna().sum()

display_as_tabs([('sales_data', nan1),
                ('social_media_data', nan2),
                ('google_search_data', nan3),
                ('Theme_product_list', nan4),
                ('Theme_list', nan5),
                ('product_manufacturer_list', nan6)])

BokehModel(combine_events=True, render_bundle={'docs_json': {'0ac17b18-9eb0-40ed-ba3f-192569d0468b': {'version…

Social media dataset contains missings values for Theme Id, around 40%

In [35]:
sum1 = eda.get_duplicate_columns(sales_data)
sum2 = eda.get_duplicate_columns(social_media_data)
sum3 = eda.get_duplicate_columns(google_search_data)
sum4 = eda.get_duplicate_columns(Theme_product_list)
sum5 = eda.get_duplicate_columns(Theme_list)
sum6 = eda.get_duplicate_columns(product_manufacturer_list)

display_as_tabs([('sales_data', sum1),
                ('social_media_data', sum2),
                ('google_search_data', sum3),
                ('Theme_product_list', sum4),
                ('Theme_list', sum5),
                ('product_manufacturer_list', sum6)])

BokehModel(combine_events=True, render_bundle={'docs_json': {'b9ade16c-6f66-45fa-a954-3e92e5b7ec8d': {'version…

only product_manufacturer_list contains duplicates columns (5 duplicates columns)

In [36]:
sum1 = eda.get_outliers(sales_data)
sum2 = eda.get_outliers(social_media_data)
sum3 = eda.get_outliers(google_search_data)
sum4 = eda.get_outliers(Theme_product_list)
sum5 = eda.get_outliers(Theme_list)
sum6 = eda.get_outliers(product_manufacturer_list)

display_as_tabs([('sales_data', sum1),
                ('social_media_data', sum2),
                ('google_search_data', sum3),
                ('Theme_product_list', sum4),
                ('Theme_list', sum5),
                ('product_manufacturer_list', sum6)])

BokehModel(combine_events=True, render_bundle={'docs_json': {'b4ec67f1-4d2e-45c8-8a40-26bf7e31b6d6': {'version…

### Data Health Summary

In [37]:
sum1, plot1 = eda.get_data_health_summary(sales_data, return_plot=True)
sum2, plot2 = eda.get_data_health_summary(social_media_data, return_plot=True)
sum3, plot3 = eda.get_data_health_summary(google_search_data, return_plot=True)
sum4, plot4 = eda.get_data_health_summary(Theme_product_list, return_plot=True)
sum5, plot5 = eda.get_data_health_summary(Theme_list, return_plot=True)
sum6, plot6 = eda.get_data_health_summary(product_manufacturer_list, return_plot=True)

display_as_tabs([('sales_data', plot1), ('social_media_data', plot2), ('google_search_data', plot3), ('Theme_product_list', plot4),('Theme_list', plot5), ('product_manufacturer_list', plot6)])

BokehModel(combine_events=True, render_bundle={'docs_json': {'e6a0e9ee-9338-4b85-ab20-8fb864d0d917': {'version…

Only social media data contains some missing values

### Generating Summary Reports

In [38]:
from ta_lib.reports.api import summary_report

summary_report(sales_data, 'reports/sales_data.html')
summary_report(social_media_data, 'reports/social_media_data.html')
summary_report(google_search_data, 'reports/google_search_data.html')
summary_report(Theme_product_list, 'reports/Theme_product_list.html')
summary_report(Theme_list, 'reports/Theme_list.html')
summary_report(product_manufacturer_list, 'reports/product_manufacturer_list.html')