# Exploratory data analysis (EDA)

In [167]:
import os
import pandas as pd
import plotly.express as px
from sklearn import decomposition 
import numpy as np      

## Data pipeline

The data pipeline, in the conventional sense, follows the ETL process. Here, "Extraction", is perfomed after you find the answers to the following questions. 
1. What data are you using?
2. What are your inputs?
3. Whata are your labels?
4. How long will you spend getting the data?

Then worry about:
1. Reproducibility
2. Meta-data
3. Data lineage
4. Balanced data sets. 

All of which will come in handy when you transform your code into a product. 

For this tutorial experiment we will use a structured data set. The data set is aboout "sales". To begin with, all attributes in the data set are inputs excluding price which is going to be our label or target attribute. To keep it simple the data is stored in a .csv file which is neither updated nor changed in any way. In practice, you would use a script (or whatever works for you) to download, scrape, etc your data at a specificed frequency, depending on your use cases. For example: daily data updates for batch jobs or minute data for data-stream jobs. 

In [158]:
class Data(object):
    def __init__(self):
        self.data_dir = os.getcwd() #directory where data is located. In our case the data is in the project root directory
        self.raw = None #stores raw data
        
    def get(self, file_name):
        data = pd.read_csv(os.path.join(self.data_dir, file_name))
        self.raw = data
        return data.info()
        
    def plot(self, feature=None, target=None):
        assert self.raw is not None, "Use get(file_name) method to import data"
        fig = px.scatter(x=self.raw[feature], y=self.raw[target])
        fig.show()
        
    #feature engineering: reduce dimensionality    
    def pca(self, features, target):
        pca = decomposition.PCA(n_components='mle')
        principal_components = pca.fit_transform(features)
        principal_components = pd.DataFrame(principal_components)
        columns = [f'PC{index}' for index in principal_components]
        principal_components.columns = columns
        return pd.concat([principal_components, target], axis=1)
        

In [168]:
data = Data()

In [169]:
data.get('sales_data.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4093 entries, 0 to 4092
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PRODUCT_ID        4093 non-null   int64  
 1   CUSTOMER_ID       4093 non-null   int64  
 2   LOCATION_ID       4093 non-null   int64  
 3   TRANSACTION_DATE  4093 non-null   object 
 4   PRICE_PER_UNIT    4093 non-null   float64
 5   QUANTITY_SOLD     4093 non-null   int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 192.0+ KB


In [161]:
data.plot(feature='QUANTITY_SOLD', target='PRICE_PER_UNIT')

In [162]:
data.plot(feature='TRANSACTION_DATE', target='PRICE_PER_UNIT')

In [163]:
data.raw.columns

Index(['PRODUCT_ID', 'CUSTOMER_ID', 'LOCATION_ID', 'TRANSACTION_DATE',
       'PRICE_PER_UNIT', 'QUANTITY_SOLD'],
      dtype='object')

In [164]:
# define features used to predict target(s) variable
features = np.array((data.raw[['PRODUCT_ID', 'CUSTOMER_ID', 'LOCATION_ID','QUANTITY_SOLD']]))
target = data.raw['PRICE_PER_UNIT']

In [173]:
# create new dataset
data = data.pca(features=features, target=target)

In [178]:
data.iloc[:,:-1]

Unnamed: 0,PC0,PC1,PC2
0,20.591670,-16.594350,-1.333751
1,41.394352,17.934596,-1.348801
2,-10.972193,-14.665168,-1.362137
3,30.484612,-4.019388,-1.337282
4,-43.750588,21.301108,-1.423641
...,...,...,...
4088,-13.448134,-17.805990,1.774299
4089,39.138231,6.986623,1.795182
4090,-14.684752,-19.377860,1.774740
4091,15.642493,-22.878912,1.803568


HW. What are PC0,...PCNn?

In [170]:
data.raw

Unnamed: 0,PRODUCT_ID,CUSTOMER_ID,LOCATION_ID,TRANSACTION_DATE,PRICE_PER_UNIT,QUANTITY_SOLD
0,17,1,1,12/6/2017,12.42,44
1,22,1,1,7/15/2020,16.00,84
2,43,1,1,9/6/2020,11.50,26
3,17,1,1,5/8/2017,10.47,60
4,91,1,1,8/6/2020,12.66,34
...,...,...,...,...,...,...
4088,43,4,2,10/11/2018,10.12,22
4089,17,4,2,12/19/2018,10.45,74
4090,43,4,2,8/15/2020,11.50,20
4091,17,4,2,11/22/2019,12.77,36
