# Exploratory Data Analysis

1. Understanding Data Structure:

- Examine data types, column names, and dataset size.
- Check how the data is organized—categorical vs. numerical features, distribution, and relationships.
2. Handling Missing Values:

- Determine the presence and extent of missing values.
- Visualize or summarize missing data to understand patterns.
- Decide on strategies to handle missing data (e.g., imputation, removal).
3. Summary Statistics:

- Generate summary statistics (mean, median, standard deviation, quartiles) to understand the central tendency and dispersion of numerical data.
- Use frequency counts for categorical features to understand distributions.
4. Identifying Outliers:

- Outliers can be identified through visualizations (e.g., boxplots) or statistical methods.
- Outliers may be errors or valuable indicators, and understanding them is critical for further analysis.
5. Visualizing Distributions:

- Plot histograms, boxplots, and density plots to visualize the distribution of numerical variables.
- These help you understand the shape of the data (e.g., normal, skewed).
6. Analyzing Relationships:

- Use scatterplots, pair plots, or correlation matrices to understand relationships between variables.
- This step helps identify collinear features, which could influence model performance.
7. Feature Analysis:

- Look at each feature individually (univariate analysis) to see its specific impact.
- Use bivariate or multivariate analysis to see relationships among features, which can help in understanding how variables interact.
8. Identifying Data Quality Issues:

- Look for errors such as inconsistencies, duplicates, or data entry mistakes.
- This process helps identify problems to fix during data cleaning.

In [1]:
#import lib
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt 
import seaborn as sns

## 1. Understanding Data Structure:

- Examine data types, column names, and dataset size.
- Check how the data is organized—categorical vs. numerical features, distribution, and relationships.

In [3]:
df = pd.read_csv("../data/raw/2012_Data.csv", encoding='unicode_escape')

  df = pd.read_csv("../data/raw/2012_Data.csv", encoding='unicode_escape')


In [9]:
#show data shape
print(f"rows: {df.shape[0]:,} | columns: {df.shape[1]}")
#show first five rows
df.head()


rows: 1,037,205 | columns: 41


Unnamed: 0,accounting_date,fiscal_year,fiscal_month,calendar_year,calendar_month,calendar_day,company_code,customer_code,customer_district_code,item_code,...,value_quantity,value_price_adjustment,currency,item_source_class,invoice_number,line_number,invoice_date,customer_order_number,order_date,dss_update_time
0,20120509,2012,11,2012,5,9,101,411800601,410,GENIE8WWWBC,...,84.0,0,AUD,,2217887,1,20120509,2865354,20120509,49:58.7
1,20120216,2012,8,2012,2,16,101,361000403,300,GENIE8WWWBC,...,12.0,0,AUD,,2185745,1,20120216,2833515,20120216,49:58.7
2,20120509,2012,11,2012,5,9,101,361000403,300,GENIE8WWWBC,...,12.0,0,AUD,,2217807,1,20120509,2864857,20120508,49:58.7
3,20120518,2012,11,2012,5,18,101,565540415,500,GENIE8WWWBC,...,6.0,0,AUD,,2222758,1,20120518,2869759,20120518,49:58.7
4,20120109,2012,7,2012,1,9,101,565540415,500,GENIE8WWWBC,...,6.0,0,AUD,,2170374,1,20120109,2819189,20120109,49:58.7


| Field Name              | Data Type       | Description                                                                                   |
|-------------------------|-----------------|-----------------------------------------------------------------------------------------------|
| accounting_date         | Date            | The date when the financial transaction is recorded in the accounting system.                |
| fiscal_year             | Integer         | The year in which the transaction occurs, based on the company's fiscal calendar.            |
| fiscal_month            | Integer         | The month of the fiscal year during which the transaction takes place.                       |
| calendar_year           | Integer         | The year in which the transaction occurs, based on the standard calendar.                    |
| calendar_month          | Integer         | The month of the year during which the transaction takes place, based on the standard calendar. |
| calendar_day            | Integer         | The specific day of the month on which the transaction occurs.                               |
| company_code            | String          | A unique identifier for the company conducting the transaction.                              |
| customer_code           | String          | A unique identifier for the customer involved in the transaction.                            |
| customer_district_code  | String          | A code representing the geographical district of the customer.                               |
| item_code               | String          | A unique identifier for the item being sold.                                                 |
| business_area_code      | String          | A code representing the specific area of business related to the transaction.                |
| item_group_code         | String          | A code indicating the group to which the item belongs.                                       |
| item_class_code         | String          | A code categorizing the item based on its characteristics or type.                           |
| item_type               | String          | A descriptor indicating the nature or category of the item.                                  |
| bonus_group_code        | String          | A code identifying the group related to bonuses or incentives for sales.                     |
| environment_group_code  | String          | A code denoting the environmental category related to the product.                           |
| technology_group_code   | String          | A code representing the technology category associated with the item or service.             |
| commission_group_code   | String          | A code identifying the group that determines commission structures for sales.                |
| reporting_classification| String          | A classification used for reporting purposes, indicating how the transaction should be categorized. |
| light_source            | String          | A code indicating the source of lighting related to the item, if applicable.                 |
| warehouse_code          | String          | A code identifying the warehouse where the item is stored or shipped from.                   |
| abc_class_code          | String          | A classification code used in inventory management to indicate the importance of an item (e.g., A, B, C categories). |
| abc_class_volume        | Float           | The volume of goods associated with the ABC classification.                                  |
| business_chain_l1_code  | String          | A code representing the first level of the business chain for tracking and analysis.          |
| business_chain_l1_name  | String          | The name corresponding to the business chain level 1 code.                                   |
| contact_method_code     | String          | A code indicating the contact used.                                                          |
| salesperson_code        | String          | A unique identifier for the salesperson associated with the transaction.                     |
| order_type_code         | String          | A code that categorizes the type of order.                                                   |
| market_segment          | String          | A descriptor of the specific market segment targeted by the transaction.                     |
| value_sales             | Float           | The monetary value of sales generated from the transaction.                                  |
| value_cost              | Float           | The cost associated with the transaction.                                                    |
| value_quantity          | Integer         | The quantity of items sold or transacted.                                                    |
| value_price_adjustment  | Float           | Any adjustments made to the price during the transaction (discounts, surcharges, etc.).      |
| currency                | String          | The currency in which the transaction is conducted.                                          |
| item_source_class       | String          | A classification indicating the source or origin of the item.                                |
| invoice_number          | String          | A unique identifier for the invoice related to the transaction.                              |
| line_number             | Integer         | The line item number on the invoice, indicating specific items.                              |
| invoice_date            | Date            | The date the invoice is issued.                                                              |
| customer_order_number   | String          | A unique identifier for the customer's order.                                                |
| order_date              | Date            | The date when the order was placed.                                                          |
| dss_update_time         | Timestamp       | The timestamp indicating when the data was last updated in the system.                       |



In [12]:
# count object variables
num_obj = df.select_dtypes('object').shape
print(f"count object column = {num_obj[1]}")
#count numerical variables
num_numerical = df.select_dtypes(exclude='object').shape
print(f"count numerical column = {num_numerical[1]}")


count object column = 22
count numerical column = 19


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1037205 entries, 0 to 1037204
Data columns (total 41 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   accounting_date           1037205 non-null  int64  
 1   fiscal_year               1037205 non-null  int64  
 2   fiscal_month              1037205 non-null  int64  
 3   calendar_year             1037205 non-null  int64  
 4   calendar_month            1037205 non-null  int64  
 5   calendar_day              1037205 non-null  int64  
 6   company_code              1037205 non-null  int64  
 7   customer_code             1037205 non-null  object 
 8   customer_district_code    1037205 non-null  int64  
 9   item_code                 1037205 non-null  object 
 10  business_area_code        1037205 non-null  object 
 11  item_group_code           1037205 non-null  object 
 12  item_class_code           1037205 non-null  object 
 13  item_type                 1

## 2. Handling Missing Values:

- Determine the presence and extent of missing values.
- Visualize or summarize missing data to understand patterns.
- Decide on strategies to handle missing data (e.g., imputation, removal).

## 3. Summary Statistics:

- Generate summary statistics (mean, median, standard deviation, quartiles) to understand the central tendency and dispersion of numerical data.
- Use frequency counts for categorical features to understand distributions.

## 4. Identifying Outliers:

- Outliers can be identified through visualizations (e.g., boxplots) or statistical methods.
- Outliers may be errors or valuable indicators, and understanding them is critical for further analysis.

## 5. Visualizing Distributions:

- Plot histograms, boxplots, and density plots to visualize the distribution of numerical variables.
- These help you understand the shape of the data (e.g., normal, skewed).

## 6. Analyzing Relationships:

- Use scatterplots, pair plots, or correlation matrices to understand relationships between variables.
- This step helps identify collinear features, which could influence model performance.

## 7. Feature Analysis:

- Look at each feature individually (univariate analysis) to see its specific impact.
- Use bivariate or multivariate analysis to see relationships among features, which can help in understanding how variables interact.

## 8. Identifying Data Quality Issues:

- Look for errors such as inconsistencies, duplicates, or data entry mistakes.
- This process helps identify problems to fix during data cleaning.