# Data Loading and Initial Exploration

This notebook covers loading data from multiple file formats (CSV, Excel), performing initial data exploration including shape, info, and descriptive statistics, and documenting data quality issues.

In [1]:
# Import necessary libraries
import pandas as pd

# Load data
sales_data = None
try:
    sales_csv = pd.read_csv('../data/raw/Global_Superstore2.csv', encoding='latin1')  # Adjust path and filename as needed
    print('CSV data loaded successfully')
    sales_data = sales_csv
except Exception as e:
    print(f'Error loading CSV data: {e}')

try:
    sales_excel = pd.read_excel('../data/raw/Global_Superstore2.xlsx')  # Adjust path and filename as needed
    print('Excel data loaded successfully')
    if sales_data is None:
        sales_data = sales_excel
except Exception as e:
    print(f'Error loading Excel data: {e}')

if sales_data is not None:
    # Initial exploration
    print('Data Shape:', sales_data.shape)
    print('Data Info:')
    print(sales_data.info())
    print('Data Description:')
    print(sales_data.describe(include='all'))
    
    # Document data quality issues
    missing_values = sales_data.isnull().sum()
    print('Missing values per column:')
    print(missing_values)
else:
    print('No data loaded successfully.')


CSV data loaded successfully


Excel data loaded successfully
Data Shape: (51290, 24)
Data Info:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Row ID          51290 non-null  int64  
 1   Order ID        51290 non-null  object 
 2   Order Date      51290 non-null  object 
 3   Ship Date       51290 non-null  object 
 4   Ship Mode       51290 non-null  object 
 5   Customer ID     51290 non-null  object 
 6   Customer Name   51290 non-null  object 
 7   Segment         51290 non-null  object 
 8   City            51290 non-null  object 
 9   State           51290 non-null  object 
 10  Country         51290 non-null  object 
 11  Postal Code     9994 non-null   float64
 12  Market          51290 non-null  object 
 13  Region          51290 non-null  object 
 14  Product ID      51290 non-null  object 
 15  Category        51290 non-null  object 
 16  Sub-Category    51290 non-null  object 
 17  Product Name    51290 non-null 

             Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
count   51290.00000           51290       51290       51290           51290   
unique          NaN           25035        1430        1464               4   
top             NaN  CA-2014-100111  18-06-2014  22-11-2014  Standard Class   
freq            NaN              14         135         130           30775   
mean    25645.50000             NaN         NaN         NaN             NaN   
std     14806.29199             NaN         NaN         NaN             NaN   
min         1.00000             NaN         NaN         NaN             NaN   
25%     12823.25000             NaN         NaN         NaN             NaN   
50%     25645.50000             NaN         NaN         NaN             NaN   
75%     38467.75000             NaN         NaN         NaN             NaN   
max     51290.00000             NaN         NaN         NaN             NaN   

       Customer ID    Customer Name   Segment      

Missing values per column:
Row ID                0
Order ID              0
Order Date            0
Ship Date             0
Ship Mode             0
Customer ID           0
Customer Name         0
Segment               0
City                  0
State                 0
Country               0
Postal Code       41296
Market                0
Region                0
Product ID            0
Category              0
Sub-Category          0
Product Name          0
Sales                 0
Quantity              0
Discount              0
Profit                0
Shipping Cost         0
Order Priority        0
dtype: int64
