Using Python’s chardet library for guessing csv file encodings

In [5]:
%pip install chardet

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting chardet
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Downloading chardet-5.2.0-py3-none-any.whl (199 kB)
Installing collected packages: chardet
Successfully installed chardet-5.2.0
Note: you may need to restart the kernel to use updated packages.


DEPRECATION: Loading egg at c:\users\sayan\appdata\local\programs\python\python312\lib\site-packages\entfa-1.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330

[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import chardet

# Read a portion or the entire file in binary mode
with open('AdventureWorks_Customers.csv', 'rb') as f:
    raw_data = f.read()

# Detect the encoding
result = chardet.detect(raw_data)
print(result)


{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


In [None]:
# %% [markdown]
# # Sales & Revenue Analysis ETL Notebook
# This notebook extracts data from CSV files, cleans and transforms it, builds a star schema, and loads the data into MySQL.
# 
# **Datasets:**
# - AdventureWorks_Customers.csv
# - AdventureWorks_Products.csv
# - AdventureWorks_Sales_2015.csv, AdventureWorks_Sales_2016.csv, AdventureWorks_Sales_2017.csv

# %% [markdown]
# ## 1. Import Libraries

# %%
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import pymysql

# %% [markdown]
# ## 2. Read the CSV Files
# Adjust file paths if necessary.

# %%
# Read Customer and Product data using the specified encoding
customers_df = pd.read_csv("AdventureWorks_Customers.csv", encoding="ISO-8859-1")
products_df = pd.read_csv("AdventureWorks_Products.csv", encoding="ISO-8859-1")

# Read Sales data for multiple years with the same encoding
sales_2015_df = pd.read_csv("AW Sales\AdventureWorks_Sales_2015.csv", encoding="ISO-8859-1")
sales_2016_df = pd.read_csv("AW Sales\AdventureWorks_Sales_2016.csv", encoding="ISO-8859-1")
sales_2017_df = pd.read_csv("AW Sales\AdventureWorks_Sales_2017.csv", encoding="ISO-8859-1")

# %% [markdown]
# ## 3. Data Cleaning & Transformation – Customers
# - Convert BirthDate to datetime (with multiple formats)
# - Clean AnnualIncome: Remove '$', commas, extra spaces then convert to numeric

# %%
# Clean BirthDate column
customers_df['BirthDate'] = pd.to_datetime(customers_df['BirthDate'], infer_datetime_format=True, errors='coerce')

# Clean AnnualIncome: remove '$', commas and extra spaces then convert to numeric
customers_df['AnnualIncome'] = customers_df['AnnualIncome'].replace(
    {'\$': '', ',': ''}, regex=True).str.strip()
customers_df['AnnualIncome'] = pd.to_numeric(customers_df['AnnualIncome'], errors='coerce')

# Check primary key uniqueness for CustomerKey
print("CustomerKey unique:", customers_df['CustomerKey'].is_unique)

# %% [markdown]
# ## 4. Data Cleaning & Transformation – Products
# - Transform the ProductSize column so that letter sizes (S, M, L, XL) become numeric.
#   Mapping: S → 44, M → 48, L → 52, XL → 62. Other values (like 0) are left as is.
# - Ensure numeric columns (ProductCost, ProductPrice) are in proper format.

# %%
# Define mapping for letter sizes
size_mapping = {
    'S': 44,
    'M': 48,
    'L': 52,
    'XL': 62
}

def transform_size(x):
    """
    Convert product size to numeric:
    - If x is one of the letter sizes (S, M, L, XL), map it to its numeric value.
    - If x is already numeric (or a string representing a number), return as integer.
    - Otherwise, return np.nan.
    """
    if isinstance(x, str):
        x = x.strip()  # remove extra spaces
        if x in size_mapping:
            return size_mapping[x]
        else:
            try:
                return int(x)
            except ValueError:
                return np.nan
    else:
        return x

# Apply transformation to ProductSize
products_df['ProductSize'] = products_df['ProductSize'].apply(transform_size)
products_df['ProductSize'] = pd.to_numeric(products_df['ProductSize'], errors='coerce')

# Check primary key uniqueness for ProductKey
print("ProductKey unique:", products_df['ProductKey'].is_unique)

# %% [markdown]
# ## 5. Data Cleaning & Transformation – Sales Data
# - Combine the sales data from 2015, 2016, and 2017.
# - Convert OrderDate and StockDate to datetime.
# - Check if the combination of OrderNumber and OrderLineItem is unique.
# - (Later, we will calculate revenue by joining with the Product dimension.)

# %%
# Concatenate sales data
sales_df = pd.concat([sales_2015_df, sales_2016_df, sales_2017_df], ignore_index=True)

# Convert date columns to datetime
sales_df['OrderDate'] = pd.to_datetime(sales_df['OrderDate'], infer_datetime_format=True, errors='coerce')
sales_df['StockDate'] = pd.to_datetime(sales_df['StockDate'], infer_datetime_format=True, errors='coerce')

# Check uniqueness of the composite key (OrderNumber + OrderLineItem)
sales_df['CompositeKey'] = sales_df['OrderNumber'].astype(str) + "_" + sales_df['OrderLineItem'].astype(str)
print("Composite Sales Key unique:", sales_df['CompositeKey'].is_unique)

# %% [markdown]
# ## 6. Handle Missing Values
# For this example, we drop rows with missing critical values. In a production system, you might impute missing values as needed.

# %%
customers_df = customers_df.dropna(subset=['CustomerKey', 'BirthDate', 'AnnualIncome'])
products_df = products_df.dropna(subset=['ProductKey', 'ProductPrice'])
sales_df = sales_df.dropna(subset=['OrderDate', 'ProductKey', 'CustomerKey', 'OrderQuantity'])

# %% [markdown]
# ## 7. Data Transformation – Derived Columns and Revenue Calculation
# - In the fact table, we calculate revenue by joining sales with product price:
#   Revenue = OrderQuantity * ProductPrice
# - Also add a normalized date column for joining with the Date dimension.

# %%
# Add a normalized date column (strip time from OrderDate)
sales_df['OrderDateNorm'] = sales_df['OrderDate'].dt.normalize()

# Join sales with products to get ProductPrice and calculate revenue
sales_df = sales_df.merge(products_df[['ProductKey', 'ProductPrice']], on='ProductKey', how='left')
sales_df['Revenue'] = sales_df['OrderQuantity'] * sales_df['ProductPrice']

# %% [markdown]
# ## 8. Create Dimension Tables for Star Schema
# - **Customer Dimension:** Based on customers_df.
# - **Product Dimension:** Based on products_df.
# - **Date Dimension:** Generated from the range of OrderDate values in sales_df.

# %%
# Customer Dimension
dim_customer = customers_df.drop_duplicates(subset=['CustomerKey'])

# Product Dimension
dim_product = products_df.drop_duplicates(subset=['ProductKey'])

# Date Dimension: Create a date table covering the range of OrderDateNorm
min_date = sales_df['OrderDateNorm'].min().date()
max_date = sales_df['OrderDateNorm'].max().date()
date_range = pd.date_range(start=min_date, end=max_date)
dim_date = pd.DataFrame({'Date': date_range})
dim_date['Year'] = dim_date['Date'].dt.year
dim_date['Month'] = dim_date['Date'].dt.month
dim_date['Day'] = dim_date['Date'].dt.day
dim_date['Weekday'] = dim_date['Date'].dt.day_name()

# %% [markdown]
# ## 9. Create the Fact Table – Sales
# We join sales with the customer and product dimensions.
# For simplicity, we include key columns and the calculated revenue.
# (The composite key of OrderNumber and OrderLineItem can serve as a unique identifier.)

# %%
# Merge sales with customers and products (if needed for additional attributes)
fact_sales = sales_df.merge(dim_customer[['CustomerKey']], on='CustomerKey', how='left') \
                     .merge(dim_product[['ProductKey']], on='ProductKey', how='left')

# Select and reorder columns for the fact table
fact_sales = fact_sales[['CompositeKey', 'OrderDate', 'StockDate', 'CustomerKey', 'ProductKey', 
                           'OrderQuantity', 'ProductPrice', 'Revenue']]

# %% [markdown]
# ## 10. Load Cleaned Data to MySQL
# Using SQLAlchemy, we load the dimension and fact tables into a MySQL database.
# Adjust the connection details if needed.

# %%
# MySQL connection details
username = 'root'
password = '12345'
host = 'localhost'
port = '3306'
database = 'case5'
engine = create_engine(f'mysql+pymysql://{username}:{password}@{host}:{port}/{database}')

# Load tables to MySQL (table names are in lower-case)
dim_customer.to_sql('dim_customer', engine, index=False, if_exists='replace')
dim_product.to_sql('dim_product', engine, index=False, if_exists='replace')
dim_date.to_sql('dim_date', engine, index=False, if_exists='replace')
fact_sales.to_sql('fact_sales', engine, index=False, if_exists='replace')

print("Data loaded to MySQL successfully!")


  sales_2015_df = pd.read_csv("AW Sales\AdventureWorks_Sales_2015.csv", encoding="ISO-8859-1")
  sales_2016_df = pd.read_csv("AW Sales\AdventureWorks_Sales_2016.csv", encoding="ISO-8859-1")
  sales_2017_df = pd.read_csv("AW Sales\AdventureWorks_Sales_2017.csv", encoding="ISO-8859-1")
  {'\$': '', ',': ''}, regex=True).str.strip()
  customers_df['BirthDate'] = pd.to_datetime(customers_df['BirthDate'], infer_datetime_format=True, errors='coerce')
  sales_df['OrderDate'] = pd.to_datetime(sales_df['OrderDate'], infer_datetime_format=True, errors='coerce')
  sales_df['StockDate'] = pd.to_datetime(sales_df['StockDate'], infer_datetime_format=True, errors='coerce')


CustomerKey unique: True
ProductKey unique: True
Composite Sales Key unique: True
Data loaded to MySQL successfully!
