# 1. Import libraries

In [2]:
import pandas as pd

In [3]:
superstore_df = pd.read_csv('superstore_raw.csv')

# 2. Check the data

In [4]:
print(superstore_df.head())

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
1       2  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
2       3  CA-2017-138688  12/06/2017  16/06/2017    Second Class    DV-13045   
3       4  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   
4       5  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   

     Customer Name    Segment        Country             City       State  \
0      Claire Gute   Consumer  United States        Henderson    Kentucky   
1      Claire Gute   Consumer  United States        Henderson    Kentucky   
2  Darrin Van Huff  Corporate  United States      Los Angeles  California   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale     Florida   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale     Florida   

   Postal Code Region       Product ID         Cat

In [5]:
print(superstore_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

In [6]:
superstore_df.describe(include='all')

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
count,9800.0,9800,9800,9800,9800,9800,9800,9800,9800,9800,9800,9789.0,9800,9800,9800,9800,9800,9800.0
unique,,4922,1230,1326,4,793,793,3,1,529,49,,4,1861,3,17,1849,
top,,CA-2018-100111,05/09/2017,26/09/2018,Standard Class,WB-21850,William Brown,Consumer,United States,New York City,California,,West,OFF-PA-10001970,Office Supplies,Binders,Staple envelope,
freq,,14,38,34,5859,35,35,5101,9800,891,1946,,3140,19,5909,1492,47,
mean,4900.5,,,,,,,,,,,55273.322403,,,,,,230.769059
std,2829.160653,,,,,,,,,,,32041.223413,,,,,,626.651875
min,1.0,,,,,,,,,,,1040.0,,,,,,0.444
25%,2450.75,,,,,,,,,,,23223.0,,,,,,17.248
50%,4900.5,,,,,,,,,,,58103.0,,,,,,54.49
75%,7350.25,,,,,,,,,,,90008.0,,,,,,210.605


Check the maximum lenght of object type of each column of th dataframe to determine PostgreSQL VARCHAR n size.

In [9]:
#Create a function to calculate the maximum length in each column from a dataframe
def max_lenght(df):
    # Receive a dataframe as input
    # Initialize an empty dictionary to store the maximum lengths
    lenghts = {}
    # Check if the column is of type object
    for columns in df.columns:
        if df[columns].dtype == 'object':
            # Calculate the maximum length of strings in the column
            lenghts[columns] = df[columns].astype(str).str.len().max()
        else:
            lenghts[columns] = None
    return lenghts

# Apply the function to each column in the dataframe
max_lenghts = max_lenght(superstore_df)
print(pd.Series(max_lenghts).sort_values(ascending=False))

Product Name     127.0
Customer Name     22.0
State             20.0
City              17.0
Product ID        15.0
Category          15.0
Ship Mode         14.0
Order ID          14.0
Country           13.0
Sub-Category      11.0
Segment           11.0
Order Date        10.0
Ship Date         10.0
Customer ID        8.0
Region             7.0
Row ID             NaN
Postal Code        NaN
Sales              NaN
dtype: float64


# 3. Define tables and primary/foreign keys (PK/FK)

Star Schema:

dim_customer
* customer_id (PK)
* customer_name
* segment

dim_product
* product_id (PK)
* product_name
* category
* sub_category

dim_region
* region_id (PK) - will be created
* country
* state
* city
* postal_code
* region

dim_date
* date_id (PK)
* year
* quarter
* month
* day
* week_day
* order_date
* ship_date

dim_ship_mode
* ship_mode_id
* ship_mode


fact_sales
* row_id (PK)
* customer_id (FK)
* product_id (FK)
* region_id (FK)
*date_id (FK)
* sales

# 4. Define relational model

4.1 customers → orders (1:N relationship)
A single customer can place multiple orders, but each order belongs to exactly one customer.

* customers.customer_id is the primary key
* orders.customer_id is a foreign key referencing it

Why this relationship exists:  
In the raw CSV, the same customer appears many times across different rows. Storing customers in a separate table prevents duplication of names and segments, and allows customer‑level analysis.

4.2 regions → orders (1:N relationship)
Each order is associated with one geographic location (city, state, region), but a region can contain many orders.
* regions.region_id is the primary key
* orders.region_id is a foreign key

Why this relationship exists:  
Location fields repeat heavily in the dataset. Creating a dedicated regions table avoids redundancy and enables geographic reporting (sales by state, region, etc.).

4.3 orders → order_items (1:N relationship)
An order can contain multiple products, but each order item belongs to exactly one order.
* orders.order_id is the primary key
* order_items.order_id is a foreign key

Why this relationship exists:  
In the CSV, each Order ID appears multiple times — once per product purchased. This relationship reflects that structure and allows detailed line‑item analysis.