# EDA on Retail Gold Schema Data

**Overview:**  
This notebook is focused on performing **Exploratory Data Analysis (EDA)** on the `gold` schema of our dataset. The schema contains both fact and dimension tables that represent sales, stock, customers, staff, products, and stores.  

**Schema Summary:**  
- **Fact Tables:**  
    - `fact_sales`: Contains transactional sales data with details such as quantity, price, discount, and dates.  
    - `fact_stocks`: Contains inventory data for stores and products.  

- **Dimension Tables:**  
    - `dim_customers`: Customer information including name, contact, and address.  
    - `dim_staffs`: Staff details including name, store association, and manager.  
    - `dim_products`: Product catalog with category, brand, pricing, and model year.  
    - `dim_stores`: Store details including address and contact information.  

**Objectives:**  
1. Connect to SQL Server and load the gold schema data into Python using **pyodbc**.  
2. Explore the data to gain **insights and trends**.
3. Perform **summary statistics and visualizations** to understand distributions, relationships, and patterns.  
4. Create charts and plots to uncover interesting aspects of the business, such as sales per product, store performance, or customer segmentation.  
5. Prepare the data for further **analysis and reporting in Power BI**.  

**Tools & Libraries:**  
- Python (`pandas`, `numpy`)  
- SQL connection (`pyodbc`)  
- Data visualization (`matplotlib`, `seaborn`)  

This notebook will help generate insights from the gold dataset and create a foundation for **Power BI dashboards and reporting**.


# Loading Gold Schema Data into Python

**Overview:**  
We will load data from our SQL Server `gold` schema into Python using `pyodbc`. This will allow us to perform EDA, generate insights, and visualize the data.

**Notes:**  
- All tables (fact and dimension tables) will be loaded as Pandas DataFrames.  
- These DataFrames will be used for charts, aggregations, and other exploratory tasks.  

In [None]:
import pyodbc
import pandas as pd

# Connection parameters
server = 'your_server_name\SQLEXPRESS'        
database = 'DataWarehouse'           


# Connection string
conn_str = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server};DATABASE={database};Trusted_Connection=yes;'

In [13]:
# Connect
conn = pyodbc.connect(conn_str)

In [None]:
# Load all the tables
df_customers = pd.read_sql_query('SELECT * FROM gold.dim_customers', conn)
df_products  = pd.read_sql_query('SELECT * FROM gold.dim_products', conn)
df_staffs    = pd.read_sql_query('SELECT * FROM gold.dim_staffs', conn)
df_stores    = pd.read_sql_query('SELECT * FROM gold.dim_stores', conn)
df_sales     = pd.read_sql_query('SELECT * FROM gold.fact_sales', conn)
df_stocks    = pd.read_sql_query('SELECT * FROM gold.fact_stocks', conn)

In [16]:
# Printing rows and columns of each DataFrame
print(f"dim_customers: {df_customers.shape}")
print(f"dim_products:  {df_products.shape}")
print(f"dim_staffs:    {df_staffs.shape}")
print(f"dim_stores:    {df_stores.shape}")
print(f"fact_sales:    {df_sales.shape}")
print(f"fact_stocks:   {df_stocks.shape}")

dim_customers: (1445, 10)
dim_products:  (321, 8)
dim_staffs:    (10, 8)
dim_stores:    (3, 9)
fact_sales:    (4722, 19)
fact_stocks:   (939, 4)
