# Superstore Sales Data Modelling and Analysis in SQL

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('Superstore_data.csv')
df = pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [4]:
list(df.columns)

['Row ID',
 'Order ID',
 'Order Date',
 'Ship Date',
 'Ship Mode',
 'Customer ID',
 'Customer Name',
 'Segment',
 'Country',
 'City',
 'State',
 'Postal Code',
 'Region',
 'Product ID',
 'Category',
 'Sub-Category',
 'Product Name',
 'Sales']

From this list of columns, we can see that this sales data can be normalized. <br>
This means:
1. Customer ID, Customer Name, Segment, Country, City, State, Postal Code can become a separate table named "Customers"
2. Product ID, Category, and Sub-Category can become a separate table named "Products"

Also, we can also notice that we do not really need Row ID. The column Invoice ID is good enough column to become primary key for "Sales" table.

# Data Infrastructure

As the name of the project suggests, I will be transferring the data to SQL. For that, I will have to use appropriate libraries in python to create a connection with SQL database. 

In [17]:
# importing connector
import mysql.connector
import os
from dotenv import load_dotenv

In [33]:
load_dotenv()
db_user = os.getenv('DB_USER')
db_password = os.getenv('DB_PASSWORD')

# Establishing the connection
try:
    conn = mysql.connector.connect(
        host="localhost",
        user= db_user,
        password= db_password,
    )
except mysql.connector.Error as e:
    print(f"Error: {e}")

In [34]:
# creating cursor
cursor = conn.cursor()

In [37]:
# Testing the connection
# Let us list all the Schemas/Databases in MySQL
cursor.execute("SHOW DATABASES")
dbs = cursor.fetchall()
for db in dbs:
    print(db)

('information_schema',)
('mysql',)
('parks_and_recreation',)
('performance_schema',)
('sakila',)
('sys',)
('world',)


Our connection is set up. Now, Let us create separate Database for our Superstore Data.

In [38]:
# creating Database
cursor.execute("CREATE SCHEMA IF NOT EXISTS superstore_db")

Now our Data Infrastructure is set up. Let us move to Data modelling now.

# Data Modelling
Now we will be creating the tables we discussed above:
1. Products
2. Customers
3. Sales

For making Data modelling easier, I will be using [Quick DBD](https://www.quickdatabasediagrams.com/) platform which is efficient for data modelling in SQL.