## Northwind database

Northwind is a company that sells specialty foods. I was given a mandate to  "__do something with this database__" as my first project as a new hire. Unfortunatly at the end of last week, the whole computer science division got sick. It was someone's birthday and they all ate bad sheet cake. I wasn't aware those could *actually* go bad.

I was able to find a entity-relation diagram for the database but not much else. So to break down this task:
1. First I am going to have to explore the database myself to see the basic metrics of the company.
2. Then I will use this basic data to formulate some hypotheses concerning some underlying trends.
3. Finally I will attempt to test out these ideas to prove myself right or wrong.

In [1]:
import sqlalchemy
from sqlalchemy import create_engine, inspect
from sqlalchemy.orm import Session, sessionmaker #importing libraries I will use. This list has been added as I progressed throughout the notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

In [200]:
#engine = create_engine('sqlite:///Northwind_large.sqlite', echo=True)
engine = create_engine('sqlite:///Northwind_small.sqlite', echo=True)
Session = sessionmaker(bind=engine)
session = Session()

inspector = inspect(engine) #checking the ERD against the actual database
inspector.get_table_names()

2019-01-25 15:09:19,388 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2019-01-25 15:09:19,389 INFO sqlalchemy.engine.base.Engine ()
2019-01-25 15:09:19,390 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2019-01-25 15:09:19,391 INFO sqlalchemy.engine.base.Engine ()
2019-01-25 15:09:19,392 INFO sqlalchemy.engine.base.Engine SELECT name FROM sqlite_master WHERE type='table' ORDER BY name
2019-01-25 15:09:19,393 INFO sqlalchemy.engine.base.Engine ()


['Category',
 'Customer',
 'CustomerCustomerDemo',
 'CustomerDemographic',
 'Employee',
 'EmployeeTerritory',
 'Order',
 'OrderDetail',
 'Product',
 'Region',
 'Shipper',
 'Supplier',
 'Territory']

So found my first discrepancy, most the table names are stated as singular while the ERD have them as plural.  
  
I will submit a ticket to the deparment to fix this when they get back.

In [50]:
inspector.get_columns('Product')

2019-01-25 13:06:07,614 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("Product")
2019-01-25 13:06:07,614 INFO sqlalchemy.engine.base.Engine ()


[{'name': 'Id',
  'type': INTEGER(),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 1},
 {'name': 'ProductName',
  'type': VARCHAR(length=8000),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'SupplierId',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'CategoryId',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'QuantityPerUnit',
  'type': VARCHAR(length=8000),
  'nullable': True,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'UnitPrice',
  'type': DECIMAL(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'UnitsInStock',
  'type': INTEGER(),
  'nullable': False,
  'default': None,
  'autoincrement': 'auto',
  'primary_key': 0},
 {'name': 'UnitsOnOrder',
  'type': INTEGER(),
  'n

A bit messy, I recall a function that I had used previously that I could use to help in cleaning this up.

In [51]:
def get_column_info(col_name): #quick function based on https://github.com/learn-co-curriculum/dsc-2-13-11-queries-with-sqlalchemy-lab
    col_list = inspector.get_columns(col_name)
    print(f'Table Name: {col_name} \n')
    
    for col in col_list:
        if col['primary_key'] == 1:
            print(f"{col['name']}  ||PRIMARY KEY||  dtype: {col['type']}")
        else:
            print(f"{col['name']}     dtype: {col['type']}")
                  

In [52]:
get_column_info('Product')

Table Name: Product 

Id  ||PRIMARY KEY||  dtype: INTEGER
ProductName     dtype: VARCHAR(8000)
SupplierId     dtype: INTEGER
CategoryId     dtype: INTEGER
QuantityPerUnit     dtype: VARCHAR(8000)
UnitPrice     dtype: DECIMAL
UnitsInStock     dtype: INTEGER
UnitsOnOrder     dtype: INTEGER
ReorderLevel     dtype: INTEGER
Discontinued     dtype: INTEGER


In [53]:
get_column_info('Employee')

2019-01-25 13:06:07,755 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("Employee")
2019-01-25 13:06:07,756 INFO sqlalchemy.engine.base.Engine ()
Table Name: Employee 

Id  ||PRIMARY KEY||  dtype: INTEGER
LastName     dtype: VARCHAR(8000)
FirstName     dtype: VARCHAR(8000)
Title     dtype: VARCHAR(8000)
TitleOfCourtesy     dtype: VARCHAR(8000)
BirthDate     dtype: VARCHAR(8000)
HireDate     dtype: VARCHAR(8000)
Address     dtype: VARCHAR(8000)
City     dtype: VARCHAR(8000)
Region     dtype: VARCHAR(8000)
PostalCode     dtype: VARCHAR(8000)
Country     dtype: VARCHAR(8000)
HomePhone     dtype: VARCHAR(8000)
Extension     dtype: VARCHAR(8000)
Photo     dtype: BLOB
Notes     dtype: VARCHAR(8000)
ReportsTo     dtype: INTEGER
PhotoPath     dtype: VARCHAR(8000)


In [54]:
get_column_info('Supplier')

2019-01-25 13:06:07,780 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("Supplier")
2019-01-25 13:06:07,780 INFO sqlalchemy.engine.base.Engine ()
Table Name: Supplier 

Id  ||PRIMARY KEY||  dtype: INTEGER
CompanyName     dtype: VARCHAR(8000)
ContactName     dtype: VARCHAR(8000)
ContactTitle     dtype: VARCHAR(8000)
Address     dtype: VARCHAR(8000)
City     dtype: VARCHAR(8000)
Region     dtype: VARCHAR(8000)
PostalCode     dtype: VARCHAR(8000)
Country     dtype: VARCHAR(8000)
Phone     dtype: VARCHAR(8000)
Fax     dtype: VARCHAR(8000)
HomePage     dtype: VARCHAR(8000)


  
So a quick check of the different tables align with the ERD. That is good. But I should be checking out the individual tables before calling them nevertheless, it is good pratice.
  
Now lets start checking out a few things:
- How much of what are we selling?
- Who are our main suppliers?
- What does our customer base look like?
- What is the geographical spread of our workforce?

Once we know these things, we will have a broad overview of the business. From there we will investigate any abnormalities or go splunking for underlying trends.
___
---
Now lets make a connect to the engine and make sure it works.

In [55]:
con = engine.connect() #connecting the engine to be able to make queries

In [56]:
q = '''SELECT * FROM Product''' #simple query
df_product = pd.read_sql_query(q, engine) #puts the information from the query into a dataframe
df_product.head()

2019-01-25 13:06:07,883 INFO sqlalchemy.engine.base.Engine SELECT * FROM Product
2019-01-25 13:06:07,884 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,Id,ProductName,SupplierId,CategoryId,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,1,1,10 boxes x 20 bags,18.0,39,0,10,0
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,0
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,22.0,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1


Fantasic, now lets start flushing out answers to those inital questions

## How much of what are we selling?

Lets check the Product table and the Order Detail table

In [57]:
print(get_column_info('Product'))
print(get_column_info('OrderDetail'))
print(get_column_info('Category'))

Table Name: Product 

Id  ||PRIMARY KEY||  dtype: INTEGER
ProductName     dtype: VARCHAR(8000)
SupplierId     dtype: INTEGER
CategoryId     dtype: INTEGER
QuantityPerUnit     dtype: VARCHAR(8000)
UnitPrice     dtype: DECIMAL
UnitsInStock     dtype: INTEGER
UnitsOnOrder     dtype: INTEGER
ReorderLevel     dtype: INTEGER
Discontinued     dtype: INTEGER
None
2019-01-25 13:06:07,976 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("OrderDetail")
2019-01-25 13:06:07,977 INFO sqlalchemy.engine.base.Engine ()
Table Name: OrderDetail 

Id  ||PRIMARY KEY||  dtype: VARCHAR(8000)
OrderId     dtype: INTEGER
ProductId     dtype: INTEGER
UnitPrice     dtype: DECIMAL
Quantity     dtype: INTEGER
Discount     dtype: FLOAT
None
2019-01-25 13:06:07,980 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("Category")
2019-01-25 13:06:07,980 INFO sqlalchemy.engine.base.Engine ()
Table Name: Category 

Id  ||PRIMARY KEY||  dtype: INTEGER
CategoryName     dtype: VARCHAR(8000)
Description     dtype: VARCH

Looks like the table is ERD is wrong again. Some of the tables are incorrectly stated i.e. ProductID is just Id

In [156]:
q='''SELECT p.ProductName, c.CategoryName, COUNT(*) num_ordered \
FROM Product p \
LEFT JOIN OrderDetail o ON o.ProductId = p.Id \
LEFT JOIN Category c ON c.Id = p.CategoryId \
GROUP BY p.ProductName ORDER BY num_ordered DESC'''
df1 = pd.read_sql_query(q, engine)
df1.head()

2019-01-25 14:22:54,846 INFO sqlalchemy.engine.base.Engine SELECT p.ProductName, c.CategoryName, COUNT(*) num_ordered FROM Product p LEFT JOIN OrderDetail o ON o.ProductId = p.Id LEFT JOIN Category c ON c.Id = p.CategoryId GROUP BY p.ProductName ORDER BY num_ordered DESC
2019-01-25 14:22:54,846 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,ProductName,CategoryName,num_ordered
0,Raclette Courdavault,Dairy Products,54
1,Camembert Pierrot,Dairy Products,51
2,Gorgonzola Telino,Dairy Products,51
3,Guaraná Fantástica,Beverages,51
4,Gnocchi di nonna Alice,Grains/Cereals,50


In [157]:
df1.CategoryName.value_counts()

Confections       13
Condiments        12
Beverages         12
Seafood           12
Dairy Products    10
Grains/Cereals     7
Meat/Poultry       6
Produce            5
Name: CategoryName, dtype: int64

So looks like confections is the kind of food we have the most orders for.

## Who are our main suppliers?

In [168]:
print(get_column_info('Order'))
print(get_column_info('Supplier'))

Table Name: Order 

Id  ||PRIMARY KEY||  dtype: INTEGER
CustomerId     dtype: VARCHAR(8000)
EmployeeId     dtype: INTEGER
OrderDate     dtype: VARCHAR(8000)
RequiredDate     dtype: VARCHAR(8000)
ShippedDate     dtype: VARCHAR(8000)
ShipVia     dtype: INTEGER
Freight     dtype: DECIMAL
ShipName     dtype: VARCHAR(8000)
ShipAddress     dtype: VARCHAR(8000)
ShipCity     dtype: VARCHAR(8000)
ShipRegion     dtype: VARCHAR(8000)
ShipPostalCode     dtype: VARCHAR(8000)
ShipCountry     dtype: VARCHAR(8000)
None
Table Name: Supplier 

Id  ||PRIMARY KEY||  dtype: INTEGER
CompanyName     dtype: VARCHAR(8000)
ContactName     dtype: VARCHAR(8000)
ContactTitle     dtype: VARCHAR(8000)
Address     dtype: VARCHAR(8000)
City     dtype: VARCHAR(8000)
Region     dtype: VARCHAR(8000)
PostalCode     dtype: VARCHAR(8000)
Country     dtype: VARCHAR(8000)
Phone     dtype: VARCHAR(8000)
Fax     dtype: VARCHAR(8000)
HomePage     dtype: VARCHAR(8000)
None


In [163]:
q = '''SELECT s.CompanyName, s.Region, COUNT(*) num_of_orders FROM [Order] o \
LEFT JOIN OrderDetail od ON o.Id = od.OrderId \
LEFT JOIN Product p ON od.ProductId = p.Id \
LEFT JOIN Supplier s ON p.SupplierId = s.Id \
GROUP BY s.CompanyName \
ORDER BY num_of_orders DESC'''

df2 = pd.read_sql_query(q, engine)
df2.head()

2019-01-25 14:27:18,978 INFO sqlalchemy.engine.base.Engine SELECT s.CompanyName, s.Region, COUNT(*) num_of_orders FROM [Order] o LEFT JOIN OrderDetail od ON o.Id = od.OrderId LEFT JOIN Product p ON od.ProductId = p.Id LEFT JOIN Supplier s ON p.SupplierId = s.Id GROUP BY s.CompanyName ORDER BY num_of_orders DESC
2019-01-25 14:27:18,978 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,CompanyName,Region,num_of_orders
0,Plutzer Lebensmittelgroßmärkte AG,Western Europe,179
1,"Pavlova, Ltd.",Victoria,163
2,"Specialty Biscuits, Ltd.",British Isles,126
3,Gai pâturage,Western Europe,105
4,Norske Meierier,Scandinavia,105


In [167]:
df2.Region.value_counts()

North America      6
Western Europe     6
Northern Europe    4
Southern Europe    3
Scandinavia        2
Eastern Asia       2
British Isles      2
South-East Asia    1
South America      1
NSW                1
Victoria           1
Name: Region, dtype: int64

The largest amount of orders come from Western Europe and it is one of the largest portion of our supply chain.

## What does our customer base look like?

In [172]:
print(get_column_info('CustomerDemographic'))
print(get_column_info('Customer'))
print(get_column_info('CustomerCustomerDemo'))

Table Name: CustomerDemographic 

Id  ||PRIMARY KEY||  dtype: VARCHAR(8000)
CustomerDesc     dtype: VARCHAR(8000)
None
Table Name: Customer 

Id  ||PRIMARY KEY||  dtype: VARCHAR(8000)
CompanyName     dtype: VARCHAR(8000)
ContactName     dtype: VARCHAR(8000)
ContactTitle     dtype: VARCHAR(8000)
Address     dtype: VARCHAR(8000)
City     dtype: VARCHAR(8000)
Region     dtype: VARCHAR(8000)
PostalCode     dtype: VARCHAR(8000)
Country     dtype: VARCHAR(8000)
Phone     dtype: VARCHAR(8000)
Fax     dtype: VARCHAR(8000)
None
2019-01-25 14:43:38,607 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("CustomerCustomerDemo")
2019-01-25 14:43:38,608 INFO sqlalchemy.engine.base.Engine ()
Table Name: CustomerCustomerDemo 

Id  ||PRIMARY KEY||  dtype: VARCHAR(8000)
CustomerTypeId     dtype: VARCHAR(8000)
None


In [181]:
q = '''SELECT * FROM CustomerCustomerDemo'''

df3 = pd.read_sql_query(q, engine)
df3

2019-01-25 14:48:19,176 INFO sqlalchemy.engine.base.Engine SELECT * FROM CustomerCustomerDemo
2019-01-25 14:48:19,176 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,Id,CustomerTypeId


It looks like `CustomerCustomerDemo` is an empty table. This must be a new table or something went wrong. That means the only customer data I have to look at will from the customer table. Another ticket I need to submit.

In [183]:
q = '''SELECT ContactTitle, Count(*) num_of_types \
FROM Customer  \
GROUP BY ContactTitle  \
ORDER BY num_of_types DESC'''

df4 = pd.read_sql_query(q, engine)
df4.head()

2019-01-25 14:51:39,916 INFO sqlalchemy.engine.base.Engine SELECT ContactTitle, Count(*) num_of_types FROM Customer GROUP BY ContactTitle ORDER BY num_of_types DESC
2019-01-25 14:51:39,917 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,ContactTitle,num_of_types
0,Owner,17
1,Sales Representative,17
2,Marketing Manager,12
3,Sales Manager,11
4,Accounting Manager,10


In [187]:
q = '''SELECT Country, Region, Count(*) num_of_customers \
FROM Customer  \
GROUP BY Region  \
ORDER BY num_of_customers DESC'''

df5 = pd.read_sql_query(q, engine)
df5

2019-01-25 14:55:44,384 INFO sqlalchemy.engine.base.Engine SELECT Country, Region, Count(*) num_of_customers FROM Customer  GROUP BY Region  ORDER BY num_of_customers DESC
2019-01-25 14:55:44,384 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,Country,Region,num_of_customers
0,Germany,Western Europe,28
1,USA,North America,16
2,Brazil,South America,16
3,Spain,Southern Europe,10
4,UK,British Isles,8
5,Mexico,Central America,5
6,Denmark,Northern Europe,4
7,Finland,Scandinavia,3
8,Poland,Eastern Europe,1


It is good to see our supplier network lines up with our supply chain. Also intriguing to see what are the title of our contacts. Now finally to answer our last question.

## What is the geographical spread of our workforce?

In [191]:
print(get_column_info('Territory'))
print(get_column_info('Region'))
print(get_column_info('EmployeeTerritory'))
print(get_column_info('Employee'))

Table Name: Territory 

Id  ||PRIMARY KEY||  dtype: VARCHAR(8000)
TerritoryDescription     dtype: VARCHAR(8000)
RegionId     dtype: INTEGER
None
2019-01-25 15:00:41,623 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("Region")
2019-01-25 15:00:41,624 INFO sqlalchemy.engine.base.Engine ()
Table Name: Region 

Id  ||PRIMARY KEY||  dtype: INTEGER
RegionDescription     dtype: VARCHAR(8000)
None
2019-01-25 15:00:41,627 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("EmployeeTerritory")
2019-01-25 15:00:41,628 INFO sqlalchemy.engine.base.Engine ()
Table Name: EmployeeTerritory 

Id  ||PRIMARY KEY||  dtype: VARCHAR(8000)
EmployeeId     dtype: INTEGER
TerritoryId     dtype: VARCHAR(8000)
None
2019-01-25 15:00:41,630 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("Employee")
2019-01-25 15:00:41,631 INFO sqlalchemy.engine.base.Engine ()
Table Name: Employee 

Id  ||PRIMARY KEY||  dtype: INTEGER
LastName     dtype: VARCHAR(8000)
FirstName     dtype: VARCHAR(8000)
Title     dtype:

In [210]:
q = '''SELECT e.LastName, e.Title, e.Region as based_from, r.RegionDescription FROM Employee e \
LEFT JOIN EmployeeTerritory et ON e.Id = et.EmployeeId \
LEFT JOIN Territory t ON et.TerritoryId = t.Id \
LEFT JOIN Region r ON t.RegionId = r.Id \
GROUP BY e.LastName \
ORDER BY e.Title'''

df6 = pd.read_sql_query(q, engine)
df6

2019-01-25 15:21:57,380 INFO sqlalchemy.engine.base.Engine SELECT e.LastName, e.Title, e.Region as based_from, r.RegionDescription FROM Employee e LEFT JOIN EmployeeTerritory et ON e.Id = et.EmployeeId LEFT JOIN Territory t ON et.TerritoryId = t.Id LEFT JOIN Region r ON t.RegionId = r.Id GROUP BY e.LastName ORDER BY e.Title
2019-01-25 15:21:57,380 INFO sqlalchemy.engine.base.Engine ()


Unnamed: 0,LastName,Title,based_from,RegionDescription
0,Callahan,Inside Sales Coordinator,North America,Northern
1,Buchanan,Sales Manager,British Isles,Eastern
2,Davolio,Sales Representative,North America,Eastern
3,Dodsworth,Sales Representative,British Isles,Northern
4,King,Sales Representative,British Isles,Western
5,Leverling,Sales Representative,North America,Southern
6,Peacock,Sales Representative,North America,Eastern
7,Suyama,Sales Representative,British Isles,Western
8,Fuller,"Vice President, Sales",North America,Eastern


It is unclear what "RegionDescription" from the `Region` table represents. I will need to do talk to one of the database engineers when they get back for some clearification.

So now we have a breif overview of the company, we can start some hypothesis testing.
___
___

## Do discounts have a statistically significant effect on the number of products customers order? If so, at what level(s) of discount?

THOUGHTS FOR ACTUAL Hypothesis testing. Any products that haven't been ordered are they discontinued? Check out date time when the orders were put in.