In [3]:
import pandas as pd

df = pd.read_csv("Data/train.csv", parse_dates = ["Order Date", "Ship Date"],
                dtype = {
                    "Ship Mode" : "category",
                    "Customer ID" : "category",
                    "Segment" : "category",
                    "Country" : "category",
                    "City" : "category",
                    "State" : "category",
                    "Postal Code" : "category",
                    "Region" : "category",
                    "Product ID" : "category",
                    "Category" : "category",
                    "Sub-Category" : "category",
                    "Product Name" : "category"
                }
                )

df.head(10)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368
5,6,CA-2015-115812,09/06/2015,14/06/2015,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,FUR-FU-10001487,Furniture,Furnishings,Eldon Expressions Wood and Plastic Desk Access...,48.86
6,7,CA-2015-115812,09/06/2015,14/06/2015,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,OFF-AR-10002833,Office Supplies,Art,Newell 322,7.28
7,8,CA-2015-115812,09/06/2015,14/06/2015,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,TEC-PH-10002275,Technology,Phones,Mitel 5320 IP Phone VoIP phone,907.152
8,9,CA-2015-115812,09/06/2015,14/06/2015,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,OFF-BI-10003910,Office Supplies,Binders,DXL Angle-View Binders with Locking Rings by S...,18.504
9,10,CA-2015-115812,09/06/2015,14/06/2015,Standard Class,BH-11710,Brosina Hoffman,Consumer,United States,Los Angeles,California,90032,West,OFF-AP-10002892,Office Supplies,Appliances,Belkin F5C206VTEL 6 Outlet Surge,114.9


After reading in the text some basic questions that are worth investigating are:

Data Cleanup

Figure out the categories -> What is the list of Regions, Category, Sub-Category, Ship Mode, Segment that exists.


# Introduction
The focus of this project will be on answering the following questions:
### General Questions
1. How many items are sold on each order
1. What is the price distribution of items sold?
1. What are the statistics of each main category?
1. What portion of sales does each Category represent?
1. What is the percentage of each category that is made up of its sub categories?
### Product & Category Behavior
1. Which product categories are sold together? Which product sub categories are sold together?
1. Are there Categories, Sub-Categories, or Products that trend amongs certain segments?
1. Does the popularity of products change over time?
### Customer Behavior
1. How does customer segment impact ordering habits? (Sales, number of order, order frequency)
1. How many orders are from repeat customers -> do some additional statistics on this how many orders do the customers make, how often do they make them, are the values of the orders consistent? Bin customers by number of orders and see if they spend different amounts per transaction?
1. What is the revenue breakdown? (One time customer vs repeat, Segment)
1. How does sales break down per customer? (Build a distribution that shows % of customers vs % of sales, 5% customers give 50% sales etc.)
1. What is the average time it takes for a customer to reorder?
1. Is total spend indicative of more orders or just higher value orders?
1. How does Ship Mode relate to order price and customer segment?
1. How do monthly or seasonal sales trends vary across Regions, Categories, and Segments?
### Geography
1. What are the outlying states for purchase amount and frequency?
1. What are the outlying cities for purchase amount and frequency?
1. How are sales broken down by state and then within each state at the city level?
1. If additional time is available, map the sales by state as a whole and by individual cities







# General Exploration
This section is for general exploration and clean up of the data before moving on to answer the various questions.

In [8]:
df.select_dtypes(include = 'category').apply(lambda col: col.nunique())

Ship Mode          4
Customer ID      793
Segment            3
Country            1
City             529
State             49
Postal Code      626
Region             4
Product ID      1861
Category           3
Sub-Category      17
Product Name    1849
dtype: int64

In [79]:
print(df.columns[df.isna().any()])
print(df[['Order ID', 'Customer ID', 'State', 'City', 'Postal Code']][df.isna().any(axis=1)])
#print('\n The count of empty strings is:', (df == '').sum()) #returned 0

#5 unique customers are affected by the missing postal code for burlington vermont.

Index(['Postal Code'], dtype='object')
            Order ID Customer ID    State        City Postal Code
2234  CA-2018-104066    QJ-19255  Vermont  Burlington         NaN
5274  CA-2016-162887    SV-20785  Vermont  Burlington         NaN
8798  US-2017-150140    VM-21685  Vermont  Burlington         NaN
9146  US-2017-165505    CB-12535  Vermont  Burlington         NaN
9147  US-2017-165505    CB-12535  Vermont  Burlington         NaN
9148  US-2017-165505    CB-12535  Vermont  Burlington         NaN
9386  US-2018-127292    RM-19375  Vermont  Burlington         NaN
9387  US-2018-127292    RM-19375  Vermont  Burlington         NaN
9388  US-2018-127292    RM-19375  Vermont  Burlington         NaN
9389  US-2018-127292    RM-19375  Vermont  Burlington         NaN
9741  CA-2016-117086    QJ-19255  Vermont  Burlington         NaN


In [12]:
#Do these lists look set up correctly?
for col in ['Ship Mode', 'Segment', 'Region', 'Category', 'Sub-Category']: #these are short enough where reading explicitly may have benefit
    print(col, list(df[col].unique()), '\n')  #changing to a list lets it print easier.

Ship Mode ['Second Class', 'Standard Class', 'First Class', 'Same Day'] 

Segment ['Consumer', 'Corporate', 'Home Office'] 

Region ['South', 'West', 'Central', 'East'] 

Category ['Furniture', 'Office Supplies', 'Technology'] 

Sub-Category ['Bookcases', 'Chairs', 'Labels', 'Tables', 'Storage', 'Furnishings', 'Art', 'Phones', 'Binders', 'Appliances', 'Paper', 'Accessories', 'Envelopes', 'Fasteners', 'Supplies', 'Machines', 'Copiers'] 



Of the lists viewed none appear to have duplicate entries or problems.

In [18]:
#What states are missing?

all_states = {
    "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
    "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
    "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana",
    "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota",
    "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada",
    "New Hampshire", "New Jersey", "New Mexico", "New York",
    "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
    "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
    "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington",
    "West Virginia", "Wisconsin", "Wyoming"
}

#print(all_states - set(df['State'].unique()))
#print(list(df['State'].unique()))

The states used are the continental states with the addition of DC for the purpose of shipping. Hawaii and Alaska aren't part of the sales data.

In [74]:
#Why is there 1861 unique product Id's and only 1849 product names?

pot_duplicate_name = df.groupby('Product Name', observed = False)['Product ID'].nunique()
pot_duplicate_name = pot_duplicate_name[pot_duplicate_name > 1]

print(pot_duplicate_name.sum(), len(pot_duplicate_name))
#print(pot_duplicate_name)


pot_duplicate_id = df.groupby('Product ID', observed = False)['Product Name'].nunique()
pot_duplicate_id = pot_duplicate_id[pot_duplicate_id > 1]

print(pot_duplicate_id.sum(), len(pot_duplicate_id))
#print(df.groupby('Product ID', observed = False)['Product Name'].unique())

60 16
64 32


There are 60 occurrences where a product name is associated with multiple product id's increasing the number of product Id's by 44.
There are 64 occurrences where a product Id is used for more than 1 name meaning there should be 32 less prodcut Id's. 

44 - 32 explains the 12 ID discrepancy.

### General Questions
1. How many items are sold on each order
1. What is the price distribution of items sold?
1. What are the statistics of each main category?
1. What portion of sales does each Category represent?
1. What is the percentage of each category that is made up of its sub categories?

### Product & Category Behavior
1. Which product categories are sold together? Which product sub categories are sold together?
1. Are there Categories, Sub-Categories, or Products that trend amongs certain segments?
1. Does the popularity of products change over time?

### Customer Behavior
1. How does customer segment impact ordering habits? (Sales, number of order, order frequency)
1. How many orders are from repeat customers -> do some additional statistics on this how many orders do the customers make, how often do they make them, are the values of the orders consistent? Bin customers by number of orders and see if they spend different amounts per transaction?
1. What is the revenue breakdown? (One time customer vs repeat, Segment)
1. How does sales break down per customer? (Build a distribution that shows % of customers vs % of sales, 5% customers give 50% sales etc.)
1. What is the average time it takes for a customer to reorder?
1. Is total spend indicative of more orders or just higher value orders?
1. How does Ship Mode relate to order price and customer segment?
1. How do monthly or seasonal sales trends vary across Regions, Categories, and Segments?

### Geography
1. What are the outlying states for purchase amount and frequency?
1. What are the outlying cities for purchase amount and frequency?
1. How are sales broken down by state and then within each state at the city level?
1. If additional time is available, map the sales by state as a whole and by individual cities