<a href="https://colab.research.google.com/github/Requenamar3/Data-Mining/blob/main/Real-Time%20Inventory%20and%20Order%20Monitoring%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Enhancing Inventory Management for Mother's Day Sales**


###**Introduction**

Managing inventory efficiently during peak sales periods like Mother’s Day is crucial, especially for businesses lacking comprehensive ERP or WMS systems. Currently, our process involves manually analyzing multiple reports exported from ShipStation, which is time-consuming and lacks real-time accuracy. This project aims to streamline and automate this process through data-driven insights.

This initiative is crucial for overcoming the challenges of delayed inventory reporting and manual error. By implementing a real-time inventory and order monitoring system that connects directly with Shopify, we aim to gain instantaneous visibility into stock levels and order flows. This enhancement will allow us to predict potential shortages, make informed decisions about inventory distribution, and optimize product availability on our website.


### **Objectives**


*  Develop a real-time monitoring system to track hourly order intake and inventory status
*  Predict which products are likely to run out of stock first and identify which locations need inventory adjustments.
*   Automate decision-making processes to improve efficiency, reduce costs, and better prepare for future demand.

This project will not only streamline operational processes but also serve as a scalable model for future high-demand periods, improving both customer satisfaction and profitability.

By transforming our inventory management approach, we expect significant cost savings and enhanced strategic decision-making capabilities.

##Data Collection

### Data Retrieval

In [198]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import mlxtend.frequent_patterns
import mlxtend.preprocessing
import datetime as dt
!pip install matplotlib
!pip install mlxtend -qqq
!pip install ydata_profiling
!pip freeze >> requirements.txt
import urllib.request
import scipy.stats as stats
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
%matplotlib inline
from pandas_profiling import ProfileReport
from datetime import datetime

pd.options.mode.chained_assignment = None  # default='warn'

# Remove the restriction on Jupyter that limits the columns displayed (the ... in the middle)
pd.set_option('display.max_columns', None)
# Docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html#

# Pretty Display of variables.  for instance, you can call df.head() and df.tail() in the same cell and BOTH display w/o print
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# List of ALL Magic Commands.  To run a magic command %var  --- i.e.:  %env
%lsmagic
# %env  -- list environment variables
# %%time  -- gives you information about how long a cel took to run
# %%timeit -- runs a cell 100,000 times and then gives you the average time the cell will take to run (can be LONG)
# %pdb -- python debugger

# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

print(np.__version__)
print(sklearn.__version__)

 # installing the pandas_profiling package for data analysis and generating statistical report summaries.
!pip install ydata_profiling

  and should_run_async(code)




Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %shell  %store  %sx  %system  %tb  %tensorflow_version  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%bigquery  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%late

1.25.2
1.2.2


In [189]:
import requests
from io import BytesIO
from zipfile import ZipFile
import pandas as pd

# URL where the CSV files are located within a ZIP file
url = "https://github.com/Requenamar3/Data-Mining/blob/main/sales_2023-10-01_2024-04-25.zip?raw=true"

# Send a HTTP request to the URL
response = requests.get(url)

# Open the ZIP file
zip_file = ZipFile(BytesIO(response.content))

# List all the file names in the zip
csv_file_names = [file for file in zip_file.namelist() if file.endswith('.csv')]
print("CSV files found in the ZIP:", csv_file_names)

# Function to get columns from a CSV file given its path within the zip
def get_columns_from_csv(zip_file, file_name):
    with zip_file.open(file_name) as csv_file:
        df = pd.read_csv(csv_file, nrows=1)  # Read only the first row to get the columns
        return df.columns.tolist()

# Now, you can call get_columns_from_csv for each CSV file
for file_name in csv_file_names:
    columns = get_columns_from_csv(zip_file, file_name)
    print(f"Columns in {file_name}:")
    print(columns)

# Don't forget to close the zip file when done
zip_file.close()


  and should_run_async(code)


CSV files found in the ZIP: ['sales_2023-10-01_2024-04-25.csv', 'shopify_shipping_labels_2023-10-01_2024-04-25.csv']
Columns in sales_2023-10-01_2024-04-25.csv:
['month', 'financial_status', 'order_id', 'order_name', 'variant_sku', 'variant_id', 'variant_title', 'shipping_postal_code', 'shipping_region', 'product_price', 'product_id', 'customer_id', 'customer_type', 'customer_cohort_week', 'customer_cohort_month', 'market_name', 'product_title', 'product_type', 'adjustment', 'sale_kind', 'sale_line_type', 'billing_postal_code', 'purchase_option', 'cost_tracked', 'total_sales', 'ordered_item_quantity', 'orders', 'net_quantity', 'total_cost', 'shipping', 'taxes']
Columns in shopify_shipping_labels_2023-10-01_2024-04-25.csv:
['day', 'order_number', 'destination_country', 'carrier', 'shipping_service', 'tracking_number', 'label_cost_2', 'label_cost_savings_2', 'destination_postal_code', 'origin_postal_code', 'label_cost_savings', 'label_cost']


In [199]:
from zipfile import ZipFile

# URL to the raw ZIP file on GitHub
url = "https://github.com/Requenamar3/Data-Mining/blob/main/sales_2023-10-01_2024-04-25.zip?raw=true"

# Send a HTTP request to the URL
response = requests.get(url)
response.raise_for_status()  # This will raise an HTTPError if the request returned an unsuccessful status code

# Open the ZIP file
with ZipFile(BytesIO(response.content)) as zip_file:
    # Extract CSV files into the current working directory
    zip_file.extractall()

    # Now you can read the extracted files assuming these are the correct file names
    df_sales = pd.read_csv('sales_2023-10-01_2024-04-25.csv')
    df_shipping_labels = pd.read_csv('shopify_shipping_labels_2023-10-01_2024-04-25.csv')

# Assuming 'Ordername' and 'order_number' are the joining keys and they exist with the same name in both dataframes
bbox = pd.merge(df_sales, df_shipping_labels, left_on='order_name', right_on='order_number')

# Now bbox contains columns from both datasets merged on 'Ordername' and 'order_number'


  and should_run_async(code)


In [200]:
bbox.sample(5).T

  and should_run_async(code)


Unnamed: 0,105698,85961,101075,51889,83580
month,2024-04,2024-03,2024-04,2024-01,2024-03
financial_status,paid,paid,paid,paid,paid
order_id,5640894546035,5580525699187,5647681388659,5512018886771,5616917545075
order_name,#892933,#877833,#895132,#854357,#887007
variant_sku,SQ1726938,BB-PET-SAFE,SQ9178530,SQ9178530,
variant_id,17097311191155,39407977103475,40953324142707,40808365752435,0
variant_title,6 Month Prepay,Month to Month,Month to Month,12 Month Prepay,
shipping_postal_code,01845,80621,63119,02138,11763
shipping_region,Massachusetts,Colorado,Missouri,Massachusetts,New York
product_price,0.0,49.99,59.99,0.0,0.0


In [201]:
#Filtered only orders paid representing sucessfull transactions,
#dropping missing values from variant_sku representing coupons

bbox = bbox[
    (bbox['financial_status'] == 'paid') &
    (bbox['product_price'] != 0.0)
].dropna(subset=['variant_sku'])


bbox


  and should_run_async(code)


Unnamed: 0,month,financial_status,order_id,order_name,variant_sku,variant_id,variant_title,shipping_postal_code,shipping_region,product_price,product_id,customer_id,customer_type,customer_cohort_week,customer_cohort_month,market_name,product_title,product_type,adjustment,sale_kind,sale_line_type,billing_postal_code,purchase_option,cost_tracked,total_sales,ordered_item_quantity,orders,net_quantity,total_cost,shipping,taxes,day,order_number,destination_country,carrier,shipping_service,tracking_number,label_cost_2,label_cost_savings_2,destination_postal_code,origin_postal_code,label_cost_savings,label_cost
20,2023-10,paid,5304790319219,#802025,SQ9178530,17097313222771,Month to Month,60120,Illinois,44.99,1734774390899,5817602441331,Returning,2022-W27,2022-07,United States,Bloomsy Original,,No,order,product,60120,One-time,No,49.49,1,1,1,0.0,0.0,4.50,2023-10-10,#802025,United States,UPS,UPS® Ground,1ZC707H10326982108,7.19,12.35,60120,60131,12.35,7.19
22,2023-10,paid,5333373747315,#809296,SQ8862610,17097312436339,6 Month Prepay,53097,Wisconsin,324.99,1734774128755,6543958081651,Returning,2023-W12,2023-03,United States,Bloomsy Premium,,No,order,product,53097,One-time,No,342.86,1,1,1,0.0,0.0,17.87,2023-10-30,#809296,United States,UPS,UPS® Ground,1ZC707H10326098369,7.19,12.35,53097,60131,12.35,7.19
35,2023-10,paid,5299724877939,#800219,SQ8862610,17097312370803,Month to Month,60647,Illinois,54.99,1734774128755,3047430324339,Returning,2020-W19,2020-05,United States,Bloomsy Premium,,No,order,product,60647,One-time,No,60.49,1,1,1,0.0,0.0,5.50,2023-10-09,#800219,United States,UPS,UPS® Ground,1ZC707H10323539101,7.19,12.71,60647,60131,12.71,7.19
41,2023-10,paid,5333376630899,#809356,SQ1726938,17097311060083,Month to Month,53024,Wisconsin,49.99,1734774063219,5245846028403,Returning,2021-W31,2021-08,United States,Bloomsy Deluxe,,No,order,product,13027,One-time,No,52.74,1,1,1,0.0,0.0,2.75,2023-10-30,#809356,United States,UPS,UPS® Ground,1ZC707H10305950799,7.19,12.35,53024,60131,12.35,7.19
43,2023-10,paid,5308156182643,#803485,SQ3100795,17097313026163,Week to Week,92024,California,44.99,1734774325363,5514399678579,Returning,2022-W03,2022-01,United States,Bloomsy Weekly,,No,order,product,92024,One-time,No,48.25,1,1,1,0.0,0.0,3.26,2023-10-16,#803485,United States,UPS,UPS® Ground,1ZC707H10302629080,7.19,12.71,92024,90058,12.71,7.19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112002,2024-04,paid,5644211093619,#894524,SQ1726938,17097311060083,Month to Month,40207,Kentucky,49.99,1734774063219,3017732194419,Returning,2020-W18,2020-04,United States,Bloomsy Deluxe,,No,order,product,19130,One-time,No,52.99,1,1,1,0.0,0.0,3.00,2024-04-15,#894524,United States,UPS,UPS® Ground,1ZC707H10321913085,7.61,13.20,40207,60131,13.20,7.61
112005,2024-04,paid,5649313333363,#895464,SQ9178530,17097313222771,Month to Month,32211,Florida,44.99,1734774390899,5879118299251,Returning,2022-W29,2022-07,United States,Bloomsy Original,,No,order,product,32211,One-time,No,48.36,1,1,1,0.0,0.0,3.37,2024-04-11,#895464,United States,UPS,UPS® Ground,1ZC707H10318589764,7.67,15.03,32211,33166,15.03,7.67
112007,2024-04,paid,5671479410803,#900450,SQ9178530,40953324142707,Month to Month,98012,Washington,59.99,7251668664435,7340112216179,Returning,2024-W12,2024-03,United States,Bloomsy Original - Month to Month,,No,order,product,98258,Subscription,No,63.89,1,1,1,0.0,0.0,3.90,2024-04-24,#900450,United States,UPS,UPS® Ground,1ZC707H10300373654,7.40,10.21,98012,98528,10.21,7.40
112010,2024-04,paid,5642329784435,#893514,SQ9178530,17097313222771,Month to Month,44333,Ohio,39.99,1734774390899,2803806896243,Returning,2019-W52,2019-12,United States,Bloomsy Original,,No,order,product,94117,One-time,No,42.89,1,1,1,0.0,0.0,2.90,2024-04-05,#893514,United States,UPS,UPS® Ground,1ZC707H10325023164,7.61,12.72,44333,43217,12.72,7.61


In [202]:

# Convert 'month' and 'customer_cohort_month' to datetime
bbox['month'] = pd.to_datetime(bbox['month'], format='%Y-%m')
bbox['customer_cohort_month'] = pd.to_datetime(bbox['customer_cohort_month'], format='%Y-%m')
bbox['day'] = pd.to_datetime(bbox['day'], infer_datetime_format=True)



  and should_run_async(code)
  bbox['day'] = pd.to_datetime(bbox['day'], infer_datetime_format=True)


In [203]:
#adding a column for state code
# Dictionary for state name to state code mapping
state_codes = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA',
    'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA',
    'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA',
    'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
    'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH',
    'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'
}

bbox['state_code'] =bbox['shipping_region'].map(state_codes)


  and should_run_async(code)


In [204]:
bbox.head()

  and should_run_async(code)


Unnamed: 0,month,financial_status,order_id,order_name,variant_sku,variant_id,variant_title,shipping_postal_code,shipping_region,product_price,product_id,customer_id,customer_type,customer_cohort_week,customer_cohort_month,market_name,product_title,product_type,adjustment,sale_kind,sale_line_type,billing_postal_code,purchase_option,cost_tracked,total_sales,ordered_item_quantity,orders,net_quantity,total_cost,shipping,taxes,day,order_number,destination_country,carrier,shipping_service,tracking_number,label_cost_2,label_cost_savings_2,destination_postal_code,origin_postal_code,label_cost_savings,label_cost,state_code
20,2023-10-01,paid,5304790319219,#802025,SQ9178530,17097313222771,Month to Month,60120,Illinois,44.99,1734774390899,5817602441331,Returning,2022-W27,2022-07-01,United States,Bloomsy Original,,No,order,product,60120,One-time,No,49.49,1,1,1,0.0,0.0,4.5,2023-10-10,#802025,United States,UPS,UPS® Ground,1ZC707H10326982108,7.19,12.35,60120,60131,12.35,7.19,IL
22,2023-10-01,paid,5333373747315,#809296,SQ8862610,17097312436339,6 Month Prepay,53097,Wisconsin,324.99,1734774128755,6543958081651,Returning,2023-W12,2023-03-01,United States,Bloomsy Premium,,No,order,product,53097,One-time,No,342.86,1,1,1,0.0,0.0,17.87,2023-10-30,#809296,United States,UPS,UPS® Ground,1ZC707H10326098369,7.19,12.35,53097,60131,12.35,7.19,WI
35,2023-10-01,paid,5299724877939,#800219,SQ8862610,17097312370803,Month to Month,60647,Illinois,54.99,1734774128755,3047430324339,Returning,2020-W19,2020-05-01,United States,Bloomsy Premium,,No,order,product,60647,One-time,No,60.49,1,1,1,0.0,0.0,5.5,2023-10-09,#800219,United States,UPS,UPS® Ground,1ZC707H10323539101,7.19,12.71,60647,60131,12.71,7.19,IL
41,2023-10-01,paid,5333376630899,#809356,SQ1726938,17097311060083,Month to Month,53024,Wisconsin,49.99,1734774063219,5245846028403,Returning,2021-W31,2021-08-01,United States,Bloomsy Deluxe,,No,order,product,13027,One-time,No,52.74,1,1,1,0.0,0.0,2.75,2023-10-30,#809356,United States,UPS,UPS® Ground,1ZC707H10305950799,7.19,12.35,53024,60131,12.35,7.19,WI
43,2023-10-01,paid,5308156182643,#803485,SQ3100795,17097313026163,Week to Week,92024,California,44.99,1734774325363,5514399678579,Returning,2022-W03,2022-01-01,United States,Bloomsy Weekly,,No,order,product,92024,One-time,No,48.25,1,1,1,0.0,0.0,3.26,2023-10-16,#803485,United States,UPS,UPS® Ground,1ZC707H10302629080,7.19,12.71,92024,90058,12.71,7.19,CA


In [208]:
bbox.columns

  and should_run_async(code)


Index(['month', 'financial_status', 'order_id', 'order_name', 'variant_sku', 'variant_id', 'variant_title', 'shipping_postal_code', 'shipping_region', 'product_price', 'product_id', 'customer_id', 'customer_type', 'customer_cohort_week', 'customer_cohort_month', 'market_name', 'product_title', 'product_type', 'adjustment', 'sale_kind', 'sale_line_type', 'billing_postal_code', 'purchase_option', 'cost_tracked', 'total_sales', 'ordered_item_quantity', 'orders', 'net_quantity', 'total_cost', 'shipping', 'taxes', 'day', 'order_number', 'destination_country', 'carrier', 'shipping_service', 'tracking_number', 'label_cost_2', 'label_cost_savings_2', 'destination_postal_code', 'origin_postal_code', 'label_cost_savings', 'label_cost', 'state_code'], dtype='object')

In [210]:
# Use the .loc method to target the 'state_code' column for rows where
# the 'shipping_region' is 'District Of Columbia'.
# This method allows us to select specific parts of the DataFrame based on a condition.
bbox.loc[
   bbox['shipping_region'] == 'District Of Columbia',  # Condition to filter rows
    'state_code'  # Column to apply the change
] = bbox.loc[
    bbox['shipping_region'] == 'District Of Columbia',  # Condition to filter rows again for consistency
    'state_code'  # Column to apply the .fillna() method
].fillna('DC')  # The .fillna() method replaces NaN values with 'DC'


  and should_run_async(code)


In [211]:
# check for NaN values
nan_records = bbox[bbox.isna().any(axis=1)]
nan_records

  and should_run_async(code)


Unnamed: 0,month,financial_status,order_id,order_name,variant_sku,variant_id,variant_title,shipping_postal_code,shipping_region,product_price,product_id,customer_id,customer_type,customer_cohort_week,customer_cohort_month,market_name,product_title,product_type,adjustment,sale_kind,sale_line_type,billing_postal_code,purchase_option,cost_tracked,total_sales,ordered_item_quantity,orders,net_quantity,total_cost,shipping,taxes,day,order_number,destination_country,carrier,shipping_service,tracking_number,label_cost_2,label_cost_savings_2,destination_postal_code,origin_postal_code,label_cost_savings,label_cost,state_code
20,2023-10-01,paid,5304790319219,#802025,SQ9178530,17097313222771,Month to Month,60120,Illinois,44.99,1734774390899,5817602441331,Returning,2022-W27,2022-07-01,United States,Bloomsy Original,,No,order,product,60120,One-time,No,49.49,1,1,1,0.0,0.0,4.50,2023-10-10,#802025,United States,UPS,UPS® Ground,1ZC707H10326982108,7.19,12.35,60120,60131,12.35,7.19,IL
22,2023-10-01,paid,5333373747315,#809296,SQ8862610,17097312436339,6 Month Prepay,53097,Wisconsin,324.99,1734774128755,6543958081651,Returning,2023-W12,2023-03-01,United States,Bloomsy Premium,,No,order,product,53097,One-time,No,342.86,1,1,1,0.0,0.0,17.87,2023-10-30,#809296,United States,UPS,UPS® Ground,1ZC707H10326098369,7.19,12.35,53097,60131,12.35,7.19,WI
35,2023-10-01,paid,5299724877939,#800219,SQ8862610,17097312370803,Month to Month,60647,Illinois,54.99,1734774128755,3047430324339,Returning,2020-W19,2020-05-01,United States,Bloomsy Premium,,No,order,product,60647,One-time,No,60.49,1,1,1,0.0,0.0,5.50,2023-10-09,#800219,United States,UPS,UPS® Ground,1ZC707H10323539101,7.19,12.71,60647,60131,12.71,7.19,IL
41,2023-10-01,paid,5333376630899,#809356,SQ1726938,17097311060083,Month to Month,53024,Wisconsin,49.99,1734774063219,5245846028403,Returning,2021-W31,2021-08-01,United States,Bloomsy Deluxe,,No,order,product,13027,One-time,No,52.74,1,1,1,0.0,0.0,2.75,2023-10-30,#809356,United States,UPS,UPS® Ground,1ZC707H10305950799,7.19,12.35,53024,60131,12.35,7.19,WI
43,2023-10-01,paid,5308156182643,#803485,SQ3100795,17097313026163,Week to Week,92024,California,44.99,1734774325363,5514399678579,Returning,2022-W03,2022-01-01,United States,Bloomsy Weekly,,No,order,product,92024,One-time,No,48.25,1,1,1,0.0,0.0,3.26,2023-10-16,#803485,United States,UPS,UPS® Ground,1ZC707H10302629080,7.19,12.71,92024,90058,12.71,7.19,CA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112002,2024-04-01,paid,5644211093619,#894524,SQ1726938,17097311060083,Month to Month,40207,Kentucky,49.99,1734774063219,3017732194419,Returning,2020-W18,2020-04-01,United States,Bloomsy Deluxe,,No,order,product,19130,One-time,No,52.99,1,1,1,0.0,0.0,3.00,2024-04-15,#894524,United States,UPS,UPS® Ground,1ZC707H10321913085,7.61,13.20,40207,60131,13.20,7.61,KY
112005,2024-04-01,paid,5649313333363,#895464,SQ9178530,17097313222771,Month to Month,32211,Florida,44.99,1734774390899,5879118299251,Returning,2022-W29,2022-07-01,United States,Bloomsy Original,,No,order,product,32211,One-time,No,48.36,1,1,1,0.0,0.0,3.37,2024-04-11,#895464,United States,UPS,UPS® Ground,1ZC707H10318589764,7.67,15.03,32211,33166,15.03,7.67,FL
112007,2024-04-01,paid,5671479410803,#900450,SQ9178530,40953324142707,Month to Month,98012,Washington,59.99,7251668664435,7340112216179,Returning,2024-W12,2024-03-01,United States,Bloomsy Original - Month to Month,,No,order,product,98258,Subscription,No,63.89,1,1,1,0.0,0.0,3.90,2024-04-24,#900450,United States,UPS,UPS® Ground,1ZC707H10300373654,7.40,10.21,98012,98528,10.21,7.40,WA
112010,2024-04-01,paid,5642329784435,#893514,SQ9178530,17097313222771,Month to Month,44333,Ohio,39.99,1734774390899,2803806896243,Returning,2019-W52,2019-12-01,United States,Bloomsy Original,,No,order,product,94117,One-time,No,42.89,1,1,1,0.0,0.0,2.90,2024-04-05,#893514,United States,UPS,UPS® Ground,1ZC707H10325023164,7.61,12.72,44333,43217,12.72,7.61,OH


In [219]:
#checking for missing values
bbox.isnull().sum()

  and should_run_async(code)


month                      0
variant_sku                0
product_price              0
customer_id                0
customer_cohort_week       0
customer_cohort_month      0
billing_postal_code        1
ordered_item_quantity      0
day                        0
order_number               0
destination_postal_code    0
origin_postal_code         0
tracking_number            0
state_code                 0
shipping_region            0
dtype: int64

In [213]:
# Selecting specific columns to create a new DataFrame df_bbo1
columns_to_select = [
    'month', 'variant_sku', 'product_price',
    'customer_id', 'customer_cohort_week', 'customer_cohort_month',
    'billing_postal_code', 'ordered_item_quantity',
    'day', 'order_number', 'destination_postal_code', 'origin_postal_code',
    'tracking_number', 'state_code','shipping_region']

# Create df_bbo1 with only the specified columns
bbox = bbox[columns_to_select].copy()

# Display the first few rows of the new DataFrame to verify
print(bbox.head())


        month variant_sku  product_price    customer_id customer_cohort_week customer_cohort_month billing_postal_code  ordered_item_quantity        day order_number destination_postal_code  origin_postal_code     tracking_number state_code shipping_region
20 2023-10-01   SQ9178530          44.99  5817602441331             2022-W27            2022-07-01               60120                      1 2023-10-10      #802025                   60120               60131  1ZC707H10326982108         IL        Illinois
22 2023-10-01   SQ8862610         324.99  6543958081651             2023-W12            2023-03-01               53097                      1 2023-10-30      #809296                   53097               60131  1ZC707H10326098369         WI       Wisconsin
35 2023-10-01   SQ8862610          54.99  3047430324339             2020-W19            2020-05-01               60647                      1 2023-10-09      #800219                   60647               60131  1ZC707H10323539101

  and should_run_async(code)


In [215]:
bbox.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26844 entries, 20 to 112012
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   month                    26844 non-null  datetime64[ns]
 1   variant_sku              26844 non-null  object        
 2   product_price            26844 non-null  float64       
 3   customer_id              26844 non-null  int64         
 4   customer_cohort_week     26844 non-null  object        
 5   customer_cohort_month    26844 non-null  datetime64[ns]
 6   billing_postal_code      26843 non-null  object        
 7   ordered_item_quantity    26844 non-null  int64         
 8   day                      26844 non-null  datetime64[ns]
 9   order_number             26844 non-null  object        
 10  destination_postal_code  26844 non-null  object        
 11  origin_postal_code       26844 non-null  int64         
 12  tracking_number          26844 non-

  and should_run_async(code)


In [217]:
# Create a ProfileReport object from the TSA dataframe.
profile = ProfileReport(bbox, title="check order placement speed", explorative=True)

  and should_run_async(code)


In [218]:
profile

  and should_run_async(code)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [220]:
#remove all missing values
bbox = bbox.dropna()

bbox


  and should_run_async(code)


Unnamed: 0,month,variant_sku,product_price,customer_id,customer_cohort_week,customer_cohort_month,billing_postal_code,ordered_item_quantity,day,order_number,destination_postal_code,origin_postal_code,tracking_number,state_code,shipping_region
20,2023-10-01,SQ9178530,44.99,5817602441331,2022-W27,2022-07-01,60120,1,2023-10-10,#802025,60120,60131,1ZC707H10326982108,IL,Illinois
22,2023-10-01,SQ8862610,324.99,6543958081651,2023-W12,2023-03-01,53097,1,2023-10-30,#809296,53097,60131,1ZC707H10326098369,WI,Wisconsin
35,2023-10-01,SQ8862610,54.99,3047430324339,2020-W19,2020-05-01,60647,1,2023-10-09,#800219,60647,60131,1ZC707H10323539101,IL,Illinois
41,2023-10-01,SQ1726938,49.99,5245846028403,2021-W31,2021-08-01,13027,1,2023-10-30,#809356,53024,60131,1ZC707H10305950799,WI,Wisconsin
43,2023-10-01,SQ3100795,44.99,5514399678579,2022-W03,2022-01-01,92024,1,2023-10-16,#803485,92024,90058,1ZC707H10302629080,CA,California
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112002,2024-04-01,SQ1726938,49.99,3017732194419,2020-W18,2020-04-01,19130,1,2024-04-15,#894524,40207,60131,1ZC707H10321913085,KY,Kentucky
112005,2024-04-01,SQ9178530,44.99,5879118299251,2022-W29,2022-07-01,32211,1,2024-04-11,#895464,32211,33166,1ZC707H10318589764,FL,Florida
112007,2024-04-01,SQ9178530,59.99,7340112216179,2024-W12,2024-03-01,98258,1,2024-04-24,#900450,98012,98528,1ZC707H10300373654,WA,Washington
112010,2024-04-01,SQ9178530,39.99,2803806896243,2019-W52,2019-12-01,94117,1,2024-04-05,#893514,44333,43217,1ZC707H10325023164,OH,Ohio


#Customer Lifetime Value (CLV)


In [221]:
#package used for customer lifetime value modeling
!pip install lifetimes
import lifetimes
import seaborn as sns
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_frequency_recency_matrix
# Set the maximum number of rows to display to 500.
pd.set_option('display.max_rows', 500)
# Set the maximum number of columns to display to 500.
pd.set_option('display.max_columns', 500)
# Set the width of the display in characters to 1000.
pd.set_option('display.width', 1000)

  and should_run_async(code)




In [222]:
bbox.isnull().sum()

  and should_run_async(code)


month                      0
variant_sku                0
product_price              0
customer_id                0
customer_cohort_week       0
customer_cohort_month      0
billing_postal_code        0
ordered_item_quantity      0
day                        0
order_number               0
destination_postal_code    0
origin_postal_code         0
tracking_number            0
state_code                 0
shipping_region            0
dtype: int64

In [223]:
bbox.describe()

  and should_run_async(code)


Unnamed: 0,month,product_price,customer_id,customer_cohort_month,ordered_item_quantity,day,origin_postal_code
count,26843,26843.0,26843.0,26843,26843.0,26843,26843.0
mean,2023-12-30 10:15:34.768841216,107.089768,5401662000000.0,2022-03-14 15:57:24.428714752,0.999963,2024-01-19 13:51:55.851432704,41772.306262
min,2023-10-01 00:00:00,1.99,1183570000000.0,2015-12-01 00:00:00,0.0,2023-10-02 00:00:00,1887.0
25%,2023-12-01 00:00:00,44.99,3553599000000.0,2020-12-01 00:00:00,1.0,2023-12-12 00:00:00,8085.0
50%,2024-01-01 00:00:00,54.99,5907361000000.0,2022-08-01 00:00:00,1.0,2024-01-18 00:00:00,33166.0
75%,2024-02-01 00:00:00,69.99,6621971000000.0,2023-05-01 00:00:00,1.0,2024-03-04 00:00:00,75041.0
max,2024-04-01 00:00:00,1184.88,7426236000000.0,2024-04-01 00:00:00,1.0,2024-04-24 00:00:00,98528.0
std,,142.666912,1688285000000.0,,0.006104,,31853.404411


In [224]:
def find_boundaries(df, variable,q1=0.05,q2=0.95):
    # the boundaries are the quantiles
    lower_boundary = df[variable].quantile(q1) # lower quantile
    upper_boundary = df[variable].quantile(q2) # upper quantile
    return upper_boundary, lower_boundary
def capping_outliers(df,variable):
    upper_boundary,lower_boundary =  find_boundaries(df,variable)
    df[variable] = np.where(df[variable] > upper_boundary, upper_boundary,
                       np.where(df[variable] < lower_boundary, lower_boundary, df[variable]))

  and should_run_async(code)


In [225]:
capping_outliers(bbox1,'product_price')
capping_outliers(bbox1,'ordered_item_quantity')

  and should_run_async(code)


In [226]:
bbox.describe()

  and should_run_async(code)


Unnamed: 0,month,product_price,customer_id,customer_cohort_month,ordered_item_quantity,day,origin_postal_code
count,26843,26843.0,26843.0,26843,26843.0,26843,26843.0
mean,2023-12-30 10:15:34.768841216,107.089768,5401662000000.0,2022-03-14 15:57:24.428714752,0.999963,2024-01-19 13:51:55.851432704,41772.306262
min,2023-10-01 00:00:00,1.99,1183570000000.0,2015-12-01 00:00:00,0.0,2023-10-02 00:00:00,1887.0
25%,2023-12-01 00:00:00,44.99,3553599000000.0,2020-12-01 00:00:00,1.0,2023-12-12 00:00:00,8085.0
50%,2024-01-01 00:00:00,54.99,5907361000000.0,2022-08-01 00:00:00,1.0,2024-01-18 00:00:00,33166.0
75%,2024-02-01 00:00:00,69.99,6621971000000.0,2023-05-01 00:00:00,1.0,2024-03-04 00:00:00,75041.0
max,2024-04-01 00:00:00,1184.88,7426236000000.0,2024-04-01 00:00:00,1.0,2024-04-24 00:00:00,98528.0
std,,142.666912,1688285000000.0,,0.006104,,31853.404411


In [227]:
bbox['Total_Price'] = bbox['product_price'] * bbox['ordered_item_quantity']

  and should_run_async(code)


In [228]:
bbox.columns

  and should_run_async(code)


Index(['month', 'variant_sku', 'product_price', 'customer_id', 'customer_cohort_week', 'customer_cohort_month', 'billing_postal_code', 'ordered_item_quantity', 'day', 'order_number', 'destination_postal_code', 'origin_postal_code', 'tracking_number', 'state_code', 'shipping_region', 'Total_Price'], dtype='object')

In [229]:
bbox.head()

  and should_run_async(code)


Unnamed: 0,month,variant_sku,product_price,customer_id,customer_cohort_week,customer_cohort_month,billing_postal_code,ordered_item_quantity,day,order_number,destination_postal_code,origin_postal_code,tracking_number,state_code,shipping_region,Total_Price
20,2023-10-01,SQ9178530,44.99,5817602441331,2022-W27,2022-07-01,60120,1,2023-10-10,#802025,60120,60131,1ZC707H10326982108,IL,Illinois,44.99
22,2023-10-01,SQ8862610,324.99,6543958081651,2023-W12,2023-03-01,53097,1,2023-10-30,#809296,53097,60131,1ZC707H10326098369,WI,Wisconsin,324.99
35,2023-10-01,SQ8862610,54.99,3047430324339,2020-W19,2020-05-01,60647,1,2023-10-09,#800219,60647,60131,1ZC707H10323539101,IL,Illinois,54.99
41,2023-10-01,SQ1726938,49.99,5245846028403,2021-W31,2021-08-01,13027,1,2023-10-30,#809356,53024,60131,1ZC707H10305950799,WI,Wisconsin,49.99
43,2023-10-01,SQ3100795,44.99,5514399678579,2022-W03,2022-01-01,92024,1,2023-10-16,#803485,92024,90058,1ZC707H10302629080,CA,California,44.99


In [232]:
unique=bbox['customer_id']
unique


  and should_run_async(code)


20        5817602441331
22        6543958081651
35        3047430324339
41        5245846028403
43        5514399678579
              ...      
112002    3017732194419
112005    5879118299251
112007    7340112216179
112010    2803806896243
112012    1183739412595
Name: customer_id, Length: 26843, dtype: int64

In [234]:
# Assuming 'bbo1' is your DataFrame

# Check for duplicated customer_id values and count them
num_duplicated_customer_ids = bbox['customer_id'].duplicated().sum()

# Display the number of duplicated customer IDs
print("Number of duplicated customer IDs:", num_duplicated_customer_ids)

# If you want to see the actual duplicated customer_id entries
duplicated_customer_ids = bbo1[bbox['customer_id'].duplicated()]['customer_id']
print("Duplicated customer IDs:", duplicated_customer_ids.unique())


Number of duplicated customer IDs: 16915


  and should_run_async(code)


NameError: name 'bbo1' is not defined

In [230]:
clv = lifetimes.utils.summary_data_from_transaction_data(bbox,'customer_id','customer_id','Total_Price',observation_period_end='2024-04-23')


  and should_run_async(code)


ValueError: The column label 'customer_id' is not unique.