# Mid Term Projects

**Problem Statement:**

This project is an in-depth analysis of SuperStore’s transaction data aimed at uncovering insights to support strategic business decisions. By exploring this dataset, I seek to understand the key drivers of sales and profitability, with a focus on customer segments, product categories, discounting practices, and regional performance.

The objectives of this analysis are:

1. Identify Financial Drivers: Conduct an in-depth examination of factors such as customer demographics, discount levels, and regional sales to determine which variables most significantly impact revenue and profitability. This will highlight high-value areas that SuperStore may prioritize to enhance overall financial performance.

2. Develop Predictive Models: Utilize linear regression to forecast sales and profit margins and logistic regression to classify orders as profitable or non-profitable. These models will aid in projecting future performance and guiding inventory, pricing, and marketing decisions.

3. Provide Actionable Recommendations: Based on the insights gained from financial analysis and predictive modeling, this project will offer data-driven strategies to optimize SuperStore’s revenue growth and profitability. Recommendations will cover discount policies, customer targeting, and inventory management.

This project serves as both a personal development initiative to refine my data science and financial analysis skills and a practical exploration of how data-driven insights can address real-world business challenges in the retail sector.




____

# Part 1: Data Processing

### 1.1 Importing the necessary libraries

In [34]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import skimpy as sp
import dtale as dt

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

### 1.2 Loading the dataset

In [75]:
# file path of the CSV file
file_path = '/Users/teslim/OneDrive/mlzoomcamp/superstore_dataset.csv'

# Load the dataset into a pandas DataFrame
df = pd.read_csv(file_path)

# Preview the loaded data
dt.show(df)



### 1.3 Data Overview

In [98]:
# check the first few rows of the dataset
df.head()

Unnamed: 0,order_id,order_date,ship_date,customer,manufactory,product_name,segment,category,subcategory,region,zip,city,state,country,discount,profit,quantity,sales,profit_margin
0,us-2020-103800,1/3/2019,1/7/2019,darren_powers,message_book,"message_book,_wirebound,_four_5_1/2""_x_4""_form...",consumer,office_supplies,paper,central,77095,houston,texas,united_states,0.2,5.5512,2,16.448,0.3375
1,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,gbc,gbc_standard_plastic_binding_systems_combs,home_office,office_supplies,binders,central,60540,naperville,illinois,united_states,0.8,-5.487,2,3.54,-1.55
2,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,avery,avery_508,home_office,office_supplies,labels,central,60540,naperville,illinois,united_states,0.2,4.2717,3,11.784,0.3625
3,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,safco,safco_boltless_steel_shelving,home_office,office_supplies,storage,central,60540,naperville,illinois,united_states,0.2,-64.7748,3,272.736,-0.2375
4,us-2020-141817,1/5/2019,1/12/2019,mick_brown,avery,avery_hi-liter_everbold_pen_style_fluorescent_...,consumer,office_supplies,art,east,19143,philadelphia,pennsylvania,united_states,0.2,4.884,3,19.536,0.25


In [76]:
# generate a summary of the data
sp.skim(df)

In [99]:
# check the data types of the columns
df.dtypes

order_id          object
order_date        object
ship_date         object
customer          object
manufactory       object
product_name      object
segment           object
category          object
subcategory       object
region            object
zip                int64
city              object
state             object
country           object
discount         float64
profit           float64
quantity           int64
sales            float64
profit_margin    float64
dtype: object

### 1.4 indentify categorical and numerical columns

In [None]:
# collection of the numerical columns
numerical_columns = df.select_dtypes(include=['number']).columns.tolist()
numerical_columns

['zip', 'discount', 'profit', 'quantity', 'sales', 'profit_margin']

In [None]:
# collection of categorical columns
categorical_columns = df.select_dtypes(include='object').columns.tolist()
categorical_columns

['order_id',
 'order_date',
 'ship_date',
 'customer',
 'manufactory',
 'product_name',
 'segment',
 'category',
 'subcategory',
 'region',
 'city',
 'state',
 'country']

In [None]:
# view of the categorical dataset 
df[categorical_columns].head()

Unnamed: 0,order_id,order_date,ship_date,customer,manufactory,product_name,segment,category,subcategory,region,city,state,country
0,US-2020-103800,1/3/2019,1/7/2019,Darren Powers,Message Book,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",Consumer,Office Supplies,Paper,Central,Houston,Texas,United States
1,US-2020-112326,1/4/2019,1/8/2019,Phillina Ober,GBC,GBC Standard Plastic Binding Systems Combs,Home Office,Office Supplies,Binders,Central,Naperville,Illinois,United States
2,US-2020-112326,1/4/2019,1/8/2019,Phillina Ober,Avery,Avery 508,Home Office,Office Supplies,Labels,Central,Naperville,Illinois,United States
3,US-2020-112326,1/4/2019,1/8/2019,Phillina Ober,SAFCO,SAFCO Boltless Steel Shelving,Home Office,Office Supplies,Storage,Central,Naperville,Illinois,United States
4,US-2020-141817,1/5/2019,1/12/2019,Mick Brown,Avery,Avery Hi-Liter EverBold Pen Style Fluorescent ...,Consumer,Office Supplies,Art,East,Philadelphia,Pennsylvania,United States


In [None]:
# convert the content of categorical columns to lowercase and replace spaces with underscores

for col in categorical_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

In [97]:
df.head()

Unnamed: 0,order_id,order_date,ship_date,customer,manufactory,product_name,segment,category,subcategory,region,zip,city,state,country,discount,profit,quantity,sales,profit_margin
0,us-2020-103800,1/3/2019,1/7/2019,darren_powers,message_book,"message_book,_wirebound,_four_5_1/2""_x_4""_form...",consumer,office_supplies,paper,central,77095,houston,texas,united_states,0.2,5.5512,2,16.448,0.3375
1,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,gbc,gbc_standard_plastic_binding_systems_combs,home_office,office_supplies,binders,central,60540,naperville,illinois,united_states,0.8,-5.487,2,3.54,-1.55
2,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,avery,avery_508,home_office,office_supplies,labels,central,60540,naperville,illinois,united_states,0.2,4.2717,3,11.784,0.3625
3,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,safco,safco_boltless_steel_shelving,home_office,office_supplies,storage,central,60540,naperville,illinois,united_states,0.2,-64.7748,3,272.736,-0.2375
4,us-2020-141817,1/5/2019,1/12/2019,mick_brown,avery,avery_hi-liter_everbold_pen_style_fluorescent_...,consumer,office_supplies,art,east,19143,philadelphia,pennsylvania,united_states,0.2,4.884,3,19.536,0.25


In [101]:
df.head()

Unnamed: 0,order_id,order_date,ship_date,customer,manufactory,product_name,segment,category,subcategory,region,zip,city,state,country,discount,profit,quantity,sales,profit_margin
0,us-2020-103800,1/3/2019,1/7/2019,darren_powers,message_book,"message_book,_wirebound,_four_5_1/2""_x_4""_form...",consumer,office_supplies,paper,central,77095,houston,texas,united_states,0.2,5.5512,2,16.448,0.3375
1,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,gbc,gbc_standard_plastic_binding_systems_combs,home_office,office_supplies,binders,central,60540,naperville,illinois,united_states,0.8,-5.487,2,3.54,-1.55
2,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,avery,avery_508,home_office,office_supplies,labels,central,60540,naperville,illinois,united_states,0.2,4.2717,3,11.784,0.3625
3,us-2020-112326,1/4/2019,1/8/2019,phillina_ober,safco,safco_boltless_steel_shelving,home_office,office_supplies,storage,central,60540,naperville,illinois,united_states,0.2,-64.7748,3,272.736,-0.2375
4,us-2020-141817,1/5/2019,1/12/2019,mick_brown,avery,avery_hi-liter_everbold_pen_style_fluorescent_...,consumer,office_supplies,art,east,19143,philadelphia,pennsylvania,united_states,0.2,4.884,3,19.536,0.25
