1. Problem Statement

The goal is to build a predictive model that estimates product sales based on key features like views, pricing, and category. This is a regression problem since the target variable (sales) is continuous

2. Hypothesis Generation

Generate hypotheses about how each feature (views, pricing, category) might influence product sales:

Higher views could lead to increased sales due to greater exposure.
Competitive pricing relative to cost might lead to higher sales.
Certain product categories may be more popular and hence drive more sales.

3. Getting the system ready and loading the data

In [13]:
# Supress warnings
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)


import pandas as pd
import numpy as np
import skimpy 

df = pd.read_csv('../data/sales.csv')


4. Understanding the data

In [3]:
df.head()

Unnamed: 0,date,id,category_id,sales,views,price_cost,price_retail
0,2022-02-24,1,3,0,0,0,0
1,2022-02-25,1,3,0,0,0,0
2,2022-02-26,1,3,0,0,0,0
3,2022-02-27,1,3,0,0,0,0
4,2022-02-28,1,3,0,0,0,0


In [4]:
dfRowCountStart = len(df.index)
dfRowCountStart

2548824

In [5]:
df.describe()

Unnamed: 0,id,category_id,sales,views,price_cost,price_retail
count,2548824.0,2548824.0,2548824.0,2548824.0,2548824.0,2548824.0
mean,1741.5,1.590752,0.3706074,38.2513,12193.18,17686.16
std,1005.167,0.6505913,5.039799,182.4299,18581.6,26413.87
min,1.0,1.0,0.0,0.0,0.0,0.0
25%,871.0,1.0,0.0,0.0,0.0,0.0
50%,1741.5,2.0,0.0,8.0,3546.0,7582.0
75%,2612.0,2.0,0.0,33.0,18148.0,25800.0
max,3482.0,4.0,1372.0,43148.0,292573.0,759077.0


The dataset contains a mix of numerical and categorical variables representing various attributes of loan applicants and their loan applications. The dataset appears to be suitable for analyzing factors influencing the sales price of a item compared to the retail price the views that they had for different items and categories.


In [6]:
# Assessing data quality, completeness, and relevance
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2548824 entries, 0 to 2548823
Data columns (total 7 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   date          object
 1   id            int64 
 2   category_id   int64 
 3   sales         int64 
 4   views         int64 
 5   price_cost    int64 
 6   price_retail  int64 
dtypes: int64(6), object(1)
memory usage: 136.1+ MB
None


- The dataset contains ... entries (rows) and ... columns.
- Each column represents a different variable or feature.
- The variables have different data types:
- ... columns are of type float64, representing numerical variables (e.g., CoapplicantIncome, LoanAmount).
- ... column is of type int64, representing a numerical variable (e.g., ApplicantIncome).
- ... columns are of type object, representing categorical variables (e.g., Gender, Married, Education).
- ... are missing values in several columns:
Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History have some missing values.
- The target variable (Loan_Status) is categorical and has two classes: Y (Yes) and N (No).
- Other categorical variables include Gender, Married, Education, Self_Employed, and Property_Area.
- Numerical variables include ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, and Credit_History.

In [8]:
#Identifying potential data issues and limitations.
from skimpy import skim
skim(df)

5. Exploratory Data Analysis

In [14]:
# For Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sn
import pandas as pd

In [10]:
df = pd.read_csv('../data/sales.csv')
df.head()

Unnamed: 0,date,id,category_id,sales,views,price_cost,price_retail
0,2022-02-24,1,3,0,0,0,0
1,2022-02-25,1,3,0,0,0,0
2,2022-02-26,1,3,0,0,0,0
3,2022-02-27,1,3,0,0,0,0
4,2022-02-28,1,3,0,0,0,0


In [11]:
df.tail()

Unnamed: 0,date,id,category_id,sales,views,price_cost,price_retail
2548819,2024-02-21,3482,2,0,41,2440,3170
2548820,2024-02-22,3482,2,0,23,2440,3172
2548821,2024-02-23,3482,2,0,14,2440,3172
2548822,2024-02-24,3482,2,0,17,2440,3172
2548823,2024-02-25,3482,2,0,25,2440,3172


 	i. Perform Univariate Analysis

In [16]:
(
    df['price_retail']
          .astype('')
)

TypeError: data type '' not understood

In [18]:
import pandas as pd
import plotly.express as px

# Assuming df is your DataFrame containing the 'price_retail' column
labels = (df['price_retail']
          .astype('')
)

# Rename columns for clarity
labels.columns = ['price_retail', 'Count']

# Create figure using Plotly
fig = px.bar(
    data_frame=labels, 
    x='price_retail', 
    y='Count', 
    title='Class Imbalance', 
    color='price_retail'
)

# Add titles & Display figure
fig.update_layout(xaxis_title='Sales price', yaxis_title='Sales made')
fig.show()

TypeError: data type '' not understood

 	ii. Perform Bivariate Analysis

Numeric Features

In [19]:
df.select_dtypes('').nunique()

TypeError: data type '' not understood

In [20]:
# this will change depending on the awnser above
# Select features to plot
plot_cols = ['sales', 'views', 'price_cost','price_cost']

# Plot numeric features against target
plt.Figure(figsize=(3,4))
for col in plot_cols:
    fig = px.box(data_frame=df[plot_cols], x=col, color=df['price_retail'], title=f'BoxPlot for {col} Feature against the Target')
    fig.update_layout(xaxis_title=f'{col} Feature')
    fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

Categorical Feature

In [21]:
df.select_dtypes('object').nunique()

date    732
dtype: int64

In [22]:
# this will change depending on the awnser above
# Select features to plot
plot_cols = ['category_id', 'id']

# Plot numeric features against target
plt.Figure(figsize=(3,4))
for col in plot_cols:
    fig = px.box(data_frame=df[plot_cols], x=col, color=df['price_retail'], title=f'BoxPlot for {col} Feature against the Target')
    fig.update_layout(xaxis_title=f'{col} Feature')
    fig.show()

KeyboardInterrupt: 

6. Missing value and outlier treatment

Any variation of data that is not applicable or any strange characters we will replace the data with null values

In [23]:
df.replace(['NaN', 'N/A', 'NA', 'n/a', 'n.a.', 'N#A', 'n#a', '?'], 'other', inplace=True)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2548824 entries, 0 to 2548823
Data columns (total 7 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   date          object
 1   id            int64 
 2   category_id   int64 
 3   sales         int64 
 4   views         int64 
 5   price_cost    int64 
 6   price_retail  int64 
dtypes: int64(6), object(1)
memory usage: 136.1+ MB


In [25]:
df.isnull().sum()

date            0
id              0
category_id     0
sales           0
views           0
price_cost      0
price_retail    0
dtype: int64

7. Evaluation Metrics for classification problem

8. Model building: part 1 (Apply Deep Learning classification algorithm without step 9)

In [26]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import mglearn
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge





In [None]:
# Assuming Sales_df is your DataFrame and 'Sales' is the column to be imputed
Sales_df = df[['sales']]

# Impute missing values using the mean strategy
imp_num = SimpleImputer(strategy='mean')
loan_df_imputed = pd.DataFrame(
    imp_num.fit_transform(Sales_df),  # Impute missing values
    columns=Sales_df.columns  # Keep the original column name
)

# Assign the imputed values back to the original DataFrame
df['sales'] = loan_df_imputed['sales']


# check missing values in %age
missing_values = (
    df.isnull().sum()/len(df)*100
).astype(int)

print(f'Column\t\t\t% missing')
print(f'{"-"}'*35)
missing_values

9. Feature engineering

10. Model building: part 2 (Apply Deep Learning classification algorithm with step 9)

11. Model deployment - Dash app on https://www.render.com