## Data and Task

### Data Link: https://www.kaggle.com/flenderson/sales-analysis

- Historical Sales and Active Inventory.
- Records of sold and unsold products and their characteristics.

### Task: **To predict if a product has been sold in the last 6 months based on the historical features of the products**

## Imports

In [56]:
!pip3 install -U scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/f3/74/eb899f41d55f957e2591cde5528e75871f817d9fb46d4732423ecaca736d/scikit_learn-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
[K     |████████████████████████████████| 22.3MB 62.6MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.1 threadpoolctl-2.1.0


In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, KFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline 


from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import f1_score, accuracy_score, plot_confusion_matrix, classification_report

## Data

In [13]:
data = pd.read_csv("SalesKaggle3.csv")

In [14]:
data

Unnamed: 0,Order,File_Type,SKU_number,SoldFlag,SoldCount,MarketingType,ReleaseNumber,New_Release_Flag,StrengthFactor,PriceReg,ReleaseYear,ItemCount,LowUserPrice,LowNetPrice
0,2,Historical,1737127,0.0,0.0,D,15,1,6.827430e+05,44.99,2015,8,28.97,31.84
1,3,Historical,3255963,0.0,0.0,D,7,1,1.016014e+06,24.81,2005,39,0.00,15.54
2,4,Historical,612701,0.0,0.0,D,0,0,3.404640e+05,46.00,2013,34,30.19,27.97
3,6,Historical,115883,1.0,1.0,D,4,1,3.340110e+05,100.00,2006,20,133.93,83.15
4,7,Historical,863939,1.0,1.0,D,2,1,1.287938e+06,121.95,2010,28,4.00,23.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198912,208023,Active,109683,,,D,7,1,2.101869e+05,72.87,2006,54,8.46,60.59
198913,208024,Active,416462,,,D,8,1,4.555041e+05,247.00,2009,65,8.40,74.85
198914,208025,Active,658242,,,S,2,1,1.692746e+05,50.00,2012,23,23.98,32.62
198915,208026,Active,2538340,,,S,2,1,3.775266e+05,46.95,2001,23,27.42,37.89


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198917 entries, 0 to 198916
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Order             198917 non-null  int64  
 1   File_Type         198917 non-null  object 
 2   SKU_number        198917 non-null  int64  
 3   SoldFlag          75996 non-null   float64
 4   SoldCount         75996 non-null   float64
 5   MarketingType     198917 non-null  object 
 6   ReleaseNumber     198917 non-null  int64  
 7   New_Release_Flag  198917 non-null  int64  
 8   StrengthFactor    198917 non-null  float64
 9   PriceReg          198917 non-null  float64
 10  ReleaseYear       198917 non-null  int64  
 11  ItemCount         198917 non-null  int64  
 12  LowUserPrice      198917 non-null  float64
 13  LowNetPrice       198917 non-null  float64
dtypes: float64(6), int64(6), object(2)
memory usage: 21.2+ MB


In [16]:
#checking the count of missing value
data.isna().sum()

Order                    0
File_Type                0
SKU_number               0
SoldFlag            122921
SoldCount           122921
MarketingType            0
ReleaseNumber            0
New_Release_Flag         0
StrengthFactor           0
PriceReg                 0
ReleaseYear              0
ItemCount                0
LowUserPrice             0
LowNetPrice              0
dtype: int64

## Data Preparation/Preprocessing
- We are only going to work with "Historical Attributes" inside the column "File_Type" so first we need to drop active from that column.

In [17]:

def data_preparation(df):
    df = df.copy()
    
    # Only use historical data
    df = df.query("File_Type == 'Historical'")
    
    # Drop unused columns
    df = df.drop(['Order', 'File_Type', 'SKU_number', 'SoldCount'], axis=1)
    
    # Shuffle data
    df = df.sample(frac=1.0, random_state=1)
    
    # Split df into X and y
    y = df['SoldFlag']
    X = df.drop('SoldFlag', axis=1)
    
    return X, y

In [18]:
X, y = data_preparation(data)

In [19]:
X

Unnamed: 0,MarketingType,ReleaseNumber,New_Release_Flag,StrengthFactor,PriceReg,ReleaseYear,ItemCount,LowUserPrice,LowNetPrice
37862,S,12,1,545082.0,96.67,2011,12,73.74,101.33
35304,S,2,1,4273940.0,58.00,2002,32,85.60,23.98
26138,D,9,1,165834.0,76.95,2011,48,75.57,42.67
52327,S,22,1,79220.0,54.25,2012,31,36.47,22.49
6038,D,8,1,80014.0,38.99,2008,62,153.24,69.43
...,...,...,...,...,...,...,...,...,...
20609,D,8,1,40841.0,103.24,2010,48,99.50,115.55
21440,D,0,0,1611172.0,86.64,2011,19,55.19,78.38
73349,S,2,1,1628317.0,69.99,2004,43,4.02,30.43
50057,S,2,1,1660915.0,44.00,2004,32,34.51,10.12


In [20]:

y

37862    0.0
35304    0.0
26138    0.0
52327    0.0
6038     0.0
        ... 
20609    0.0
21440    1.0
73349    0.0
50057    0.0
5192     1.0
Name: SoldFlag, Length: 75996, dtype: float64

# Narration
- We need to Onehot encode the "Marketing Type" mainly looks like a Binary encoding. 
- SoldCount is a function for Soldflag that means we can't keep Soldcount as it already defines if the product has been sold so for a practical model, it is not feasible to give hint to the target.
- We will remove "Unique identifier" from the data like "Order", "SKU_number", etc.
- We don't need the file type anymore as we have all historical records.

## Pipeline

In [21]:
def model_pipeline():
    binary_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(sparse=False, drop='if_binary'))
    ])
    
    nominal_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(transformers=[
        ('binary', binary_transformer, ['MarketingType']),
        ('nominal', nominal_transformer, ['ReleaseNumber'])
    ], remainder='passthrough')
    
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', RandomForestClassifier(random_state=1))
    ])
    
    return model

# Model training and Evaluation

#### We using Kfold for validation

In [22]:
accs = []
f1s = []

kf = KFold(n_splits=5)

for train_idx, test_idx in kf.split(X):
    X_train = X.iloc[train_idx, :]
    X_test = X.iloc[test_idx, :]
    y_train = y.iloc[train_idx]
    y_test = y.iloc[test_idx]
    
    model = model_pipeline()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    accs.append(accuracy_score(y_test, y_pred))
    f1s.append(f1_score(y_test, y_pred, pos_label=1.0))

acc = np.mean(accs)
f1 = np.mean(f1s)

print("Accuracy: {:.2f}%".format(acc * 100))
print("F1-Score: {:.5f}".format(f1))

Accuracy: 83.54%
F1-Score: 0.23681
