![AlbertsonsImgae.png](attachment:AlbertsonsImgae.png)

# Machine Learning - Personalization for business in reatil industry

### Introduction
This case study is about identifying the customers dietary preferences based on customer's shopping history

### Objective - Identifying the Customer's Dietary Prefernce

### Business Problem

Identifying customer's dietary preferences is crucial in the retail industry, particularly in food and beverages, for several reasons 

1. **Presonalization:** Understanding dietary preferences allows reatilers to provide personalized recommendations and offers to customers. This enhances the shopping experience and the likelihood of making a sale
2. **Customer Satisfaction:** Catering to dieatry preferences ensures that customers can find products that align with their needs and values.Satisfied customers are more likely to return and become loyal shoppers.
3. **Market Segmentation:** By identifying the different dietary segments(e.g. vegetarian, vegan, paleo, gluten-free) retailers can tailor their product offerings and marketing startegies to target specific customer groups effectivel.
4. **Marketing Efficiency:** Retailers can create targeted marketing campaigns that resonate with specific dietary segment, leading to higher return on marketing investment.
5. **Data-Driven Decision Making:** Gathering data on dietary preferences enables reatilers to make data-driven decisions aboutproduct offerings, inventory management, and marketing strategies, leading to more informed choices.


### Type of Problem - Classification

We can use previous purchase data to identify the customers dietary preference

### Classification 

A classification algorithm is a supervised learning technique that uses data training to determine data into different classes. Classification predictive modeling is trained using data or observations, and new observations are categorized into classes or groups.
1. **Logistic Regression:** Statistical model (also known as logit model) is often used for classification and predictive analytics. Logistic regression estimates the probability of an event occurring, such as voted or didn’t vote, based on a given dataset of independent variables.We will use**Multinomial logistic regression:** In this type of logistic regression model, the dependent variable has three or more possible outcomes; however, these values have no specified order.  For example, movie studios want to predict what genre of film a moviegoer is likely to see to market films more effectively. A multinomial logistic regression model can help indentify a customers dietary preference.

2. **RandomForestClassifier:** A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
-  Random forests are a popular supervised machine learning algorithm. 
    *  Random forests are for supervised machine learning, where there is a labeled target variable.
    *  Random forests can be used for solving regression (numeric target variable) and classification (categorical target variable) problems. 
    *  Random forests are an ensemble method, meaning they combine predictions from other models.
    *  Each of the smaller models in the random forest ensemble is a decision tree.


# Importing the necessay libraries

In [None]:
!pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org -r ./../requirements.txt

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Reading the features csv file

In [None]:
df = pd.read_csv("features.csv")

### Shape attribute in Pandas enables us to obtain the shape of a DataFrame

In [None]:
df.shape

### Pandas head function returns the first 10 rows for the object based on position. 

In [None]:
df.head(10)

### Pandas column is part of a two-dimensional data structure in which one of the attributes is a column

In [None]:
df.columns.values

### Columns Description 
*  **customer_id:** unique customer identifier
*  **dp:** Dietary preference
*  **num_txn:** Number of transactions done by the household 
*  **basket_pct_MILK_FREE:** Percentage of milk free transactions
*  **basket_pct_EGG_FREE:** Percentage of egg free transactions
*  **basket_pct_GLUTEN_FREE:** Percentage of gluten free transactions
*  **basket_pct_VEGETARIAN:** Percentage of vegetarian transactions
*  **basket_pct_VEGAN:** Percentage of vegan transactions
*  **basket_pct_KETO:** Percentage of keto transactions
*  **basket_pct_PALEO:** Percentage of paleo transactions
*  **basket_pct_SHELLFISH_FREE:** Percentage of shellfish free transactions
*  **basket_pct_SOY_FREE:** Percentage of soy free transactions
*  **basket_pct_LACTOSE_FREE:** Percentage of lactose free transactions
*  **basket_pct_PEANUT_FREE:** Percentage of peanut free transactions
*  **basket_pct_PESCATARIAN:** Percentage of pescatarian transactions
*  **basket_pct_TREE_NUT_FREE:** Percentage of tree nut free transactions
*  **basket_pct_LOW_CARB:** Percentage of low carb transactions
*  **basket_pct_WHEAT_FREE:** Percentage of wheat free transactions
*  **basket_pct_NO_DIET_PREFERENCES:** Percentage of No Dieatry Preferences Transactions
*  **basket_pct_NO_DIET_RESTRICTIONS:** Percentage of No Dietary Restrictions transactions

### The Pandas describe() method returns description of the data in the DataFrame.

    If the DataFrame contains numerical data, the description contains these information for each column:

    count - The number of not-empty values.
    mean - The average (mean) value.
    std - The standard deviation.
    min - the minimum value.
    25% - The 25% percentile*.
    50% - The 50% percentile*.
    75% - The 75% percentile*.
    max - the maximum value.

    *Percentile meaning: how many of the values are less than the given percentile. Read more about percentiles in our Machine Learning Percentile chapter.

In [None]:
df.describe()

# Check if any columns have null values

In [None]:
df.isna().any()

In [None]:
# Count if the total number of records which have null columns

In [None]:
df.isna().sum()

In [None]:
df['basket_pct_MILK_FREE'].fillna(int(df['basket_pct_MILK_FREE'].mean()), inplace=True)
df['basket_pct_GLUTEN_FREE'].fillna(int(df['basket_pct_GLUTEN_FREE'].mean()), inplace=True)
df['basket_pct_VEGETARIAN'].fillna(int(df['basket_pct_VEGETARIAN'].mean()), inplace=True)
df['basket_pct_VEGAN'].fillna(int(df['basket_pct_VEGAN'].mean()), inplace=True)
df['basket_pct_PALEO'].fillna(int(df['basket_pct_PALEO'].mean()), inplace=True)
df['basket_pct_SHELLFISH_FREE'].fillna(int(df['basket_pct_SHELLFISH_FREE'].mean()), inplace=True)
df['basket_pct_PEANUT_FREE'].fillna(int(df['basket_pct_PEANUT_FREE'].mean()), inplace=True)
df['basket_pct_TREE_NUT_FREE'].fillna(int(df['basket_pct_TREE_NUT_FREE'].mean()), inplace=True)

In [None]:
df.isna().sum()



In [None]:
(df == 0).sum(axis=0)

In [None]:

df = df.drop('basket_pct_LACTOSE_FREE', axis=1)

In [None]:
df['basket_pct_MILK_FREE']=df.basket_pct_MILK_FREE.mask(df.basket_pct_MILK_FREE == 0,df['basket_pct_MILK_FREE'].mean())
df['basket_pct_EGG_FREE']=df.basket_pct_EGG_FREE.mask(df.basket_pct_EGG_FREE == 0,df['basket_pct_EGG_FREE'].mean())
df['basket_pct_GLUTEN_FREE']=df.basket_pct_GLUTEN_FREE.mask(df.basket_pct_GLUTEN_FREE == 0,df['basket_pct_GLUTEN_FREE'].mean())
df['basket_pct_VEGETARIAN']=df.basket_pct_VEGETARIAN.mask(df.basket_pct_VEGETARIAN == 0,df['basket_pct_VEGETARIAN'].mean())
df['basket_pct_VEGAN']=df.basket_pct_VEGAN.mask(df.basket_pct_VEGAN == 0,df['basket_pct_VEGAN'].mean())
df['basket_pct_KETO']=df.basket_pct_KETO.mask(df.basket_pct_KETO == 0,df['basket_pct_KETO'].mean())
df['basket_pct_PALEO']=df.basket_pct_PALEO.mask(df.basket_pct_PALEO == 0,df['basket_pct_PALEO'].mean())
df['basket_pct_SHELLFISH_FREE']=df.basket_pct_SHELLFISH_FREE.mask(df.basket_pct_SHELLFISH_FREE == 0,df['basket_pct_SHELLFISH_FREE'].mean())
df['basket_pct_SOY_FREE']=df.basket_pct_SOY_FREE.mask(df.basket_pct_SOY_FREE == 0,df['basket_pct_SOY_FREE'].mean())
df['basket_pct_PEANUT_FREE']=df.basket_pct_PEANUT_FREE.mask(df.basket_pct_PEANUT_FREE == 0,df['basket_pct_PEANUT_FREE'].mean())
df['basket_pct_PESCATARIAN']=df.basket_pct_PESCATARIAN.mask(df.basket_pct_PESCATARIAN == 0,df['basket_pct_PESCATARIAN'].mean())
df['basket_pct_TREE_NUT_FREE']=df.basket_pct_TREE_NUT_FREE.mask(df.basket_pct_TREE_NUT_FREE == 0,df['basket_pct_TREE_NUT_FREE'].mean())
df['basket_pct_LOW_CARB']=df.basket_pct_LOW_CARB.mask(df.basket_pct_LOW_CARB == 0,df['basket_pct_LOW_CARB'].mean())
df['basket_pct_WHEAT_FREE']=df.basket_pct_WHEAT_FREE.mask(df.basket_pct_WHEAT_FREE == 0,df['basket_pct_WHEAT_FREE'].mean())

In [None]:
(df == 0).sum(axis=0)

### No of unique value counts in the dp (dietary preference) column

In [None]:
df["dp"].value_counts()

# List of features and model choice

In [None]:
Y_feature = "dp"
X_feature_list = list(set(df.columns) - set([Y_feature, "household_id"]))
model_type = "RF" # RF or LR. two models were trained and compared

### We choose the model_type  : either RandomForestClassifier or Logistic Regression

## Cross-validation
*  Cross-validation is a statistical method used to estimate the skill of machine learning models.

*  It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem 

## K-fold : Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

The general procedure is as follows:

* Shuffle the dataset randomly.
* Split the dataset into k groups
* For each unique group:
    * Take the group as a hold out or test data set
    * Take the remaining groups as a training data set
    * Fit a model on the training set and evaluate it on the test set
* Retain the evaluation score and discard the model
* Summarize the skill of the model using the sample of model evaluation scores

In [None]:
# K-fold
X = df[X_feature_list].to_numpy(); y = df[Y_feature].to_numpy()

strtfdKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state = 42)
kfolds = strtfdKFold.split(X, y)

table = []
for k, (train_rows, test_rows) in enumerate(kfolds):
  if model_type == "RF":
    X_train = X[train_rows]
    X_test = X[test_rows]
    model = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=10, random_state=42).fit(X_train, y[train_rows])    
  elif model_type == "LR":
    scaler = StandardScaler().fit(X[train_rows])
    X_train = scaler.transform(X[train_rows])
    X_test = scaler.transform(X[test_rows])
    model = LogisticRegression(C = 0.8, random_state=42).fit(X_train, y[train_rows])
    
  y_hat = model.predict(X_train)
  y_pred = model.predict(X_test)
  train_precision, train_recall, train_fscore, _ = precision_recall_fscore_support(y[train_rows], y_hat, pos_label = 1, average='micro')
  test_precision, test_recall, test_fscore, _ = precision_recall_fscore_support(y[test_rows], y_pred, pos_label = 1, average='micro')
       
  table.append([k, X_train.shape[0], 100. * train_precision, 100. * train_recall, 100. * train_fscore,\
                X_test.shape[0], 100. * test_precision, 100. * test_recall, 100. * test_fscore])
Kfold_performance_df = pd.DataFrame(table, columns = ['k', 'train_L', 'train_precision', 'train_recall', 'train_fscore',\
                                                 'test_L', 'test_precision', 'test_recall', 'test_fscore'])
Kfold_performance_df.round(decimals = 2)

### K-fold output parameters
    * k: number folds
    * train_L: numbers of records used for training
    * train_precision: Precision of the train data
    * train_recall: Recall of the train data
    * train_fscore: F1 score of the train data 
    * test_L: numbers of records used for testing
    * test_precision: Precision of the test data
    * test_recall: Recall of the test data
    * test_fscore: F1 score of the test data

**F1 score** is a machine learning evaluation metric that measures a model’s accuracy. It combines the precision and recall scores of a model.

<img src="Confusion_Matrix.png" width="65%"/>
<img src="Precision_Recall.png" width="65%"/>
<img src="F1_Score.png" width="65%"/>

**GridSearchCV** is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. The performance of a model significantly depends on the value of hyperparameters. Note that there is no way to know in advance the best values for hyperparameters so ideally, we need to try all possible values to know the optimal values. Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.

In [None]:
# K-fold + GridSearchCV
X = df[X_feature_list].to_numpy(); y = df[Y_feature].to_numpy()

strtfdKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state = 42)
kfolds = strtfdKFold.split(X, y)

best_parameters = None
table = []
for k, (train_rows, test_rows) in enumerate(kfolds):
  if isinstance(best_parameters, dict) == False:
    print("capturing best hyper parameters")
    if model_type == "RF":
      model = RandomForestClassifier(random_state=42)
      parameters = {'n_estimators':[100, 200, 500], 'max_depth':[1, 5, 10], 'min_samples_leaf': [1, 5, 10]}
      X_train = X[train_rows]
      X_test = X[test_rows]
    elif model_type == "LR":
      model = LogisticRegression(random_state=42)
      parameters = {'C':[0.5, 0.75, 1, 1.5, 2.0]}
      scaler = StandardScaler().fit(X[train_rows])
      X_train = scaler.transform(X[train_rows])
      X_test = scaler.transform(X[test_rows])  
    model_GS = GridSearchCV(model, parameters)
    model_GS.fit(X_train, y[train_rows])
    best_parameters = model_GS.best_params_

  if model_type == "RF":
    X_train = X[train_rows]
    X_test = X[test_rows]
    model = RandomForestClassifier(**best_parameters, random_state=42).fit(X_train, y[train_rows])    
  elif model_type == "LR":
    scaler = StandardScaler().fit(X[train_rows])
    X_train = scaler.transform(X[train_rows])
    X_test = scaler.transform(X[test_rows])
    model = LogisticRegression(**best_parameters, random_state=42).fit(X_train, y[train_rows])
    
  y_hat = model.predict(X_train)
  y_pred = model.predict(X_test)
  train_precision, train_recall, train_fscore, _ = precision_recall_fscore_support(y[train_rows], y_hat, pos_label = 1, average='micro')
  test_precision, test_recall, test_fscore, _ = precision_recall_fscore_support(y[test_rows], y_pred, pos_label = 1, average='micro')
       
  table.append([k, X_train.shape[0], 100. * train_precision, 100. * train_recall, 100. * train_fscore,\
                X_test.shape[0], 100. * test_precision, 100. * test_recall, 100. * test_fscore])
Kfold_performance_df = pd.DataFrame(table, columns = ['k', 'train_L', 'train_precision', 'train_recall', 'train_fscore',\
                                                 'test_L', 'test_precision', 'test_recall', 'test_fscore'])
Kfold_performance_df.round(decimals = 2)

### Print the best parameters 

In [None]:
best_parameters

### Creating a dataframe consisting of output results

In [None]:
test_results_df = pd.DataFrame({'actual_dp': y[test_rows], 'predicted_dp': y_pred})

### Pandas groupby is used for grouping the data according to the categories and applying a function to the categories.

In [None]:
test_results_df.groupby('actual_dp').count()

### Paleo is easier to detect (higher accuracy). The lowest accuracy is pescatarian & keto 

In [None]:
test_results_df[test_results_df['actual_dp'] == test_results_df['predicted_dp']].groupby(['actual_dp']).size() 

### Getting the important features

In [None]:
feature_importance_df = pd.DataFrame()
if model_type == "RF":
  feature_importance_df["feature"] = X_feature_list
  feature_importance_df["importance"] = model.feature_importances_
  feature_importance_df.sort_values(by=["importance"], ascending = False, inplace = True)
feature_importance_df.head(25)