# First App Hosted Machine Learning Workflow Pipeline 
### Predicting Likelyhood of Customer Purchase


## Table of contents

- [00. Project Overview](#overview-main)
    - [Context](#overview-context)
    - [Actions](#overview-actions)
    - [Results](#overview-results)
    - [Considerations/Next Steps](#overview-growth)
- [01. Data Overview](#data-overview)
- [02. Data Import and Preprocessing](#eda-overview)
- [03. Logistic Regression](#logreg-title)
- [04. Random Forest](#rf-title)
- [05. Prediction](#prediction)
- [06. Front-End Logic](#frontendcode)
- [07. Front-End Repo](#frontrepo)

## Project Overview <a name="overview-main"></a>

### Context <a name="overview-context"></a>

This is first ever pipeline creation from start to finish, where the end goal is to host a simply created Machine Learning (ML) model on a website for a user to populate the fields and return the predicted outcome of the likelyhood of a customer purchasing a product. 

What this project demonstrates is how seemless and quick a recommendation can be made to the user much like how a recommendation is performed on a purchasing website such as Amazon, Apple, Temu and etc... This project helps shine a light on work that is needed to be performed in a very understandable level.

The contents of this project details the different work needed to ETL the data, perform necessary massaging, using ideal ML model, validating results and then implemmenting web host work to have the user populate. 

The model works to **predict the likelihood a customer will purchase a product given a certain amount of parameters**. What could be done with this information is then provide the customer with that product as a suggestion to purchase. 


### Actions <a name="overview-actions"></a>

1. Gather the data 
2. Scrub and prepare the data
3. Identify ML tools that would work best with this problem
    - Logistic Regression
    - Decision Tree
    - Random Forest
4. Assess ML outputs (and rework where ever possible) to best arguably determine the scores
5. Choose a website hosting platform (Stream-Lit)
6. Perform front-end work for user
7. Assess hosted website

This notebook goes through all those steps.




### Results <a name="overview-results"></a>

I was able to successfully build and save a model to predict the chances of a customer purchasing a product (in percentage) AND host it using Stream-Lit by pointing it to a created GitHub repo which houses the model and front-end logic. A end user can obtain the results in a format more digestible than looking at a Jupiter Notebook. 

HOSTED WEBSITE:
https://emanuel071-ml-stremlit-web-app-dsi-streamlit-web-app-vitmyo.streamlit.app/

HOSTED REPO:
https://github.com/Emanuel071/ML_stremlit_web_app

The model chosen was random forest due to previous ML work performed and heavy reliance on past predicted outcome.

- Logistic Regression: 0.85
- Random Forest Accuracy: 0.85


### Considerations/Next Steps <a name="overview-growth"></a>



While the hosted website is quite simplistic, it helps demonstrate how giant companies such as Amazon, Apple, etc... are able to quickly provide a suggestion of products to show to a customer who is within their online purchasing website. From the creation of this project, it is quite evident how far this can go with more robust models and front-end work to host all sorts of predicted outcomes. It is quite evident how the big organizations perform these types of processes in a more overhauled scale. 

From a data point of view, further variables could be collected, and further feature engineering could be undertaken to ensure that we have as much useful information available for predicting customer loyalty.

Other considerations:

1. Normally we would spend more time/work on the data we removed due to **NA**, understanding the data more, bringing in more parameters, and the works. The Justification was made to simply run with the data for the soul purpose of creating a pipeline from start to finish and understanding how to interact with Stream-Lit, GitHub Repo locatino of the model created and creating the model.

2. Further work should have been performed to choose the most appropriate machine learning tool than just picking Random Forest, see justification in (1).

3. Further front-end work could have been performed to make the user experience more entertaining. The only justification for this was that it was the first time a project was created on a hosted wesite, there will be more in the future with more sophistication placed within :).

## Data Overview <a name="data-overview"></a>

We will be predicting the likelyhood a customer will purchase a product (in percentage) given three parameters. The model will be trained on the given payload with the columns seen below.

After pre-processing data in Python, dataset for modelling contains following information...
<br>
<br>

| **Variable Name** | **Variable Type** | **Description** |
|---|---|---|
| purchase | Dependent | The likelyhood of purchase |
| age | Independent | Customers age |
| gender | Independent | The gender provided by the customer |
| credit_score | Independent | The customers most recent credit score |


## Imports and Preprocessing <a name="eda-overview"></a>

In [6]:
################################
#Pipelines - Basic Template
################################
# Import required Python packages
#Import sample data
import pandas as pd
import numpy as np
from sklearn. pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [7]:
# Import sample date
data_path = "./Saved_files/pipeline_data.csv"
sample_data = pd.read_csv(data_path)
print(sample_data.shape)
sample_data.head()

(100, 4)


Unnamed: 0,purchase,age,gender,credit_score
0,0,47.0,F,309.0
1,1,18.0,F,230.0
2,1,25.0,M,92.0
3,0,38.0,M,486.0
4,1,38.0,M,236.0


In [8]:
# Split data into input and putput objects
X = sample_data.drop(['purchase'], axis=1)
Y = sample_data['purchase']
print(X.shape)
print(Y.shape)
X.head()
# Specify numeric and categorical features

(100, 3)
(100,)


Unnamed: 0,age,gender,credit_score
0,47.0,F,309.0
1,18.0,F,230.0
2,25.0,M,92.0
3,38.0,M,486.0
4,38.0,M,236.0


In [9]:
# Split data into training and test data
x_train, x_test, y_train, y_test = train_test_split(X,
                                                    Y, 
                                                    test_size=.2,
                                                    random_state=42,
                                                    stratify=Y)

In [10]:
# specify numeric and categoriacl features 
# get rid of missing values
numeric_features  = ['age', 'credit_score']
categorical_features = ['gender']

In [11]:
################################
# Set up Pipelines
################################

# Numerical Feature Transformer
numeric_transformer =  Pipeline(steps = [('imputer',SimpleImputer()),
                                         ('scaler',StandardScaler())])
print(numeric_transformer)

Pipeline(steps=[('imputer', SimpleImputer()), ('scaler', StandardScaler())])


In [12]:
# Categorical Feature Transformer
categorical_transformer =  Pipeline(steps = [('imputer',SimpleImputer(strategy='constant',
                                                                      fill_value='U')), # if missing value for gender U for unknown
                                            ('ohe',OneHotEncoder(handle_unknown='ignore'))])# example N or L in gender column, instead of breaking set all to 0 
print(categorical_transformer)

Pipeline(steps=[('imputer', SimpleImputer(fill_value='U', strategy='constant')),
                ('ohe', OneHotEncoder(handle_unknown='ignore'))])


In [13]:
# Preprocessing Pipeline
preprocessing_pipeline = ColumnTransformer(transformers = [('numeric',numeric_transformer, numeric_features),
                                                           ('categorical', categorical_transformer,categorical_features)] )
print(preprocessing_pipeline)

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('imputer', SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'credit_score']),
                                ('categorical',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(fill_value='U',
                                                                strategy='constant')),
                                                 ('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['gender'])])


# Logistic Regression <a name="logreg-title"></a>

In [14]:
################################
# Apply the Pipeline 
################################
# Logistic Regression
clf = Pipeline(steps=[('preprocessing_pipeline',preprocessing_pipeline),
                      ('classifier', LogisticRegression(random_state=42))])
clf.fit(x_train,y_train)

In [15]:
y_pred_class = clf.predict(x_test)
print(y_pred_class)

[1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1]


In [16]:
accuracy_score(y_test,y_pred_class)

0.85

# Random Forest <a name="rf-title"></a>

In [18]:
# random forest
clf2 = Pipeline(steps=[('preprocessing_pipeline',preprocessing_pipeline),
                      ('classifier', RandomForestClassifier(random_state=42))])
clf2.fit(x_train,y_train)

In [19]:
y_pred_class2 = clf2.predict(x_test)
print(y_pred_class2)

[1 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1]


In [20]:
accuracy_score(y_test,y_pred_class2)

0.85

In [21]:
###################################
# save pipeline
###################################
import joblib
joblib.dump(clf2,'Saved_files/model.joblib')

['Saved_files/model.joblib']

## Prediction <a name="prediction"></a>

In [22]:
###################################################
# import pipeline object and predict a new data
###################################################

# Import required Python packages
import joblib
import pandas as pd
import numpy as np

In [23]:
# import pipeline 
clf = joblib.load('Saved_files/model.joblib')

In [24]:
# create new data frame
new_data = pd.DataFrame({"age" : [25, np.nan, 50],
                         "gender": ["M", "F", np.nan], 
                         "credit_score" : [200, 100, 500]})

In [25]:
# Pass new data in and receive predictions
clf.predict(new_data)

array([1, 0, 0], dtype=int64)

As shown above, we recieve the predicted values for 3 different customers.

The below Frint-End code saved in a .py file is used to now host the created model above with simple user end logic.

## Front-End Logic <a name="frontendcode"></a>

In [None]:
# title - what this app actually is 

# how to use
# 1. Age 
# 2. Gender 
# 3. credit score 
# Submit button 
import streamlit as st
import pandas as pd 
import joblib

# load our model pipeline object 
# model = joblib.load('C:/Users/Emanuel/Documents/GitHub/MachineLearningWork/Projects/Streamlit/model.joblib')
model = joblib.load('model.joblib')


# add title and instructions 
st.title('Purchase Prediction Model')
st.subheader('Enter customer information and submit for likelihood to purchase')

# age input form 

age = st.number_input(
    label="01. Enter the customer's age",
    min_value= 18,
    max_value=120,
    value = 35  
    )

# gender input form
gender = st.radio(
    label= "02. Enter the customer's gender",
    options= ['M','F']
)
# credit score input form 
credit_score = st.number_input(
    label="03. Enter the customer's credit score",
    min_value= 0,
    max_value=1000,
    value = 500 
    )

# Submit to model
if st.button("Submit For Prediction"):
    # store our data in a data frame for prediction 
    new_data = pd.DataFrame({"age": [age],
                            "gender":[gender],
                            "credit_score":[credit_score]})
    # apply model pipe line to the input data and extract probability prediction
    pred_proba = model.predict_proba(new_data)[0][1]

    # output prediction 
    st.subheader(f"Based on these customer's attributes our model predicts a purchase probability of {pred_proba:.0%}")


## Front-End Repo <a name="frontrepo"></a>

The GitHub repo used to host can be found below or in the results cell above in this project documentation.

https://github.com/Emanuel071/ML_stremlit_web_app