<a href="https://colab.research.google.com/github/SharWarr/ML_Projects/blob/main/Ecommerce_Churn/NB-4_Random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Setting the environment variables

In [None]:
import os
import sys
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook --no-browser"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/home/ec2-user/spark-2.4.4-bin-hadoop2.7"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

In [None]:
# Spark environment
from pyspark import SparkConf
from pyspark.sql import SparkSession

In [None]:
MAX_MEMORY = "14G"

spark = SparkSession \
    .builder \
    .appName("demo") \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

spark

23/02/20 14:40:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/02/20 14:40:47 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


# Ecommerce Churn Assignment

The aim of the assignment is to build a model that predicts whether a person purchases an item after it has been added to the cart or not. Being a classification problem, you are expected to use your understanding of all the three models covered till now. You must select the most robust model and provide a solution that predicts the churn in the most suitable manner. 

For this assignment, you are provided the data associated with an e-commerce company for the month of October 2019. Your task is to first analyse the data, and then perform multiple steps towards the model building process.

The broad tasks are:
- Data Exploration
- Feature Engineering
- Model Selection
- Model Inference

### Data description

The dataset stores the information of a customer session on the e-commerce platform. It records the activity and the associated parameters with it.

- **event_time**: Date and time when user accesses the platform
- **event_type**: Action performed by the customer
            - View
            - Cart
            - Purchase
            - Remove from cart
- **product_id**: Unique number to identify the product in the event
- **category_id**: Unique number to identify the category of the product
- **category_code**: Stores primary and secondary categories of the product
- **brand**: Brand associated with the product
- **price**: Price of the product
- **user_id**: Unique ID for a customer
- **user_session**: Session ID for a user


### Initialising the SparkSession

The dataset provided is 5 GBs in size. Therefore, it is expected that you increase the driver memory to a greater number. You can refer to notebook 1 for the steps involved here.

In [None]:
# Loading the clean data

df=spark.read.parquet("Cleaned_df_final_parquet.parquet")

                                                                                

In [None]:
from pyspark.ml.feature import Bucketizer
bucketizer = Bucketizer(splits=[ 0, 6, 12, 18, 24 ],inputCol="Hour", outputCol="Hour_binned")
df_buck = bucketizer.setHandleInvalid("keep").transform(df)

from pyspark.sql.types import IntegerType,FloatType
df_buck = df_buck.withColumn("Hour_binned", df_buck["Hour_binned"].cast(IntegerType()))

# Check if only the required columns are present to build the model
# If not, drop the redundant columns
df_buck = df_buck.fillna(value ='no category',subset =['category_2'])
df_buck = df_buck.withColumn("price", df_buck["price"].cast(FloatType()))
df_buck = df_buck.drop("category_code","user_id","product_id","brand","Hour","category_id","user_session")

## Task 3: Model Selection
3 models for classification:	
- Logistic Regression
- Decision Tree
- Random Forest

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
from pyspark.ml.feature import StringIndexer
# Feature transformation for categorical features
indexer = StringIndexer(inputCol="event_type", outputCol="event_type_cat")
indexed = indexer.fit(df_buck).transform(df_buck)
# Feature transformation for categorical features
indexer = StringIndexer(inputCol="category_1", outputCol="category_1_cat")
indexed = indexer.fit(indexed).transform(indexed)
# Feature transformation for categorical features
indexer = StringIndexer(inputCol="category_2", outputCol="category_2_cat")
indexed = indexer.fit(indexed).transform(indexed)
# Feature transformation for categorical features
indexer = StringIndexer(inputCol="brand_new", outputCol="brand_new_cat")
indexed = indexer.fit(indexed).transform(indexed)

                                                                                

In [None]:
indexed.columns

['event_type',
 'price',
 'category_1',
 'category_2',
 'brand_new',
 'target',
 'Hour_binned',
 'event_type_cat',
 'category_1_cat',
 'category_2_cat',
 'brand_new_cat']

In [None]:
#Creating Vector Assembler to combine all the raw features
# Vector assembler to combine all the features
assembler = VectorAssembler(inputCols=[
 'price',
 'Hour_binned',
 'event_type_cat',
 'category_1_cat',
 'brand_new_cat'], outputCol="features")

In [None]:
output = assembler.transform(indexed)

In [None]:
output.show()

+----------+------+-----------+--------------------+---------+------+-----------+--------------+--------------+--------------+-------------+--------------------+
|event_type| price| category_1|          category_2|brand_new|target|Hour_binned|event_type_cat|category_1_cat|category_2_cat|brand_new_cat|            features|
+----------+------+-----------+--------------------+---------+------+-----------+--------------+--------------+--------------+-------------+--------------------+
|      view|341.74|electronics|         no category|   xiaomi|     0|          1|           0.0|           0.0|           0.0|          4.0|[341.739990234375...|
|      view| 36.04|no category|         no category| no brand|     0|          1|           0.0|           1.0|           0.0|          1.0|[36.0400009155273...|
|      view| 34.11|no category|         no category|   Others|     0|          1|           0.0|           1.0|           0.0|          0.0|[34.1100006103515...|
|      view| 63.06|no catego

                                                                                

In [None]:
# Check if only the required columns are present to build the model
# If not, drop the redundant columns
output.select("features","target").show()

+--------------------+------+
|            features|target|
+--------------------+------+
|[341.739990234375...|     0|
|[36.0400009155273...|     0|
|[34.1100006103515...|     0|
|[63.0600013732910...|     0|
|[341.910003662109...|     0|
|[362.339996337890...|     0|
|[341.910003662109...|     0|
|[392.380004882812...|     0|
|[339.279998779296...|     0|
|[448.839996337890...|     0|
|[283.119995117187...|     0|
|[225.229995727539...|     0|
|[283.119995117187...|     0|
|[225.229995727539...|     0|
|[228.470001220703...|     0|
|[283.119995117187...|     0|
|[952.030029296875...|     0|
|[196.910003662109...|     0|
|(5,[0,4],[153.979...|     0|
|(5,[0,4],[166.539...|     0|
+--------------------+------+
only showing top 20 rows



### Model 3: Random Forest

In [None]:
model_df = output.select("features","target")

In [None]:
# Splitting the data into train and test (Remember you are expected to compare the model later)
training_df, test_df = model_df.randomSplit([0.7,0.3])

In [None]:
# Number of rows in train and test data
training_df.count()

                                                                                

29690946

In [None]:
test_df.count()

                                                                                

12727598

In [None]:
from pyspark.ml.classification import RandomForestClassifier

In [None]:
# Building the RF model

rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'target', \
                            maxDepth=5, impurity='gini', numTrees=25, seed=100)

In [None]:
# Fitting the model over the training set
rfmodel = rf.fit(training_df)

[Stage 21:>                                                         (0 + 4) / 7]23/02/20 15:02:35 WARN MemoryStore: Not enough space to cache rdd_77_3 in memory! (computed 770.4 MB so far)
23/02/20 15:02:35 WARN MemoryStore: Not enough space to cache rdd_77_2 in memory! (computed 770.4 MB so far)
23/02/20 15:02:35 WARN BlockManager: Persisting block rdd_77_3 to disk instead.
23/02/20 15:02:35 WARN BlockManager: Persisting block rdd_77_2 to disk instead.
23/02/20 15:02:35 WARN MemoryStore: Not enough space to cache rdd_77_0 in memory! (computed 770.4 MB so far)
23/02/20 15:02:35 WARN BlockManager: Persisting block rdd_77_0 to disk instead.
23/02/20 15:02:44 WARN MemoryStore: Not enough space to cache rdd_77_1 in memory! (computed 1159.0 MB so far)
23/02/20 15:02:44 WARN BlockManager: Persisting block rdd_77_1 to disk instead.
23/02/20 15:04:37 WARN BlockManager: Persisting block rdd_77_4 to disk instead.
23/02/20 15:04:44 WARN MemoryStore: Not enough space to cache rdd_77_5 in memory! (

In [None]:
# Printing the forest obtained from the model
print(rfmodel.toDebugString)

RandomForestClassificationModel (uid=RandomForestClassifier_0e0ed3d148ba) with 25 trees
  Tree 0 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 1 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 2 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 3 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 4 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 5 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 6 (weight 1.0):
    If (feature 2 in {1.0,2.0})
     Predict: 1.0
    Else (feature 2 not in {1.0,2.0})
     Predict: 0.0
  Tree 7 

In [None]:
# Applying the model on test set
predictions = rfmodel.transform(test_df)

In [None]:
predictions

DataFrame[features: vector, target: int, rawPrediction: vector, probability: vector, prediction: double]

In [None]:
predictions.show()

[Stage 32:>                                                         (0 + 1) / 1]

+--------------------+------+--------------------+--------------------+----------+
|            features|target|       rawPrediction|         probability|prediction|
+--------------------+------+--------------------+--------------------+----------+
|(5,[0],[0.8799999...|     0|[16.9378453510291...|[0.67751381404116...|       0.0|
|(5,[0],[0.8799999...|     0|[16.9378453510291...|[0.67751381404116...|       0.0|
|(5,[0],[0.8799999...|     1|[16.9378453510291...|[0.67751381404116...|       0.0|
|(5,[0],[0.8799999...|     1|[16.9378453510291...|[0.67751381404116...|       0.0|
|(5,[0],[0.8799999...|     1|[16.9378453510291...|[0.67751381404116...|       0.0|
|(5,[0],[0.8799999...|     1|[16.9378453510291...|[0.67751381404116...|       0.0|
|(5,[0],[0.8799999...|     1|[16.9378453510291...|[0.67751381404116...|       0.0|
|       (5,[0],[1.0])|     0|[16.9378453510291...|[0.67751381404116...|       0.0|
|       (5,[0],[1.0])|     0|[16.9378453510291...|[0.67751381404116...|       0.0|
|   

                                                                                

In [None]:
# Printing the required columns
predictions.select('target', 'rawPrediction', 'prediction', 'probability').show(10)

[Stage 33:>                                                         (0 + 1) / 1]

+------+--------------------+----------+--------------------+
|target|       rawPrediction|prediction|         probability|
+------+--------------------+----------+--------------------+
|     0|[16.9378453510291...|       0.0|[0.67751381404116...|
|     0|[16.9378453510291...|       0.0|[0.67751381404116...|
|     1|[16.9378453510291...|       0.0|[0.67751381404116...|
|     1|[16.9378453510291...|       0.0|[0.67751381404116...|
|     1|[16.9378453510291...|       0.0|[0.67751381404116...|
|     1|[16.9378453510291...|       0.0|[0.67751381404116...|
|     1|[16.9378453510291...|       0.0|[0.67751381404116...|
|     0|[16.9378453510291...|       0.0|[0.67751381404116...|
|     0|[16.9378453510291...|       0.0|[0.67751381404116...|
|     0|[16.9378453510291...|       0.0|[0.67751381404116...|
+------+--------------------+----------+--------------------+
only showing top 10 rows



                                                                                

#### Feature Transformation (Code will be same; check for the columns)

In [None]:
# Check if only the required columns are present to build the model
# If not, drop the redundant columns


In [None]:
# Categorising the attributes into its type - Continuous and Categorical


In [None]:
# Feature transformation for categorical features


In [None]:
# Vector assembler to combine all the features


In [None]:
# Pipeline for the tasks


In [None]:
# Transforming the dataframe df


In [None]:
# Schema of the transformed df


In [None]:
# Checking the elements of the transformed df - Top 20 rows


In [None]:
# Storing the transformed df in S3 bucket to prevent repetition of steps again


#### Train-test split

In [None]:
# Splitting the data into train and test (Remember you are expected to compare the model later)


In [None]:
# Number of rows in train and test data


#### Model Fitting

In [None]:
# Building the model with hyperparameter tuning
# Create ParamGrid for Cross Validation


In [None]:
# Run cross-validation steps


In [None]:
# Fitting the models on transformed df


In [None]:
# Best model from the results of cross-validation


#### Model Analysis

Required Steps:
- Fit on test data
- Performance analysis
    - Appropriate Metric with reasoning

In [None]:
# Model evaluation
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")

In [None]:

accuracy = evaluator.evaluate(predictions)

                                                                                

In [None]:
# Model Accuracy
print(accuracy)

0.6940859540032612


In [None]:
# Test Error
print("Test Error = %g" % (1.0 - accuracy))

Test Error = 0.305914


In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier

# Assuming you have a random forest model called "rf" and a test dataset called "testData"


# Evaluate precision
evaluator = MulticlassClassificationEvaluator( labelCol="target", metricName="weightedPrecision")
precision = evaluator.evaluate(predictions)

# Evaluate recall
evaluator = MulticlassClassificationEvaluator( labelCol="target", metricName="weightedRecall")
recall = evaluator.evaluate(predictions)


                                                                                

In [None]:
recall

0.6940859540032612

In [None]:
precision

0.7442418095503369

#### Summary of the best Random Forest model