<a href="https://colab.research.google.com/github/Kalaiselvan88/MLAssignments/blob/main/ML2Assignment1_Notebook5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Setting the environment variables

In [None]:
import os
import sys
os.environ["PYSPARK_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"]="notebook --no-browser"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.8.0_161/jre"
os.environ["SPARK_HOME"] = "/home/ec2-user/spark-2.4.4-bin-hadoop2.7"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

# Ecommerce Churn Assignment

The aim of the assignment is to build a model that predicts whether a person purchases an item after it has been added to the cart or not. Being a classification problem, you are expected to use your understanding of all the three models covered till now. You must select the most robust model and provide a solution that predicts the churn in the most suitable manner. 

For this assignment, you are provided the data associated with an e-commerce company for the month of October 2019. Your task is to first analyse the data, and then perform multiple steps towards the model building process.

The broad tasks are:
- Data Exploration
- Feature Engineering
- Model Selection
- Model Inference

### Data description

The dataset stores the information of a customer session on the e-commerce platform. It records the activity and the associated parameters with it.

- **event_time**: Date and time when user accesses the platform
- **event_type**: Action performed by the customer
            - View
            - Cart
            - Purchase
            - Remove from cart
- **product_id**: Unique number to identify the product in the event
- **category_id**: Unique number to identify the category of the product
- **category_code**: Stores primary and secondary categories of the product
- **brand**: Brand associated with the product
- **price**: Price of the product
- **user_id**: Unique ID for a customer
- **user_session**: Session ID for a user


### Initialising the SparkSession

The dataset provided is 5 GBs in size. Therefore, it is expected that you increase the driver memory to a greater number. You can refer to notebook 1 for the steps involved here.

In [None]:
# initialising the session with 14 GB driver memory
from pyspark import SparkConf
from pyspark.sql import SparkSession

MAX_MEMORY = "14G"

spark = SparkSession \
    .builder \
    .appName("inference") \
    .config("spark.driver.memory", MAX_MEMORY) \
    .getOrCreate()

spark.catalog.clearCache()
spark

<hr>

## Task 4: Model Inference

- Feature Importance
- Model Inference
- Feature exploration

In [None]:
# Loading the clean data
df_feature_imp = spark.read.parquet('feature_importances.parquet')
df_feature_imp.show()

+---+--------------------+-----+
|idx|                name|score|
+---+--------------------+-----+
|  0|               price|  0.0|
| 64|       cat_l2_en_ski|  0.0|
| 74| brand_red_en_xiaomi|  0.0|
| 73|  brand_red_en_apple|  0.0|
| 72|brand_red_en_samsung|  0.0|
| 71| brand_red_en_others|  0.0|
| 70|    cat_l2_en_shorts|  0.0|
| 69|     cat_l2_en_skirt|  0.0|
| 68| cat_l2_en_furniture|  0.0|
| 67|      cat_l2_en_belt|  0.0|
| 66|      cat_l2_en_sock|  0.0|
| 65|     cat_l2_en_scarf|  0.0|
| 63|    cat_l2_en_tennis|  0.0|
| 51|      cat_l2_en_fmcg|  0.0|
| 62|    cat_l2_en_jumper|  0.0|
| 61| cat_l2_en_snowboard|  0.0|
| 60|  cat_l2_en_umbrella|  0.0|
| 59|cat_l2_en_lawn_mower|  0.0|
| 58|cat_l2_en_cultivator|  0.0|
| 57|  cat_l2_en_cartrige|  0.0|
+---+--------------------+-----+
only showing top 20 rows



#### As we can see above the price column takes the first precendance, followed by a 2nd level category and then brand

In [None]:
df = spark.read.parquet('cleaned_df.parquet')
df.filter("cat_l2 == 'ski'").show()

+----------+-------------------+------+---------+--------------------+-----------+------+------+-----------+---------+------------+
|product_id|        category_id| price|  user_id|        user_session|day_of_week|cat_l1|cat_l2|hour_bucket|brand_red|is_purchased|
+----------+-------------------+------+---------+--------------------+-----------+------+------+-----------+---------+------------+
|  21500029|2053013559255761525|166.07|512872982|e38877a3-a4c4-437...|          1| sport|   ski|        3.0|   others|           0|
|  20500641|2053013559348036219| 53.28|517091325|8f8c5748-a292-4dc...|          3| sport|   ski|        2.0|   others|           0|
|  20500356|2053013559348036219|501.69|513127542|16ddbe0f-3894-4cc...|          4| sport|   ski|        2.0|   others|           0|
|  20500014|2053013559348036219|568.61|519809800|7c993532-b6bf-492...|          6| sport|   ski|        1.0|   others|           0|
|  21500036|2053013559255761525|109.18|538042700|a8abb968-1ac6-412...|      