# Customer Segmentation using Spark
Customer segmentation is a marketing technique companies use to identify and group users who display similar characteristics. In this project, I am using Spark with Python i-e PySpark for building the customer segmentation model

## Dataset Used:
ecommerce-dataset

### Imports

In [2]:
from pyspark.sql import SparkSession

### 1. Creating a SparkSession
A SparkSession is an entry point into all functionality in Spark, and is required if you want to build a dataframe in PySpark

In [3]:
spark = SparkSession.builder.appName("Customer Segmentation Spark").config("spark.memory.offHeap.enabled","true").config("spark.memory.offHeap.size","10g").getOrCreate()


Using the codes above, we built a spark session and set a name for the application. Then, the data was cached in off-heap memory to avoid storing it directly on disk, and the amount of memory was manually specified.

### 2. Creating a DataFrame using Spark

In [10]:
df = spark.read.csv('ecommerce_data.csv',header=True,escape="\"")

Note that we defined an escape character to avoid commas in the .csv file when parsing.

In [11]:
df.show(5,0)

+---------+---------+-----------------------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|Description                        |Quantity|InvoiceDate   |UnitPrice|CustomerID|Country       |
+---------+---------+-----------------------------------+--------+--------------+---------+----------+--------------+
|536365   |85123A   |WHITE HANGING HEART T-LIGHT HOLDER |6       |12/1/2010 8:26|2.55     |17850     |United Kingdom|
|536365   |71053    |WHITE METAL LANTERN                |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
|536365   |84406B   |CREAM CUPID HEARTS COAT HANGER     |8       |12/1/2010 8:26|2.75     |17850     |United Kingdom|
|536365   |84029G   |KNITTED UNION FLAG HOT WATER BOTTLE|6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
|536365   |84029E   |RED WOOLLY HOTTIE WHITE HEART.     |6       |12/1/2010 8:26|3.39     |17850     |United Kingdom|
+---------+---------+-----------------------------------

The dataframe consists of 8 variables:  
InvoiceNo: The unique identifier of each customer invoice.  
StockCode: The unique identifier of each item in stock.  
Description: The item purchased by the customer.  
Quantity: The number of each item purchased by a customer in a single invoice.  
InvoiceDate: The purchase date.  
UnitPrice: Price of one unit of each item.  
CustomerID: Unique identifier assigned to each user.  
Country: The country from where the purchase was made  

### 3. Exploratory Data Analysis

Counting the total rows in dataset

In [12]:
df.count()

541909

Counting the number of unique customers present in dataframe

In [13]:
df.select('CustomerID').distinct().count()

4373

Finding the country from where the most purchases are made using groupby