# Customer Purchase Prediction & Effect of Micro-Numerosity

#### **Description**:

Customer Purchase Prediction involves leveraging machine learning algorithms to predict whether a customer will make a purchase based on various features such as age, gender, education, and review ratings. The Effect of Micro-Numerosity Model, in this context, refers to understanding how small variations in these features can influence purchasing behavior. By analyzing these attributes, machine learning models can identify patterns and correlations that might not be apparent through traditional analysis methods.

#### **Key Features**:

1. **Age** : Different age groups may exhibit different purchasing behaviors. Younger customers might be more inclined towards trendy products, while older customers may prefer quality and reliability.

2. **Gender**: Gender-based preferences can significantly affect purchasing patterns. For instance, men and women might prioritize different aspects of a product.

3. **Education**: Education level can influence purchasing decisions, with more educated customers potentially focusing on the value and features of a product.

4. **Review**: Customer reviews play a crucial role in the decision-making process. Positive reviews can drive purchases, while negative reviews can deter potential buyers.

5. **Purchased**: Historical purchase data helps in understanding repeat buying patterns and customer loyalty.

In [0]:
dbutils.fs.ls("dbfs:/FileStore/tables/")

[FileInfo(path='dbfs:/FileStore/tables/Admission_Chance.csv', name='Admission_Chance.csv', size=12905, modificationTime=1720190058000),
 FileInfo(path='dbfs:/FileStore/tables/Cancer.csv', name='Cancer.csv', size=125204, modificationTime=1720190099000),
 FileInfo(path='dbfs:/FileStore/tables/Credit_Default.csv', name='Credit_Default.csv', size=101152, modificationTime=1720190106000),
 FileInfo(path='dbfs:/FileStore/tables/Customer_Purchase.csv', name='Customer_Purchase.csv', size=1489, modificationTime=1720190113000),
 FileInfo(path='dbfs:/FileStore/tables/Fish.csv', name='Fish.csv', size=6349, modificationTime=1720190119000),
 FileInfo(path='dbfs:/FileStore/tables/Ice_Cream.csv', name='Ice_Cream.csv', size=4872, modificationTime=1720190124000),
 FileInfo(path='dbfs:/FileStore/tables/Test1.csv', name='Test1.csv', size=108, modificationTime=1720158698000),
 FileInfo(path='dbfs:/FileStore/tables/Test2.csv', name='Test2.csv', size=192, modificationTime=1720158698000),
 FileInfo(path='dbfs:

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
from pyspark.ml.feature import VectorAssembler,StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
spark = SparkSession.builder.appName('Customer Purchase Prediction & Effect of Micro-Numerosity').getOrCreate() 

In [0]:
spark

In [0]:

df_pyspark = spark.read.csv('dbfs:/FileStore/tables/Customer_Purchase.csv',header=True,inferSchema=True)

In [0]:
df_pyspark.printSchema()

root
 |-- Customer ID: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Review: string (nullable = true)
 |-- Purchased: string (nullable = true)



In [0]:
df_pyspark

DataFrame[Customer ID: int, Age: int, Gender: string, Education: string, Review: string, Purchased: string]

In [0]:
df_pyspark.show()

+-----------+---+------+---------+-------+---------+
|Customer ID|Age|Gender|Education| Review|Purchased|
+-----------+---+------+---------+-------+---------+
|       1021| 30|Female|   School|Average|       No|
|       1022| 68|Female|       UG|   Poor|       No|
|       1023| 70|Female|       PG|   Good|       No|
|       1024| 72|Female|       PG|   Good|       No|
|       1025| 16|Female|       UG|Average|       No|
|       1026| 31|Female|   School|Average|      Yes|
|       1027| 18|  Male|   School|   Good|       No|
|       1028| 60|Female|   School|   Poor|      Yes|
|       1029| 65|Female|       UG|Average|       No|
|       1030| 74|  Male|       UG|   Good|      Yes|
|       1031| 98|Female|       UG|   Good|      Yes|
|       1032| 74|  Male|       UG|   Good|      Yes|
|       1033| 51|  Male|   School|   Poor|       No|
|       1034| 57|Female|   School|Average|       No|
|       1035| 15|  Male|       PG|   Poor|      Yes|
|       1036| 75|  Male|       UG|   Poor|    

In [0]:
# Drop any rows with null values (if any)
df_cleaned = df_pyspark.dropna()

In [0]:
# Convert categorical variables to numerical using StringIndexer
gender_indexer = StringIndexer(inputCol='Gender', outputCol='Gender_index')
education_indexer = StringIndexer(inputCol='Education', outputCol='Education_index')
review_indexer = StringIndexer(inputCol='Review', outputCol='Review_index')
purchased_indexer = StringIndexer(inputCol='Purchased', outputCol='Purchased_index')

df_indexed = gender_indexer.fit(df_cleaned).transform(df_cleaned)
df_indexed = education_indexer.fit(df_indexed).transform(df_indexed)
df_indexed = review_indexer.fit(df_indexed).transform(df_indexed)
df_indexed = purchased_indexer.fit(df_indexed).transform(df_indexed)

In [0]:
df_indexed.show()

+-----------+---+------+---------+-------+---------+------------+---------------+------------+---------------+
|Customer ID|Age|Gender|Education| Review|Purchased|Gender_index|Education_index|Review_index|Purchased_index|
+-----------+---+------+---------+-------+---------+------------+---------------+------------+---------------+
|       1021| 30|Female|   School|Average|       No|         0.0|            1.0|         2.0|            0.0|
|       1022| 68|Female|       UG|   Poor|       No|         0.0|            2.0|         1.0|            0.0|
|       1023| 70|Female|       PG|   Good|       No|         0.0|            0.0|         0.0|            0.0|
|       1024| 72|Female|       PG|   Good|       No|         0.0|            0.0|         0.0|            0.0|
|       1025| 16|Female|       UG|Average|       No|         0.0|            2.0|         2.0|            0.0|
|       1026| 31|Female|   School|Average|      Yes|         0.0|            1.0|         2.0|            1.0|
|

In [0]:
# Select columns for RandomForestClassifier
feature_columns = ['Age', 'Gender_index', 'Education_index', 'Review_index']
assembler = VectorAssembler(inputCols=feature_columns, outputCol='features')
df_assembled = assembler.transform(df_indexed)

# Convert target column to numeric
df_assembled = df_assembled.withColumn('label', col('Purchased_index'))

# Select final data for model training
data = df_assembled.select('features', 'label')

# Show schema of final prepared data
data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)



In [0]:
data.show()

+------------------+-----+
|          features|label|
+------------------+-----+
|[30.0,0.0,1.0,2.0]|  0.0|
|[68.0,0.0,2.0,1.0]|  0.0|
|    (4,[0],[70.0])|  0.0|
|    (4,[0],[72.0])|  0.0|
|[16.0,0.0,2.0,2.0]|  0.0|
|[31.0,0.0,1.0,2.0]|  1.0|
|[18.0,1.0,1.0,0.0]|  0.0|
|[60.0,0.0,1.0,1.0]|  1.0|
|[65.0,0.0,2.0,2.0]|  0.0|
|[74.0,1.0,2.0,0.0]|  1.0|
|[98.0,0.0,2.0,0.0]|  1.0|
|[74.0,1.0,2.0,0.0]|  1.0|
|[51.0,1.0,1.0,1.0]|  0.0|
|[57.0,0.0,1.0,2.0]|  0.0|
|[15.0,1.0,0.0,1.0]|  1.0|
|[75.0,1.0,2.0,1.0]|  0.0|
|[59.0,1.0,2.0,1.0]|  1.0|
|[22.0,0.0,2.0,1.0]|  1.0|
|[19.0,1.0,1.0,0.0]|  0.0|
|[97.0,1.0,0.0,1.0]|  1.0|
+------------------+-----+
only showing top 20 rows



In [0]:
# Split data into training and test sets
(train_data, test_data) = data.randomSplit([0.8, 0.2], seed=42)

In [0]:
# Initialize RandomForestClassifier
rf = RandomForestClassifier(labelCol='label', featuresCol='features')

# Train model
model = rf.fit(train_data)

In [0]:
# Make predictions
predictions = model.transform(test_data)

# Show predictions
predictions.select('label', 'prediction', 'probability').show(10, False)

+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|0.0  |0.0       |[0.7215748418248418,0.2784251581751581] |
|1.0  |0.0       |[0.5558322510822511,0.4441677489177489] |
|1.0  |1.0       |[0.4135281385281385,0.5864718614718615] |
|1.0  |0.0       |[0.5724087301587303,0.4275912698412698] |
|1.0  |1.0       |[0.48524242424242436,0.5147575757575757]|
|0.0  |0.0       |[0.6155681818181818,0.3844318181818182] |
|1.0  |1.0       |[0.4133459595959596,0.5866540404040405] |
|0.0  |0.0       |[0.7284319846819847,0.27156801531801533]|
|1.0  |0.0       |[0.5088571428571428,0.4911428571428572] |
|1.0  |1.0       |[0.43600000000000005,0.564]             |
+-----+----------+----------------------------------------+
only showing top 10 rows



In [0]:
# Confusion matrix
predictions.groupBy('label', 'prediction').count().show()

# Evaluate model using accuracy, precision, recall, and F1-score
evaluator = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction')

# Accuracy
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
print(f"Accuracy: {accuracy}")

# Precision
precision = evaluator.evaluate(predictions, {evaluator.metricName: "weightedPrecision"})
print(f"Precision: {precision}")

# Recall
recall = evaluator.evaluate(predictions, {evaluator.metricName: "weightedRecall"})
print(f"Recall: {recall}")

# F1-score
f1_score = evaluator.evaluate(predictions, {evaluator.metricName: "f1"})
print(f"F1-Score: {f1_score}")


+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  1.0|       1.0|    5|
|  0.0|       1.0|    1|
|  1.0|       0.0|    3|
|  0.0|       0.0|    3|
+-----+----------+-----+

Accuracy: 0.6666666666666666
Precision: 0.7222222222222222
Recall: 0.6666666666666667
F1-Score: 0.6761904761904762


In [0]:
# Save the trained logistic regression model
model_path = "./Internship_Sem-6_models/Customer_Purchase_Prediction_model"
model.save(model_path)

In [0]:
dbutils.fs.ls("dbfs:/Internship_Sem-6_models/Customer_Purchase_Prediction_model")

[FileInfo(path='dbfs:/Internship_Sem-6_models/Customer_Purchase_Prediction_model/data/', name='data/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/Internship_Sem-6_models/Customer_Purchase_Prediction_model/metadata/', name='metadata/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/Internship_Sem-6_models/Customer_Purchase_Prediction_model/treesMetadata/', name='treesMetadata/', size=0, modificationTime=0)]