### About Dataset

Customer Purchases Behaviour Dataset

Subtitle:
Simulated Dataset of Customer Purchase Behavior

Description:
This dataset contains simulated data representing customer purchase behavior. It includes various features such as age, gender, income, education, region, loyalty status, purchase frequency, purchase amount, product category, promotion usage, and satisfaction score.

File Information:
* File Format: CSV
* Number of Rows: 100000
* Number of Columns: 12

Column Descriptors:
* age: Age of the customer.
* gender: Gender of the customer (0 for Male, 1 for Female).
* income: Annual income of the customer.
* education: Education level of the customer.
* region: Region where the customer resides.
* loyalty_status: Loyalty status of the customer.
* purchase_frequency: Frequency of purchases made by the customer.
* purchase_amount: Amount spent by the customer in each purchase.
* product_category: Category of the purchased product.
* promotion_usage: Indicates whether the customer used promotional offers (0 for No, 1 for Yes).
* satisfaction_score: Satisfaction score of the customer.

https://www.kaggle.com/datasets/sanyamgoyal401/customer-purchases-behaviour-dataset

### Objective

To build a model that predicts whether a customer will make a purchase based on their demographic and behavioral data.

### Snapshot

![Snapshot](Snap.png)

In [4]:
import pandas as pd             
import numpy as np
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import graphviz

In [6]:
df = pd.read_csv('customer_data.csv')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 12 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   id                  100000 non-null  int64 
 1   age                 100000 non-null  int64 
 2   gender              100000 non-null  object
 3   income              100000 non-null  int64 
 4   education           100000 non-null  object
 5   region              100000 non-null  object
 6   loyalty_status      100000 non-null  object
 7   purchase_frequency  100000 non-null  object
 8   purchase_amount     100000 non-null  int64 
 9   product_category    100000 non-null  object
 10  promotion_usage     100000 non-null  int64 
 11  satisfaction_score  100000 non-null  int64 
dtypes: int64(6), object(6)
memory usage: 9.2+ MB


In [10]:
df.shape 

(100000, 12)

In [12]:
df.columns 

Index(['id', 'age', 'gender', 'income', 'education', 'region',
       'loyalty_status', 'purchase_frequency', 'purchase_amount',
       'product_category', 'promotion_usage', 'satisfaction_score'],
      dtype='object')

In [14]:
print(df.isnull().values.sum())

0


In [16]:
print(df.isnull())

          id    age  gender  income  education  region  loyalty_status  \
0      False  False   False   False      False   False           False   
1      False  False   False   False      False   False           False   
2      False  False   False   False      False   False           False   
3      False  False   False   False      False   False           False   
4      False  False   False   False      False   False           False   
...      ...    ...     ...     ...        ...     ...             ...   
99995  False  False   False   False      False   False           False   
99996  False  False   False   False      False   False           False   
99997  False  False   False   False      False   False           False   
99998  False  False   False   False      False   False           False   
99999  False  False   False   False      False   False           False   

       purchase_frequency  purchase_amount  product_category  promotion_usage  \
0                   False     

### Since there are no missing value let's proceed with decision tree classifier

In [19]:
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    if column != 'purchase_frequency':
        label_encoders[column] = LabelEncoder()
        df[column] = label_encoders[column].fit_transform(df[column])


X = df.drop(columns='purchase_frequency')  
y = df['purchase_frequency']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [21]:
clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train, y_train)

In [23]:
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.3735
Confusion Matrix:
[[ 874 1188 1898]
 [1300 1775 2934]
 [2115 3095 4821]]
Classification Report:
              precision    recall  f1-score   support

    frequent       0.20      0.22      0.21      3960
  occasional       0.29      0.30      0.29      6009
        rare       0.50      0.48      0.49     10031

    accuracy                           0.37     20000
   macro avg       0.33      0.33      0.33     20000
weighted avg       0.38      0.37      0.38     20000



In [25]:
clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
clf_rf.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = clf_rf.predict(X_test)

# Evaluate accuracy
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf}")

# Display the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

# Display the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

Random Forest Accuracy: 0.4779
Confusion Matrix:
[[  62  389 3509]
 [  92  614 5303]
 [ 174  975 8882]]
Classification Report:
              precision    recall  f1-score   support

    frequent       0.19      0.02      0.03      3960
  occasional       0.31      0.10      0.15      6009
        rare       0.50      0.89      0.64     10031

    accuracy                           0.48     20000
   macro avg       0.33      0.33      0.27     20000
weighted avg       0.38      0.48      0.37     20000



### Conclusion

The model achieved an accuracy of 37.35%, providing insights into factors influencing purchase decisions such as income, education, and satisfaction score. For improved accuracy and balanced predictions across purchase frequency categories (frequent, occasional, rare), a Random Forest model was also evaluated, achieving 47.79% accuracy.

In conclusion, while the decision tree provided valuable insights, the Random Forest model demonstrated superior performance, making it a suitable choice for predicting customer purchases based on the given dataset.

In [None]:
from IPython.display import display

dot_data = export_graphviz(clf, out_file=None, feature_names=X.columns, class_names=clf.classes_.astype(str), filled=True, rounded=True, special_characters=True)

graph = graphviz.Source(dot_data)
graph.render("decision_tree")  

display(graph)
