# Customer Prediction Model Analysis Report

## Introduction

In this analysis, we aimed to build a predictive model to determine whether a customer will purchase a product based on certain demographic factors such as age and gender, as well as estimated salary. We used the Naive Bayes classification algorithm and explored different variants, including Gaussian Naive Bayes and Multinomial Naive Bayes.

## Data Preprocessing

### Data Loading

We started by importing the necessary libraries and loading the dataset, which was retrieved from the "Social Network Ads" dataset. The dataset contained information about customers' age, gender, estimated salary, and whether they made a purchase.

### Data Cleaning

We performed basic data cleaning by checking for missing values, which, in this dataset, were found to be zero. No further cleaning was required.

### Feature Selection

We selected the following features for our analysis:
- Age
- Gender
- Estimated Salary

### Data Encoding

We encoded the categorical variable "Gender" using Label Encoding to convert it into a numerical format suitable for machine learning algorithms.

### Data Splitting

We split the dataset into training and testing sets, with a 65% training set and a 35% testing set.

### Feature Scaling

We applied Standard Scaling to standardize the numerical features to ensure all features have the same scale.

## Model Selection

We experimented with different Naive Bayes variants:

1. **Gaussian Naive Bayes:** We applied Gaussian Naive Bayes, which assumes that the features follow a Gaussian distribution.

2. **Multinomial Naive Bayes:** This variant is suitable for discrete data, but it gave us an error due to negative values in the dataset.

## Model Evaluation

### Evaluation Metrics

For model evaluation, we used the following metrics:
- **Accuracy:** 88.57%
- **Precision:** 0.86
- **Recall:** 0.82
- **F1 Score:** 0.84

These metrics provide insights into the performance of our models. Precision measures the ability of the model to correctly predict positive cases, recall measures the proportion of actual positive cases that were correctly predicted, and the F1 score is the harmonic mean of precision and recall.

### Visualization

We visualized the model evaluation metrics using a bar plot:

![Metrics Bar Plot](metrics_bar_plot.png)

## Conclusion

In this analysis, we built and evaluated Naive Bayes models for customer prediction. Our models achieved good accuracy, precision, recall, and F1 score. However, further exploration and optimization can be performed to enhance model performance.

## Future Work

1. Feature Engineering: Explore additional features or perform feature engineering to improve model performance.
2. Hyperparameter Tuning: Optimize hyperparameters to find the best model configuration.
3. Model Comparison: Compare Naive Bayes with other classification algorithms to identify the most suitable model for this task.
4. Deployment: Consider deploying the best-performing model for real-time customer prediction.

This analysis serves as a starting point for customer prediction and can be extended and refined for more accurate predictions.

# Importing necessary libraries

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from matplotlib.colors import ListedColormap

# Load the dataset

In [None]:
dataset = pd.read_csv('/kaggle/input/social-network-ads/Social_Network_Ads.csv')

# Data Preprocessing
# Check for missing values

In [None]:
missing_values = dataset.isnull().sum()
print("Missing Values:\n", missing_values)

# Summary statistics

In [None]:
data_summary = dataset.describe()
print("Data Summary:\n", data_summary)

# Understanding Data Distribution

In [None]:
plt.figure(figsize=(10, 6))
plt.subplot(2, 2, 1)
plt.hist(dataset['Age'], bins=20, color='skyblue', edgecolor='black')
plt.title('Age Distribution')

plt.subplot(2, 2, 2)
plt.hist(dataset['EstimatedSalary'], bins=20, color='salmon', edgecolor='black')
plt.title('Estimated Salary Distribution')

plt.subplot(2, 2, 3)
plt.bar(dataset['Gender'].unique(), dataset['Gender'].value_counts(), color=['pink', 'lightblue'])
plt.title('Gender Distribution')

plt.subplot(2, 2, 4)
plt.bar(dataset['Purchased'].unique(), dataset['Purchased'].value_counts(), color=['lightgreen', 'coral'])
plt.title('Purchase Distribution')

plt.tight_layout()
plt.show()

# Extracting features and target variable

In [None]:
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values

# Encoding Categorical Data

In [None]:
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

# Splitting the dataset into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=0)

# Feature Scaling using Standardization

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Gaussian Naive Bayes Classifier

In [None]:
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

# Model Evaluation

In [None]:
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

try:
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
except ZeroDivisionError:
    precision, recall, f1 = 0, 0, 0

print('Confusion Matrix:\n', cm)
print('Accuracy: {:.2f}%'.format(accuracy * 100))
print('Precision: {:.2f}'.format(precision))
print('Recall: {:.2f}'.format(recall))
print('F1 Score: {:.2f}'.format(f1))

In [None]:
# Names of the metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score']

# Corresponding values
values = [accuracy, precision, recall, f1_score]

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(metrics, values, color=['skyblue', 'lightgreen', 'lightcoral', 'lightsalmon'])
plt.ylim(0, 1)  # Set the y-axis limit to the range of [0, 1] for percentages
plt.xlabel('Metrics')
plt.ylabel('Value')
plt.title('Model Evaluation Metrics')
plt.show()


# Visualize decision boundary

In [None]:
def plot_decision_boundary(classifier, X_train, y_train):
    h = .02
    x_min, x_max = X_train[:, 0].min() - 0.1, X_train[:, 0].max() + 0.1
    y_min, y_max = X_train[:, 1].min() - 0.1, X_train[:, 1].max() + 0.1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.RdYlBu)
    plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=plt.cm.bwr)
    plt.xlabel('Age')
    plt.ylabel('Estimated Salary')
    plt.title('Gaussian Naive Bayes Decision Boundary')
    plt.show()

plot_decision_boundary(classifier, X_train, y_train)

# Prediction for a new customer

In [None]:
new_customer_age = 70
new_customer_salary = 80000

# Standardizing the input features
new_customer_data = scaler.transform([[new_customer_age, new_customer_salary]])

# Making the prediction
new_customer_prediction = classifier.predict(new_customer_data)

if new_customer_prediction[0] == 1:
    print('The new customer is likely to make a purchase.')
else:
    print('The new customer is not likely to make a purchase.')