<a href="https://colab.research.google.com/github/Hariprakashhp/Customer-Purchase-Behavior-Analysis-Project/blob/main/customer_purchase_behavior_analysis_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load the dataset into a pandas DataFrame
pd = pd.read_csv('/content/ecommerce_customer_data_custom_ratios.csv')

In [None]:
# Display the first few rows of the DataFrame to understand its structure
pd.head()

In [None]:
# Get information about the DataFrame, including data types and non-null counts
pd.info()

In [None]:
# Generate descriptive statistics of the numerical columns
pd.describe()

In [None]:
# Check for missing values in each column and sum them up
pd.isna().sum()

In [None]:
# Check for null values in the DataFrame (returns a boolean DataFrame)
pd.isnull()

In [None]:
# Calculate the percentage of missing values for each column
pd.isnull().sum()*100/len(pd)

In [None]:
# Impute missing values in 'Returns' with the mode
mode_returns = pd['Returns'].mode()[0]
pd['Returns'].fillna(mode_returns, inplace=True)

# Verify that there are no more missing values in 'Returns'
display(pd.isna().sum())

The missing values in the 'Returns' column were imputed with the mode in the previous step (`wwHiSMCqWF-3`). This is one method for handling nulls in a dataset.

# Task
Analyze the data, build a model to predict churn, and evaluate the model's performance.

## Exploratory data analysis (eda)

### Subtask:
Analyze the data to understand the distributions of features, identify patterns, and explore relationships between features and the target variable ('Churn'). This may include visualizations and summary statistics.


**Reasoning**:
Create the requested visualizations to analyze the data, focusing on distributions and relationships with the 'Churn' variable, as per the instructions.



In [None]:
# 1. Histogram for Customer Age distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=pd, x='Customer Age', kde=True)
plt.xlabel('Customer Age')
plt.ylabel('Frequency')
plt.title('Distribution of Customer Age')
plt.show()

# 2. Count plot for Gender and Churn relationship
plt.figure(figsize=(8, 6))
sns.countplot(data=pd, x='Gender', hue='Churn')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Relationship between Gender and Churn')
plt.show()

# 3. Count plot for Product Category and Churn relationship
plt.figure(figsize=(12, 6))
sns.countplot(data=pd, x='Product Category', hue='Churn')
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.title('Relationship between Product Category and Churn')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# 4. Box plot for Total Purchase Amount and Churn relationship
plt.figure(figsize=(8, 6))
sns.boxplot(data=pd, x='Churn', y='Total Purchase Amount')
plt.xlabel('Churn')
plt.ylabel('Total Purchase Amount')
plt.title('Relationship between Total Purchase Amount and Churn')
plt.show()

# 5. Count plot for Payment Method and Churn relationship
plt.figure(figsize=(10, 6))
sns.countplot(data=pd, x='Payment Method', hue='Churn')
plt.xlabel('Payment Method')
plt.ylabel('Count')
plt.title('Relationship between Payment Method and Churn')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# 6. Count plot for Returns and Churn relationship
plt.figure(figsize=(8, 6))
sns.countplot(data=pd, x='Returns', hue='Churn')
plt.xlabel('Returns')
plt.ylabel('Count')
plt.title('Relationship between Returns and Churn')
plt.show()

## Data preprocessing

### Subtask:
Prepare the data for modeling. This may involve handling categorical variables (e.g., one-hot encoding), scaling numerical features, and splitting the data into training and testing sets.


**Reasoning**:
Select relevant features, apply one-hot encoding to categorical features, define X and y, and split the data into training and testing sets.



**Reasoning**:
The error indicates that the variable `pd` is being used as a DataFrame object when it was overwritten as the DataFrame itself in a previous cell. I need to use the correct variable name for the DataFrame, which should be `df` or similar, and then use `pd.get_dummies` to perform one-hot encoding.



**Reasoning**:
The error persists because the variable `pd` is still being used as the DataFrame itself, not the pandas library. I need to correct this by using `pandas.get_dummies` or by assigning the DataFrame to a different variable name like `df` and then using `pd.get_dummies(df, ...)`. I will assign the dataframe to `df` and use `pd.get_dummies`.



In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Display the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## Model Selection and Building

### Subtask:
Choose an appropriate model for binary classification (churn prediction) and train it on the training data.

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd # Ensure pandas is imported as pd

# Select features for modeling
features = ['Product Category', 'Product Price', 'Quantity', 'Total Purchase Amount',
            'Payment Method', 'Customer Age', 'Returns', 'Gender']
X = df[features]
y = df['Churn']

# Apply one-hot encoding to categorical features on the selected features
categorical_features = ['Product Category', 'Payment Method', 'Gender']
X = pd.get_dummies(X, columns=categorical_features, drop_first=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Display the shapes of the resulting sets and the head of the processed X
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
display(X.head())

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the Logistic Regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

## Model Evaluation

### Subtask:
Evaluate the performance of the trained model using appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score, AUC).

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score

# Predict on the test set
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate the model using various metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)

# Display the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("AUC:", auc)

# (Optional) Perform cross-validation for a more robust evaluation
# cross_val_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
# print("\nCross-validation Accuracy Scores:", cross_val_scores)
# print("Mean Cross-validation Accuracy:", cross_val_scores.mean())

In [None]:
# Check the distribution of the target variable 'Churn' to identify potential imbalance
churn_distribution = df['Churn'].value_counts(normalize=True)
print("Churn Distribution:")
print(churn_distribution)

# Visualize the distribution of Churn
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x='Churn')
plt.title('Distribution of Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.show()

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train the Random Forest model, using class_weight='balanced' to handle imbalance
rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Predict on the test set using the Random Forest model
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

# Evaluate the Random Forest model using various metrics
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, y_pred_proba_rf)

# Display the evaluation metrics for the Random Forest model
print("Random Forest Model Evaluation:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1-score:", f1_rf)
print("AUC:", auc_rf)

In [None]:
# Print the evaluation results of the Random Forest model
print("Random Forest Model Evaluation Results:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)
print("F1-score:", f1_rf)
print("AUC:", auc_rf)

# E-commerce Customer Churn Prediction

This project aims to analyze e-commerce customer data and build a model to predict customer churn.

## Project Steps:

1.  **Data Loading**: The dataset was loaded into a pandas DataFrame.
2.  **Exploratory Data Analysis (EDA)**:
    *   Initial inspection of the data using `head()`, `info()`, and `describe()`.
    *   Checking for missing values and their percentages.
    *   Visualizing the distribution of 'Churn' to identify class imbalance.
3.  **Data Preprocessing**:
    *   Handling missing values in the 'Returns' column by imputing with the mode.
    *   Selecting relevant features for the model.
    *   Applying one-hot encoding to categorical features ('Product Category', 'Payment Method', 'Gender').
    *   Splitting the data into training and testing sets.
4.  **Model Selection and Building**:
    *   Initially, a Logistic Regression model was trained. Due to poor performance on imbalanced data, a Random Forest Classifier was chosen.
    *   A Random Forest Classifier was initialized with `class_weight='balanced'` to address data imbalance and trained on the training data.
5.  **Model Evaluation**:
    *   The trained models were evaluated using metrics such as Accuracy, Precision, Recall, F1-score, and AUC.
    *   The evaluation of the Logistic Regression model showed it was not predicting the minority class.
    *   The Random Forest model showed some improvement in predicting the minority class, but the overall performance (especially Recall and AUC) indicated further optimization or techniques for handling imbalance could be beneficial.

## Results:

The analysis revealed that the dataset is imbalanced, which significantly impacts model performance. While the Random Forest model was able to make some predictions on the minority class (churn), its overall predictive power was limited, highlighting the challenges of predicting churn in this imbalanced dataset.

Further work could involve exploring advanced resampling techniques, hyperparameter tuning, or trying other models better suited for imbalanced classification to improve the model's performance.