
# Customer Data Analysis and Machine Learning Models

This project analyzes a synthetic customer dataset with 50,000 rows and 12 columns.  
We perform exploratory data analysis (EDA) using 9 visualizations and apply 5 machine learning models.

## Dataset Description
- **Customer_ID**: Unique ID of the customer  
- **Age**: Age of the customer  
- **Gender**: Male or Female  
- **Income**: Annual income in USD  
- **Spending_Score**: Score representing spending behavior  
- **Education_Level**: Highest educational qualification  
- **Marital_Status**: Single, Married, or Divorced  
- **Purchase_Amount**: Total amount spent  
- **Num_Transactions**: Number of transactions made  
- **Product_Category**: Category of most purchased product  
- **Loyalty_Score**: Score representing customer loyalty  
- **Customer_Tenure**: Number of years as a customer  

We will perform data visualization, preprocessing, model training, and evaluation.


In [None]:

import pandas as pd

# Load dataset
df = pd.read_csv("customer_data.csv")

# Display basic information
df.info()
df.head()


## Exploratory Data Analysis (EDA) and Visualizations

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")

# 1. Age Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["Age"], bins=30, kde=True, color="blue")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# 2. Gender Count
plt.figure(figsize=(6, 4))
sns.countplot(x="Gender", data=df, palette="coolwarm")
plt.title("Gender Distribution")
plt.show()

# 3. Income Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["Income"], bins=30, kde=True, color="green")
plt.title("Income Distribution")
plt.xlabel("Income ($)")
plt.ylabel("Count")
plt.show()

# 4. Spending Score Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["Spending_Score"], bins=30, kde=True, color="red")
plt.title("Spending Score Distribution")
plt.xlabel("Spending Score")
plt.ylabel("Count")
plt.show()

# 5. Purchase Amount vs. Income Scatter Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x="Income", y="Purchase_Amount", data=df, alpha=0.5)
plt.title("Purchase Amount vs. Income")
plt.xlabel("Income ($)")
plt.ylabel("Purchase Amount ($)")
plt.show()

# 6. Number of Transactions by Category
plt.figure(figsize=(8, 5))
sns.boxplot(x="Product_Category", y="Num_Transactions", data=df, palette="pastel")
plt.title("Number of Transactions by Product Category")
plt.xticks(rotation=45)
plt.show()

# 7. Loyalty Score Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["Loyalty_Score"], bins=30, kde=True, color="purple")
plt.title("Loyalty Score Distribution")
plt.xlabel("Loyalty Score")
plt.ylabel("Count")
plt.show()

# 8. Purchase Amount by Marital Status
plt.figure(figsize=(8, 5))
sns.boxplot(x="Marital_Status", y="Purchase_Amount", data=df, palette="coolwarm")
plt.title("Purchase Amount by Marital Status")
plt.show()

# 9. Customer Tenure Distribution
plt.figure(figsize=(8, 5))
sns.histplot(df["Customer_Tenure"], bins=30, kde=True, color="brown")
plt.title("Customer Tenure Distribution")
plt.xlabel("Years")
plt.ylabel("Count")
plt.show()


## Machine Learning Models

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Encode categorical features
label_encoders = {}
categorical_columns = ["Gender", "Education_Level", "Marital_Status", "Product_Category"]

for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Define features and target variable
X = df.drop(columns=["Customer_ID", "Spending_Score"])  # Using Spending Score as target
y = df["Spending_Score"]

# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100),
    "Support Vector Machine": SVC(),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors=5)
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy:.2f}")
    print(classification_report(y_test, y_pred))



## Conclusion

- We explored customer data using 9 visualizations.
- We trained 5 machine learning models to predict spending scores.
- The models' performance varied, and we can further optimize them using hyperparameter tuning.

This project provides a comprehensive workflow for data analysis and machine learning.
