# Final Project
## Customer Segmentation (Multiclass) - Classification

Pada Final Project ini, kami mengembangkan sistem Customer Segmentation Multiclass dengan pendekatan klasifikasi. Proyek ini bertujuan untuk mengelompokkan pelanggan ke dalam beberapa segmen berbeda berdasarkan karakteristik dan perilaku mereka. Kami membandingkan performa dari berbagai algoritma klasifikasi terbaik guna memperoleh model yang paling optimal dalam mengidentifikasi segmen pelanggan secara akurat.

Sebagai tahap akhir, kami mengintegrasikan model tersebut ke dalam sebuah platform interaktif berbasis Streamlit, yang memungkinkan pengguna untuk melakukan analisis data, eksplorasi segmen pelanggan, serta pelatihan ulang model secara mandiri dan mudah digunakan, tanpa perlu keterampilan pemrograman yang kompleks.

Author: Kelompok 4 DataBender's
Nama :
* Farhan Wily
* Ghazy Shidqy
*
*



## 1. Business Understanding

🎯 **Latar Belakang Bisnis**

Sebuah perusahaan otomotif ingin memperluas pasar dengan produk yang sudah ada (P1–P5). Berdasarkan riset pasar, perilaku konsumen di pasar baru mirip dengan pasar saat ini. Di pasar saat ini, perusahaan telah berhasil membagi pelanggan ke dalam 4 segmen (A, B, C, D) dan menyesuaikan strategi pemasaran untuk tiap segmen. Strategi ini terbukti sangat efektif.

🎯 **Tujuan Bisnis**

Membangun model klasifikasi berdasarkan data pelanggan yang telah dilabeli di pasar lama, untuk memprediksi segmen pelanggan baru (2627 calon pelanggan) di pasar baru.

✅ Output yang Diharapkan

* Model prediktif yang dapat mengklasifikasikan pelanggan baru ke dalam salah satu segmen: A, B, C, atau D.
* Analisis karakteristik tiap segmen untuk mendukung strategi bisnis.



## 2. Data Understanding

📁 **Dataset**

Terdapat dua file:
* **Train.csv**: Data pelanggan lama yang telah memiliki label segmen (Segmentation)
* **Test.csv**: Data calon pelanggan baru yang belum diketahui segmennya

### Connect to G-Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os
default_dir = "/content/drive/"
os.chdir(default_dir)

In [None]:
!ls

MyDrive


### Import Libraries

In [None]:
# Package imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.set_option('display.max_rows', None)

### Load Data

In [None]:
# Load data
df = pd.read_csv('/content/drive/MyDrive/Train.csv')
df.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


* ID:	Unique ID
* Gender:	Gender of the customer
* Ever_Married:	Marital status of the customer
* Age:	Age of the customer
* Graduated:	Is the customer a graduate?
* Profession:	Profession of the customer
* Work_Experience:	Work Experience in years
* Spending_Score:	Spending score of the customer
* Family_Size:	Number of family members for the customer (including the customer)
* Var_1:	Anonymised Category for the customer
* Segmentation:	(target) Customer Segment of the customer

In [None]:
def unique_categorical_values(dataset, column_name):
    """
    Prints the number and list of unique values for a specified
    categorical column in a Pandas DataFrame.

    Args:
        dataset: The Pandas DataFrame.
        column_name: The name of the categorical column.
    """
    try:
        print(f"Number of unique {column_name}: {dataset[column_name].nunique()}")
        print(f"Unique {column_name}:")
        for value in dataset[column_name].unique():
            print(f"- {value}")
    except KeyError:
        print(f"Error: Column '{column_name}' not found in the dataset.")

In [None]:
unique_categorical_values(df, "Profession")

Number of unique Profession: 9
Unique Profession:
- Healthcare
- Engineer
- Lawyer
- Entertainment
- Artist
- Executive
- Doctor
- Homemaker
- Marketing
- nan


In [None]:
unique_categorical_values(df, "Spending_Score")

Number of unique Spending_Score: 3
Unique Spending_Score:
- Low
- Average
- High


In [None]:
unique_categorical_values(df, "Var_1")

Number of unique Var_1: 7
Unique Var_1:
- Cat_4
- Cat_6
- Cat_7
- Cat_3
- Cat_1
- Cat_2
- nan
- Cat_5


In [None]:
len(df)

8068

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


In [None]:
df.describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,8068.0,8068.0,7239.0,7733.0
mean,463479.214551,43.466906,2.641663,2.850123
std,2595.381232,16.711696,3.406763,1.531413
min,458982.0,18.0,0.0,1.0
25%,461240.75,30.0,0.0,2.0
50%,463472.5,40.0,1.0,3.0
75%,465744.25,53.0,4.0,4.0
max,467974.0,89.0,14.0,9.0


In [None]:
# Check missing value
df.isnull().sum()

Unnamed: 0,0
ID,0
Gender,0
Ever_Married,140
Age,0
Graduated,78
Profession,124
Work_Experience,829
Spending_Score,0
Family_Size,335
Var_1,76
