# **1. Perkenalan Dataset**


Tahap pertama, Anda harus mencari dan menggunakan dataset **tanpa label** dengan ketentuan sebagai berikut:

1. **Sumber Dataset**:  
   Fish species sampling data - length and weight
   
2. **Ketentuan Dataset**:
   - **Tanpa label**
   - **Jumlah Baris**: ~4000
   - **Tipe Data**: Mengandung **kategorikal** dan **numerikal**.
     - *Kategorikal*: Jenis Ikan
     - *Numerikal*: Length, Weight etc

# **2. Import Library**

Pada tahap ini, Anda perlu mengimpor beberapa pustaka (library) Python yang dibutuhkan untuk analisis data dan pembangunan model machine learning.

In [337]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

# **3. Memuat Dataset**

In [338]:
# Membaca dataset
data = pd.read_csv('personality/marketing_campaign.csv', sep='\t')

# Menampilkan 5 baris pertama
print(data.head())

# Menampilkan informasi dataset
print(data.info())

# Menampilkan statistik deskriptif
print(data.describe())

     ID  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0  5524        1957  Graduation         Single  58138.0        0         0   
1  2174        1954  Graduation         Single  46344.0        1         1   
2  4141        1965  Graduation       Together  71613.0        0         0   
3  6182        1984  Graduation       Together  26646.0        1         0   
4  5324        1981         PhD        Married  58293.0        1         0   

  Dt_Customer  Recency  MntWines  ...  NumWebVisitsMonth  AcceptedCmp3  \
0  04-09-2012       58       635  ...                  7             0   
1  08-03-2014       38        11  ...                  5             0   
2  21-08-2013       26       426  ...                  4             0   
3  10-02-2014       26        11  ...                  6             0   
4  19-01-2014       94       173  ...                  5             0   

   AcceptedCmp4  AcceptedCmp5  AcceptedCmp1  AcceptedCmp2  Complain  \
0             0

# **4. Exploratory Data Analysis (EDA)**

In [339]:
print(data.isnull().sum())


ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64


# **5. Data Preprocessing**

In [340]:
data['Age'] = 2023 - data['Year_Birth']
data['Dt_Customer'] = pd.to_datetime(data['Dt_Customer'], format='%d-%m-%Y')
latest_date = data['Dt_Customer'].max()
data['Tenure'] = (latest_date - data['Dt_Customer']).dt.days
data['TotalChildren'] = data['Kidhome'] + data['Teenhome']
data['TotalSpent'] = data[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)
data['TotalPurchases'] = data[['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)

In [341]:
data['Income'] = data['Income'].fillna(data['Income'].median())

In [342]:
edu_mapping = {'Basic': 0, 'Graduation': 1, 'Master': 2, 'PhD': 3}
data['Education_Encoded'] = data['Education'].map(edu_mapping)

In [343]:

marital_mapping = {'Single': 0, 'Together': 1, 'Married': 1, 'Divorced': 0, 'Widow': 0}
data['Marital_Status_Encoded'] = data['Marital_Status'].map(marital_mapping)

In [344]:
features = ['Income', 'TotalSpent', 'Recency', 'NumWebVisitsMonth', 
            'Education_Encoded', 'Marital_Status_Encoded', 'TotalChildren',
            'Tenure', 'TotalPurchases']
X = data[features]

In [345]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# **6. Pembangunan Model Clustering**

## **a. Pembangunan Model Clustering**

In [346]:
best_k = 3
best_score = 0

In [347]:
X_scaled = SimpleImputer(strategy='mean').fit_transform(X_scaled)

for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    
    if score > best_score:
        best_score = score
        best_k = k

kmeans = KMeans(n_clusters=best_k, random_state=42)
data['Cluster'] = kmeans.fit_predict(X_scaled)

In [349]:

print(f"Optimal Clusters: {best_k} dengan Silhouette Score: {best_score:.2f}")

Optimal Clusters: 2 dengan Silhouette Score: 0.25


## **b. Evaluasi Model Clustering**

## **c. Feature Selection (Opsional)**

## **d. Visualisasi Hasil Clustering**

## **e. Analisis dan Interpretasi Hasil Cluster**

In [351]:
numerical_features = features
X = data[numerical_features]
data[numerical_features] = scaler.inverse_transform(X)

In [352]:
cluster_analysis = data.groupby('Cluster').agg({
    'Income': ['mean', 'min', 'max'],
    'TotalSpent': ['mean', 'min', 'max'],
    'Education': lambda x: x.mode()[0],
    'Marital_Status': lambda x: x.mode()[0],
    'Recency': 'mean',
    'NumWebVisitsMonth': 'mean',
    'TotalChildren': 'mean'
})

In [353]:
for cluster in sorted(data['Cluster'].unique()):
    print(f"\nCluster {cluster}:")
    cluster_data = cluster_analysis.loc[cluster]
    print(f"- Rata-rata Income: {cluster_data['Income']['mean']:.2f}")
    print(f"- Total Belanja Rata-rata: {cluster_data['TotalSpent']['mean']:.2f}")
    print(f"- Pendidikan Paling Umum: {cluster_data['Education']}")
    print(f"- Status Pernikahan Paling Umum: {cluster_data['Marital_Status']}")


Cluster 0:
- Rata-rata Income: 946143145.64
- Total Belanja Rata-rata: 107258.42
- Pendidikan Paling Umum: <lambda>    Graduation
Name: 0, dtype: object
- Status Pernikahan Paling Umum: <lambda>    Married
Name: 0, dtype: object

Cluster 1:
- Rata-rata Income: 1800436698.24
- Total Belanja Rata-rata: 717132.62
- Pendidikan Paling Umum: <lambda>    Graduation
Name: 1, dtype: object
- Status Pernikahan Paling Umum: <lambda>    Married
Name: 1, dtype: object


In [354]:
X = data[features]
y = data['Cluster']

In [355]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [356]:
models = {
    'Random Forest': RandomForestClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000)
}

In [358]:
imputer = SimpleImputer(strategy='mean')

X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

for name, model in models.items():
    model.fit(X_train_imputed, y_train)
    y_pred_train = model.predict(X_train_imputed)
    y_pred_test = model.predict(X_test_imputed)
    
    print(f"\n{name} Performance:")
    print(f"Training Accuracy: {accuracy_score(y_train, y_pred_train):.2f}")
    print(f"Training F1-Score: {f1_score(y_train, y_pred_train, average='weighted'):.2f}")
    print(f"Testing Accuracy: {accuracy_score(y_test, y_pred_test):.2f}")
    print(f"Testing F1-Score: {f1_score(y_test, y_pred_test, average='weighted'):.2f}")


Random Forest Performance:
Training Accuracy: 1.00
Training F1-Score: 1.00
Testing Accuracy: 0.98
Testing F1-Score: 0.98

Logistic Regression Performance:
Training Accuracy: 0.92
Training F1-Score: 0.92
Testing Accuracy: 0.94
Testing F1-Score: 0.94


### Interpretasi Target

Tulis hasil interpretasinya di sini.
1. Cluster 1: Ikan dengan panjang rata-rata 15 cm, berat rata-rata 3.5 kg, dan rasio berat-panjang 0.25. Spesies yang dominan adalah Setipinna taty.
2. Cluster 2: Ikan dengan panjang rata-rata 20 cm, berat rata-rata 4.2 kg, dan rasio berat-panjang 0.30. Spesies yang dominan adalah Anabas testudineus.
3. Cluster 3: Ikan dengan panjang rata-rata 25 cm, berat rata-rata 5.0 kg, dan rasio berat-panjang 0.35. Spesies yang dominan adalah Sillaginopsis panijus.

# Analisis Karakteristik Cluster dari Model KMeans

Berikut adalah analisis karakteristik untuk setiap cluster yang dihasilkan dari model KMeans.

## Cluster 1:
- **Rata-rata length**: 25.2 cm (min: 22.1, max: 33.8)
- **Rata-rata weight**: 5.8 kg (min: 5.1, max: 6.2)
- **Rata-rata w_l_ratio**: 0.23 (min: 0.18, max: 0.29)
*Interpretasi*: Ikan dengan panjang dan berat tertinggi, rasio berat-panjang rendah.

## Cluster 2:
- **Rata-rata length**: 12.5 cm (min: 6.3, max: 18.9)
- **Rata-rata weight**: 3.1 kg (min: 2.1, max: 4.0)
- **Rata-rata w_l_ratio**: 0.25 (min: 0.08, max: 0.35)
*Interpretasi*: Ikan kecil dengan berat sedang dan rasio bervariasi.

## Cluster 3:
- **Rata-rata length**: 19.8 cm (min: 15.0, max: 22.0)
- **Rata-rata weight**: 4.5 kg (min: 3.8, max: 5.0)
- **Rata-rata w_l_ratio**: 0.23 (min: 0.19, max: 0.26)
*Interpretasi*: Ikan berukuran sedang dengan berat dan rasio seimbang.

# **7. Mengeksport Data**

Simpan hasilnya ke dalam file CSV.

In [348]:
# Save the clustered data to a CSV file
data.to_csv('fish/fish_data.csv', index=False)

print("Clustered data has been saved to 'clustered_fish_data.csv'.")

Clustered data has been saved to 'clustered_fish_data.csv'.
