<a href="https://colab.research.google.com/github/Scottzeng03/scott1040/blob/main/EX04_03_%E5%AE%A2%E6%88%B6%E5%88%86%E7%BE%A4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 問題定義

將 性別 欄位 Label Encoding
將客戶分成 3 群，並解讀各群特質
計算 k = 2 ~ 15 的 Calinski-Harbasz Score，找出最佳 k 值

## 資料收集

In [None]:
!wget -O car_models.csv https://raw.githubusercontent.com/imchihchao/aop113b/main/materials/04-customer.csv

--2025-05-27 07:36:56--  https://raw.githubusercontent.com/imchihchao/aop113b/main/materials/04-customer.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2839 (2.8K) [text/plain]
Saving to: ‘car_models.csv’


2025-05-27 07:36:57 (23.3 MB/s) - ‘car_models.csv’ saved [2839/2839]



In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/imchihchao/aop113b/main/materials/04-customer.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,性別,年齡,收入（千）,消費指數（1~100）
0,女,74,38,81
1,女,51,71,91
2,女,30,65,10
3,女,88,49,17
4,女,55,48,70


## 資料前處理

### 資料清理

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   性別           200 non-null    object
 1   年齡           200 non-null    int64 
 2   收入（千）        200 non-null    int64 
 3   消費指數（1~100）  200 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 6.4+ KB


### 探索性分析

In [None]:
df_cor = df.drop(columns='性別').corr()
df_cor

Unnamed: 0,年齡,收入（千）,消費指數（1~100）
年齡,1.0,0.031519,-0.127454
收入（千）,0.031519,1.0,0.031476
消費指數（1~100）,-0.127454,0.031476,1.0


Setosa 與其他兩類在花瓣長、寬上分離度極高；Versicolor 與 Virginica 重疊較多。

### 資料分割

In [None]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['女','男']])
df[['性別']] = encoder.fit_transform(df[['性別']])
df

Unnamed: 0,性別,年齡,收入（千）,消費指數（1~100）
0,0.0,74,38,81
1,0.0,51,71,91
2,0.0,30,65,10
3,0.0,88,49,17
4,0.0,55,48,70
...,...,...,...,...
195,1.0,86,84,82
196,1.0,59,52,30
197,0.0,63,29,61
198,1.0,67,80,9


### 特徵縮放

## 模型訓練

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)

In [None]:
df['cluster'] = kmeans.labels_
df

Unnamed: 0,性別,年齡,收入（千）,消費指數（1~100）,cluster
0,0.0,74,38,81,2
1,0.0,51,71,91,2
2,0.0,30,65,10,1
3,0.0,88,49,17,1
4,0.0,55,48,70,2
...,...,...,...,...,...
195,1.0,86,84,82,2
196,1.0,59,52,30,1
197,0.0,63,29,61,2
198,1.0,67,80,9,1


## 模型評估

In [None]:
df.groupby('cluster').mean()

Unnamed: 0_level_0,性別,年齡,收入（千）,消費指數（1~100）
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.563636,29.763636,54.6,72.709091
1,0.578947,55.289474,60.289474,19.578947
2,0.565217,68.449275,60.57971,72.826087


In [None]:
from sklearn.metrics import calinski_harabasz_score
score = calinski_harabasz_score(df.drop(columns='cluster'), kmeans.labels_)
score

np.float64(88.85861356898553)

## 模型調整

In [None]:
df_nocluster = df.drop(columns='cluster')
for i in range(2,16):
  kmeans = KMeans(n_clusters=i)
  kmeans.fit(df)
  score = calinski_harabasz_score(df_nocluster, kmeans.labels_)
  print(f"k={i} score={score}")


k=2 score=111.68583137120146
k=3 score=89.31584869867916
k=4 score=82.60483985065864
k=5 score=83.98870308066951
k=6 score=92.20887239022812
k=7 score=92.85715193421127
k=8 score=100.43488502100337
k=9 score=93.94005522518034
k=10 score=93.44181718282113
k=11 score=87.02431325295066
k=12 score=88.49684046019316
k=13 score=86.84496980171804
k=14 score=82.09900938277478
k=15 score=88.19034918795728


**調參要點**

| 參數            | 說明                                               |
| ------------- | ------------------------------------------------ |
| `n_neighbors` | k 值過小易受雜訊影響，過大則平滑過度；通常取 √N 或交叉驗證搜尋。              |
| `weights`     | `"uniform"`：等權；`"distance"`：距離反比權重，對類別邊界效果較好。    |
| `metric`      | Iris 常見 `euclidean`，但在特徵關係稀疏或離散時可嘗試 `manhattan`。 |

## 模型部署

### 儲存模型

### 推論預測

## 結論