### 풀이 영상: https://youtu.be/diP0q1YzVFg

## Q. [마케팅] 자동차 시장 세분화
- 자동차 회사는 새로운 전략을 수립하기 위해 4개의 시장으로 세분화했습니다.
- 기존 고객 분류 자료를 바탕으로 신규 고객이 어떤 분류에 속할지 예측해주세요!


- 예측할 값(y): "Segmentation" (1,2,3,4)
- 평가: Macro f1-score
- data: train.csv, test.csv
- 제출 형식: 
~~~
ID,Segmentation
458989,1
458994,2
459000,3
459003,4
~~~

### 답안 제출 참고
- 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용 
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('003000000.csv', index=False)

### 노트북 구분
- basic: 수치형 데이터만 활용 -> 학습 및 test데이터 예측
- intermediate: 범주형 데이터도 활용 -> 학습 및 test데이터 예측
- advanced: 학습 및 교차 검증(모델 평가) -> 하이퍼파라미터 튜닝 -> test데이터 예측

### 학습을 위한 채점
- 최종 파일을 "수험번호.csv"가 아닌 "submission.csv" 작성 후 오른쪽 메뉴 아래 "submit" 버튼 클릭 -> 리더보드에 점수 및 등수 확인 가능함
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('submission.csv', index=False)


In [2]:
import pandas as pd
import numpy as np
train_df = pd.read_csv('/kaggle/input/big-data-analytics-certification-kr-2022/train.csv')
test_df = pd.read_csv('/kaggle/input/big-data-analytics-certification-kr-2022/test.csv')
print(train_df, test_df)

          ID  Gender Ever_Married  Age Graduated  Profession  Work_Experience  \
0     462809    Male           No   22        No  Healthcare              1.0   
1     466315  Female          Yes   67       Yes    Engineer              1.0   
2     461735    Male          Yes   67       Yes      Lawyer              0.0   
3     461319    Male          Yes   56        No      Artist              0.0   
4     460156    Male           No   32       Yes  Healthcare              1.0   
...      ...     ...          ...  ...       ...         ...              ...   
6660  463002    Male          Yes   41       Yes      Artist              0.0   
6661  464685    Male           No   35        No   Executive              3.0   
6662  465406  Female           No   33       Yes  Healthcare              1.0   
6663  467299  Female           No   27       Yes  Healthcare              1.0   
6664  461879    Male          Yes   37       Yes   Executive              0.0   

     Spending_Score  Family

In [12]:
#EDA
# print(train_df.head(10),test_df.head(10))
# print(train_df.shape,test_df.shape)
# print(train_df.describe(),test_df.describe())
print(train_df.describe(exclude = 'object'),test_df.describe(exclude = 'object'))
#수치형 = Age  Work_Experience  Family_Size
# object = ['Gender','Ever_Married','Graduated','Profession','Spending_Score','Var_1']
# print(train_df.isnull().sum(),test_df.isnull().sum())
# print(train_df['Segmentation'].value_counts())
# print(train_df.info())

               Age  Work_Experience  Family_Size
count  6665.000000      6665.000000  6665.000000
mean     43.536084         2.629107     2.841110
std      16.524054         3.405365     1.524743
min      18.000000         0.000000     1.000000
25%      31.000000         0.000000     2.000000
50%      41.000000         1.000000     2.000000
75%      53.000000         4.000000     4.000000
max      89.000000        14.000000     9.000000                Age  Work_Experience  Family_Size
count  2154.000000      2154.000000  2154.000000
mean     43.461467         2.551532     2.837047
std      16.761895         3.344917     1.566872
min      18.000000         0.000000     1.000000
25%      30.000000         0.000000     2.000000
50%      41.000000         1.000000     2.000000
75%      52.000000         4.000000     4.000000
max      89.000000        14.000000     9.000000


In [8]:
#데이터 전처리
# target = train_df.pop('Segmentation')
# train_df = train_df.drop('ID', axis = 1)
# id = test_df.pop('ID')
print(target)

0       4
1       2
2       2
3       3
4       3
       ..
6660    2
6661    4
6662    4
6663    2
6664    2
Name: Segmentation, Length: 6665, dtype: int64


In [13]:
#수치형 변수 스케일 변환
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
fcols = ['Age','Work_Experience','Family_Size']
for fcol in fcols:
    train_df[fcol] = sc.fit_transform(train_df[[fcol]])
    test_df[fcol] = sc.transform(test_df[[fcol]])
    
train_df    

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,Male,No,-1.303415,No,Healthcare,-0.478430,Low,0.760113,Cat_4
1,Female,Yes,1.420092,Yes,Engineer,-0.478430,Low,-1.207580,Cat_6
2,Male,Yes,1.420092,Yes,Lawyer,-0.772106,High,-0.551682,Cat_6
3,Male,Yes,0.754346,No,Artist,-0.772106,Average,-0.551682,Cat_6
4,Male,No,-0.698191,Yes,Healthcare,-0.478430,Low,0.104215,Cat_6
...,...,...,...,...,...,...,...,...,...
6660,Male,Yes,-0.153490,Yes,Artist,-0.772106,High,1.416011,Cat_6
6661,Male,No,-0.516624,No,Executive,0.108922,Low,0.760113,Cat_4
6662,Female,No,-0.637669,Yes,Healthcare,-0.478430,Low,-1.207580,Cat_6
6663,Female,No,-1.000803,Yes,Healthcare,-0.478430,Low,0.760113,Cat_6


In [14]:
#레이블 인코딩 분류이므로
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cols = ['Gender','Ever_Married','Graduated','Profession','Spending_Score','Var_1']
for col in cols:
    train_df[col] = le.fit_transform(train_df[col])
    test_df[col] = le.transform(test_df[col])    

test_df

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,1,-0.456102,1,2,-0.772106,2,-1.207580,5
1,1,1,-0.395579,1,5,1.577304,0,0.760113,5
2,1,1,0.935913,0,4,2.458333,1,-0.551682,5
3,1,1,0.209644,1,1,-0.772106,1,1.416011,3
4,1,1,1.056958,1,1,0.696275,2,0.104215,5
...,...,...,...,...,...,...,...,...,...
2149,0,0,-0.516624,1,3,-0.478430,2,-0.551682,5
2150,1,0,-0.879758,0,5,1.870980,2,0.760113,5
2151,0,0,-0.516624,1,1,-0.478430,2,-1.207580,5
2152,1,1,0.209644,1,4,-0.478430,1,1.416011,3


In [19]:
from sklearn.model_selection import train_test_split
x_tr, x_vr, t_tr, t_vr = train_test_split(train_df, target, test_size = 0.2, random_state = 2023)
print(x_tr)

      Gender  Ever_Married       Age  Graduated  Profession  Work_Experience  \
1386       1             1  0.391212          1           4        -0.478430   
1397       1             1  0.451734          1           0         1.283628   
5713       1             0 -0.637669          0           8         2.458333   
3722       0             0 -0.940281          0           3        -0.478430   
767        0             0 -0.335057          1           0         1.870980   
...      ...           ...       ...        ...         ...              ...   
6049       1             1  1.056958          1           3         0.989951   
2743       0             0 -0.940281          0           4        -0.772106   
6598       1             1 -0.274535          0           4         0.989951   
5657       1             1  1.722704          1           4        -0.478430   
4951       1             1 -0.214012          1           1        -0.478430   

      Spending_Score  Family_Size  Var_

In [26]:
from sklearn.ensemble import RandomForestClassifier
model_RFC = RandomForestClassifier(n_estimators = 200, max_depth = 8, random_state = 2023)
model_RFC.fit(x_tr,t_tr)


[0.51019865 0.51012476 0.5273146  0.53054131 0.52995771]
0.5216274061323178


In [31]:
pred = model_RFC.predict(x_vr)
pred

array([3, 4, 3, ..., 3, 1, 4])

In [30]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_RFC, x_tr, t_tr, scoring='f1_macro', cv=5)
print(scores)
print(scores.mean())

[0.51019865 0.51012476 0.5273146  0.53054131 0.52995771]
0.5216274061323178


In [32]:
from sklearn.metrics import f1_score
f1_score(pred, t_vr, average = 'macro')

0.5315857785169403

In [None]:
#n_estimators = 200, 0.5225550469718077, 100 0.5220513137066412
#
# 학습
model_RFC.fit(train_df, target)
pred = model_RFC.predict(test_df)
pred

In [None]:
answer = pd.DataFrame({'ID' : id, 'Segmentation' : pred})
answer

In [None]:
answer.to_csv('submission.csv', index = False)

In [None]:
# 라이브러리 불러오기
import pandas as pd

In [None]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

# 🍭 basic 단계 🍭  
- 목표: 수치형 데이터만이라도 활용해 제출하자!!!👍

## EDA

In [None]:
# 데이터 크기 확인
train.shape, test.shape

In [None]:
# train 샘플 확인
train.head()

In [None]:
# test 샘플 확인 
test.head()

In [None]:
# target 확인
train['Segmentation'].value_counts()

In [None]:
# 결측치 확인(train)
train.isnull().sum()

In [None]:
# 결측치 확인(test)
test.isnull().sum()

In [None]:
# type 확인
train.info()

## 전처리

In [None]:
# target(y, label) 값 복사
target = train.pop('Segmentation')
target

In [None]:
# test데이터 ID 복사
test_ID = test.pop('ID')

In [None]:
# 수치형 컬럼(train)
# ['ID', 'Age', 'Work_Experience', 'Family_Size', 'Segmentation']
num_cols = ['Age', 'Work_Experience', 'Family_Size']
train = train[num_cols]
train.head(2)

In [None]:
# 수치형 컬럼(test)
test = test[num_cols]
test.head(2)

## model 학습 및 예측

In [None]:
# 모델 선택 및 학습
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(train, target)
pred = rf.predict(test)
pred

In [None]:
# 예측 결과 -> 데이터 프레임
# pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)

submit = pd.DataFrame({
    'ID': test_ID,
    'Segmentation': pred
})
submit

In [None]:
submit.to_csv("submission.csv", index=False)
# Score: 0.30477

# 🍭 intermediate 단계 🍭
- 목표: 범주형(카테고리)데이터 활용하기

In [None]:
# 라이브러리 불러오기
import pandas as pd

In [None]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

## EDA

In [None]:
# train 샘플 확인
train.head()

In [None]:
train.info()

In [None]:
train.describe(include="O")

## 전처리

In [None]:
# 원핫 인코딩
train = pd.get_dummies(train)
test = pd.get_dummies(test)

In [None]:
# type 확인
train.info()

In [None]:
# target(y, label) 값 복사
target = train.pop('Segmentation')
target

In [None]:
train = train.drop("ID", axis=1)
train.head(1)

In [None]:
# test데이터 ID 복사
test_ID = test.pop('ID')
test_ID

In [None]:
# 모델 선택 및 학습
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0)
rf.fit(train, target)
pred = rf.predict(test)
pred

In [None]:
# 예측 결과 -> 데이터 프레임
# pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)

submit = pd.DataFrame({
    'ID': test_ID,
    'Segmentation': pred
})
submit

In [None]:
submit.to_csv("submission.csv", index=False)
# Score: 0.30381

# 🍭 advanced 단계 🍭
- 목표: 교차검증 및 평가 후 제출하기

In [None]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

In [None]:
# 범주형 변수
# train.select_dtypes(include='object').columns
# ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score','Var_1']
cat_cols = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score','Var_1']

In [None]:
## label encoding
## Series.astype('category').cat.codes
train['Gender'] = train['Gender'].astype('category').cat.codes
train['Ever_Married'] = train['Ever_Married'].astype('category').cat.codes
train['Graduated'] = train['Graduated'].astype('category').cat.codes
train['Profession'] = train['Profession'].astype('category').cat.codes
train['Spending_Score'] = train['Spending_Score'].astype('category').cat.codes
train['Var_1'] = train['Var_1'].astype('category').cat.codes
train

In [None]:
## cat.codes의 label 인코딩은 ABC 순대로 되는 것을 확인할 수 있다
test['Profession'].astype('category').cat.categories

In [None]:
## label encoding
test['Gender'] = test['Gender'].astype('category').cat.codes
test['Ever_Married'] = test['Ever_Married'].astype('category').cat.codes
test['Graduated'] = test['Graduated'].astype('category').cat.codes
test['Profession'] = test['Profession'].astype('category').cat.codes
test['Spending_Score'] = test['Spending_Score'].astype('category').cat.codes
test['Var_1'] = test['Var_1'].astype('category').cat.codes
test

In [None]:
# ID, target 처리
target = train.pop('Segmentation')
train = train.drop("ID", axis=1)
test_ID = test.pop('ID')

In [None]:
# 모델 선택
# 하이퍼파라미터 튜닝: max_depth, n_estimators
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0, max_depth=7, n_estimators=500)

In [None]:
# 교차 검증
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rf, train, target, scoring='f1_macro', cv=5)
print(scores)
print(scores.mean())

In [None]:
# 학습
rf.fit(train, target)
pred = rf.predict(test)
pred

In [None]:
# 예측 결과 -> 데이터 프레임
# pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)

submit = pd.DataFrame({
    'ID': test_ID,
    'Segmentation': pred
})
submit.to_csv("submission.csv", index=False)
# Score: 0.32046