### 풀이 영상: https://youtu.be/diP0q1YzVFg

## Q. [마케팅] 자동차 시장 세분화
- 자동차 회사는 새로운 전략을 수립하기 위해 4개의 시장으로 세분화했습니다.
- 기존 고객 분류 자료를 바탕으로 신규 고객이 어떤 분류에 속할지 예측해주세요!


- 예측할 값(y): "Segmentation" (1,2,3,4)
- 평가: Macro f1-score
- data: train.csv, test.csv
- 제출 형식: 
~~~
ID,Segmentation
458989,1
458994,2
459000,3
459003,4
~~~

### 답안 제출 참고
- 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용 
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('003000000.csv', index=False)

### 노트북 구분
- basic: 수치형 데이터만 활용 -> 학습 및 test데이터 예측
- intermediate: 범주형 데이터도 활용 -> 학습 및 test데이터 예측
- advanced: 학습 및 교차 검증(모델 평가) -> 하이퍼파라미터 튜닝 -> test데이터 예측

### 학습을 위한 채점
- 최종 파일을 "수험번호.csv"가 아닌 "submission.csv" 작성 후 오른쪽 메뉴 아래 "submit" 버튼 클릭 -> 리더보드에 점수 및 등수 확인 가능함
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('submission.csv', index=False)


In [40]:
# 라이브러리 불러오기
import pandas as pd

In [41]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

# 🍭 basic 단계 🍭  
- 목표: 수치형 데이터만이라도 활용해 제출하자!!!👍

## EDA

In [42]:
# 데이터 크기 확인
train.shape, test.shape

((6665, 11), (2154, 10))

In [43]:
# train 샘플 확인
train.head(3)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,4
1,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,2
2,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,2


In [44]:
# test 샘플 확인 
test.head(3)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6


In [45]:
# target 확인
train.Segmentation.value_counts()

4    1757
3    1720
1    1616
2    1572
Name: Segmentation, dtype: int64

In [46]:
# 결측치 확인(train) # 없다
train.isnull().sum()

ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
Segmentation       0
dtype: int64

In [47]:
# 결측치 확인(test) # 없다. 
test.isnull().sum()

ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
dtype: int64

In [48]:
# type 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6665 entries, 0 to 6664
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               6665 non-null   int64  
 1   Gender           6665 non-null   object 
 2   Ever_Married     6665 non-null   object 
 3   Age              6665 non-null   int64  
 4   Graduated        6665 non-null   object 
 5   Profession       6665 non-null   object 
 6   Work_Experience  6665 non-null   float64
 7   Spending_Score   6665 non-null   object 
 8   Family_Size      6665 non-null   float64
 9   Var_1            6665 non-null   object 
 10  Segmentation     6665 non-null   int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 572.9+ KB


## 전처리

In [49]:
# target(y, label) 값 복사
target = train.pop('Segmentation')
target

0       4
1       2
2       2
3       3
4       3
       ..
6660    2
6661    4
6662    4
6663    2
6664    2
Name: Segmentation, Length: 6665, dtype: int64

In [50]:
# test데이터 ID 복사
id = test.pop('ID')
id

0       458989
1       458994
2       459000
3       459003
4       459005
         ...  
2149    467950
2150    467954
2151    467958
2152    467961
2153    467968
Name: ID, Length: 2154, dtype: int64

In [51]:
# 수치형 컬럼(train)
# ['ID', 'Age', 'Work_Experience', 'Family_Size', 'Segmentation']
num_columns = ['Age', 'Work_Experience', 'Family_Size']
train = train[num_columns]
train.head(3)

Unnamed: 0,Age,Work_Experience,Family_Size
0,22,1.0,4.0
1,67,1.0,1.0
2,67,0.0,2.0


In [52]:
# 수치형 컬럼(test)
test = test[num_columns]
test.head(3)

Unnamed: 0,Age,Work_Experience,Family_Size
0,36,0.0,1.0
1,37,8.0,4.0
2,59,11.0,2.0


## model 학습 및 예측

In [53]:
# 모델 선택 및 학습
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(train, target)
pred = rf.predict(test)
pred

array([2, 3, 3, ..., 4, 3, 1])

In [54]:
# 예측 결과 -> 데이터 프레임
# pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)
result = pd.DataFrame({'ID': id, 'Segmentation': pred})
result

Unnamed: 0,ID,Segmentation
0,458989,2
1,458994,3
2,459000,3
3,459003,3
4,459005,2
...,...,...
2149,467950,1
2150,467954,1
2151,467958,4
2152,467961,3


In [55]:
result.to_csv("submission.csv", index=False)
# Score: 0.30477

# 🍭 intermediate 단계 🍭
- 목표: 범주형(카테고리)데이터 활용하기

In [56]:
# 라이브러리 불러오기
import pandas as pd

In [57]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

## EDA

In [58]:
# train 샘플 확인
train.head(3)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,4
1,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,2
2,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,2


In [59]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6665 entries, 0 to 6664
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               6665 non-null   int64  
 1   Gender           6665 non-null   object 
 2   Ever_Married     6665 non-null   object 
 3   Age              6665 non-null   int64  
 4   Graduated        6665 non-null   object 
 5   Profession       6665 non-null   object 
 6   Work_Experience  6665 non-null   float64
 7   Spending_Score   6665 non-null   object 
 8   Family_Size      6665 non-null   float64
 9   Var_1            6665 non-null   object 
 10  Segmentation     6665 non-null   int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 572.9+ KB


In [60]:
train.describe(include='O')

Unnamed: 0,Gender,Ever_Married,Graduated,Profession,Spending_Score,Var_1
count,6665,6665,6665,6665,6665,6665
unique,2,2,2,9,3,7
top,Male,Yes,Yes,Artist,Low,Cat_6
freq,3677,3944,4249,2192,3999,4476


## 전처리

In [61]:
# 원핫 인코딩
train = pd.get_dummies(train)
test = pd.get_dummies(test)

In [62]:
# type 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6665 entries, 0 to 6664
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        6665 non-null   int64  
 1   Age                       6665 non-null   int64  
 2   Work_Experience           6665 non-null   float64
 3   Family_Size               6665 non-null   float64
 4   Segmentation              6665 non-null   int64  
 5   Gender_Female             6665 non-null   uint8  
 6   Gender_Male               6665 non-null   uint8  
 7   Ever_Married_No           6665 non-null   uint8  
 8   Ever_Married_Yes          6665 non-null   uint8  
 9   Graduated_No              6665 non-null   uint8  
 10  Graduated_Yes             6665 non-null   uint8  
 11  Profession_Artist         6665 non-null   uint8  
 12  Profession_Doctor         6665 non-null   uint8  
 13  Profession_Engineer       6665 non-null   uint8  
 14  Professi

In [63]:
# target(y, label) 값 복사
target = train.pop('Segmentation')
target.head(2)

0    4
1    2
Name: Segmentation, dtype: int64

In [64]:
# train ID 컬럼 삭제
train = train.drop('ID', axis=1)

In [65]:
# test데이터 ID 복사
id = test.pop('ID')

In [66]:
# 모델 선택 및 학습
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state = 0)
rf.fit(train, target)
pred = rf.predict(test)
pred

array([1, 3, 3, ..., 2, 3, 4])

In [67]:
# 예측 결과 -> 데이터 프레임
# pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)
result = pd.DataFrame({'ID': id, 'Segmentation':pred})
result

Unnamed: 0,ID,Segmentation
0,458989,1
1,458994,3
2,459000,3
3,459003,3
4,459005,1
...,...,...
2149,467950,1
2150,467954,4
2151,467958,2
2152,467961,3


In [68]:
result.to_csv("submission.csv", index=False)
# Score: 0.30381

# 🍭 advanced 단계 🍭
- 목표: 교차검증 및 평가 후 제출하기

In [69]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

In [70]:
# 범주형 변수
# train.select_dtypes(include='object').columns
# ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score','Var_1']
cat_cols = ['Gender', 'Ever_Married', 'Graduated', 'Profession', 'Spending_Score','Var_1']

In [71]:
## label encoding
## Series.astype('category').cat.codes
train['Gender'] = train['Gender'].astype('category').cat.codes
train['Ever_Married'] = train['Ever_Married'].astype('category').cat.codes
train['Graduated'] = train['Graduated'].astype('category').cat.codes
train['Profession'] = train['Profession'].astype('category').cat.codes
train['Spending_Score'] = train['Spending_Score'].astype('category').cat.codes
train['Var_1'] = train['Var_1'].astype('category').cat.codes
train

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,1,0,22,0,5,1.0,2,4.0,3,4
1,466315,0,1,67,1,2,1.0,2,1.0,5,2
2,461735,1,1,67,1,7,0.0,1,2.0,5,2
3,461319,1,1,56,0,0,0.0,0,2.0,5,3
4,460156,1,0,32,1,5,1.0,2,3.0,5,3
...,...,...,...,...,...,...,...,...,...,...,...
6660,463002,1,1,41,1,0,0.0,1,5.0,5,2
6661,464685,1,0,35,0,4,3.0,2,4.0,3,4
6662,465406,0,0,33,1,5,1.0,2,1.0,5,4
6663,467299,0,0,27,1,5,1.0,2,4.0,5,2


In [72]:
## cat.codes의 label 인코딩은 ABC 순대로 되는 것을 확인할 수 있다
test['Profession'].astype('category').cat.categories

Index(['Artist', 'Doctor', 'Engineer', 'Entertainment', 'Executive',
       'Healthcare', 'Homemaker', 'Lawyer', 'Marketing'],
      dtype='object')

In [73]:
## label encoding
test['Gender'] = test['Gender'].astype('category').cat.codes
test['Ever_Married'] = test['Ever_Married'].astype('category').cat.codes
test['Graduated'] = test['Graduated'].astype('category').cat.codes
test['Profession'] = test['Profession'].astype('category').cat.codes
test['Spending_Score'] = test['Spending_Score'].astype('category').cat.codes
test['Var_1'] = test['Var_1'].astype('category').cat.codes
test

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,0,1,36,1,2,0.0,2,1.0,5
1,458994,1,1,37,1,5,8.0,0,4.0,5
2,459000,1,1,59,0,4,11.0,1,2.0,5
3,459003,1,1,47,1,1,0.0,1,5.0,3
4,459005,1,1,61,1,1,5.0,2,3.0,5
...,...,...,...,...,...,...,...,...,...,...
2149,467950,0,0,35,1,3,1.0,2,2.0,5
2150,467954,1,0,29,0,5,9.0,2,4.0,5
2151,467958,0,0,35,1,1,1.0,2,1.0,5
2152,467961,1,1,47,1,4,1.0,1,5.0,3


In [74]:
# ID, target 처리
target = train.pop('Segmentation')
train = train.drop("ID", axis=1)
test_ID = test.pop('ID')

In [75]:
# 모델 선택
# 하이퍼파라미터 튜닝: max_depth, n_estimators
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=0, max_depth=7, n_estimators=500)

In [76]:
# 교차 검증
from sklearn.model_selection import cross_val_score
scores = cross_val_score(rf, train, target, scoring='f1_macro', cv=5)
print(scores)
print(scores.mean())

[0.53130191 0.51695963 0.52121909 0.54069647 0.51119827]
0.5242750727554509


In [77]:
# 학습
rf.fit(train, target)
pred = rf.predict(test)
pred

array([1, 3, 2, ..., 1, 2, 4])

In [78]:
# 예측 결과 -> 데이터 프레임
# pd.DataFrame({'cust_id': X_test.cust_id, 'gender': pred}).to_csv('003000000.csv', index=False)

submit = pd.DataFrame({
    'ID': test_ID,
    'Segmentation': pred
})
submit.to_csv("submission.csv", index=False)
# Score: 0.32046