# 1. 문제정의 


  ## 1.1 경진대회 소개

- 데이콘 신용카드 사용자 연체 예측 AI 경진 대회

- https://dacon.io/competitions/official/235713/overview/description

    ### 1.1.1 평가
  - 평가산식 : Logloss
  - Public 평가 : 테스트 데이터 중 랜덤 샘플 된 50%로 채점
  - Private 평가 : 나머지 50 % 테스트 데이터로 채점

    ### 1.1.2 배경
    - 신용카드사는 신용카드 신청자가 제출한 개인정보와 데이터를 활용해 신용 점수를 산정합니다. 신용카드사는 이 신용 점수를 활용해 신청자의 향후 채무 불이행과 신용카드 대급 연체 가능성을 예측합니다. 

    - 현재 많은 금융업계는 인공지능(AI)를 활용한 금융 서비스를 구현하고자 합니다. 사용자의 대금 연체 정도를 예측할 수 있는 인공지능 알고리즘을 개발해 금융업계에 제안할 수 있는 인사이트를 발굴해주세요!

    ### 1.1.3 데이터 분석 대상
    - train : 총 26,457행의 데이터 20열
    - test : 총 10,000행의 데이터 19열("credit" 제외)

    => train 데이터를 이용해 모델을 학습 시킨 뒤 test 데이터를 활용해 **"credit"** 예측

## 1.2 데이터 도메인 정보
-	index
-	gender: 성별
-	car: 차량 소유 여부
-	reality: 부동산 소유 여부
-	child_num: 자녀 수
-	income_total: 연간 소득
-	income_type: 소득 분류 ['Commercial associate', 'Working', 'State servant', 'Pensioner', 'Student']
-	edu_type: 교육 수준 ['Higher education' ,'Secondary / secondary special', 'Incomplete higher', 'Lower secondary', 'Academic degree']
-	family_type: 결혼 여부 ['Married', 'Civil marriage', 'Separated', 'Single / not married', 'Widow']
-	house_type: 생활 방식 ['Municipal apartment', 'House / apartment', 'With parents', 'Co-op apartment', 'Rented apartment', 'Office apartment']
-	DAYS_BIRTH: 출생일 데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전에 태어났음을 의미
-	DAYS_EMPLOYED: 업무 시작일 데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전부터 일을 시작함을 의미 양수 값은 고용되지 않은 상태를 의미함
-	FLAG_MOBIL: 핸드폰 소유 여부
-	work_phone: 업무용 전화 소유 여부
-	phone: 가정용 전화 소유 여부
-	email: 이메일 소유 여부
-	occyp_type: 직업 유형 
-	family_size: 가족 규모
-	begin_month: 신용카드 발급 월 데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 한 달 전에 신용카드를 발급함을 의미
-	credit: 사용자의 신용카드 대금 연체를 기준의 신용도 => 낮을 수록 높은 신용의 신용카드 사용자를 의미함

## 1.3 평가척도
=> 참가자가 최종적으로 제출한 결과물의 우열을 판단하는 척도

- 본 대회에서는 logloss라는 평가척도를 적용합니다. 이 logloss 값은 분류모델에서 평가지표로 사용하는 지표 중 하나이며, 0에 가까울수록 정확하다는 뜻이고, 확률이 낮아질수록 logloss값은 급격하게 커진다.

## 1.4 문제 해결을 위한 접근 방식

- 이번에는 기존 방식과는 다르게
- **데이터 모델링 -> 결과 확인 -> EDA -> 전처리 -> 모델링 성능 개선**

## 1.5 섹션 아이디어
- [11.30]
  - 1) 이번에는 지난 번과 다르게 먼저 기초적인 모델링을 통해 예측 후 성능 개선을 위해 추가적인 부분을 수정 

# 2. 데이터 간단히 탐색

## 2.1 데이터 연결
- 데이터 로드를 위한 url 연결

In [None]:
import requests
from io import StringIO

In [None]:
# 학습 데이터 로드
orig_url = 'https://drive.google.com/file/d/1Rku0YUinqkflwUAgaT66uD4W1nhVEll5/view?usp=sharing'
file_id = orig_url.split('/')[-2]
down_url = 'https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(down_url).text
train_url = StringIO(url)

In [None]:
# 테스트 데이터 로드
orig_url = 'https://drive.google.com/file/d/1UpAcGoRB5zOqzNFw9EYMUwjLcWgzahYM/view?usp=sharing'
file_id = orig_url.split('/')[-2]
down_url = 'https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(down_url).text
test_url = StringIO(url)

In [None]:
# 제출 데이터 로드
orig_url = 'https://drive.google.com/file/d/157dN3myXQ-sIXNKiWRAVTs4fvU5ZChEJ/view?usp=sharing'
file_id = orig_url.split('/')[-2]
down_url = 'https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(down_url).text
sub_url = StringIO(url)

## 2.2 데이터 로드


### 2.2.1 학습 데이터

In [None]:
import pandas as pd

In [None]:
# 학습 데이터 로드
X = pd.read_csv(train_url)

In [None]:
# 학습 데이터 기본 정보 파악
display(X.info())

display(X.head().T)

display(X.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          26457 non-null  int64  
 1   gender         26457 non-null  object 
 2   car            26457 non-null  object 
 3   reality        26457 non-null  object 
 4   child_num      26457 non-null  int64  
 5   income_total   26457 non-null  float64
 6   income_type    26457 non-null  object 
 7   edu_type       26457 non-null  object 
 8   family_type    26457 non-null  object 
 9   house_type     26457 non-null  object 
 10  DAYS_BIRTH     26457 non-null  int64  
 11  DAYS_EMPLOYED  26457 non-null  int64  
 12  FLAG_MOBIL     26457 non-null  int64  
 13  work_phone     26457 non-null  int64  
 14  phone          26457 non-null  int64  
 15  email          26457 non-null  int64  
 16  occyp_type     18286 non-null  object 
 17  family_size    26457 non-null  float64
 18  begin_

None

Unnamed: 0,0,1,2,3,4
index,0,1,2,3,4
gender,F,F,M,F,F
car,N,N,Y,N,Y
reality,N,Y,Y,Y,Y
child_num,0,1,0,0,0
income_total,202500,247500,450000,202500,157500
income_type,Commercial associate,Commercial associate,Working,Commercial associate,State servant
edu_type,Higher education,Secondary / secondary special,Higher education,Secondary / secondary special,Higher education
family_type,Married,Civil marriage,Married,Married,Married
house_type,Municipal apartment,House / apartment,House / apartment,House / apartment,House / apartment


Unnamed: 0,index,child_num,income_total,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,family_size,begin_month,credit
count,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0
mean,13228.0,0.428658,187306.5,-15958.053899,59068.750728,1.0,0.224742,0.294251,0.09128,2.196848,-26.123294,1.51956
std,7637.622372,0.747326,101878.4,4201.589022,137475.427503,0.0,0.41742,0.455714,0.288013,0.916717,16.55955,0.702283
min,0.0,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,-60.0,0.0
25%,6614.0,0.0,121500.0,-19431.0,-3153.0,1.0,0.0,0.0,0.0,2.0,-39.0,1.0
50%,13228.0,0.0,157500.0,-15547.0,-1539.0,1.0,0.0,0.0,0.0,2.0,-24.0,2.0
75%,19842.0,1.0,225000.0,-12446.0,-407.0,1.0,0.0,1.0,0.0,3.0,-12.0,2.0
max,26456.0,19.0,1575000.0,-7705.0,365243.0,1.0,1.0,1.0,1.0,20.0,0.0,2.0


### 2.2.2 테스트 데이터

In [None]:
# 테스트 데이터 로드
test = pd.read_csv(test_url)

In [None]:
# 테스트 데이터 기본 정보 파악
display(test.info())

display(test.head())

display(test.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          10000 non-null  int64  
 1   gender         10000 non-null  object 
 2   car            10000 non-null  object 
 3   reality        10000 non-null  object 
 4   child_num      10000 non-null  int64  
 5   income_total   10000 non-null  float64
 6   income_type    10000 non-null  object 
 7   edu_type       10000 non-null  object 
 8   family_type    10000 non-null  object 
 9   house_type     10000 non-null  object 
 10  DAYS_BIRTH     10000 non-null  int64  
 11  DAYS_EMPLOYED  10000 non-null  int64  
 12  FLAG_MOBIL     10000 non-null  int64  
 13  work_phone     10000 non-null  int64  
 14  phone          10000 non-null  int64  
 15  email          10000 non-null  int64  
 16  occyp_type     6848 non-null   object 
 17  family_size    10000 non-null  float64
 18  begin_m

None

Unnamed: 0,index,gender,car,reality,child_num,income_total,income_type,edu_type,family_type,house_type,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,occyp_type,family_size,begin_month
0,26457,M,Y,N,0,112500.0,Pensioner,Secondary / secondary special,Civil marriage,House / apartment,-21990,365243,1,0,1,0,,2.0,-60.0
1,26458,F,N,Y,0,135000.0,State servant,Higher education,Married,House / apartment,-18964,-8671,1,0,1,0,Core staff,2.0,-36.0
2,26459,F,N,Y,0,69372.0,Working,Secondary / secondary special,Married,House / apartment,-15887,-217,1,1,1,0,Laborers,2.0,-40.0
3,26460,M,Y,N,0,112500.0,Commercial associate,Secondary / secondary special,Married,House / apartment,-19270,-2531,1,1,0,0,Drivers,2.0,-41.0
4,26461,F,Y,Y,0,225000.0,State servant,Higher education,Married,House / apartment,-17822,-9385,1,1,0,0,Managers,2.0,-8.0


Unnamed: 0,index,child_num,income_total,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,family_size,begin_month
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,31456.5,0.4347,185043.3,-16020.4664,59776.6904,1.0,0.2276,0.2963,0.0856,2.2027,-26.2724
std,2886.89568,0.729102,101539.8,4197.672887,138121.224504,0.0,0.419304,0.456648,0.279786,0.898272,16.348557
min,26457.0,0.0,27000.0,-25152.0,-15661.0,1.0,0.0,0.0,0.0,1.0,-60.0
25%,28956.75,0.0,121500.0,-19483.25,-3153.0,1.0,0.0,0.0,0.0,2.0,-39.0
50%,31456.5,0.0,157500.0,-15606.0,-1577.0,1.0,0.0,0.0,0.0,2.0,-25.0
75%,33956.25,1.0,225000.0,-12539.0,-410.0,1.0,0.0,1.0,0.0,3.0,-12.0
max,36456.0,5.0,1575000.0,-7489.0,365243.0,1.0,1.0,1.0,1.0,7.0,0.0


### 2.2.3 제출 데이터

In [None]:
# 테스트 데이터 로드
sub = pd.read_csv(sub_url)

In [None]:
# 테스트 데이터 기본 정보 파악
display(sub.info())

display(sub.head())

display(sub.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   index   10000 non-null  int64
 1   0       10000 non-null  int64
 2   1       10000 non-null  int64
 3   2       10000 non-null  int64
dtypes: int64(4)
memory usage: 312.6 KB


None

Unnamed: 0,index,0,1,2
0,26457,0,0,0
1,26458,0,0,0
2,26459,0,0,0
3,26460,0,0,0
4,26461,0,0,0


Unnamed: 0,index,0,1,2
count,10000.0,10000.0,10000.0,10000.0
mean,31456.5,0.0,0.0,0.0
std,2886.89568,0.0,0.0,0.0
min,26457.0,0.0,0.0,0.0
25%,28956.75,0.0,0.0,0.0
50%,31456.5,0.0,0.0,0.0
75%,33956.25,0.0,0.0,0.0
max,36456.0,0.0,0.0,0.0


## 2.3 섹션 아이디어

- [11.30]
  - 1) sub 3가지 유형 별 예측확률 필요 모델 학습 후 predict_proba 필요
  - 2) train, test 모두 occyp_type(직업열)에 결측값 존재 -> 직업 정의를 내릴수  없는 직업들일 경우라 생각 -> "Etc" 값으로 대치
  - 3) index값은 키 값 -> 학습에서 제외 가능 
  - 4) 범주형 데이터를 가진 열이 많다. -> 라벨 인코딩, 원-핫 인코딩 필요
  - 5) 정규화 -> "RandomForest"에서는 필요 없다. -> 우선 랜포로 진행

# 3. 데이터 전처리

In [None]:
today = "1130_1"

## 3.1 결측치 처리

### 3.1.1 occupy

In [None]:
# 결측값 확인
display(pd.DataFrame(X.isnull().sum(),columns = ["X"]))

display(pd.DataFrame(test.isnull().sum(),columns = ["test"]))


Unnamed: 0,X
index,0
gender,0
car,0
reality,0
child_num,0
income_total,0
income_type,0
edu_type,0
family_type,0
house_type,0


Unnamed: 0,test
index,0
gender,0
car,0
reality,0
child_num,0
income_total,0
income_type,0
edu_type,0
family_type,0
house_type,0


약 30%의 결측 보인다.

In [None]:
# occyp_type 열에 있는 결측값 "Etc" 대치
X["occyp_type"] = X["occyp_type"].fillna("Etc")
test["occyp_type"] = test["occyp_type"].fillna("Etc")

In [None]:
# 결측값 확인
display(pd.DataFrame(X.isnull().sum(),columns = ["X"]))

display(pd.DataFrame(test.isnull().sum(),columns = ["test"]))


Unnamed: 0,X
index,0
gender,0
car,0
reality,0
child_num,0
income_total,0
income_type,0
edu_type,0
family_type,0
house_type,0


Unnamed: 0,test
index,0
gender,0
car,0
reality,0
child_num,0
income_total,0
income_type,0
edu_type,0
family_type,0
house_type,0


## 3.2 원-핫 인코딩

In [None]:
# 수치형 변수 추출
v_i = ["child_num","income_total","DAYS_BIRTH","DAYS_EMPLOYED",
       "family_size", "begin_month"]
tr_i = X[v_i]
tr_s = X.drop(v_i,axis=1)
tr_s = tr_s.drop(["index","credit"],axis = 1)
tr_s = tr_s.astype("object")
tr_t = X["credit"]

In [None]:
# 수치형 변수 추출
v_i = ["child_num","income_total","DAYS_BIRTH","DAYS_EMPLOYED",
       "family_size", "begin_month"]
te_i = test[v_i]
te_s = test.drop(v_i,axis=1) 
te_s = te_s.drop("index",axis = 1)
te_s = te_s.astype("object")

In [None]:
tr_dummy = pd.get_dummies(tr_s)
te_dummy = pd.get_dummies(te_s)
display(tr_dummy)

Unnamed: 0,gender_F,gender_M,car_N,car_Y,reality_N,reality_Y,income_type_Commercial associate,income_type_Pensioner,income_type_State servant,income_type_Student,income_type_Working,edu_type_Academic degree,edu_type_Higher education,edu_type_Incomplete higher,edu_type_Lower secondary,edu_type_Secondary / secondary special,family_type_Civil marriage,family_type_Married,family_type_Separated,family_type_Single / not married,family_type_Widow,house_type_Co-op apartment,house_type_House / apartment,house_type_Municipal apartment,house_type_Office apartment,house_type_Rented apartment,house_type_With parents,FLAG_MOBIL_1,work_phone_0,work_phone_1,phone_0,phone_1,email_0,email_1,occyp_type_Accountants,occyp_type_Cleaning staff,occyp_type_Cooking staff,occyp_type_Core staff,occyp_type_Drivers,occyp_type_Etc,occyp_type_HR staff,occyp_type_High skill tech staff,occyp_type_IT staff,occyp_type_Laborers,occyp_type_Low-skill Laborers,occyp_type_Managers,occyp_type_Medicine staff,occyp_type_Private service staff,occyp_type_Realty agents,occyp_type_Sales staff,occyp_type_Secretaries,occyp_type_Security staff,occyp_type_Waiters/barmen staff
0,1,0,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,1,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26452,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26453,1,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
26454,1,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26455,0,1,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0


## 3.3 정규화

## 3.4 데이터 통합

In [None]:
#수치형 변수와 범주형 변수 데이터열 합성
new_X = pd.concat([tr_i,tr_dummy],axis= 1)
new_test = pd.concat([te_i,te_dummy],axis= 1)
y = tr_t
display(new_X.info())
display(new_test.info())
# display(tr_t)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 59 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   child_num                               26457 non-null  int64  
 1   income_total                            26457 non-null  float64
 2   DAYS_BIRTH                              26457 non-null  int64  
 3   DAYS_EMPLOYED                           26457 non-null  int64  
 4   family_size                             26457 non-null  float64
 5   begin_month                             26457 non-null  float64
 6   gender_F                                26457 non-null  uint8  
 7   gender_M                                26457 non-null  uint8  
 8   car_N                                   26457 non-null  uint8  
 9   car_Y                                   26457 non-null  uint8  
 10  reality_N                               26457 non-null  ui

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 59 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   child_num                               10000 non-null  int64  
 1   income_total                            10000 non-null  float64
 2   DAYS_BIRTH                              10000 non-null  int64  
 3   DAYS_EMPLOYED                           10000 non-null  int64  
 4   family_size                             10000 non-null  float64
 5   begin_month                             10000 non-null  float64
 6   gender_F                                10000 non-null  uint8  
 7   gender_M                                10000 non-null  uint8  
 8   car_N                                   10000 non-null  uint8  
 9   car_Y                                   10000 non-null  uint8  
 10  reality_N                               10000 non-null  uin

None

In [None]:
# 데이터 저장
new_X.to_csv("/content/drive/Shareddrives/A-2/심종수/Card/prepro/new_X_{}.csv".format(today),index = False, encoding = "cp949")
y.to_csv("/content/drive/Shareddrives/A-2/심종수/Card/prepro/y_{}.csv".format(today),index = False, encoding = "cp949")
new_test.to_csv("/content/drive/Shareddrives/A-2/심종수/Card/prepro/new_test_{}.csv".format(today),index = False, encoding = "cp949")

# 4. 데이터 분할

## 4.1 데이터 분할

In [None]:
#앞에 저장한 데이터 로드

X = pd.read_csv("/content/drive/Shareddrives/A-2/심종수/Card/prepro/new_X_{}.csv".format(today))
y = pd.read_csv("/content/drive/Shareddrives/A-2/심종수/Card/prepro/y_{}.csv".format(today))
test = pd.read_csv("/content/drive/Shareddrives/A-2/심종수/Card/prepro/new_test_{}.csv".format(today))

In [None]:
# 데이터 분할
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_state = 42, stratify = y)

## 4.2 데이터 교차검증

# 5. 모델링

## 5.1 모델선정

In [None]:
# # 랜덤포레스트
# from sklearn.ensemble import RandomForestClassifier

# model = RandomForestClassifier(random_state = 42)
# model.fit(X_train,y_train)
# model.score(X_train,y_train)

## 5.2 최적의 하이퍼파라미터 설정


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
para = {'max_depth' : [6, 10],
           'min_samples_leaf' : [8, 12],
           'min_samples_split' : [8, 16]}

# 랜덤포레스트 모델 생성
rf = RandomForestClassifier(random_state = 42)
#그리드 서치 모델
gs = GridSearchCV(rf, param_grid = para, cv = 3)
#그리드 서치하며 학습
gs.fit(X_train,y_train)
#최적의 모델 선정
gs_best = gs.best_estimator_

display(gs.best_score_)
display(gs.best_params_)


  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_params)
  estimator.fit(X_train, y_train, **fit_

0.692420118939623

{'max_depth': 10, 'min_samples_leaf': 8, 'min_samples_split': 8}

In [None]:
display(gs_best.score(X_train,y_train))

0.6929744985384538

# 6. 모델 성능 평가

## 6.1 logloss 평가 지표 확인

In [None]:
from sklearn.metrics import log_loss

pred_test = gs_best.predict_proba(X_test)

test_score = log_loss(y_test,pred_test)

display(test_score)

0.8040989699544631

In [None]:
from sklearn.metrics import log_loss

pred_test = gs_best.predict_proba(X_test)

test_score = log_loss(y_test,pred_test)

display(test_score)

0.8040989699544631

# 7. 예측 진행

## 7.1 최종 모델 학습

In [None]:
# # 랜덤포레스트
# from sklearn.ensemble import RandomForestClassifier

# model_f = RandomForestClassifier(random_state = 42)
# model_f.fit(X,y)
# model_f.score(X_train,y_train)

## 7.2 최종 예측

In [None]:
pred = gs_best.predict_proba(test)

sub.loc[:,1:4] = pred

display(sub)

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,index,0,1,2
0,26457,0.106617,0.195608,0.697775
1,26458,0.118212,0.192228,0.689559
2,26459,0.117328,0.209243,0.673429
3,26460,0.125497,0.171118,0.703385
4,26461,0.124669,0.196290,0.679040
...,...,...,...,...
9995,36452,0.122807,0.211759,0.665434
9996,36453,0.132656,0.241556,0.625788
9997,36454,0.109388,0.168953,0.721659
9998,36455,0.117516,0.188253,0.694231


In [None]:
# 제출파일 생성
sub.to_csv("/content/drive/Shareddrives/A-2/심종수/Card/sub/sub_{}.csv".format(today),index=False, encoding= "cp949")

# 총 평
- [1130]
  - 1) 과적합되는 듯한 점수를 보여준다. (1차 시도 split 데이터 2차 시도 전체 데이터) -> 전체 데이터로 학습을 시킨다고 성능이 좋아지지 않는다.

- [1130_1]
  - 1) 그리드서치를 통해 최적의 모델 선정 진행
  - 2) 'n_estimators', 'max_feature' 처음에는 이 두가지 파라미터로 설정하고 학습 진행 후 예측했지만 성능 개선효과가 없었다.
  - 3) 'max_depth', 'min_samples_leaf', 'min_samples_split' 이 세가지 파라미터를 설정하고 학습 진행 후 예측 성능이 상당히 개선되었다.