# Random Forest 사용해보기

- Decision Tree의 단점, **Overfitting(과적합)!!**

    학습 데이터를 과하게 학습한 나머지 학습 상황에서는 오차가 줄지만 실제 데이터들에 대해 오차가 증가하는 현상


- 과적합 문제 해결을 위한 **Random Forest**
    
    여러 개의 Decision Tree를 만들고 연결하여 결과를 취합한 후 평균을 내어 성능을 높인 모델. 즉, Ensemble의 원리를 이용한 방법

## Stage1. 데이터 불러오기

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
Id               150 non-null int64
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetalLengthCm    150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [4]:
df.describe() #이게 더 유용

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


## Stage2. Feature Engineering  and Visualization

## Stage3. Scikit-learn으로 Random Forest 구현하기

- 데이터 보유량이 150개로 적으므로 validation set 만들지 않음
- training set과 test set으로만 나누기

### training set과 validation set 나누기

In [5]:
from sklearn.model_selection import train_test_split

# train data에는 Survived 없애주기
train_data = df[['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
# target으로 삼는 데이터는 Survived
target_data = df['Species']

#자동으로 75%, 25%로 나눠줌
x_train, x_test, y_train, y_test = train_test_split(train_data, target_data)

print(train_data.shape, x_train.shape, x_test.shape)
print(train_data.shape, y_train.shape, y_test.shape)

(150, 5) (112, 5) (38, 5)
(150, 5) (112,) (38,)


### Random Forest 모델링 및 결과 csv 파일 생성

In [6]:
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier(n_estimators=10, max_depth=1)
forest.fit(x_train, y_train)
print('train acciracy:', forest.score(x_train, y_train))
print('test accuracy:', forest.score(x_test, y_test))

train acciracy: 1.0
test accuracy: 1.0


In [7]:
prediction  = forest.predict(x_test)
prediction

array(['Iris-setosa', 'Iris-setosa', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-virginica', 'Iris-setosa',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor',
       'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
       'Iris-virginica', 'Iris-versicolor', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa',
       'Iris-setosa', 'Iris-virginica'], dtype=object)

In [8]:
result=pd.DataFrame({
    'Id': x_test['Id'],
    'Species': prediction
})

result.to_csv('data/iris_submit_forest.csv',index=False)

my_prediction = pd.read_csv('data/iris_submit_forest.csv')
my_prediction.head()

Unnamed: 0,Id,Species
0,13,Iris-setosa
1,38,Iris-setosa
2,121,Iris-virginica
3,71,Iris-versicolor
4,130,Iris-virginica
