## Kaggle [Gender Recognition by Voice]
- https://www.kaggle.com/primaryobjects/voicegender

### Kernel을 참조하여 데이터 전처리 및 시각화를 진행함
- https://www.kaggle.com/sushanthiray/d/primaryobjects/voicegender/experimenting-with-neural-networks-in-tensorflow/notebook

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df = pd.read_csv('data/voice.csv')
df.head()

In [None]:
# DataFrame에 null이 있으면 True, 없으면 False를 원 데이터 형태로 표현
pd.isnull(df)

In [None]:
# 만약 True가 있으면 DataFrame의 해당 인덱스가 출력됨
np.where(pd.isnull(df))

In [None]:
print(np.where([True, False, True, False, False]))

In [None]:
!pip install missingno

In [None]:
import missingno
missingno.matrix(df)

Awesome. We don't have any null's in the dataset. One less thing to worry about. Now let us check how the labels are distributed.

In [None]:
print("Number of male: {}".format(df[df.label == 'male'].shape[0]))
print("Number of female: {}".format(df[df.label == 'female'].shape[0]))

In [None]:
colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(df.iloc[:,:-1].astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

While looking at the plot, we can figure out some interesting correlations. If you look at meanfreq vs centroid their correlation is maximum possible value of 1. Same is the case with maxdom and dfrange. So essentially we could filter out these features and still get an equivalent performance as they aren't adding any new information.

In [None]:
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]

## Train / Validation (dev) / Test

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 123)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size = 0.375,
                                                  random_state = 123)

In [None]:
# Training 50%, validation 30%, test 20%
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

## Decision tree modeling
- 나무 깊이에 대해 다른 파라미터값 부여
- 각각의 파라미터에 대해 Train set으로 모델을 생성한 후, Validation set으로 성능 평가 ==> 가장 성능이 좋은 파라미터와 모델 선택
- Test set으로 모델의 예측 성능 평가

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

depth_set = [3, 4, 5, 6, 7, 8, 9, 10]
dt_models = []
accuracy_set = []
cm_set = []
train_accuracy_set = []

for depth in depth_set:
    model = DecisionTreeClassifier(max_depth = depth, random_state = 1)
    model.fit(X_train, y_train)
    y_train_hat = model.predict(X_train)
    y_val_hat = model.predict(X_val)
    train_accuracy = metrics.accuracy_score(y_train, 
                                            y_train_hat)
    accuracy = metrics.accuracy_score(y_val, y_val_hat)
    cm = metrics.confusion_matrix(y_val, y_val_hat)
    
    dt_models.append(model)
    accuracy_set.append(accuracy)
    train_accuracy_set.append(train_accuracy)
    cm_set.append(cm)

In [None]:
from pprint import pprint
pprint(accuracy_set)

In [None]:
pprint(train_accuracy_set)

In [None]:
# 파라미터 탐색 결과, 가장 좋은 모델과 Validation set에 대한 정확도
max_value = max(accuracy_set)
max_index = accuracy_set.index(max_value)
print(max_index)
print(max_value)

In [None]:
# 가장 좋은 모델
dt_models[max_index]

In [None]:
# 가장 좋은 모델을 가져와 Test set에 대해 예측 성능 평가
y_test_hat = dt_models[max_index].predict(X_test)
print(metrics.accuracy_score(y_test, y_test_hat))
print(metrics.confusion_matrix(y_test, y_test_hat))

## Random Forest
- 나무의 갯수, 각 나무의 변수 선택 수를 파라미터로 설정
- 각각의 파라미터 집합에 대해 Train set으로 모델을 생성한 후, Validation set으로 성능 평가 ==> 가장 성능이 좋은 파라미터와 모델 선택
- Test set으로 모델의 예측 성능 평가

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
n_estimators_set = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]
max_features_set = ['auto', 'log2']

rf_models = []
accuracy_set = []
cm_set = []

for n_estimators in n_estimators_set:
    for max_features in max_features_set:
        rf = RandomForestClassifier(n_estimators = n_estimators,
                                    max_features = max_features,
                                    random_state = 123)
        rf.fit(X_train, y_train)
        y_val_hat = rf.predict(X_val)
        accuracy = metrics.accuracy_score(y_val, y_val_hat)
        cm = metrics.confusion_matrix(y_val, y_val_hat)

        rf_models.append(rf)
        accuracy_set.append(accuracy)
        cm_set.append(cm)

In [None]:
accuracy_set

In [None]:
# 파라미터 탐색 결과, 가장 좋은 모델과 Validation set에 대한 정확도
max_value = max(accuracy_set)
max_index = accuracy_set.index(max_value)
print(max_index)
print(max_value)

In [None]:
# 가장 좋은 모델
rf_models[max_index]

In [None]:
# 가장 좋은 모델을 가져와 Test set에 대해 예측 성능 평가
y_test_hat = rf_models[max_index].predict(X_test)
print(metrics.accuracy_score(y_test, y_test_hat))
print(metrics.confusion_matrix(y_test, y_test_hat))

In [None]:
fi = rf_models[max_index].feature_importances_

In [None]:
fi

In [None]:
col_names = X.columns

In [None]:
for i, j in zip(col_names, fi): print(i, '\t', j)

In [None]:
print(metrics.classification_report(y_test, y_test_hat))

# 실습
1. LogisticRegression, kNearestClassifier 등을 사용하여 모델 하이퍼파라미터 탐색 및 베스트 모델을 뽑아보세요.
2. Decision Tree, Random Forest에 대해서 다른 후보군으로 모델 하이퍼파라미터 탐색을 및 베스트 모델을 뽑아보세요.

In [None]:
from sklearn.linear_model import LogisticRegression

# Hyper-parameter caldidates
penalty_set = ['l1', 'l2']
C_set = [0.01, 0.1, 1, 10, 100]
class_weight_set = [None, 'balanced']

# 결과 저장을 미리 할당하기 위한 리스트 선언
train_acc_set = []
val_acc_set = []
lrs = []

for penalty in penalty_set:
    for C in C_set:
        for class_weight in class_weight_set:
            lr = LogisticRegression(penalty=penalty, C=C, 
                                    class_weight=class_weight,
                                    random_state=2072)
            # Train the model
            lr.fit(X_train, y_train)
            lrs.append(lr)
            
            # Calculate training accuracy and validation accuracy
            y_train_hat = lr.predict(X_train)
            y_val_hat = lr.predict(X_val)
            train_acc = metrics.accuracy_score(y_train, y_train_hat)
            val_acc = metrics.accuracy_score(y_val, y_val_hat)
            train_acc_set.append(train_acc)
            val_acc_set.append(val_acc)
            

In [None]:
# 파라미터 탐색 결과, 가장 좋은 모델과 Validation set에 대한 정확도
max_value = max(val_acc_set)
max_index = val_acc_set.index(max_value)
print(max_index)
print(max_value)

In [None]:
# 가장 좋은 모델을 가져와 Test set에 대해 예측 성능 평가
y_test_hat = lrs[max_index].predict(X_test)
print(metrics.accuracy_score(y_test, y_test_hat))
print(metrics.confusion_matrix(y_test, y_test_hat))

In [None]:
# Trainining set과 Validation set을 합친 후 Test set에 대해 예측 성능 평가
X_concat = pd.concat([X_train, X_val])
y_concat = pd.concat([y_train, y_val])
# 합친 데이터에 모델을 refit
best_lr = lrs[max_index]
best_lr.fit(X_concat, y_concat)
# Test set에 대해 예측 성능 평가
y_test_hat = best_lr.predict(X_test)
print(metrics.accuracy_score(y_test, y_test_hat))
print(metrics.confusion_matrix(y_test, y_test_hat))