<a href="https://www.kaggle.com/code/jungeunbaik/red-wine-quality?scriptVersionId=163252636" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd

df = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

In [2]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


# 1. Definition of each variable

1. Fixed acidity : It is one of the important factors explaining the chemical properties of wine, which affects the freshness and stability of wine. "Fixed acidity" refers to the amount of acid present in the pH value of 7 or less among these acids.

1. Volatile acidity : One of the important properties that determine the taste and aroma of wine. Volatile acid mainly exhibits acids such as acetic acid. When excessive acetic acid is present in wine, it can give a scent and taste like vinegar, which can deteriorate the quality. Microbial activities or spoilage that may occur during the wine production process result in an increase in acetic acid and must be adjusted to an appropriate level.

1. Citric acid : Citric acid is one of the naturally occurring organic acids found mainly in fruits. It is abundant in apples such as lemons and limes, grapes, tangerines, guava, and is also produced artificially.

1. Residual sugar : The amount of sugar that remains after fermentation is completed during the wine-making process, which is an important factor in determining the degree of sweetness of wine. When wine is fermented, the sugar in grapes changes to alcohol. However, if some sugar remains due to lack of full fermentation, the wine will taste sweeter. The amount of residual sugar varies depending on the style of wine, and there are various levels of sugar depending on the type of wine.

1. Chlorides : It is the amount of chlorine ions contained in wine. Chlorides are introduced into wine due to various causes during the wine making process. The amount of chlorides affects the taste and properties of wine. Too many chlorine ions give off an unpleasant salt taste. The acceptable levels of chlorides vary depending on the type and style of wine.

1. free sulfur dioxide : It is a glass form of sulfur sulfide gas, denoted SO2. It is related to wine production and preservation. Free sulfur dioxide increases the shelf life of wine by inhibiting microbial growth and preventing oxidation in wine. Excessive sulfur sulfide gives wine an unpleasant taste and smell. It is often labelled "Free SO2" on wine labels.

1. Total sulfur dioxide : the total amount of sulfur sulfide contained in wine. Sulfur sulfide plays an important role as a preservative used in the manufacture and preservation of wine. "Total sulfur dioxide" includes "free sulfur dioxide" and "bound sulfur dioxide". "bound sulfur dioxide" refers to sulfur sulfide in combination with other compounds in wine. Wine makers comply with industrial standards and regulations to maintain proper concentrations. Wine labels often state "Total SO2".

1. Density : The density of wine is determined mainly by factors such as alcohol content, sugar content, temperature, acidity, etc. Density is one of the important properties of evaluating and describing wine, and winemakers and consumers understand and choose the style and characteristics of wine considering its density.

1. pH : pH is expressed on a scale from 0 to 14, the lower the number, the more acidic, and the higher the basic. Neutral pH is 7, and wine is generally between about 2.8 and 4.0. The pH of wine is influenced by various factors in the wine-making process. The variety of grapes, the region of production, the timing of the harvest, and the fermentation conditions determine the pH of wine. pH also affects the taste and aroma of wine and its shelf life. In general, the more acidic a wine is, the more fresh it has. However, too high acidity also causes uncomfortable taste and affects the stability of wine.

1. Sulfates : Sulfides are used as preservatives added to some wines in the wine-making process. By using sulfides, they protect wine from oxygen, inhibit microbial growth, and improve the stability of wine. This helps to maintain the quality of wine and keep it fresh for a long period of time when it is stored and bottled. However, some people can show allergic reactions to sulfides, so wine labels often label them as "Contains Sulfites" or "Contains Sulfites."

1. alcohol: Refers to the amount of ethanol in wine. The alcohol content of wine is determined by the conversion of sugar from fermented grapes into alcohol during the wine production process. It is usually expressed as a percentage and typically ranges from about 8 to 15 percent. The alcohol content determines the taste and aroma of wine and how it feels in your mouth. Alcohol content is one of the tools that wine producers can use to control the sugar content of harvesting grapes, the management of the fermentation process, and the style of wine. Low alcohol content has a light style, and high alcohol content can have a rich and intense style.


# 2. Checking Dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


>There seems to be no null value data

In [4]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


>'Qulity' dataset seems to have only integer values. Let's check.

In [5]:
df['quality'].value_counts()

quality
5    681
6    638
7    199
4     53
8     18
3     10
Name: count, dtype: int64

In [6]:
import plotly as py
import cufflinks as cf
cf.go_offline(connected=True)
import plotly.graph_objects as go
from plotly.offline import iplot

In [7]:
corr = df.corr()
corr.iplot(kind='heatmap', colorscale='RdBu', dimensions=(700,700))

>There seems to be a correlation between some variables. Let's check it more precisely

In [8]:
high_corr = corr[(corr.abs() >= 0.5) & (corr != 1)].stack().reset_index()
high_corr.columns = ['Variable 1', 'Variable 2', 'Correlation']
high_corr = high_corr.drop_duplicates(subset = 'Correlation', keep = 'first')
high_corr.reset_index(drop=True)

Unnamed: 0,Variable 1,Variable 2,Correlation
0,fixed acidity,citric acid,0.671703
1,fixed acidity,density,0.668047
2,fixed acidity,pH,-0.682978
3,volatile acidity,citric acid,-0.552496
4,citric acid,pH,-0.541904
5,free sulfur dioxide,total sulfur dioxide,0.667666


>All variables that have higher correlation than 0.5 were filtered out. These variables can cause multicollinearity, so we need to check these vaiables.

In [9]:
from plotly.subplots import make_subplots

col_names = df.columns.tolist()
m,n = 1,1
fig = make_subplots(rows = 3, cols = 4, vertical_spacing = 0.05)
for i in col_names:
  fig.add_trace(go.Box(y = df[i], name = i), row = n, col = m)
  if m%4 == 0 : m, n = m-3, n+1
  else : m+=1
fig.update_layout(height=1500, width=1500)
fig.show()

# 3. Visualization

>Before visualizing the data, let's do data standardization

In [10]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x = min_max_scaler.fit_transform(df.iloc[:,:-1])
df.iloc[:,:-1] = x

In [11]:
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

col_names = df.columns.tolist()
m,n = 1,1
fig = make_subplots(rows = 3, cols = 4, vertical_spacing = 0.05, subplot_titles =  col_names[:-1])
for i in col_names[:-1]:
  fig.add_trace(go.Box(x = df['quality'], y = df[i], name = i), row = n, col = m)
  if m%4 == 0 : m, n = m-3, n+1
  else : m+=1
fig.update_layout(height=1000, width=1300, showlegend = True)
fig.show()

>In the case of volatile acidty, the quality tends to improve as the number decreases, and in the case of citric acid, the higher the number, the better the quality. Sulfates also show this tendency slightly, and alcohol has a similar distribution.

# 4. Data Preprocessing 
>First, let's divide quality data into good quality(7,8) and bad quality(3,4,5,6)

In [12]:
df.loc[df['quality'].isin([3,4,5,6]), 'quality'] = 0
df.loc[df['quality'].isin([7,8]), 'quality'] = 1

In [13]:
df['quality'].iplot(kind ='histogram', dimensions = (500,500))

>Now, let's filter out some important variables.

In [14]:
from sklearn.model_selection import train_test_split
X = df.iloc[:,:-1]
y = df['quality']
X_train, X_test, y_train , y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [15]:
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.neighbors import KNeighborsClassifier             # 1. K-Nearest Neighbor(KNN)
from sklearn.linear_model import LogisticRegression            # 2. Logistic Regression
from sklearn.svm import SVC                                    # 3. SVC
from sklearn.tree import DecisionTreeClassifier                # 4. Decision Tree
from sklearn.ensemble import RandomForestClassifier            # 5. Random Forest
from sklearn.ensemble import ExtraTreesClassifier              # 6. Extra Tree
from sklearn.ensemble import GradientBoostingClassifier        # 7. GBM
from sklearn.naive_bayes import GaussianNB                     # 8. GaussianNB
from xgboost import XGBClassifier                              # 9. XGBoost
from lightgbm import LGBMClassifier                            # 10. LightGBM

import warnings
warnings.filterwarnings('ignore')


You are using pyarrow version 11.0.0 which is known to be insecure. See https://www.cve.org/CVERecord?id=CVE-2023-47248 for further details. Please upgrade to pyarrow>=14.0.1 or install pyarrow-hotfix to patch your current version.



In [16]:
knn_model = KNeighborsClassifier()
logreg_model = LogisticRegression()
svc_model = SVC()
decision_model = DecisionTreeClassifier()
random_model = RandomForestClassifier()
extra_model = ExtraTreesClassifier()
gbm_model = GradientBoostingClassifier()
nb_model = GaussianNB()
xgb_model = XGBClassifier(eval_metric='logloss')
lgbm_model = LGBMClassifier(verbose = -1)

models = [
    knn_model,
    logreg_model,
    svc_model,
    decision_model,
    random_model,
    extra_model,
    gbm_model,
    nb_model,
    xgb_model,
    lgbm_model
]

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
results = dict()

for alg in models:
    alg.fit(X_train, y_train)
    score = cross_val_score(alg, X_train, y_train.values.ravel(), cv=k_fold, scoring='accuracy')
    results[alg.__class__.__name__] = np.mean(score)*100

>By using cross validation score, the accuaracy score of all model were evaluated.

In [17]:
sorted(results.items(), key=lambda x: x[1], reverse=True)

[('XGBClassifier', 90.69512795275591),
 ('RandomForestClassifier', 90.46136811023622),
 ('LGBMClassifier', 90.38262795275591),
 ('ExtraTreesClassifier', 90.22514763779527),
 ('GradientBoostingClassifier', 89.44389763779527),
 ('SVC', 87.02263779527559),
 ('LogisticRegression', 86.6326279527559),
 ('DecisionTreeClassifier', 86.47145669291338),
 ('KNeighborsClassifier', 86.39579232283465),
 ('GaussianNB', 82.95337106299213)]

>Testing the accuracy score of each model, it shows that XGBClassifier, LGBMClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier have high accuracy.

In [18]:
tree_models = [random_model, extra_model, gbm_model, xgb_model, lgbm_model]

>Let's filter out variables that have high feature importance

In [19]:
for alg in tree_models:
    try:
        print(alg.__class__.__name__)
        print(alg.feature_importances_)
    except:
        print(alg.__class__.__name__, "X")

RandomForestClassifier
[0.07207082 0.12174045 0.08610401 0.06368037 0.07162725 0.05657628
 0.07808298 0.09292617 0.06123356 0.12697387 0.16898425]
ExtraTreesClassifier
[0.07785172 0.10568544 0.0988586  0.0721837  0.06276434 0.06787091
 0.07839875 0.08842948 0.06338297 0.11633063 0.16824346]
GradientBoostingClassifier
[0.05250356 0.13377208 0.0434474  0.03291741 0.04220049 0.06350613
 0.06979342 0.04359016 0.04595275 0.14537736 0.32693924]
XGBClassifier
[0.06102113 0.11690516 0.05122033 0.07242035 0.05688294 0.0629416
 0.06992993 0.05309817 0.0726413  0.10496208 0.277977  ]
LGBMClassifier
[240 317 285 219 307 221 255 344 249 290 273]


In [20]:
random_model_importance = pd.DataFrame({'Feature':X.columns, 'random_model':random_model.feature_importances_})
extra_model_importance = pd.DataFrame({'Feature':X.columns, 'extra_model':extra_model.feature_importances_})
gbm_model_importance = pd.DataFrame({'Feature':X.columns, 'gbm_model':gbm_model.feature_importances_})
xgb_model_importance = pd.DataFrame({'Feature':X.columns, 'xgb_model':xgb_model.feature_importances_})
lgbm_model_importance = pd.DataFrame({'Feature':X.columns, 'lgbm_model':lgbm_model.feature_importances_})

In [21]:
from functools import reduce
data_frames = [
    random_model_importance,
    extra_model_importance,
    gbm_model_importance,
    xgb_model_importance,
    lgbm_model_importance,
]
importances = reduce(lambda  left,right: pd.merge(left, right, on=['Feature']), data_frames)

In [22]:
import numpy as np

col_sum = np.sum(importances, axis = 0)
col_sum

Feature         fixed acidityvolatile aciditycitric acidresidu...
random_model                                                  1.0
extra_model                                                   1.0
gbm_model                                                     1.0
xgb_model                                                     1.0
lgbm_model                                                   3000
dtype: object

In [23]:
importances['lgbm_model']=importances['lgbm_model']/3000

In [24]:
importances['avg'] = importances.iloc[:,1:6].mean(axis=1)
importances = importances.sort_values(by='avg', ascending=False)
importances

Unnamed: 0,Feature,random_model,extra_model,gbm_model,xgb_model,lgbm_model,avg
10,alcohol,0.168984,0.168243,0.326939,0.277977,0.091,0.206629
9,sulphates,0.126974,0.116331,0.145377,0.104962,0.096667,0.118062
1,volatile acidity,0.12174,0.105685,0.133772,0.116905,0.105667,0.116754
7,density,0.092926,0.088429,0.04359,0.053098,0.114667,0.078542
6,total sulfur dioxide,0.078083,0.078399,0.069793,0.06993,0.085,0.076241
2,citric acid,0.086104,0.098859,0.043447,0.05122,0.095,0.074926
0,fixed acidity,0.072071,0.077852,0.052504,0.061021,0.08,0.068689
4,chlorides,0.071627,0.062764,0.0422,0.056883,0.102333,0.067162
8,pH,0.061234,0.063383,0.045953,0.072641,0.083,0.065242
5,free sulfur dioxide,0.056576,0.067871,0.063506,0.062942,0.073667,0.064912


>Five variables with the highest feature importance were selected, and there is no correlation between these variables that we checked before.

In [25]:
X_train = X_train.loc[:, ['alcohol', 'sulphates', 'volatile acidity','density', 'total sulfur dioxide']]
X_test = X_test.loc[:, ['alcohol', 'sulphates', 'volatile acidity','density', 'total sulfur dioxide']]

In [26]:
for alg in models:
    alg.fit(X_train, y_train)
    score = cross_val_score(alg, X_train, y_train.values.ravel(),verbose = -1, cv=k_fold, scoring='accuracy')
    results[alg.__class__.__name__] = np.mean(score)*100

In [27]:
sorted(results.items(), key=lambda x: x[1], reverse=True)

[('RandomForestClassifier', 90.69574311023622),
 ('LGBMClassifier', 89.83452263779527),
 ('ExtraTreesClassifier', 89.67765748031496),
 ('GradientBoostingClassifier', 89.60199311023622),
 ('XGBClassifier', 89.51956200787401),
 ('SVC', 87.64825295275591),
 ('LogisticRegression', 86.7882627952756),
 ('KNeighborsClassifier', 86.08452263779527),
 ('DecisionTreeClassifier', 86.08267716535434),
 ('GaussianNB', 85.69266732283465)]

# 5. Modeling

Now let's find the optimal hyperparameter for the five best performing models.


>5.1 Gradient Boosting + GridSearchCV

In [28]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

In [29]:
learning_rate = [0.01, 0.05, 0.1]
n_estimators = [100, 500, 1000]
max_depth = [2, 3, 4]



hyperparams = {
    'learning_rate': learning_rate,
    'n_estimators': n_estimators,
    'max_depth': max_depth
}

gd=GridSearchCV(
    estimator = GradientBoostingClassifier(),
    param_grid = hyperparams,
    cv=10,
    scoring = "accuracy",
    refit = True
)

gd.fit(X_train, y_train.values.ravel())


print(gd.best_score_)
print(gd.best_params_)

0.8960383858267716
{'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 500}


In [30]:
gbm = GradientBoostingClassifier(learning_rate = 0.05,
                                 max_depth = 4,
                                 n_estimators = 500
)

k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
gbm_final = gbm.fit(X_train, y_train)
score1 = cross_val_score(gbm_final, X_train, y_train, cv=k_fold, scoring='accuracy')
np.mean(score1)*100

89.83267716535434

In [31]:
from sklearn.metrics import accuracy_score
gbm_final = gbm.fit(X_train,y_train)
y_pred = gbm_final.predict(X_test)
accuracy_score(y_test, y_pred)

0.9125

> 5.2 XGBoost + Bayesian Optimization

In [32]:
import numpy as np
from xgboost import XGBClassifier
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score

pbounds = {
    'learning_rate' : (0.01, 0.5),
    'n_estimators' : (100, 2000),
    'max_depth' : (3,10),
    'min_child_weight' : (0,10),
    'subsample' : (0.5, 1.0),
    'colsample_bytree' : (0.5,1.0),
    'gamma' : (0, 5)
}

def xgboost_hyper_param(learning_rate,n_estimators, max_depth, min_child_weight, subsample, colsample_bytree, gamma):
  max_depth = int(max_depth)
  n_estimators = int(n_estimators)
  clf = XGBClassifier(
      learning_rate = learning_rate,
      n_estimators = n_estimators,
      max_depth = max_depth,
      min_child_weight = min_child_weight,
      subsample = subsample,
      colsample_bytree = colsample_bytree,
      gamma = gamma,
      random_state = 0,
      eval_metric = 'logloss')

  return np.mean(cross_val_score(clf, X, y.values.ravel(), cv=10, scoring='accuracy'))

optimizer = BayesianOptimization(f = xgboost_hyper_param, pbounds = pbounds, random_state = 1)


optimizer.maximize(init_points = 10, n_iter= 100)

|   iter    |  target   | colsam... |   gamma   | learni... | max_depth | min_ch... | n_esti... | subsample |
-------------------------------------------------------------------------------------------------------------
| [0m1        [0m | [0m0.8762   [0m | [0m0.7085   [0m | [0m3.602    [0m | [0m0.01006  [0m | [0m5.116    [0m | [0m1.468    [0m | [0m275.4    [0m | [0m0.5931   [0m |
| [0m2        [0m | [0m0.8681   [0m | [0m0.6728   [0m | [0m1.984    [0m | [0m0.274    [0m | [0m5.934    [0m | [0m6.852    [0m | [0m488.5    [0m | [0m0.9391   [0m |
| [0m3        [0m | [0m0.8718   [0m | [0m0.5137   [0m | [0m3.352    [0m | [0m0.2145   [0m | [0m6.911    [0m | [0m1.404    [0m | [0m476.4    [0m | [0m0.9004   [0m |
| [0m4        [0m | [0m0.8537   [0m | [0m0.9841   [0m | [0m1.567    [0m | [0m0.3492   [0m | [0m9.135    [0m | [0m8.946    [0m | [0m261.6    [0m | [0m0.5195   [0m |
| [0m5        [0m | [0m0.8655   [0m | [0m0.5849

In [33]:
optimizer.max

{'target': 0.8774528301886793,
 'params': {'colsample_bytree': 0.601508618086527,
  'gamma': 0.4476899033100151,
  'learning_rate': 0.12188901378171772,
  'max_depth': 7.359817788641856,
  'min_child_weight': 0.06834860589359892,
  'n_estimators': 1997.8514287654143,
  'subsample': 0.9340104885126539}}

In [34]:
from sklearn.metrics import accuracy_score
xgb = XGBClassifier(
      colsample_bytree= 0.6,
      gamma= 0.45,
      learning_rate= 0.12,
      max_depth= 7,
      min_child_weight= 0.07,
      n_estimators= 2000,
      subsample= 0.93,
      random_state = 0,
      eval_metric = 'logloss')

xgb_final = xgb.fit(X_train,y_train.values.ravel())
y_pred = xgb_final.predict(X_test)
accuracy_score(y_test, y_pred)

0.934375

> 5.3. LightGBM + Bayesian Optimization

In [35]:
import numpy as np
from lightgbm import LGBMClassifier
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score

pbounds = {
    'learning_rate': (0.01, 0.5),
    'n_estimators': (100, 1000),
    'max_depth': (3, 10),
    'min_child_weight': (0, 10),
    'subsample': (0.5, 1.0),
    'colsample_bytree': (0.5, 1.0),
    'gamma': (0, 5)
    # 'reg_lambda': (0, 1000, 'log-uniform'),
    # 'reg_alpha': (0, 1.0, 'log-uniform')
}

def lgbm_hyper_param(learning_rate, n_estimators, max_depth, min_child_weight, subsample, colsample_bytree, gamma):
    max_depth = int(max_depth)
    n_estimators = int(n_estimators)
    clf = LGBMClassifier(
        max_depth=max_depth,
        min_child_weight= min_child_weight,
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        gamma=gamma,
        random_state=1,
        eval_metric='logloss',
        verbose=-1
        # reg_alpha=reg_alpha,
        # reg_lambda=reg_lambda
        
    )
    return np.mean(cross_val_score(clf, X_train, y_train.values.ravel(), cv=10, scoring='accuracy'))

optimizer = BayesianOptimization( f=lgbm_hyper_param, pbounds=pbounds, random_state=1)

optimizer.maximize(init_points=10, n_iter=100)

|   iter    |  target   | colsam... |   gamma   | learni... | max_depth | min_ch... | n_esti... | subsample |
-------------------------------------------------------------------------------------------------------------
| [0m1        [0m | [0m0.8788   [0m | [0m0.7085   [0m | [0m3.602    [0m | [0m0.01006  [0m | [0m5.116    [0m | [0m1.468    [0m | [0m183.1    [0m | [0m0.5931   [0m |
| [95m2        [0m | [95m0.8843   [0m | [95m0.6728   [0m | [95m1.984    [0m | [95m0.274    [0m | [95m5.934    [0m | [95m6.852    [0m | [95m284.0    [0m | [95m0.9391   [0m |
| [95m3        [0m | [95m0.8913   [0m | [95m0.5137   [0m | [95m3.352    [0m | [95m0.2145   [0m | [95m6.911    [0m | [95m1.404    [0m | [95m278.3    [0m | [95m0.9004   [0m |
| [0m4        [0m | [0m0.878    [0m | [0m0.9841   [0m | [0m1.567    [0m | [0m0.3492   [0m | [0m9.135    [0m | [0m8.946    [0m | [0m176.5    [0m | [0m0.5195   [0m |
| [0m5        [0m | [0m0.8702  

In [36]:
optimizer.max

{'target': 0.8999384842519685,
 'params': {'colsample_bytree': 0.9360937001250944,
  'gamma': 4.579146149631467,
  'learning_rate': 0.03803448554294551,
  'max_depth': 7.272482561992197,
  'min_child_weight': 1.4623459396141825,
  'n_estimators': 471.1065788468232,
  'subsample': 0.6251327796142989}}

In [37]:
lgbm = LGBMClassifier(
      colsample_bytree= 0.94,
      gamma= 4.58,
      learning_rate= 0.04,
      max_depth= 7,
      min_child_weight= 1.46,
      n_estimators= 500,
      subsample= 0.63,
      random_state = 0,
      eval_metric = 'logloss',
      verbose=-1)

lgbm_final = lgbm.fit(X_train,y_train.values.ravel())
y_pred = lgbm_final.predict(X_test)
accuracy_score(y_test, y_pred)

0.928125

> 5.4 ExtraTreesClassifier + GridSearchCV

In [38]:
from sklearn.ensemble import ExtraTreesClassifier    

hyperparams ={ 'n_estimators' : [500, 1000, 2000],
           'min_samples_leaf' : [2,4,8],
           'min_samples_split' : [2,4,8],
           #'max_depth' : [5,10,20, None]
           #'max_features' : [0.5,1,4]
            }

gd=GridSearchCV(
    estimator = ExtraTreesClassifier(),
    param_grid = hyperparams,                           
    cv=10,
    scoring = "accuracy",
    refit = True
)

gd.fit(X_train, y_train.values.ravel())

print(gd.best_score_)
print(gd.best_params_)

0.8866387795275591
{'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 500}


In [39]:
from sklearn.model_selection import StratifiedKFold
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

extra = ExtraTreesClassifier(n_estimators = 500,
                                 min_samples_leaf = 2,
                                 min_samples_split = 2,
)
extra.fit(X, y)
score1 = cross_val_score(extra, X_train, y_train.values.ravel(), cv=k_fold, scoring='accuracy')
np.mean(score1)*100

88.89886811023622

In [40]:
from sklearn.metrics import accuracy_score
extra_final = extra.fit(X_train,y_train)
y_pred = extra_final.predict(X_test)
accuracy_score(y_test, y_pred)

0.925

> 5.5 RandomForestClassifier + GridSearchCV

In [41]:
from sklearn.ensemble import RandomForestClassifier   

hyperparams ={ 'n_estimators' : [500, 1000, 2000],
           'min_samples_leaf' : [2, 4, 8],
           'min_samples_split' : [2, 4, 8],
           'max_depth' : [5,10,20]
           #'max_features' : [0.5,1,4]
            }


gd=GridSearchCV(
    estimator = RandomForestClassifier(),
    param_grid = hyperparams,                            
    cv=10,
    scoring = "accuracy",
    refit = True
)

gd.fit(X_train, y_train.values.ravel())

print(gd.best_score_)
print(gd.best_params_)

0.8936884842519686
{'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 2000}


In [42]:
from sklearn.model_selection import StratifiedKFold
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

rfm = RandomForestClassifier(n_estimators = 2000,
                             min_samples_leaf = 2,
                             min_samples_split = 4,
                             max_depth = 20
)
rfm.fit(X, y)
score1 = cross_val_score(rfm, X_train, y_train.values.ravel(), cv=k_fold, scoring='accuracy')
np.mean(score1)*100

89.7576279527559

In [43]:
from sklearn.metrics import accuracy_score
rfm_final = rfm.fit(X_train,y_train)
y_pred = rfm_final.predict(X_test)
accuracy_score(y_test, y_pred)

0.9125

> Voting Classifier (hard, soft)


In [44]:
from sklearn.ensemble import VotingClassifier

grid_hard = VotingClassifier(estimators = [
        ('GBM', gbm_final),
        ('LightGBM', lgbm_final),
        ('XGBoost', xgb_final),
        ('Random Forest', rfm_final),
        ('Extra Trees', extra_final),
    ], voting = 'hard')

score = cross_val_score(grid_hard, X_train, y_train, cv=k_fold, scoring='accuracy')
print(np.mean(score)*100)

90.38201279527559


In [45]:
from sklearn.ensemble import VotingClassifier

grid_soft = VotingClassifier(estimators = [
        ('GBM', gbm_final),
        ('LightGBM', lgbm_final),
        ('XGBoost', xgb_final),
        ('Random Forest', rfm_final),
        ('Extra Trees', extra_final),
    ], voting = 'soft')

score = cross_val_score(grid_soft, X_train, y_train, cv=k_fold, scoring='accuracy')
print(np.mean(score)*100)

90.46075295275591


In [46]:
from sklearn.metrics import accuracy_score
hard_voting = grid_hard.fit(X_train,y_train)
y_pred = hard_voting.predict(X_test)
accuracy_score(y_test, y_pred)

0.928125

In [47]:
from sklearn.metrics import accuracy_score
soft_voting = grid_soft.fit(X_train,y_train)
y_pred = soft_voting.predict(X_test)
accuracy_score(y_test, y_pred)

0.925

In [48]:
from sklearn.metrics import classification_report

y_pred = hard_voting.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96       290
           1       0.61      0.63      0.62        30

    accuracy                           0.93       320
   macro avg       0.79      0.80      0.79       320
weighted avg       0.93      0.93      0.93       320



In [49]:
from sklearn.metrics import roc_auc_score

auc_score = roc_auc_score(y_test, hard_voting.predict(X_test))
print("AUC:", auc_score)

AUC: 0.7959770114942529


In [50]:
import plotly.graph_objs as go
from sklearn.metrics import roc_curve

y_pred = hard_voting.predict(X_test)

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

fig = go.Figure()

fig.add_trace(go.Scatter(x=fpr, y=tpr,
                         mode='lines',
                         line=dict(color='blue', width=2),
                         name='ROC Curve'))

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1],
                         mode='lines',
                         line=dict(color='red', dash='dash'),
                         showlegend=False))

fig.update_layout(title='ROC Curve',
                  xaxis_title='False Positive Rate',
                  yaxis_title='True Positive Rate',
                  xaxis=dict(range=[0, 1], dtick=0.1),
                  yaxis=dict(range=[0, 1], dtick=0.1),
                  width=800, height=600)

fig.show()

It's my first kaggle notebook so if you have anything to supplement, please leave comments.
Thank you for reading my notebook.