# Alcohol Consumption in minors - Ensemble Classification

The data set contains data about high school students. Each row represents a single student. The columns include the characteristics of deidentified students. This is a binary classification task: predict whether a student drinks alcohol or not (this is the **Alc** column: 1=Yes, 0=No). This is an important prediction task to detect underage drinking and deploy intervention techniques. 

## Description of Variables

The description of variables are provided in "Alcohol - Data Dictionary.docx"

## Goal

Use the **alcohol.csv** data set and build a model to predict **Alc**. Build (at least) the models required below.

# Read and Prepare the Data
## feature engineering: create one new variable from existing ones

In [1]:
# Common imports
import numpy as np
import pandas as pd

np.random.seed(42)

In [2]:
#We will predict the "price_gte_150" value in the data set:

alcohol = pd.read_csv("alcohol.csv")
alcohol.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,gender,alc
0,18,2,1,4,2,0,5,4,2,5,2,M,1
1,18,4,3,1,0,0,4,4,2,3,9,M,1
2,15,4,3,2,3,0,5,3,4,5,0,F,0
3,15,3,3,1,4,0,4,3,3,3,10,F,0
4,17,3,2,1,2,0,5,3,5,5,2,M,1


In [3]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(alcohol, test_size=0.3)

In [4]:
train_set.isna().sum()

age           0
Medu          0
Fedu          0
traveltime    0
studytime     0
failures      0
famrel        0
freetime      0
goout         0
health        0
absences      0
gender        0
alc           0
dtype: int64

In [5]:
test_set.isna().sum()

age           0
Medu          0
Fedu          0
traveltime    0
studytime     0
failures      0
famrel        0
freetime      0
goout         0
health        0
absences      0
gender        0
alc           0
dtype: int64

### Data Preparation 

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import FunctionTransformer

In [7]:
#train = train_set.drop(['alc'], axis=1)
#test = test_set.drop(['alc'], axis=1)

In [8]:
train_y = train_set['alc']
test_y = test_set['alc']

train_inputs = train_set.drop(['alc'], axis=1)
test_inputs = test_set.drop(['alc'], axis=1)

### Feature Engineering: Let's derive a new column

In [9]:
def new_col(df):
    #Create a copy so that we don't overwrite the existing dataframe
    df1 = df.copy()
    
    #df1['num_failures_binned'] = pd.cut(df1['number_of_reviews'],
     #                                  bins=[0,0.5,1,5,15,50,10000],  #bins=[exclusive, inclusive]
      #                                 labels=False, 
       #                                include_lowest=True,
        #                               ordered=True)
        
    df1['IsAdult'] = np.where(df1['age']>=18,1,0)
    
#     You can also do this if you want categorical values:    
#     df1['num_reviews_binned'] = pd.cut(df1['number_of_reviews'],
#                                        bins=[0,0.5,1,5,15,50,10000], 
#                                        labels=['None','Very few','Few','Medium','Many','Too many'], 
#                                        include_lowest=True,
#                                        ordered=False)

    
    return df1[['IsAdult']]
    # You can use this to check whether the calculation is made correctly:
    #return df1

In [10]:
#Let's test the new function:

# Send the new dataframe to the function we created
new_col(train_set)

Unnamed: 0,IsAdult
12759,0
4374,0
8561,0
10697,0
19424,0
...,...
16850,0
6265,0
11284,0
860,1


### Identify numerical and categorical columns

In [11]:
train_inputs.dtypes

age            int64
Medu           int64
Fedu           int64
traveltime     int64
studytime      int64
failures       int64
famrel         int64
freetime       int64
goout          int64
health         int64
absences       int64
gender        object
dtype: object

In [12]:
# Identify the numerical columns
numeric_columns = train_inputs.select_dtypes(include=[np.number]).columns.to_list()

# Identify the categorical columns
categorical_columns = train_inputs.select_dtypes('object').columns.to_list()

In [13]:
# Identify the binary columns so we can pass them through without transforming
#binary_columns = ['host_is_superhost', 'host_identity_verified']

In [14]:
numeric_columns

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'famrel',
 'freetime',
 'goout',
 'health',
 'absences']

In [15]:
categorical_columns

['gender']

In [16]:
transformed_columns = ['age']

### Pipeline

In [17]:
numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [18]:
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [19]:
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

In [20]:
my_new_column = Pipeline(steps=[('my_new_column', FunctionTransformer(new_col)),
                               ('scaler', StandardScaler())])

In [21]:
preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns),
        #('binary', binary_transformer, binary_columns),
        ('trans', my_new_column, transformed_columns)],
        remainder='passthrough')


### Transform: fit_transform() for TRAIN 

In [22]:
#Fit and transform the train data
train_x = preprocessor.fit_transform(train_inputs)

train_x

array([[ 0.66643886,  0.96597412,  0.90362635, ...,  1.        ,
         0.        , -0.36853945],
       [ 0.66643886, -0.93881619, -1.68666277, ...,  0.        ,
         1.        , -0.36853945],
       [ 0.66643886,  0.33104402,  0.04019664, ...,  0.        ,
         1.        , -0.36853945],
       ...,
       [ 0.66643886, -2.20867639, -2.55009248, ...,  1.        ,
         0.        , -0.36853945],
       [ 1.6195814 , -0.30388608, -1.68666277, ...,  0.        ,
         1.        ,  2.71341375],
       [ 1.6195814 , -0.30388608, -2.55009248, ...,  0.        ,
         1.        ,  2.71341375]])

In [23]:
train_x.shape

(23800, 14)

### Tranform: transform() for TEST

In [24]:
# Transform the test data
test_x = preprocessor.transform(test_inputs)

test_x

array([[-1.23984621,  0.33104402,  1.76705606, ...,  1.        ,
         0.        , -0.36853945],
       [-1.23984621, -0.30388608,  0.04019664, ...,  0.        ,
         1.        , -0.36853945],
       [-0.28670367,  0.33104402,  0.04019664, ...,  0.        ,
         1.        , -0.36853945],
       ...,
       [ 0.66643886, -0.30388608,  0.04019664, ...,  1.        ,
         0.        , -0.36853945],
       [-1.23984621, -0.93881619,  0.04019664, ...,  0.        ,
         1.        , -0.36853945],
       [-1.23984621,  0.96597412,  0.04019664, ...,  1.        ,
         0.        , -0.36853945]])

In [25]:
test_x.shape

(10200, 14)

# Baseline:

In [26]:
train_y.value_counts()/len(train_y)

0    0.523487
1    0.476513
Name: alc, dtype: float64

# Hard voting classifier (should include at least two models)

### Hard voting classifier 1

In [27]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import SGDClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier


dtree_clf = DecisionTreeClassifier(max_depth=10)
log_clf = LogisticRegression(multi_class='multinomial',solver = 'lbfgs', C=15, max_iter=1000)
#log_clf = LogisticRegression(multi_class='multinomial',solver = 'saga', C=12, max_iter=1000)
sgd_clf = SGDClassifier(max_iter=10000, tol=1e-3)
#sgd_clf2 = SGDClassifier(max_iter=10000, tol=1e-3,penalty='elasticnet')
voting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf), 
                        ('sgd', sgd_clf),
                       #('sgd2',sgd_clf2)
                       ],
            voting='hard')

voting_clf.fit(train_x, train_y)

VotingClassifier(estimators=[('dt', DecisionTreeClassifier(max_depth=10)),
                             ('lr',
                              LogisticRegression(C=15, max_iter=1000,
                                                 multi_class='multinomial')),
                             ('sgd', SGDClassifier(max_iter=10000))])

### Accuracy

In [28]:
from sklearn.metrics import accuracy_score

In [29]:
#Train accuracy

train_y_pred = voting_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.826764705882353


In [30]:
#Test accuracy

test_y_pred = voting_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8216666666666667


### Classification Matrix

In [31]:
from sklearn.metrics import confusion_matrix

In [32]:
confusion_matrix(test_y, test_y_pred)

array([[4380,  918],
       [ 901, 4001]], dtype=int64)

### Inspect each classifier's accuracy 

In [33]:
for clf in (dtree_clf, log_clf, sgd_clf, voting_clf):
    clf.fit(train_x, train_y.ravel())
    test_y_pred = clf.predict(test_x)
    print(clf.__class__.__name__, 'Test acc=', accuracy_score(test_y, test_y_pred))

DecisionTreeClassifier Test acc= 0.7995098039215687
LogisticRegression Test acc= 0.8213725490196079
SGDClassifier Test acc= 0.8116666666666666
VotingClassifier Test acc= 0.8208823529411765


### Hard voting Classifier 2

In [34]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import SGDClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC

dtree_clf = DecisionTreeClassifier(max_depth=10)
log_clf = LogisticRegression(multi_class='multinomial',solver = 'lbfgs', C=8, max_iter=1000)
sgd_clf = SGDClassifier(max_iter=10000, tol=1e-3)
svc_clf=SVC(kernel="poly", degree=2, coef0=1, C=1, gamma='scale')

voting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf), 
                        ('sgd', sgd_clf),
                       ('svc',svc_clf)],
            voting='hard')

voting_clf.fit(train_x, train_y)

VotingClassifier(estimators=[('dt', DecisionTreeClassifier(max_depth=10)),
                             ('lr',
                              LogisticRegression(C=8, max_iter=1000,
                                                 multi_class='multinomial')),
                             ('sgd', SGDClassifier(max_iter=10000)),
                             ('svc',
                              SVC(C=1, coef0=1, degree=2, kernel='poly'))])

### Accuracy 

In [35]:
#Train accuracy

train_y_pred = voting_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8294117647058824


In [36]:
#Test accuracy

test_y_pred = voting_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8237254901960784


### Classification Matrix

In [37]:
confusion_matrix(test_y, test_y_pred)

array([[4563,  735],
       [1063, 3839]], dtype=int64)

### Inspect each classifier's accuracy 

In [38]:
for clf in (dtree_clf, log_clf, sgd_clf, svc_clf,voting_clf):
    clf.fit(train_x, train_y.ravel())
    test_y_pred = clf.predict(test_x)
    print(clf.__class__.__name__, 'Test acc=', accuracy_score(test_y, test_y_pred))

DecisionTreeClassifier Test acc= 0.7994117647058824
LogisticRegression Test acc= 0.8213725490196079
SGDClassifier Test acc= 0.8198039215686275
SVC Test acc= 0.8338235294117647
VotingClassifier Test acc= 0.826764705882353


# Soft voting classifier (should include at least two models)

### Soft voting Classifier 1

In [39]:
#Each model should have predict_proba() function. Otherwise, you can't use it for soft voting
#We can't use sgd, because it doesn't have predict_proba() function.

voting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf)],
            voting='soft')

voting_clf.fit(train_x, train_y)

VotingClassifier(estimators=[('dt', DecisionTreeClassifier(max_depth=10)),
                             ('lr',
                              LogisticRegression(C=8, max_iter=1000,
                                                 multi_class='multinomial'))],
                 voting='soft')

In [40]:
#Train accuracy

train_y_pred = voting_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8438235294117648


In [41]:
#Test accuracy

test_y_pred = voting_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8156862745098039


### Soft Voting Classifier 2 

In [42]:
#Each model should have predict_proba() function. Otherwise, you can't use it for soft voting
#We can't use sgd, because it doesn't have predict_proba() function.
from sklearn.ensemble import RandomForestClassifier 
rnd_clf = RandomForestClassifier(n_estimators=600, n_jobs=-1)

voting_clf = VotingClassifier(
            estimators=[('dt', dtree_clf), 
                        ('lr', log_clf),
                         ('rnd',rnd_clf)
                       ],
            voting='soft')

voting_clf.fit(train_x, train_y)

VotingClassifier(estimators=[('dt', DecisionTreeClassifier(max_depth=10)),
                             ('lr',
                              LogisticRegression(C=8, max_iter=1000,
                                                 multi_class='multinomial')),
                             ('rnd',
                              RandomForestClassifier(n_estimators=600,
                                                     n_jobs=-1))],
                 voting='soft')

In [43]:
#Train accuracy

train_y_pred = voting_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9039915966386555


In [44]:
#Test accuracy

test_y_pred = voting_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.818921568627451


# Bagging classifier

In [45]:
from sklearn.ensemble import BaggingClassifier 


#If you want to do pasting, change "bootstrap=False"
#n_jobs=-1 means use all CPU cores
#bagging automatically performs soft voting

#bag_clf = BaggingClassifier( 
#            SGDClassifier(), n_estimators=50, 
#            max_samples=1000, bootstrap=True, n_jobs=-1) 

#svc_clf=BaggingClassifier(
#        SVC(kernel="linear"),n_estimators=50, 
#           max_samples=1000, bootstrap=True, n_jobs=-1)

bag_clf =BaggingClassifier(SVC(kernel="poly", degree=2, coef0=1, C=1, gamma='scale'),n_estimators=50, 
           max_samples=1000, bootstrap=True, n_jobs=-1) #-->83%
#svc_clf =BaggingClassifier(SVC(kernel="rbf"),n_estimators=50, 
#          max_samples=1000, bootstrap=True, n_jobs=-1) -->82.6%

bag_clf.fit(train_x, train_y)

BaggingClassifier(base_estimator=SVC(C=1, coef0=1, degree=2, kernel='poly'),
                  max_samples=1000, n_estimators=50, n_jobs=-1)

### Accuracy

In [46]:
#Train accuracy

train_y_pred = bag_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8320588235294117


In [47]:
#Test accuracy

test_y_pred = bag_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc)) ##Best model so far

Test acc: 0.8306862745098039


# Random forest classifier

In [49]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf = RandomForestClassifier(n_estimators=1000, n_jobs=-1) #500 98% & 80.5%

rnd_clf.fit(train_x, train_y)

RandomForestClassifier(n_estimators=1000, n_jobs=-1)

In [50]:
#Train accuracy

train_y_pred = rnd_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.9829831932773109


In [51]:
#Test accuracy

test_y_pred = rnd_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8054901960784314


# AdaBoost Classifier

In [55]:
from sklearn.ensemble import AdaBoostClassifier 

#Create Adapative Boosting with Decision Stumps (depth=1) ##intial depth 1, test accuracy 82%
ada_clf = AdaBoostClassifier( 
            DecisionTreeClassifier(max_depth=1), n_estimators=500, 
            algorithm="SAMME.R", learning_rate=0.1) 

ada_clf.fit(train_x, train_y)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1),
                   learning_rate=0.1, n_estimators=500)

In [56]:
#Train accuracy

train_y_pred = ada_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8209243697478992


In [57]:
#Test accuracy

test_y_pred = ada_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8195098039215686


# Gradient Boosting Classifier

In [61]:
#Use GradientBoosting-was not in assignment

from sklearn.ensemble import GradientBoostingClassifier#use depth 2,8,10

gbclf = GradientBoostingClassifier(max_depth=8, n_estimators=100, learning_rate=0.1) 

gbclf.fit(train_x, train_y)

GradientBoostingClassifier(max_depth=8)

In [62]:
#Train accuracy

train_y_pred = gbclf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.8914285714285715


In [63]:
#Test accuracy

test_y_pred = gbclf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.823921568627451


# Stochastic Gradient Boosting Classifier

In [64]:
#Train on 75% of the sample only
from sklearn.ensemble import GradientBoostingClassifier #check with depth 2

gbclf = GradientBoostingClassifier(max_depth=10, n_estimators=100, 
                                   learning_rate=0.1, subsample=0.75) 

gbclf.fit(train_x, train_y)

GradientBoostingClassifier(max_depth=10, subsample=0.75)

In [65]:
#Train accuracy

train_y_pred = gbclf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.944201680672269


In [66]:
#Test accuracy

test_y_pred = gbclf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8140196078431372


In [67]:
for x in range(1,30):
    gbclf = GradientBoostingClassifier(max_depth=8, n_estimators=x, learning_rate=1.0) 
    gbclf.fit(train_x, train_y.ravel())
    
    train_predictions = gbclf.predict(train_x)
    test_predictions = gbclf.predict(test_x)
    
    train_accuracy = round(accuracy_score(train_y, train_predictions),4)
    test_accuracy = round(accuracy_score(test_y, test_predictions),4)
    
    print('# Estimators = {}'.format(x) + "     " + 'Train accuracy = {}'.format(train_accuracy) + "   "
         'Test accuracy = {}'.format(test_accuracy))

# Estimators = 1     Train accuracy = 0.8171   Test accuracy = 0.7993
# Estimators = 2     Train accuracy = 0.8337   Test accuracy = 0.809
# Estimators = 3     Train accuracy = 0.8411   Test accuracy = 0.8102
# Estimators = 4     Train accuracy = 0.8474   Test accuracy = 0.8155
# Estimators = 5     Train accuracy = 0.8514   Test accuracy = 0.8126
# Estimators = 6     Train accuracy = 0.857   Test accuracy = 0.8138
# Estimators = 7     Train accuracy = 0.8632   Test accuracy = 0.8111
# Estimators = 8     Train accuracy = 0.8668   Test accuracy = 0.8108
# Estimators = 9     Train accuracy = 0.8683   Test accuracy = 0.8114
# Estimators = 10     Train accuracy = 0.8708   Test accuracy = 0.8095
# Estimators = 11     Train accuracy = 0.8746   Test accuracy = 0.8061
# Estimators = 12     Train accuracy = 0.8767   Test accuracy = 0.8055
# Estimators = 13     Train accuracy = 0.8782   Test accuracy = 0.8091
# Estimators = 14     Train accuracy = 0.884   Test accuracy = 0.8018
# Estimators = 15 

In [68]:
#none of the estimators beat best model (bagging classifier) test accuracy of 83%

In [69]:
import xgboost
xgb_clf = xgboost.XGBClassifier()

xgb_clf.fit(train_x, train_y)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [70]:
#Train accuracy

train_y_pred = xgb_clf.predict(train_x)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

Train acc: 0.867436974789916


In [71]:
#Test accuracy

test_y_pred = xgb_clf.predict(test_x)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

Test acc: 0.8266666666666667


In [None]:
#Above model is still not the best model

# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) What is the baseline?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?

# Extra Credit

In [72]:
alcohol_competition = pd.read_csv("alcohol_competition.csv")

In [73]:
alcohol_competition=alcohol_competition.drop(['ID'],axis=1)

In [74]:
subset_df = alcohol_competition[['age']]

# Send the new dataframe to the function we created
new_col(subset_df)

Unnamed: 0,IsAdult
0,0
1,0
2,0
3,1
4,0
...,...
995,0
996,0
997,1
998,0


In [75]:
alcohol_competition.dtypes

age            int64
Medu           int64
Fedu           int64
traveltime     int64
studytime      int64
failures       int64
famrel         int64
freetime       int64
goout          int64
health         int64
absences       int64
gender        object
dtype: object

In [76]:
# Transform the test data
testalcohol = preprocessor.transform(alcohol_competition)

testalcohol

array([[ 0.66643886,  0.33104402,  0.04019664, ...,  1.        ,
         0.        , -0.36853945],
       [-1.23984621,  2.23583433,  2.63048577, ...,  1.        ,
         0.        , -0.36853945],
       [ 0.66643886,  0.33104402,  0.04019664, ...,  1.        ,
         0.        , -0.36853945],
       ...,
       [ 1.6195814 ,  0.33104402, -0.82323307, ...,  0.        ,
         1.        ,  2.71341375],
       [ 0.66643886, -0.30388608,  0.04019664, ...,  0.        ,
         1.        , -0.36853945],
       [ 1.6195814 ,  0.33104402, -0.82323307, ...,  1.        ,
         0.        ,  2.71341375]])

In [77]:
bestprediction = bag_clf.predict(testalcohol)

In [78]:
bestpredictiondf=pd.DataFrame(bestprediction,columns=['Alc'])

In [79]:
bestpredictiondf.insert(1, "ID", np.arange(1, 1001, 1).tolist(), False)

In [80]:
bestpredictiondf.head()

Unnamed: 0,Alc,ID
0,0,1
1,0,2
2,1,3
3,1,4
4,0,5


In [81]:
bestpredictiondf.to_csv('salian_tm_competition.csv',index=False)