Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [70]:
# Familiar imports
import numpy as np
import pandas as pd

# For ordinal encoding categorical variables, splitting data
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

# For training random forest model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [71]:
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Preview the data
train.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,B,B,C,B,B,A,E,C,N,...,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985,8.113634
2,B,B,A,A,B,D,A,F,A,O,...,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083,8.481233
3,A,A,A,C,B,D,A,D,A,F,...,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846,8.364351
4,B,B,A,C,B,D,A,E,C,K,...,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682,8.049253
6,A,A,A,C,B,D,A,E,A,N,...,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823,7.97226


In [72]:
print(train.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300000 entries, 1 to 499999
Data columns (total 25 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   cat0    300000 non-null  object 
 1   cat1    300000 non-null  object 
 2   cat2    300000 non-null  object 
 3   cat3    300000 non-null  object 
 4   cat4    300000 non-null  object 
 5   cat5    300000 non-null  object 
 6   cat6    300000 non-null  object 
 7   cat7    300000 non-null  object 
 8   cat8    300000 non-null  object 
 9   cat9    300000 non-null  object 
 10  cont0   300000 non-null  float64
 11  cont1   300000 non-null  float64
 12  cont2   300000 non-null  float64
 13  cont3   300000 non-null  float64
 14  cont4   300000 non-null  float64
 15  cont5   300000 non-null  float64
 16  cont6   300000 non-null  float64
 17  cont7   300000 non-null  float64
 18  cont8   300000 non-null  float64
 19  cont9   300000 non-null  float64
 20  cont10  300000 non-null  float64
 21  cont11  30

In [73]:
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200000 entries, 0 to 499995
Data columns (total 24 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   cat0    200000 non-null  object 
 1   cat1    200000 non-null  object 
 2   cat2    200000 non-null  object 
 3   cat3    200000 non-null  object 
 4   cat4    200000 non-null  object 
 5   cat5    200000 non-null  object 
 6   cat6    200000 non-null  object 
 7   cat7    200000 non-null  object 
 8   cat8    200000 non-null  object 
 9   cat9    200000 non-null  object 
 10  cont0   200000 non-null  float64
 11  cont1   200000 non-null  float64
 12  cont2   200000 non-null  float64
 13  cont3   200000 non-null  float64
 14  cont4   200000 non-null  float64
 15  cont5   200000 non-null  float64
 16  cont6   200000 non-null  float64
 17  cont7   200000 non-null  float64
 18  cont8   200000 non-null  float64
 19  cont9   200000 non-null  float64
 20  cont10  200000 non-null  float64
 21  cont11  20

The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [74]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
features.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,B,B,C,B,B,A,E,C,N,...,0.610706,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985
2,B,B,A,A,B,D,A,F,A,O,...,0.276853,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083
3,A,A,A,C,B,D,A,D,A,F,...,0.285074,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846
4,B,B,A,C,B,D,A,E,C,K,...,0.284667,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682
6,A,A,A,C,B,D,A,E,A,N,...,0.287595,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823


# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [None]:
## List of categorical columns
#object_cols = [col for col in features.columns if 'cat' in col]

## ordinal-encode categorical columns
#X = features.copy()
#X_test = test.copy()
#ordinal_encoder = OrdinalEncoder()
#X[object_cols] = ordinal_encoder.fit_transform(features[object_cols])
#X_test[object_cols] = ordinal_encoder.transform(test[object_cols])

## Preview the ordinal-encoded features
#X.head()

In [75]:
SHDF_catogorical = features.select_dtypes(include=['object']).copy()
SHDF_catogorical.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,B,B,B,C,B,B,A,E,C,N
2,B,B,A,A,B,D,A,F,A,O
3,A,A,A,C,B,D,A,D,A,F
4,B,B,A,C,B,D,A,E,C,K
6,A,A,A,C,B,D,A,E,A,N


In [76]:
print(SHDF_catogorical.isnull().values.sum())

0


In [77]:
#print(SHDF_catogorical['cat0'].value_counts())
object_cols = [col for col in SHDF_catogorical.columns if SHDF_catogorical[col].dtype == "object"]

object_valcount = list(map(lambda col: SHDF_catogorical[col].value_counts(), object_cols))
d = dict(zip(object_cols, object_valcount))

# Print number of unique entries by column, in ascending order
#sorted(d.items(), key=lambda x: x[1])
#print(d.items())
#"\n".join("{}\t{}".format(k, v) for k, v in sorted(d.items(), key=lambda t: str(t[0])))
print("\n".join("{}\t{}".format(k, v) for k, v in d.items()))

cat0	A    193130
B    106870
Name: cat0, dtype: int64
cat1	A    154824
B    145176
Name: cat1, dtype: int64
cat2	A    253886
B     46114
Name: cat2, dtype: int64
cat3	C    263356
A     31726
D      4328
B       590
Name: cat3, dtype: int64
cat4	B    294737
A      2978
C      1772
D       513
Name: cat4, dtype: int64
cat5	B    149340
D    126137
C     20248
A      4275
Name: cat5, dtype: int64
cat6	A    290511
B      8018
C       928
D       292
I       136
H        56
E        45
G        14
Name: cat6, dtype: int64
cat7	E    276040
D     12144
B      8297
G      2870
F       562
C        36
A        31
I        20
Name: cat7, dtype: int64
cat8	C    111103
E     79844
A     76585
G     26128
D      5187
F       966
B       187
Name: cat8, dtype: int64
cat9	F    71249
I    59218
G    28253
L    20958
H    19925
K    18057
N    16704
B    14477
J    14266
O    14203
A    11029
M     7931
C     1603
D     1088
E     1039
Name: cat9, dtype: int64


In [78]:
object_cols = [col for col in SHDF_catogorical.columns if features[col].dtype == "object"]

object_nunique = list(map(lambda col: SHDF_catogorical[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('cat0', 2),
 ('cat1', 2),
 ('cat2', 2),
 ('cat3', 4),
 ('cat4', 4),
 ('cat5', 4),
 ('cat8', 7),
 ('cat6', 8),
 ('cat7', 8),
 ('cat9', 15)]

In [79]:
object_cols = [col for col in SHDF_catogorical.columns if 'cat' in col]
high_cardinality_cols = [col for col in object_cols if features[col].nunique() > 5]
high_cardinality_cols

['cat6', 'cat7', 'cat8', 'cat9']

In [80]:
# I  will only one-hot encode columns with relatively low cardinality which are 'cat0','cat1','cat2','cat3','cat4','cat5'. 
#Then, for high cardinality columns I will use ordinal encoding which are columns 'cat6','cat7','cat8','cat9'.

SHDF_catogorical = pd.get_dummies(SHDF_catogorical, columns=['cat0','cat1','cat2','cat3','cat4','cat5'], prefix = ['OH_cat0','OH_cat1','OH_cat2','OH_cat3','OH_cat4','OH_cat5'])
X_test=pd.get_dummies(test, columns=['cat0','cat1','cat2','cat3','cat4','cat5'], prefix = ['OH_cat0','OH_cat1','OH_cat2','OH_cat3','OH_cat4','OH_cat5'])
#SHDF_catogorical.drop(['cat0','cat1','cat2','cat3','cat4','cat5'], axis=1)
print(SHDF_catogorical.head())

   cat6 cat7 cat8 cat9  OH_cat0_A  OH_cat0_B  OH_cat1_A  OH_cat1_B  OH_cat2_A  \
id                                                                              
1     A    E    C    N          0          1          0          1          0   
2     A    F    A    O          0          1          0          1          1   
3     A    D    A    F          1          0          1          0          1   
4     A    E    C    K          0          1          0          1          1   
6     A    E    A    N          1          0          1          0          1   

    OH_cat2_B  ...  OH_cat3_C  OH_cat3_D  OH_cat4_A  OH_cat4_B  OH_cat4_C  \
id             ...                                                          
1           1  ...          1          0          0          1          0   
2           0  ...          0          0          0          1          0   
3           0  ...          1          0          0          1          0   
4           0  ...          1          0       

In [81]:
SHDF_catogorical

Unnamed: 0_level_0,cat6,cat7,cat8,cat9,OH_cat0_A,OH_cat0_B,OH_cat1_A,OH_cat1_B,OH_cat2_A,OH_cat2_B,...,OH_cat3_C,OH_cat3_D,OH_cat4_A,OH_cat4_B,OH_cat4_C,OH_cat4_D,OH_cat5_A,OH_cat5_B,OH_cat5_C,OH_cat5_D
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,A,E,C,N,0,1,0,1,0,1,...,1,0,0,1,0,0,0,1,0,0
2,A,F,A,O,0,1,0,1,1,0,...,0,0,0,1,0,0,0,0,0,1
3,A,D,A,F,1,0,1,0,1,0,...,1,0,0,1,0,0,0,0,0,1
4,A,E,C,K,0,1,0,1,1,0,...,1,0,0,1,0,0,0,0,0,1
6,A,E,A,N,1,0,1,0,1,0,...,1,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499993,A,E,A,I,0,1,0,1,1,0,...,0,0,0,1,0,0,0,0,0,1
499996,A,E,E,F,1,0,0,1,1,0,...,1,0,0,1,0,0,0,1,0,0
499997,A,E,G,F,0,1,0,1,1,0,...,1,0,0,1,0,0,0,0,1,0
499998,A,E,E,I,1,0,0,1,1,0,...,1,0,0,1,0,0,0,1,0,0


In [82]:
object_cols = [col for col in test[high_cardinality_cols].columns if 'cat' in col]
#X = features.copy()
#X_test = test.copy()
ordinal_encoder = OrdinalEncoder()
SHDF_catogorical[high_cardinality_cols] = ordinal_encoder.fit_transform(SHDF_catogorical[high_cardinality_cols])
X_test[object_cols] = ordinal_encoder.transform(X_test[object_cols])
#ordinal_encoder = OrdinalEncoder()
#SHDF_catogorical[high_cardinality_cols] = ordinal_encoder.fit_transform(SHDF_catogorical[high_cardinality_cols])
#label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

SHDF_catogorical.head()

Unnamed: 0_level_0,cat6,cat7,cat8,cat9,OH_cat0_A,OH_cat0_B,OH_cat1_A,OH_cat1_B,OH_cat2_A,OH_cat2_B,...,OH_cat3_C,OH_cat3_D,OH_cat4_A,OH_cat4_B,OH_cat4_C,OH_cat4_D,OH_cat5_A,OH_cat5_B,OH_cat5_C,OH_cat5_D
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,4.0,2.0,13.0,0,1,0,1,0,1,...,1,0,0,1,0,0,0,1,0,0
2,0.0,5.0,0.0,14.0,0,1,0,1,1,0,...,0,0,0,1,0,0,0,0,0,1
3,0.0,3.0,0.0,5.0,1,0,1,0,1,0,...,1,0,0,1,0,0,0,0,0,1
4,0.0,4.0,2.0,10.0,0,1,0,1,1,0,...,1,0,0,1,0,0,0,0,0,1
6,0.0,4.0,0.0,13.0,1,0,1,0,1,0,...,1,0,0,1,0,0,0,0,0,1


In [83]:
X=SHDF_catogorical.copy()
X.head()

Unnamed: 0_level_0,cat6,cat7,cat8,cat9,OH_cat0_A,OH_cat0_B,OH_cat1_A,OH_cat1_B,OH_cat2_A,OH_cat2_B,...,OH_cat3_C,OH_cat3_D,OH_cat4_A,OH_cat4_B,OH_cat4_C,OH_cat4_D,OH_cat5_A,OH_cat5_B,OH_cat5_C,OH_cat5_D
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,4.0,2.0,13.0,0,1,0,1,0,1,...,1,0,0,1,0,0,0,1,0,0
2,0.0,5.0,0.0,14.0,0,1,0,1,1,0,...,0,0,0,1,0,0,0,0,0,1
3,0.0,3.0,0.0,5.0,1,0,1,0,1,0,...,1,0,0,1,0,0,0,0,0,1
4,0.0,4.0,2.0,10.0,0,1,0,1,1,0,...,1,0,0,1,0,0,0,0,0,1
6,0.0,4.0,0.0,13.0,1,0,1,0,1,0,...,1,0,0,1,0,0,0,0,0,1


In [84]:
X_test.head()

Unnamed: 0_level_0,cat6,cat7,cat8,cat9,cont0,cont1,cont2,cont3,cont4,cont5,...,OH_cat3_C,OH_cat3_D,OH_cat4_A,OH_cat4_B,OH_cat4_C,OH_cat4_D,OH_cat5_A,OH_cat5_B,OH_cat5_C,OH_cat5_D
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,4.0,4.0,8.0,0.296227,0.686757,0.587731,0.392753,0.476739,0.37635,...,1,0,0,1,0,0,0,1,0,0
5,0.0,4.0,2.0,7.0,0.543707,0.364761,0.452967,0.929645,0.285509,0.860046,...,1,0,0,1,0,0,0,0,1,0
15,0.0,4.0,3.0,10.0,0.408961,0.296129,0.690999,0.740027,0.697272,0.6836,...,0,0,0,1,0,0,0,1,0,0
16,0.0,4.0,0.0,13.0,1.031239,0.356062,0.303651,0.895591,0.719306,0.77789,...,1,0,0,1,0,0,0,0,0,1
17,0.0,4.0,2.0,5.0,0.530447,0.729004,0.281723,0.444698,0.313032,0.431007,...,1,0,0,1,0,0,0,0,1,0


In [85]:
selected_columns=features[['cont0','cont1','cont2','cont3','cont4','cont5','cont6','cont7','cont8','cont9','cont10','cont11','cont12','cont13']].copy()
X.reset_index(drop=True, inplace=True)
selected_columns.reset_index(drop=True, inplace=True)
X = pd.concat([X,selected_columns],axis=1)
X.head()
#type(selected_columns)

Unnamed: 0,cat6,cat7,cat8,cat9,OH_cat0_A,OH_cat0_B,OH_cat1_A,OH_cat1_B,OH_cat2_A,OH_cat2_B,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
0,0.0,4.0,2.0,13.0,0,1,0,1,0,1,...,0.610706,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985
1,0.0,5.0,0.0,14.0,0,1,0,1,1,0,...,0.276853,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083
2,0.0,3.0,0.0,5.0,1,0,1,0,1,0,...,0.285074,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846
3,0.0,4.0,2.0,10.0,0,1,0,1,1,0,...,0.284667,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682
4,0.0,4.0,0.0,13.0,1,0,1,0,1,0,...,0.287595,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823


Next, we break off a validation set from the training data.

In [87]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

# Step 4: Train a model

Now that the data is prepared, the next step is to train a model.  

If you took the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** courses, then you learned about **[Random Forests](https://www.kaggle.com/dansbecker/random-forests)**.  In the code cell below, we fit a random forest model to the data.

In [88]:
# Define the model 
model = RandomForestRegressor(random_state=1)

# Train the model (will take about 10 minutes to run)
model.fit(X_train, y_train)
preds_valid = model.predict(X_valid)
print(mean_squared_error(y_valid, preds_valid, squared=False))

0.7377000429732766


In [93]:
# Define the model 
model = RandomForestRegressor(n_estimators=50,random_state=1)

# Train the model (will take about 10 minutes to run)
model.fit(X_train, y_train)
preds_valid = model.predict(X_valid)
print(mean_squared_error(y_valid, preds_valid, squared=False))

0.7413157248645301


In [92]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

# Define the model
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

# Fit the model
my_model_2.fit(X_train,y_train) 

# Get predictions
predictions_2 = my_model_2.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2,y_valid)

# Uncomment to print MAE
print("Mean Absolute Error:" , mae_2)


Mean Absolute Error: 0.5746810824776925


In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [94]:
# Use the model to generate predictions
predictions = my_model_2.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

  "because it will generate extra copies and increase " +


Once you have run the code cell above, follow the instructions below to submit to the competition:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

# Step 6: Keep Learning!

If you're not sure what to do next, you can begin by trying out more model types!
1. If you took the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course, then you learned about **[XGBoost](https://www.kaggle.com/alexisbcook/xgboost)**.  Try training a model with XGBoost, to improve over the performance you got here.

2. Take the time to learn about **Light GBM (LGBM)**, which is similar to XGBoost, since they both use gradient boosting to iteratively add decision trees to an ensemble.  In case you're not sure how to get started, **[here's a notebook](https://www.kaggle.com/svyatoslavsokolov/tps-feb-2021-lgbm-simple-version)** that trains a model on a similar dataset.