Welcome to the **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**!  In this notebook, you'll learn how to make your first submission.

Before getting started, make your own editable copy of this notebook by clicking on the **Copy and Edit** button.

# Step 1: Import helpful libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [None]:
# Familiar imports
import numpy as np
import pandas as pd

# For ordinal encoding categorical variables, splitting data
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

# For training random forest model
# import lightgbm as lgb
# from sklearn.ensemble import lightgbm.
from sklearn.metrics import mean_squared_error

# Step 2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [None]:
def confirm(df):
    display(df.head())
    display(df.info())
    display(df.describe())

In [None]:
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)



The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [None]:
# Preview the data
confirm(train)

In [None]:
# Separate target from features
y = train['target']
features = train.drop(['target'], axis=1)

# Preview features
# features.head()

# Step 3: Prepare the data

Next, we'll need to handle the categorical columns (`cat0`, `cat1`, ... `cat9`).  

In the **[Categorical Variables lesson](https://www.kaggle.com/alexisbcook/categorical-variables)** in the Intermediate Machine Learning course, you learned several different ways to encode categorical variables in a dataset.  In this notebook, we'll use ordinal encoding and save our encoded features as new variables `X` and `X_test`.

In [None]:
# List of categorical columns
object_cols = [col for col in features.columns if 'cat' in col]

# ordinal-encode categorical columns
X = features.copy()
X_test = test.copy()
ordinal_encoder = OrdinalEncoder()
X[object_cols] = ordinal_encoder.fit_transform(features[object_cols])
X_test[object_cols] = ordinal_encoder.transform(test[object_cols])

# Preview the ordinal-encoded features
X.head()

Next, we break off a validation set from the training data.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

# Step 4: Predict using XGBoost

Use XGBoost instead of random forest. Faster and probably provides better prediction accuracy.

In [None]:
# from xgboost import XGBRegressor

# # Define the model
# my_model_1 = XGBRegressor(random_state=0)

# # Fit the model
# my_model_1.fit(X_train,y_train) 
# predictions_1 = my_model_1.predict(X_valid)

# # Calculate MAE
# mse_1 = mean_squared_error(y_valid, predictions_1, squared=False) 


# print("Mean Squared Error:" , mse_1)



# Step 4.1: Predict using XGBoost with parameters tunning

Now, let us try to tune XGBoost to give us better accuracy

In [None]:
# from xgboost import XGBRegressor

# # Define the model
# my_model_2 = XGBRegressor(n_estimators=5000, learning_rate=0.05, n_jobs=-1,random_state=0) # Your code here

# # Fit the model
# my_model_2.fit(X_train,y_train, early_stopping_rounds=10, eval_set=[(X_valid,y_valid)], verbose=False) # Your code here

# # Get predictions
# predictions_2 = my_model_2.predict(X_valid) # Your code here

# # Calculate MAE
# mse_2 = mean_squared_error(predictions_2,y_valid) # Your code here

# # Uncomment to print MAE
# print("Mean Squared Error:" , mse_2)

# Step 4.2: Predict using LightGBM


In [None]:
import optuna.integration.lightgbm as lgbo
from lightgbm import LGBMRegressor

In [None]:
opt_params = {
    "objective":"regression",
    "metric":"rmse"
}

In [None]:
reg_train = lgbo.Dataset(X_train,y_train)
reg_valid = lgbo.Dataset(X_valid,y_valid,reference=reg_train)

In [None]:
opt=lgbo.train(opt_params,reg_train,valid_sets = reg_valid,verbose_eval=False,num_boost_round = 5,early_stopping_rounds = 100)

In [None]:
opt.params

In [None]:


lgbm_parameters = {
 'objective': 'regression',
 'metric': 'rmse',
 'feature_pre_filter': False,
 'lambda_l1': 2.1324554005212664e-05,
 'lambda_l2': 7.486212839933644,
 'num_leaves': 251,
 'feature_fraction': 1.0,
 'bagging_fraction': 0.5337542240432858,
 'bagging_freq': 3,
 'min_child_samples': 20,
 'num_iterations': 5,
}
# early_sr=64


lgbm_model = LGBMRegressor(**lgbm_parameters)
lgbm_model.fit(X_train, y_train, eval_set = ((X_valid,y_valid)),verbose = -1,categorical_feature=object_cols)  



In the code cell above, we set `squared=False` to get the root mean squared error (RMSE) on the validation data.

# Step 5: Submit to the competition

We'll begin by using the trained model to generate predictions, which we'll save to a CSV file.

In [None]:
# Use the model to generate predictions
predictions = lgbm_model.predict(X_test)

# Save the predictions to a CSV file
output = pd.DataFrame({'Id': X_test.index,
                       'target': predictions})
output.to_csv('submission.csv', index=False)

Once you have run the code cell above, follow the instructions below to submit to the competition:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.

# Step 6: Keep Learning!

If you're not sure what to do next, you can begin by trying out more model types!
1. If you took the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course, then you learned about **[XGBoost](https://www.kaggle.com/alexisbcook/xgboost)**.  Try training a model with XGBoost, to improve over the performance you got here.

2. Take the time to learn about **Light GBM (LGBM)**, which is similar to XGBoost, since they both use gradient boosting to iteratively add decision trees to an ensemble.  In case you're not sure how to get started, **[here's a notebook](https://www.kaggle.com/svyatoslavsokolov/tps-feb-2021-lgbm-simple-version)** that trains a model on a similar dataset.