<a href="https://colab.research.google.com/github/PrashantDandriyal/Airplane-Severity-Competition-Hackerearth-CATBOOST/blob/master/Airplane_severity_catboost_acc951333_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing CatBoost

Next big thing is to import CatBoost inside environment. Colaboratory has built in libraries installed and most libraries can be installed quickly with a simple *!pip install* command.  
Please ignore the warning message about already imported enum package. Furthermore take note that you need to re-import the library every time you start a new session of Colab.

In [10]:
!pip install catboost



## Download and prepare dataset
The next step is dataset downloading. GPU training is useful for large datsets. You will get a good speedup starting from 10k objects and the more objects you have, the more will be the speedup.
Because of that reason we have selected a large dataset - [Epsilon](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) (500.000 documents and 2.000 features) for this tutorial.
Firstly, we will get the data through catboost.datasets module. The code below does this. It will run for approximately 10-15 minutes. So please be patient :)

In [11]:
!wget "https://he-s3.s3.amazonaws.com/media/hackathon/airplane-accident-severity-hackerearth-machine-learning-challenge/how-severe-can-an-airplane-accident-be-03e7a3f1/3c055e822d5b11ea.zip?Signature=DF7CGRrTwupnhGy5Q4MrgvW2bOQ%3D&Expires=1579011754&AWSAccessKeyId=AKIA6I2ISGOYH7WWS3G5"

--2020-01-14 13:38:26--  https://he-s3.s3.amazonaws.com/media/hackathon/airplane-accident-severity-hackerearth-machine-learning-challenge/how-severe-can-an-airplane-accident-be-03e7a3f1/3c055e822d5b11ea.zip?Signature=DF7CGRrTwupnhGy5Q4MrgvW2bOQ%3D&Expires=1579011754&AWSAccessKeyId=AKIA6I2ISGOYH7WWS3G5
Resolving he-s3.s3.amazonaws.com (he-s3.s3.amazonaws.com)... 52.219.40.209
Connecting to he-s3.s3.amazonaws.com (he-s3.s3.amazonaws.com)|52.219.40.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 555584 (543K) [application/zip]
Saving to: ‘3c055e822d5b11ea.zip?Signature=DF7CGRrTwupnhGy5Q4MrgvW2bOQ=&Expires=1579011754&AWSAccessKeyId=AKIA6I2ISGOYH7WWS3G5’


2020-01-14 13:38:29 (379 KB/s) - ‘3c055e822d5b11ea.zip?Signature=DF7CGRrTwupnhGy5Q4MrgvW2bOQ=&Expires=1579011754&AWSAccessKeyId=AKIA6I2ISGOYH7WWS3G5’ saved [555584/555584]



In [12]:
!unzip "/content/3c055e822d5b11ea.zip"

Archive:  /content/3c055e822d5b11ea.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [13]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier

#Read trainig and testing files
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

#Identify the datatype of variables
train.dtypes

Severity                    object
Safety_Score               float64
Days_Since_Inspection        int64
Total_Safety_Complaints      int64
Control_Metric             float64
Turbulence_In_gforces      float64
Cabin_Temperature          float64
Accident_Type_Code           int64
Max_Elevation              float64
Violations                   int64
Adverse_Weather_Metric     float64
Accident_ID                  int64
dtype: object

In [14]:
#show the train data
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
Severity                   10000 non-null object
Safety_Score               10000 non-null float64
Days_Since_Inspection      10000 non-null int64
Total_Safety_Complaints    10000 non-null int64
Control_Metric             10000 non-null float64
Turbulence_In_gforces      10000 non-null float64
Cabin_Temperature          10000 non-null float64
Accident_Type_Code         10000 non-null int64
Max_Elevation              10000 non-null float64
Violations                 10000 non-null int64
Adverse_Weather_Metric     10000 non-null float64
Accident_ID                10000 non-null int64
dtypes: float64(6), int64(5), object(1)
memory usage: 937.6+ KB


In [15]:
#show how many the null value for each column
train.isnull().sum()

Severity                   0
Safety_Score               0
Days_Since_Inspection      0
Total_Safety_Complaints    0
Control_Metric             0
Turbulence_In_gforces      0
Cabin_Temperature          0
Accident_Type_Code         0
Max_Elevation              0
Violations                 0
Adverse_Weather_Metric     0
Accident_ID                0
dtype: int64

In [0]:
#now we will get the train data and label
x = train.drop('Severity',axis=1)
y = train.Severity

In [17]:
#show what the dtype of x, note that the catboost will just make the string object to categorical 
#object inside
x.dtypes

Safety_Score               float64
Days_Since_Inspection        int64
Total_Safety_Complaints      int64
Control_Metric             float64
Turbulence_In_gforces      float64
Cabin_Temperature          float64
Accident_Type_Code           int64
Max_Elevation              float64
Violations                   int64
Adverse_Weather_Metric     float64
Accident_ID                  int64
dtype: object

In [18]:
#choose the features we want to train, just forget the float data
#created a list of feature (column) numbers which we want CatBoost to treat as the categorical ones. 
cate_features_index = np.where(x.dtypes != float)[0]
print(cate_features_index)

[ 1  2  6  8 10]


In [0]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [0]:
#make the x for train and test (also called validation data) 
xtrain,xtest,ytrain,ytest = train_test_split(x,y,
                                             train_size=.85,
                                             random_state=1234)

In [0]:
#let us make the catboost model, use_best_model params will make the model prevent overfitting
model = CatBoostClassifier(eval_metric='Accuracy',
                           use_best_model=True,
                           plot=True,
                           random_seed=42)

In [0]:
#now just to make the model to fit the data
model.fit(xtrain,
          ytrain,
          cat_features=cate_features_index,
          eval_set=(xtest,ytest))

In [21]:
#show the model test acc, but you have to note that the acc is not the cv acc,
#so recommend to use the cv acc to evaluate your model!
print('the test accuracy is :{:.6f}'.format(accuracy_score(ytest,model.predict(xtest))))

the test accuracy is :0.951333


In [0]:
#last let us make the submission,note that you have to make the pred to be int!
pred = model.predict(test)
pred.flatten()
pred

In [0]:
#pred = pred.astype(np.int)
col1 = test['Accident_ID']
submission = pd.DataFrame({'Accident_ID':col1,'Severity':pred[:,0]})

In [0]:
#make the file to yourself's directory
submission.to_csv('out_popo.csv',index=False)