# Kaggle Challenge

This notebook is built for AdVITya Kaggle Challenge

Data taken from Kaggle's offical competiion page. Data available inside 'data/' directory.

## Data Preprocessing

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/train.csv', index_col = 'ID')

In [3]:
data.head()

Unnamed: 0_level_0,VIN (1-10),County,City,State,ZIP Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,Expected Price ($1k)
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
EV33174,5YJ3E1EC6L,Snohomish,LYNNWOOD,WA,98037.0,2020.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,308,0,32.0,109821694,POINT (-122.287614 47.83874),PUGET SOUND ENERGY INC,50.0
EV40247,JN1AZ0CP8B,Skagit,BELLINGHAM,WA,98229.0,2011.0,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,0,40.0,137375528,POINT (-122.414936 48.709388),PUGET SOUND ENERGY INC,15.0
EV12248,WBY1Z2C56F,Pierce,TACOMA,WA,98422.0,2015.0,BMW,I3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,81,0,27.0,150627382,POINT (-122.396286 47.293138),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,18.0
EV55713,1G1RD6E44D,King,REDMOND,WA,98053.0,2013.0,CHEVROLET,VOLT,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38,0,45.0,258766301,POINT (-122.024951 47.670286),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),33.9
EV28799,1G1FY6S05K,Pierce,PUYALLUP,WA,98375.0,2019.0,CHEVROLET,BOLT EV,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,0,25.0,296998138,POINT (-122.321062 47.103797),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,41.78


Very early on we decided that the following columns were of no use in the prediction. Thus were dropped from the training data. The same is done for the testing data later

In [4]:
garbage_columns = ['Base MSRP' , 'VIN (1-10)' , 'DOL Vehicle ID']

In [5]:
data_garbage_removed = data.drop(garbage_columns , axis= 1)

Now that the garbage columns are removed we checked for NULL values. Around 1000 of them were noted.

In [6]:
data_garbage_removed.isnull().sum()

County                                                 4
City                                                   9
State                                                 11
ZIP Code                                               6
Model Year                                             7
Make                                                   4
Model                                                 13
Electric Vehicle Type                                  0
Clean Alternative Fuel Vehicle (CAFV) Eligibility      0
Electric Range                                         0
Legislative District                                 169
Vehicle Location                                     510
Electric Utility                                     722
Expected Price ($1k)                                   0
dtype: int64

As the training samples is 60000+ it does not matter if we drop ~1000 rows. There is still enough data for training.

In [7]:
data_garbage_removed_na_dropped = data_garbage_removed.dropna()

In [8]:
data_garbage_removed_na_dropped.head()

Unnamed: 0_level_0,County,City,State,ZIP Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Vehicle Location,Electric Utility,Expected Price ($1k)
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
EV33174,Snohomish,LYNNWOOD,WA,98037.0,2020.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,308,32.0,POINT (-122.287614 47.83874),PUGET SOUND ENERGY INC,50.0
EV40247,Skagit,BELLINGHAM,WA,98229.0,2011.0,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,40.0,POINT (-122.414936 48.709388),PUGET SOUND ENERGY INC,15.0
EV12248,Pierce,TACOMA,WA,98422.0,2015.0,BMW,I3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,81,27.0,POINT (-122.396286 47.293138),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,18.0
EV55713,King,REDMOND,WA,98053.0,2013.0,CHEVROLET,VOLT,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38,45.0,POINT (-122.024951 47.670286),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),33.9
EV28799,Pierce,PUYALLUP,WA,98375.0,2019.0,CHEVROLET,BOLT EV,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,25.0,POINT (-122.321062 47.103797),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,41.78


Copy of the data were made at uniform intervals to keep a backup of current data

In [9]:
data_cleaned = data_garbage_removed_na_dropped.copy()

## Feature Engineering

The columns 'County' , 'City' , 'State' , 'ZIP Code' is all correlated together and describe the same element (Owner's Address). Merging all of them togther not only makes the no of features less( Thus faster training and less error overall). It also means the model can better correlate between owner's address and the Price.

In [10]:
data_cleaned['Address'] = data_cleaned['County'] + '-' + data_cleaned['City'] + '-' + data_cleaned['State'] + '-' + data_cleaned['ZIP Code'].astype('str')

The 'Unique Model' feature is the merging of the three features 'Make' , 'Model' , 'Year'. Which better correlates with Expected Price.

In [11]:
data_cleaned['Unique Model'] = data_cleaned['Make'] + '-' + data_cleaned['Model'] + '-' + data_cleaned['Model Year'].astype('str')

After merging the first 10 rows are shown here

In [12]:
data_cleaned.head()

Unnamed: 0_level_0,County,City,State,ZIP Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Vehicle Location,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
EV33174,Snohomish,LYNNWOOD,WA,98037.0,2020.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,308,32.0,POINT (-122.287614 47.83874),PUGET SOUND ENERGY INC,50.0,Snohomish-LYNNWOOD-WA-98037.0,TESLA-MODEL 3-2020.0
EV40247,Skagit,BELLINGHAM,WA,98229.0,2011.0,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,40.0,POINT (-122.414936 48.709388),PUGET SOUND ENERGY INC,15.0,Skagit-BELLINGHAM-WA-98229.0,NISSAN-LEAF-2011.0
EV12248,Pierce,TACOMA,WA,98422.0,2015.0,BMW,I3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,81,27.0,POINT (-122.396286 47.293138),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,18.0,Pierce-TACOMA-WA-98422.0,BMW-I3-2015.0
EV55713,King,REDMOND,WA,98053.0,2013.0,CHEVROLET,VOLT,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38,45.0,POINT (-122.024951 47.670286),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),33.9,King-REDMOND-WA-98053.0,CHEVROLET-VOLT-2013.0
EV28799,Pierce,PUYALLUP,WA,98375.0,2019.0,CHEVROLET,BOLT EV,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,25.0,POINT (-122.321062 47.103797),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,41.78,Pierce-PUYALLUP-WA-98375.0,CHEVROLET-BOLT EV-2019.0


We checked if the same city might contain multiple ZIP Codes. Which is true in the case of this dataset. Thus we were sure of our feature engineering

In [13]:
data_cleaned.loc[data_cleaned['City'] == 'TACOMA']['ZIP Code']

ID
EV12248    98422.0
EV78401    98445.0
EV22778    98418.0
EV59516    98406.0
EV78000    98403.0
            ...   
EV46868    98406.0
EV49558    98409.0
EV4422     98409.0
EV59519    98465.0
EV423      98402.0
Name: ZIP Code, Length: 1434, dtype: float64

Now that the features were combined to create better features ,we can remove all the original columns which are now obselute.

In [14]:
original_cols = ['City' , 'State' ,'ZIP Code' , 'County' , 'Make' , 'Model' , 'Model Year']

In [15]:
data_cleaned_dropped = data_cleaned.drop(original_cols , axis = 1) 

In [16]:
data_cleaned_dropped.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Vehicle Location,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
EV33174,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,308,32.0,POINT (-122.287614 47.83874),PUGET SOUND ENERGY INC,50.0,Snohomish-LYNNWOOD-WA-98037.0,TESLA-MODEL 3-2020.0
EV40247,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,40.0,POINT (-122.414936 48.709388),PUGET SOUND ENERGY INC,15.0,Skagit-BELLINGHAM-WA-98229.0,NISSAN-LEAF-2011.0
EV12248,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,81,27.0,POINT (-122.396286 47.293138),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,18.0,Pierce-TACOMA-WA-98422.0,BMW-I3-2015.0
EV55713,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38,45.0,POINT (-122.024951 47.670286),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),33.9,King-REDMOND-WA-98053.0,CHEVROLET-VOLT-2013.0
EV28799,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,25.0,POINT (-122.321062 47.103797),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,41.78,Pierce-PUYALLUP-WA-98375.0,CHEVROLET-BOLT EV-2019.0


For deciding whether 'Electric Utility' and 'Legislative District' prove any good. We decided to calculate the corration between these features and the label.
But the label was in the dtype of 'object' . Which was not appropriate for calculating correlation. Thus we converted the column into float64

In [17]:
data_cleaned_dropped['Expected Price ($1k)'] = data_cleaned_dropped['Expected Price ($1k)'].astype('float64')

After this correlation analysis, We decided to remove only 'Vehicle Location' column from the training data.

In [18]:
data_cleaned_dropped.corr()

Unnamed: 0,Electric Range,Legislative District,Expected Price ($1k)
Electric Range,1.0,0.041381,0.213697
Legislative District,0.041381,1.0,0.056133
Expected Price ($1k),0.213697,0.056133,1.0


In [19]:
non_correlation_columns = ['Vehicle Location']

In [20]:
data_cleaned_dropped = data_cleaned_dropped.drop(non_correlation_columns , axis = 1)

In [21]:
data_cleaned_dropped.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EV33174,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,308,32.0,PUGET SOUND ENERGY INC,50.0,Snohomish-LYNNWOOD-WA-98037.0,TESLA-MODEL 3-2020.0
EV40247,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,40.0,PUGET SOUND ENERGY INC,15.0,Skagit-BELLINGHAM-WA-98229.0,NISSAN-LEAF-2011.0
EV12248,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,81,27.0,BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,18.0,Pierce-TACOMA-WA-98422.0,BMW-I3-2015.0
EV55713,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38,45.0,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),33.9,King-REDMOND-WA-98053.0,CHEVROLET-VOLT-2013.0
EV28799,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,25.0,BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,41.78,Pierce-PUYALLUP-WA-98375.0,CHEVROLET-BOLT EV-2019.0


In [22]:
# data_cleaned_dropped = data_cleaned_dropped.drop(['Electric Utility'] , axis = 1)

We then dumped the entire data into a csv file to be shared with the team.

In [23]:
data_cleaned_dropped.to_csv('preprocessed_cleaned_data.csv')

## Note that encoding is still remaining , which was performed after , 'test-data' cleaning

# Testing data Preprocessing

In [1]:
import pandas as pd

In [2]:
test_data = pd.read_csv('data/test.csv' , index_col = 'ID')

In [3]:
sample_submission = pd.read_csv('data/sample_submission.csv' )

In [4]:
len(test_data)

27580

In [5]:
test_data.shape

(27580, 16)

In [6]:
len(sample_submission)

27580

## Data Cleaning
The test data was analyzed for null values , which were present. But unlike training data , we can't drop them. We decided to use imputation and replace those columns with appropriate values , calculated by `sklearn`'s `Simple Imputer`

In [7]:
test_data.isnull().sum()

VIN (1-10)                                             0
County                                                 4
City                                                   3
State                                                  2
ZIP Code                                               3
Model Year                                             4
Make                                                   3
Model                                                  4
Electric Vehicle Type                                  0
Clean Alternative Fuel Vehicle (CAFV) Eligibility      0
Electric Range                                         0
Base MSRP                                              0
Legislative District                                  70
DOL Vehicle ID                                         0
Vehicle Location                                     253
Electric Utility                                     327
dtype: int64

In [8]:
from sklearn.impute import SimpleImputer

In [9]:
imputer = SimpleImputer(strategy='most_frequent')

In [10]:
imputed_test_data = imputer.fit_transform(test_data)

In [11]:
cols_to_impute = test_data.columns[test_data.isnull().any()]

In [12]:
len(cols_to_impute)

10

In [13]:
imputed_cols = imputer.fit_transform(test_data[cols_to_impute])

In [14]:
imputed_cols.shape

(27580, 10)

In [15]:
imputed_data = test_data.copy()

In [16]:
imputed_data[cols_to_impute] = imputed_cols

After imputation all null values where gone!

In [17]:
imputed_data.isnull().sum()

VIN (1-10)                                           0
County                                               0
City                                                 0
State                                                0
ZIP Code                                             0
Model Year                                           0
Make                                                 0
Model                                                0
Electric Vehicle Type                                0
Clean Alternative Fuel Vehicle (CAFV) Eligibility    0
Electric Range                                       0
Base MSRP                                            0
Legislative District                                 0
DOL Vehicle ID                                       0
Vehicle Location                                     0
Electric Utility                                     0
dtype: int64

Like the training data these columns are of no use to us

In [18]:
garbage_cols = ['Base MSRP' , 'DOL Vehicle ID' ,'VIN (1-10)']

In [19]:
dropped_test_data = imputed_data.drop(garbage_cols , axis = 1)

In [20]:
dropped_test_data.shape

(27580, 13)

In [21]:
dropped_test_data.head()

Unnamed: 0_level_0,County,City,State,ZIP Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Vehicle Location,Electric Utility
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
EV28368,Jefferson,PORT TOWNSEND,WA,98368.0,2012.0,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,24.0,POINT (-122.818016 48.080229),BONNEVILLE POWER ADMINISTRATION||PUGET SOUND E...
EV27088,King,ISSAQUAH,WA,98029.0,2021.0,FORD,F-150,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,5.0,POINT (-122.014191 47.559121),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)
EV58989,Pierce,PUYALLUP,WA,98372.0,2018.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,215,25.0,POINT (-122.270761 47.205558),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)
EV6715,Pierce,TACOMA,WA,98403.0,2021.0,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,27.0,POINT (-122.459716 47.265523),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...
EV63251,Kitsap,BREMERTON,WA,98312.0,2021.0,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,35.0,POINT (-122.724682 47.57271),PUGET SOUND ENERGY INC


Like the training data , we engineered a few features to reduce training complexity.

In [22]:
dropped_test_data['Address'] = dropped_test_data['County'] + '-' + dropped_test_data['City']  + '-' + dropped_test_data['State'] + '-' + dropped_test_data['ZIP Code'].astype('str')

In [23]:
dropped_test_data['Unique Model'] = dropped_test_data['Make'] + '-' + dropped_test_data['Model'] + '-' + dropped_test_data['Model Year'].astype('str')

And dropped the original columns

In [24]:
original_cols = ['City' , 'State' ,'ZIP Code' , 'County' , 'Make' , 'Model' , 'Model Year']

In [25]:
clean_test_data = dropped_test_data.drop(original_cols , axis = 1)

In [26]:
clean_test_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Vehicle Location,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EV28368,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,24.0,POINT (-122.818016 48.080229),BONNEVILLE POWER ADMINISTRATION||PUGET SOUND E...,Jefferson-PORT TOWNSEND-WA-98368.0,NISSAN-LEAF-2012.0
EV27088,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,5.0,POINT (-122.014191 47.559121),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),King-ISSAQUAH-WA-98029.0,FORD-F-150-2021.0
EV58989,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,215,25.0,POINT (-122.270761 47.205558),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),Pierce-PUYALLUP-WA-98372.0,TESLA-MODEL 3-2018.0
EV6715,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,27.0,POINT (-122.459716 47.265523),BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,Pierce-TACOMA-WA-98403.0,TESLA-MODEL 3-2021.0
EV63251,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,35.0,POINT (-122.724682 47.57271),PUGET SOUND ENERGY INC,Kitsap-BREMERTON-WA-98312.0,TESLA-MODEL Y-2021.0


In [27]:
cols_under_reconsideration = ['Vehicle Location']

In [28]:
total_clean_test_data = clean_test_data.drop(cols_under_reconsideration , axis = 1)

In [29]:
total_clean_test_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,24.0,BONNEVILLE POWER ADMINISTRATION||PUGET SOUND E...,Jefferson-PORT TOWNSEND-WA-98368.0,NISSAN-LEAF-2012.0
EV27088,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,5.0,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),King-ISSAQUAH-WA-98029.0,FORD-F-150-2021.0
EV58989,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,215,25.0,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),Pierce-PUYALLUP-WA-98372.0,TESLA-MODEL 3-2018.0
EV6715,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,27.0,BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,Pierce-TACOMA-WA-98403.0,TESLA-MODEL 3-2021.0
EV63251,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,35.0,PUGET SOUND ENERGY INC,Kitsap-BREMERTON-WA-98312.0,TESLA-MODEL Y-2021.0


We dumped this data for sharing with the team

In [30]:
total_clean_test_data.to_csv('cleaned_test_data.csv')

# Data Encoding

Encoding the textual data into ordinal values is compulsory for working with ML models.
But the biggest hurdle was that some data which were in Testing Dataset were not present in Training Dataset. Thus the decision was made to combine the 'Categorical Columns' of both Testing and Training for Encoding purposes. This meant both testing and training dataset would have same encoding while we train and test.

In [2]:
import pandas as pd

In [6]:
train_data = pd.read_csv('preprocessed_cleaned_data.csv' , index_col = 'ID')

In [7]:
clean_test_data = pd.read_csv('cleaned_test_data.csv' , index_col = 'ID')

In [8]:
clean_test_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,24.0,BONNEVILLE POWER ADMINISTRATION||PUGET SOUND E...,Jefferson-PORT TOWNSEND-WA-98368.0,NISSAN-LEAF-2012.0
EV27088,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,5.0,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),King-ISSAQUAH-WA-98029.0,FORD-F-150-2021.0
EV58989,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,215,25.0,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),Pierce-PUYALLUP-WA-98372.0,TESLA-MODEL 3-2018.0
EV6715,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,27.0,BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,Pierce-TACOMA-WA-98403.0,TESLA-MODEL 3-2021.0
EV63251,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not b...,0,35.0,PUGET SOUND ENERGY INC,Kitsap-BREMERTON-WA-98312.0,TESLA-MODEL Y-2021.0


In [9]:
train_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EV33174,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,308,32.0,PUGET SOUND ENERGY INC,50.0,Snohomish-LYNNWOOD-WA-98037.0,TESLA-MODEL 3-2020.0
EV40247,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,73,40.0,PUGET SOUND ENERGY INC,15.0,Skagit-BELLINGHAM-WA-98229.0,NISSAN-LEAF-2011.0
EV12248,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,81,27.0,BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,18.0,Pierce-TACOMA-WA-98422.0,BMW-I3-2015.0
EV55713,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,38,45.0,PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),33.9,King-REDMOND-WA-98053.0,CHEVROLET-VOLT-2013.0
EV28799,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,238,25.0,BONNEVILLE POWER ADMINISTRATION||CITY OF TACOM...,41.78,Pierce-PUYALLUP-WA-98375.0,CHEVROLET-BOLT EV-2019.0


The list of columns to be encoded is given below

In [10]:
encoding_cols = [ 'Electric Vehicle Type' , 'Clean Alternative Fuel Vehicle (CAFV) Eligibility' , 'Electric Utility' , 'Address' , 'Unique Model']

In [12]:
combined_encoding_data = [train_data[encoding_cols] , clean_test_data[encoding_cols]]

In [17]:
combined_encoding_data = pd.concat(combined_encoding_data)

In [18]:
combined_encoding_data.shape

(90661, 5)

In [19]:
from sklearn.preprocessing import OrdinalEncoder

In [20]:
encoder = OrdinalEncoder()

In [21]:
encoder.fit(combined_encoding_data)

OrdinalEncoder()

In [22]:
train_data[encoding_cols] = encoder.transform(train_data[encoding_cols])

In [23]:
clean_test_data[encoding_cols] = encoder.transform(clean_test_data[encoding_cols])

After encoding both training and testing data separately we simply dumped the data again for final training and testing

In [24]:
clean_test_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,0.0,0.0,73,24.0,44.0,165.0,211.0
EV27088,0.0,1.0,0,5.0,67.0,208.0,99.0
EV58989,0.0,0.0,215,25.0,67.0,467.0,254.0
EV6715,0.0,1.0,0,27.0,20.0,481.0,257.0
EV63251,0.0,1.0,0,35.0,66.0,301.0,278.0


In [29]:
train_data.shape

(63081, 8)

In [26]:
clean_test_data.shape

(27580, 7)

In [28]:
clean_test_data.to_csv('encoded_clean_test_data.csv')

In [27]:
train_data.to_csv('encoded_clean_training_data.csv')

# Training and Testing

After the data is cleaned , the training step is very easy

In [1]:
import pandas as pd

In [2]:
train_data = pd.read_csv('encoded_clean_training_data.csv' , index_col = 'ID')

In [4]:
train_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EV33174,0.0,0.0,308,32.0,66.0,50.0,576.0,256.0
EV40247,0.0,0.0,73,40.0,66.0,15.0,538.0,210.0
EV12248,0.0,0.0,81,27.0,20.0,18.0,490.0,35.0
EV55713,1.0,0.0,38,45.0,67.0,33.9,234.0,69.0
EV28799,0.0,0.0,238,25.0,16.0,41.78,470.0,60.0


In [5]:
from sklearn.model_selection import train_test_split

In [6]:
test_data = pd.read_csv('encoded_clean_test_data.csv' , index_col = 'ID')

In [7]:
test_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,0.0,0.0,73,24.0,44.0,165.0,211.0
EV27088,0.0,1.0,0,5.0,67.0,208.0,99.0
EV58989,0.0,0.0,215,25.0,67.0,467.0,254.0
EV6715,0.0,1.0,0,27.0,20.0,481.0,257.0
EV63251,0.0,1.0,0,35.0,66.0,301.0,278.0


Data is divided into X(features) and y(labels).

In [8]:
X = train_data.drop(['Expected Price ($1k)'] , axis = 1)

In [9]:
y = train_data['Expected Price ($1k)']

In [10]:
X.shape

(63081, 7)

In [11]:
y.shape

(63081,)

Splitting into training and validation dataset using `train_test_split` was done. The `Random State` was kept 300 throughout our training and testing , to ensure all results are consistent and comparable within our team

In [24]:
import xgboost as xgb
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [25]:
training_data = pd.read_csv('encoded_clean_training_data.csv', index_col = 'ID')

In [26]:
training_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EV33174,0.0,0.0,308,32.0,66.0,50.0,576.0,256.0
EV40247,0.0,0.0,73,40.0,66.0,15.0,538.0,210.0
EV12248,0.0,0.0,81,27.0,20.0,18.0,490.0,35.0
EV55713,1.0,0.0,38,45.0,67.0,33.9,234.0,69.0
EV28799,0.0,0.0,238,25.0,16.0,41.78,470.0,60.0


In [4]:
model = XGBRegressor()

In [5]:
X = training_data.drop(['Expected Price ($1k)'], axis = 1)

In [6]:
y = training_data['Expected Price ($1k)']

In [7]:
X.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV33174,0.0,0.0,308,32.0,66.0,576.0,256.0
EV40247,0.0,0.0,73,40.0,66.0,538.0,210.0
EV12248,0.0,0.0,81,27.0,20.0,490.0,35.0
EV55713,1.0,0.0,38,45.0,67.0,234.0,69.0
EV28799,0.0,0.0,238,25.0,16.0,470.0,60.0


In [8]:
y.head()

ID
EV33174    50.00
EV40247    15.00
EV12248    18.00
EV55713    33.90
EV28799    41.78
Name: Expected Price ($1k), dtype: float64

In [11]:
train_x, test_x, train_y, test_y = train_test_split(X, y, random_state = 300)

In [12]:
model.fit(train_x, train_y)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

In [14]:
model.score(train_x, train_y)

0.9998449954801341

In [15]:
model.score(test_x, test_y)

0.881998540171633

In [16]:
test_data = pd.read_csv('final_test_data.csv', index_col = 'ID')

In [17]:
test_data.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,0.0,0.0,73,24.0,44.0,165.0,211.0
EV27088,0.0,1.0,0,5.0,67.0,208.0,99.0
EV58989,0.0,0.0,215,25.0,67.0,467.0,254.0
EV6715,0.0,1.0,0,27.0,20.0,481.0,257.0
EV63251,0.0,1.0,0,35.0,66.0,301.0,278.0


In [18]:
predicted_values = model.predict(test_data)

In [19]:
test_data

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,0.0,0.0,73,24.0,44.0,165.0,211.0
EV27088,0.0,1.0,0,5.0,67.0,208.0,99.0
EV58989,0.0,0.0,215,25.0,67.0,467.0,254.0
EV6715,0.0,1.0,0,27.0,20.0,481.0,257.0
EV63251,0.0,1.0,0,35.0,66.0,301.0,278.0
...,...,...,...,...,...,...,...
EV16462,1.0,2.0,19,18.0,33.0,47.0,95.0
EV19258,1.0,0.0,35,11.0,67.0,239.0,68.0
EV91826,0.0,1.0,0,41.0,67.0,180.0,279.0
EV5443,0.0,0.0,238,22.0,66.0,665.0,58.0


In [20]:
predicted_values

array([17.005924, 42.950695, 68.98781 , ..., 77.98914 , 20.019003,
       15.052367], dtype=float32)

In [21]:
predicted_dataframe = pd.DataFrame(data = { 'ID':test_data.index, 'Expected Price ($1k)' : predicted_values})

In [22]:
predicted_dataframe.set_index('ID')

Unnamed: 0_level_0,Expected Price ($1k)
ID,Unnamed: 1_level_1
EV28368,17.005924
EV27088,42.950695
EV58989,68.987808
EV6715,64.013527
EV63251,72.999001
...,...
EV16462,19.075752
EV19258,15.979469
EV91826,77.989143
EV5443,20.019003


In [23]:
predicted_dataframe.to_csv('XGBoost_submission.csv', index  = False)

In [1]:
from sklearn.model_selection import train_test_split
import catboost as cb
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV

In [2]:
d = pd.read_csv("final_train.csv", index_col = "ID")
d.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Expected Price ($1k),Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EV33174,0.0,0.0,308,32.0,66.0,50.0,576.0,256.0
EV40247,0.0,0.0,73,40.0,66.0,15.0,538.0,210.0
EV12248,0.0,0.0,81,27.0,20.0,18.0,490.0,35.0
EV55713,1.0,0.0,38,45.0,67.0,33.9,234.0,69.0
EV28799,0.0,0.0,238,25.0,16.0,41.78,470.0,60.0


In [3]:
x = d.drop(["Expected Price ($1k)"], axis=1)

In [4]:
y = d["Expected Price ($1k)"]

In [5]:
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 300)

In [6]:
train_x.shape

(47310, 7)

In [7]:
train_y.shape

(47310,)

In [44]:
model = cb.CatBoostRegressor()

In [45]:
parameters = {'iterations': range(1000, 1500), 'learning_rate': [0.03, 0.05, 0.09, 0.1, 0.25, 0.35, 0.5, 0.6]}
clf = RandomizedSearchCV(model, parameters, random_state=300)

In [46]:
search = clf.fit(train_x, train_y)

0:	learn: 23.5750504	total: 5.63ms	remaining: 6.49s
1:	learn: 22.7143929	total: 10.7ms	remaining: 6.16s
2:	learn: 21.8938137	total: 15.4ms	remaining: 5.91s
3:	learn: 21.1370593	total: 20.1ms	remaining: 5.79s
4:	learn: 20.4099314	total: 24.6ms	remaining: 5.66s
5:	learn: 19.7561523	total: 29.3ms	remaining: 5.6s
6:	learn: 19.1412566	total: 33.9ms	remaining: 5.56s
7:	learn: 18.5379259	total: 38.6ms	remaining: 5.53s
8:	learn: 17.9812807	total: 43.3ms	remaining: 5.5s
9:	learn: 17.4431805	total: 48ms	remaining: 5.49s
10:	learn: 16.9342680	total: 52.4ms	remaining: 5.45s
11:	learn: 16.4630551	total: 57.1ms	remaining: 5.43s
12:	learn: 16.0416785	total: 61.8ms	remaining: 5.42s
13:	learn: 15.6363230	total: 66.7ms	remaining: 5.43s
14:	learn: 15.2462317	total: 71.5ms	remaining: 5.43s
15:	learn: 14.8886935	total: 76.6ms	remaining: 5.45s
16:	learn: 14.5509428	total: 81.5ms	remaining: 5.45s
17:	learn: 14.2209406	total: 86.2ms	remaining: 5.44s
18:	learn: 13.9378978	total: 91ms	remaining: 5.43s
19:	learn

In [47]:
search.best_params_

{'learning_rate': 0.35, 'iterations': 1218}

In [48]:
search.best_score_

0.9651000249605092

In [23]:
model.fit(train_x, train_y,)

0:	learn: 20.1200470	total: 37.7ms	remaining: 43.2s
1:	learn: 17.0498814	total: 43ms	remaining: 24.6s
2:	learn: 14.8895051	total: 48.4ms	remaining: 18.4s
3:	learn: 13.2753053	total: 53.7ms	remaining: 15.4s
4:	learn: 12.0652974	total: 58.8ms	remaining: 13.4s
5:	learn: 11.1970596	total: 64.3ms	remaining: 12.2s
6:	learn: 10.5308426	total: 69.4ms	remaining: 11.3s
7:	learn: 9.9659303	total: 74.4ms	remaining: 10.6s
8:	learn: 9.5195081	total: 79.7ms	remaining: 10.1s
9:	learn: 9.1306804	total: 84.9ms	remaining: 9.65s
10:	learn: 8.7525554	total: 90.1ms	remaining: 9.3s
11:	learn: 8.4311011	total: 95.2ms	remaining: 9.01s
12:	learn: 8.2274808	total: 100ms	remaining: 8.76s
13:	learn: 8.0208751	total: 105ms	remaining: 8.52s
14:	learn: 7.8738689	total: 110ms	remaining: 8.32s
15:	learn: 7.7837482	total: 115ms	remaining: 8.13s
16:	learn: 7.6046079	total: 120ms	remaining: 8s
17:	learn: 7.5184093	total: 126ms	remaining: 7.88s
18:	learn: 7.3719592	total: 132ms	remaining: 7.82s
19:	learn: 7.3117335	total: 

<catboost.core.CatBoostRegressor at 0x1b161395cc0>

In [25]:
model.score(test_x, test_y)

0.8812645018585319

In [72]:
from sklearn.metrics import mean_squared_error
mean_squared_error(model.predict(test_x), test_y)

76.65129140511884

In [85]:
model.save_model('cat_boost_final.model')

In [88]:
test_d = pd.read_csv("final_test.csv", index_col="ID")
test_d.head()

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,0.0,0.0,73,24.0,44.0,165.0,211.0
EV27088,0.0,1.0,0,5.0,67.0,208.0,99.0
EV58989,0.0,0.0,215,25.0,67.0,467.0,254.0
EV6715,0.0,1.0,0,27.0,20.0,481.0,257.0
EV63251,0.0,1.0,0,35.0,66.0,301.0,278.0


In [89]:
predicted_values = model.predict(test_d)

In [98]:
test_d

Unnamed: 0_level_0,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Legislative District,Electric Utility,Address,Unique Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
EV28368,0.0,0.0,73,24.0,44.0,165.0,211.0
EV27088,0.0,1.0,0,5.0,67.0,208.0,99.0
EV58989,0.0,0.0,215,25.0,67.0,467.0,254.0
EV6715,0.0,1.0,0,27.0,20.0,481.0,257.0
EV63251,0.0,1.0,0,35.0,66.0,301.0,278.0
...,...,...,...,...,...,...,...
EV16462,1.0,2.0,19,18.0,33.0,47.0,95.0
EV19258,1.0,0.0,35,11.0,67.0,239.0,68.0
EV91826,0.0,1.0,0,41.0,67.0,180.0,279.0
EV5443,0.0,0.0,238,22.0,66.0,665.0,58.0


In [91]:
predicted_values

array([16.91984119, 42.93415945, 68.86985007, ..., 77.91703079,
       20.10635919, 15.2860361 ])

In [127]:
predicted_dataframe = pd.DataFrame(data = {  'ID':test_d.index , 'Expected Price ($1k)':predicted_values_2 })

In [128]:
predicted_dataframe.set_index('ID')

Unnamed: 0_level_0,Expected Price ($1k)
ID,Unnamed: 1_level_1
EV28368,16.748642
EV27088,42.275073
EV58989,68.673845
EV6715,63.943561
EV63251,73.082119
...,...
EV16462,18.611776
EV19258,16.229354
EV91826,77.702929
EV5443,20.259924


In [129]:
predicted_dataframe.to_csv('catboost_submission_2.csv', index  = False)

In [49]:
model2 = cb.CatBoostRegressor(learning_rate = 0.35, iterations = 1218)

In [50]:
model2.fit(train_x, train_y)

0:	learn: 18.4896010	total: 13.5ms	remaining: 16.5s
1:	learn: 15.0004892	total: 25ms	remaining: 15.2s
2:	learn: 12.9579534	total: 36.2ms	remaining: 14.7s
3:	learn: 11.5473756	total: 46.4ms	remaining: 14.1s
4:	learn: 10.6025210	total: 56.8ms	remaining: 13.8s
5:	learn: 9.9564861	total: 67.3ms	remaining: 13.6s
6:	learn: 9.1755211	total: 77.5ms	remaining: 13.4s
7:	learn: 8.8334253	total: 87.9ms	remaining: 13.3s
8:	learn: 8.4045495	total: 98.4ms	remaining: 13.2s
9:	learn: 8.0708774	total: 109ms	remaining: 13.2s
10:	learn: 7.8566864	total: 119ms	remaining: 13.1s
11:	learn: 7.6238805	total: 129ms	remaining: 13s
12:	learn: 7.5250714	total: 139ms	remaining: 12.8s
13:	learn: 7.3424944	total: 154ms	remaining: 13.2s
14:	learn: 7.1591655	total: 172ms	remaining: 13.8s
15:	learn: 6.9681451	total: 186ms	remaining: 14s
16:	learn: 6.9068882	total: 197ms	remaining: 13.9s
17:	learn: 6.8306017	total: 208ms	remaining: 13.9s
18:	learn: 6.7526660	total: 219ms	remaining: 13.8s
19:	learn: 6.4979869	total: 229ms

<catboost.core.CatBoostRegressor at 0x1b161396740>

In [51]:
model2.score(test_x, test_y)

0.8816023918869802

In [125]:
predicted_values_2 = model2.predict(test_d)

In [126]:
predicted_values_2


array([16.74864206, 42.27507338, 68.67384486, ..., 77.70292887,
       20.25992441, 15.50487048])