# Estimate car price - Load data to SAP HANA
This notebook is part of a Machine Learning project that is described and available to download on 
<BR><a href="https://blogs.sap.com/2019/11/05/hands-on-tutorial-machine-learning-push-down-to-sap-hana-with-python/">https://blogs.sap.com/2019/11/05/hands-on-tutorial-machine-learning-push-down-to-sap-hana-with-python/</a>
<BR><BR>The purpose of this notebook is to load the data to SAP HANA that will be required by this project.

### Steps in this notebook
-  Load local CSV file into pandas data frame
-  Adjust data slightly, filter and translate
-  Save the data frame to SAP HANA as table

### Documentation
-  SAP HANA Python Client API for Machine Learning Algorithms:   
   https://help.sap.com/doc/0172e3957b5946da85d3fde85ee8f33d/latest/en-US/html/hana_ml.html
-  SAP HANA Predictive Analysis Library (PAL):  
   https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/latest/en-US/f652a8186a144e929a1ade7a3cb7abe8.html
-  Dataset: https://www.kaggle.com/bozungu/ebay-used-car-sales-data

### Load data from CSV file into pandas data frame
Begin by loading the historic data from the CSV file into a pandas data frame.

In [1]:
import pandas as pd
df_data = pd.read_csv('autos.csv', encoding = 'Windows-1252')

In [2]:
df_data.head(5)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


### Change column names to upper case
Having all column and table names in upper case helps to simplify once the data is stored in SAP HANA.

In [3]:
df_data.columns = map(str.upper, df_data.columns)
df_data.head(5)

Unnamed: 0,DATECRAWLED,NAME,SELLER,OFFERTYPE,PRICE,ABTEST,VEHICLETYPE,YEAROFREGISTRATION,GEARBOX,POWERPS,MODEL,KILOMETER,MONTHOFREGISTRATION,FUELTYPE,BRAND,NOTREPAIREDDAMAGE,DATECREATED,NROFPICTURES,POSTALCODE,LASTSEEN
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


### Filter on single car make
Reduce the dataset to the newer cars of Mercedes Benz, that are sold by non-commercials and that have no unrepaired damage. This focus can help identify patterns in the data. However, the scope of the analysis is also reduced of course. It will not be possible anymore to analyse the impact of unrepaired damage on a car price for instance.

In [4]:
df_data = df_data[df_data['BRAND'] == 'mercedes_benz']    # Keep only Mercedes Benz
df_data = df_data[df_data['OFFERTYPE'] == 'Angebot']      # Keep only cars for sale (excluding adverts for purchasing a car)
df_data = df_data[df_data['SELLER'] == 'privat']          # Keep only sales by private people (excluding commercial offers)
df_data = df_data[df_data['NOTREPAIREDDAMAGE'] == 'nein'] # Keep only cars that have no unrepaired damage
df_data.head(5)

Unnamed: 0,DATECRAWLED,NAME,SELLER,OFFERTYPE,PRICE,ABTEST,VEHICLETYPE,YEAROFREGISTRATION,GEARBOX,POWERPS,MODEL,KILOMETER,MONTHOFREGISTRATION,FUELTYPE,BRAND,NOTREPAIREDDAMAGE,DATECREATED,NROFPICTURES,POSTALCODE,LASTSEEN
19,2016-04-01 22:55:47,Mercedes_Benz_A_160_Classic_Klima,privat,Angebot,1850,test,bus,2004,manuell,102,a_klasse,150000,1,benzin,mercedes_benz,nein,2016-04-01 00:00:00,0,49565,2016-04-05 22:46:05
30,2016-04-03 15:48:11,Mercedes_Benz_E_250_D_Original_Zustand_!!,privat,Angebot,3300,test,limousine,1995,automatik,113,e_klasse,150000,1,diesel,mercedes_benz,nein,2016-04-03 00:00:00,0,53879,2016-04-05 15:16:05
34,2016-03-17 18:55:12,Mercedes_Benz_E_200_CDI_Automatik_Classic,privat,Angebot,3500,control,limousine,2004,automatik,122,e_klasse,150000,11,diesel,mercedes_benz,nein,2016-03-17 00:00:00,0,67071,2016-03-30 15:46:10
39,2016-03-25 15:50:30,Mercedes_Camper_D407,privat,Angebot,1500,test,bus,1984,manuell,70,andere,150000,8,diesel,mercedes_benz,nein,2016-03-25 00:00:00,0,22767,2016-03-27 03:17:02
49,2016-04-04 14:06:22,Mercedes_Benz_B180_Automatik,privat,Angebot,13500,test,bus,2012,automatik,109,b_klasse,150000,7,diesel,mercedes_benz,nein,2016-04-04 00:00:00,0,35576,2016-04-05 12:09:29


### Reduce number of columns
Removing complexity further. In a real project you probably do not want to delete all these columns early on, if ever.

In [5]:
df_data = df_data.drop(['NOTREPAIREDDAMAGE',
                        'NAME', 
                        'DATECRAWLED', 
                        'SELLER', 
                        'OFFERTYPE', 
                        'ABTEST', 
                        'BRAND', 
                        'DATECREATED',
                        'NROFPICTURES', 
                        'POSTALCODE', 
                        'LASTSEEN', 
                        'MONTHOFREGISTRATION'],
                       axis = 1)
df_data.head(5)

Unnamed: 0,PRICE,VEHICLETYPE,YEAROFREGISTRATION,GEARBOX,POWERPS,MODEL,KILOMETER,FUELTYPE
19,1850,bus,2004,manuell,102,a_klasse,150000,benzin
30,3300,limousine,1995,automatik,113,e_klasse,150000,diesel
34,3500,limousine,2004,automatik,122,e_klasse,150000,diesel
39,1500,bus,1984,manuell,70,andere,150000,diesel
49,13500,bus,2012,automatik,109,b_klasse,150000,diesel


### Rename columns
Purely for legibility

In [6]:
df_data = df_data.rename(index = str, columns = {'YEAROFREGISTRATION': 'YEAR',
                                                 'POWERPS': 'HP'})
df_data.head(5)

Unnamed: 0,PRICE,VEHICLETYPE,YEAR,GEARBOX,HP,MODEL,KILOMETER,FUELTYPE
19,1850,bus,2004,manuell,102,a_klasse,150000,benzin
30,3300,limousine,1995,automatik,113,e_klasse,150000,diesel
34,3500,limousine,2004,automatik,122,e_klasse,150000,diesel
39,1500,bus,1984,manuell,70,andere,150000,diesel
49,13500,bus,2012,automatik,109,b_klasse,150000,diesel


### Translate column content to English
The original data was scraped from the German-speaking version of eBay.

In [7]:
df_data['MODEL'] = df_data['MODEL'].replace({'a_klasse': 'A-Class',
                                             'b_klasse': 'B-Class',
                                             'c_klasse': 'C-Class',
                                             'e_klasse': 'E-Class',
                                             'g_klasse': 'G-Class',
                                             'm_klasse': 'M-Class',
                                             's_klasse': 'S-Class',                                     
                                             'v_klasse': 'V-Class',                                       
                                             'cl': 'CL',  
                                             'sl': 'SL', 
                                             'gl': 'GL', 
                                             'clk': 'CLK',   
                                             'slk': 'SLK',
                                             'glk': 'GLK',  
                                             'sprinter': 'Sprinter',  
                                             'viano': 'Viano',  
                                             'vito': 'Vito',                                        
                                             'andere': 'Other'                                        
                                             })
df_data['GEARBOX']  = df_data['GEARBOX'].replace({'manuell': 'manual',
                                                  'automatik': 'automatic'})
df_data['FUELTYPE'] = df_data['FUELTYPE'].replace({'benzin': 'petrol'})
df_data.head(5)

Unnamed: 0,PRICE,VEHICLETYPE,YEAR,GEARBOX,HP,MODEL,KILOMETER,FUELTYPE
19,1850,bus,2004,manual,102,A-Class,150000,petrol
30,3300,limousine,1995,automatic,113,E-Class,150000,diesel
34,3500,limousine,2004,automatic,122,E-Class,150000,diesel
39,1500,bus,1984,manual,70,Other,150000,diesel
49,13500,bus,2012,automatic,109,B-Class,150000,diesel


### Add ID column
An identifier column will be needed later on, ie when training Machine Learning models.

In [8]:
df_data.insert(0, 'CAR_ID', df_data.reset_index().index)
df_data.head(5)

Unnamed: 0,CAR_ID,PRICE,VEHICLETYPE,YEAR,GEARBOX,HP,MODEL,KILOMETER,FUELTYPE
19,0,1850,bus,2004,manual,102,A-Class,150000,petrol
30,1,3300,limousine,1995,automatic,113,E-Class,150000,diesel
34,2,3500,limousine,2004,automatic,122,E-Class,150000,diesel
39,3,1500,bus,1984,manual,70,Other,150000,diesel
49,4,13500,bus,2012,automatic,109,B-Class,150000,diesel


### Move Price column to the end
Required by the PAL algorithm that will be used in the next notebook to train a Machine Learning model. The column could also have been moved later on. See the documentation of the Predictive Analysis Library at the top of this notebook for the detailed requirements of the different algorithms.

In [9]:
df_data = df_data[pd.Index.append(df_data.columns.drop("PRICE"), pd.Index(['PRICE']))]
df_data.head(5)

Unnamed: 0,CAR_ID,VEHICLETYPE,YEAR,GEARBOX,HP,MODEL,KILOMETER,FUELTYPE,PRICE
19,0,bus,2004,manual,102,A-Class,150000,petrol,1850
30,1,limousine,1995,automatic,113,E-Class,150000,diesel,3300
34,2,limousine,2004,automatic,122,E-Class,150000,diesel,3500
39,3,bus,1984,manual,70,Other,150000,diesel,1500
49,4,bus,2012,automatic,109,B-Class,150000,diesel,13500


### Save the transformed data as CSV file for reference if desired
Having a backup of the transformed data available offlne might be convenient. Storing the file is not needed though to continue with the notebooks. Currently the command is commented out. Just remove the # in the following cell to activate the command

In [10]:
df_data.to_csv('usedcarprices.csv')

### Create table with unlabelled observations for inference
Create another table to store cars for which the price is to be predicted. Use the same table structure as the above history. As the price is unknown for these cars and needs to be predicted, the PRICE column has to be removed.

In [11]:
df_topredict = pd.DataFrame(data = None, 
                            columns = df_data.columns.drop("PRICE"))
for xx in df_topredict.columns:
    df_topredict[xx] = df_topredict[xx].astype(df_data[xx].dtypes.name)
df_topredict.dtypes

CAR_ID          int64
VEHICLETYPE    object
YEAR            int64
GEARBOX        object
HP              int64
MODEL          object
KILOMETER       int64
FUELTYPE       object
dtype: object

Add two vehicles whose prices have to be predicted. These imaginary cars are identical, apart from their mileage. The second car driven 100.000 kilometers more than the first. We shall see, how this additional mileage affects the estimated price.

In [12]:
df_topredict = df_topredict.append({'CAR_ID': 1, 
                                    'VEHICLETYPE': 'coupe', 
                                    'YEAR': 2006, 
                                    'GEARBOX': 'manual', 
                                    'HP': 231, 'MODEL': 'CLK', 
                                    'KILOMETER': 50000, 
                                    'FUELTYPE': 'petrol'}, 
                                    ignore_index = True)

In [13]:
df_topredict = df_topredict.append({'CAR_ID': 2, 
                                    'VEHICLETYPE': 'coupe', 
                                    'YEAR': 2006, 
                                    'GEARBOX': 'manual', 
                                    'HP': 231, 'MODEL': 'CLK', 
                                    'KILOMETER': 150000, 
                                    'FUELTYPE': 'petrol'},
                                    ignore_index = True)

In [14]:
df_topredict

Unnamed: 0,CAR_ID,VEHICLETYPE,YEAR,GEARBOX,HP,MODEL,KILOMETER,FUELTYPE
0,1,coupe,2006,manual,231,CLK,50000,petrol
1,2,coupe,2006,manual,231,CLK,150000,petrol


### Load data into SAP HANA
Load the pandas data frames to SAP HANA as new tables. This functionality is available with hana_ml version 1.0.7.
<BR>Verify the version you have installed.

In [15]:
import hana_ml
print(hana_ml.__version__)

2.6.20120900


Instantiate a connecton object to SAP HANA. We recommend keeping these credentials in the Secure User Store of the SAP HANA Client. Retrieving the credentials from the Secure User Store prevents having to specify these credentials in clear text. See the blog on the SAP Commmunity to which these notebooks belong, for steps on how to use that Secure User Store.

In [16]:
import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(userkey = 'hana_hxe', encrypt = 'true', sslValidateCertificate = 'false')

Load the historic data into SAP HANA.

In [17]:
df_data_forhana = df_data.copy(deep = True)
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, 
                                                   pandas_df = df_data_forhana, 
                                                   table_name = 'USEDCARPRICES',
                                                   force = True,
                                                   replace = False)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.51s/it]


See the column types of the new SAP HANA table.

In [18]:
df_remote.dtypes()

[('CAR_ID', 'INT', 10, 10, 10, 0),
 ('VEHICLETYPE', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('YEAR', 'INT', 10, 10, 10, 0),
 ('GEARBOX', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('HP', 'INT', 10, 10, 10, 0),
 ('MODEL', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('KILOMETER', 'INT', 10, 10, 10, 0),
 ('FUELTYPE', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('PRICE', 'INT', 10, 10, 10, 0)]

Retrieve a small number of rows from the SAP HANA table for verification.

In [19]:
df_remote.head(5).collect()

Unnamed: 0,CAR_ID,VEHICLETYPE,YEAR,GEARBOX,HP,MODEL,KILOMETER,FUELTYPE,PRICE
0,0,bus,2004,manual,102,A-Class,150000,petrol,1850
1,1,limousine,1995,automatic,113,E-Class,150000,diesel,3300
2,2,limousine,2004,automatic,122,E-Class,150000,diesel,3500
3,3,bus,1984,manual,70,Other,150000,diesel,1500
4,4,bus,2012,automatic,109,B-Class,150000,diesel,13500


Similarly, load the data frame with the two vehicles, whose prices are to be predicted, into a separate SAP HANA table. Again, ignore the message 'invalid table name', that might come up on a red background.

In [20]:
df_topredict_forhana = df_topredict.copy(deep = True)
df_topredict_remote = dataframe.create_dataframe_from_pandas(connection_context = conn,
                                                             pandas_df = df_topredict_forhana,
                                                             table_name = 'USEDCARPRICES_TOPREDICT',
                                                             force = True,
                                                             replace = False)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.52it/s]


See the column types of the SAP HANA table.

In [21]:
df_topredict_remote.dtypes()

[('CAR_ID', 'INT', 10, 10, 10, 0),
 ('VEHICLETYPE', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('YEAR', 'INT', 10, 10, 10, 0),
 ('GEARBOX', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('HP', 'INT', 10, 10, 10, 0),
 ('MODEL', 'NVARCHAR', 5000, 5000, 5000, 0),
 ('KILOMETER', 'INT', 10, 10, 10, 0),
 ('FUELTYPE', 'NVARCHAR', 5000, 5000, 5000, 0)]

All necessary data has now been loaded to SAP HANA. Continue with the next notebook, "05 Introduction".