# Code samples

First, we'll import all the necessary libraries and functions that we have implemented.

In [2]:
import pandas as pd
from preprocess import preprocess

To illustrate how our preprocessing works, let's import the dataset.

In [3]:
df = pd.read_csv("CarsData.csv")
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,Manufacturer
0,I10,2017,7495,Manual,11630,Petrol,145,60.1,1.0,hyundi
1,Polo,2017,10989,Manual,9200,Petrol,145,58.9,1.0,volkswagen
2,2 Series,2019,27990,Semi-Auto,1614,Diesel,145,49.6,2.0,BMW
3,Yeti Outdoor,2017,12495,Manual,30960,Diesel,150,62.8,2.0,skoda
4,Fiesta,2017,7999,Manual,19353,Petrol,125,54.3,1.2,ford


Let us see what is the effect of our function `preprocess`, used for transforming the data.

In [4]:
preprocess(df, target="price", verbose=False).head()

Unnamed: 0,componenet0,componenet1,componenet2,componenet3,componenet4,componenet5,componenet6,componenet7,componenet8,componenet9,...,componenet29,componenet30,componenet31,componenet32,componenet33,componenet34,componenet35,componenet36,componenet37,price
0,-0.061325,1.086454,-0.837091,0.603295,0.262565,0.117377,0.128906,-0.016715,0.109262,0.067967,...,0.019938,0.018612,-0.001738,0.021694,0.632401,0.278334,-0.019606,-0.043779,0.014855,7495
1,-0.15047,1.128804,-0.862481,0.475072,0.295706,0.027867,0.665889,0.829599,-0.006955,-0.097599,...,-0.005759,0.005499,0.131387,0.006647,-0.00123,0.012497,0.000178,-0.000314,0.002219,10989
2,-1.581833,-0.123695,0.96008,-0.099154,0.239616,-0.220379,-0.185822,0.069912,-0.061182,0.918685,...,0.146501,-0.583324,0.02074,-0.006295,0.015868,0.011957,-0.236977,0.04994,0.014723,27990
3,0.16858,-0.551741,0.395775,0.607504,-0.581399,0.443354,0.092376,0.097957,0.202668,0.032486,...,-0.023202,-0.007319,-0.027988,0.020799,-0.001232,-0.017309,0.068689,0.071858,0.278447,12495
4,0.092386,0.741502,-0.837758,0.169809,-0.049627,0.276358,-0.579975,0.089769,-0.276826,-0.127186,...,0.010764,0.007696,-0.001379,-0.00043,-0.001072,-0.007729,0.002192,0.000616,-0.008122,7999


We can try to tweak a few parameters. For example, instead of using the default feature extraction method, which is PCA (Principal Component Analysis) we can use LDA (Linear Discriminant Analysis):

In [5]:
preprocess(df, target="price", feature_extraction_method="LDA", verbose=False).head()

Unnamed: 0,componenet0,componenet1,componenet2,componenet3,componenet4,componenet5,componenet6,componenet7,componenet8,componenet9,...,componenet29,componenet30,componenet31,componenet32,componenet33,componenet34,componenet35,componenet36,componenet37,price
0,4.07843,1.162027,-0.620544,0.102534,0.624363,0.044283,1.633284,-0.180133,1.154254,0.868012,...,-1.71308,1.886796,-6.55596,2.035413,-0.174779,0.377262,0.27924,0.214726,0.131374,7495
1,1.804505,-0.997714,0.431322,-0.34648,-0.089322,0.032255,-0.463499,-0.119391,-0.574145,-0.558497,...,-0.392219,0.092754,-1.259338,-1.706386,-1.874932,-0.169326,0.950268,-1.22366,-0.325899,10989
2,-3.670387,-1.827402,-0.329004,1.22157,0.351202,0.478909,-0.673353,-0.708843,1.712731,-0.42079,...,-0.56414,-0.212347,-0.155665,0.010871,0.324179,0.04912,0.122903,0.04034,0.04342,27990
3,0.130875,-0.02603,0.256254,0.42446,-0.65594,0.05072,-0.30755,0.064457,-0.004407,-0.731562,...,0.699752,-0.207406,-0.16563,-1.460313,1.354471,-3.865342,-2.303212,-0.470219,-0.875656,12495
4,1.881615,0.048389,0.319601,-0.420374,0.046675,-0.050191,-0.178517,0.094389,-0.241788,0.112766,...,-0.284876,0.305662,-0.03674,0.136912,-0.208775,0.458313,-1.667743,0.171841,0.056596,7999
