# Code samples

First, we'll import all the necessary libraries and functions that we have implemented.

In [1]:
import pandas as pd
from preprocess import preprocess

To illustrate how our preprocessing works, let's import the dataset that we are using in our experiment.

In [2]:
df = pd.read_csv("CarsData.csv")
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,Manufacturer
0,I10,2017,7495,Manual,11630,Petrol,145,60.1,1.0,hyundi
1,Polo,2017,10989,Manual,9200,Petrol,145,58.9,1.0,volkswagen
2,2 Series,2019,27990,Semi-Auto,1614,Diesel,145,49.6,2.0,BMW
3,Yeti Outdoor,2017,12495,Manual,30960,Diesel,150,62.8,2.0,skoda
4,Fiesta,2017,7999,Manual,19353,Petrol,125,54.3,1.2,ford


Let us see what is the effect of our function `preprocess`, used for transforming the data.

In [3]:
preprocess(df, target="price", verbose=False).head()

Unnamed: 0,componenet0,componenet1,componenet2,componenet3,componenet4,componenet5,componenet6,componenet7,componenet8,componenet9,...,componenet21,componenet22,componenet23,componenet24,componenet25,componenet26,componenet27,componenet28,componenet29,price
0,-0.061325,1.086454,-0.837091,0.603295,0.262565,0.117377,0.128906,-0.016715,0.109262,0.067967,...,-0.038346,-0.049106,-0.145031,-0.043586,0.154055,-0.048407,0.04665,-0.006822,0.019938,7495
1,-0.15047,1.128804,-0.862481,0.475072,0.295706,0.027867,0.665889,0.829599,-0.006955,-0.097599,...,-0.318651,-0.06159,0.088273,0.020066,-0.048955,0.014696,-0.004634,0.018283,-0.005759,10989
2,-1.581833,-0.123695,0.96008,-0.099154,0.239616,-0.220379,-0.185822,0.069912,-0.061182,0.918685,...,-0.105118,-0.013767,0.024417,0.035767,0.224899,0.550826,-0.302162,0.05718,0.146501,27990
3,0.16858,-0.551741,0.395775,0.607504,-0.581399,0.443354,0.092376,0.097957,0.202668,0.032486,...,0.036848,0.035288,0.152585,0.031328,-0.032995,0.013367,-0.01018,0.013057,-0.023202,12495
4,0.092386,0.741502,-0.837758,0.169809,-0.049627,0.276358,-0.579975,0.089769,-0.276826,-0.127186,...,-0.008308,-0.005567,-0.039285,-0.006276,0.005925,-0.001399,-0.006205,-0.004326,0.010764,7999


We can try to tweak a few parameters. For example, instead of using the default feature extraction method, which is PCA (Principal Component Analysis) we can use LDA (Linear Discriminant Analysis):

In [4]:
preprocess(df, target="price", feature_extraction_method="LDA", verbose=False).head()

Unnamed: 0,componenet0,componenet1,componenet2,componenet3,componenet4,componenet5,componenet6,componenet7,componenet8,componenet9,...,componenet21,componenet22,componenet23,componenet24,componenet25,componenet26,componenet27,componenet28,componenet29,price
0,4.07843,1.162027,-0.620544,0.102534,0.624363,0.044283,1.633284,-0.180133,1.154254,0.868012,...,0.564748,0.18756,0.194428,-2.02129,-2.442746,-2.180037,0.082778,1.700415,-1.71308,7495
1,1.804505,-0.997714,0.431322,-0.34648,-0.089322,0.032255,-0.463499,-0.119391,-0.574145,-0.558497,...,-0.066888,0.156256,0.070424,0.485914,0.500733,0.813873,-1.690842,1.631866,-0.392219,10989
2,-3.670387,-1.827402,-0.329004,1.22157,0.351202,0.478909,-0.673353,-0.708843,1.712731,-0.42079,...,-0.126226,0.141754,0.304181,0.017803,0.297379,-0.134972,-0.272485,0.097241,-0.56414,27990
3,0.130875,-0.02603,0.256254,0.42446,-0.65594,0.05072,-0.30755,0.064457,-0.004407,-0.731562,...,0.280038,-0.615872,-0.233987,-0.332073,0.025178,0.386584,-0.141617,1.09988,0.699752,12495
4,1.881615,0.048389,0.319601,-0.420374,0.046675,-0.050191,-0.178517,0.094389,-0.241788,0.112766,...,-0.325335,0.263164,0.571403,0.884114,-0.32029,-0.539746,-0.525142,-0.429076,-0.284876,7999


Let us increase the value of `feature_selection_treshold`. This will result in a lower number of columns in the final DataFrame:

In [6]:
preprocess(df, target="price", feature_selection_treshold=1000.0, verbose=False).head()

Unnamed: 0,componenet0,componenet1,componenet2,componenet3,componenet4,componenet5,componenet6,componenet7,componenet8,componenet9,...,componenet21,componenet22,componenet23,componenet24,componenet25,componenet26,componenet27,componenet28,componenet29,price
0,-0.581332,-1.299914,0.303025,0.031935,0.106172,0.10515,-0.214675,0.057518,-0.354522,-0.153743,...,0.000942,0.00671,0.031886,0.021367,0.130373,0.102298,0.095423,0.58616,0.323849,7495
1,-0.670748,-1.311547,0.314985,0.113871,0.602877,-0.785689,0.311405,-0.060477,0.069074,-0.220555,...,0.052477,0.12765,0.001709,-0.016905,-0.018741,-0.007202,-0.001215,0.001708,0.000887,10989
2,-1.203972,1.14332,0.001088,-0.060262,0.192244,0.078558,-0.736494,-0.69695,0.102396,-0.100952,...,0.058516,-0.015828,0.084734,0.060644,0.321746,-0.519186,0.012012,0.011036,-0.030893,27990
3,0.436375,0.511731,-0.383083,0.522358,0.247546,0.17858,-0.065219,0.072253,-0.249238,0.125777,...,0.000432,-0.014065,-0.011866,-0.048772,-0.100861,-0.05249,-0.022055,0.007355,0.063711,12495
4,-0.263704,-0.976091,0.112261,0.078665,-0.061263,0.011513,-0.058756,0.058622,-0.245612,-0.132717,...,-0.015842,-0.005316,0.029608,-0.042162,-0.099058,-0.030049,-0.016012,0.002782,0.015006,7999


Increasing the value of parameter `variance_treshold` will have a similar effect:

In [14]:
preprocess(df, target="price", variance_treshold=0.02, verbose=False).head()

Unnamed: 0,componenet0,componenet1,componenet2,componenet3,componenet4,componenet5,componenet6,componenet7,componenet8,componenet9,...,componenet18,componenet19,componenet20,componenet21,componenet22,componenet23,componenet24,componenet25,componenet26,price
0,-0.14376,-1.197589,-0.7895,0.640042,0.139803,0.227712,0.161199,-0.061879,-0.096281,-0.234582,...,0.014345,0.094109,0.056279,-0.038185,0.047426,-0.001758,0.006547,-0.078953,-0.001912,7495
1,-0.233004,-1.246443,-0.816162,0.517579,0.13686,0.208445,0.37855,0.933181,0.188291,0.051168,...,-0.005642,-0.028377,0.050789,-0.290092,-0.1726,-0.002858,-0.050044,0.04464,-0.003367,10989
2,-1.504619,0.343596,1.124916,-0.10828,0.118863,-0.156361,0.451221,-0.121453,-0.75457,0.403208,...,0.020932,-0.040316,-0.233125,-0.123052,0.030225,-0.013286,0.014594,-0.046396,0.001958,27990
3,0.241604,0.655107,0.516039,0.617079,-0.676354,0.444135,0.020167,-0.010677,0.014737,-0.133096,...,0.015914,0.034665,-0.063268,0.096906,-0.084571,0.046863,-0.036875,0.213229,-0.000558,12495
4,0.009715,-0.853076,-0.820826,0.166001,0.040883,0.108627,-0.087063,-0.065556,-0.055231,-0.329621,...,0.013932,-0.054623,0.030938,-0.059205,0.023362,-0.020157,-0.000179,-0.031667,0.001467,7999
