# Data pipeline

A data pipeline is a means of moving data from one place (the source) to a destination (such as a data warehouse). Along the way, data is transformed and optimized, arriving in a state that can be analyzed and used to develop business insights.

![Picture title](image-20220525-074202.png)

![Picture title](image-20220525-074340.png)

In [None]:
import pandas as pd
import joblib

from sklearn.impute import KNNImputer
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer

## *Load data*

In [None]:
data = pd.read_csv('motorcycle.csv')
data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
0,acabion,da vinci 650-vi,2011,Prototype / concept model,3.2,,804.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
1,acabion,gtbo 55,2007,Sport,2.6,1300.0,541.0,420.0,In-line four,four-stroke,...,360.0,,,,,,,,,
2,acabion,gtbo 600 daytona-vi,2011,Prototype / concept model,3.5,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
3,acabion,gtbo 600 daytona-vi,2021,Prototype / concept model,,,536.0,,Electric,Electric,...,420.0,,,Single disc,Single disc,,,,,
4,acabion,gtbo 70,2007,Prototype / concept model,3.1,1300.0,689.0,490.0,In-line four,four-stroke,...,300.0,,,,,,,,,Custom made.


***Data description:***
<br>

1.  **Brand**  - brand name of the motorcycle
2.  **Model**  - model name of the motorcycle
3.  **Year**  - year the motorcycle was built
4.  **Category**  - sub-class the motorcycle belongs to in the market (style of motorcycle)
5.  **Rating**  - review average out of 5 stars
6.  **Displacement (ccm)**  - engine size of the motorcycle in cubic centimeters (ccm)
7.  **Power (hp)**  - max power output in horsepower (hp) and kilowatt (kW) along with peak power rpm
8.  **Torque (Nm)**  - max torque in newton-meters (Nm) and foot-pounds (ft-lbs) along with peak torque rpm
9.  **Engine cylinder**  - number of cylinders in the engine as well as configuration
10.  **Engine stroke**  - number of stages to complete one power stroke of the engine
11.  **Gearbox**  - number of gears in transmission
12.  **Bore (mm)**  - diameter of each cylinder in millimeters (mm) and inches (in)
13.  **Stroke (mm)**  - distance within the cylinder a piston travels in millimeters (mm) and inches (in)
14.  **Transmission type**  - type of transmission of the motorcycle
15.  **Front brakes**  - type of front brake
16.  **Rear brakes**  - type of rear brake
17.  **Front tire**  - front tire size
18.  **Rear tire**  - rear tire size
19.  **Front suspension**  - front suspension type and configuration
20.  **Rear suspension**  - rear suspension type and configuration
21.  **Dry weight (kg)**  - weight of the motorcycle, without any fluids, in kilograms (kg) and pounds (lbs)
22.  **Wheelbase (mm)**  - distance between the points where the front and rear wheels touch the ground in millimeters (mm)
23.  **Fuel capacity (lts)**  - maximum capacity of fuel tank in liters (lts)
24.  **Fuel system**  - fuel delivery system into engine
25.  **Fuel control**  - valve configuration fo the engine
26.  **Seat height (mm)**  - height from bottom of seat to the ground in millimeters (mm)
27.  **Cooling system**  - engine cooling system
28.  **Color options**  - different color options of the motorcycle model for that particular year  
    dtypes: float64(9), int64(1), object(18)






In [None]:
data.shape

(38472, 28)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38472 entries, 0 to 38471
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                38472 non-null  object 
 1   Model                38444 non-null  object 
 2   Year                 38472 non-null  int64  
 3   Category             38472 non-null  object 
 4   Rating               21788 non-null  float64
 5   Displacement (ccm)   37461 non-null  float64
 6   Power (hp)           26110 non-null  float64
 7   Torque (Nm)          16634 non-null  float64
 8   Engine cylinder      38461 non-null  object 
 9   Engine stroke        38461 non-null  object 
 10  Gearbox              32675 non-null  object 
 11  Bore (mm)            28689 non-null  float64
 12  Stroke (mm)          28689 non-null  object 
 13  Fuel capacity (lts)  31704 non-null  float64
 14  Fuel system          27844 non-null  object 
 15  Fuel control         22008 non-null 

In [None]:
data.describe()

Unnamed: 0,Year,Rating,Displacement (ccm),Power (hp),Torque (Nm),Bore (mm),Fuel capacity (lts),Dry weight (kg),Wheelbase (mm),Seat height (mm)
count,38472.0,21788.0,37461.0,26110.0,16634.0,28689.0,31704.0,22483.0,25493.0,24182.0
mean,2003.195883,3.401574,552.515072,50.77604,64.527173,72.596713,13.286191,164.151532,1423.113521,789.253246
std,20.083372,0.355631,545.394956,52.082094,63.884654,18.758621,6.01067,85.085133,172.645438,105.492167
min,1894.0,1.4,25.0,0.3,1.5,1.0,0.5,15.1,725.0,39.0
25%,2000.0,3.2,125.0,12.0,12.2,57.0,8.2,105.0,1321.0,743.0
50%,2010.0,3.4,397.2,30.0,57.0,73.0,13.5,145.0,1422.0,790.0
75%,2016.0,3.7,805.0,77.0,102.0,88.0,17.5,199.6,1500.0,830.0
max,2022.0,4.6,8277.0,804.0,712.0,176.0,64.34,1000.0,3327.0,7501.0


## *Feature engineering*

### Filling missing values(KNN imputer)

In [None]:
data.isna().sum()

Brand                      0
Model                     28
Year                       0
Category                   0
Rating                 16684
Displacement (ccm)      1011
Power (hp)             12362
Torque (Nm)            21838
Engine cylinder           11
Engine stroke             11
Gearbox                 5797
Bore (mm)               9783
Stroke (mm)             9783
Fuel capacity (lts)     6768
Fuel system            10628
Fuel control           16464
Cooling system          4214
Transmission type       5611
Dry weight (kg)        15989
Wheelbase (mm)         12979
Seat height (mm)       14290
Front brakes            1583
Rear brakes             1776
Front tire              6490
Rear tire               6464
Front suspension       12363
Rear suspension        12847
Color options          14144
dtype: int64

In [None]:
len(data._get_numeric_data().columns)

10

In [None]:
data.select_dtypes(include=['object'])

Unnamed: 0,Brand,Model,Category,Engine cylinder,Engine stroke,Gearbox,Stroke (mm),Fuel system,Fuel control,Cooling system,Transmission type,Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
0,acabion,da vinci 650-vi,Prototype / concept model,Electric,Electric,,,,,Liquid,Chain,Single disc,Single disc,,,,,
1,acabion,gtbo 55,Sport,In-line four,four-stroke,6-speed,63.0,Turbo. KKK Acabion Extended,,Liquid,,,,,,,,
2,acabion,gtbo 600 daytona-vi,Prototype / concept model,Electric,Electric,,,,,Liquid,,Single disc,Single disc,,,,,
3,acabion,gtbo 600 daytona-vi,Prototype / concept model,Electric,Electric,,,,,Liquid,,Single disc,Single disc,,,,,
4,acabion,gtbo 70,Prototype / concept model,In-line four,four-stroke,6-speed,63.0,Turbo. KKK Acabion Extended,,Liquid,,,,,,,,Custom made.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38467,zündapp,z 22,Sport,Single cylinder,two-stroke,,70.0,Carburettor,,Air,Belt,,,2.25-24,2.25-24,Druid fork,Rigid,
38468,zündapp,z 249,Sport,Single cylinder,two-stroke,3-speed,82.5,Carburettor,,Air,Belt,Expanding brake (drum brake),Expanding brake (drum brake),2.25-24,2.25-24,Druid fork,Rigid,
38469,zündapp,z 249,Sport,Single cylinder,two-stroke,3-speed,82.5,Carburettor,,Air,Belt,Expanding brake (drum brake),Expanding brake (drum brake),2.25-24,2.25-24,Druid fork,Rigid,
38470,zündapp,z 300,Sport,Single cylinder,two-stroke,,82.5,Carburettor,Overhead Valves (OHV),Air,Chain,Expanding brake (drum brake),Expanding brake (drum brake),2.85-26,2.85-26,,,


In [None]:
data.tail()

Unnamed: 0,Brand,Model,Year,Category,Rating,Displacement (ccm),Power (hp),Torque (Nm),Engine cylinder,Engine stroke,...,Dry weight (kg),Wheelbase (mm),Seat height (mm),Front brakes,Rear brakes,Front tire,Rear tire,Front suspension,Rear suspension,Color options
38467,zündapp,z 22,1924,Sport,,211.0,2.3,,Single cylinder,two-stroke,...,,,,,,2.25-24,2.25-24,Druid fork,Rigid,
38468,zündapp,z 249,1923,Sport,,249.0,2.8,,Single cylinder,two-stroke,...,76.0,,,Expanding brake (drum brake),Expanding brake (drum brake),2.25-24,2.25-24,Druid fork,Rigid,
38469,zündapp,z 249,1924,Sport,,249.0,2.8,,Single cylinder,two-stroke,...,76.0,,,Expanding brake (drum brake),Expanding brake (drum brake),2.25-24,2.25-24,Druid fork,Rigid,
38470,zündapp,z 300,1928,Sport,,298.0,26.0,,Single cylinder,two-stroke,...,105.0,,,Expanding brake (drum brake),Expanding brake (drum brake),2.85-26,2.85-26,,,
38471,zündapp,z 300,1929,Sport,,298.0,26.0,,Single cylinder,two-stroke,...,105.0,,,Expanding brake (drum brake),Expanding brake (drum brake),2.85-26,2.85-26,,,


When the column is object, we cannot perform any string operations on it. Because object column can all be numbers, or a mixture of strings, integers or floats.

In [None]:
cat_cols = data.select_dtypes(include=['object']).columns
data[cat_cols] = data[cat_cols].astype('str')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38472 entries, 0 to 38471
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                38472 non-null  object 
 1   Model                38472 non-null  object 
 2   Year                 38472 non-null  int64  
 3   Category             38472 non-null  object 
 4   Rating               21788 non-null  float64
 5   Displacement (ccm)   37461 non-null  float64
 6   Power (hp)           26110 non-null  float64
 7   Torque (Nm)          16634 non-null  float64
 8   Engine cylinder      38472 non-null  object 
 9   Engine stroke        38472 non-null  object 
 10  Gearbox              38472 non-null  object 
 11  Bore (mm)            28689 non-null  float64
 12  Stroke (mm)          38472 non-null  object 
 13  Fuel capacity (lts)  31704 non-null  float64
 14  Fuel system          38472 non-null  object 
 15  Fuel control         38472 non-null 

In [None]:
cat_cols

Index(['Brand', 'Model', 'Category', 'Engine cylinder', 'Engine stroke',
       'Gearbox', 'Stroke (mm)', 'Fuel system', 'Fuel control',
       'Cooling system', 'Transmission type', 'Front brakes', 'Rear brakes',
       'Front tire', 'Rear tire', 'Front suspension', 'Rear suspension',
       'Color options'],
      dtype='object')

In [None]:
trans = [('categorical_transformer', OrdinalEncoder(), cat_cols)]
col_trans = ColumnTransformer(transformers=trans, remainder = 'passthrough')

encoder = OrdinalEncoder()

imputer = KNNImputer(n_neighbors=3)

scaler = MinMaxScaler(feature_range=(0,1))

array([[0.00000000e+00, 4.21400000e+03, 9.00000000e+00, ...,
        1.21200000e+03, 2.44666667e+02, 1.82333333e+02],
       [0.00000000e+00, 7.22700000e+03, 1.20000000e+01, ...,
        1.12600000e+03, 4.79666667e+02, 1.76666667e+02],
       [0.00000000e+00, 7.22800000e+03, 9.00000000e+00, ...,
        1.21200000e+03, 3.54666667e+02, 2.12000000e+02],
       ...,
       [5.75000000e+02, 1.74300000e+04, 1.20000000e+01, ...,
        1.87000000e+02, 1.71333333e+02, 1.38000000e+02],
       [5.75000000e+02, 1.74380000e+04, 1.20000000e+01, ...,
        3.47000000e+02, 3.06000000e+02, 2.33333333e+02],
       [5.75000000e+02, 1.74380000e+04, 1.20000000e+01, ...,
        3.47000000e+02, 3.06000000e+02, 2.33333333e+02]])

## ***Make a pipeline***

*It may take 5-15 mins*

In [None]:
pipeline = make_pipeline(col_trans, encoder, imputer, scaler)
pipeline.fit_transform(data)

## ***Save and load the pipeline***

*Joblib outperforms pickle in terms of memory consumption.*

In [None]:
joblib.save(pipeline, 'pipeline.joblib')

AttributeError: module 'joblib' has no attribute 'save'

In [None]:
pipeline = jblib.load('pipeline.joblib')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=80067e42-29c6-4d8b-b22d-84d1b5bafdd2' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>