# Combine sequential operations
> we chain the operations and use them with any other classifier or regressor

- toc: true
- badges: false
- comments: true
- author: Cécile Gallioz
- categories: [sklearn]

# Preparation

In [1]:
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

In [2]:
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

In [3]:
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

## The set

In [4]:
target_column = 'Species'
target = myDataFrame[target_column]
target.value_counts()

Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

In [5]:
target.value_counts(normalize=True)

Adelie       0.441520
Gentoo       0.359649
Chinstrap    0.198830
Name: Species, dtype: float64

## Continuation of preparation

In [6]:
data = myDataFrame.drop(columns=target_column)
data.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)'], dtype='object')

In [7]:
numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
data_numeric = data[numerical_columns]

In [8]:
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    #random_state=42, 
    test_size=0.25)

In [9]:
data_train.describe()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm)
count,256.0,256.0
mean,43.697266,17.251953
std,5.524607,1.923721
min,32.1,13.1
25%,39.175,15.7
50%,43.4,17.5
75%,48.25,18.7
max,59.6,21.5


# Model without normalization

In [10]:
model = LogisticRegression()
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start

In [11]:
model_name = model.__class__.__name__
score = model.score(data_test, target_test)

In [12]:
print(f"The accuracy using a {model_name} is {score:.3f} "
      f"with a fitting time of {elapsed_time:.3f} seconds "
      f"in {model.n_iter_[0]} iterations")

The accuracy using a LogisticRegression is 0.907 with a fitting time of 0.021 seconds in 67 iterations


# Model with normalization : Pipeline
Fewer iterations

In [13]:
model = make_pipeline(StandardScaler(), LogisticRegression())
model

In [14]:
start = time.time()
model.fit(data_train, target_train)
elapsed_time = time.time() - start

In [15]:
predicted_target = model.predict(data_test)
model_name = model.__class__.__name__
score = model.score(data_test, target_test)

In [16]:
print(f"The accuracy using a {model_name} is {score:.3f} "
      f"with a fitting time of {elapsed_time:.3f} seconds "
      f"in {model[-1].n_iter_[0]} iterations")

The accuracy using a Pipeline is 0.942 with a fitting time of 0.011 seconds in 15 iterations
