Numerical data is very intuitive for machine learning algorithms, and in this tutorial we cover the
following: * Identify numerical data, using data types * Preprocess the numerical data using feature
scaling * Demonstrate the use of pipelines * Compare the result of models with and without feature
scaling

In [38]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

In [1]:
us_adult_df = pd.read_csv('/content/drive/MyDrive/Supervised ML/Working with numerical & categorical data/US-Adult-Census.csv')

print(us_adult_df.head())

   age   workclass  fnlwgt      education  education-num       marital-status  \
0   25     Private  226802           11th              7        Never-married   
1   38     Private   89814        HS-grad              9   Married-civ-spouse   
2   28   Local-gov  336951     Assoc-acdm             12   Married-civ-spouse   
3   44     Private  160323   Some-college             10   Married-civ-spouse   
4   18           ?  103497   Some-college             10        Never-married   

           occupation relationship    race      sex  capital-gain  \
0   Machine-op-inspct    Own-child   Black     Male             0   
1     Farming-fishing      Husband   White     Male             0   
2     Protective-serv      Husband   White     Male             0   
3   Machine-op-inspct      Husband   Black     Male          7688   
4                   ?    Own-child   White   Female             0   

   capital-loss  hours-per-week  native-country   class  
0             0              40   United

In [5]:
print('(Rows, Columns)')
print(us_adult_df.shape)

(Rows, Columns)
(48842, 15)


In [2]:
print('>>> Row <<<')
print(us_adult_df.count())

>>> Row <<<
age               48842
workclass         48842
fnlwgt            48842
education         48842
education-num     48842
marital-status    48842
occupation        48842
relationship      48842
race              48842
sex               48842
capital-gain      48842
capital-loss      48842
hours-per-week    48842
native-country    48842
class             48842
dtype: int64


In [4]:
print('>>> Info <<<')
print(us_adult_df.info())

>>> Info <<<
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48842 non-null  object
 14  class           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
None


the education and education-num
columns represent the same information, and as such we shall remove the column education-num.

In [6]:
us_adult_df.drop(columns='education-num', inplace=True)
data, target = us_adult_df.drop('class', axis=1), us_adult_df['class']

In [7]:
print(data.head())

   age   workclass  fnlwgt      education       marital-status  \
0   25     Private  226802           11th        Never-married   
1   38     Private   89814        HS-grad   Married-civ-spouse   
2   28   Local-gov  336951     Assoc-acdm   Married-civ-spouse   
3   44     Private  160323   Some-college   Married-civ-spouse   
4   18           ?  103497   Some-college        Never-married   

           occupation relationship    race      sex  capital-gain  \
0   Machine-op-inspct    Own-child   Black     Male             0   
1     Farming-fishing      Husband   White     Male             0   
2     Protective-serv      Husband   White     Male             0   
3   Machine-op-inspct      Husband   Black     Male          7688   
4                   ?    Own-child   White   Female             0   

   capital-loss  hours-per-week  native-country  
0             0              40   United-States  
1             0              50   United-States  
2             0              40   Unit

Checking the data type of each column

In [8]:
data.dtypes

Unnamed: 0,0
age,int64
workclass,object
fnlwgt,int64
education,object
marital-status,object
occupation,object
relationship,object
race,object
sex,object
capital-gain,int64


Identify the uniqueness of each column

In [9]:
data.dtypes.unique()

array([dtype('int64'), dtype('O')], dtype=object)

Select numerical columns and work with them

In [10]:
numerical_columns = data.select_dtypes(include=['int64']).columns
print(numerical_columns)

Index(['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week'], dtype='object')


In [11]:
print(data[numerical_columns].head())

   age  fnlwgt  capital-gain  capital-loss  hours-per-week
0   25  226802             0             0              40
1   38   89814             0             0              50
2   28  336951             0             0              40
3   44  160323          7688             0              40
4   18  103497             0             0              30


In [12]:
data_numeric = data[numerical_columns]
print(data_numeric.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             48842 non-null  int64
 1   fnlwgt          48842 non-null  int64
 2   capital-gain    48842 non-null  int64
 3   capital-loss    48842 non-null  int64
 4   hours-per-week  48842 non-null  int64
dtypes: int64(5)
memory usage: 1.9 MB
None


Split validation

In [15]:
data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, random_state=42)

In [17]:
data_train.describe()

Unnamed: 0,age,fnlwgt,capital-gain,capital-loss,hours-per-week
count,36631.0,36631.0,36631.0,36631.0,36631.0
mean,38.642352,189680.8,1087.077721,89.665311,40.431247
std,13.725748,105598.7,7522.692939,407.110175,12.423952
min,17.0,12285.0,0.0,0.0,1.0
25%,28.0,117724.5,0.0,0.0,40.0
50%,37.0,178033.0,0.0,0.0,40.0
75%,48.0,237731.0,0.0,0.0,45.0
max,90.0,1484705.0,99999.0,4356.0,99.0


## Model with Feature Scaling

### Automatically combining Scaling functions
combining Scaling functions
The pipeline combine (sequential operation) the transformation (fit and transform) followed by the training of the model

In [19]:
model_pip = make_pipeline(StandardScaler(), LogisticRegression())

check each step of the pipeline, by using the property named_steps

In [20]:
model_pip.named_steps

{'standardscaler': StandardScaler(),
 'logisticregression': LogisticRegression()}

### Traning & Testing with Pipelines
As the transformer function has already been included in the pipeline, we do not need to call the
transformerâ€™s fit or fit_transform function. We isntead call the fit function of the pipeline, which
preprocesses the data and feeds it to train the predictor, in this case our LogisticRegression model.

In [21]:
import time
start_time = time.time()
model_pip.fit(data_train, target_train)
elapsed_time = time.time() - start_time

In [22]:
print(elapsed_time)

0.20160245895385742


In [23]:
predictions = model_pip.predict(data_test)
print(predictions)

[' <=50K' ' <=50K' ' >50K' ... ' <=50K' ' <=50K' ' <=50K']


check the overall accuracy of the pipeline, by calling the score() function with the test set.

In [25]:
score = model_pip.score(data_test, target_test)

In [26]:
print(f"The accuracy using a pipeline is {score:.3f} "
f"with a training time of {elapsed_time:.3f} seconds "
f"in {model_pip[-1].n_iter_[0]} iterations")

The accuracy using a pipeline is 0.807 with a training time of 0.202 seconds in 9 iterations


## Comparison with a Model without Feature Scaling

In [32]:
lr_model = LogisticRegression(max_iter=1000)

In [33]:
lr_start_time = time.time()
lr_model.fit(data_train, target_train)
lr_elapsed_time = time.time() - lr_start_time

In [34]:
lr_score = lr_model.score(data_test, target_test)

In [36]:
print(f"The accuracy using a LogisticRegression model without scaling is {lr_score:.3f} "
f"with a training time of {lr_elapsed_time:.3f} seconds "
f"in {lr_model.n_iter_[0]} iterations")

The accuracy using a LogisticRegression model without scaling is 0.807 with a training time of 1.108 seconds in 185 iterations


In [37]:
lr_prediction = lr_model.predict(data_test)
print(lr_prediction)

[' <=50K' ' <=50K' ' >50K' ... ' <=50K' ' <=50K' ' <=50K']


In [39]:
pip_report = classification_report(target_test, predictions)
print(pip_report)

              precision    recall  f1-score   support

       <=50K       0.82      0.96      0.88      9354
        >50K       0.71      0.29      0.41      2857

    accuracy                           0.81     12211
   macro avg       0.77      0.63      0.65     12211
weighted avg       0.79      0.81      0.77     12211



In [40]:
lr_report = classification_report(target_test, lr_prediction)
print(lr_report)

              precision    recall  f1-score   support

       <=50K       0.82      0.96      0.88      9354
        >50K       0.71      0.29      0.41      2857

    accuracy                           0.81     12211
   macro avg       0.77      0.63      0.65     12211
weighted avg       0.79      0.81      0.77     12211

