# Data Jobs Analysis

## Overview

In the field of Big Data, Internet is full of job advertisements by companies looking for different profiles at each level (**Junior**, **Middle**, **Senior**, **Tech lead**...) and different skills to take up a new position (**Data Analyst**, **Data Engineer**, **Data Scientist**, etc).

Our objective is to handle 4 datasets containing thousands of Big Data related job advertisements and build a model which will allow us to predict the **salary** based on a few variables like **industry**, **location**, **company revenue**, **experience**, etc.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re

In [None]:
df_dataanalyst = pd.read_csv('/content/drive/MyDrive/Kaggle-Datasets/DataAnalyst.csv')
df_dataengineer = pd.read_csv('/content/drive/MyDrive/Kaggle-Datasets/DataEngineer.csv')
df_businessanalyst = pd.read_csv('/content/drive/MyDrive/Kaggle-Datasets/BusinessAnalyst.csv')
df_datascientist = pd.read_csv('/content/drive/MyDrive/Kaggle-Datasets/DataScientist.csv')

df = pd.concat([df_dataanalyst, df_dataengineer, df_businessanalyst, df_datascientist])

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors,Easy Apply,index
0,0,"Data Analyst, Center on Immigration and Justic...",$37K-$66K (Glassdoor est.),Are you eager to roll up your sleeves and harn...,3.2,Vera Institute of Justice\n3.2,"New York, NY","New York, NY",201 to 500 employees,1961,Nonprofit Organization,Social Assistance,Non-Profit,$100 to $500 million (USD),-1,True,
1,1,Quality Data Analyst,$37K-$66K (Glassdoor est.),Overview\n\nProvides analytical and technical ...,3.8,Visiting Nurse Service of New York\n3.8,"New York, NY","New York, NY",10000+ employees,1893,Nonprofit Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),-1,-1,
2,2,"Senior Data Analyst, Insights & Analytics Team...",$37K-$66K (Glassdoor est.),We’re looking for a Senior Data Analyst who ha...,3.4,Squarespace\n3.4,"New York, NY","New York, NY",1001 to 5000 employees,2003,Company - Private,Internet,Information Technology,Unknown / Non-Applicable,GoDaddy,-1,
3,3,Data Analyst,$37K-$66K (Glassdoor est.),Requisition NumberRR-0001939\nRemote:Yes\nWe c...,4.1,Celerity\n4.1,"New York, NY","McLean, VA",201 to 500 employees,2002,Subsidiary or Business Segment,IT Services,Information Technology,$50 to $100 million (USD),-1,-1,
4,4,Reporting Data Analyst,$37K-$66K (Glassdoor est.),ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...,3.9,FanDuel\n3.9,"New York, NY","New York, NY",501 to 1000 employees,2009,Company - Private,Sports & Recreation,"Arts, Entertainment & Recreation",$100 to $500 million (USD),DraftKings,True,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12782 entries, 0 to 399
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         9854 non-null   object
 1   Job Title          12782 non-null  object
 2   Salary Estimate    12782 non-null  object
 3   Job Description    12782 non-null  object
 4   Rating             12782 non-null  object
 5   Company Name       12781 non-null  object
 6   Location           12782 non-null  object
 7   Headquarters       12782 non-null  object
 8   Size               12782 non-null  object
 9   Founded            12782 non-null  object
 10  Type of ownership  12782 non-null  object
 11  Industry           12782 non-null  object
 12  Sector             12782 non-null  object
 13  Revenue            12782 non-null  object
 14  Competitors        12782 non-null  object
 15  Easy Apply         12782 non-null  object
 16  index              7601 non-null   object


## Data Preprocessing

In [None]:
# First, we need to clean the data. We remove the columns we don't need:

def drop_features(df):
  df.drop(labels=['Unnamed: 0','index'], axis=1, inplace=True)
  return df

In [None]:
# Using the unique method we notice that df_businessanalyst dataset contains values in 'Founded' column which are not years and values which are not ratings in 'Rating' column.
# In fact, all the values in df_businessanalyst are shifted and placed in a wrong column
# To fix this problem, we are going to change the column order of these particular rows:

def fix_df_columns(df):
  sub_df_corrected = pd.DataFrame(df.loc[df['Rating'].astype(str).str.contains('[A-Za-z]') == True, :].to_numpy(), columns=['Job Title', 'Job Description',
       'Rating', 'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors',
       'Easy Apply', 'index', 'Unnamed: 0', 'Salary Estimate'])
  df = df[df['Founded'].astype(str).str.contains('[A-Za-z]') == False]
  df = df[df['Rating'].astype(str).str.contains('[A-Za-z]') == False]
  df = pd.concat([df,sub_df_corrected], axis=0)
  df.reset_index(drop=True, inplace=True)
  return df

In [None]:
def convert_dtypes(df):
  df['Rating'] = df['Rating'].astype(float)
  df['Founded'] = df['Founded'].astype(float)
  return df

In [None]:
def replace_null(df):
  df.replace({-1: np.nan, -1.0: np.nan, '-1': np.nan, 'Unknown': np.nan}, inplace=True)
  return df

In [None]:
# We notice that the Company Name column contain a '\n' after the company name value, followed by a duplication of the company's rating.
# We need to clean this field, and we are going to use the split method to do so :

def split_company_name(x):
  if (type(x)==str):
    return x.split('\n')[0]
  else:
    return 'Unknown'

def clean_company_name(df):
  df['Company Name'] = df['Company Name'].apply(lambda x: split_company_name(x))
  return df

In [None]:
# 'Sector' and 'Type of ownership' columns have both a value named 'Government', it will be a problem when we'll have to perform One Hot Encoding.
# Let's replace this value for one of these columns:

def replace_redundant_val(df):
  df['Sector'].replace({'Government':'Government Sector'}, inplace=True)
  return df

## Feature Engineering

In [None]:
def add_position(df):
  df['Position'] = 'Other'
  searchfor = ['data', 'analyst']
  df.loc[(df['Job Title'].str.lower().str.contains(searchfor[0])) & (df['Job Title'].str.lower().str.contains(searchfor[1])), 'Position'] = 'Data Analyst'
  searchfor = ['data', 'engineer']
  df.loc[(df['Job Title'].str.lower().str.contains(searchfor[0])) & (df['Job Title'].str.lower().str.contains(searchfor[1])), 'Position'] = 'Data Engineer'
  searchfor = ['data', 'scientist']
  df.loc[(df['Job Title'].str.lower().str.contains(searchfor[0])) & (df['Job Title'].str.lower().str.contains(searchfor[1])), 'Position'] = 'Data Scientist'
  searchfor = ['machine learning', 'engineer']
  df.loc[(df['Job Title'].str.lower().str.contains(searchfor[0])) & (df['Job Title'].str.lower().str.contains(searchfor[1])), 'Position'] = 'ML Engineer'
  searchfor = ['business', 'analyst']
  df.loc[(df['Job Title'].str.lower().str.contains(searchfor[0])) & (df['Job Title'].str.lower().str.contains(searchfor[1])), 'Position'] = 'BI Analyst'
  searchfor = ['data', 'architect']
  df.loc[(df['Job Title'].str.lower().str.contains(searchfor[0])) & (df['Job Title'].str.lower().str.contains(searchfor[1])), 'Position'] = 'Data Architect'
  return df

In [None]:
def add_experience(df):
  df['Years of experience'] = df['Job Description'].apply(lambda x: find_experience(x))
  df['Level of experience'] = df['Job Title'].apply(lambda x : categorize_level_by_title(x)) # This way to find the level of experience required by checking keyword in the job title is more reliable so we put it before
  df.loc[df['Level of experience']=='Unknown','Level of experience'] = df.loc[df['Level of experience']=='Unknown']['Years of experience'].apply(lambda x : categorize_level_by_years(x))
  return df

In [None]:
# Here, we are coding a function using regular expressions patterns to detect in each job description, what is the experience needed to apply to the job
def find_experience(x):
  pattern1 = re.compile(r'\d+\+?(\s?\.?-?\s?|\s?to\s?)(\d?\s?years)') #maybe add \s[oO]f\s[eE]xperience
  matches = pattern1.finditer(x)

  filteredtext = ''

  for match in matches:
    filteredtext += (match.group(0) + ' ') 

  pattern2 = re.compile(r'\d+')
  matches = pattern2.finditer(filteredtext)
  years_found = np.array([])
  for match in matches:
    if (float(match.group(0)) <= 10):
      years_found = np.append(years_found, float(match.group(0)))

  if (len(years_found)==0):
    return 0
  else:
    return round(years_found.mean(),2)

def categorize_level_by_years(x):
  if x == 0:
    return 'Unknown'
  elif x <= 1.5:
    return 'Junior'
  elif x <= 3:
    return 'Middle'
  elif x <= 5:
    return 'Senior'
  elif x <= 10:
    return 'Technical lead'

def categorize_level_by_title(x):
  if (x.lower().find('junior') != -1) | (x.lower().find('jr') != -1):
    return 'Junior'
  elif x.lower().find('middle') != -1:
    return 'Middle'
  elif (x.lower().find('senior') != -1) | (x.lower().find('sr') != -1):
    return 'Senior'
  elif (x.lower().find('technical lead') != -1) | (x.lower().find('tech lead') != -1):
    return 'Technical lead'
  else:
    return 'Unknown'

In [None]:
# Regarding the location and headquarters fields, we notice there are a bunch of cities coupled with their region, and it would more handy for our machine learning training
# if we selected only the region ID to define these columns. So we are also going to use a splitting function to perform this task :

def split_location(x):
  if (type(x)==str):
    return x.split(',')[-1].strip()
  else:
    return 'Unknown'

def redefine_locations(df):
  df['Location'] = df['Location'].apply(lambda x: split_location(x))
  df['Headquarters'] = df['Headquarters'].apply(lambda x: split_location(x))
  return df

In [None]:
# We would like to modify the Salary Estimate column which contain string values indicating the salary but with extra $ and K symbols. As we want the value in the form of a float,
# we will perform a function using the regex library to extract each number values for lower and upper estimate, and finally take the mean of both. Nevertheless, one important thing to consider
# is that a few rows have their salary expressed per hour, so we will also take care of this eventuality in the function :

def estimate_salary(x):
  if (type(x)==str):
    pattern = re.compile(r'\d+')
    ph_pattern = re.compile(r'[Pp]er\s[Hh]our')
    lower_upper = np.array([])
    matches = ph_pattern.finditer(x)
    ph_found = sum(1 for match in matches)
    matches = pattern.finditer(x)
    for match in matches:
      lower_upper = np.append(lower_upper, float(match.group(0)))
    if ph_found == False:
      return lower_upper.mean()
    else:
      return (lower_upper.mean() * 40 * 48)/1000 # Considering 48 working weeks a year is the average in the United States, and assuming 40 hour week, we multiply all these values together to obtain the salary per Year
  else:
    return np.nan

def redefine_salary(df):
  df['Salary Estimate'] = df['Salary Estimate'].apply(lambda x: estimate_salary(x))
  return df

In [None]:
def label_encode(df):
  df['Size'].replace({'1 to 50 employees': 1, '51 to 200 employees': 2, '201 to 500 employees': 3, '501 to 1000 employees': 4, 
                      '1001 to 5000 employees': 5, '5001 to 10000 employees': 6, '10000+ employees': 7}, inplace=True)
  
  df['Revenue'].replace({'$100 to $500 million (USD)':7, '$2 to $5 billion (USD)':10, '$50 to $100 million (USD)':6,
       '$1 to $2 billion (USD)': 9, '$5 to $10 billion (USD)':11, '$1 to $5 million (USD)': 2, '$25 to $50 million (USD)':5,
       '$10+ billion (USD)':12, 'Less than $1 million (USD)':1, '$10 to $25 million (USD)': 4, '$500 million to $1 billion (USD)': 8,
       '$5 to $10 million (USD)': 3}, inplace=True)
  
  df['Level of experience'].replace({'Junior':1, 'Middle':2, 'Senior':3, 'Technical lead':4}, inplace=True)
  return df

In [None]:
def one_hot_encode(df):
  df_location = pd.get_dummies(df['Location'])
  df_sector = pd.get_dummies(df['Sector'])
  df_typeofownership = pd.get_dummies(df['Type of ownership'])
  df_position = pd.get_dummies(df['Position'])
  df = pd.concat([df, df_location, df_sector, df_typeofownership, df_position], axis=1)
  df.drop(labels=['Location','Sector','Type of ownership','Position'], axis = 1, inplace=True)
  return df

def stat_impute_numerical_variables(df):
  df['Rating'].fillna(df['Rating'].mean(), inplace=True)
  df['Salary Estimate'].fillna(df['Salary Estimate'].median(), inplace=True)
  return df

def stat_impute_categorical_variables(df):
  df['Sector'].fillna(df['Sector'].mode()[0], inplace=True)
  df['Size'].fillna(df['Size'].mode()[0], inplace=True)
  df['Type of ownership'].fillna(df['Type of ownership'].mode()[0], inplace=True)
  return df

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

# We can predict missing values in Revenue using a prediction model based on a classification algorithm.
# To do so we are going to train our model on independant features : Salary Estimate, Rating, Location, Size, Sector, Type of ownership

def ml_impute_revenue(df):
  scaler = MinMaxScaler()
  X_train = df.loc[(df['Revenue'].isna() == False) & (df['Revenue'] != 'Unknown / Non-Applicable'),[c for c in df.columns if c not in ('Job Title', 'Job Description', 'Company Name', 'Revenue',
                                                                                                                        'Headquarters', 'Founded', 'Industry', 'Competitors', 'Easy Apply',
                                                                                                                        'Position', 'Years of experience', 'Level of experience', 'BI Analyst', 'Data Analyst',
                                                                                                                        'Data Architect', 'Data Engineer', 'Data Scientist', 'ML Engineer', 'Other')]]
  X_train = scaler.fit_transform(X_train)
  y_train = df.loc[(df['Revenue'].isna() == False) & (df['Revenue'] != 'Unknown / Non-Applicable'), 'Revenue']
  y_train = y_train.astype('int')
  X_test = df.loc[(df['Revenue'].isna() == True) | (df['Revenue'] == 'Unknown / Non-Applicable'),[c for c in df.columns if c not in ('Job Title', 'Job Description', 'Company Name', 'Revenue',
                                                                                                                          'Headquarters', 'Founded', 'Industry', 'Competitors', 'Easy Apply',
                                                                                                                          'Position', 'Years of experience', 'Level of experience', 'BI Analyst', 'Data Analyst',
                                                                                                                        'Data Architect', 'Data Engineer', 'Data Scientist', 'ML Engineer', 'Other')]]
  X_test = scaler.transform(X_test)
  log_reg = LogisticRegression(max_iter=1000)
  log_reg.fit(X_train, y_train)
  y_pred = log_reg.predict(X_test)
  df.loc[(df['Revenue'].isna() == True) | (df['Revenue'] == 'Unknown / Non-Applicable'), 'Revenue'] = y_pred
  return df

# As we did to predict the missing values in Revenue, we can predict the unknown categories in Level of experience with the help of a classification model and machine learning
# we are going to train our model on independant features : Salary Estimate, Position, Rating, Location, Size, Sector, Type of ownership

def ml_impute_level(df):
  scaler = MinMaxScaler()
  X_train = df.loc[(df['Level of experience'] != 'Unknown'),[c for c in df.columns if c not in ('Job Title', 'Job Description', 'Company Name', 'Revenue',
                                                                                                                        'Headquarters', 'Founded', 'Industry', 'Competitors', 'Easy Apply',
                                                                                                                        'Position', 'Years of experience', 'Level of experience')]]
  X_train = scaler.fit_transform(X_train)
  y_train = df.loc[(df['Level of experience'] != 'Unknown'), 'Level of experience']
  y_train = y_train.astype('int')
  X_test = df.loc[(df['Level of experience'] == 'Unknown'),[c for c in df.columns if c not in ('Job Title', 'Job Description', 'Company Name', 'Revenue',
                                                                                                                          'Headquarters', 'Founded', 'Industry', 'Competitors', 'Easy Apply',
                                                                                                                          'Position', 'Years of experience', 'Level of experience')]]
  X_test = scaler.transform(X_test)
  log_reg = LogisticRegression(max_iter=1000)
  log_reg.fit(X_train, y_train)
  y_pred = log_reg.predict(X_test)
  df.loc[(df['Level of experience'] == 'Unknown'), 'Level of experience'] = y_pred
  return df

In [None]:
def preprocess_df(df):
  df = fix_df_columns(df)
  df = drop_features(df)
  df = replace_null(df)
  df = replace_redundant_val(df)
  df = convert_dtypes(df)
  df = add_position(df)
  df = add_experience(df)
  df = clean_company_name(df)
  df = redefine_locations(df)
  df = redefine_salary(df)
  df = stat_impute_numerical_variables(df)
  df = stat_impute_categorical_variables(df)
  df = label_encode(df)
  df = one_hot_encode(df)
  df = ml_impute_revenue(df)
  df = ml_impute_level(df)
  return df

In [None]:
df = pd.concat([df_dataanalyst, df_dataengineer, df_businessanalyst, df_datascientist])
df = preprocess_df(df)

## Selecting and Training models
Now that we have completed all the preprocessing steps, we are going to use the independant features we have processed which are related to the salary to predict the salary for new inputs. The independant features (predictors) are the following : Position, Level of experience, Rating, Location, Size, Sector, Revenue, Type of ownership

In [None]:
#X.head()

In [None]:
from sklearn.model_selection import train_test_split

X = df[[c for c in df.columns if c not in ('Salary Estimate','Job Title', 'Job Description','Company Name', 'Industry', 'Years of experience',
                                           'Headquarters', 'Founded', 'Competitors', 'Easy Apply')]]
y = df['Salary Estimate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
scaler = MinMaxScaler()
X_test = scaler.fit_transform(X_test)

In [None]:
# Linear Regression

from sklearn.linear_model import LinearRegression

lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)

# Decision Tree Regressor

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)

# Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)

# Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor

gbt_reg = GradientBoostingRegressor()
gbt_reg.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

## Model Evaluation using Cross Validation - Before Tuning

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(lr_reg, X_test, y_test, scoring="neg_mean_squared_error", cv = 10)
print("Linear Regression : ", scores)

scores = cross_val_score(tree_reg, X_test, y_test, scoring="neg_mean_squared_error", cv = 10)
print("Regression Tree : ", scores)

scores = cross_val_score(rf_reg, X_test, y_test, scoring="neg_mean_squared_error", cv = 10)
print("Random Forest Regressor : ", scores)

scores = cross_val_score(gbt_reg, X_test, y_test, scoring="neg_mean_squared_error", cv = 10)
print("Gradient Boosting Regressor : ", scores)

Linear Regression :  [-9.04339448e+02 -6.81059053e+02 -7.79614258e+02 -8.89470911e+02
 -2.60703635e+21 -7.88413050e+02 -9.15053035e+02 -6.82100295e+22
 -9.93760380e+02 -9.18973857e+02]
Regression Tree :  [-1603.40897658 -1395.88811022 -1721.78486909 -1579.53012931
 -1861.66710163 -1469.11543745 -1750.69792914 -1786.85221608
 -1854.42148533 -1623.83356373]
Random Forest Regressor :  [-1035.10242162  -835.80208572 -1051.12645214  -992.84061653
 -1054.35840821  -957.59059659 -1153.07038146 -1234.70306532
 -1184.48001253  -991.70579737]
Gradient Boosting Regressor :  [ -892.51228044  -673.78258694  -789.98043482  -839.52712724
  -860.16437312  -815.27918933  -910.1273307  -1078.2228459
  -955.77388246  -905.01637446]


## Hyperparameter Tuning using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {'fit_intercept': [True, False],               
              'normalize':[True, False],
              'n_jobs' : [1, 10, 50, 100]}
# Linear Regression
lr_reg1 = GridSearchCV(LinearRegression(), parameters, cv=5, scoring='neg_mean_squared_error') 
lr_reg1.fit(X_train, y_train)
print("The best parameters obtained by CV:", lr_reg1.best_params_)
print("The best score obtained by CV = {:5.3f}".format(lr_reg1.best_score_))

The best parameters obtained by CV: {'fit_intercept': False, 'n_jobs': 1, 'normalize': True}
The best score obtained by CV = -359245240567123738624.000


In [None]:
parameters = {'criterion': ['mse', 'friedman_mse'],               
              'splitter':['best', 'random'],
              'max_features': ['auto', 'sqrt', 'log2']}
# Decision Tree Regressor
tree_reg2 = GridSearchCV(DecisionTreeRegressor(), parameters, cv=5, scoring='neg_mean_squared_error') 
tree_reg2.fit(X_train, y_train)
print("The best parameters obtained by CV:", tree_reg2.best_params_)
print("The best score obtained by CV = {:5.3f}".format(tree_reg2.best_score_))

The best parameters obtained by CV: {'criterion': 'friedman_mse', 'max_features': 'log2', 'splitter': 'random'}
The best score obtained by CV = -1363.583


In [None]:
parameters = {'n_estimators': [100, 150, 200],               
              'max_features' : ['auto', 'log2']}
# Random Forest Regressor
rf_reg3 = GridSearchCV(RandomForestRegressor(), parameters, cv=5, scoring='neg_mean_squared_error') 
rf_reg3.fit(X_train, y_train)
print("The best parameters obtained by CV:", rf_reg3.best_params_)
print("The best score obtained by CV = {:5.3f}".format(rf_reg3.best_score_))

The best parameters obtained by CV: {'max_features': 'log2', 'n_estimators': 200}
The best score obtained by CV = -955.291


In [None]:
parameters = {'learning_rate': [0.05, 0.1, 0.2, 0.3],
              'n_estimators': [100, 150, 200],
              'criterion': ['friedman_mse', 'mse'],
              'max_features' : ['auto', 'sqrt', 'log2']}
# Gradient Boosting Regressor
gbt_reg4 = GridSearchCV(GradientBoostingRegressor(), parameters, cv=5, scoring='neg_mean_squared_error')
gbt_reg4.fit(X_train, y_train)
print("The best parameters by CV:", gbt_reg4.best_params_)
print("The best score by CV = {:5.3f}".format(gbt_reg4.best_score_))

The best parameters by CV: {'criterion': 'mse', 'learning_rate': 0.1, 'max_features': 'auto', 'n_estimators': 150}
The best score by CV = -796.315


## Save the best model

In [None]:
import sklearn
sklearn.__version__


'0.22.2.post1'

In [None]:
import pickle

##saving the model
with open("gbt_model.pkl", 'wb') as f_out:
    pickle.dump(gbt_reg4, f_out)
    f_out.close()

## Save the test scaler

In [None]:
##saving the model
with open("test_scaler.pkl", 'wb') as f_out:
    pickle.dump(scaler, f_out)
    f_out.close()