## Good Ressource
- https://github.com/solegalli/feature-engineering-for-machine-learning/tree/main/Section-08-Categorical-Encoding-Basic

I have several other files preprocessing the Canadian data from the stack overflow surveys that I use to train the model. However, I will not share them since it was done in the context of a college course, and thus I am scared that people might want to cheat on it.

In [4]:
import pandas as pd
import math
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import joblib

In [251]:
basePath = ""

In [227]:
data = pd.read_csv(f"{basePath}/Project/Resources/CanadaData.csv")
data

Unnamed: 0,OrgSize,Industry,Experience,Title,Country,City,Salary
0,500 to 999 employees,"Information Services, IT, Software Development...",5-9,"Developer, back-end",Canada,Edmonton Region,101490.0
1,10 to 19 employees,"Information Services, IT, Software Development...",5-9,Data scientist or machine learning specialist,Canada,Halifax Region,52046.0
2,"1,000 to 4,999 employees","Information Services, IT, Software Development...",10-*,"Developer, back-end",Canada,Edmonton Region,118962.0
3,100 to 499 employees,"Manufacturing, Transportation, or Supply Chain",10-*,"Developer, full-stack",Canada,North Shore Region,64686.0
4,20 to 99 employees,"Information Services, IT, Software Development...",2-4,"Developer, full-stack",Canada,Hamilton–Niagara Peninsula Region,59481.0
...,...,...,...,...,...,...,...
10242,10 to 19 employees,"Information Services, IT, Software Development...",10-*,"Developer, desktop or enterprise applications",Canada,Edmonton Region,150000.0
10243,100 to 499 employees,"Information Services, IT, Software Development...",10-*,"Developer, front-end",Canada,Lower Mainland–Southwest Region,90000.0
10244,20 to 99 employees,"Information Services, IT, Software Development...",10-*,"Developer, front-end",Canada,Annapolis Valley Region,70000.0
10245,10 to 19 employees,"Information Services, IT, Software Development...",10-*,"Developer, front-end",Canada,Cape Breton Region,50000.0


In [228]:
data = data.rename(columns={"OrgSize":"Company Size"})

In [229]:
experience_mapping = {
    '0-1': '0 to 1 years',
    '2-4': '2 to 4 years',
    '5-9': '5 to 9 years',
    '10-*': '10 or more years'
}

data['Experience'] = data['Experience'].map(experience_mapping)

data

Unnamed: 0,Company Size,Industry,Experience,Title,Country,City,Salary
0,500 to 999 employees,"Information Services, IT, Software Development...",5 to 9 years,"Developer, back-end",Canada,Edmonton Region,101490.0
1,10 to 19 employees,"Information Services, IT, Software Development...",5 to 9 years,Data scientist or machine learning specialist,Canada,Halifax Region,52046.0
2,"1,000 to 4,999 employees","Information Services, IT, Software Development...",10 or more years,"Developer, back-end",Canada,Edmonton Region,118962.0
3,100 to 499 employees,"Manufacturing, Transportation, or Supply Chain",10 or more years,"Developer, full-stack",Canada,North Shore Region,64686.0
4,20 to 99 employees,"Information Services, IT, Software Development...",2 to 4 years,"Developer, full-stack",Canada,Hamilton–Niagara Peninsula Region,59481.0
...,...,...,...,...,...,...,...
10242,10 to 19 employees,"Information Services, IT, Software Development...",10 or more years,"Developer, desktop or enterprise applications",Canada,Edmonton Region,150000.0
10243,100 to 499 employees,"Information Services, IT, Software Development...",10 or more years,"Developer, front-end",Canada,Lower Mainland–Southwest Region,90000.0
10244,20 to 99 employees,"Information Services, IT, Software Development...",10 or more years,"Developer, front-end",Canada,Annapolis Valley Region,70000.0
10245,10 to 19 employees,"Information Services, IT, Software Development...",10 or more years,"Developer, front-end",Canada,Cape Breton Region,50000.0


In [230]:
data.to_csv(f"{basePath}/ModifiedCanada.csv",index=False)

## One Hot Encoding
Encode each categorical variable into a vector of 00...00 and 1.
The dimension of the vector is 1*n where n: number of columns,correspond to the number of unique values from the encoded columns.

Good for Linear model and to not introduced some sort of hierarchy between categorical values.

For example One hot encoding would be bad for value such as ["small","medium","large"]

Since in this case you want to encode with some sort of hirarchy.

### Creating the training and testing dataset

In [231]:
# I will drop the Country column because all my data is from Canada which make this column redundant.

data = data.drop(columns=["Country"])

In [232]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop("Salary", axis=1),  # predictors
    data["Salary"],  # target
    test_size=0.2,  # percentage of obs in test set
    random_state=42,  # seed to ensure reproducibility
)

X_train.shape, X_test.shape

((8197, 5), (2050, 5))

### Using Scikit

In [233]:
encoder = OneHotEncoder(categories="auto",drop=None,sparse_output=False,handle_unknown="infrequent_if_exist")

encoder.set_output(transform="pandas")

In [234]:
encoder.fit(X_train)

In [235]:
encoder.transform(X_train).shape

(8197, 91)

In [236]:
encoder.transform(X_test).shape

(2050, 91)

#### Using Column Transformer

In [237]:
CompanySizeOrderedCategories = ['2 to 9 employees', '10 to 19 employees', '20 to 99 employees',
     '100 to 499 employees', '500 to 999 employees', '1,000 to 4,999 employees',
     '5,000 to 9,999 employees', '10,000 or more employees']

ExperienceOrderedCategories = ['0 to 1 years', '2 to 4 years', '5 to 9 years','10 or more years']
transformer = ColumnTransformer(
    transformers=[
        ('oe_CompanySize', OrdinalEncoder(categories=[CompanySizeOrderedCategories]),['Company Size']),
        ('oe_Experience', OrdinalEncoder(categories=[ExperienceOrderedCategories]), ['Experience']),
        ('categorical', OneHotEncoder(categories="auto",drop=None,sparse_output=False,handle_unknown="infrequent_if_exist"), ["Industry","Title","City"])
    ],remainder="passthrough")

transformer.set_output(transform="pandas")

In [238]:
transformer.fit(X_train)

In [249]:
X_train_enc = transformer.transform(X_train)
X_test_enc = transformer.transform(X_test)

X_test_enc.head(1)

Unnamed: 0,oe_CompanySize__Company Size,oe_Experience__Experience,categorical__Industry_Advertising Services,categorical__Industry_Financial Services,categorical__Industry_Healthcare,categorical__Industry_Higher Education,"categorical__Industry_Information Services, IT, Software Development, or other Technology",categorical__Industry_Insurance,categorical__Industry_Legal Services,"categorical__Industry_Manufacturing, Transportation, or Supply Chain",...,categorical__City_Saint John–St. Stephen Region,categorical__City_Saskatoon–Biggar Region,categorical__City_Southeast Region,categorical__City_Southern Region,categorical__City_Thompson–Okanagan Region,categorical__City_Toronto Region,categorical__City_Vancouver Island and Coast Region,categorical__City_West Coast–Northern Peninsula–Labrador Region,categorical__City_Windsor-Sarnia Region,categorical__City_Winnipeg Region
5265,3.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#### Pipeline Model Training

In [240]:
pipeline = Pipeline(steps=[('preprocessor', transformer),
                             ('regressor', LinearRegression())])

In [241]:
pipeline.fit(X_train, y_train)

In [242]:
predictions = pipeline.predict(X_test)
predictions

array([ 70348., 131204.,  60520., ...,  56000.,  70908.,  77580.])

In [250]:
Copy_X_test = X_test.copy()
Copy_X_test["Salary"] = y_test
Copy_X_test["Predicted Salary"] = predictions
Copy_X_test.head(5)

Unnamed: 0,Company Size,Industry,Experience,Title,City,Salary,Predicted Salary
5265,100 to 499 employees,"Information Services, IT, Software Development...",5 to 9 years,"Developer, QA or test",Windsor-Sarnia Region,86220.0,70348.0
2167,"10,000 or more employees","Information Services, IT, Software Development...",5 to 9 years,Data scientist or machine learning specialist,Calgary Region,187402.0,131204.0
9752,"10,000 or more employees","Information Services, IT, Software Development...",5 to 9 years,"Developer, back-end",North Shore Region,55000.0,60520.0
5009,"1,000 to 4,999 employees","Information Services, IT, Software Development...",5 to 9 years,"Developer, desktop or enterprise applications",Hamilton–Niagara Peninsula Region,83194.0,95284.0
9251,100 to 499 employees,Legal Services,10 or more years,"Developer, front-end",Cape Breton Region,53977.272727,53640.0


#### Analysing the Model Accuracy

In [244]:
mse = mean_squared_error(y_test, predictions)
math.sqrt(mse)

30329021058224.336

In [245]:
mae = mean_absolute_error(y_test, predictions)
mae

669856032218.2335

In [246]:
data["Company Size"].unique()

array(['500 to 999 employees', '10 to 19 employees',
       '1,000 to 4,999 employees', '100 to 499 employees',
       '20 to 99 employees', '10,000 or more employees',
       '2 to 9 employees', '5,000 to 9,999 employees'], dtype=object)

#### Testing the Model With Custom Input

In [247]:
Company_Size = ['500 to 999 employees']*4
Experience = ['0 to 1 years', '2 to 4 years', '5 to 9 years','10 or more years']
Industry = ['Information Services, IT, Software Development, or other Technology']*4
Title = ['Developer, back-end']*4
City = ['Montréal Region']*4
testData = {
    'Company Size': Company_Size,
    'Experience': Experience,
    'Industry':Industry,
    'Title': Title,
    'City': City
}

testDF = pd.DataFrame(testData)
testDF

Unnamed: 0,Company Size,Experience,Industry,Title,City
0,500 to 999 employees,0 to 1 years,"Information Services, IT, Software Development...","Developer, back-end",Montréal Region
1,500 to 999 employees,2 to 4 years,"Information Services, IT, Software Development...","Developer, back-end",Montréal Region
2,500 to 999 employees,5 to 9 years,"Information Services, IT, Software Development...","Developer, back-end",Montréal Region
3,500 to 999 employees,10 or more years,"Information Services, IT, Software Development...","Developer, back-end",Montréal Region


In [248]:
(pipeline.predict(testDF)*1.33)/(12*4*5*8)

array([27.13754167, 38.73625   , 50.33772917, 61.9364375 ])

### Exporting the Model

In [None]:
# joblib.dump(pipeline, f"{basePath}/Canada.joblib")