<font color="#483D8B">
<h1 align="center"> Semi-Supervised Learning</h1>
<h3 align="center"> Jeremy Lopez</h3>
<h3 align="center"> 11/28/18</h3>
</font>

***

# Overview

This notebook will aim to predict the duration of testing a Mercedes-Benz car based on its features. Note that the car features have been kept anonymous in the dataset (X0, X1, X2, ...). This notebook will also demonstrate an implementation of a semi-supervised learning method known as pseudo-labeling because the dataset does contain a fair amount of unlabeled data. Before the implementation process begins, the provided dataset will be preprocessed and explored. Once the data is ready, a pseudo-learning regression model will be implemented and tested using labeled and unlabeled data. Then the implemented model will be compared with an XGBoost regression model that does not perform pseudo-labeling.


***

# Background

### Supervised Learning

* Supervised learning is where you have both input variables AND an output variable


* An algorithm can be used to learn the mapping function from the input to the output
    * Y = f(X)
    
    
* Process of an algorithm learning from the training dataset
    * A teacher supervising the learning process
    * Teacher has the correct answers
    * Algorithm iteratively makes predictions on the training data and is corrected by the teacher
    * Learning stops when the algorithm achieves an acceptable level of performance
    
    
* Examples:
    * Classification
        * output variable is a category
        * "red", "blue", "disease", "no disease"
    * Regression 
        * output variable is a real value
        * "dollars", "weight"
        

### Unsupervised Learning

* Unlike supervised learning, unsupervised learning deals with input data that has no corresponding output variables


* The main goal with unsupervised learning is to model the distribution of the data in order to learn more about it.


* Algorithms are left to find and reveal the underlying structure of the data
    * There is no supervising teacher
    * With no teacher, there are no right answers
    
    
* Examples:
    * Clustering
        * want to discover groupings in the data
        * grouping customers by purchasing behavior
    * Association
        * discover rules that describe larger portions of your data
        * "people who buy X also tend to buy Y"
        
        
### Semi-Supervised Learning
    

* There is a large amount of available input data (X) but only some the output data is known (Y)


* Semi-supervised learning deals with finding the rest of the unknown output data, which is usually just unlabeled data


* Supervised and unsupervised learning techniques can be used to make predictions for the unlabeled data


* This notebook will approach a semi-supervised learning problem with a dataset that mostly contains unlabeled data using a technique called pseudo-labeling


### What is Pseudo-labeling?

* First, train the model on the available labeled data


* Then, use the trained model to predict labels for the unlabeled data, which creates "pseudo-labels"


* Lastly, combine both the labeled and newly pseudo-labeled data in a new dataset that is used to train the data


* The pseudo-labeling process can be summarized with the image shown below: 

<img src="https://datawhatnow.com/wp-content/uploads/2017/08/pseudo-labeling-683x1024.png" alt="Testing" style="width:300px;height:450px">

***

# Data

This dataset used in this notebook is from the Mercedes-Benz Greener Manufacturing Kaggle competition. The data ('train.csv' and 'test.csv') can be downloaded from the competition website (https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data).

In [33]:
import pandas as pd

# Load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print(train.shape, test.shape)
# (4209, 378) (4209, 377)

(4209, 378) (4209, 377)


In [34]:
train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


***

# Exploratory Data Analysis


With a quick glance of the data, we can see that the first couple of features are assigned non-numerical values. Features X0-X8 are categorical variables and we need to transform them into numerical values. This can be done using  scikit-learn’s LabelEncoder class.

In [35]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

features = train.columns[2:]

for column_name in features:
    label_encoder = LabelEncoder() 
    
    # Get the column values
    train_column_values = list(train[column_name].values)
    test_column_values = list(test[column_name].values)
    
    # Fit the label encoder
    label_encoder.fit(train_column_values + test_column_values)
    
    # Transform the feature
    train[column_name] = label_encoder.transform(train_column_values)
    test[column_name] = label_encoder.transform(test_column_values)

In [43]:
train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,37,23,20,0,3,27,9,14,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,37,21,22,4,3,31,11,14,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,24,24,38,2,3,30,9,23,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,24,21,38,5,3,30,11,4,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,24,23,38,5,3,14,3,13,...,0,0,0,0,0,0,0,0,0,0


Data is now ready for the model implementation.

***

# Models

### Implementation


First, we will create a function that will create an "augmented training set". This created data set will consist of both pseudo-labeled and labeled data. The arguments of the function are the model, training and test set information (data and features), and a parameter called sample_rate. Sample_rate allows us to control the percent of pseudo-labeled data that we will mix with the initially labeled data. For example, setting sample_rate to 0 means that the model will only use the labeled data this is already available. Setting the sample_rate to 0.5 means that the model will use all the labeled data and only half of the pseudo-labeled data. 


In [36]:
def create_augmented_train(X, y, model, test, features, target, sample_rate):
    '''
    Create and return the augmented_train set that consists
    of pseudo-labeled and labeled data.
    '''
    num_of_samples = int(len(test) * sample_rate)

    # Train the model and create the pseudo-labels
    model.fit(X, y)
    pseudo_labels = model.predict(test[features])

    # Add the pseudo-labels to the test set
    augmented_test = test.copy(deep=True)
    augmented_test[target] = pseudo_labels

    # Take a subset of the test set with pseudo-labels and append in onto
    # the training set
    sampled_test = augmented_test.sample(n=num_of_samples)
    temp_train = pd.concat([X, y], axis=1)
    augemented_train = pd.concat([sampled_test, temp_train])
    
    # Shuffle the augmented dataset and return it
    return shuffle(augemented_train)

Next, we will need a fit method that will train the model using the recently created augmented training set. It will take the same arguments as the `create_augmented_train()` function, minus the `test` and parameter.

In [37]:
def fit(X, y, model, features, target, sample_rate):
    
    # train the model using the augmented_train set if sample_rate is > 0.0
    if sample_rate > 0.0:
        augemented_train = create_augmented_train(X, y)
        model.fit(
            augemented_train[features],
            augemented_train[target]
        )
    else:
        model.fit(X, y)
        
    return model

Since both of our implemented functions take in a lot of arguments, it will be easier to create a class in order to unify and clean all of our code. A class will also give us the opportunity to add more functions if we need any. Below is our defined class, `PseudoLabeler`, which contains the previously defined functions.

In [38]:
from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, RegressorMixin

class PseudoLabeler(BaseEstimator, RegressorMixin):
    
    def __init__(self, model, test, features, target, sample_rate=0.2, seed=42):
        self.sample_rate = sample_rate
        self.seed = seed
        self.model = model
        self.model.seed = seed
        
        self.test = test
        self.features = features
        self.target = target
        
    def get_params(self, deep=True):
        return {
            "sample_rate": self.sample_rate,
            "seed": self.seed,
            "model": self.model,
            "test": self.test,
            "features": self.features,
            "target": self.target
        }

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

        
    def fit(self, X, y):
        if self.sample_rate > 0.0:
            augemented_train = self.create_augmented_train(X, y)
            self.model.fit(
                augemented_train[self.features],
                augemented_train[self.target]
            )
        else:
            self.model.fit(X, y)
        
        return self


    def create_augmented_train(self, X, y):
        num_of_samples = int(len(test) * self.sample_rate)
        
        # Train the model and creat the pseudo-labels
        self.model.fit(X, y)
        pseudo_labels = self.model.predict(self.test[self.features])
        
        # Add the pseudo-labels to the test set
        augmented_test = test.copy(deep=True)
        augmented_test[self.target] = pseudo_labels
        
        # Take a subset of the test set with pseudo-labels and append in onto
        # the training set
        sampled_test = augmented_test.sample(n=num_of_samples)
        temp_train = pd.concat([X, y], axis=1)
        augemented_train = pd.concat([sampled_test, temp_train])

        return shuffle(augemented_train)
        
    def predict(self, X):
        return self.model.predict(X)
    
    def get_model_name(self):
        return self.model.__class__.__name__

Now that the `PseudoLabeler` class has been fully implemented, we need a model to use as one of the parameters, `model`, for the initialization of the class. Below, we will compare some models that can be imported from the sklearn library using the coefficient of determination (R2).

In [39]:
from xgboost import XGBRegressor
from sklearn.linear_model import BayesianRidge, Ridge, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

In [40]:
model_factory = [
    RandomForestRegressor(),
    XGBRegressor(nthread=1),
    MLPRegressor(),
    Ridge(),
    BayesianRidge(),
    ExtraTreesRegressor(),
    ElasticNet(),
    KNeighborsRegressor(),
    GradientBoostingRegressor()
]

for model in model_factory:
    model.seed = 42
    num_folds = 3

    scores = cross_val_score(model, X_train, y_train, cv=num_folds, scoring='r2', n_jobs=8)
    score_description = " %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)

    print('{model:25} CV-5 R2: {score}'.format(
        model=model.__class__.__name__,
        score=score_description
    ))

RandomForestRegressor     CV-5 R2:  0.43 (+/- 0.02)
XGBRegressor              CV-5 R2:  0.55 (+/- 0.06)
MLPRegressor              CV-5 R2:  0.49 (+/- 0.07)
Ridge                     CV-5 R2:  0.52 (+/- 0.08)
BayesianRidge             CV-5 R2:  0.54 (+/- 0.07)
ExtraTreesRegressor       CV-5 R2:  0.37 (+/- 0.07)
ElasticNet                CV-5 R2:  0.35 (+/- 0.06)
KNeighborsRegressor       CV-5 R2:  0.27 (+/- 0.04)
GradientBoostingRegressor CV-5 R2:  0.54 (+/- 0.07)


Comparing the results of each model, the XGBRegressor model scored the highest mean score with an average standard deviation. Therefore, the `PseudoLabeler` class will be implemented using the XGBRegressor model. This is shown below.

In [41]:
target = 'y'

# Preprocess the data
X_train, X_test = train[features], test[features]
y_train = train[target]

# Create the PseudoLabeler with XGBRegressor as the base regressor
model = PseudoLabeler(
    XGBRegressor(nthread=1),
    test,
    features,
    target
)

# Train the model and use it to predict
model.fit(X_train, y_train)
model.predict(X_train)

array([117.541885,  92.00364 ,  76.15129 , ..., 110.52514 ,  92.19561 ,
        94.57087 ], dtype=float32)

### Performance (vs Default XGBoost Regression)


To test out the `PseudoLabeler`, we will compare it to the default XGBRegressor model using the R2-score as the evaluation metric again. 

In [42]:
model_factory = [
    XGBRegressor(nthread=1),
    
    PseudoLabeler(
        XGBRegressor(nthread=1),
        test,
        features,
        target,
        sample_rate=0.3
    ),
]

for model in model_factory:
    model.seed = 42
    num_folds = 8
    
    scores = cross_val_score(model, X_train, y_train, cv=num_folds, scoring='r2', n_jobs=8)
    score_description = "R2: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2)

    print('{model:25} CV-{num_folds} {score_cv}'.format(
        model=model.__class__.__name__,
        num_folds=num_folds,
        score_cv=score_description
    ))

XGBRegressor              CV-8 R2: 0.5671 (+/- 0.1596)
PseudoLabeler             CV-8 R2: 0.5693 (+/- 0.1604)


As you can see, the the PseudoLabeler has a slightly higher mean score and lower standard deviation. This implies that the PseudoLabeler is a slightly superior model to the default XGBRegressor model.

***

# Conclusion



Pseudo-labeling allows us to utilize unlabeled data while training machine learning models. It can also improve the performance of other models if the process is tuned and made to work properly. In this case, however, the XGBRegressor model was only slightly improved when it was implemented using pseudo-labeling. Note that this data set was obtained from a Kaggle competition so this very slight improvement in the R2-scores does make a difference. However, in the context of the actual business problem, this isn't very impressive and all of the time spent implementing the PseudoLabeler class could have instead been used to improve the default XGBRegression model and reducing the number of features in the dataset. However, we do end up with a model with a decent R2 score that will predict the amount of testing time for a vehicle given a combination of features.

References:

[1.] https://datawhatnow.com/pseudo-labeling-semi-supervised-learning/
    
[2.] https://github.com/Weenkus/DataWhatNow-Codes/blob/master/pseudo_labeling_a_simple_semi_supervised_learning_method/pseudo_labeling_a_simple_semi_supervised_learning_method.ipynb
    
[3.] https://www.analyticsvidhya.com/blog/2017/09/pseudo-labelling-semi-supervised-learning-technique/
    
[4.] https://scikit-learn.org/stable/modules/label_propagation.html
    
[5.] https://www.kaggle.com/residentmario/notes-on-semi-supervised-learning

[6.] https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/