## Introduction

The goal of this notebook is to use previous learning materials to create predictions on the test dataset using an SGDClassifier and averages over multiple columns.

In the previous notebook we used averages for **ip**, **app**, **channel**...

In this notebook we will use averages for **app** AND **channel** as well.

To cope with averages on multiple columns, **AverageManager** class has been modified. 

Please note that the python code in this notebook uses advanced pandas and python code. Therefore you may need time to really understand each line of code. 

PLEASE ask questions.

At the end of this notebook you will know how to :
 - use SGDClassifier **partial_fit()** method to update a classifier with newly received samples
 - manage several target averages using a python **dict**
 - compute averages over multiple columns
 - create test predictions in a gzip format
 
A public kaggle kernel has been created to help you submit the predictions to the LeaderBoard:

https://www.kaggle.com/ogrellier/sgd-using-averages-of-previous-chunk

This will make it simpler for you to fork the script and use it for your own experiments.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.linear_model import SGDClassifier
import time
import gc

gc.enable()

Please change the file_path so that it points to where the train file is on your system  

In [2]:
file_path = "../input/train.csv.zip"

Specify data types to limit memory usage

In [3]:
dtypes = {
        'ip': 'uint32',
        'app': 'uint16',
        'device': 'uint16',
        'os': 'uint16',
        'channel': 'uint16',
        'is_attributed': 'uint8'
    }
cols = [f_ for f_ in dtypes.keys()]

Create a average manager class

In [4]:
class AverageManager(object):
    """ Class that will manage target averages for selected feature and target encode these features """
    def __init__(self, features, target):
        """ 
        Init an average manager for the given features and the specified target
        :param features : expected to be a list of list of features to group by the data
        :param target : name of the target feature
        """
        # Check features : 
        for f_ in features:
            if type(f_) != list:
                raise ValueError('Features are expected to be provided as a list')
        # averages contains the average data
        self.averages = {
            tuple(f_): None for f_ in features
        }
        # Prior contains the estimated prior of the target 
        self.prior = {'cum_sum': 0.0, 'nb_samples': 0.0}
        # Conatins the name of the target column in the DataFrames
        self.target = target
        
    def update_averages(self, df):
        """Update averages information using samples available in df"""
        # update prior
        self.prior['cum_sum'] += df[self.target].sum()
        self.prior['nb_samples'] += df.shape[0]
        
        for f_ in self.averages.keys():
            # Create the groupby
            the_group = df[list(f_) + [self.target]].groupby(list(f_)).agg(['sum', 'count'])
            the_group.columns = the_group.columns.droplevel(0)
    
            # Update the average
            if self.averages[f_] is None:
                self.averages[f_] = the_group
            else:
                # pandas .add method makes sure apps that are not in both the_group and current averages
                # take value of 0 before the addition takes place
                self.averages[f_] = the_group.add(self.averages[f_], fill_value=0.0)
            
            del the_group
            gc.collect()
            
    def apply_averages(self, df):
        """Apply calculated averages on df to target encode the features"""
        encoded = pd.DataFrame()
        for f_ in self.averages.keys():
            # Check averages are fitted
            if self.averages[f_] is None:
                raise ValueError('Averages have not been fitted yet')
            # Compute the average
            self.averages[f_]['average'] = self.averages[f_]['sum'] / self.averages[f_]['count']
            
            # The next few lines of code are here to allow mapping averages to several columns
            # Pandas does not allow to use the map statement on several columns so we need to group them into 
            # one single column.
            # THIS IS THE TRICKY PART OF THIS CODE !
            # Now we need to encode for potetially several columns
            feat_name = '_' + '_'.join(list(f_))
            # Now group the columns as a single string to make it only one 
            add_str_feature(df, list(f_), feat_name)
            # Now group the average index columns as one string column and reindex using this string column
            the_average = self.averages[f_].reset_index()
            add_str_feature(the_average, list(f_), feat_name)
            the_average.set_index(feat_name, inplace=True)
            
            # finally map on the single string column
            encoded[feat_name] = df[feat_name].map(the_average['average']).astype(np.float32)
            prior = self.prior['cum_sum'] / self.prior['nb_samples']
            encoded[feat_name].fillna(prior, inplace=True)
            # Drop feat_name from df
            del df[feat_name]
            gc.collect()
        
        return encoded
        

def add_str_feature(df_, features, name):
    """
    It does the same as : 
    df[feat_name] = df[list(f_)].apply(lambda row: '_'.join(row.astype(str)), axis=1, raw=True)
    However:
     - The addition of series is faster than the apply statement 
     - apply(lambda x: str(x)) is faster than df_[f].astype(str)
    """
    df_[name] = ''
    for f in features:
        df_[name] += df_[f].apply(lambda x: str(x)) + '_'
            

## Read train dataset, update averages and train SGDClassifier

Averages are updated on the fly and used to target encode the features one chunk after the other.

SGDClassifier is trained using partial_fit.

Doing things in this way will limit using future samples target on previous data like we would do when calating averages on the whole training set. Overall averages would use events that occured on the last day of the training set to encode events that occured on the 1st day of the training set.

This is something we need to take care of when predicting on time series. 

Although I am not totally sure it is important in the TalkingData situation, I believe it would still limit overfitting.

In [5]:
start_time=time.time()
# Create average manager
used_features = ['app', 'os', 'channel']
avg_features = [['app'], ['os'], ['app', 'channel']]
avg_man = AverageManager(features=avg_features, target='is_attributed')

# Init Classifier
clf = SGDClassifier(loss='log', tol=1e-2)

# Read train file 
chunksize=20000000
for i_chunk, df in enumerate(pd.read_csv(file_path, 
                                         chunksize=chunksize, 
                                         dtype=dtypes, 
                                         usecols=used_features + ['is_attributed'])):
    # Udpate averages with the average manager
    avg_man.update_averages(df)
    
    # Apply averages usin the average manager
    target_encoding = avg_man.apply_averages(df)
    
    # Update the SGDClassifier using current target encoding and calling partial_fit
    clf.partial_fit(X=target_encoding, y=df['is_attributed'], classes=[0, 1])
    
    # Get current predictions
    preds = clf.predict_proba(target_encoding)[:, 1]
    
    # Display the log_loss and AUC score on the current chunk
    print("Chunk %3d scores : loss %.6f auc %.6f [%5.1f min used so far]"
          % (i_chunk + 1, 
             log_loss(df['is_attributed'], preds),
             roc_auc_score(df['is_attributed'], preds),
             (time.time() - start_time) / 60))
    
    del target_encoding
    gc.collect()

Chunk   1 scores : loss 0.011074 auc 0.953510 [  1.6 min used so far]
Chunk   2 scores : loss 0.012969 auc 0.957749 [  3.1 min used so far]
Chunk   3 scores : loss 0.012627 auc 0.969659 [  4.7 min used so far]
Chunk   4 scores : loss 0.011466 auc 0.947955 [  6.2 min used so far]
Chunk   5 scores : loss 0.013295 auc 0.962948 [  7.8 min used so far]
Chunk   6 scores : loss 0.011556 auc 0.967542 [  9.4 min used so far]
Chunk   7 scores : loss 0.011155 auc 0.945865 [ 10.9 min used so far]
Chunk   8 scores : loss 0.013511 auc 0.949089 [ 12.5 min used so far]
Chunk   9 scores : loss 0.013967 auc 0.957319 [ 14.1 min used so far]
Chunk  10 scores : loss 0.010261 auc 0.972942 [ 14.5 min used so far]


In [6]:
start_time=time.time()
# Create place holder for the prediction
predictions = None
# PLEASE CHANGE THE TEST FILE PATH TO YOUR OWN SETTINGS
test_file_path = '../input/test.csv.zip'
chunksize = 5000000
# Read the test file by chunks
for i_chunk, df in enumerate(pd.read_csv(test_file_path, 
                                         chunksize=chunksize, 
                                         dtype=dtypes, 
                                         usecols=used_features + ['click_id'])):
    if predictions is None:
        # Get the click ids
        # double square brackets are used to return a DataFrame and not a Series
        predictions = df[['click_id']].copy() 
        # Encode df using average manager
        target_encoding = avg_man.apply_averages(df)
        # Predict probabilities with SGD Classifier
        predictions['is_attributed'] = clf.predict_proba(target_encoding)[:, 1]
    else:
        # double square brackets are used to return a DataFrame and not a Series
        curr_preds = df[['click_id']].copy() 
        # Encode df using average manager
        target_encoding = avg_man.apply_averages(df)
        # Predict probabilities with SGD Classifier
        curr_preds['is_attributed'] = clf.predict_proba(target_encoding)[:, 1]
        # Stack predictions and current predictions
        predictions: pd.DataFrame = pd.concat([predictions, curr_preds], axis=0)
        # free memory
        del curr_preds
        
    # Free memory by deleting the current DataFrame
    del df
    gc.collect()
    
    # Display the time we spent so far
    print("%3d Chunks have been read in %5.1f minute" 
          % (i_chunk + 1, (time.time() - start_time) / 60))
    

  1 Chunks have been read in   0.3 minute
  2 Chunks have been read in   0.6 minute
  3 Chunks have been read in   0.9 minute
  4 Chunks have been read in   1.2 minute


Now that we have our predictions we need to store them in a file for submission.

In this contest the submission file is quite big. To reduce its size, both for storage and submission over the web,we will use the following arguments:
 - float_format : it is used to cut the decimals of floats
 - compression : pandas can store files in a compressed format called gzip
 
Writing this file can take some time! On my disk the file takes 108778KB.

In the following statement:
 - **float_format** limits the number of decimal to 6
 - **compression** tells pandas to compress the file in gzip format
 - **index=False** tells pandas only to write the features themeselves in the file without the DataFrame index

In [7]:
predictions.to_csv('app_predictions.csv.gz', float_format='%.6f', compression='gzip', index=False)

If you want to avoid compressing the file or sending it through the web, which may also take some time, the best is to log into your Kaggle account and create kernel. You can then submit the result directly to the Leaderboard.

## Exercise

Please use the previous code to create test predictions with more features and more feature combinations:
 - ip
 - device
 - channel
 - ip, app
 - app, device
 - app, os
 
And give us your results and let us know if it helps you in progressing through the LeaderBoard.