## Introduction

The goal of this notebook is to use previous learning materials to create predictions on the test dataset and make your first submission.

We will use the prediction mechanism demonstrated in **02_read_data_by_chunks_and_make_predictions.ipynb** to create estimations of the target based on the **app** feature.

At the end of this notebook you will know how to :
 - use a groupby statement to create target predictions
 - create test predictions in a gzip format
 
A public kaggle kernel has been created to help you submit these simple predictions to the LeaderBoard:

https://www.kaggle.com/ogrellier/basic-predictions-using-target-encoding-on-app

This will make it simpler for you to fork the script and use it on Kaggle

In [1]:
import pandas as pd
import numpy as np

Please change the file_path so that it points to where the train file is on your system  

In [2]:
file_path = "../input/train.csv.zip"

From the hard work you have done in the first notebook you can define the best data types for each columns.

Again in this notebook we will exclude all time related columns. 

Here are the data types definition: 

In [3]:
dtypes = {
        'ip': 'uint32',
        'app': 'uint16',
        'device': 'uint16',
        'os': 'uint16',
        'channel': 'uint16',
        'is_attributed': 'uint8'
    }
cols = [f_ for f_ in dtypes.keys()]

## Creating simple statistics with the app feature

The goal of this section is to calculate the probability a given **app**lication will be attributed.
In other words we will try to compute :
$$P\left ( is\_attributed= 1\mid app \right )$$

To do this we have to go though all training samples and sum up the number of times is_attributed=1 for each **app**lication and count the number of occurences of each **app**lication in the train file.

As shown in a previous notebook, the simplest way to do this on the entire train.csv file is : 

In [9]:
# Create app_average and make it None
import time
app_average = None
start_time=time.time()
chunksize=20000000
for i_chunk, df in enumerate(pd.read_csv(file_path, chunksize=chunksize, dtype=dtypes, usecols=['app', 'is_attributed'])):
    # Make the groupby statement
    # The groupby statement uses sum and count to be able to compute averages over all samples
    the_group = df.groupby("app").agg(['sum', 'count'])
    the_group.columns = the_group.columns.droplevel(0)
    if app_average is None:
        app_average = the_group
    else:
        # pandas .add method makes sure apps that are not in both the_group and app_average
        # take value of 0 before the addition takes place
        app_average = the_group.add(app_average, fill_value=0.0)

    # Free memory by deleting the current DataFrame
    del df, the_group
    gc.collect()
    
    # Display the time we spent so far
    print("%3d Chunks have been read in %5.1f minute" 
          % (i_chunk + 1, (time.time() - start_time) / 60))
    
app_average.head(10)

  1 Chunks have been read in   0.2 minute
  2 Chunks have been read in   0.4 minute
  3 Chunks have been read in   0.6 minute
  4 Chunks have been read in   0.8 minute
  5 Chunks have been read in   1.0 minute
  6 Chunks have been read in   1.2 minute
  7 Chunks have been read in   1.4 minute
  8 Chunks have been read in   1.6 minute
  9 Chunks have been read in   1.8 minute
 10 Chunks have been read in   1.8 minute


Unnamed: 0_level_0,sum,count
app,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1005.0,3248.0
1,1230.0,5796274.0
2,5661.0,21642136.0
3,10261.0,33911780.0
4,5.0,126275.0
5,27263.0,375533.0
6,205.0,2464136.0
7,1182.0,1764954.0
8,6875.0,3731948.0
9,18823.0,16458268.0


Now create the average per **app**lication

In [10]:
app_average['average'] = app_average['sum'] / app_average['count'] 

## Read the test dataset and make predictions

To perform this action we need to read the test file by chunks, as we did with the train file, and **map** the **app**lication with the average calculated above.

In [15]:
start_time=time.time()
# Create place holders for sample ids and predictions to be able to create the prediction file
# ids = None
# predictions = None 
predictions = None
# PLEASE CHANGE THE TEST FILE PATH TO YOUR OWN SETTINGS
test_file_path = '../input/test.csv.zip'
chunksize = 5000000
# Read the test file by chunks
for i_chunk, df in enumerate(pd.read_csv(test_file_path, chunksize=chunksize, dtype=dtypes, usecols=['click_id', 'app'])):
    if predictions is None:
        predictions = df[['click_id']] # double square brackets are used to return a DataFrame and not a Series
        predictions['is_attributed'] = df['app'].map(app_average['average']).astype(np.float32)
    else:
        curr_preds = df[['click_id']] # double square brackets are used to return a DataFrame and not a Series
        curr_preds['is_attributed'] = df['app'].map(app_average['average']).astype(np.float32)
        # Stack predictions and current predictions
        predictions: pd.DataFrame = pd.concat([predictions, curr_preds], axis=0)
        # free memory
        del curr_preds
        
    # Free memory by deleting the current DataFrame
    del df
    gc.collect()
    
    # Display the time we spent so far
    print("%3d Chunks have been read in %5.1f minute" 
          % (i_chunk + 1, (time.time() - start_time) / 60))
    

  1 Chunks have been read in   0.1 minute
  2 Chunks have been read in   0.1 minute
  3 Chunks have been read in   0.2 minute
  4 Chunks have been read in   0.2 minute


Now that we have our predictions we need to store them in a file for submission.

In this contest the submission file is quite big. To reduce its size, both for storage and submission over the web,we will use the following arguments:
 - float_format : it is used to cut the decimals of floats
 - compression : pandas can store files in a compressed format called gzip
 
Writing this file can take some time! On my disk the file takes 108778KB.

In the following statement:
 - **float_format** limits the number of decimal to 6
 - **compression** tells pandas to compress the file in gzip format
 - **index=False** tells pandas only to write the features themeselves in the file without the DataFrame index

In [17]:
predictions.to_csv('app_predictions.csv.gz', float_format='%.6f', compression='gzip', index=False)

If you want to avoid compressing the file or sending it through the web, which may also take some time, the best is to log into your Kaggle account and create kernel. You can then submit the result directly to the Leaderboard.

## Exercise

Please use previous code to create test predictions for :
 - ip
 - device
 - channel
 - os
 
And give us your results.