## Introduction

The goal of this notebook is to demonstrate the use of **defaultdict** to build statistics.

**defaultdict** is a high performance version of python **dict**.

You can read more [here](https://docs.python.org/2/library/collections.html) 

In particular, **defaultdict** can create keys on the fly, when you would have to check the existence of a key prior to using it in a standard python **dict** object

At the end of this notebook, you will know how to :
 - go through the rows of a DataFrame using an iterator
 - create a **defaultdict**, create new keys on the fly and simple statistics
 - create predictions using the **map** method of pd.Series objects
 

In [1]:
import pandas as pd
import numpy as np

Please change the file_path so that it points to where the train file is on your system  

In [2]:
file_path = "../input/train.csv.zip"

From the hard work you have done in the first notebook you can define the best data types for each columns.

Again in this notebook we will exclude all time related columns. 

Here are the data types definition: 

In [3]:
dtypes = {
        'ip': 'uint32',
        'app': 'uint16',
        'device': 'uint16',
        'os': 'uint16',
        'channel': 'uint16',
        'is_attributed': 'uint8'
    }
cols = [f_ for f_ in dtypes.keys()]

To read data by chunks will use the chunksize argument of pd.read_csv method 

chunksize is the maximum number of rows each chunk should be made of.

pandas will read the file and give access the the first N rows then the following N rows and so on until the end of the file.

We will use **defaultdict** from the python collections package to compute simple averages.

In [4]:
# chunksize is the maximum number of rows each chunk should be made of
import time
import gc
# Enable garbage collection
gc.enable()

# Please adapt the chunksize to your memory setup
chunksize = 10000000

# Import defaultdict
from collections import defaultdict

def update_dicts(row, sum_dict, count_dict):
    # row[0] is the ip address
    # row[1] is is_attributed
    sum_dict[row[0]] += row[1]
    count_dict[row[0]] += 1 
    
# Create defaultdicts for sum and count
ip_attributed = defaultdict(float)
ip_count = defaultdict(float)

start_time = time.time()
for i_chunk, df in enumerate(pd.read_csv(file_path, chunksize=chunksize, dtype=dtypes, usecols=cols)):
    print("%3d Chunks have been read in %5.1f minute" 
          % (i_chunk + 1, (time.time() - start_time) / 60))
    # Go through the rows of the current DataFrame chunk
    # Please note that this is a lot quicker than using the .apply method
    for ip, attributed in df[['ip', "is_attributed"]].values:
            ip_attributed[ip] += attributed
            ip_count[ip] += 1
    
    # Free memory by deleting the current DataFrame
    del df
    gc.collect()


  1 Chunks have been read in   0.1 minute
  2 Chunks have been read in   0.5 minute
  3 Chunks have been read in   1.0 minute
  4 Chunks have been read in   1.4 minute
  5 Chunks have been read in   1.8 minute
  6 Chunks have been read in   2.2 minute
  7 Chunks have been read in   2.6 minute
  8 Chunks have been read in   3.1 minute
  9 Chunks have been read in   3.5 minute
 10 Chunks have been read in   4.0 minute
 11 Chunks have been read in   4.4 minute
 12 Chunks have been read in   4.9 minute
 13 Chunks have been read in   5.3 minute
 14 Chunks have been read in   5.7 minute
 15 Chunks have been read in   6.2 minute
 16 Chunks have been read in   6.6 minute
 17 Chunks have been read in   7.1 minute
 18 Chunks have been read in   7.5 minute
 19 Chunks have been read in   7.8 minute


As you can see this is a lot longer than the **groupby** method we used in the previous notebook. However you can do more complicated tasks with this simple method.

In [5]:
print("Number of keys in train : ", len(ip_attributed.keys()))
for key in ip_attributed.keys():
    ip_attributed[key] /= ip_count[key]
    
del ip_count
gc.collect()

Number of keys in train :  277396


46

In [6]:
# Create ip_average as an empty DataFrame
start_time=time.time()
# Create place holders for target and predictions to be able to compute the AUC score once the process has completed
target = None
predictions = None 
for i_chunk, df in enumerate(pd.read_csv(file_path, chunksize=chunksize, dtype=dtypes, usecols=['ip', 'is_attributed'])):
    print("%3d Chunks have been processed in %5.1f minute" 
          % (i_chunk + 1, (time.time() - start_time) / 60))
    if target is None:
        target = df['is_attributed'].values
        predictions = df['ip'].map(dict(ip_attributed)).values
    else:
        target = np.hstack((target, df['is_attributed'].values))
        predictions = np.hstack((predictions, df['ip'].map(dict(ip_attributed))))
        
    # Free memory by deleting the current DataFrame
    del df
    gc.collect()

  1 Chunks have been processed in   0.1 minute
  2 Chunks have been processed in   0.2 minute
  3 Chunks have been processed in   0.4 minute
  4 Chunks have been processed in   0.5 minute
  5 Chunks have been processed in   0.6 minute
  6 Chunks have been processed in   0.8 minute
  7 Chunks have been processed in   0.9 minute
  8 Chunks have been processed in   1.0 minute
  9 Chunks have been processed in   1.2 minute
 10 Chunks have been processed in   1.3 minute
 11 Chunks have been processed in   1.4 minute
 12 Chunks have been processed in   1.6 minute
 13 Chunks have been processed in   1.7 minute
 14 Chunks have been processed in   1.8 minute
 15 Chunks have been processed in   2.0 minute
 16 Chunks have been processed in   2.1 minute
 17 Chunks have been processed in   2.3 minute
 18 Chunks have been processed in   2.4 minute
 19 Chunks have been processed in   2.5 minute


Display AUC score for this simple prediction on training dataset

Please note this may take some time

In [7]:
from sklearn.metrics import roc_auc_score
print("AUC score of predictions using ip on the whole dataset = %.6f"
      % (roc_auc_score(target, predictions)))

AUC score of predictions using ip on the whole dataset = 0.825532
