<h1>Email recipient recommendation</h1>

<i>Thomas Boudou, Guillaume Richard, Antoine Simoulin</i>

<p style="text-align: justify">It was shown that at work, employees frequently forget to include one or more recipient(s) before sending a message. Conversely, it is common that some recipients of a given message were actually not intended to receive the message. To increase productivity and prevent information leakage, the needs for effective <b>email recipient recommendation</b> systems are thus pressing.

In this challenge, you are asked to develop such a system, which, given the content and the date of a message, recommends a list of <b>10 recipients ranked by decreasing order of relevance</b>.</p>

In [1]:
# Requirements
%matplotlib inline
import random
import pandas as pd
import numpy as np
# do not display warnings
import warnings
warnings.filterwarnings("ignore")

# Functions files are saved in "src/" directory.
import sys
sys.path.append('src/')
from accuracy_measure import *

In [2]:
from load_data import *

# load files
# Data are saved in "data/" directory
path_to_data = 'data/'
training, training_info, test, test_info, y_df = load_data(path_to_data)

# create adress book
# /!\ can take 1-2 min
address_books = create_address_books(training, y_df)

# join train and test files
X_df = join_data(training_info, training)
X_sub_df = join_data(test_info, test)

In [3]:
import TFIDF_mod
from TFIDF_mod import TFIDF

# transform each mail body into tfidf vector
# /!\ function can take 1-2 min to execute
TFIDF = TFIDF()
X_TFIDF = TFIDF.fit_transform(X_df) # resulting shape : (43613, 275988)

In [17]:
from sklearn.model_selection import ShuffleSplit
import predictor
from predictor import Predictor_1

# splitting data for cross validation
skf = ShuffleSplit(n_splits=2, test_size=0.2)
print('%10s | %40s | %10s' %('sender_nb', 'sender', 'accuracy'))
print('%10s + %40s + %10s' %(10*'-', 40*'-', 10*'-'))
for train_is, test_is in skf.split(y_df):
    
    X_tfidf_train = X_TFIDF[train_is].copy()
    y_train = y_df.recipients.loc[train_is].copy()
    X_tfidf_test = X_TFIDF[test_is].copy()
    y_test = y_df.recipients.loc[test_is].copy()
    X_test_df = X_df.loc[test_is].copy()
    X_train_df = X_df.loc[train_is].copy()
    
    i=0
    pdt = {}
    accuracy = {}
    accuracy_TOT = 0
    sender_test = X_test_df.sender.unique().tolist()
    y_pred = np.empty((X_test_df.shape[0],10),dtype=object)

    for sender in sender_test:
        print('%10s | %40s | ' %(sender_test.index(sender), sender), end='')
        # indices corresponding to the sender
        sender_train_is = np.array(X_train_df.sender == sender)
        sender_test_is = np.array(X_test_df.sender == sender)
        
        pdt[sender] = Predictor_1(X_tfidf_train[sender_train_is], y_train[sender_train_is], sender, address_books)
        y_pred[sender_test_is] = pdt[sender].predict_1(X_tfidf_test[sender_test_is])
        
        accuracy[sender] = mapk(y_test[sender_test_is], y_pred[sender_test_is], k=10)
        accuracy_TOT += accuracy[sender]
        print('%.2f' %(accuracy[sender]))

    print('%30s'%(30*'-'))
    print('error TOT = %.2f' %(accuracy_TOT/len(accuracy)))

 sender_nb |                                   sender |   accuracy
---------- + ---------------------------------------- + ----------
         0 |                    jason.wolfe@enron.com | 0.20
         1 |                sara.shackleton@enron.com | 0.09
         2 |                       rick.buy@enron.com | 0.10
         3 |                  tanya.rohauer@enron.com | 0.29
         4 |                  david.forster@enron.com | 0.27
         5 |                  michelle.cash@enron.com | 0.11
         6 |                 ginger.dernehl@enron.com | 0.51
         7 |                  peter.keohane@enron.com | 0.22
         8 |                      eric.bass@enron.com | 0.13
         9 |                  chris.germany@enron.com | 0.06
        10 |                     sally.beck@enron.com | 0.09
        11 |                 kevin.m.presto@enron.com | 0.14
        12 |        enron_update@concureworkplace.com | 0.16
        13 |                 phillip.m.love@enron.com | 0.06
        14 |