<h1>Email recipient recommendation</h1>

<i>Thomas Boudou, Guillaume Richard, Antoine Simoulin</i>

<p style="text-align: justify">It was shown that at work, employees frequently forget to include one or more recipient(s) before sending a message. Conversely, it is common that some recipients of a given message were actually not intended to receive the message. To increase productivity and prevent information leakage, the needs for effective <b>email recipient recommendation</b> systems are thus pressing.

In this challenge, you are asked to develop such a system, which, given the content and the date of a message, recommends a list of <b>10 recipients ranked by decreasing order of relevance</b>.</p>

In [1]:
# Requirements
%matplotlib inline
import random
import pandas as pd
import numpy as np
# do not display warnings
import warnings
warnings.filterwarnings("ignore")

# Functions files are saved in "src/" directory.
import sys
sys.path.append('src/')
from accuracy_measure import *

In [2]:
from load_data import *

# load files
# Data are saved in "data/" directory
path_to_data = 'data/'
training, training_info, test, test_info, y_df = load_data(path_to_data)

# create adress book
# /!\ can take 1-2 min
address_books = create_address_books(training, y_df)

# join train and test files
X_df = join_data(training_info, training)
X_sub_df = join_data(test_info, test)

In [3]:
from proper_name_extractor import *

In [4]:
#Extracting proper names in X_train

X_tmp=add_proper_names(X_df)
X_tmp.to_csv('data_with_proper_names.tsv', sep='\t')

NameError: name 'add_proper_names' is not defined

In [117]:
#Name per address creation

surname_link, recipients_link=create_name_dict(X_tmp, y_df)

In [193]:
from sklearn.model_selection import ShuffleSplit
import predictor

# splitting data for cross validation
skf = ShuffleSplit(n_splits=2, test_size=0.2)
print('%10s | %40s | %10s' %('sender_nb', 'sender', 'accuracy'))
print('%10s + %40s + %10s' %(10*'-', 40*'-', 10*'-'))
for train_is, test_is in skf.split(y_df):
    
    X_tfidf_train = X_TFIDF[train_is].copy()
    y_train = y_df.recipients.loc[train_is].copy()
    X_tfidf_test = X_TFIDF[test_is].copy()
    y_test = y_df.recipients.loc[test_is].copy()
    X_test_df = X_df.loc[test_is].copy()
    X_train_df = X_df.loc[train_is].copy()
    
    i=0
    pdt = {}
    accuracy = {}
    accuracy_TOT = 0
    sender_test = X_test_df.sender.unique().tolist()
    y_pred = np.empty((X_test_df.shape[0],10),dtype=object)

    for sender in sender_test:
        print('%10s | %40s | ' %(sender_test.index(sender), sender), end='')
        # indices corresponding to the sender
        sender_train_is = np.array(X_train_df.sender == sender)
        sender_test_is = np.array(X_test_df.sender == sender)
        
        pdt[sender] = Predictor_Names(X_train_df[sender_train_is], y_train[sender_train_is], sender, address_books)
        y_pred[sender_test_is] = pdt[sender].pred(X_test_df[sender_test_is])
        
        accuracy[sender] = mapk(y_test[sender_test_is], y_pred[sender_test_is], k=10)
        accuracy_TOT += accuracy[sender]
        print('%.2f' %(accuracy[sender]))

    print('%30s'%(30*'-'))
    print('error TOT = %.2f' %(accuracy_TOT/len(accuracy)))

 sender_nb |                                   sender |   accuracy
---------- + ---------------------------------------- + ----------
         0 |                    david.portz@enron.com | 0.39
         1 |                russell.diamond@enron.com | 0.23
         2 |                      eric.bass@enron.com | 0.30
         3 |                   paul.y barbo@enron.com | 0.38
         4 |                    marie.heard@enron.com | 0.15
         5 |                     sally.beck@enron.com | 0.36
         6 |                  suzanne.adams@enron.com | 0.57
         7 |                     jane.tholt@enron.com | 0.28
         8 |                richard.shapiro@enron.com | 0.27
         9 |                 phillip.m.love@enron.com | 0.24
        10 |                grace.rodriguez@enron.com | 0.45
        11 |                  jim.schwieger@enron.com | 0.42
        12 |                  michelle.cash@enron.com | 0.46
        13 |                    susan.scott@enron.com | 0.27
        14 |