<h1>Email recipient recommendation</h1>

<i>Thomas Boudou, Guillaume Richard, Antoine Simoulin</i>

<p style="text-align: justify">It was shown that at work, employees frequently forget to include one or more recipient(s) before sending a message. Conversely, it is common that some recipients of a given message were actually not intended to receive the message. To increase productivity and prevent information leakage, the needs for effective <b>email recipient recommendation</b> systems are thus pressing.

In this challenge, you are asked to develop such a system, which, given the content and the date of a message, recommends a list of <b>10 recipients ranked by decreasing order of relevance</b>.</p>

In [1]:
# Requirements
%matplotlib inline
import random
import pandas as pd
import numpy as np
# do not display warnings
import warnings
warnings.filterwarnings("ignore")

# Functions files are saved in "src/" directory.
import sys
sys.path.append('src/')
from accuracy_measure import *

In [2]:
from load_data import *

# load files
# Data are saved in "data/" directory
path_to_data = '../mail_recipients/data/'
training, training_info, test, test_info, y_df = load_data(path_to_data)

# create adress book
# /!\ can take 1-2 min
address_books = create_address_books(training, y_df)

# join train and test files
X_df = join_data(training_info, training)
X_sub_df = join_data(test_info, test)

<h2> TF-IDF </h2>

In [3]:
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import time

cachedStopWords = stopwords.words("english")

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    stemmer = PorterStemmer()
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

class TFIDF():
    def __init__(self):
        self.token_dict = {}
        self.tfidf=TfidfVectorizer(tokenizer=None, stop_words='english')

    def fit(self, X):
        for i in range(X.shape[0]):
            text = X.body.values[i]
            lowers = text.lower()
            s=string.punctuation.replace('@','')
            s=s.replace('+','')
            no_punctuation = lowers.translate(str.maketrans('','',s))
            y = " ".join(no_punctuation.split())
            y = ' '.join([word for word in y.split() if word not in cachedStopWords])
            self.token_dict[i] = y

        self.tfidf.fit(self.token_dict.values())


    def fit_transform(self, X):
        start_time = time.time()
        for i in range(X.shape[0]):
            text = X.body.values[i]
            lowers = text.lower()
            s=string.punctuation.replace('@','')
            s=s.replace('+','')

            no_punctuation = lowers.translate(str.maketrans('','',s))
            y = " ".join(no_punctuation.split())
            y = ' '.join([word for word in y.split() if word not in cachedStopWords])

            self.token_dict[i] = y

        X_tfidf = self.tfidf.fit_transform(self.token_dict.values())

        print('performed Tf-Idf in %2i seconds.' % (time.time() - start_time))
        return X_tfidf

    def transform(self, Y):
        start_time = time.time()
        Y_dict={}
        for i in range(Y.shape[0]):
            text = Y.body.values[i]
            lowers = text.lower()
            s=string.punctuation.replace('@','')
            s=s.replace('+','')

            no_punctuation = lowers.translate(str.maketrans('','',s))
            y = " ".join(no_punctuation.split())
            y = ' '.join([word for word in y.split() if word not in cachedStopWords])

            Y_dict[i] = y
        Y_tf_idf=self.tfidf.transform(Y_dict.values())

        print('performed Tf-Idf in %2i seconds.' % (time.time() - start_time))
        return Y_tf_idf

<h3> Useful functions </h3>

In [135]:
from sklearn.metrics.pairwise import cosine_similarity
from proper_name_extractor import *

#Score vector creation
def score_vector(KNN_indices,sender_index,sender_AB,y,cos_dist_mat):
    recipient_scores=np.zeros((sum(sender_index),len(sender_AB)+1))
    for i in range(sum(sender_index)):
        d=np.array(KNN_indices[i])
        neigh_mails=y.values[d]#neighbour mails
        z=0
        for n_mail in neigh_mails:
            for rec in n_mail:
                if rec in sender_AB:
                    j=sender_AB[rec]#index in the score vector
                    recipient_scores[i,j]+=cos_dist_mat[i,d[z]]
            z=z+1
    return recipient_scores

#Label creation (from recipient addresses to 0/1 vector)
def create_labels(sender_train_is,sender_AB,y_train):
    recipient_labels=np.zeros((sum(sender_train_is),len(sender_AB)))
    i=0
    for rec_list in y_train[sender_train_is]:
        for rec in rec_list:
            if rec in sender_AB:
                j=sender_AB[rec]
                recipient_labels[i,j]=1 
        i=i+1
    return recipient_labels


def create_name_scores(X_test,sender_test_is,sender_AB,dict_recipients,dict_names):
    recipient_name_score=np.zeros((sum(sender_test_is),len(sender_AB)))
    i=0
    for names in X_test.ix[sender_test_is].names:
        if len(names)>0:
            for n in names:
                for rec in sender_AB:
                    if recipient_surnames[rec]==n:
                        recipient_name_score[i,sender_AB[rec]]=1
                    
    return recipient_name_score
#Complete prediction when <10
def complete_prediction(k, sender, address_books, res_temp, K=10):
    # k the number of recipients to predict
    k_most = [elt[0] for elt in address_books[sender][:K] if elt not in res_temp]
    k_most = k_most[:k]
    if len(k_most) < k: # sender n'a pas assez de contacts
        k_most.extend([0] * (k-len(k_most)))
    return k_most

#Computes the KNN on the distance matrix
def KNN(distance,k=30):
    indexes=[]
    for d in distance:
        indexes.append((-d).argsort()[:k])
    return np.array(indexes)

#Extract names at the beginning of the document
def extract_names(text, dict_n=dict_names, dict_m=dict_months,nb_words=5):
    
    name_list=[]
    text=re.sub(r'[^\w\s]',' ',text)
    text = ' '.join([word for word in text.split() if word not in cachedStopWords])
    forward=False
    
    if nb_words==None:
        nb_words=len(text.split())
    
    dear=False
    
    count=0
    for z in text.split()[:nb_words]:
        if (z.lower()=='forwarded' or z.lower()=='original'):
            forward=True
        if(z.lower()=='dear' or z.lower()=='hi' or z.lower()=='thanks'):
            dear=True
            
        if (z.lower() in dict_n and z.lower()!=z and z.lower() not in dict_m and forward==False and (dear==True or count==0)):
            name_list.append(z.lower())

    if len(name_list)==0:
        name_list=['']

    return name_list#','.join([word for word in np.unique(name_list).tolist()])

#Add columns to the initial dataframe
def create_names_df(X_df):
    X_names=X_df.copy()

    list_names=[]
    for x in X_names.body:
        l_names=extract_names(x,nb_words=5)
        list_names.append(l_names)
    X_names['names']=list_names
    return X_names

#attributes names for mail addresses
def names(address_books):
    recipient_name = {}
    for sender in address_books:
        for rec, value in address_books[sender]:
            if rec not in recipient_name:
                recipient_name[rec]='DefaultNULL'
                if '.' in rec[:rec.find('@')]:
                    found = rec[:rec.find('.')].lower()
                    if found in dict_names:
                        recipient_name[rec] = found
                    else:
                        found=rec[rec.find('.')+1:rec.find('@')].lower()
                        if found in dict_names:
                            recipient_name[rec] = found
    return recipient_name

<h2> Fitting </h2>

In [126]:
%%time 
#import TFIDF_mod
#from TFIDF_mod import TFIDF
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb
import numpy as np
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold

# splitting data for cross validation
#skf = ShuffleSplit(n_splits=1, test_size=0.02)
skf=KFold(n_splits=5, shuffle=True)
X_df=create_names_df(X_df)

X_tot=pd.merge(X_df,y_df,on='mid')

dict_recipients={}
j=0
for i,x in X_tot.iterrows():
    for rec in x.recipients:
        if rec not in dict_recipients:
            dict_recipients[rec]=j
            j=j+1
            
recipient_surnames=names(address_books)            
print('--------Cross-Validation Module--------')
for train_is, test_is in skf.split(y_df):
    print('\n Beginning of extraction \n ------------')
    ############Extraction + TF-IDF############
    X_train=X_df.ix[train_is]
    y_train = y_df.recipients.loc[train_is].copy()
    X_test=X_df.ix[test_is].copy()
    y_test = y_df.recipients.loc[test_is].copy()
    y_pred=y_test.copy()
    
    tf_idf = TFIDF()
    X_train_TFIDF=tf_idf.fit_transform(X_train)
    X_test_TFIDF=tf_idf.transform(X_test)
    
    print('Extraction done \n ------------')
    print('\n Beginning of prediction \n ------------')
    ############Prediction############
    sender_test = X_test.sender.unique().tolist()
    clf={}
    count=0
    L=len(sender_test)
    y_pred=y_test.copy()
    for sender in sender_test:
        #Isolation of sender's mails
        sender_train_is = np.array(X_train.sender == sender)
        sender_test_is = np.array(X_test.sender == sender)

        ############Feature extraction############
        
        #Finding the nearest neighbours of sender's mails
        cos_dist_mat=cosine_similarity(X_train_TFIDF[sender_train_is])-np.identity(sum(sender_train_is))
        cos_dist_mat_test=cosine_similarity(X_test_TFIDF[sender_test_is],X_train_TFIDF[sender_train_is])
        #cos_dist_mat=cosine_similarity(X_TFIDF[sender_train_is],X_TFIDF) #to try later
        
        #KNN
        KNN_indices=KNN(cos_dist_mat,k=50)
        KNN_indices_test=KNN(cos_dist_mat_test,k=50)

        #Sender number in the address book
        sender_AB={}
        id_to_sender={}
        sent_frequency={}
        rec_frequency={}
        n_mails=float(sum(sender_train_is))
        z=0
        for x in address_books[sender]:
            sender_AB[x[0]]=z
            id_to_sender[z]=x[0]
            sent_frequency[x[0]]=x[1]/n_mails
            rec_frequency[x[0]]=float(x[1])/tot_rec_mails[x[0]]
            z=z+1

        #Creation of the score vector
        recipient_scores=score_vector(KNN_indices,sender_train_is,sender_AB,y_train,cos_dist_mat) 

        #recipient_name_score=create_name_scores(X_train,sender_train_is,sender_AB,dict_recipients,dict_names)
        ############Train############

        #Creation of the labels for the classifier
        recipient_labels=create_labels(sender_train_is,sender_AB,y_train)

        #One classifier per recipient
        for rec in sender_AB:
            #Adding frequency feature
            #recipient_scores.T[len(sender_AB)]=sent_frequency[rec]
            x_fit=np.array([[x, sent_frequency[rec], rec_frequency[rec]] for x in recipient_scores.T[sender_AB[rec]]])
            #x_fit=np.concatenate((x_fit,
            #          recipient_name_score.T[sender_AB[rec]].reshape((len(recipient_name_score.T[sender_AB[rec]]),1))),
            #         axis=1)
            key=sender+','+rec
            #clf[key]=SVC()
            clf[key]=xgb.XGBClassifier(n_estimators=10)
            clf[key].fit(x_fit,recipient_labels.T[sender_AB[rec]])


        ############Test############

        #Creation of the test score vector
        recipient_scores=score_vector(KNN_indices_test,sender_test_is,sender_AB,y_train,cos_dist_mat_test) 
        recipient_labels=np.zeros((sum(sender_test_is),len(sender_AB))).T
        recipient_name_score=create_name_scores(X_test,sender_test_is,sender_AB,dict_recipients,dict_names)

        #Prediction
        pred=0
        for rec in sender_AB:
            #Adding frequency feature
            recipient_scores.T[len(sender_AB)]=sent_frequency[rec]
            x_fit=np.array([[x, 
                             sent_frequency[rec], 
                             rec_frequency[rec]] 
                            for x in recipient_scores.T[sender_AB[rec]]])
            #x_fit=np.concatenate((x_fit,
            #          recipient_name_score.T[sender_AB[rec]].reshape((len(recipient_name_score.T[sender_AB[rec]]),1))),
            #         axis=1)
            
            #Predict
            key=sender+','+rec
            recipient_labels[sender_AB[rec]]=(clf[key].predict_proba(x_fit)).T[1].T
            j=0
            for names in X_test.ix[sender_test_is].names:
                for n in names:
                    if recipient_surnames[rec]==n:
                        recipient_labels[sender_AB[rec],j]=1
                j=j+1
        recipient_labels=recipient_labels.T
        #Storage
        y_test_pred=[]
        for y in recipient_labels:
            y_tmp=[]
            max_rec=(-y).argsort()[:10]
            for rec_id in max_rec:
                y_tmp.append(id_to_sender[rec_id])
            if len(y_tmp) < 10:
                y_tmp.extend(complete_prediction(10-len(y_tmp),sender, address_books, y_tmp))
            y_test_pred.append(y_tmp)
        y_pred.ix[sender_test_is]=y_test_pred
        if int((count*10)/L)>int(((count-1)*10)/L):
            print(round(float(count*100)/L))
        count=count+1
    print('End of prediction')
    print('------------')

--------Cross-Validation Module--------

 Beginning of extraction 
 ------------
performed Tf-Idf in 17 seconds.
performed Tf-Idf in  4 seconds.
Extraction done 
 ------------

 Beginning of prediction 
 ------------
10
20
30
40
50
60
70
80
90
End of prediction
------------

 Beginning of extraction 
 ------------
performed Tf-Idf in 17 seconds.
performed Tf-Idf in  5 seconds.
Extraction done 
 ------------

 Beginning of prediction 
 ------------


KeyboardInterrupt: 

<h2> Submission </h2>

In [136]:
%%time 
#import TFIDF_mod
#from TFIDF_mod import TFIDF
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb
import numpy as np
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import KFold

# splitting data for cross validation
skf = ShuffleSplit(n_splits=1, test_size=0.02)
#skf=KFold(n_splits=5, shuffle=True)
X_df=create_names_df(X_df)
X_sub_df=create_names_df(X_sub_df)

print('--------Cross-Validation Module--------')
for train_is, test_is in skf.split(y_df):
    print('\n Beginning of extraction \n ------------')
    ############Extraction + TF-IDF############
    X_train=X_df.copy()
    y_train = y_df.recipients.copy()
    X_test=X_sub_df.copy()
    #y_test = y_df.recipients.loc[test_is].copy()
    y_pred=pd.Series([[''] for x in X_sub_df.sender])

    
    tf_idf = TFIDF()
    X_train_TFIDF=tf_idf.fit_transform(X_train)
    X_test_TFIDF=tf_idf.transform(X_test)
    
    print('Extraction done \n ------------')
    print('\n Beginning of prediction \n ------------')
    ############Prediction############
    sender_test = X_test.sender.unique().tolist()
    clf={}
    count=0
    L=len(sender_test)
    
    tot_rec_mails={}
    for sender in sender_test:
        for x in address_books[sender]:
            if x[0] in tot_rec_mails:
                tot_rec_mails[x[0]]=tot_rec_mails[x[0]]+x[1]
            else:
                tot_rec_mails[x[0]]=x[1]
    for sender in sender_test:
        #Isolation of sender's mails
        sender_train_is = np.array(X_train.sender == sender)
        sender_test_is = np.array(X_test.sender == sender)

        ############Feature extraction############
        
        #Finding the nearest neighbours of sender's mails
        cos_dist_mat=cosine_similarity(X_train_TFIDF[sender_train_is])-np.identity(sum(sender_train_is))
        cos_dist_mat_test=cosine_similarity(X_test_TFIDF[sender_test_is],X_train_TFIDF[sender_train_is])
        #cos_dist_mat=cosine_similarity(X_TFIDF[sender_train_is],X_TFIDF) #to try later
        
        #KNN
        KNN_indices=KNN(cos_dist_mat,k=50)
        KNN_indices_test=KNN(cos_dist_mat_test,k=50)

        #Sender number in the address book
        sender_AB={}
        id_to_sender={}
        sent_frequency={}
        rec_frequency={}
        n_mails=float(sum(sender_train_is))
        z=0
        for x in address_books[sender]:
            sender_AB[x[0]]=z
            id_to_sender[z]=x[0]
            sent_frequency[x[0]]=x[1]/n_mails
            rec_frequency[x[0]]=float(x[1])/tot_rec_mails[x[0]]
            z=z+1

        #Creation of the score vector
        recipient_scores=score_vector(KNN_indices,sender_train_is,sender_AB,y_train,cos_dist_mat) 

        #recipient_name_score=create_name_scores(X_train,sender_train_is,sender_AB,dict_recipients,dict_names)
        ############Train############

        #Creation of the labels for the classifier
        recipient_labels=create_labels(sender_train_is,sender_AB,y_train)

        #One classifier per recipient
        for rec in sender_AB:
            #Adding frequency feature
            #recipient_scores.T[len(sender_AB)]=sent_frequency[rec]
            x_fit=np.array([[x, sent_frequency[rec], rec_frequency[rec]] for x in recipient_scores.T[sender_AB[rec]]])
            #x_fit=np.concatenate((x_fit,
            #          recipient_name_score.T[sender_AB[rec]].reshape((len(recipient_name_score.T[sender_AB[rec]]),1))),
            #         axis=1)
            key=sender+','+rec
            #clf[key]=SVC()
            clf[key]=xgb.XGBClassifier(n_estimators=10)
            clf[key].fit(x_fit,recipient_labels.T[sender_AB[rec]])


        ############Test############

        #Creation of the test score vector
        recipient_scores=score_vector(KNN_indices_test,sender_test_is,sender_AB,y_train,cos_dist_mat_test) 
        recipient_labels=np.zeros((sum(sender_test_is),len(sender_AB))).T
        recipient_name_score=create_name_scores(X_test,sender_test_is,sender_AB,dict_recipients,dict_names)

        #Prediction
        pred=0
        for rec in sender_AB:
            #Adding frequency feature
            recipient_scores.T[len(sender_AB)]=sent_frequency[rec]
            x_fit=np.array([[x, 
                             sent_frequency[rec], 
                             rec_frequency[rec]] 
                            for x in recipient_scores.T[sender_AB[rec]]])
            #x_fit=np.concatenate((x_fit,
            #          recipient_name_score.T[sender_AB[rec]].reshape((len(recipient_name_score.T[sender_AB[rec]]),1))),
            #         axis=1)
            
            #Predict
            key=sender+','+rec
            recipient_labels[sender_AB[rec]]=(clf[key].predict_proba(x_fit)).T[1].T
            j=0
            for names in X_test.ix[sender_test_is].names:
                for n in names:
                    if recipient_surnames[rec]==n:
                        recipient_labels[sender_AB[rec],j]=1
                j=j+1
        recipient_labels=recipient_labels.T
        #Storage
        y_test_pred=[]
        for y in recipient_labels:
            y_tmp=[]
            max_rec=(-y).argsort()[:10]
            for rec_id in max_rec:
                y_tmp.append(id_to_sender[rec_id])
            if len(y_tmp) < 10:
                y_tmp.extend(complete_prediction(10-len(y_tmp),sender, address_books, y_tmp))
            y_test_pred.append(y_tmp)
        y_pred.ix[sender_test_is]=y_test_pred
        if int((count*10)/L)>int(((count-1)*10)/L):
            print(round(float(count*100)/L))
        count=count+1
    print('End of prediction')
    print('------------')

create_submission(y_pred,X_sub_df)

--------Cross-Validation Module--------

 Beginning of extraction 
 ------------
performed Tf-Idf in 22 seconds.
performed Tf-Idf in  1 seconds.
Extraction done 
 ------------

 Beginning of prediction 
 ------------
10
20
30
40
50
60
70
80
90
End of prediction
------------
CPU times: user 8min 57s, sys: 11.5 s, total: 9min 8s
Wall time: 3min 33s


In [131]:
def create_submission(y_pred,X_test_df):

    predictions_towrite={}
    x_test=X_test_df.values
    for i in range(len(y_pred)):
        recipients=y_pred[i]
        mid=x_test[i][0]
        predictions_towrite[mid]=recipients

    count=0
    with open('./pred_custom.txt', 'w') as my_file:
        my_file.write('mid,recipients' + '\n')
        for ids, preds in predictions_towrite.items():
            count=count+1
            r=str(ids)+","
            for s in preds:
                r=r+" "+str(s)
            r=r+'\n'
            my_file.write(r)

In [132]:
create_submission(y_pred,X_sub_df)

In [137]:
X_sub_df

Unnamed: 0,mid,date,body,sender,names
0,1577,2001-11-19 06:59:51,Note: Stocks of heating oil are very high for...,lorna.brennan@enron.com,[]
1,1750,2002-03-05 08:46:57,"Kevin Hyatt and I are going for ""sghetti"" at S...",julie.armstrong@enron.com,[kevin]
2,1916,2002-02-13 14:17:39,This was forwarded to me and it is funny. - Wi...,julie.armstrong@enron.com,[]
3,2094,2002-01-22 11:33:56,I will be in to and happy to assist too. I ma...,julie.armstrong@enron.com,[]
4,2205,2002-01-11 07:12:19,Thanks. I needed a morning chuckle.,julie.armstrong@enron.com,[]
5,2297,2002-01-11 14:37:19,Note: Westpath Expansion plans filed at NEBTr...,lorna.brennan@enron.com,[]
6,5300,2001-11-26 14:13:01,Here are Peggy s slides. -----Original Message...,stanley.horton@enron.com,[peggy]
7,5333,2001-11-19 13:44:18,Here s the information. -----Original Message-...,cindy.stark@enron.com,[]
8,6583,2002-01-18 05:00:48,I would like to know where and how this is goi...,darrell.schoolcraft@enron.com,[]
9,7460,2001-11-12 16:43:31,"Richard: Per Elliot s e-mail below, do you hav...",jennifer.thome@enron.com,"[richard, elliot]"


In [45]:
create_name_scores(X_test,sender_test_is,sender_AB,dict_recipients,dict_names)

TypeError: list indices must be integers or slices, not str