# WARNING
**Please make sure to "COPY AND EDIT NOTEBOOK" to use compatible library dependencies! DO NOT CREATE A NEW NOTEBOOK AND COPY+PASTE THE CODE - this will use latest Kaggle dependencies at the time you do that, and the code will need to be modified to make it work. Also make sure internet connectivity is enabled on your notebook**

# Preliminaries
Install required dependencies not already on the Kaggle image

In [1]:
# install sent2vec
!pip install git+https://github.com/epfml/sent2vec

Collecting git+https://github.com/epfml/sent2vec
  Cloning https://github.com/epfml/sent2vec to /tmp/pip-req-build-6usf0w9e
  Running command git clone -q https://github.com/epfml/sent2vec /tmp/pip-req-build-6usf0w9e
Building wheels for collected packages: sent2vec
  Building wheel for sent2vec (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / done
[?25h  Created wheel for sent2vec: filename=sent2vec-0.0.0-cp36-cp36m-linux_x86_64.whl size=1139408 sha256=5d41d4ea2c3b1e451e3ee71fae31af75b6747a9ae33021dee09a8ed06e040cad
  Stored in directory: /tmp/pip-ephem-wheel-cache-8nd6mv10/wheels/f5/1a/52/b5f36e8120688b3f026ac0cefe9c6544905753c51d8190ff17
Successfully built sent2vec
Installing collected packages: sent2vec
Successfully installed sent2vec-0.0.0


Write requirements to file, anytime you run it, in case you have to go back and recover dependencies. **MOST OF THESE REQUIREMENTS WOULD NOT BE NECESSARY FOR LOCAL INSTALLATION**

Requirements are hosted for each notebook in the companion github repo, and can be pulled down and installed here if needed. Companion github repo is located at https://github.com/azunre/transfer-learning-for-nlp

In [2]:
!pip freeze > kaggle_image_requirements.txt

# Download IMDB Movie Review Dataset
Download IMDB dataset

In [3]:
import random
import pandas as pd

## Read-in the reviews and print some basic descriptions of them

!wget -q "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!tar xzf aclImdb_v1.tar.gz

# Define Tokenization, Stop-word and Punctuation Removal Functions
Before proceeding, we must decide how many samples to draw from each class. We must also decide the maximum number of tokens per email, and the maximum length of each token. This is done by setting the following overarching hyperparameters

In [4]:
Nsamp = 1000 # number of samples to generate in each class - 'spam', 'not spam'
maxtokens = 200 # the maximum number of tokens per document
maxtokenlen = 100 # the maximum length of each token

**Tokenization**

In [5]:
def tokenize(row):
    if row is None or row is '':
        tokens = ""
    else:
        tokens = str(row).split(" ")[:maxtokens]
    return tokens

**Use regular expressions to remove unnecessary characters**

Next, we define a function to remove punctuation marks and other nonword characters (using regular expressions) from the emails with the help of the ubiquitous python regex library. In the same step, we truncate all tokens to hyperparameter maxtokenlen defined above.

In [6]:
import re

def reg_expressions(row):
    tokens = []
    try:
        for token in row:
            token = token.lower() # make all characters lower case
            token = re.sub(r'[\W\d]', "", token)
            token = token[:maxtokenlen] # truncate token
            tokens.append(token)
    except:
        token = ""
        tokens.append(token)
    return tokens

**Stop-word removal**

Stop-words are also removed. Stop-words are words that are very common in text but offer no useful information that can be used to classify the text. Words such as is, and, the, are are examples of stop-words. The NLTK library contains a list of 127 English stop-words and can be used to filter our tokenized strings.

In [7]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = stopwords.words('english')    

# print(stopwords) # see default stopwords
# it may be beneficial to drop negation words from the removal list, as they can change the positive/negative meaning
# of a sentence
# stopwords.remove("no")
# stopwords.remove("nor")
# stopwords.remove("not")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
def stop_word_removal(row):
    token = [token for token in row if token not in stopwords]
    token = filter(None, token)
    return token

# Assemble Embedding Vectors
The following functions are used to extract sent2vec embedding vectors for each review

In [9]:
import time
import sent2vec

s2v_model = sent2vec.Sent2vecModel()
start=time.time()
s2v_model.load_model('../input/sent2vec/wiki_unigrams.bin')
end = time.time()
print("Loading the sent2vec embedding took %d seconds"%(end-start))

Loading the sent2vec embedding took 43 seconds


In [10]:
def assemble_embedding_vectors(data):
    out = None
    for item in data:
        vec = s2v_model.embed_sentence(" ".join(item))
        if vec is not None:
            if out is not None:
                out = np.concatenate((out,vec),axis=0)
            else:
                out = vec                                            
        else:
            pass
        
        
    return out

# Putting It All Together To Assemble Dataset
Now, putting all the preprocessing steps together we assemble our dataset...

In [11]:
import os
import numpy as np

# shuffle raw data first
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p]
    header = np.asarray(header)[p]
    return data, header

# load data in appropriate form
def load_data(path):
    data, sentiments = [], []
    for folder, sentiment in (('neg', 0), ('pos', 1)):
        folder = os.path.join(path, folder)
        for name in os.listdir(folder):
            with open(os.path.join(folder, name), 'r') as reader:
                  text = reader.read()
            text = tokenize(text)
            text = stop_word_removal(text)
            text = reg_expressions(text)
            data.append(text)
            sentiments.append(sentiment)
    data_np = np.array(data)
    data, sentiments = unison_shuffle_data(data_np, sentiments)
    
    return data, sentiments

train_path = os.path.join('aclImdb', 'train')
test_path = os.path.join('aclImdb', 'test')
raw_data, raw_header = load_data(train_path)

print(raw_data.shape)
print(len(raw_header))

(25000,)
25000


In [12]:
# Subsample required number of samples
random_indices = np.random.choice(range(len(raw_header)),size=(Nsamp*2,),replace=False)
data_train = raw_data[random_indices]
header = raw_header[random_indices]

del raw_data, raw_header # huge and no longer needed, get rid of it

print("DEBUG::data_train::")
print(data_train)

DEBUG::data_train::
[list(['there', 'interesting', 'discussion', 'movie', 'is', 'moral', 'person', 'good', 'enough', 'need', 'something', 'morebr', 'br', 'the', 'movie', 'preaches', 'without', 'guidance', 'god', 'morally', 'good', 'person', 'enough', 'there', 'line', 'early', 'movie', 'you', 'i', 'look', 'person', 'morally', 'good', 'know', 'going', 'go', 'hellbr', 'br', 'while', 'i', 'christian', 'discussions', 'throughout', 'course', 'movie', 'fascinating', 'way', 'movie', 'intended', 'i', 'left', 'movie', 'stronger', 'feeling', 'morally', 'good', 'is', 'enough', 'the', 'arguments', 'discussions', 'presented', 'heavily', 'biased', 'much', 'crush', 'weight', 'ignorance', 'fanaticism', 'powerful', 'thing', 'especially', 'inferenced', 'minds', 'ignorant', 'uneducated', 'as', 'george', 'carlins', 'character', 'dogma', 'said', 'hook', 'em', 'theyre', 'youngbr', 'br', 'the', 'basic', 'premise', 'interesting', 'one', 'also', 'a', 'bible', 'scholar', 's', 'attempting', 'publish', 'book', 'sa

Display sentiments and their frequencies in the dataset, to ensure it is roughly balanced between classes

In [13]:
unique_elements, counts_elements = np.unique(header, return_counts=True)
print("Sentiments and their frequencies:")
print(unique_elements)
print(counts_elements)

Sentiments and their frequencies:
[0 1]
[1020  980]


**Featurize and Create Labels**

In [14]:
EmbeddingVectors = assemble_embedding_vectors(data_train)
print(EmbeddingVectors)

[[ 0.07333101  0.01511984 -0.03923827 ...  0.10405464 -0.08110102
   0.25688022]
 [-0.08809602 -0.34617755 -0.01950213 ...  0.13445397 -0.05908645
   0.02556805]
 [-0.0640822  -0.0522301  -0.03211598 ...  0.01951624  0.08673023
   0.10480079]
 ...
 [-0.14213526 -0.171769   -0.00846663 ... -0.04127606  0.0761945
   0.10384577]
 [ 0.22651275  0.03324785  0.1213126  ... -0.07920102  0.13148943
   0.12953816]
 [ 0.02885111 -0.15764742 -0.00427123 ...  0.05920245 -0.02075605
   0.00402516]]


In [15]:
data = EmbeddingVectors
del EmbeddingVectors

idx = int(0.7*data.shape[0])

# 70% of data for training
train_x = data[:idx,:]
train_y = header[:idx]
# # remaining 30% for testing
test_x = data[idx:,:]
test_y = header[idx:] 

print("train_x/train_y list details, to make sure it is of the right form:")
print(len(train_x))
print(train_x)
print(train_y[:5])
print(len(train_y))

train_x/train_y list details, to make sure it is of the right form:
1400
[[ 0.07333101  0.01511984 -0.03923827 ...  0.10405464 -0.08110102
   0.25688022]
 [-0.08809602 -0.34617755 -0.01950213 ...  0.13445397 -0.05908645
   0.02556805]
 [-0.0640822  -0.0522301  -0.03211598 ...  0.01951624  0.08673023
   0.10480079]
 ...
 [ 0.0027437   0.01300152 -0.07666195 ... -0.02507846 -0.07781653
   0.03148237]
 [-0.07563806 -0.18189028 -0.00728056 ... -0.05255522  0.12547016
   0.2277591 ]
 [-0.11023627 -0.09230539 -0.06702375 ... -0.07511466 -0.00195359
   0.17991148]]
[0 1 0 1 0]
1400


# Single IMDB Task Baseline

In [16]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout

input_shape = (len(train_x[0]),)
sent2vec_vectors = Input(shape=input_shape)
dense = Dense(512, activation='relu')(sent2vec_vectors)
dense = Dropout(0.3)(dense)
output = Dense(1, activation='sigmoid')(dense)
model = Model(inputs=sent2vec_vectors, outputs=output)

Using TensorFlow backend.


In [17]:
model.compile(loss='binary_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
history = model.fit(train_x, train_y, validation_data=(test_x, test_y), batch_size=32,
                    nb_epoch=10, shuffle=True)

  after removing the cwd from sys.path.


Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Add Email Task, Train Single Email Task Baseline

Read Enron dataset and get a sense for the data by printing sample messages to screen

In [18]:
# Input data files are available in the "../input/" directory.
filepath = "../input/enron-email-dataset/emails.csv"

# Read the enron data into a pandas.DataFrame called emails
emails = pd.read_csv(filepath)

print("Successfully loaded {} rows and {} columns!".format(emails.shape[0], emails.shape[1]))
print(emails.head())

Successfully loaded 517401 rows and 2 columns!
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


Separate headers from the message bodies

In [19]:
import email

def extract_messages(df):
    messages = []
    for item in df["message"]:
        # Return a message object structure from a string
        e = email.message_from_string(item)    
        # get message body  
        message_body = e.get_payload()
        messages.append(message_body)
    print("Successfully retrieved message body from e-mails!")
    return messages

bodies = extract_messages(emails)

del emails

Successfully retrieved message body from e-mails!


In [20]:
# extract random 10000 enron email bodies for building dataset
import random
bodies_df = pd.DataFrame(random.sample(bodies, 10000))

del bodies # these are huge, no longer needed, get rid of them

# expand default pandas display options to make emails more clearly visible when printed
pd.set_option('display.max_colwidth', 300)

bodies_df.head() # you could do print(bodies_df.head()), but Jupyter displays this nicer for pandas DataFrames

Unnamed: 0,0
0,"Barbara/Jeff:\n\nIn the TurboPark world there will be certain impacts on how power projects \ninvolving physical capacity are structured and implemented. I know some (but \nnot all) of it. I can say that it is fairly easy to bust the structure and \nwind up with assets on the balance sheet, a..."
1,"??? Dear Vince:\n?\n??? I have put together a list of finance working papers (some of which I \nbrought to the interview last wednesday) which I have written since 1995 \nmostly in support of my work but also (at least initially) as a learning \ntool. Several of them, however,?do contain inn..."
2,"Guys, what is the contingency plan to manage these outages in the event we \ncan't manage this as effectively in future situations?\n\nRegards\nDelainey\n---------------------- Forwarded by David W Delainey/HOU/ECT on 06/29/2000 \n12:25 PM ---------------------------\n\n\nDavid W Delainey\n06/29..."
3,"Lee,\nThat works well.\nCongratulations and thanks,\nBill\n\n\n\n\n\nlfried@uclink4.berkeley.edu on 10/26/2000 10:06:39 PM\nPlease respond to lfried@uclink4.berkeley.edu\n\n\nTo: jeff.dasovich@enron.com, William Hederman/TCO/ColumbiaGas@COLUMBIAGAS,\nAMosher@appanet.org, shapiro@haas.berkeley.ed..."
4,"GT, How is it going? Man, was that an ass-whipping or what that we took \nfrom OU. Looks like we salvaged some dignity with the CU game. \n\nI was planning on coming in for the Missouri game. I might have my brother \nand another friend with me. How are you set for space. Those guys could..."


Read and Preprocess Fraudulent "419" Email Corpus

In [21]:
filepath = "../input/fraudulent-email-corpus/fradulent_emails.txt"
with open(filepath, 'r',encoding="latin1") as file:
    data = file.read()

Split on the code word `From r` appearing close to the beginning of each email

In [22]:
fraud_emails = data.split("From r")

del data

print("Successfully loaded {} spam emails!".format(len(fraud_emails)))

Successfully loaded 3978 spam emails!


In [23]:
fraud_bodies = extract_messages(pd.DataFrame(fraud_emails,columns=["message"]))

del fraud_emails

fraud_bodies_df = pd.DataFrame(fraud_bodies[1:])

del fraud_bodies

fraud_bodies_df.head() # you could do print(fraud_bodies_df.head()), but Jupyter displays this nicer for pandas DataFrames

Successfully retrieved message body from e-mails!


Unnamed: 0,0
0,"FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY G..."
1,"Dear Friend,\n\nI am Mr. Ben Suleman a custom officer and work as Assistant controller of the Customs and Excise department Of the Federal Ministry of Internal Affairs stationed at the Murtala Mohammed International Airport, Ikeja, Lagos-Nigeria.\n\nAfter the sudden death of the former Head of s..."
2,"FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF ELEME KINGDOM \nCHIEF DANIEL ELEME, PHD, EZE 1 OF ELEME.E-MAIL \nADDRESS:obong_715@epatra.com \n\nATTENTION:PRESIDENT,CEO Sir/ Madam. \n\nThis letter might surprise you because we have met\nneither in person nor by correspondence. But I believe\nit is..."
3,"FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF ELEME KINGDOM \nCHIEF DANIEL ELEME, PHD, EZE 1 OF ELEME.E-MAIL \nADDRESS:obong_715@epatra.com \n\nATTENTION:PRESIDENT,CEO Sir/ Madam. \n\nThis letter might surprise you because we have met\nneither in person nor by correspondence. But I believe\nit is..."
4,"Dear sir, \n \nIt is with a heart full of hope that I write to seek your help in respect of the context below. I am Mrs. Maryam Abacha the former first lady of the former Military Head of State of Nigeria General Sani Abacha whose sudden death occurred on 8th of June 1998 as a result of cardiac ..."


Convert everything to lower-case, truncate to maxtokens and truncate each token to maxtokenlen

In [24]:
EnronEmails = bodies_df.iloc[:,0].apply(tokenize)
EnronEmails = EnronEmails.apply(stop_word_removal)
EnronEmails = EnronEmails.apply(reg_expressions)
EnronEmails = EnronEmails.sample(Nsamp)

del bodies_df

SpamEmails = fraud_bodies_df.iloc[:,0].apply(tokenize)
SpamEmails = SpamEmails.apply(stop_word_removal)
SpamEmails = SpamEmails.apply(reg_expressions)
SpamEmails = SpamEmails.sample(Nsamp)

del fraud_bodies_df

raw_data = pd.concat([SpamEmails,EnronEmails], axis=0).values

In [25]:
print("Shape of combined data is:")
print(raw_data.shape)
print("Data is:")
print(raw_data)

# create corresponding labels
Categories = ['spam','notspam']
header = ([1]*Nsamp)
header.extend(([0]*Nsamp))

Shape of combined data is:
(2000,)
Data is:
[list(['frommr', 'james', 'tshabalalatel', 'south', 'africa', 'good', 'day', '', '', 'urgent', 'response', 'let', 'start', 'introducing', 'myself', 'i', 'mr', 'james', 'tshabalala', 'bank', 'manager', 'commercial', 'bank', 'south', 'africa', 'i', 'came', 'know', 'private', 'search', 'reliable', 'personcompany', 'handle', 'confidential', 'portfolio', 'supervision', 'during', 'visit', 'pretoria', 'one', 'associates', 'there', 'confidence', 'gave', 'contact', 'address', 'with', 'srong', 'hope', 'search', 'i', 'decided', 'put', 'proposal', 'the', 'portfolio', 'a', 'german', 'man', 'called', 'mr', 'wolfgang', 'schiniter', '', 'years', 'age', 'prosperous', 'farmer', 'deposited', 'us', 'sum', 'twenty', 'million', 'dollarsusmillion', 'his', 'wife', 'mrs', 'helga', 'schiniter', 'legitimate', 'beneficiary', 'account', 'unfortunately', 'killed', 'air', 'crash', 'involving', 'concord', 'af', 'gonesse', 'france', 'our', 'serch', 'german', 'embassy', 'find

We are now ready to convert these into numerical vectors!!

**Featurize and Create Labels**

In [26]:
EmbeddingVectors = assemble_embedding_vectors(raw_data)
print(EmbeddingVectors)

[[-0.01536004 -0.12274602  0.06190309 ...  0.0405392  -0.11272036
  -0.04488139]
 [-0.02831265 -0.08182404  0.20950457 ...  0.02200493 -0.01504305
  -0.07850857]
 [-0.0064414  -0.02040358 -0.02121366 ...  0.0932903  -0.14455533
  -0.09238692]
 ...
 [-0.3625687   0.07203355  0.11921321 ...  0.07324798 -0.10625815
  -0.19990994]
 [-0.06652257 -0.25758487 -0.10346989 ... -0.4478352  -0.48571077
   0.260466  ]
 [-0.18741988 -0.23663858  0.10094383 ... -0.03404405  0.02272647
   0.06112502]]


In [27]:
# shuffle raw data first
def unison_shuffle_data(data, header):
    p = np.random.permutation(len(header))
    data = data[p,:]
    header = np.asarray(header)[p]
    return data, header

data, header = unison_shuffle_data(EmbeddingVectors, header)

idx = int(0.7*data.shape[0])

# 70% of data for training
train_x2 = data[:idx,:]
train_y2 = header[:idx]
# # remaining 30% for testing
test_x2 = data[idx:,:]
test_y2 = header[idx:] 

print("train_x2/train_y2 (emails) list details, to make sure it is of the right form:")
print(len(train_x2))
print(train_x2)
print(train_y2[:5])
print(len(train_y2))

train_x2/train_y2 (emails) list details, to make sure it is of the right form:
1400
[[-1.1175805  -0.751737    0.55355763 ... -0.6607673   0.21230514
  -0.30602062]
 [-0.03476024  0.01665223 -0.11075132 ...  0.05981117 -0.32611066
  -0.16099696]
 [-0.21694274 -0.03368444  0.05471903 ... -0.24358185  0.02436963
  -0.4627576 ]
 ...
 [-0.2266984  -0.0039735   0.25503    ...  0.14465979 -0.15364355
  -0.1912163 ]
 [-0.01915166 -0.05129508 -0.06456647 ...  0.11918928 -0.21648325
  -0.02365398]
 [-1.1175805  -0.751737    0.55355763 ... -0.6607673   0.21230514
  -0.30602062]]
[1 0 0 1 0]
1400


Train "just email" single-task shallow neural network

In [28]:
input_shape = (len(train_x2[0]),)
sent2vec_vectors = Input(shape=input_shape)
dense = Dense(512, activation='relu')(sent2vec_vectors)
dense = Dropout(0.3)(dense)
output = Dense(1, activation='sigmoid')(dense)
model = Model(inputs=sent2vec_vectors, outputs=output)

model.compile(loss='binary_crossentropy',
                  optimizer='adam', metrics=['accuracy'])
history = model.fit(train_x2, train_y2, validation_data=(test_x2, test_y2), batch_size=32,
                    nb_epoch=10, shuffle=True)

  # This is added back by InteractiveShellApp.init_path()


Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# "Double-Task" Email and IMDB System

In [29]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout
from keras.layers.merge import concatenate

input1_shape = (len(train_x[0]),)
input2_shape = (len(train_x2[0]),)
sent2vec_vectors1 = Input(shape=input1_shape)
sent2vec_vectors2 = Input(shape=input2_shape)
combined = concatenate([sent2vec_vectors1,sent2vec_vectors2])
dense1 = Dense(512, activation='relu')(combined)
dense1 = Dropout(0.3)(dense1)
output1 = Dense(1, activation='sigmoid',name='classification1')(dense1)
output2 = Dense(1, activation='sigmoid',name='classification2')(dense1)
model = Model(inputs=[sent2vec_vectors1,sent2vec_vectors2], outputs=[output1,output2])

In [30]:
model.compile(loss={'classification1': 'binary_crossentropy', 
                    'classification2': 'binary_crossentropy'},
              optimizer='adam', metrics=['accuracy'])
history = model.fit([train_x,train_x2],[train_y,train_y2],
                    validation_data=([test_x,test_x2],[test_y,test_y2]),
                                     batch_size=32, nb_epoch=10, shuffle=True)

  


Train on 1400 samples, validate on 600 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
from IPython.display import HTML
def create_download_link(title = "Download file", filename = "data.csv"):  
    html = '<a href={filename}>{title}</a>'
    html = html.format(title=title,filename=filename)
    return HTML(html)

#create_download_link(filename='file.svg')

In [32]:
!rm -rf aclImdb
!rm aclImdb_v1.tar.gz