## Contexte du projet

Nous sommes chargés de créer un model IA capable de détecter les email SPAM, qu'elle puisse différencier ceux qui le sont avec ceux qui ne le sont pas, grâce au NLP

## Information dataset

Le dataset comporte 3 colonnes et 5796 lignes :  

<ul>
<li>Category : Specifies whether mail is spam or not.  

1 --> Spam  
0 --> Not spam</li>  
<li>Message : Raw text messages  
Combinations of Plain messages with headers and also few with HTML tags.</li>  
<li>File_Name: Unique message indicators</li>  
</ul>

## Import libraries

In [54]:
import pandas as pd
import numpy as np

#preprocessing
from sklearn.model_selection import train_test_split

#modeling
import tensorflow as tf

#saving
import pickle

import warnings
warnings.filterwarnings(action="ignore")

## Forme dataset

In [33]:
data = pd.read_csv('data/Spam Email raw text for NLP.csv')
data

Unnamed: 0,CATEGORY,MESSAGE,FILE_NAME
0,1,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",00249.5f45607c1bffe89f60ba1ec9f878039a
1,1,ATTENTION: This is a MUST for ALL Computer Use...,00373.ebe8670ac56b04125c25100a36ab0510
2,1,This is a multi-part message in MIME format.\n...,00214.1367039e50dc6b7adb0f2aa8aba83216
3,1,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,00210.050ffd105bd4e006771ee63cabc59978
4,1,This is the bottom line. If you can GIVE AWAY...,00033.9babb58d9298daa2963d4f514193d7d6
...,...,...,...
5791,0,"I'm one of the 30,000 but it's not working ver...",00609.dd49926ce94a1ea328cce9b62825bc97
5792,0,Damien Morton quoted:\n\n>W3C approves HTML 4 ...,00957.e0b56b117f3ec5f85e432a9d2a47801f
5793,0,"On Mon, 2002-07-22 at 06:50, che wrote:\n\n\n\...",01127.841233b48eceb74a825417d8d918abf8
5794,0,"Once upon a time, Manfred wrote :\n\n\n\n> I w...",01178.5c977dff972cd6eef64d4173b90307f0


In [34]:
#info of data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5796 entries, 0 to 5795
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   CATEGORY   5796 non-null   int64 
 1   MESSAGE    5796 non-null   object
 2   FILE_NAME  5796 non-null   object
dtypes: int64(1), object(2)
memory usage: 136.0+ KB


Le dataset ne possède pas de valeurs manquantes ou abberrantes.  
La colonne CATEGORY est notre target.  
La colonne MESSAGE comporte les mots, phrases et caractères des mails.  
FILE_NAME est une colonne que nous allons supprimer pendant le preprossing n'étant pas utile pour notre travail.  

## Preprocessing

On commence par faire des séquences avec la tokenisation.  
On détermine la taille max des séquences, c'est à dire le nombre de mots maximum par lignes du dataset.  
Le **tokenizer** consiste à attribuer des nombres pour chaque mot repéré. Afin de ne pas avoir de problème de *shape*, il va normaliser la taille des phrases avec des **0**.  

In [55]:
def get_sequences(texts, tokenizer, train=True, max_seq_length=None):
    sequences = tokenizer.texts_to_sequences(texts)
    
    if train == True:
        max_seq_length = np.max(list(map(lambda x: len(x), sequences)))
    
    sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_seq_length, padding='post')

    return sequences

Dans le code suivant, nous déterminons X et Y pour notre modèle tokenizer et supprimons "FILE NAME".  
On effectue ensuite un train_test_split, pour diviser notre dataset. Le train va servir d'entrainement au modèle et fait la taille de **0.7** (**soit 70% de la donnée**). Il lui reste donc **30% de test**.  
On fit ensuite notre tokenizer pour qu'il créée son dictionnaire.  
Le tokenizer est sauvegardé sous le format pickle.  

In [56]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Drop FILE_NAME column
    df = df.drop('FILE_NAME', axis=1)
    
    # Split df into X and y
    y = df['CATEGORY']
    X = df['MESSAGE']
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    # Create tokenizer
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)
    
    # Fit the tokenizer
    tokenizer.fit_on_texts(X_train)
    
    # Convert texts to sequences
    X_train = get_sequences(X_train, tokenizer, train=True)
    X_test = get_sequences(X_test, tokenizer, train=False, max_seq_length=X_train.shape[1])
    
    #save token
    with open('tokenizer.pickle', 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
    return X_train, X_test, y_train, y_test

In [57]:
X_train, X_test, y_train, y_test = preprocess_inputs(data)

## Explanation

In [38]:
# len(data["MESSAGE"][0])

In [39]:
# tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)
# tokenizer.fit_on_texts(data["MESSAGE"])
# seq = tokenizer.texts_to_sequences(data["MESSAGE"])

In [40]:
# max_len = 0
# for i in data["MESSAGE"]:
#     seq = tokenizer.texts_to_sequences(i)
#     tmp_len = len(seq)
#     if tmp_len > max_len:
#         max_len = tmp_len
#         max_phrase = i

In [41]:
# max_len

In [42]:
# max_len = np.max(list(map(lambda x : len(x),seq)))
# sequences = tf.keras.preprocessing.sequence.pad_sequences(seq, maxlen=max_len, padding='post')

In [43]:
# print(sequences[0])

In [44]:
# X_train, X_test, y_train, y_test = preprocess_inputs(data)

In [45]:
# X_train.shape

In [46]:
# y_train.value_counts()

## Training

In [47]:
X_train.shape

(4057, 14804)

In [48]:
inputs = tf.keras.Input(shape=(14804,))

embedding = tf.keras.layers.Embedding(
    input_dim=30000,
    output_dim=64
)(inputs)

flatten = tf.keras.layers.Flatten()(embedding)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(flatten)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

#compile
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.AUC(name='auc')
    ]
)


print(model.summary())

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 14804)]           0         
                                                                 
 embedding_1 (Embedding)     (None, 14804, 64)         1920000   
                                                                 
 flatten_1 (Flatten)         (None, 947456)            0         
                                                                 
 dense_1 (Dense)             (None, 1)                 947457    
                                                                 
Total params: 2,867,457
Trainable params: 2,867,457
Non-trainable params: 0
_________________________________________________________________
None


In [49]:
history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100


## Results

In [50]:
results = model.evaluate(X_test, y_test, verbose=0)

print("    Test Loss: {:.4f}".format(results[0]))
print("Test Accuracy: {:.2f}%".format(results[1] * 100))
print("     Test AUC: {:.4f}".format(results[2]))

    Test Loss: 0.0229
Test Accuracy: 99.19%
     Test AUC: 0.9989


## Save model

In [51]:
from tensorflow.keras.models import save_model
# ##### SAVE MODEL JUST ONE TIME #####
save_model(model,'model.h5')

## Code refactored for application

In [21]:
# from flask import Flask, render_template, request
# import pandas as pd
# from tensorflow.keras.models import load_model
# from tensorflow.keras import layers, Model, metrics, callbacks
# from sklearn.model_selection import train_test_split

# app = Flask(__name__)

# @app.route('/')
# def home():
#     return render_template('index.html')

# @app.route('/predict', methods=['POST'])
# def predict():

#     def get_sequences(texts, tokenizer, train=True, max_seq_length=None):
#         sequences = tokenizer.texts_to_sequences(texts)    
#     if train == True:
#         max_seq_length = np.max(list(map(lambda x: len(x), sequences)))        
#     sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_seq_length, padding='post')

#     return sequences

#     def preprocess_test(df):
#         df = df.copy()
    
#         # Drop FILE_NAME column
#         df = df.drop('FILE_NAME', axis=1)
    
#         # Split df into X and y
#         y = df['CATEGORY']
#         X = df['MESSAGE']
   
#         # Create tokenizer
#         tokenizer = pickle.load(open("tokenizer.pickle","rb"))
#         X_train_len = 14804
    
#         # Convert texts to sequences
#         X_test = get_sequences(X, tokenizer, train=False, max_seq_length=X_train_len)
#         return X_test, y

#         model = load_model('model.h5')

#         if request.method == 'POST':
#             message = request.form['message']
#             data = [message]
#             my_prediction = model.predict(X_test)
#         return render_template('index.html', prediction=my_prediction)

# if __name__ == '__main__':
#     app.run()