# Data presentation



## Downloading the data

In [1]:
!wget https://marwachafii.github.io/assets/datasets/Sarcasm_Headlines_Dataset.json

--2019-11-29 09:01:16--  https://marwachafii.github.io/assets/datasets/Sarcasm_Headlines_Dataset.json
Resolving marwachafii.github.io (marwachafii.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to marwachafii.github.io (marwachafii.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5616833 (5.4M) [application/json]
Saving to: ‘Sarcasm_Headlines_Dataset.json.6’


2019-11-29 09:01:16 (103 MB/s) - ‘Sarcasm_Headlines_Dataset.json.6’ saved [5616833/5616833]



## Exploring the data

The downloaded file is in the [JSON](https://en.wikipedia.org/wiki/JSON) format.
Usually, a json file stores a single dictionary (a json object) or a list of dictionaries (a json array).

In order to read the file we will use the Python package `json` and then check the type of the loaded content (list or dict).

In [2]:
import json

file_name = "Sarcasm_Headlines_Dataset.json"
with open(file_name) as f:
  dataset = json.load(f)

print(type(dataset))

<class 'list'>


Since our dataset is a list, let us display the first item.

In [3]:
print("The first item:\n")
print(dataset[0])
print("\n")

print("The first item's keys:\n")
print(list(dataset[0].keys()))
print("\n")

print("The first item's values:\n")
print(list(dataset[0].values()))

The first item:

{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', 'headline': "former versace store clerk sues over secret 'black code' for minority shoppers", 'is_sarcastic': 0}


The first item's keys:

['article_link', 'headline', 'is_sarcastic']


The first item's values:

['https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5', "former versace store clerk sues over secret 'black code' for minority shoppers", 0]


When loading a list of dictionaries from a JSON file, you should not assume that all the list's dictionaries have the same keys.

The following example is a valid JSON:

```JSON
[
  {
    "a": 123,
    "b": "test"
  },
  {
    "d": false,
    "b": "test",
    "e": 12.6
  }
]
```

That being said, we will not be facing this issue with the dataset at hand.

# Data preparation

We will need two lists:
- the first created from the `headline` value of every dataset item
- the second created from the `is_sarcastic` value of every dataset item

In [0]:
"""
The following code uses 'list comprehensions':
https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions
"""
headlines = [item["headline"] for item in dataset]
labels = [item["is_sarcastic"] for item in dataset]

Some basic stats:

In [None]:
import numpy as np
import keras

print(f"{np.sum(labels)} of {len(headlines)} headlines are sarcastic ({np.mean(labels)*100}%)")


Using TensorFlow backend.


11724 of 26709 headlines are sarcastic (43.89531618555543%)


## Tokenization

First, we will build a dictionary that associates an integer with each word (for example, the 10k most frequent words). Remember to reserve an integer for non-encoded words (1 for example) and an integer for the absence of words (0 for example).

Then we will convert a sentence into a sequence of integers using the dictionary. Then we will transform the sentences into fixed size vectors by zero padding.

Example of mapping words to integers and a phrase to a vector:

```
Word dictionary

new <--> 12356
president <--> 756
elected <--> 12374
unknown words <--> 1

a new president elected <--> [1, 12356, 756, 12374]

Notice we mapped the word "a" to 1 since it was not found on our word dictionary.

In order to have same size vectors and if for example our longuest sentence is 9 words long, we can add 5 zeroes (zero padding) at the end of our vector (since our vector has only 5 items in it).

a new president elected <--> [1, 12356, 756, 12374, 0, 0, 0, 0, 0]
```




In [0]:
"""
You can proceed as follows (you are free to do otherwise):

- Create word dictionary from the headlines
- Select the 10000 most recurrent words
- Create a dictionary that maps every word from the 10000 list to a number
  0 is reserved for padding vectors and 1 for words not found in the 10000
  word list
"""

#On crée la liste contenant tous les mots de toutes les headlines (phrases) . 
headlines = [item["headline"] for item in dataset]
word_list=[]
for sentence in headlines:
  words=keras.preprocessing.text.text_to_word_sequence(sentence, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
  for word in words:
    word_list.append(word)

In [0]:
#On crée une liste contenant tous les mots sans répétition
list=[]
for word in word_list:
  if word not in list:
    list.append(word)

In [0]:
#On crée une liste qui contient l'occurence de tous ces mots
list_count=[]
for word in list:
  count=word_list.count(word)
  list_count.append(count)

In [0]:
#On crée la liste contenant les 10000 mots les plus récurrents
most_recurrent_words=[]
for i in range(10000):
  maxcount=max(list_count)
  indice=list_count.index(maxcount)
  most_recurrent_words.append(list[indice])
  list_count.remove(maxcount)

In [18]:
print(most_recurrent_words[0])

to


In [0]:
#On crée le dictionnaire des 10000 mots les plus récurrents 
word_to_int = []
for word in most_recurrent_words:
  word_to_int.append(most_reccurent_words.index(word)+2)

In [0]:
#Longueur de la phrase la plus longue pour le zero-padding
len_sentence=[]
for sentence in headlines:
  words=keras.preprocessing.text.text_to_word_sequence(sentence, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
  len_sentence.append(len(words))
max_len_sentence=max(len_sentence)

In [0]:
#On convertit toutes les phrases en liste d'integer de longueur max_len_sentence à l'aide du dictionnaire.
list_vector=[]
for sentence in headlines:
  words=keras.preprocessing.text.text_to_word_sequence(sentence, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ')
  vector=np.zeros(max_len_sentence)
  for word in words:
    index=words.index(word)
    if word not in most_reccurent_words:
      vector[index]=1
    else:
      vector[index]=word_to_int[most_recurrent_words.index(word)]
  list_vector.append(vector)

In [43]:
print(list_vector[0])

[3.080e+02 6.790e+02 2.298e+03 2.577e+03 3.820e+02 4.800e+01 2.746e+03
 3.112e+03 6.970e+03 6.000e+00 2.924e+03 5.274e+03 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00]


# Classification

**Deep neural network**

We will now create a network of neurons that takes a whole vector of indices (an encoded sentence) and outputs a prediction: "sarcasm" or "no sarcasm"

Guidelines:

- Remember to divide your set into train, val and test subsets
- The first layer of our network is an embedding layer (use `tf.keras.layers.Embedding`)
- Choose the hidden layers as you see fit
- The last layer has only one output neuron with a sigmoid activation function corresponding to a probability
- The applied cost function is a binary crossentropy

In [76]:
#On crée notre premier modèle 

from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Embedding, SimpleRNN, LSTM, Conv1D

X_train=list_vector[0:20710] 
X_val=list_vector[20710:23710]
X_test=list_vector[23710:26710]
Y_train=labels[0:20710]
Y_val=labels[20710:23710]
Y_test=labels[23710:26710]

X_train = np.asarray(X_train)
X_val = np.asarray(X_val )
X_test = np.asarray(X_test)

model = Sequential()
model.add(Embedding(10001, 64, input_length=max_len_sentence))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, 40, 64)            640064    
_________________________________________________________________
flatten_13 (Flatten)         (None, 2560)              0         
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 2561      
Total params: 642,625
Trainable params: 642,625
Non-trainable params: 0
_________________________________________________________________
None


In [64]:
batch_size = 32
epochs = 20

model.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(X_val, Y_val))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])





Train on 20710 samples, validate on 3000 samples
Epoch 1/20





Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Test loss: 1.1471892492836337
Test accuracy: 0.7905968657808726


On obtient une précision de test de 0.79, ce qui est un bon résultat. Cependant la précision de training est de 0.99, on est donc dans un cas d'overfitting. Pour cela on peut diminuer le nombre d'epoch (early-stopping) ou bien augmenter le dataset d'entraînement. 

In [68]:

X_train=list_vector[0:21368] 
X_val=list_vector[21368:24710]
X_test=list_vector[24710:26710]
Y_train=labels[0:21368]
Y_val=labels[21368:24710]
Y_test=labels[24710:26710]

X_train = np.asarray(X_train)
X_val = np.asarray(X_val )
X_test = np.asarray(X_test)

model = Sequential()
model.add(Embedding(10001, 64, input_length=max_len_sentence))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

batch_size = 32
epochs = 5

model.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(X_val, Y_val))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])


Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 40, 64)            640064    
_________________________________________________________________
flatten_8 (Flatten)          (None, 2560)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 2561      
Total params: 642,625
Trainable params: 642,625
Non-trainable params: 0
_________________________________________________________________
None
Train on 21368 samples, validate on 3342 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.4288494803268591
Test accuracy: 0.819909955007306


On obtient des résultats légèrement meilleurs.
On tente d'améliorer notre modèle en utilisant une couche RNN (afin d'ajouter une dépendance entre les mots d'une phrase).

In [0]:
X_train=list_vector[0:21368] 
X_val=list_vector[21368:24710]
X_test=list_vector[24710:26710]
Y_train=labels[0:21368]
Y_val=labels[21368:24710]
Y_test=labels[24710:26710]

X_train = np.asarray(X_train)
X_val = np.asarray(X_val )
X_test = np.asarray(X_test)

model = Sequential()
model.add(Embedding(10001, 64, input_length=max_len_sentence))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

batch_size = 32
epochs = 10

model.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(X_val, Y_val))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 40, 64)            640064    
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 32)                3104      
_________________________________________________________________
dense_19 (Dense)             (None, 1)                 33        
Total params: 643,201
Trainable params: 643,201
Non-trainable params: 0
_________________________________________________________________
None
Train on 21368 samples, validate on 3342 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.755467976269691
Test accuracy: 0.800900450284747


Les résultats ne sont pas vraiment meilleurs. 
On utilise à la place une couche LSTM ( même principe que le RNN mais qui permet de relier plus efficacement deux mots éloignés d'une phrase (car pour le RNN il y a le problème du vanishing gradient)).

In [80]:
model = Sequential()
model.add(Embedding(10001, 64, input_length=max_len_sentence))
model.add(LSTM(32)))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

batch_size = 32
epochs = 

model.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(X_val, Y_val))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, 40, 64)            640064    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense_20 (Dense)             (None, 1)                 33        
Total params: 652,513
Trainable params: 652,513
Non-trainable params: 0
_________________________________________________________________
None
Train on 21368 samples, validate on 3342 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.5095036876923207
Test accuracy: 0.80040020010005


On a beaucoup moins d'overfitting, la précision de test est plutot bonne par rapport à la précision d'entraînement. On pourrait obtenir de meilleurs résultats avec plus d'epochs. On rajoute une couche Conv1D

In [81]:
model = Sequential()
model.add(Embedding(10001, 64, input_length=max_len_sentence))
model.add(Conv1D(64,5,strides=1))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

print(model.summary())

batch_size = 32
epochs = 5

model.compile(loss='binary_crossentropy',
              optimizer='Adam',
              metrics=['accuracy'])

history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(X_val, Y_val))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_19 (Embedding)     (None, 40, 64)            640064    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 36, 64)            20544     
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense_21 (Dense)             (None, 1)                 33        
Total params: 673,057
Trainable params: 673,057
Non-trainable params: 0
_________________________________________________________________
None
Train on 21368 samples, validate on 3342 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.42554853333271164
Test accuracy: 0.8139069534767384


En rajoutant une couche Conv1D, on remarque une petite amélioration au niveau de la précision tout en n'étant pas en overfitting ou underfitting.

On s'attendait à avoir de meilleurs résultats en utilisant le LSTM mais cela est sûrement dû à la base de données initiale (peut etre que l'on avait besoin d'entrainer avec toute la data set afin de traiter tous les cas et utiliser d'autres exemples pour les dataset de tests et validation). 