# TP3 & 4 - Artifical Intelligence

#### Marlon KUQI & Théo MARIE

### 1 - Recurrent neural network / LSTM : IMDB sentiment classification

![Encoders/Decoders for LTSM](https://qph.fs.quoracdn.net/main-qimg-febee5b881545802a75c064a84ecf85d)

In [2]:
'''
#Trains an LSTM model on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF + LogReg.
**Notes**
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''

from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

max_features = 20000
# cut texts after this number of words (among top max_features most common words)
maxlen = 80
batch_size = 32

# ---------- Data preparation -----------

# Loads data from the imdb dataset
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

# Reshapes the training and testing sets
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

# ---------- Model building -----------

print('Build model...')
model = Sequential()

# The Embedding layer indicates that we work on a vocabulary (i.e. distinct words) of size "max_features"
# Our embedding vectors will have a size of 128
model.add(Embedding(max_features, 128))

# The LSTM layer will have an output size of 128
# "dropout" -> Fraction of the units to drop for the linear transformation of the inputs.
# "recurrent_dropout" -> Fraction of the units to drop for the linear transformation of the recurrent state.
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

# The last layer of our NN will be a regular one, of size 1 and activated by a sigmoid function
model.add(Dense(1, activation='sigmoid'))

# Compiles the model
# "loss" -> Loss function used, here binary_crossentropy (i.e. LogLoss)
# "optimizer" -> Optimizer for the convergence of the loss function, here adam. We used SGD in the last TPs
# "metrics" -> List of metrics to be evaluated during testing and training
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=15,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Using TensorFlow backend.


Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)
Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test score: 1.1003549643659591
Test accuracy: 0.8094


#### Our first results give the following figures:

Test score: 1.1003549643659591

Test accuracy: 0.8094

### 2 - Text classification: the Ohsumed dataset

In [38]:
import numpy as np
import pandas as pd
import os
from collections import defaultdict

In [None]:
categories = 23
current_category = "03"

training_path = "ohsumed-first-20000\\training\C"
test_path = "ohsumed-first-20000\\test\C" 

#tmp_file = pd.read_csv("ohsumed-first-20000\training\C01")

wordcount = defaultdict(int)

for root, dirs, files in os.walk(training_path + current_category):  
    for filename in files:
        print(filename)
        with open(training_path + current_category + "\\" + filename) as file:
            for word in file.read().split():
                wordcount[word] += 1

    for k,v in wordcount.items():
        print (k, v)

In [8]:
df.head()

Unnamed: 0,ohsumed-first-20000-docs/,5,Unnamed: 2,Unnamed: 3,5.1,Unnamed: 5,Unnamed: 6,5.2,Unnamed: 8,Unnamed: 9,0,Unnamed: 11,Unnamed: 12,mentoplasty,using,Mersilene,mesh.
0,,Many,different,materials,are,available,for,augmentation,mentoplasty.,,,,,,,,
1,,"However,",the,optimal,implant,material,for,chin,implantation,has,yet,to,be,found.,,,
2,,The,material,provides,excellent,tensile,"strength,","durability,",and,surgical,adaptability.,,,,,,
3,,Based,on,this,10-year,"experience,",Mersilene,mesh,remains,our,material,of,choice,for,chin,augmentation.,
4,,0,,,intracranial,mucoceles,associated,with,phaeohyphomycosis,of,the,paranasal,sinuses.,,,,
