<a href="https://colab.research.google.com/github/EmmanuelADAM/IntelligenceArtificiellePython/blob/master/TP_DeepLearningM1TNSI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modèle d'Analyse des Sentiments basé sur un apprentissage profond (Deep Learning)

L’objectif de ce TP est de développer un modèle prédictif d’apprentissage automatique à l’aide de la représentation par sac de mots pour la classification des sentiments des critiques de films (données textuelles).

## Collection de données de critiques de films 
Télécharger la collection à partir de ce lien 
[Movie Review Polarity Dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz)

## Préparation des données


### Division en données d'entrainement et données de test

Division en données d'entraînement (90%) et données de test (10%)


### Chargement et nettoyage des données

Chargement et nettoyage des données
  - Fractionner des jetons (**tokens**) sur un espace blanc
  - Supprimer toute ponctuation 
  - Supprimer tous les mots qui ne sont pas uniquement composés de caractères alphabétiques
  - Supprimer tous les mots connus qui sont des mots vides (stop words)
  - Supprimer tous les mots dont la longueur est <= 1 caractère

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
drive.mount("/content/drive", force_remount=True)

Mounted at /content/drive


In [3]:
from google.colab import files
uploaded = files.upload()

In [4]:
import nltk
nltk.download("stopwords")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from numpy import array
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

 
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens
 

In [10]:
# load a document
filename = '/content/drive/My Drive/txt_sentoken/neg/cv001_19502.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)
# load another document
filename = '/content/drive/My Drive/txt_sentoken/pos/cv175_6964.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)

['happy', 'bastards', 'quick', 'movie', 'review', 'damn', 'bug', 'got', 'head', 'start', 'movie', 'starring', 'jamie', 'lee', 'curtis', 'another', 'baldwin', 'brother', 'william', 'time', 'story', 'regarding', 'crew', 'tugboat', 'comes', 'across', 'deserted', 'russian', 'tech', 'ship', 'strangeness', 'kick', 'power', 'back', 'little', 'know', 'power', 'within', 'going', 'gore', 'bringing', 'action', 'sequences', 'virus', 'still', 'feels', 'empty', 'like', 'movie', 'going', 'flash', 'substance', 'dont', 'know', 'crew', 'really', 'middle', 'nowhere', 'dont', 'know', 'origin', 'took', 'ship', 'big', 'pink', 'flashy', 'thing', 'hit', 'mir', 'course', 'dont', 'know', 'donald', 'sutherland', 'stumbling', 'around', 'drunkenly', 'throughout', 'hey', 'lets', 'chase', 'people', 'around', 'robots', 'acting', 'average', 'even', 'likes', 'curtis', 'youre', 'likely', 'get', 'kick', 'work', 'halloween', 'sutherland', 'wasted', 'baldwin', 'well', 'hes', 'acting', 'like', 'baldwin', 'course', 'real', '

### déninir un vocabulaire

In [11]:
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)

# load some docs in a directory
def process_docs(directory, vocab):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews that start by cv9....(kept for the test set)
		if filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)

    
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('/content/drive/My Drive/txt_sentoken/pos', vocab)
process_docs('/content/drive/My Drive/txt_sentoken/neg', vocab)
# print the size of the vocab
print(len(vocab))
# print the 50 top words in the vocab
print(vocab.most_common(50))

44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('could', 1248), ('bad', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]


In [13]:
# keep tokens with a min occurrence
min_occurance = 2
tokens = [k for k,c in vocab.items() if c >= min_occurance]
print(len(tokens))

25767


In [0]:
# save list to file
def save_list(lines, filename):
	# convert lines to a single blob of text
	data = '\n'.join(lines)
	# open file
	file = open(filename, 'w')
	# write text
	file.write(data)
	# close file
	file.close()

# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')

## représentation en sac de mots

### convertir les critiques en lignes de jetons 


In [0]:
# load doc, clean and return line of tokens
def doc_to_line(filename, vocab):
	# load the doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	return ' '.join(tokens)

In [0]:
# load all docs in a directory
def process_docs(directory, vocab, is_train=True):
	lines = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_train and filename.startswith('cv9'):
			continue
		if not is_train and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load and clean the doc
		line = doc_to_line(path, vocab)
		# add to list
		lines.append(line)
	return lines

In [17]:
# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
# load all training reviews
positive_lines = process_docs('/content/drive/My Drive/txt_sentoken/pos', vocab)
negative_lines = process_docs('/content/drive/My Drive/txt_sentoken/neg', vocab)
# summarize what we have
print(len(positive_lines), len(negative_lines))

900 900


### convertir les critiques de films aux vecteurs des sac de mots 

In [0]:
import keras
# create the tokenizer
tokenizer = keras.preprocessing.text.Tokenizer()
# fit the tokenizer on the documents
docs = positive_lines + negative_lines
tokenizer.fit_on_texts(docs)

In [19]:
# encode training data set
Xtrain = tokenizer.texts_to_matrix(docs, mode='freq')
print(Xtrain.shape)

(1800, 25768)


In [20]:
# load all test reviews
positive_lines = process_docs('/content/drive/My Drive/txt_sentoken/pos', vocab, False)
negative_lines = process_docs('/content/drive/My Drive/txt_sentoken/neg', vocab, False)
docs = negative_lines + positive_lines
# encode training data set
Xtest = tokenizer.texts_to_matrix(docs, mode='freq')
print(Xtest.shape)

(200, 25768)


## Modèle d'analyse des sentiments

In [0]:
#TODO: donnez le nb de mots retenus
n_words = 

In [0]:
from numpy import array
#definition des sorties
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

In [0]:
# TODO: define network
model = Sequential()
# test differents nb of layers, nb of neurons in the layers,  
# specify the activation functions, 
model.add(
...
model.add(
# compile network // TODO: test several optimizer...
model.compile(....

In [0]:
# fit network // TODO test different epochs
model.fit(Xtrain, ytrain, epochs=50, verbose=2)

Epoch 1/50
 - 3s - loss: 0.6932 - acc: 0.4950
Epoch 2/50
 - 2s - loss: 0.6902 - acc: 0.6389
Epoch 3/50
 - 2s - loss: 0.6674 - acc: 0.7694
Epoch 4/50
 - 2s - loss: 0.5928 - acc: 0.8944
Epoch 5/50
 - 2s - loss: 0.4568 - acc: 0.9417
Epoch 6/50
 - 2s - loss: 0.3067 - acc: 0.9661
Epoch 7/50
 - 2s - loss: 0.1929 - acc: 0.9811
Epoch 8/50
 - 2s - loss: 0.1212 - acc: 0.9922
Epoch 9/50
 - 2s - loss: 0.0788 - acc: 0.9956
Epoch 10/50
 - 2s - loss: 0.0519 - acc: 0.9978
Epoch 11/50
 - 2s - loss: 0.0363 - acc: 1.0000
Epoch 12/50
 - 2s - loss: 0.0259 - acc: 1.0000
Epoch 13/50
 - 2s - loss: 0.0195 - acc: 1.0000
Epoch 14/50
 - 2s - loss: 0.0150 - acc: 1.0000
Epoch 15/50
 - 2s - loss: 0.0118 - acc: 1.0000
Epoch 16/50
 - 2s - loss: 0.0094 - acc: 1.0000
Epoch 17/50
 - 2s - loss: 0.0078 - acc: 1.0000
Epoch 18/50
 - 2s - loss: 0.0065 - acc: 1.0000
Epoch 19/50
 - 2s - loss: 0.0055 - acc: 1.0000
Epoch 20/50
 - 2s - loss: 0.0047 - acc: 1.0000
Epoch 21/50
 - 2s - loss: 0.0041 - acc: 1.0000
Epoch 22/50
 - 2s - lo

<keras.callbacks.History at 0x7fbcd6ae85c0>

In [0]:
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))

Test Accuracy: 90.500000


## prédiction pour de nouvelles critiques

In [0]:
# classify a review as negative (0) or positive (1)
def predict_sentiment(review, vocab, tokenizer, model):
	# clean
	tokens = clean_doc(review)
	# filter by vocab
	tokens = [w for w in tokens if w in vocab]
	# convert to line
	line = ' '.join(tokens)
	# encode
	encoded = tokenizer.texts_to_matrix([line], mode='freq')
	# prediction
	yhat = model.predict(encoded, verbose=0)
	return round(yhat[0,0])

In [0]:
# test positive text
text = 'Best movie ever!'
print(predict_sentiment(text, vocab, tokenizer, model))
# test negative text
text = 'This is a bad movie.'
print(predict_sentiment(text, vocab, tokenizer, model))

1.0
0.0
