<a href="https://colab.research.google.com/github/OdysseusPolymetis/colabs_for_nlp/blob/main/3_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyse de sentiments en français**

## 1. Les imports

In [1]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

In [2]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

In [3]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> popular


    Downloading collection 'popular'
       | 
       | Downloading package cmudict to /root/nltk_data...
       |   Package cmudict is already up-to-date!
       | Downloading package gazetteers to /root/nltk_data...
       |   Package gazetteers is already up-to-date!
       | Downloading package genesis to /root/nltk_data...
       |   Package genesis is already up-to-date!
       | Downloading package gutenberg to /root/nltk_data...
       |   Package gutenberg is already up-to-date!
       | Downloading package inaugural to /root/nltk_data...
       |   Package inaugural is already up-to-date!
       | Downloading package movie_reviews to /root/nltk_data...
       |   Package movie_reviews is already up-to-date!
       | Downloading package names to /root/nltk_data...
       |   Package names is already up-to-date!
       | Downloading package shakespeare to /root/nltk_data...
       |   Package shakespeare is already up-to-date!
       | Downloading package stopwords to /root/nlt


---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

In [4]:
spark = sparknlp.start()

In [5]:
!gdown --id 1GEgd5cQoJkTm5PRWfixxOKHe3uOlxFqo

Downloading...
From: https://drive.google.com/uc?id=1GEgd5cQoJkTm5PRWfixxOKHe3uOlxFqo
To: /content/miserables.txt
100% 3.17M/3.17M [00:00<00:00, 209MB/s]


## 2. Les phrases à analyser



In [6]:
filepath_of_text = "/content/miserables.txt"

In [7]:
full_text = open(filepath_of_text, encoding="utf-8").read()

In [8]:
jaccuse="J’accuse enfin le premier conseil de guerre d’avoir violé le droit, en condamnant un accusé sur une pièce restée secrète, et j’accuse le second conseil de guerre d’avoir couvert cette illégalité, par ordre, en commettant à son tour le crime juridique d’acquitter sciemment un coupable. En portant ces accusations, je n’ignore pas que je me mets sous le coup des articles 30 et 31 de la loi sur la presse du 29 juillet 1881, qui punit les délits de diffamation. Et c’est volontairement que je m’expose. Quant aux gens que j’accuse, je ne les connais pas, je ne les ai jamais vus, je n’ai contre eux ni rancune ni haine. Ils ne sont pour moi que des entités, des esprits de malfaisance sociale. Et l’acte que j’accomplis ici n’est qu’un moyen révolutionnaire pour hâter l’explosion de la vérité et de la justice. Je n’ai qu’une passion, celle de la lumière, au nom de l’humanité qui a tant souffert et qui a droit au bonheur. Ma protestation enflammée n’est que le cri de mon âme. Qu’on ose donc me traduire en cour d’assises et que l’enquête ait lieu au grand jour ! J’attends."

In [9]:
from nltk.tokenize import sent_tokenize

In [10]:
#sentences=sent_tokenize(full_text, language="french")
sentences=sent_tokenize(jaccuse, language="french")

In [11]:
sentences

['J’accuse enfin le premier conseil de guerre d’avoir violé le droit, en condamnant un accusé sur une pièce restée secrète, et j’accuse le second conseil de guerre d’avoir couvert cette illégalité, par ordre, en commettant à son tour le crime juridique d’acquitter sciemment un coupable.',
 'En portant ces accusations, je n’ignore pas que je me mets sous le coup des articles 30 et 31 de la loi sur la presse du 29 juillet 1881, qui punit les délits de diffamation.',
 'Et c’est volontairement que je m’expose.',
 'Quant aux gens que j’accuse, je ne les connais pas, je ne les ai jamais vus, je n’ai contre eux ni rancune ni haine.',
 'Ils ne sont pour moi que des entités, des esprits de malfaisance sociale.',
 'Et l’acte que j’accomplis ici n’est qu’un moyen révolutionnaire pour hâter l’explosion de la vérité et de la justice.',
 'Je n’ai qu’une passion, celle de la lumière, au nom de l’humanité qui a tant souffert et qui a droit au bonheur.',
 'Ma protestation enflammée n’est que le cri de 

## 3. Construire la Pipeline

In [12]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = BertSentenceEmbeddings\
    .pretrained('labse', 'xx') \
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bert_sentiment", "fr") \
  .setInputCols(["sentence_embeddings"]) \
  .setOutputCol("class")   
  #.setInputCols(["document", "sentence_embeddings"]) 
     
nlpPipeline = Pipeline(stages=[
 document, 
 embeddings,
 sentimentClassifier
 ])

labse download started this may take some time.
Approximate size to download 1.7 GB
[OK!]
classifierdl_bert_sentiment download started this may take some time.
Approximate size to download 22.2 MB
[OK!]


## 5. Faire tourner sur le texte

In [13]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({"text":sentences}))
result = pipelineModel.transform(df)

## 6. Les résultats

In [14]:
result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("class")).show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|document                                                                                                                                                                                                                                                                                     |class   |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|J’accuse enfin le premier conseil de guerre d’avoir violé le droit, en condamnant un accusé sur une pièce re

#**Word Embeddings**

In [22]:
!pip install stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.4.2-py3-none-any.whl (691 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.3/691.3 KB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
Collecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 KB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234926 sha256=92fbaef34dea722673a245b1acf659bcc67bf759834419d37a3a75b5113222bc
  Stored in directory: /root/.cache/pip/wheels/86/62/9e/a6b27a681abcde69970dbc0326ff51955f3beac72f15696984
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-2.2.0 stan

In [29]:
!curl -L -s -o /content/le_rouge_et_le_noir.txt 'https://drive.google.com/uc?id=1gTZgRAh0hEad0YgUKLFdGk42OU8JS5vK&confirm=t'

In [30]:
rouge_et_noir = "/content/le_rouge_et_le_noir.txt"

In [31]:
import stanza
stanza.download('fr')
nlp_stanza = stanza.Pipeline(lang='fr', processors='tokenize,mwt,pos,lemma')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Downloading default packages for language: fr (French) ...
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 37132)
Traceback (most recent call last):
  File "/usr/lib/python3.8/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.8/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.8/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.8/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/local/lib/python3.8/dist-packages/pyspark/accumulators.py", line 262, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.8/dist-packages/pyspark/accumulators.py", line 235, in poll
    if func():
  File "/usr/local/lib/python3.8/dist-packages/pyspark/accumulators.py", line 239, 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

INFO:stanza:Loading these models for language: fr (French):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |

INFO:stanza:Use device: gpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!


In [38]:
rouge_stanza=nlp_stanza(open(rouge_et_noir).read())

In [39]:
sents_rouge=list()
for sent in rouge_stanza.sentences:
  sentence=list()
  for token in sent.words:
    sentence.append(token.lemma)
  sents_rouge.append(sentence)

In [41]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=sents_rouge, size=200, window=5, min_count=2, workers=4)

In [42]:
sims = model.wv.most_similar('Julien', topn=10)

In [43]:
sims

[('Mathilde', 0.9868568181991577),
 ('indifférence', 0.9800305366516113),
 ('venir', 0.9796343445777893),
 ('femme', 0.9781731367111206),
 ('caprice', 0.976804256439209),
 ('sortir', 0.9744753837585449),
 ('ange', 0.9744090437889099),
 ('Paris', 0.9743478298187256),
 ('oui', 0.9740673303604126),
 ('reproche', 0.9740334749221802)]

In [49]:
odd = model.wv.doesnt_match(['Sorel', 'Mole', 'Rênal', 'Paris'])
print("Le nom intrus est : {}".format(odd))

Le nom intrus est : Paris
