# Natural Language Processing First Steps with Python

This notebook contains basic NLP techniques : text preprocessing, Similarity measures, Sentiment Analysis, text translation, TF-IDF analysis 

Using different libraries : 
- **NLTK**
- **SpaCy**
- **Wordnet**
- **Textblob**

## Libraries

In [134]:
import numpy as np

# NLTK
import nltk
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.corpus import stopwords, gutenberg

# SpaCy
# We need to dowmnloead the trained pipelines "en_core_web_md" using the code below
#!python -m spacy download en_core_web_md
import spacy
nlp = spacy.load("en_core_web_md")

#
import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 
warnings.filterwarnings("ignore", message=r"\[W008\]", category=UserWarning)

# Wordnet
from nltk.corpus import wordnet as wn

# textblob
from textblob import TextBlob

# TF-IDF
import math

#
import wikipedia

In [None]:
# We may need to download some package from NLTK module
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# nltk.download('all') # for wordnet languages

## Text processing

### NLTK Text processing

Load the text document  "metamorphosis_clean"

In [2]:
filename = 'metamorphosis_clean.txt'
file = open(filename, 'rt')
doc = file.read()
file.close()

In [3]:
print('The text have {} characters.'.format(len(doc)))

The text have 119163 characters.


Split the text into sentences

In [87]:
sentences = sent_tokenize(doc)

Display the 10 first sentences

In [88]:
sentences[:10]

['One morning, when Gregor Samsa woke from troubled dreams, he found\nhimself transformed in his bed into a horrible vermin.',
 'He lay on\nhis armour-like back, and if he lifted his head a little he could\nsee his brown belly, slightly domed and divided by arches into stiff\nsections.',
 'The bedding was hardly able to cover it and seemed ready\nto slide off any moment.',
 'His many legs, pitifully thin compared\nwith the size of the rest of him, waved about helplessly as he\nlooked.',
 '"What\'s happened to me?"',
 'he thought.',
 "It wasn't a dream.",
 'His room,\na proper human room although a little too small, lay peacefully\nbetween its four familiar walls.',
 'A collection of textile samples\nlay spread out on the table - Samsa was a travelling salesman - and\nabove it there hung a picture that he had recently cut out of an\nillustrated magazine and housed in a nice, gilded frame.',
 'It showed\na lady fitted out with a fur hat and fur boa who sat upright,\nraising a heavy fur m

Slice the 1rst sentence into words

In [89]:
words = word_tokenize(sentences[0])
print(words)

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.']


Display the function of each word

In [90]:
nltk.pos_tag(words)

[('One', 'CD'),
 ('morning', 'NN'),
 (',', ','),
 ('when', 'WRB'),
 ('Gregor', 'NNP'),
 ('Samsa', 'NNP'),
 ('woke', 'VBD'),
 ('from', 'IN'),
 ('troubled', 'JJ'),
 ('dreams', 'NNS'),
 (',', ','),
 ('he', 'PRP'),
 ('found', 'VBD'),
 ('himself', 'PRP'),
 ('transformed', 'VBN'),
 ('in', 'IN'),
 ('his', 'PRP$'),
 ('bed', 'NN'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('horrible', 'JJ'),
 ('vermin', 'NN'),
 ('.', '.')]

Display the first 100 tokenized words of the text

In [91]:
doc_words = doc.split()
print(text_words[0:100])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '"What\'s', 'happened', 'to', 'me?"', 'he', 'thought.', 'It', "wasn't", 'a', 'dream.', 'His', 'room,', 'a', 'proper', 'human']


English stopwords 

In [92]:
en_stops = set(stopwords.words('english'))
print(en_stops)

{'whom', "shan't", "don't", 'm', 'because', 'from', 'but', 'nor', 'an', 'wouldn', 'again', 'it', "isn't", 'each', 'no', "shouldn't", 'too', 'yourselves', 'when', 'you', 'of', 'himself', 'to', 'above', 'his', 'shouldn', 'your', 'wasn', 'who', 'through', 'if', 'out', 'aren', 'ma', 'below', 'her', 'why', 'than', 'into', 'their', "she's", "you'll", "mightn't", "doesn't", 'some', "hasn't", 'under', 'isn', 'a', 'am', "hadn't", 'until', "wouldn't", 'be', 'where', 'before', 'once', "couldn't", 'all', 'don', 'did', 'after', 'at', 'do', 'there', 'about', 'me', 'what', "you're", 'very', 'won', 'yours', 'they', 'by', 'for', 'haven', 'so', 'here', 'only', 'more', "weren't", 'mightn', 'will', 'shan', 'yourself', 'which', 'as', 'ain', 'y', 'is', 'she', 'myself', 'any', "you've", 'then', 'same', "you'd", 'herself', 'down', 'didn', 'on', 'in', 'during', 'that', 'most', 'have', 'now', 'been', "didn't", 'mustn', 'over', 'those', 'the', 'are', 'ourselves', 'these', 'having', 'should', 'theirs', 'against',

Remove all stop words from the text and print the new text

In [93]:
text_without_en_stops = []  
for w in doc_words:  
    if w not in en_stops:  
        text_without_en_stops.append(w) 
print('The new text have {} carcahters.\n'.format(len(text_without_en_stops)))
print(text_without_en_stops[:100])

The new text have 11346 carcahters.

['One', 'morning,', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams,', 'found', 'transformed', 'bed', 'horrible', 'vermin.', 'He', 'lay', 'armour-like', 'back,', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly,', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections.', 'The', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'size', 'rest', 'him,', 'waved', 'helplessly', 'looked.', '"What\'s', 'happened', 'me?"', 'thought.', 'It', 'dream.', 'His', 'room,', 'proper', 'human', 'room', 'although', 'little', 'small,', 'lay', 'peacefully', 'four', 'familiar', 'walls.', 'A', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', '-', 'Samsa', 'travelling', 'salesman', '-', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice,', 'gilded', 'frame.', 'It', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat']


### SpaCy Text processing

- Tokonization

In [5]:
doc_= nlp(doc)
word_tokens = [token.text for token in doc_]
print(word_tokens[:100])

['One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', '\n', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', ' ', 'He', 'lay', 'on', '\n', 'his', 'armour', '-', 'like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', '\n', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', '\n', 'sections', '.', ' ', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', '\n', 'to', 'slide', 'off', 'any', 'moment', '.', ' ', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', '\n', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved']


- Get position and tag of each word

In [6]:
pos_tag = [(token.text,token.pos_, token.tag_) for token in doc_]
print(pos_tag[:100])

[('One', 'NUM', 'CD'), ('morning', 'NOUN', 'NN'), (',', 'PUNCT', ','), ('when', 'SCONJ', 'WRB'), ('Gregor', 'PROPN', 'NNP'), ('Samsa', 'PROPN', 'NNP'), ('woke', 'VERB', 'VBD'), ('from', 'ADP', 'IN'), ('troubled', 'ADJ', 'JJ'), ('dreams', 'NOUN', 'NNS'), (',', 'PUNCT', ','), ('he', 'PRON', 'PRP'), ('found', 'VERB', 'VBD'), ('\n', 'SPACE', '_SP'), ('himself', 'PRON', 'PRP'), ('transformed', 'VERB', 'VBD'), ('in', 'ADP', 'IN'), ('his', 'PRON', 'PRP$'), ('bed', 'NOUN', 'NN'), ('into', 'ADP', 'IN'), ('a', 'DET', 'DT'), ('horrible', 'ADJ', 'JJ'), ('vermin', 'NOUN', 'NN'), ('.', 'PUNCT', '.'), (' ', 'SPACE', '_SP'), ('He', 'PRON', 'PRP'), ('lay', 'VERB', 'VBD'), ('on', 'ADP', 'IN'), ('\n', 'SPACE', '_SP'), ('his', 'PRON', 'PRP$'), ('armour', 'NOUN', 'NN'), ('-', 'PUNCT', 'HYPH'), ('like', 'NOUN', 'NN'), ('back', 'NOUN', 'NN'), (',', 'PUNCT', ','), ('and', 'CCONJ', 'CC'), ('if', 'SCONJ', 'IN'), ('he', 'PRON', 'PRP'), ('lifted', 'VERB', 'VBD'), ('his', 'PRON', 'PRP$'), ('head', 'NOUN', 'NN'), (

- Remove stopwords 

In [7]:
word_tokens = [token.text for token in doc_ if not token.is_stop]
print(word_tokens[:100])

['morning', ',', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams', ',', 'found', '\n', 'transformed', 'bed', 'horrible', 'vermin', '.', ' ', 'lay', '\n', 'armour', '-', 'like', ',', 'lifted', 'head', 'little', '\n', 'brown', 'belly', ',', 'slightly', 'domed', 'divided', 'arches', 'stiff', '\n', 'sections', '.', ' ', 'bedding', 'hardly', 'able', 'cover', 'ready', '\n', 'slide', 'moment', '.', ' ', 'legs', ',', 'pitifully', 'thin', 'compared', '\n', 'size', 'rest', ',', 'waved', 'helplessly', '\n', 'looked', '.', '\n\n', '"', 'happened', '?', '"', 'thought', '.', ' ', 'dream', '.', ' ', 'room', ',', '\n', 'proper', 'human', 'room', 'little', 'small', ',', 'lay', 'peacefully', '\n', 'familiar', 'walls', '.', ' ', 'collection', 'textile', 'samples', '\n', 'lay', 'spread', 'table', '-', 'Samsa', 'travelling', 'salesman']


- Remove ponctuation

In [8]:
word_tokens = [token.text for token in doc_ if not token.is_stop and token.is_alpha]
print(word_tokens[:100])

['morning', 'Gregor', 'Samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armour', 'like', 'lifted', 'head', 'little', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'ready', 'slide', 'moment', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'dream', 'room', 'proper', 'human', 'room', 'little', 'small', 'lay', 'peacefully', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'table', 'Samsa', 'travelling', 'salesman', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'lower', 'arm', 'viewer', 'Gregor', 'turned', 'look', 'window', 'dull', 'weather', 'Drops', 'rain', 'heard', 'hitting']


- Lemmatisation

In [9]:
word_tokens = [token.lemma_.lower() for token in doc_ if not token.is_stop and token.is_alpha]
print(word_tokens[:100])

['morning', 'gregor', 'samsa', 'wake', 'troubled', 'dream', 'find', 'transform', 'bed', 'horrible', 'vermin', 'lie', 'armour', 'like', 'lift', 'head', 'little', 'brown', 'belly', 'slightly', 'domed', 'divide', 'arch', 'stiff', 'section', 'bedding', 'hardly', 'able', 'cover', 'ready', 'slide', 'moment', 'leg', 'pitifully', 'thin', 'compare', 'size', 'rest', 'wave', 'helplessly', 'look', 'happen', 'think', 'dream', 'room', 'proper', 'human', 'room', 'little', 'small', 'lie', 'peacefully', 'familiar', 'wall', 'collection', 'textile', 'sample', 'lie', 'spread', 'table', 'samsa', 'travel', 'salesman', 'hang', 'picture', 'recently', 'cut', 'illustrate', 'magazine', 'house', 'nice', 'gild', 'frame', 'show', 'lady', 'fit', 'fur', 'hat', 'fur', 'boa', 'sit', 'upright', 'raise', 'heavy', 'fur', 'muff', 'cover', 'low', 'arm', 'viewer', 'gregor', 'turn', 'look', 'window', 'dull', 'weather', 'drop', 'rain', 'hear', 'hit']


In [13]:
' '.join(word_tokens[:200])

'morning gregor samsa wake troubled dream find transform bed horrible vermin lie armour like lift head little brown belly slightly domed divide arch stiff section bedding hardly able cover ready slide moment leg pitifully thin compare size rest wave helplessly look happen think dream room proper human room little small lie peacefully familiar wall collection textile sample lie spread table samsa travel salesman hang picture recently cut illustrate magazine house nice gild frame show lady fit fur hat fur boa sit upright raise heavy fur muff cover low arm viewer gregor turn look window dull weather drop rain hear hit pane feel sad sleep little bit long forget nonsense think unable sleep right present state position hard throw right roll try time shut eye look flounder leg stop begin feel mild dull pain feel oh god think strenuous career choose travel day day business like take effort business home curse travel worry make train connection bad irregular food contact different people time k

## Wordnet

WordNet is a lexical database for the English language that can be used to find word meanings, synonyms, antonyms, ...

Here are some examples: 

 Print all elements of synsets of the word **"happiness"**

In [14]:
syn_happiness = wn.synsets('happiness')
syn_happiness

[Synset('happiness.n.01'), Synset('happiness.n.02')]

This function gives the different synonyms (synset) of the word 'happiness' by specifying the nature: noun (n), verb (v),...

A synset is a set of synonyms that share a common meaning. 

Each synset contains one or more lemmas, which represent a specific sense of a specific word.

Properties of the first synonym

In [15]:
print('First synonym is : ', syn_happiness[0].name())
print('Definition of that first synset : ', syn_happiness[0].definition())
print('Synonyms (lemmas) :' , syn_happiness[0].lemmas(), '\n', syn_happiness[0].lemmas()[0].name())
print('Example of the word in use in sentences : ', syn_happiness[0].examples())

First synonym is :  happiness.n.01
Definition of that first synset :  state of well-being characterized by emotions ranging from contentment to intense joy
Synonyms (lemmas) : [Lemma('happiness.n.01.happiness'), Lemma('happiness.n.01.felicity')] 
 happiness
Example of the word in use in sentences :  []


All synonyms and antonyms for the word "happiness" :

In [119]:
synonyms = [] 
antonyms = [] 
  
for syn in syn_happiness : 
    for l in syn.lemmas(): 
        synonyms.append(l.name()) 
        if l.antonyms(): 
            antonyms.append(l.antonyms()[0].name()) 
  
print('Synonyms of love : ', '\n', set(synonyms), '\n') 
print('Antonyms of love : ', '\n',set(antonyms)) 

Synonyms of love :  
 {'happiness', 'felicity'} 

Antonyms of love :  
 {'unhappiness', 'sadness'}


#### All available language in wordnet

The WordNet corpus reader gives access to the Open Multilingual WordNet,
using ISO-639 language codes.

So we can get word synonyms in another language.

Language codes :

In [120]:
print('Language codes :','\n', wn.langs())

Language codes : 
 dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'dan', 'ell', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])


Get lemma name of the first synset of the word **"happiness"** in Japanese, Arabic and Fransh

In [129]:
print('For Japanese :\n',set(syn_happiness[0].lemma_names('jpn')))
print('For Arabic :\n',set(syn_happiness[0].lemma_names('arb')))
print('For Frensh:\n',set(syn_happiness[0].lemma_names('fra')))

For Japanese :
 {'果報', '幸', '幸い', '幸せ', '清福', '福禄', '利福', '倖せ', '倖', '仕合わせ', '福', '慶福', '幸福'}
For Arabic :
 {'هناء', 'نعِيم', 'سرور', 'بهْجة', 'سعادة'}
For Frensh:
 {'joie', 'bonheur', 'félicité'}


## Similarity measures

### Similarity measures With Wordnet (NLTK)

- Sets of synonyms of cat : 

In [16]:
wn.synsets('happiness')

[Synset('happy.a.01'),
 Synset('felicitous.s.02'),
 Synset('glad.s.02'),
 Synset('happy.s.04')]

- Sets of synonyms of dog : 

In [37]:
wn.synsets('health')

[Synset('health.n.01'), Synset('health.n.02')]

- Sets of synonyms of car : 

In [38]:
wn.synsets('home')

[Synset('home.n.01'),
 Synset('dwelling.n.01'),
 Synset('home.n.03'),
 Synset('home_plate.n.01'),
 Synset('base.n.14'),
 Synset('home.n.06'),
 Synset('home.n.07'),
 Synset('family.n.01'),
 Synset('home.n.09'),
 Synset('home.v.01'),
 Synset('home.v.02'),
 Synset('home.a.01'),
 Synset('home.a.02'),
 Synset('home.s.03'),
 Synset('home.r.01'),
 Synset('home.r.02'),
 Synset('home.r.03')]

#### Similarity measures :

In [39]:
happiness = wn.synset('happiness.n.01') #
health = wn.synset('health.n.01')
home = wn.synset('home.n.01')

**path_similarity**

In [40]:
print('Similarity between cat and dog : ', happiness.path_similarity(health))
print('Similarity between dog and car : ', health.path_similarity(happiness))
print('Similarity between car and cat : ', home.path_similarity(health))

Similarity between cat and dog :  0.09090909090909091
Similarity between dog and car :  0.0625
Similarity between car and cat :  0.05555555555555555


**lch_similarity** : Leacock-Chodorow Similarity (Leacock and Chodorow 1998)

In [41]:
print('Similarity between happiness and health : ', happiness.lch_similarity(health))
print('Similarity between health and home : ', health.lch_similarity(home))
print('Similarity between home and happiness : ', home.lch_similarity(happiness))

Similarity between cat and dog :  1.2396908869280152
Similarity between dog and car :  0.7472144018302211
Similarity between car and cat :  0.8649974374866046


**wup_similarity** : Wu-Palmer Similarity (Wu and Palmer 1994)

In [42]:
print('Similarity between happiness and health : ', happiness.wup_similarity(health))
print('Similarity between health and home : ', health.wup_similarity(home))
print('Similarity between home and happiness : ', home.wup_similarity(happiness))

Similarity between cat and dog :  0.4444444444444444
Similarity between dog and car :  0.10526315789473684
Similarity between car and cat :  0.11764705882352941


### Similarity measures With SpaCy

Based on coisnus similarity

In [47]:
print('Similarity between happiness and health : ', nlp('happiness').similarity(nlp('health')))
print('Similarity between health and home : ', nlp('health').similarity(nlp('home')))
print('Similarity between home and happiness : ', nlp('home').similarity(nlp('happiness')))

Similarity between happiness and health :  0.41790983756817446
Similarity between health and home :  0.2936136163478605
Similarity between home and happiness :  0.3032258847207985


# TextBlob : sentiment analysis, translation, TF-IDF Aanalysis

In [32]:
textblob_python = TextBlob("Python is a beautiful high-level, general-purpose programming language !")
print(textblob_python)

Python is a beautiful high-level, general-purpose programming language !


In [33]:
textblob_python.tags

[('Python', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('beautiful', 'JJ'),
 ('high-level', 'NN'),
 ('general-purpose', 'JJ'),
 ('programming', 'NN'),
 ('language', 'NN')]

=> this gives the function of each word like nltk.pos_tag and token.pos_ token.tag_ for SpaCy

Get the noun phrases

In [34]:
textblob_python.noun_phrases

WordList(['python', 'beautiful high-level'])

In [35]:
textblob_paris = TextBlob('Paris is so beautiful, I like Paris, peopel are so nice !')
print(textblob_paris)

Paris is so beautiful, I like Paris, peopel are so nice !


In [36]:
textblob_paris.sentiment

Sentiment(polarity=0.8, subjectivity=1.0)

=> the sentence have a positive sentiment (positive polarity = 0.8 > 0 and close to 1) and it's very subjective (subjectivity = 1)


## **Wikipedia text**

In [48]:
wiki = '''
"During the Iron Age, what is
now metropolitan France was inhabited by the Gauls, a Celtic people. Rome
annexed the area in 51 BC, holding it until the arrival of Germanic Franks
in 476, who formed the Kingdom of Francia. The Treaty of Verdun of 843
partitioned Francia into East Francia, Middle Francia and West Francia.West Francia, which became the Kingdom of France in 987, emerged as a
major European power in the Late Middle Ages, following its victory in the
Hundred Years' War (13371453). During the Renaissance, French culture
fourished and a global colonial empire was established, which by the 20th
century would become the second largest in the world. The 16th century
was dominated by religious civil wars between Catholics and Protestants
(Huguenots). France became Europe's dominant cultural, political, and military power in the 17th century under Louis XIV. In the late 18th century,
the French Revolution overthrew the absolute monarchy, establishing one of
modern history's earliest republics and drafting the Declaration of the Rights
of Man and of the Citizen, which expresses the nation's ideals to this day.")
'''

Convert wiki to extBlob object

In [51]:
blob_wiki = TextBlob(wiki)
blob_wiki

TextBlob("
"During the Iron Age, what is
now metropolitan France was inhabited by the Gauls, a Celtic people. Rome
annexed the area in 51 BC, holding it until the arrival of Germanic Franks
in 476, who formed the Kingdom of Francia. The Treaty of Verdun of 843
partitioned Francia into East Francia, Middle Francia and West Francia.West Francia, which became the Kingdom of France in 987, emerged as a
major European power in the Late Middle Ages, following its victory in the
Hundred Years' War (13371453). During the Renaissance, French culture

ourished and a global colonial empire was established, which by the 20th
century would become the second largest in the world. The 16th century
was dominated by religious civil wars between Catholics and Protestants
(Huguenots). France became Europe's dominant cultural, political, and military power in the 17th century under Louis XIV. In the late 18th century,
the French Revolution overthrew the absolute monarchy, establishing one of
modern histo

## words and sentences of wikipedia text

In [53]:
print('\n Words of the wikipedia text : \n', blob_wiki.words, '\n')
print('Sentences of the wikipedia text :', '\n', blob_wiki.sentences)


 Words of the wikipedia text : 
 ['During', 'the', 'Iron', 'Age', 'what', 'is', 'now', 'metropolitan', 'France', 'was', 'inhabited', 'by', 'the', 'Gauls', 'a', 'Celtic', 'people', 'Rome', 'annexed', 'the', 'area', 'in', '51', 'BC', 'holding', 'it', 'until', 'the', 'arrival', 'of', 'Germanic', 'Franks', 'in', '476', 'who', 'formed', 'the', 'Kingdom', 'of', 'Francia', 'The', 'Treaty', 'of', 'Verdun', 'of', '843', 'partitioned', 'Francia', 'into', 'East', 'Francia', 'Middle', 'Francia', 'and', 'West', 'Francia.West', 'Francia', 'which', 'became', 'the', 'Kingdom', 'of', 'France', 'in', '987', 'emerged', 'as', 'a', 'major', 'European', 'power', 'in', 'the', 'Late', 'Middle', 'Ages', 'following', 'its', 'victory', 'in', 'the', 'Hundred', 'Years', 'War', '1337\x151453', 'During', 'the', 'Renaissance', 'French', 'culture', 'ourished', 'and', 'a', 'global', 'colonial', 'empire', 'was', 'established', 'which', 'by', 'the', '20th', 'century', 'would', 'become', 'the', 'second', 'largest', 'in',

## Sentiment analysis for each sentence

In [54]:
for sentence in blob_wiki.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.03958333333333333, subjectivity=0.20000000000000004)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.0, subjectivity=0.25)
Sentiment(polarity=0.0, subjectivity=0.10000000000000002)
Sentiment(polarity=0.02500000000000001, subjectivity=0.45)


## Translate the wikipedia text to french

In [62]:
blob_wiki.translate(from_lang = 'en', to='fr')

TextBlob(""Pendant l'âge du fer, ce qui est
Maintenant, la France métropolitaine était habitée par les Gaulois, un peuple celtique. Rome
annexé la zone en 51 avant JC, la tenant jusqu'à l'arrivée de Franks germaniques
en 476, qui a formé le royaume de Francia. Le traité de Verdun de 843
Francia partitionné dans East Francia, Middle Francia et West Francia.
Principale puissance européenne à la fin du Moyen Âge, après sa victoire dans le
Cent ans de guerre (1337 1453). Pendant la Renaissance, la culture française
ourous et un empire colonial mondial ont été créés, qui au 20
Century deviendrait le deuxième plus grand au monde. Le XVIe siècle
était dominé par les guerres civiles religieuses entre catholiques et protestants
(Huguenots). La France est devenue la puissance culturelle, politique et militaire dominante d'Europe au 17ème siècle sous Louis XIV. À la fin du XVIIIe siècle,
La Révolution française a renversé la monarchie absolue, établissant l'un des
Les premières républiques de l'h

In [63]:
blob_wiki.translate(from_lang = 'en', to='ar')

TextBlob(""خلال العصر الحديدي ، ما هو
الآن ، كان غالولز في فرنسا متروبوليتان ، وهو شعب سلتيك. روما
ضم المنطقة في 51 قبل الميلاد ، ممسكة بها حتى وصول الفرنجة الجرمانية
في 476 ، الذين شكلوا مملكة فرانسيا. معاهدة فيردون من 843
قامت فرانسيا بتقسيم إلى شرق فرانسيا ووسط فرانسيا وغرب فرانسيا. غرب فرانسيا ، التي أصبحت مملكة فرنسا في عام 987 ، ظهرت كـ
القوة الأوروبية الرئيسية في أواخر العصور الوسطى ، بعد فوزها في
حرب مائة عام (1337 1453). خلال عصر النهضة ، الثقافة الفرنسية
تم إنشاء إمبراطورية الاستعمارية العالمية الخاصة بنا ، والتي بحلول العشرين
سيصبح القرن ثاني أكبر أكبر في العالم. القرن السادس عشر
سيطر عليها الحروب الأهلية الدينية بين الكاثوليك والبروتستانت
(Huguenots). أصبحت فرنسا القوة الثقافية والسياسية والعسكرية السائدة في أوروبا في القرن السابع عشر تحت قيادة لويس الرابع عشر. في أواخر القرن الثامن عشر ،
أطاحت الثورة الفرنسية بالملكية المطلقة ، وتأسيس واحدة من
أول جمهوريات للتاريخ الحديث وصياغة إعلان الحقوق
من الرجل والمواطن ، الذي يعبر عن مُثُل الأمة حتى يومنا هذا. ")")

## TF-IDF analysis with textblob and wikipedia


- **tf** function : takes into parameter a word and a blobtext and returns the word frequency inside the textblob

In [67]:
def tf(word, blobtext):
    if not isinstance(word,str):
      word = str(word)
    return(blobtext.words.count(word) / len(blobtext.words))

-- verification of the function

In [68]:
textblob_paris = TextBlob('Paris is so beautiful, I like Paris, peopel are so nice !')
tf('paris', textblob_paris)

0.18181818181818182

- **n_containing** function :  takes into parameters a word and a textblob list and returns the occurrence of this word in the list

In [77]:
def n_containing(word, bloblist):
    c=0
    for blob in bloblist :
        if (blob.words.count(word)!=0):
            c+=1
    return c

-- Verification of the function

In [95]:
# bloblist contains some famous Paris quotes
bloblist= [textblob_paris,
           TextBlob('Paris is always a good idea'),#Audrey Hepburn,
           TextBlob('Just add three letters to Paris, and you have paradise'), #Jules Renard
           TextBlob('London is a riddle. Paris is an explanation') #G. K. Chesterson
          ]
bloblist

[TextBlob("Paris is so beautiful, I like Paris, peopel are so nice !"),
 TextBlob("Paris is always a good idea"),
 TextBlob("Just add three letters to Paris, and you have paradise"),
 TextBlob("London is a riddle. Paris is an explanation")]

In [96]:
n_containing('paris', bloblist)

4

- **idf** function : takes into parameters a word and a textblob list and returns the inverse document frequency for the word passed as parameter.


In [97]:
def idf(word,bloblist):
    return(math.log((len(bloblist)/n_containing(word, bloblist))))

-- Verification

In [105]:
idf('paris',bloblist)

0.0

- **tfidf** function: takes into parameter, a word, a textbloc and a textblob list. This function must return the Term Frequency Inverse Document Frequency.


In [101]:
def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

-- Verification

In [106]:
blob = textblob_paris
bloblist= bloblist
tfidf('paris',blob,bloblist)

0.0

=> All documents have the word 'Paris'

## TextBlob from Wikipedia

In [121]:
bloblist_wiki=[]
topics = ["France","Python (programming language)", "Fox", "Isaac Newton", "Zinedine Zidane" ]
for topic in topics :
    bloblist_wiki.append(TextBlob(wikipedia.summary(topic)))

bloblist_wiki

[TextBlob("France (French: [fʁɑ̃s] ), officially the French Republic (French: République française), is a transcontinental country predominantly located in Western Europe and spanning overseas regions and territories in the Americas and the Atlantic, Pacific and Indian Oceans. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea; overseas territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean. Due to its several coastal territories, France has the largest exclusive economic zone in the world. France borders Belgium, Luxembourg, Germany, Switzerland, Monaco, Italy, Andorra, and Spain in continental Europe, as well as the Netherlands, Suriname, and Brazil in the Americas via its overseas territories in French Guiana and Saint Martin. Its eighteen integral regions (five of which are over

#### Wikipedia Text preprocessing 

remove the stopwords, change everything to lower case, and create a bag of words containing all other words.

In [123]:
en_stops = stopwords.words('english')

bag_of_other_words = []  
for blob in bloblist:  
    for w in blob.words.lower():  #change all words to lower case
        if w not in en_stops: # remove stopwords
            bag_of_other_words.append(w) 
print('We have {} words'.format(len(bag_of_other_words)))
print(bag_of_other_words[:200])

We have 1290 words
['france', 'french', 'fʁɑ̃s', 'officially', 'french', 'republic', 'french', 'république', 'française', 'transcontinental', 'country', 'predominantly', 'located', 'western', 'europe', 'spanning', 'overseas', 'regions', 'territories', 'americas', 'atlantic', 'pacific', 'indian', 'oceans', 'metropolitan', 'area', 'extends', 'rhine', 'atlantic', 'ocean', 'mediterranean', 'sea', 'english', 'channel', 'north', 'sea', 'overseas', 'territories', 'include', 'french', 'guiana', 'south', 'america', 'saint', 'pierre', 'miquelon', 'north', 'atlantic', 'french', 'west', 'indies', 'many', 'islands', 'oceania', 'indian', 'ocean', 'due', 'several', 'coastal', 'territories', 'france', 'largest', 'exclusive', 'economic', 'zone', 'world', 'france', 'borders', 'belgium', 'luxembourg', 'germany', 'switzerland', 'monaco', 'italy', 'andorra', 'spain', 'continental', 'europe', 'well', 'netherlands', 'suriname', 'brazil', 'americas', 'via', 'overseas', 'territories', 'french', 'guiana', 'sain

#### Compute the TF-IDF of all words for all documents in the corpus

For this we will create a dictionary containing the word and its tfidf mesure

In [132]:
list(tf_idf_wiki.items())[:10]

[('france', 0.0035584106092200196),
 ('french', 0.00533761591383003),
 ('fʁɑ̃s', 0.0),
 ('officially', 0.0),
 ('republic', 0.0),
 ('république', 0.0),
 ('française', 0.0),
 ('transcontinental', 0.0),
 ('country', 0.0017792053046100098),
 ('predominantly', 0.0)]

In [133]:
tf_idf_wiki={}
for word in bag_of_other_words:
    for blob in bloblist:
        tf_idf_wiki[word] = tfidf(word, blob, bloblist)
# print the 100 first words
print(list(tf_idf_wiki.items())[:100])

[('france', 0.0035584106092200196), ('french', 0.00533761591383003), ('fʁɑ̃s', 0.0), ('officially', 0.0), ('republic', 0.0), ('république', 0.0), ('française', 0.0), ('transcontinental', 0.0), ('country', 0.0017792053046100098), ('predominantly', 0.0), ('located', 0.0), ('western', 0.0), ('europe', 0.0), ('spanning', 0.0), ('overseas', 0.0), ('regions', 0.0), ('territories', 0.0), ('americas', 0.0), ('atlantic', 0.0), ('pacific', 0.0), ('indian', 0.0), ('oceans', 0.0), ('metropolitan', 0.0), ('area', 0.0), ('extends', 0.0), ('rhine', 0.0), ('ocean', 0.0), ('mediterranean', 0.0), ('sea', 0.0), ('english', 0.0), ('channel', 0.0), ('north', 0.0), ('include', 0.0), ('guiana', 0.0), ('south', 0.0), ('america', 0.0), ('saint', 0.0), ('pierre', 0.0), ('miquelon', 0.0), ('west', 0.0), ('indies', 0.0), ('many', 0.0017792053046100098), ('islands', 0.0), ('oceania', 0.0), ('due', 0.0), ('several', 0.0035584106092200196), ('coastal', 0.0), ('largest', 0.0), ('exclusive', 0.0), ('economic', 0.0), (