Questo notebook è pensato come introduzione agli strumenti di python per *text mining* e *Natural Language Processing*

**Text Mining**: is the process of deriving high quality information from text data. The overall goal is to turn the text data into data for analysis, via application of NLP techniques.

**Natural Language processing**:is a subfield of linguistics, computer science and AI concerned with interactions between computers and human language. In particular, NLP is focused on how to program computers to process and analyze large amounts of text data.


First of all, let's introduce the most used python library for text mining and text data processing: *nltk*, which stands for Natural Language Toolkit.
For more information, read the documentation here https://www.nltk.org/.

In [None]:
!pip install nltk

### Tokenization
It is the first step in NLP, that consists in breaking strings into tokens which are small structures or units. Tokenizations involves three steps:

1. Breaking complex sentence into words
2. Understanding the importance of each word w.r.t the sentence
3. Produce a structural description on an input sentence


##### 1. Breaking complex sentence into words
For this we simply use the function *word_tokenize* fromm the nltk library. In the module *nltk.tokenize* there is plenty of other functions you can use for this: more informations at https://www.nltk.org/api/nltk.tokenize.html.

In [10]:
# LIBRARIES 
import nltk
import nltk.corpus # sample text for performing tokenization
from nltk.tokenize import word_tokenize # importing word_tokenize from nltk



text = "Daniel Pennac is one of my favourite writers. His main saga is set in a neighborhood in the Parisian suburbs, called Belleville. Seven books belongs to the saga: the first one was written in 1991, and the last one in 2017. All the books were critical acclaimed, but personally the one I prefer is the second: the first scene, where a sheet of ice shape is compared to the African continent, is brilliant.   " 
token = word_tokenize(text)
token = word_tokenize(text)# Passing the string text into word tokenize for breaking the sentences
token

['Daniel',
 'Pennac',
 'is',
 'one',
 'of',
 'my',
 'favourite',
 'writers',
 '.',
 'His',
 'main',
 'saga',
 'is',
 'set',
 'in',
 'a',
 'neighborhood',
 'in',
 'the',
 'Parisian',
 'suburbs',
 ',',
 'called',
 'Belleville',
 '.',
 'Seven',
 'books',
 'belongs',
 'to',
 'the',
 'saga',
 ':',
 'the',
 'first',
 'one',
 'was',
 'written',
 'in',
 '1991',
 ',',
 'and',
 'the',
 'last',
 'one',
 'in',
 '2017',
 '.',
 'All',
 'the',
 'books',
 'were',
 'critical',
 'acclaimed',
 ',',
 'but',
 'personally',
 'the',
 'one',
 'I',
 'prefer',
 'is',
 'the',
 'second',
 ':',
 'the',
 'first',
 'scene',
 ',',
 'where',
 'a',
 'sheet',
 'of',
 'ice',
 'shape',
 'is',
 'compared',
 'to',
 'the',
 'African',
 'continent',
 ',',
 'is',
 'brilliant',
 '.']

##### 2. Understanding the importance of each word w.r.t the sentence
For this a method based on the *FreqDist* function from the *nltk.probability* module, which gives you the frequency of words within a text. More information at https://www.nltk.org/api/nltk.probability.html.

In [11]:
# finding the frequency distinct in the tokens
# Importing FreqDist library from nltk and passing token into FreqDist
from nltk.probability import FreqDist
fdist = FreqDist(token)
fdist

FreqDist({'the': 9, 'is': 5, ',': 5, 'one': 4, '.': 4, 'in': 4, 'of': 2, 'saga': 2, 'a': 2, 'books': 2, ...})

If you have a large text and you want to find the *n* most common words, you can use the *most_common* attribute. Moreover, remark that statistics like these are more informative afeter stop-world removal and stemming.

Remark also that if two words have the same frequency (as the world *one* and the world *in* in the example) the most_common function returns the first world that appearse in the sentence.

In [12]:
n=4
fdist.most_common(n)

[('the', 9), ('is', 5), (',', 5), ('one', 4)]

### Stemming
Stemming usually refers to normalizing words into its base form or root form.
Below an example of what I mean:
    
                         | Waited -----> Wait |
                         | Waiting ----> Wait |
                         | Waits ------> Wait |
    
    
    
There are two populars algorithms use for stemming: 
 - *Porter Stemming*: removes common morphological and infalctional endings from words
 - *Lancaster Stemming*: a more agressive steming algorithm
 
For a more complete overview on stemming with python, let's refer to https://towardsdatascience.com/stemming-corpus-with-nltk-7a6a6d02d3e5.

In [16]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

# Checking for the list of words
stm = ["waited", "waiting", "waits"]
for word in stm:
    print(word+ ":" +pst.stem(word))

waited:wait
waiting:wait
waits:wait


In [20]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
stm = ["giving", "given", "given", "gave"]
for word in stm :
    print(word+ ":" +lst.stem(word))

giving:giv
given:giv
given:giv
gave:gav


### Lemmatization
It is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

For example, lemmatization would correctly identify the base form of ‘caring’ to ‘care,’ whereas stemming would cutoff the ‘ing’ part and convert it into a car.

Lemmatization can be implemented in python by using several different algorithms:
 - Wordnet Lemmatizer
 - Spacy Lemmatizer
 - TextBlob 
 - Stanford CoreNLP

In [26]:
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 
 
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora"))

rocks : rock
corpora : corpus


### Stop Words
“Stop words” are the most common words in a language like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all”. These words do not provide any meaning and are usually removed from texts. We can remove these stop words using nltk library.

In [29]:
# importing stopwors from nltk library
from nltk import word_tokenize
from nltk.corpus import stopwords
text_old = "Daniel Pennac is one of my favourite writers. His main saga is set in a neighborhood in the Parisian suburbs, called Belleville. Seven books belongs to the saga: the first one was written in 1991, and the last one in 2017. All the books were critical acclaimed, but personally the one I prefer is the second: the first scene, where a sheet of ice shape is compared to the African continent, is brilliant.   " 
a = set(stopwords.words("english"))
text1 = word_tokenize(text_old.lower())
print("ORIGINAL:")
print(text1)
stopwords = [x for x in text1 if x not in a]
print()
print("AFETR STOP-WORD REMOVAL:")
print(stopwords)

ORIGINAL:
['daniel', 'pennac', 'is', 'one', 'of', 'my', 'favourite', 'writers', '.', 'his', 'main', 'saga', 'is', 'set', 'in', 'a', 'neighborhood', 'in', 'the', 'parisian', 'suburbs', ',', 'called', 'belleville', '.', 'seven', 'books', 'belongs', 'to', 'the', 'saga', ':', 'the', 'first', 'one', 'was', 'written', 'in', '1991', ',', 'and', 'the', 'last', 'one', 'in', '2017.', 'all', 'the', 'books', 'were', 'critical', 'acclaimed', ',', 'but', 'personally', 'the', 'one', 'i', 'prefer', 'is', 'the', 'second', ':', 'the', 'first', 'scene', ',', 'where', 'a', 'sheet', 'of', 'ice', 'shape', 'is', 'compared', 'to', 'the', 'african', 'continent', ',', 'is', 'brilliant', '.']

AFETR STOP-WORD REMOVAL:
['daniel', 'pennac', 'one', 'favourite', 'writers', '.', 'main', 'saga', 'set', 'neighborhood', 'parisian', 'suburbs', ',', 'called', 'belleville', '.', 'seven', 'books', 'belongs', 'saga', ':', 'first', 'one', 'written', '1991', ',', 'last', 'one', '2017.', 'books', 'critical', 'acclaimed', ',', '