# Natural Language Processing in Python

 * Outline:
     * general introduction
     * problem definition
     * hands-on dem


## What is Natural Language Processing (NLP)

 * automatized processing of large amount of textual data, that would be impossible for human beiing
 * common applications:
     * sentiment analysis
     * text summarization
     * predictions
     * human language modelling
     * ...

## Overview of steps
    
### Data preparation - preprocessing 

* bringing text into analyzable form
* lowercasing
* stemming
* lemmatization
* cleaning data (stop-word removal, noise removal)
* normalization

### Apply data analysis techniques on preprocessed data

* **Bag-of-words, TF-IDF** (scikit-learn)
* n-gram,skip-gram techniques for context evaluation
* word embeddings (Gensim, Word2Vec)
* deep-learning RNNs

* apply KISS principle
    * deciding on the technique depends on what problem you are solving
    * different data and task, may require different subset of preprocessing steps
    * deep learning may be an overkill

## Data preprocessing


### Lowercasing

* simple, but important
* if keeping case is not important for your task, make sure all words are lowercase

In [3]:
texts=["CANADA","Canada","canadA","canada"] # different words from text processor perspective
lower_words=[word.lower() for word in texts]
lower_words

['canada', 'canada', 'canada', 'canada']

### Stemming

 * bringing words to their 'root' form, or rather canonical form of the original word
 * chopping off the ends of words
 * helps with sparsity issues, standardizing vocabulary
 * especially search applications
 * different algorithms
     * English: [Porters Algorithm](https://tartarus.org/martin/PorterStemmer/)
     * other languages: [Snowball](https://snowballstem.org/algorithms/), [CzechLight](http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt)

In [7]:
import nltk # natural langage toolkit module in python
import pandas as pd
from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

In [8]:
# stem trouble variations
words=["trouble","troubled","troubles","troublemsome"]
stemmed_words=[porter_stemmer.stem(word=word) for word in words]

stemdf= pd.DataFrame({'original_word': words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,original_word,stemmed_word
0,trouble,troubl
1,troubled,troubl
2,troubles,troubl
3,troublemsome,troublemsom


### Lemmatization

 * properly determine the root of a word
 * no significant benefit over stemming for search and text-classification
 * can have impact on performance
 * e.g. lemmatization('better') -> 'good'
 * achieved by a sort of dictionary
     * [WordNet for mappings](https://www.nltk.org/_modules/nltk/stem/wordnet.html)
     * [Rle based approaches](https://www.semanticscholar.org/paper/A-Rule-based-Approach-to-Word-Lemmatization-Plisson-Lavrac/5319539616e81b02637b1bf90fb667ca2066cf14)

In [9]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\poso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


In [10]:
#lemmatize trouble variations
words=["trouble","troubling","troubled","troubles",]
lemmatized_words=[lemmatizer.lemmatize(word=word,pos='v') for word in words]
lemmatizeddf= pd.DataFrame({'original_word': words,'lemmatized_word': lemmatized_words})
lemmatizeddf=lemmatizeddf[['original_word','lemmatized_word']]
lemmatizeddf

Unnamed: 0,original_word,lemmatized_word
0,trouble,trouble
1,troubling,trouble
2,troubled,trouble
3,troubles,trouble


### Stop-word removal

 * stopwords are commonly used words in a language: 'a', 'the', 'is'
 * since they are common, they provide little information about the text surrounding it
 * e.g. search 'what is text processing' - one will find lot of 'what' and 'is' without the 'text' and 'processing'
 * searches, text classification, topic modelling, topic extraction, ...
 * stopwords can come form some external source, or based on domain language and problem definition, can be designed at custom
 * one can set a criterion, for when a word is considered a stopword, e.g. by frequency

In [11]:
stopwords=['this','that','and','a','we','it','to','is','of','up','need']
text="this is a text full of content and we need to clean it up"

In [12]:
words=text.split(" ")
shortlisted_words=[]

#remove stop words
for w in words:
    if w not in stopwords:
        shortlisted_words.append(w)
    else:
        shortlisted_words.append("W")

print("original sentence = ",text)    
print("sentence with stop words removed= ",' '.join(shortlisted_words))

original sentence =  this is a text full of content and we need to clean it up
sentence with stop words removed=  W W W text full W content W W W W clean W W


### Normalization

 * transforming text into standard form
 * normalization('goooood') -> 'good'
 * applied to noisy text: social media comments, text messages, ...
 * sentiment analysis, highly unstructured clinical texts
 * no standard way
     * statistical machine translation (SMT)
     * dictionary based approach

In [19]:
def find_matching_key(word, dictionary):
    match = [key for key, value in dictionary.items() if word in value]
    return match[0]

raw = ['2moro', '2mrrw', 'b4']
translation_dictionary = {'tomorrow': ('2moro', '2mrrw'), 'before': ('b4')}
normalized = [find_matching_key(word, translation_dictionary) for word in raw]
normalized_df = pd.DataFrame({'raw': raw, 'normalized': normalized})
normalized_df

Unnamed: 0,raw,normalized
0,2moro,tomorrow
1,2mrrw,tomorrow
2,b4,before


### Noise removal

 * removal of characters, digits and pieces of text, that can interfere with your text analysis
 * domain independent
 * punctuation removal, special character removal, formatting removal (html)

In [20]:
import nltk
import pandas as pd
import re
from nltk.stem import PorterStemmer

porter_stemmer=PorterStemmer()

In [21]:
# stem raw words with noise
raw_words=["..trouble..","trouble<","trouble!","<a>trouble</a>",'1.trouble']
stemmed_words=[porter_stemmer.stem(word=word) for word in raw_words]
stemdf= pd.DataFrame({'raw_word': raw_words,'stemmed_word': stemmed_words})
stemdf

Unnamed: 0,raw_word,stemmed_word
0,..trouble..,..trouble..
1,trouble<,trouble<
2,trouble!,trouble!
3,<a>trouble</a>,<a>trouble</a>
4,1.trouble,1.troubl


In [22]:
def scrub_words(text):
    """Basic cleaning of texts."""
    
    # remove html markup
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    return text

In [23]:
# stem words already cleaned
cleaned_words=[scrub_words(w) for w in raw_words]
cleaned_stemmed_words=[porter_stemmer.stem(word=word) for word in cleaned_words]
stemdf= pd.DataFrame({'raw_word': raw_words,'cleaned_word':cleaned_words,'stemmed_word': cleaned_stemmed_words})
stemdf=stemdf[['raw_word','cleaned_word','stemmed_word']]
stemdf

Unnamed: 0,raw_word,cleaned_word,stemmed_word
0,..trouble..,trouble,troubl
1,trouble<,trouble,troubl
2,trouble!,trouble,troubl
3,<a>trouble</a>,trouble,troubl
4,1.trouble,trouble,troubl


## Analysing text

### Bag of words, TF-IDF

 * Bag of Words
     * simplified represenation of text as a bag of its word
     * does not care about order or context
 * TF-IDF - term requency-inverse document frequency
     * intended to reflect how important a word is to a document in a document collection
     * can be used to construct word vectors of texts, tf-idf representing weights

### Cosine similarity

 * once we have representation of the sentences/documents as vectors, we can calculate the similarity thanks to vector algebra


In [56]:
from IPython.display import Math
display(Math(r'\vec{a} . \vec{b} = ||\vec{a}||||\vec{b}||cos\theta'))
display(Math(r'cos\theta=\frac{\vec{a} . \vec{b}}{||\vec{a}||||\vec{b}||}'))

<IPython.core.display.Math object>

<IPython.core.display.Math object>

![Dot_Product](Dot_Product.png)

In [46]:
documents = (
    "The sky is blue",
    "The sun is bright",
    "The sun in the sky is bright",
    "We can see the shining sun, the bright sun"
)

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix.shape)

(4, 11)


In [52]:
# vector distance of the first document to other documents
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

array([[1.        , 0.36651513, 0.52305744, 0.13448867]])

In [54]:
import math
cos_sim = 0.52305744
angle_in_radians = math.acos(cos_sim)
print(math.degrees(angle_in_radians))

58.462437107432784


### Sources

 * http://kavita-ganesan.com/text-preprocessing-tutorial/#.XHa4-ZNKhuU
     * https://github.com/kavgan/nlp-in-practice/tree/master/text-pre-processing
 * http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
 