# Natural Processing of Language 

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


* review
* spelling
* similarity
* sentiment
* languages

# textblob

In [2]:
from textblob import TextBlob

In [3]:
text_today = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!"""


In [4]:
wiki = TextBlob(text_today)

In [5]:
# what can we get from textblob
type(wiki)

textblob.blob.TextBlob

In [6]:
wiki.words

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex', 'Complex', 'is', 'better', 'than', 'complicated', 'Flat', 'is', 'better', 'than', 'nested', 'Sparse', 'is', 'better', 'than', 'dense', 'Readability', 'counts', 'Special', 'cases', 'are', "n't", 'special', 'enough', 'to', 'break', 'the', 'rules', 'Although', 'practicality', 'beats', 'purity', 'Errors', 'should', 'never', 'pass', 'silently', 'Unless', 'explicitly', 'silenced', 'In', 'the', 'face', 'of', 'ambiguity', 'refuse', 'the', 'temptation', 'to', 'guess', 'There', 'should', 'be', 'one', 'and', 'preferably', 'only', 'one', 'obvious', 'way', 'to', 'do', 'it', 'Although', 'that', 'way', 'may', 'not', 'be', 'obvious', 'at', 'first', 'unless', 'you', "'re", 'Dutch', 'Now', 'is', 'better', 'than', 'never', 'Although', 'never', 'is', 'often', 'better', 'than', 'right', 'now', 'If', 'the', 'implementation', 'is', 'hard', 'to', 'explain', 'it', 

In [7]:
# Parts of Speech
wiki

TextBlob("Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!")

In [8]:
# nouns?
wiki

TextBlob("Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!")

In [9]:
from textblob import Word

In [10]:
w = Word("octopi")
w.lemmatize()

'octopus'

In [11]:
# how would we change the part of speech?
w = Word("went")
w.lemmatize('v')  # change to a verb

'go'

# spellling?

#### there is also a spellchecker with textblob

In [12]:
spell = TextBlob("I havv goood speling")

In [13]:
spell.correct()

TextBlob("I have good spelling")

# Similarity between Documents

To compare the similarity between documents, normalizing for size, take the cosine similarity between the two 
<br>
This creates a metric from [0,1] of how 'similar' the documents are, which will might see in recommendation engines...

In [14]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
sim_docs = CountVectorizer()
sim_corpus = ['I ate a burger at burger queen and it was very good.',
           'I ate a hot dog at burger prince and it was bad',
          'I drove a racecar through your kitchen door',
          'I ate a hot dog at burger king and it was bad. I ate a burger at burger queen and it was very good']
sim_vector = sim_docs.fit_transform(sim_corpus)

In [16]:
sim_docs

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
# print(sim_vector[0])
print(type(sim_vector[0]))
df_sim_0 = pd.DataFrame(sim_vector[0].toarray(), columns=sim_docs.get_feature_names())
# df_sim_0

<class 'scipy.sparse.csr.csr_matrix'>


In [19]:
print(sim_vector[1])
df_sim_1 = pd.DataFrame(sim_vector[1].toarray(), columns=sim_docs.get_feature_names())
# df_sim_1

  (0, 3)	1
  (0, 13)	1
  (0, 5)	1
  (0, 9)	1
  (0, 18)	1
  (0, 10)	1
  (0, 0)	1
  (0, 1)	1
  (0, 4)	1
  (0, 2)	1


In [20]:
cosine_similarity(sim_vector[3], sim_vector[3])

array([[1.]])

# Gensim implementation of word2vec

In [21]:
from gensim.summarization import summarize, keywords

In [22]:
# summarize a document 
summarize(text_today)

"Simple is better than complex.\nComplex is better than complicated.\nIf the implementation is hard to explain, it's a bad idea."

In [23]:
keywords(text_today)

'idea\nbeats\npurity\ncases\nreadability\ngreat'

# sentiment

## vader

In [24]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [25]:
si = SentimentIntensityAnalyzer()

In [26]:
vader_test = "I love you"
vader_test_2 = "not great... but the pasta is ok"
vader_test_3 = "I don't think this is a good idea"

In [27]:
si.polarity_scores(vader_test_2)

{'neg': 0.0, 'neu': 0.682, 'pos': 0.318, 'compound': 0.4215}

## textblob

In [28]:
# think about attributes...
wiki

TextBlob("Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!")

In [29]:
wiki

TextBlob("Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!")

# autres langues

In [30]:
chinese_blob = TextBlob(u"美丽优于丑陋")

In [31]:
chinese_blob.translate(from_lang="zh-CN", to='en')

TextBlob("Beauty is better than ugly")

In [32]:
b = TextBlob(u"بسيط هو أفضل من مجمع")

In [33]:
b.detect_language()

'ar'