### Natural Language Processing in Julia

- Natural Language Process == understanding(processing) everyday language.


#### Packages
+ Pkg.add("TextAnalysis")
+ Pkg.clone("WordTokenizers")
+ NLTK via PyCall


Natural language processing is a form of artificial intelligence that helps computers read and respond by simulating the human ability to understand everyday language. Many organizations use NLP techniques to optimize customer support,improve the efficiency of text analytics by easily finding the information they need, and enhance social media monitoring. For example, banks might implement NLP algorithms to optimize customer support; a large consumer products brand might combine natural language processing and semantic analysis to improve their knowledge management strategies and social media monitoring


### Reading/Creating A Document
+ + Uses Hierarchy File->String->Tokens->Ngrams
+ Document()
+ FileDocument(filepath)
+ StringDocument("my string")
+ TokenDocument(String[])
+ NGramDocument(Dict())

In [1]:
using TextAnalysis

In [2]:
mystr = """The best error message is the one that never shows up.
You Learn More From Failure Than From Success. 
The purpose of software engineering is to control complexity, not to create it"""

"The best error message is the one that never shows up.\nYou Learn More From Failure Than From Success. \nThe purpose of software engineering is to control complexity, not to create it"

In [3]:
# Basic Way
sd1 = Document(mystr)

A TextAnalysis.StringDocument

In [4]:
# Best Way
sd2 = StringDocument(mystr)

A TextAnalysis.StringDocument

In [5]:
# Reading from a file
filepath = "samplefile.txt"

"samplefile.txt"

In [6]:
# Basic Way
filedoc = Document("samplefile.txt")

A TextAnalysis.FileDocument

In [7]:
# Best Way
fd = FileDocument("samplefile.txt")

A TextAnalysis.FileDocument

#### There is also 
+ TokenDocument()
+ NGramDocument()

In [8]:
# Working  With Our Document
text(sd1)

"The best error message is the one that never shows up.\nYou Learn More From Failure Than From Success. \nThe purpose of software engineering is to control complexity, not to create it"

#### What language is it?

In [9]:
# Getting the Base Info About it
language(sd1)

Languages.EnglishLanguage

### Tokenization With TextAnalysis
+ Word Tokens
+ Sentence Tokens

In [10]:
text(sd1)

"The best error message is the one that never shows up.\nYou Learn More From Failure Than From Success. \nThe purpose of software engineering is to control complexity, not to create it"

In [11]:
# Word Tokens from a String Document
tokens(sd1)

32-element Array{SubString{String},1}:
 "The"        
 "best"       
 "error"      
 "message"    
 "is"         
 "the"        
 "one"        
 "that"       
 "never"      
 "shows"      
 "up."        
 "You"        
 "Learn"      
 ⋮            
 "purpose"    
 "of"         
 "software"   
 "engineering"
 "is"         
 "to"         
 "control"    
 "complexity,"
 "not"        
 "to"         
 "create"     
 "it"         

In [12]:
text(fd)

"The best error message is the one that never shows up.\r\nYou Learn More From Failure Than From Success. \r\nThe purpose of software engineering is to control complexity, not to create it"

In [13]:
# Word Tokens from a File Document
tokens(fd)

32-element Array{SubString{String},1}:
 "The"        
 "best"       
 "error"      
 "message"    
 "is"         
 "the"        
 "one"        
 "that"       
 "never"      
 "shows"      
 "up."        
 "You"        
 "Learn"      
 ⋮            
 "purpose"    
 "of"         
 "software"   
 "engineering"
 "is"         
 "to"         
 "control"    
 "complexity,"
 "not"        
 "to"         
 "create"     
 "it"         

### Tokenization With WordTokenizer
+ Word Tokens
+ Sentence Tokens

In [15]:
using WordTokenizers

In [16]:
sd1

A TextAnalysis.StringDocument

In [17]:
# Must convert from TextAnalysis Type to String Type
tokenize(text(sd1))

33-element Array{SubString{String},1}:
 "The"        
 "best"       
 "error"      
 "message"    
 "is"         
 "the"        
 "one"        
 "that"       
 "never"      
 "shows"      
 "up."        
 "You"        
 "Learn"      
 ⋮            
 "of"         
 "software"   
 "engineering"
 "is"         
 "to"         
 "control"    
 "complexity" 
 ","          
 "not"        
 "to"         
 "create"     
 "it"         

In [18]:
tokenize("Hello world this is Julia")

5-element Array{SubString{String},1}:
 "Hello"
 "world"
 "this" 
 "is"   
 "Julia"

#### Sentence Tokenization

First, solve the problem. Then, write the code. 
Fix the cause, not the symptom.
Simplicity is the soul of efficiency. Good design adds value faster than it adds cost.
In theory, theory and practice are the same. In practice, they’re not.
There are two ways of constructing a software design.
One way is to make it so simple that there are obviously no deficiencies.
And the other way is to make it so complicated that there are no obvious deficiencies.

In [19]:
# Read a file with sentences
sent_files = FileDocument("quotesfiles.txt")

A TextAnalysis.FileDocument

In [20]:
text(sent_files)

"\ufeffFirst, solve the problem. Then, write the code.\r\nFix the cause, not the symptom.\r\nSimplicity is the soul of efficiency.\r\nGood design adds value faster than it adds cost.\r\nIn theory, theory and practice are the same. In practice, they’re not.\r\nThere are two ways of constructing a software design.\r\nOne way is to make it so simple that there are obviously no deficiencies.\r\nAnd the other way is to make it so complicated that there are no obvious deficiencies."

In [23]:
# Sentence Tokenization
split_sentences(text(sent_files))

17-element Array{SubString{String},1}:
 "\ufeffFirst, solve the problem."                                                       
 "Then, write the code."                                                                 
 ""                                                                                      
 "Fix the cause, not the symptom."                                                       
 ""                                                                                      
 "Simplicity is the soul of efficiency."                                                 
 ""                                                                                      
 "Good design adds value faster than it adds cost."                                      
 ""                                                                                      
 "In theory, theory and practice are the same."                                          
 "In practice, they’re not."                                 

In [24]:
for sentence in split_sentences(text(sent_files))
    println(sentence)
end

﻿First, solve the problem.
Then, write the code.

Fix the cause, not the symptom.

Simplicity is the soul of efficiency.

Good design adds value faster than it adds cost.

In theory, theory and practice are the same.
In practice, they’re not.

There are two ways of constructing a software design.

One way is to make it so simple that there are obviously no deficiencies.

And the other way is to make it so complicated that there are no obvious deficiencies.


In [25]:
for sentence in split_sentences(text(sent_files))
    wordtokens = tokenize(sentence)
    println("Word token=> $wordtokens")
end

Word token=> SubString{String}["\ufeffFirst", ",", "solve", "the", "problem", "."]
Word token=> SubString{String}["Then", ",", "write", "the", "code", "."]
Word token=> SubString{String}[]
Word token=> SubString{String}["Fix", "the", "cause", ",", "not", "the", "symptom", "."]
Word token=> SubString{String}[]
Word token=> SubString{String}["Simplicity", "is", "the", "soul", "of", "efficiency", "."]
Word token=> SubString{String}[]
Word token=> SubString{String}["Good", "design", "adds", "value", "faster", "than", "it", "adds", "cost", "."]
Word token=> SubString{String}[]
Word token=> SubString{String}["In", "theory", ",", "theory", "and", "practice", "are", "the", "same", "."]
Word token=> SubString{String}["In", "practice", ",", "they", "’", "re", "not", "."]
Word token=> SubString{String}[]
Word token=> SubString{String}["There", "are", "two", "ways", "of", "constructing", "a", "software", "design", "."]
Word token=> SubString{String}[]
Word token=> SubString{String}["One", "way", "

### N-Grams
+ Combinations of multiple words
+ Useful for creating features during language modeling

In [26]:
mystr

"The best error message is the one that never shows up.\nYou Learn More From Failure Than From Success. \nThe purpose of software engineering is to control complexity, not to create it"

In [27]:
sd3 = StringDocument(mystr)

A TextAnalysis.StringDocument

In [28]:
# Unigram
ngrams(sd3)

Dict{SubString{String},Int64} with 28 entries:
  "engineering" => 1
  "Learn"       => 1
  "is"          => 2
  "From"        => 2
  "not"         => 1
  "one"         => 1
  "never"       => 1
  "up."         => 1
  "complexity," => 1
  "create"      => 1
  "software"    => 1
  "that"        => 1
  "it"          => 1
  "You"         => 1
  "Failure"     => 1
  "best"        => 1
  "shows"       => 1
  "purpose"     => 1
  "error"       => 1
  "the"         => 1
  "Success."    => 1
  "The"         => 2
  "Than"        => 1
  "of"          => 1
  "More"        => 1
  ⋮             => ⋮

In [29]:
# Bigrams
ngrams(sd3,2)


Dict{AbstractString,Int64} with 59 entries:
  "that never"     => 1
  "is to"          => 1
  "create"         => 1
  "that"           => 1
  "best"           => 1
  "Than From"      => 1
  "shows up."      => 1
  "purpose"        => 1
  "of"             => 1
  "purpose of"     => 1
  "More"           => 1
  "to"             => 2
  "the one"        => 1
  "is"             => 2
  "never"          => 1
  "complexity,"    => 1
  "software"       => 1
  "one that"       => 1
  "shows"          => 1
  "From Success."  => 1
  "The purpose"    => 1
  "message is"     => 1
  "engineering is" => 1
  "engineering"    => 1
  "not"            => 1
  ⋮                => ⋮

In [41]:
# Trigram
for trigram in ngrams(sd3,3)
    println(trigram)
end

Pair{AbstractString,Int64}("that never", 1)
Pair{AbstractString,Int64}("Success. The purpose", 1)
Pair{AbstractString,Int64}("You Learn More", 1)
Pair{AbstractString,Int64}("From Failure Than", 1)
Pair{AbstractString,Int64}("is to", 1)
Pair{AbstractString,Int64}("purpose of software", 1)
Pair{AbstractString,Int64}("create", 1)
Pair{AbstractString,Int64}("the one that", 1)
Pair{AbstractString,Int64}("software engineering is", 1)
Pair{AbstractString,Int64}("that", 1)
Pair{AbstractString,Int64}("control complexity, not", 1)
Pair{AbstractString,Int64}("best", 1)
Pair{AbstractString,Int64}("Than From", 1)
Pair{AbstractString,Int64}("shows up.", 1)
Pair{AbstractString,Int64}("purpose", 1)
Pair{AbstractString,Int64}("of", 1)
Pair{AbstractString,Int64}("purpose of", 1)
Pair{AbstractString,Int64}("More", 1)
Pair{AbstractString,Int64}("to", 2)
Pair{AbstractString,Int64}("the one", 1)
Pair{AbstractString,Int64}("is", 2)
Pair{AbstractString,Int64}("is the one", 1)
Pair{AbstractString,Int64}("one t

In [30]:
# Creating an NGram 
my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                "or" => 1, "not" => 1,
                                "to" => 1, "be..." => 1)


Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

In [31]:
ngd = NGramDocument(my_ngrams)

A TextAnalysis.NGramDocument

In [32]:
# Detecting Which NGram it is
ngram_complexity(ngd)

1

In [33]:
my_ngrams2 = Dict{AbstractString,Int64}(
  "that never" => 1,"is to" => 1,"create" => 1,"that" => 1,"best" => 1,"Than From" => 1,"shows up." => 1,
    "purpose" => 1,"of" => 1,"purpose of" => 1,"More" => 1,"to" => 2,"the one" => 1,
    "is" => 2,"never" => 1,"complexity,"=> 1,"software" => 1,"one that" => 1)

Dict{AbstractString,Int64} with 18 entries:
  "that never"  => 1
  "shows up."   => 1
  "the one"     => 1
  "purpose"     => 1
  "is"          => 2
  "never"       => 1
  "complexity," => 1
  "is to"       => 1
  "of"          => 1
  "create"      => 1
  "purpose of"  => 1
  "software"    => 1
  "that"        => 1
  "More"        => 1
  "one that"    => 1
  "to"          => 2
  "best"        => 1
  "Than From"   => 1

In [34]:
ngd2 = NGramDocument(my_ngrams2)

A TextAnalysis.NGramDocument

In [35]:
ngram_complexity(ngd2)

1

#### You can use pip to install nltk and add it
+ pip install nltk
- import nltk
- nltk.download

#### In Julia
+ using Conda
+ Conda.add("nltk")

### Parts of Speech
+ NLTK.tags via PyCall


In [38]:
using PyCall

In [39]:
# Importing Part of Speech Tag from NLTK
@pyimport nltk.tag as ptag

In [41]:
# Using TextAnalysis to tokenize or WordTokenizer to do the same
ex = StringDocument("Julia is very fast but it is still young")


A TextAnalysis.StringDocument

In [43]:
# TextAnalysis.tokens()
mytokens = tokens(ex)

9-element Array{SubString{String},1}:
 "Julia"
 "is"   
 "very" 
 "fast" 
 "but"  
 "it"   
 "is"   
 "still"
 "young"

In [44]:
# Using NLTK tags for finding the part of speech of our tokens
ptag.pos_tag(mytokens)

9-element Array{Tuple{String,String},1}:
 ("Julia", "NNP")
 ("is", "VBZ")   
 ("very", "RB")  
 ("fast", "RB")  
 ("but", "CC")   
 ("it", "PRP")   
 ("is", "VBZ")   
 ("still", "RB") 
 ("young", "JJ") 

### Word Inflection == Word Formation by adding to base/root word
+ Stemming (Basics) stem!()
+ Lemmatizing
- - How do we do these?
- + PyCall To the Rescue

In [45]:
whos(TextAnalysis)

              AbstractDocument     92 bytes  DataType
                        Corpus     40 bytes  UnionAll
               DirectoryCorpus      0 bytes  TextAnalysis.#DirectoryCorpus
                      Document      0 bytes  TextAnalysis.#Document
            DocumentTermMatrix    136 bytes  DataType
                  FileDocument    124 bytes  DataType
               GenericDocument     48 bytes  Union
                 NGramDocument    136 bytes  DataType
                       Stemmer    136 bytes  DataType
                StringDocument    124 bytes  DataType
                  TextAnalysis  23658 KB     Module
              TextHashFunction    124 bytes  DataType
                 TokenDocument    124 bytes  DataType
                        author      0 bytes  TextAnalysis.#author
                       author!      0 bytes  TextAnalysis.#author!
                   cardinality      0 bytes  TextAnalysis.#cardinality
                     documents      0 bytes  TextAnalysis.#docum

In [None]:
# Path to 
# /usr/local/lib/python3.5/dist-packages/textblob
# /usr/local/lib/python3.5/dist-packages/nltk