# NLP Lexical analysis

The lexical analysis in NLP deals with the study at the level of words with respect to their lexical meaning and part-of-speech. This level of linguistic processing utilizes a language’s lexicon, which is a collection of individual lexemes. A lexeme is a basic unit of lexical meaning; which is an abstract unit of morphological analysis that represents the set of forms or “senses” taken by a single morpheme.

## Lexicon
A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information, such as part-of-speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts.

In [3]:
import nltk
wordlist = nltk.corpus.words.words()
print(len(wordlist))
[w for w in wordlist if len(w) >= 23]

236736


['anthropomorphologically',
 'blepharosphincterectomy',
 'epididymodeferentectomy',
 'formaldehydesulphoxylate',
 'formaldehydesulphoxylic',
 'gastroenteroanastomosis',
 'hematospectrophotometer',
 'macracanthrorhynchiasis',
 'pancreaticoduodenostomy',
 'pathologicohistological',
 'pathologicopsychological',
 'pericardiomediastinitis',
 'phenolsulphonephthalein',
 'philosophicotheological',
 'Pseudolamellibranchiata',
 'pseudolamellibranchiate',
 'scientificogeographical',
 'scientificophilosophical',
 'tetraiodophenolphthalein',
 'thymolsulphonephthalein',
 'thyroparathyroidectomize',
 'transubstantiationalist']

Let's start to build our lexicon from our tweets database

In [4]:
import os
import pandas as pd
os.listdir()

['NLP_03_lexical_analysis.ipynb',
 'NLP_04_syntactic_analysis.ipynb',
 '__pycache__',
 'SimplePushdownAutomata',
 'Teacher',
 'NLP_05_semantic_analysis.ipynb',
 'NLP_02_preprocessing.ipynb',
 'NLP_01_intro.ipynb',
 'resources',
 '.ipynb_checkpoints']

In [5]:
dir_data = "../data/"
df = pd.read_csv(dir_data + 'finalizedtest.csv')
df.describe()

Unnamed: 0,senti
count,298.0
mean,2.006711
std,1.748483
min,0.0
25%,0.0
50%,2.0
75%,4.0
max,4.0


In [7]:
df.head()

Unnamed: 0,tweet,senti
0,"@united Oh, we are sure it's not planned, but ...",0
1,History exam studying ugh,0
2,@unnitallman yeah looks like that only! &quot;...,0
3,Loves twitter,4
4,@Mbjthegreat i really dont want AT&amp;T phone...,0


In [7]:
tweets_lexicon = {}
tokenizer = nltk.tokenize.TweetTokenizer()
for t in df["tweet"]:
    print(t)
    tokens = tokenizer.tokenize(t)

@united Oh, we are sure it's not planned, but it occurs absolutely consistently, it's usually the only YYJ flight that's Cancelled Flightled daily.
History exam studying ugh
@unnitallman yeah looks like that only! &quot;busy&quot; is fucking me so yeah.. its my &quot;GF&quot; 
Loves twitter
@Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal
I donâ€™t want either! RT @clayhebert: We might get pilotless planes before driverless cars - http://t.co/y4YOxaqI
Super cool: "@google: The next stop on the road to a self-driving car http://t.co/x4rem8zeKy http://t.co/UPdUUaTTb6"
Aw fuck - this night ended badly 
Ok so lots of buzz from IO2009 but how lucky are they - a Free G2!! http://is.gd/Hyzl
got a new pair of nike shoes. pics up later
 Hey any chance you have an update on Flight 99 Hartford to D.C.?
Learning jQuery 1.3 Book Review - http://cfbloggers.org/?c=30629
@kirstiealley my dentist is great but she's expensive...=(
loves chocolate milk  a

**Exercise:** Build a minimal lexicon holding each word in the tweets sentiment analysis database. Associate each word with the number of times it appears on the text. Perform a simple preprocessing using noise removal, and case normlaization

In [10]:
def cleanString(special_chars, string):
    cleansed_string = string
    #Your code here
    for char in special_chars:
        cleansed_string = cleansed_string.replace(char, '')
    return clenased_string

tokenizer = nltk.tokenize.TweetTokenizer()
special_chars = ",.?!¬-\''=()%"
tweet_lexicon = {}
for t in df["tweet"]:
    cleaned =  cleanString(t, special_chars)
    tokens = tokenizer.tokenize(t)
    #Your code here
print(tweet_lexicon)

{'@united': 24, 'Oh': 5, ',': 99, 'we': 10, 'are': 18, 'sure': 1, "it's": 9, 'not': 14, 'planned': 1, 'but': 12, 'it': 28, 'occurs': 1, 'absolutely': 2, 'consistently': 1, 'usually': 1, 'the': 118, 'only': 9, 'YYJ': 1, 'flight': 5, "that's": 6, 'Cancelled': 2, 'Flightled': 2, 'daily': 1, '.': 228, 'History': 1, 'exam': 3, 'studying': 1, 'ugh': 3, '@unnitallman': 1, 'yeah': 2, 'looks': 3, 'like': 14, 'that': 29, '!': 118, '"': 30, 'busy': 1, 'is': 53, 'fucking': 13, 'me': 27, 'so': 15, '..': 13, 'its': 6, 'my': 48, 'GF': 2, 'Loves': 1, 'twitter': 1, '@Mbjthegreat': 1, 'i': 17, 'really': 8, 'dont': 1, 'want': 6, 'AT': 3, '&': 13, 'T': 3, 'phone': 4, 'service': 6, 'they': 6, 'suck': 2, 'when': 6, 'comes': 1, 'to': 96, 'having': 3, 'a': 77, 'signal': 1, 'I': 86, 'donâ': 1, '€': 13, '™': 2, 't': 1, 'either': 1, 'RT': 4, '@clayhebert': 1, ':': 44, 'We': 4, 'might': 1, 'get': 8, 'pilotless': 2, 'planes': 3, 'before': 4, 'driverless': 11, 'cars': 21, '-': 29, 'http://t.co/y4YOxaqI': 1, 'Super'

Let's order the words according to their frequency

In [11]:
import operator
sorted_tokens = sorted(tweet_lexicon.items(), key=operator.itemgetter(1), reverse=True)
print(sorted_tokens)

[('.', 228), ('the', 118), ('!', 118), (',', 99), ('to', 96), ('I', 86), ('a', 77), ('and', 64), ('?', 62), ('is', 53), ('my', 48), ('for', 46), (':', 44), ('in', 40), ('on', 38), ('you', 38), ('of', 35), ('...', 32), ('"', 30), ('that', 29), ('-', 29), ('it', 28), ('me', 27), ('good', 26), ('car', 25), ('@united', 24), ('be', 24), ('was', 24), ('with', 23), ('have', 22), ('cars', 21), ('at', 21), ('just', 21), ('/', 20), ('from', 19), ('are', 18), ('i', 17), ('this', 17), ('by', 17), ("'", 16), ('so', 15), ("I'm", 15), ('about', 15), ('not', 14), ('like', 14), ('(', 14), (')', 14), ('your', 14), ('fucking', 13), ('..', 13), ('&', 13), ('€', 13), ('got', 13), ('up', 13), ('Google', 13), ('love', 13), ('2', 13), ('go', 13), ('but', 12), ('now', 12), ('driverless', 11), ('The', 11), ('self-driving', 11), ('day', 11), ('we', 10), ('an', 10), ('out', 10), (':)', 10), ('can', 10), ('will', 10), ('him', 10), ('$', 10), ("it's", 9), ('only', 9), ('fuck', 9), ('night', 9), ('new', 9), ('great'

## Part of Speech Tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Let's build a simple POS tagger using the tweet corpus and the tags offered by nltk POS tagger

In [8]:
import nltk

text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('refuse', 'VBP'),
 ('to', 'TO'),
 ('permit', 'VB'),
 ('us', 'PRP'),
 ('to', 'TO'),
 ('obtain', 'VB'),
 ('the', 'DT'),
 ('refuse', 'NN'),
 ('permit', 'NN')]

## Updating the lexicon

Let's update the lexicon to consider the tags, assigned to a token and the next tag seen

**Exercise**
Remove stopwords from our previous lexicon

Hint: cehck python implementations of the nltk corpus stopwords

In [14]:
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
sws = set(stopwords.words('spanish'))
print(sw)


{'each', 'y', 'from', 'any', 'so', 've', 'is', 'them', 'don', 'ain', 'the', 'his', "mustn't", 'doesn', 'who', 'my', 'of', 'under', 'our', 'you', 'at', 'it', "hasn't", 'have', 'being', 'before', 'only', 'd', 'as', 'm', 'me', 'ours', 'too', 'these', 'what', 'and', 'themselves', 'hasn', 'needn', 'because', 'has', 'hadn', 'why', 'weren', 'herself', 'theirs', 'most', 'aren', 'been', 'itself', 'against', "that'll", 'when', 'for', 'him', 'mustn', 'hers', 'than', 'shouldn', 'an', 'we', 'all', 'couldn', 'doing', "wouldn't", "didn't", "you're", 'that', 're', 'isn', "you'd", 'down', 'she', 'ourselves', 'after', "shan't", "you'll", 'just', 'll', 'while', 'himself', 'will', 'below', 'if', "haven't", 'whom', "don't", 'they', 'was', 'were', 'during', 'both', 'same', 't', 'your', 'did', 'be', 'can', 'yourself', 'by', 'again', 'shan', 'wasn', 'having', "shouldn't", 'to', 'o', "aren't", 'there', 'with', 'more', 'i', 'but', 'over', 'wouldn', 'about', 'ma', "needn't", 'yours', 'off', 'very', 'here', 'furt

In [34]:
#print(tweet_lexicon)
d_keys = list(tweet_lexicon.keys())
print(str(len(d_keys)) + " words in the dictionary")
deleted = 0
for i in d_keys:
    if i in sw and i in tweet_lexicon:
        deleted += 1
        del tweet_lexicon[i]
print("Deleted " + str(deleted) + " stopwords")
print(tweet_lexicon)

1823 words in the dictionary
Deleted 0 stopwords
{'@united': 24, 'Oh': 5, ',': 99, 'sure': 1, 'planned': 1, 'occurs': 1, 'absolutely': 2, 'consistently': 1, 'usually': 1, 'YYJ': 1, 'flight': 5, "that's": 6, 'Cancelled': 2, 'Flightled': 2, 'daily': 1, '.': 228, 'History': 1, 'exam': 3, 'studying': 1, 'ugh': 3, '@unnitallman': 1, 'yeah': 2, 'looks': 3, 'like': 14, '!': 118, '"': 30, 'busy': 1, 'fucking': 13, '..': 13, 'GF': 2, 'Loves': 1, 'twitter': 1, '@Mbjthegreat': 1, 'really': 8, 'dont': 1, 'want': 6, 'AT': 3, '&': 13, 'T': 3, 'phone': 4, 'service': 6, 'suck': 2, 'comes': 1, 'signal': 1, 'I': 86, 'donâ': 1, '€': 13, '™': 2, 'either': 1, 'RT': 4, '@clayhebert': 1, ':': 44, 'We': 4, 'might': 1, 'get': 8, 'pilotless': 2, 'planes': 3, 'driverless': 11, 'cars': 21, '-': 29, 'http://t.co/y4YOxaqI': 1, 'Super': 2, 'cool': 3, '@google': 2, 'The': 11, 'next': 3, 'stop': 1, 'road': 3, 'self-driving': 11, 'car': 25, 'http://t.co/x4rem8zeKy': 1, 'http://t.co/UPdUUaTTb6': 1, 'Aw': 1, 'fuck': 9, '

**Exercise** Update the lexicon to include the generated tags for each tweet at the moment of tokenization and for the next word,

Hint: perform sotpword removal

In [1]:
new_tweet_lexicon = {}
tokenizer = nltk.tokenize.TweetTokenizer()
for t in df["tweet"]:
    #Remove Noise
    #Lower case
    #Tokenize
    #Stop word removal
    #Get tags and append it to a list
    #Get next tokens tag and append it to a list
    
print(new_tweet_lexicon)
for e in new_tweet_lexicon.keys():
    data = new_tweet_lexicon[e]
    print(e)
    print(data['ocurrences'])
    print(len(data['tags']))
    print(len(data['next_tags']))

IndentationError: expected an indented block (<ipython-input-1-e41a7e9f3e15>, line 11)

**Exercise** assgin probabilities of tags using the observed tags of the corpus and it's transitions, count each tag and divide it by the number of ocurrences of the word, ane xample of the calculations is below

In [36]:
print(new_tweet_lexicon['united'])

{'ocurrences': 27, 'tags': ['JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ', 'VBD', 'JJ', 'JJ', 'JJ', 'JJ', 'JJ'], 'next_tags': ['JJ', 'NN', 'CD', 'VBP', 'NNS', 'NN', 'NNS', 'VBD', 'NN', 'NN', 'NNS', 'NN', 'NN', 'NN', 'NN', 'VBD', 'VBN', 'NN', 'NN', 'NNP', 'NN', 'CD', 'NN', 'VBG', 'JJR', 'VBD', 'NNS']}


In [41]:
results = {'ocurrences': 27, 'tags': {'JJ': {'ocurrences': 26, 'probability': 0.9629629629629629}, 'VBD': {'ocurrences': 1, 'probability': 0.037037037037037035}}, 'next_tags': {'JJ': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NN': {'ocurrences': 12, 'probability': 0.4444444444444444}, 'CD': {'ocurrences': 2, 'probability': 0.07407407407407407}, 'VBP': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NNS': {'ocurrences': 4, 'probability': 0.14814814814814814}, 'VBD': {'ocurrences': 3, 'probability': 0.1111111111111111}, 'VBN': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NNP': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'VBG': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'JJR': {'ocurrences': 1, 'probability': 0.037037037037037035}}}
print(results)

{'ocurrences': 27, 'tags': {'JJ': {'ocurrences': 26, 'probability': 0.9629629629629629}, 'VBD': {'ocurrences': 1, 'probability': 0.037037037037037035}}, 'next_tags': {'JJ': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NN': {'ocurrences': 12, 'probability': 0.4444444444444444}, 'CD': {'ocurrences': 2, 'probability': 0.07407407407407407}, 'VBP': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NNS': {'ocurrences': 4, 'probability': 0.14814814814814814}, 'VBD': {'ocurrences': 3, 'probability': 0.1111111111111111}, 'VBN': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NNP': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'VBG': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'JJR': {'ocurrences': 1, 'probability': 0.037037037037037035}}}


In [2]:
def countTags(list_tags):
    #count the tags in the list for
    #seen tags and previous tags
    
print(new_tweet_lexicon)

IndentationError: expected an indented block (<ipython-input-2-808218c5415c>, line 5)

In [39]:
print(new_tweet_lexicon['united'])

{'ocurrences': 27, 'tags': {'JJ': {'ocurrences': 26, 'probability': 0.9629629629629629}, 'VBD': {'ocurrences': 1, 'probability': 0.037037037037037035}}, 'next_tags': {'JJ': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NN': {'ocurrences': 12, 'probability': 0.4444444444444444}, 'CD': {'ocurrences': 2, 'probability': 0.07407407407407407}, 'VBP': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NNS': {'ocurrences': 4, 'probability': 0.14814814814814814}, 'VBD': {'ocurrences': 3, 'probability': 0.1111111111111111}, 'VBN': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'NNP': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'VBG': {'ocurrences': 1, 'probability': 0.037037037037037035}, 'JJR': {'ocurrences': 1, 'probability': 0.037037037037037035}}}


Let's save our lexicon in a pickle file:

In [43]:
import pickle
path = "../../data/"
lexicon_file = open(path + 'tweet_lexicon.pickle','wb') 
pickle.dump(new_tweet_lexicon, lexicon_file)

Detect ambiguous entries

In [44]:
#Ambiguous entries
ambiguous = []
determined = []
for word in new_tweet_lexicon.keys():
    entry = new_tweet_lexicon[word]
    tags = ",".join(entry['tags'].keys())
    if len(entry['tags'].keys()) > 1:
        print(word + " is ambiguous Can be " + tags)
        ambiguous.append(word)
    else:
        print(word + " is determined. Can be " + tags)
        determined.append(word)

united is ambiguous Can be JJ,VBD
oh is ambiguous Can be JJ,RB,UH
sure is determined. Can be NN
planned is determined. Can be VBN
occurs is determined. Can be VBZ
absolutely is determined. Can be RB
consistently is determined. Can be RB
usually is determined. Can be RB
yyj is determined. Can be JJ
flight is determined. Can be NN
thats is determined. Can be NNS
cancelled is determined. Can be VBD
flightled is ambiguous Can be JJ,VBD
history is determined. Can be NN
exam is ambiguous Can be NN,VBP
studying is determined. Can be VBG
unnitallman is determined. Can be JJ
yeah is ambiguous Can be UH,RB,NN
looks is determined. Can be VBZ
like is ambiguous Can be IN,VB
" is ambiguous Can be NNP,JJ,NN,VBP
busy is determined. Can be JJ
fucking is ambiguous Can be VBG,NN
gf is determined. Can be NN
loves is determined. Can be NNS
mbjthegreat is determined. Can be NN
really is determined. Can be RB
dont is ambiguous Can be JJ,NN,VB
want is ambiguous Can be VBP,VB
& is determined. Can be CC
phone i

## Frequency approach
The first and simplest approaach to solfve word ambiguity is to assign the most "probable" label of the token to a word that can be ambiguous, for example if the word is in the ambiguous list lookout for lexicon higher probability. FIrst let's build a simple pre processor class:

In [71]:
class PreProcessor():
    
    def __init__(self, special_chars=''):
        self.special_chars = special_chars
    
    def process(self, string):
        return self.cleanString(string)
    
    def cleanString(self, string):
        clean_string = string.lower()
        cleaned_words = []
        for char in self.special_chars:
            clean_string = clean_string.replace(char, '')
            for word in clean_string.split(" "):
                clean_word = word.replace(" ", "")
                if len(clean_word) > 0 and clean_word not in cleaned_words:
                    cleaned_words.append(word.replace(" ", ""))
        return " ".join(cleaned_words)

Now let's see build a simple pos tagger class

In [4]:
class simplePOS():
    
    def __init__(self, lexicon, ambiguous, determined, pre_processor = None):
        self.lexicon = lexicon
        self.ambiguous = ambiguous
        self.determined = determined
        self.preProcessor = pre_processor if pre_processor else PreProcessor(",.@?!¬-\''=()")
        self.tokenizer = tokenizer = nltk.tokenize.TweetTokenizer()
    
    def pos_tag(self, string):
        tagged_tokens = []
        if self.preProcessor:
            string = self.preProcessor.process(string)
        tokens = self.tokenizer.tokenize(string)
        for t in tokens:
            print(t)
            #Classify the tokens
        return tagged_tokens

Instantiate the tag and assign a POS

In [78]:
pos_tagger = simplePOS(new_tweet_lexicon, ambiguous, determined)
print(pos_tagger.pos_tag("United nails stocks"))

[('united', 'JJ'), ('nails', 'NNS'), ('stocks', 'NNS')]


What will you do to, alter the probability according to the next tag?

For each word consider the bigrams, fro example
JJ NNS, will form a bigram, consider how many times a NNS is present after a JJ tag
* Form all possible bigrams for JJ
* Count how many bigrams start with JJ
* Count how many bigrams starting with JJ are followed by a NNS, this will give you the probability of a transition

**Exercise** given the list of tags and next_tags build the probability of a tag given another, the goal is to model the markovian state diagram
Answer which is the probability to have a JJ tag given a NNS tag