# Language Processing and Python
## Computing with Language: Texts and Words


In [None]:
from nltk.book import *

In [None]:
text1

### Searching Text
shows every occurrence of a given word, together with somecontext

In [None]:
text1.concordance("monstrous")

In [None]:
text2.concordance("affection")

**other words appear in a similar range of contexts**

In [None]:
text1.similar("monstrous")

**common_contexts allows us to examine just the contexts that are shared by
two or more words, such as monstrous and very.**

In [None]:
text2.common_contexts(["monstrous", "very"])

Using a dispersion plot to show the location of a word in the text: how many words from the beginning it appears. Each stripe represents
an instance of a word, and each row represents the entire text.

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

**generating some random text in the various styles we have just seen**

In [None]:
text3.generate()

### Counting Vocabulary

In [None]:
len(text3)

In [None]:
sorted(set(text3))

In [None]:
len(set(text3))

**lexical richness of the text**

In [None]:
from __future__ import division

In [None]:
len(text3) / len(set(text3))

In [None]:
text3.count("smote")

**compute what percentage of the text is taken up by a specific word**

In [None]:
100 * text4.count('a') / len(text4)

**How many times does the word lol appear in text5? How
much is this as a percentage of the total number of words in this text?**

In [None]:
text5.count("lol")

In [None]:
100 * text5.count('lol') / len(text5)

In [None]:
def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total

In [None]:
lexical_diversity(text3)

In [None]:
lexical_diversity(text5)

In [None]:
percentage(4, 5)

In [None]:
percentage(text4.count('a'), len(text4))

## A Closer Look at Python: Texts as Lists of Words##
### List ###

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
lexical_diversity(sent1)

In [None]:
sent1.append("Some")

In [None]:
sent1

### Indexing Lists ###

In [None]:
text4[173]

In [None]:
text4.index('awaken')

In [None]:
text5[16715:16735]

In [None]:
sent = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10']

In [None]:
sent[5:8]

In [None]:
sent[7]

In [None]:
sent[:3] #sent 0 1 2

### String ###

In [None]:
name = 'Monty'

In [None]:
name[0]

In [None]:
name[:4]

In [None]:
name * 2

In [None]:
name + '!'

In [None]:
' '.join(['Monty', 'Python'])

In [None]:
'Monty Python'.split()

## Computing with Language: Simple Statistics ## 

In [None]:
saying = ['After', 'all', 'is', 'said', 'and', 'done','more', 'is', 'said', 'than', 'done']

In [None]:
tokens = set(saying)
tokens

In [None]:
tokens = sorted(tokens)
tokens

In [None]:
tokens[-2:]

## Frequency Distributions ##

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist1

In [None]:
vocabulary1 = fdist1.keys()

In [None]:
fdist1['whale']

In [None]:
fdist1.plot(50, cumulative=True)

In [None]:
#words that occur once only
fdist1.hapaxes()

In [None]:
#find the words from the vocabulary of the text that are more than 15 characters long
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)

In [None]:
# all words from the chat corpus that are longer than seven characters, that occur more than seven times
fdist5 = FreqDist(text5)
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

## Collocations and Bigrams ##

**a list of word pairs, also known as bigrams**

In [None]:
bigram = bigrams(['more', 'is', 'said', 'than', 'done'])

In [None]:
# fix bug with code "text4.collocations()"
print('; '.join(text4.collocation_list()))

## Counting Other Things ##

In [None]:
[len(w) for w in text1]

In [None]:
fdist = FreqDist([len(w) for w in text1])

In [None]:
fdist

In [None]:
fdist.keys()

In [None]:
fdist.items()

In [None]:
fdist.max()

In [None]:
fdist[3]

In [None]:
fdist.freq(3)

|function|explaination|
|----|-----|
|fdist = FreqDist(samples)| Create a frequency distribution containing the given samples|
|fdist.inc(sample) |Increment the count for this sample|
|fdist['monstrous'] |Count of the number of times a given sample occurred|
|fdist.freq('monstrous') |Frequency of a given sample|
|fdist.N() |Total number of samples|
|fdist.keys() |The samples sorted in order of decreasing frequency|
|for sample in fdist: |Iterate over the samples, in order of decreasing frequency|
|fdist.max() |Sample with the greatest count|
|fdist.tabulate() |Tabulate the frequency distribution|
|fdist.plot()| Graphical plot of the frequency distribution|
|fdist.plot(cumulative=True) |Cumulative plot of the frequency distribution|
|fdist1 |< fdist2 Test if samples in fdist1 occur less frequently than in fdist2|

### Some word comparison operators ###
|function|explaination|
|----|-----|
|s.startswith(t) |Test if s starts with t|
|s.endswith(t) |Test if s ends with t|
|t in s |Test if t is contained inside s|
|s.islower() |Test if all cased characters in s are lowercase|
|s.isupper() |Test if all cased characters in s are uppercase|
|s.isalpha() |Test if all characters in s are alphabetic|
|s.isalnum() |Test if all characters in s are alphanumeric|
|s.isdigit() |Test if all characters in s are digits|
|s.istitle() |Test if s is titlecased (all words in s have initial capitals)|

In [None]:
sorted([w for w in set(text1) if w.endswith('ableness')])

In [None]:
sorted([term for term in set(text4) if 'gnt' in term])

In [None]:
sorted([item for item in set(text6) if item.istitle()])

In [None]:
sorted([item for item in set(sent7) if item.isdigit()])

In [None]:
[len(w) for w in text1]

In [None]:
[w.upper() for w in text1]

In [None]:
len(set([word.lower() for word in text1]))

In [None]:
#eliminate numbers and punctuation from the vocabulary count by filtering out any non-alphabetic items
len(set([word.lower() for word in text1 if word.isalpha()]))