# **Assignment 1 - Group 7*

## Corpra 
This assignment uses texts from 3 different genres: 

-Text from a novel
-Text from a stageplay 
-Text from a TV show 

Sources:

[Sherlock Holmes](https://sherlock-holm.es/stories/plain-text/cano.txt)

[Macbeth](https://dracor.org/shake/macbeth#downloads)

[Breaking Bad](https://transcripts.foreverdreaming.org/viewforum.php?f=165)


In [1]:
import nltk
import numpy
import matplotlib
import os
import re
from nltk.stem import SnowballStemmer

In [2]:
# collect of corpus in Corpora into a list
corpusList = os.listdir(os.getcwd() + "/Corpora/")

In [3]:
def readCleanSplit(corpus):
    with open("./Corpora/" + corpus, 'r', encoding='utf-8') as f:
        text = f.read()
        cleanedText = re.sub(" *[^\w\s]+", " ", text)
        splitedText = cleanedText.split()
    return splitedText

In [4]:
# remove all punctuations and get length
def getLength(corpus):
    length = len(readCleanSplit(corpus))
    return length

# print result
for corpus in corpusList:
    print("The length of {} is {}".format(corpus, getLength(corpus)))


The length of BreakingBadSeason1.txt is 24875
The length of macbethspoken.txt is 16708
The length of SherlockHolmes.txt is 668228


In [5]:
# define lexical diversity function
def lexDiv(corpus):
    wordList = readCleanSplit(corpus)
    tokens = len(wordList)
    types = len(set(w.lower() for w in wordList))
    return types / tokens

# print result
for corpus in corpusList:
    print("The lexical diversity of {} is {:.8f}".format(corpus,lexDiv(corpus)))


The lexical diversity of BreakingBadSeason1.txt is 0.11798995
The lexical diversity of macbethspoken.txt is 0.18996888
The lexical diversity of SherlockHolmes.txt is 0.02871924


In [6]:
# top 10 words with given initial
def initial_top10(corpus, initial):
    wordList = readCleanSplit(corpus)
    # only put words with initial X in list
    processedList = []
    for w in wordList:
        if w[0] == initial.upper() or w[0] == initial.lower():
            processedList.append(w.lower())
            
    fDist = nltk.FreqDist(nltk.Text(processedList))
    return fDist.most_common(10)
    
for corpus in corpusList:
    print("\n{}:".format(corpus))
    for vowel in "aeiou":
        print("The 10 most common words starting with '{}': \n{}".format(vowel, initial_top10(corpus,vowel)))
  


BreakingBadSeason1.txt:
The 10 most common words starting with 'a': 
[('a', 432), ('and', 290), ('all', 151), ('about', 124), ('are', 118), ('at', 100), ('as', 50), ('an', 41), ('any', 30), ('am', 29)]
The 10 most common words starting with 'e': 
[('even', 32), ('enough', 19), ('ever', 18), ('every', 18), ('emilio', 15), ('elliott', 15), ('else', 11), ('excuse', 11), ('easy', 10), ('everybody', 9)]
The 10 most common words starting with 'i': 
[('i', 935), ('it', 570), ('is', 264), ('in', 220), ('if', 62), ('into', 15), ('isn', 13), ('idea', 12), ('iron', 5), ('interest', 5)]
The 10 most common words starting with 'o': 
[('of', 236), ('on', 151), ('out', 110), ('okay', 104), ('one', 81), ('oh', 75), ('or', 51), ('our', 48), ('off', 37), ('over', 26)]
The 10 most common words starting with 'u': 
[('up', 119), ('us', 50), ('used', 18), ('uh', 16), ('understand', 15), ('until', 11), ('use', 7), ('um', 6), ('uncle', 4), ('understanding', 4)]

macbethspoken.txt:
The 10 most common words sta

In [7]:
# set the path to the corpus directory
corpus_path = "Corpora"

# got a list of all files in the corpus directory
corpus_files = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path)]

# created a list of labels for the files
file_labels = ["Breaking Bad", "Macbeth", "Sherlock Holmes"]

# Created a loop
for i, corpus_file in enumerate(corpus_files):
 # open the file
    with open(corpus_file, 'r', encoding='utf-8') as f:
 # read in the entire corpus
        corpus = f.read()
    
 # Used regular expressions to split the corpus into sentences
        sentences = re.split('[.?!]', corpus)
    
# Found the longest sentence and removed any whitespace
        longest_sentence = max(sentences, key=len).strip()
    
 # count the number of words in the longest sentence
        num_words = len(longest_sentence.split())
    
# printed the longest sentence and number of words, starting with an informative label
        print(f"Longest sentence in {file_labels[i]}: {longest_sentence} ({num_words} words)")

Longest sentence in Breaking Bad: The right-handed isomer of the drug, Thalidomide is a perfectly fine good medicine to give to a pregnant woman to prevent morning sickness but, make the mistake of giving that same pregnant woman the left-handed isomer of the drug Thalidomide, and her child will be born with horrible birth defects (50 words)
Longest sentence in Macbeth: What’s more to do ,
Which would be planted newly with the time ,
As calling home our exiled friends abroad
That fled the snares of watchful tyranny ,
Producing forth the cruel ministers
Of this dead butcher and his fiend-like queen
( Who , as ’tis thought , by self and violent hands ,
Took off her life ) — this , and what needful else
That calls upon us , by the grace of grace ,
We will perform in measure , time , and place (88 words)
Longest sentence in Sherlock Holmes: "

     "When you combine the ideas of whistles at night, the presence of a
     band of gipsies who are on intimate terms with this old doctor, the
  

In [9]:
# Created a Snowball Stemmer (NLTK) object
stemmer = SnowballStemmer("english")

# Looped through the corpus files
for i, corpus_file in enumerate(corpus_files):

# Used the Snowball Stemmer to create a list of stemmed words from the longest sentence
        stemmed_longest_sentence = [stemmer.stem(word) for word in longest_sentence.split()]
    
# Print the stemmed version of the longest sentence for this file
        print(f"Stemmed version of the longest sentence in {file_labels[i]}: {' '.join(stemmed_longest_sentence)}")

Stemmed version of the longest sentence in Breaking Bad: " "when you combin the idea of whistl at night, the presenc of a band of gipsi who are on intim term with this old doctor, the fact that we have everi reason to believ that the doctor has an interest in prevent his stepdaught marriage, the die allus to a band, and, finally, the fact that miss helen stoner heard a metal clang, which might have been caus by one of those metal bar that secur the shutter fall back into it place, i think that there is good ground to think that the mysteri may be clear along those line
Stemmed version of the longest sentence in Macbeth: " "when you combin the idea of whistl at night, the presenc of a band of gipsi who are on intim term with this old doctor, the fact that we have everi reason to believ that the doctor has an interest in prevent his stepdaught marriage, the die allus to a band, and, finally, the fact that miss helen stoner heard a metal clang, which might have been caus by one of those m

In [10]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "Corpora"
mycorpus = PlaintextCorpusReader(corpus_root, '.*', encoding = "utf8")
macbeth = nltk.Text(mycorpus.words('macbethspoken.txt'))
nltk.FreqDist(macbeth)

FreqDist({',': 1393, '.': 1152, '’': 514, 'the': 502, 'and': 352, 'I': 348, 'of': 303, 'to': 297, '?': 239, 'a': 190, ...})

In [11]:
print("These are the concordances for the top three most frequent words in 'Macbeth':")
macbeth.concordance("the")
macbeth.concordance("and")
macbeth.concordance("I")

These are the concordances for the top three most frequent words in 'Macbeth':
Displaying 25 of 633 matches:
under , lightning , or in rain ? When the hurly - burly ’ s done , When the bat
hen the hurly - burly ’ s done , When the battle ’ s lost and won . That will b
e ’ s lost and won . That will be ere the set of sun . Where the place ? Upon t
at will be ere the set of sun . Where the place ? Upon the heath . There to mee
e set of sun . Where the place ? Upon the heath . There to meet with Macbeth . 
ul , and foul is fair ; Hover through the fog and filthy air . What bloody man 
eport , As seemeth by his plight , of the revolt The newest state . This is the
seemeth by his plight , of the revolt The newest state . This is the sergeant W
the revolt The newest state . This is the sergeant Who , like a good and hardy 
vity . — Hail , brave friend ! Say to the King the knowledge of the broil As th
Hail , brave friend ! Say to the King the knowledge of the broil As thou didst 
nd ! Say to

In [12]:
sherlock = nltk.Text(mycorpus.words('SherlockHolmes.txt'))
nltk.FreqDist(sherlock)

FreqDist({',': 40636, '.': 33762, 'the': 33275, 'I': 17320, 'and': 16748, 'of': 16498, 'to': 15840, '"': 15114, 'a': 15076, 'that': 10724, ...})

In [13]:
print("These are the concordances for the top three most frequent words in 'Sherlock Holmes':")
sherlock.concordance("the")
sherlock.concordance("I")
sherlock.concordance("and")

These are the concordances for the top three most frequent words in 'Sherlock Holmes':
Displaying 25 of 36091 matches:
PART I ( Being a reprint from the reminiscences of John H . Watson , M 
of John H . Watson , M . D ., late of the Army Medical Department .) CHAPTER I 
 .) CHAPTER I Mr . Sherlock Holmes In the year 1878 I took my degree of Doctor 
ok my degree of Doctor of Medicine of the University of London , and proceeded 
and proceeded to Netley to go through the course prescribed for surgeons in the
the course prescribed for surgeons in the army . Having completed my studies th
tudies there , I was duly attached to the Fifth Northumberland Fusiliers as Ass
land Fusiliers as Assistant Surgeon . The regiment was stationed in India at th
he regiment was stationed in India at the time , and before I could join it , t
e time , and before I could join it , the second Afghan war had broken out . On
ed that my corps had advanced through the passes , and was already deep in the 
 the pass

In [14]:
breaking = nltk.Text(mycorpus.words('BreakingBadSeason1.txt'))
nltk.FreqDist(breaking)

FreqDist({'.': 2950, ',': 1920, "'": 1377, '?': 979, 'I': 934, 'you': 835, 'the': 544, 's': 524, 'to': 510, 'it': 434, ...})

In [15]:
print("These are the concordances for the top three most frequent words in 'Breaking Bad':")
breaking.concordance("I")
breaking.concordance("you")
breaking.concordance("the")

These are the concordances for the top three most frequent words in 'Breaking Bad':
Displaying 25 of 935 matches:
My name is Walter Hartwell White . I live at 308 Negra Arroyo Lane Albuquer
 , this is not an admission of guilt . I am speaking to my family now . Skyler 
. Skyler you are the love of my life . I hope you know that . Walter Junior you
 learn about me in the next few days . I just want you to know that no matter h
 know that no matter how it may look , I only had you in my heart . Goodbye . (
u think you ' ll be home ? Same time . I don ' t want him dicking you around to
again . There was no hot water again . I have an easy fix for that . You wake u
to be the first person in the shower . I have an idea . How about buy a new hot
. Did you take your Echinacea ? Yeah . I think it ' s getting better . What the
n . We ' re watching our cholesterol , I guess . Not me . I want real bacon . N
g our cholesterol , I guess . Not me . I want real bacon . Not this fake crap .
ur veggie 