In [58]:
import nltk
from nltk.corpus import gutenberg
from nltk.corpus import movie_reviews

  **Gutenberg Corpus:** A collection of literary works in English, with over 25,000 texts from Project Gutenberg.

**Words**

In [59]:
print("WORDS of Gutenberg Corpus : \n ", gutenberg.words())
print("\n No. of Words/Tokens in the Gutenberg Corpus : ", len(gutenberg.words()))

WORDS of Gutenberg Corpus : 
  ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

 No. of Words/Tokens in the Gutenberg Corpus :  2621613


**Unique Tokens**

In [60]:
unique_tokens = len(set(gutenberg.words()))
print("Number of unique tokens in Brown Corpus:", unique_tokens)

Number of unique tokens in Brown Corpus: 51156


**Sentences**

In [61]:
print("Sentences of Gutenberg Corpus : \n ", gutenberg.sents())
print("\n No. of Sentences in the Gutenberg Corpus : ", len(gutenberg.sents()))

Sentences of Gutenberg Corpus : 
  [['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]

 No. of Sentences in the Gutenberg Corpus :  98552


  **Movie Reviews Corpus:** A collection of movie reviews with pre-assigned sentiment labels (positive or negative).

**Words**

In [62]:
print("WORDS of Movie_Reviews Corpus : \n ", movie_reviews.words())
print("\n No. of Words/Tokens in the Movie_Reviews Corpus : ", len(movie_reviews.words()))

WORDS of Movie_Reviews Corpus : 
  ['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

 No. of Words/Tokens in the Movie_Reviews Corpus :  1583820


**Unique Tokens**

In [63]:
unique_tokens = len(set(movie_reviews.words()))
print("\n Number of Unique tokens in Movie_Reviews Corpus:", unique_tokens)


 Number of Unique tokens in Movie_Reviews Corpus: 39768


**Sentences**

In [64]:
print("Sentences of Movie_Reviews Corpus : \n ", movie_reviews.sents())
print("\n No. of Sentences in the Movie_Reviews Corpus : ", len(movie_reviews.sents()))

Sentences of Movie_Reviews Corpus : 
  [['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]

 No. of Sentences in the Movie_Reviews Corpus :  71532


**<u>POS TAGGING</u>**

POS tagging is the task of labeling each word in a sentence with its part of speech, such as noun, verb, adjective, or adverb

In [65]:
Gutenberg_sentences = gutenberg.sents()
print("The 10th Sentence in the Gutenberg Corpus : \n", ' '.join(Gutenberg_sentences[10]))

#nltk.download('averaged_perceptron_tagger')
tagged_sentence = nltk.pos_tag(Gutenberg_sentences[10])
print("\n POS Tagged : \n", tagged_sentence)

The 10th Sentence in the Gutenberg Corpus : 
 The danger , however , was at present so unperceived , that they did not by any means rank as misfortunes with her .

 POS Tagged : 
 [('The', 'DT'), ('danger', 'NN'), (',', ','), ('however', 'RB'), (',', ','), ('was', 'VBD'), ('at', 'IN'), ('present', 'JJ'), ('so', 'RB'), ('unperceived', 'JJ'), (',', ','), ('that', 'IN'), ('they', 'PRP'), ('did', 'VBD'), ('not', 'RB'), ('by', 'IN'), ('any', 'DT'), ('means', 'NNS'), ('rank', 'NN'), ('as', 'IN'), ('misfortunes', 'NNS'), ('with', 'IN'), ('her', 'PRP'), ('.', '.')]


**<u>PARSING</u>**

Parsing is the process of analyzing a sentence or a text in order to determine its syntactic structure, and is an important task in natural language processing (NLP). 

In [66]:
# Define a grammar rule for a simple sentence
grammar = nltk.CFG.fromstring("""
    S -> NP VP
    NP -> DT NN
    VP -> VBZ NP | VBD NP
    DT -> 'the'
    NN -> 'cat' | 'dog'
    VBZ -> 'chases'
    VBD -> 'chased'
""")

# Create the parser
parser = nltk.RecursiveDescentParser(grammar)

# Parse a sentence
sentence = "the cat chases the dog"
for tree in parser.parse(nltk.word_tokenize(sentence)):
    print(tree)

(S (NP (DT the) (NN cat)) (VP (VBZ chases) (NP (DT the) (NN dog))))


**<u>SPOKEN LANGUAGE</u>**

Tokenizing spoken language: NLTK provides a tokenizer that can split spoken language into words, known as the word_tokenize method.

In [67]:
spoken_text = "um, I don't know, like, this is, like, really cool, you know?"

# Tokenize the spoken language
tokens = nltk.word_tokenize(spoken_text)

print(tokens)

['um', ',', 'I', 'do', "n't", 'know', ',', 'like', ',', 'this', 'is', ',', 'like', ',', 'really', 'cool', ',', 'you', 'know', '?']


**Removing filler words:** Spoken language often includes filler words, such as "um" or "like". You can use NLTK to remove these filler words using regular expressions. 

In [68]:
import re

# Define a regular expression pattern to match filler words
filler_pattern = re.compile(r'\b(um|uh|like)\b')

# Remove filler words from the spoken language
spoken_text_without_fillers = re.sub(filler_pattern, '', spoken_text)

print(spoken_text_without_fillers)

, I don't know, , this is, , really cool, you know?


**Analyzing sentiment in spoken language:** NLTK provides a sentiment analysis module that can be used to classify the sentiment of spoken language.

In [69]:
from nltk.sentiment import SentimentIntensityAnalyzer

spoken_text = "I'm feeling pretty good about the presentation, although I did stumble over a few words."

# Create a sentiment analyzer object
sentiment_analyzer = SentimentIntensityAnalyzer()

# Analyze the sentiment of the spoken language
sentiment_scores = sentiment_analyzer.polarity_scores(spoken_text)

print(sentiment_scores)

{'neg': 0.0, 'neu': 0.568, 'pos': 0.432, 'compound': 0.765}


**<u>SEMANTIC TAGGED</u>**

Semantic tagging (also known as part-of-speech tagging) is the process of assigning a part of speech to each word in a text. 

In [70]:
# Load the default NLTK tagger
tagger = nltk.pos_tag

# Tokenize the text
text = "I went to the store and bought some milk."
tokens = nltk.word_tokenize(text)

# Perform semantic tagging on the tokens
tagged_tokens = tagger(tokens)

# Print the tagged tokens
print(tagged_tokens)

[('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('store', 'NN'), ('and', 'CC'), ('bought', 'VBD'), ('some', 'DT'), ('milk', 'NN'), ('.', '.')]
