## Snowball stemming
### Snowball stemming, also known as the Porter2 stemming algorithm, is another popular stemming algorithm used in natural language processing (NLP). 
### It is an improvement over the original Porter stemming algorithm, providing more accurate and linguistically robust stemming. 
### The Snowball stemming algorithm is language-specific and supports multiple languages. 
### It follows a similar approach to the Porter stemming algorithm, applying a set of rules to strip suffixes from words. However, Snowball stemming incorporates additional linguistic knowledge and fine-tunes the stemming process.

In [1]:
import nltk
from nltk.stem.snowball import SnowballStemmer
 
#the stemmer requires a language parameter
snow_stemmer = SnowballStemmer(language='english')
 
#list of tokenized words
words = ['cared','university','fairly','easily','singing',
       'sings','sung','singer','sportingly']
 
#stem's of each word
stem_words = []
for w in words:
    x = snow_stemmer.stem(w)
    stem_words.append(x)
     
#print stemming results
for e1,e2 in zip(words,stem_words):
    print(e1+' ----> '+e2)

cared ----> care
university ----> univers
fairly ----> fair
easily ----> easili
singing ----> sing
sings ----> sing
sung ----> sung
singer ----> singer
sportingly ----> sport


## Porter Stemming
### Porter stemming is a widely used stemming algorithm in natural language processing (NLP). Stemming is the process of reducing words to their base or root form, which helps in grouping together different variations of the same word.

In [2]:
from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
stemmer = PorterStemmer()

# Example words
words = ["running", "runs", "ran"]

# Stem each word
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

['run', 'run', 'ran']


## Lancaster stemming
### Lancaster stemming is another popular stemming algorithm used in natural language processing (NLP). Like Porter and Snowball stemming, Lancaster stemming aims to reduce words to their base or root form. However, it employs a more aggressive stemming approach, often resulting in shorter stems compared to other algorithms.

In [4]:
from nltk.stem import LancasterStemmer

# Create an instance of the LancasterStemmer
stemmer = LancasterStemmer()

# Example words
words = ["running", "runs", "ran"]

# Stem each word
stemmed_words = [stemmer.stem(word) for word in words]

print(stemmed_words)

['run', 'run', 'ran']


## n-grams using nltk
### In natural language processing (NLP), an n-gram is a contiguous sequence of n items from a given text, where the items can be words, characters, or even larger units such as phrases. The "n" in n-gram represents the number of items in the sequence.
### Unigram: A unigram is an n-gram of size 1, where each item in the sequence is a single word. For example, in the sentence "I love to code," the unigrams would be "I," "love," "to," and "code."
### Bigram: A bigram is an n-gram of size 2, where each item in the sequence consists of two consecutive words. For example, in the same sentence "I love to code," the bigrams would be "I love," "love to," and "to code."
### Trigram: A trigram is an n-gram of size 3, where each item in the sequence consists of three consecutive words. For example, in the sentence "I love to code," the trigrams would be "I love to" and "love to code."

In [5]:
from nltk import ngrams
sentence = input("Enter the sentence: ")
n = int(input("Enter the value of n: "))
n_grams = ngrams(sentence.split(), n)
for grams in n_grams:
    print(grams)

Enter the sentence: Hello, how are you, what are you doing, they both get placed finallu
Enter the value of n: 3
('Hello,', 'how', 'are')
('how', 'are', 'you,')
('are', 'you,', 'what')
('you,', 'what', 'are')
('what', 'are', 'you')
('are', 'you', 'doing,')
('you', 'doing,', 'they')
('doing,', 'they', 'both')
('they', 'both', 'get')
('both', 'get', 'placed')
('get', 'placed', 'finallu')


## n-grams without using nltk

In [6]:
# Input sentence
sentence = "This is an example sentence for n-gram generation."

# Tokenize the sentence into words
tokens = sentence.split()

# Set the value of n for n-gram generation
n = 3

# Generate n-grams
ngrams_list = []
for i in range(len(tokens) - n + 1):
    ngram = tokens[i:i+n]
    ngrams_list.append(tuple(ngram))

# Print the generated n-grams
for ngram in ngrams_list:
    print(ngram)

('This', 'is', 'an')
('is', 'an', 'example')
('an', 'example', 'sentence')
('example', 'sentence', 'for')
('sentence', 'for', 'n-gram')
('for', 'n-gram', 'generation.')


## n-gram smooting
### N-gram smoothing, also known as add-one smoothing or Laplace smoothing, is a technique used in language modeling to address the problem of zero probabilities for unseen n-grams. It helps to overcome the sparsity issue and assign non-zero probabilities to unseen or rare n-grams. The basic idea behind n-gram smoothing is to adjust the probability estimates for n-grams by adding a constant value (typically 1) to the count of each n-gram. This effectively redistributes the probability mass from seen n-grams to unseen n-grams.

In [None]:
# Theory
# Here's a step-by-step approach for implementing add-one smoothing for n-grams:
# 1.) Count the occurrences of each n-gram in your training corpus. Let's assume you have a dictionary or data structure that stores the counts.
# 2.) Calculate the total count of all n-grams in the training corpus. This is the sum of the counts of all n-grams.
# 3.) Determine the vocabulary size, which represents the number of unique n-grams in the training corpus. This is the number of keys or distinct items in your n-gram count dictionary.
# 4.) Calculate the smoothed probability for each n-gram using the formula:
# P_smoothed = (count(n-gram) + 1) / (total_count + vocabulary_size)
# In this formula, count(n-gram) is the count of the specific n-gram, total_count is the sum of all n-gram counts, and vocabulary_size is the number of unique n-grams.

In [11]:
from collections import defaultdict

# Input corpus
corpus = "This is an example sentence for n-gram smoothing."

# Tokenize the corpus into words
tokens = corpus.split()

# Set the value of n for n-gram generation
n = 2

# Initialize n-gram count dictionary
ngram_counts = defaultdict(int)

# Generate n-grams and count their occurrences
for i in range(len(tokens) - n + 1):
    ngram = tuple(tokens[i:i+n])
    ngram_counts[ngram] += 1

# Calculate total count and vocabulary size
total_count = sum(ngram_counts.values())
vocabulary_size = len(ngram_counts)

# Apply add-one smoothing and calculate smoothed probabilities
smoothed_probabilities = {}
for ngram, count in ngram_counts.items():
    smoothed_probabilities[ngram] = (count + 1) / (total_count + vocabulary_size)

# Print the smoothed probabilities
for ngram, probability in smoothed_probabilities.items():
    print(ngram, probability)


('This', 'is') 0.14285714285714285
('is', 'an') 0.14285714285714285
('an', 'example') 0.14285714285714285
('example', 'sentence') 0.14285714285714285
('sentence', 'for') 0.14285714285714285
('for', 'n-gram') 0.14285714285714285
('n-gram', 'smoothing.') 0.14285714285714285


## POS Tagger
### POS tagging, short for Part-of-Speech tagging, is the process of assigning a grammatical category (part-of-speech tag) to each word in a given text or sentence. The part-of-speech tags represent the syntactic or grammatical role of the word within the sentence, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, etc.
### POS tagging is an important task in natural language processing (NLP) and is used in various applications, including text analysis, information extraction, machine translation, sentiment analysis, and more. It helps in understanding the structure of the sentence, disambiguating word meanings, and facilitating subsequent language processing tasks.

In [12]:
import nltk

# Training data for the POS tagger
training_data = [
    ("The cat is sitting on the mat", "DT NN VBZ VBG IN DT NN"),
    ("I love to eat pizza", "PRP VBP TO VB NN"),
    ("She is singing a song", "PRP VBZ VBG DT NN")
]

# Prepare the training data in the required format
tagged_sentences = [nltk.pos_tag(nltk.word_tokenize(sentence)) for sentence, _ in training_data]

# Train the POS tagger
pos_tagger = nltk.DefaultTagger('NN')
pos_tagger = nltk.UnigramTagger(tagged_sentences, backoff=pos_tagger)
pos_tagger = nltk.BigramTagger(tagged_sentences, backoff=pos_tagger)

# Test the POS tagger on new sentences
test_sentence = "The dog is chasing the ball"
tagged_words = nltk.word_tokenize(test_sentence)
pos_tags = pos_tagger.tag(tagged_words)

# Print the POS tags
for word, tag in pos_tags:
    print(word, tag)


The DT
dog NN
is VBZ
chasing NN
the DT
ball NN


## Chunker
### In natural language processing (NLP), a chunker, also known as a shallow parser or phrase chunker, is a component that groups words in a sentence into meaningful chunks based on their grammatical structure. Chunking is a process of identifying and labeling contiguous sequences of words that belong together syntactically, such as noun phrases, verb phrases, prepositional phrases, and more.
### This can be helpful in various NLP tasks such as information extraction, named entity recognition, relation extraction, and syntactic parsing.They use the POS tags assigned to each word in a sentence to identify and group words into chunks based on predefined grammatical patterns or rules. These rules are often defined using regular expressions or other pattern matching techniques.

In [14]:
import nltk

# Sample sentence
sentence = "The black cat is sitting on the mat"

# Tokenize the sentence into words
tokens = nltk.word_tokenize(sentence)

# Perform part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)

# Define a chunk grammar using regular expressions
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}   # Chunk noun phrases
    VP: {<VB.*><NP|PP>}    # Chunk verb phrases
    PP: {<IN><NP>}         # Chunk prepositional phrases
"""

# Create a chunk parser using the defined grammar
chunk_parser = nltk.RegexpParser(chunk_grammar)

# Apply the chunk parser to the part-of-speech tagged sentence
chunks = chunk_parser.parse(pos_tags)

# Print the resulting chunks
print(chunks)


(S
  (NP The/DT black/JJ cat/NN)
  is/VBZ
  sitting/VBG
  (PP on/IN (NP the/DT mat/NN)))
