# NLP

The vertical split segregates - x and y where x is the predictors and y is the response.

# Lexical Ambiguity - 
- word has different meanings.<br>
- word being understood in more than 1 way.<br>
- Machine has difficulty in interpreting these.
- Ex.- Bank- 1. Financial institution, 2. River bank 

# Syntax Ambiguity - syntactic ambiguity
- when a sentence can be parsed in different ways then this type of ambiguity is called syntactic ambiguity.<br>
- ex. The man saw a girl with a telescope.


# Symantic ambiguity
- when the meaning of the words themselves can be misinterpreted 
- 


# Anaphoric ambiguity 
- arises due to the use of anaphore in the sentence
- Ex- The horse run up the hill. It was too steep . It soon got tired.

# Pragmatic ambiguity
- When the statement is not specific or the context of the phrase gives multiple interpretations.
- Ex. I love you too.


# Cleaning Data
- convert data into lowercase
- remove punctuations - .,,'',"",!,?,@,$,-
- Remove stopwords - I, a ,an

- nltk library - natural language toolkit used in nlp 

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string

from nltk import pos_tag
from nltk.util import ngrams

In [3]:
#Downloading nltk data(first time only)-
# nltk.download('punkt_tab')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger_eng') #supervised machine learning model for POS tagging

# Tokenization 
- splitting a phrase/sentence/paragraph or entire text into smaller units(tokens) such as words is called as tokenization.

In [4]:
text='The quick brown fox jumped over the lazy dog!'

# Tokenize-
tokens=word_tokenize(text.lower())
tokens

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '!']

In [5]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [6]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
print(sorted(stopwords.words('english')))


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [8]:
tokens=[word for word in tokens if word not in string.punctuation]
print(tokens)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


In [9]:
tokens=[word for word in tokens if word not in stopwords.words('english')]
print(tokens)

['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']


# Stemming 
- reducing the word to their stem(base word) is called stemming.
- ex- moving,move,moved - move

- It stems the word but sometimes the context of the word gets lost.
- ex.- studies -> studi
  

# About Lemmatization-
- lemmatization is a concept which solves this problem.
- and the base word will have some meaning.
- lemmatization has a dictionary and will look for a word similiar to the stemmed word, and will eventually give the correct word which has that meaning
- ex. - studies -> study

# Difference between stemming and lemmatization-

| Stemming | Lemmatization |
| --- | --- |
| - Stemming is faster. <br> - It is a rule-based approach (just remove the end part to stem the word). <br> - Less accuracy in stemming. <br> - Stemming is used in case of spam detection. | - Lemmatization is slower than stemming. <br> - It is a dictionary-based approach. <br> - Accuracy is more. <br> - Lemmatization is used for text summarizing, question and answers. |


# Stemming-
- Stemming is faster.
- It is a rule based approach(just remove the end part to stem the word).
- Less accuracy in stemming.
- Stemming is used in case of spam detection.

# Lemmatization-
- Lemmatization is slower than stemming.
- It is a dictionary based approach.
- Accuracy is more.
- Lemmatization is used for text summarizing, question and answers.

In [10]:
from nltk.stem import PorterStemmer  #PorterStemmer
stemmer=PorterStemmer()
stemmed_words=[stemmer.stem(word) for word in tokens]
print(stemmed_words)

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']


In [11]:
lemmatizer=WordNetLemmatizer()   #Lemmatizer
tokens=[lemmatizer.lemmatize(word) for word in tokens]
print(tokens)

['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']


In [12]:
#Another example-
# pos- parts of speech

#Sample text
text='Hello, All! This is an NLP example. I love learning about transformers, attention mechanisms, and embeddings'

#step1: Tokenization:
tokens=word_tokenize(text)
print("Tokens:",tokens)
print('\n')

#step 2: lowercasing
tokens=[word.lower() for word in tokens]
print('Lowercased Tokens:',tokens)
print('\n')
#step 3: Remove punctuation
tokens=[word for word in tokens if word not in string.punctuation]
print('Without punctuation:',tokens)
print('\n')

#step 4: Remove stopwords
stop_words=set(stopwords.words('english'))
tokens=[word for word in tokens if word not in stop_words]
print('Without stopwords:',tokens)
print('\n')

#step 5 : Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens=[lemmatizer.lemmatize(word) for word in tokens]
print('Lemmatized Tokens:',lemmatized_tokens)
print('\n')

#step 6 : Stemming
stemmer=PorterStemmer()
stemmed_words=[stemmer.stem(word) for word in lemmatized_tokens]
print('Stemmed tokens:',stemmed_words)
print('\n')

Tokens: ['Hello', ',', 'All', '!', 'This', 'is', 'an', 'NLP', 'example', '.', 'I', 'love', 'learning', 'about', 'transformers', ',', 'attention', 'mechanisms', ',', 'and', 'embeddings']


Lowercased Tokens: ['hello', ',', 'all', '!', 'this', 'is', 'an', 'nlp', 'example', '.', 'i', 'love', 'learning', 'about', 'transformers', ',', 'attention', 'mechanisms', ',', 'and', 'embeddings']


Without punctuation: ['hello', 'all', 'this', 'is', 'an', 'nlp', 'example', 'i', 'love', 'learning', 'about', 'transformers', 'attention', 'mechanisms', 'and', 'embeddings']


Without stopwords: ['hello', 'nlp', 'example', 'love', 'learning', 'transformers', 'attention', 'mechanisms', 'embeddings']


Lemmatized Tokens: ['hello', 'nlp', 'example', 'love', 'learning', 'transformer', 'attention', 'mechanism', 'embeddings']


Stemmed tokens: ['hello', 'nlp', 'exampl', 'love', 'learn', 'transform', 'attent', 'mechan', 'embed']




In [13]:
tokens

['hello',
 'nlp',
 'example',
 'love',
 'learning',
 'transformers',
 'attention',
 'mechanisms',
 'embeddings']

In [14]:
#step 7 : POS TAGGING-
pos_tags=pos_tag(tokens)
print('POS Tags:',pos_tags)
print('\n')

#step 8: Generate N-grams
bigrams=list(ngrams(tokens,2))
print('Bigrams:',bigrams)

POS Tags: [('hello', 'NN'), ('nlp', 'JJ'), ('example', 'NN'), ('love', 'IN'), ('learning', 'VBG'), ('transformers', 'NNS'), ('attention', 'NN'), ('mechanisms', 'NNS'), ('embeddings', 'NNS')]


Bigrams: [('hello', 'nlp'), ('nlp', 'example'), ('example', 'love'), ('love', 'learning'), ('learning', 'transformers'), ('transformers', 'attention'), ('attention', 'mechanisms'), ('mechanisms', 'embeddings')]


# POS tags

pos_tag->meaning -> example
1. NN ->Nouns singular->
2. NNS-> Nouns Plural ->
3. VB-> Verb base form-> run,eat,read
4. VBD-> Verb in the past tense -> ate, ran,
5. VBG -> ing form of verb -> running, sleeping
6. VBN -> verb past participle -> eaten,
7. JJ-> adjectives -> beautiful
8. RB -> adverb -> slowly, quickly

In [15]:
# wordnet.NOUN
# wordnet.VERB
# wordnet.ADV
# wordnet.ADJ
#  

# Bag of words -
- is a text representation technique where the text is represented as collection of words(bag) where the grammar annd the word order is not taken into account but the frequency of the words matter.

Each sentence is considered as a document<br>
vocabulary->unique words<br>
- Tokenization
- Vocabulary
- Vectorisation(if a particualar word is there it is given value 1 else 0)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
#sample documents
docs=['I love machine learning','Machine learning is fun']

#initialize count vectorizer
vectorizer=CountVectorizer(lowercase=True,stop_words='english')

#fit and transform the documents into a bag of words-
X=vectorizer.fit_transform(docs)
print(X)

# convert the result to an array
print(X.toarray())

#Display the vocabulary(features)-
print(vectorizer.get_feature_names_out())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6 stored elements and shape (2, 4)>
  Coords	Values
  (0, 2)	1
  (0, 3)	1
  (0, 1)	1
  (1, 3)	1
  (1, 1)	1
  (1, 0)	1
[[0 1 1 1]
 [1 1 0 1]]
['fun' 'learning' 'love' 'machine']


sparse matrix has higher computation speed than dense matrix, so we convert dense matrix to sparse matrix

# Limitations of bag of words
- hLoss of context
- High vector size(high dimensionality or vector size)
- There is no symantic meaning."
- 

term freq - no. o reperition of word kns a srence==========/Total number of words in the sentence

TF-IDF -Term Frequency - Inverse Document frequency
- This is a statistical method to evaluate the importance of a word in a documeny, by calculating a score for each of the word
- It is generally used for information retrieval and summarization

idf= number of sentences in numerator / number of sentences containing the word

Rare words will have high idf values.

Idf measures how unique a word is across all the documents

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
documents=['The cat sat on the mat','The dog barked at the cat']

In [19]:
vectorizer=TfidfVectorizer()

In [20]:
x=vectorizer.fit_transform(documents)
print(x)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10 stored elements and shape (2, 8)>
  Coords	Values
  (0, 7)	0.6050614264813995
  (0, 2)	0.30253071324069974
  (0, 6)	0.42519636159088015
  (0, 5)	0.42519636159088015
  (0, 4)	0.42519636159088015
  (1, 7)	0.6050614264813995
  (1, 2)	0.30253071324069974
  (1, 3)	0.42519636159088015
  (1, 1)	0.42519636159088015
  (1, 0)	0.42519636159088015


In [21]:
print('Vocabulary:',vectorizer.get_feature_names_out())

Vocabulary: ['at' 'barked' 'cat' 'dog' 'mat' 'on' 'sat' 'the']


In [22]:
print("TF-IDF MAtrix:\n",x.toarray())

TF-IDF MAtrix:
 [[0.         0.         0.30253071 0.         0.42519636 0.42519636
  0.42519636 0.60506143]
 [0.42519636 0.42519636 0.30253071 0.42519636 0.         0.
  0.         0.60506143]]


In [24]:
corpus=['The cat sat on the mat','The dog barked at the cat']

In [32]:
#generate TFIDF features with N-Grams(unigrams,bigrams)
vectorizer=TfidfVectorizer(ngram_range=(1,2))
X=vectorizer.fit_transform(corpus)
print(X)
print('\n')
#Print feature names(vocabulary)
print('Features (N-Grams):',vectorizer.get_feature_names_out())
print('\n')

#print tf-idf matrix-
print("TF-IDf matrix:\n",X.toarray())

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 20 stored elements and shape (2, 17)>
  Coords	Values
  (0, 13)	0.44915675016915907
  (0, 4)	0.22457837508457953
  (0, 11)	0.31563707021700443
  (0, 9)	0.31563707021700443
  (0, 8)	0.31563707021700443
  (0, 14)	0.22457837508457953
  (0, 5)	0.31563707021700443
  (0, 12)	0.31563707021700443
  (0, 10)	0.31563707021700443
  (0, 16)	0.31563707021700443
  (1, 13)	0.44915675016915907
  (1, 4)	0.22457837508457953
  (1, 14)	0.22457837508457953
  (1, 6)	0.31563707021700443
  (1, 2)	0.31563707021700443
  (1, 0)	0.31563707021700443
  (1, 15)	0.31563707021700443
  (1, 7)	0.31563707021700443
  (1, 3)	0.31563707021700443
  (1, 1)	0.31563707021700443


Features (N-Grams): ['at' 'at the' 'barked' 'barked at' 'cat' 'cat sat' 'dog' 'dog barked'
 'mat' 'on' 'on the' 'sat' 'sat on' 'the' 'the cat' 'the dog' 'the mat']


TF-IDf matrix:
 [[0.         0.         0.         0.         0.22457838 0.31563707
  0.         0.         0.31563707 0.315637

# Word Embeddings
- are dense vector representations of words generated through neural networks.
- Each word is mapped to a high dimensional vector space where the relationships between the words are preserved.

# Advantages of word embeddings-
- symantic as well as syntactic meaning both are preserved.
- embeddings trained on large corpus understand the word relationship.
- accuracy is improved.


Different word embedding models which are available- Word2Vec(pretrained model by google),Glove(pretrained model by stanford),FastText(pretrained model by facebook)

2 Architectures in word2vec -
- CBow - continuous bag of words
- Skip-gram

Glove - Global vectors