# NLP

The vertical split segregates - x and y where x is the predictors and y is the response.

# Lexical Ambiguity - 
- word has different meanings.<br>
- word being understood in more than 1 way.<br>
- Machine has difficulty in interpreting these.
- Ex.- Bank- 1. Financial institution, 2. River bank 

# Syntax Ambiguity - syntactic ambiguity
- when a sentence can be parsed in different ways then this type of ambiguity is called syntactic ambiguity.<br>
- ex. The man saw a girl with a telescope.


# Symantic ambiguity
- when the meaning of the words themselves can be misinterpreted 
- 


# Anaphoric ambiguity 
- arises due to the use of anaphore in the sentence
- Ex- The horse run up the hill. It was too steep . It soon got tired.

# Pragmatic ambiguity
- When the statement is not specific or the context of the phrase gives multiple interpretations.
- Ex. I love you too.


# Cleaning Data
- convert data into lowercase
- remove punctuations - .,,'',"",!,?,@,$,-
- Remove stopwords - I, a ,an

- nltk library - natural language toolkit used in nlp 

In [17]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string

from nltk import pos_tag
from nltk.util import ngrams

In [16]:
#Downloading nltk data(first time only)-
# nltk.download('punkt_tab')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger_eng') #supervised machine learning model for POS tagging

# Tokenization 
- splitting a phrase/sentence/paragraph or entire text into smaller units(tokens) such as words is called as tokenization.

In [4]:
text='The quick brown fox jumped over the lazy dog!'

# Tokenize-
tokens=word_tokenize(text.lower())
tokens

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '!']

In [5]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [6]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
print(sorted(stopwords.words('english')))


['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [8]:
tokens=[word for word in tokens if word not in string.punctuation]
print(tokens)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']


In [9]:
tokens=[word for word in tokens if word not in stopwords.words('english')]
print(tokens)

['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']


# Stemming 
- reducing the word to their stem(base word) is called stemming.
- ex- moving,move,moved - move

- It stems the word but sometimes the context of the word gets lost.
- ex.- studies -> studi
  

# About Lemmatization-
- lemmatization is a concept which solves this problem.
- and the base word will have some meaning.
- lemmatization has a dictionary and will look for a word similiar to the stemmed word, and will eventually give the correct word which has that meaning
- ex. - studies -> study

# Difference between stemming and lemmatization-

| Stemming | Lemmatization |
| --- | --- |
| - Stemming is faster. <br> - It is a rule-based approach (just remove the end part to stem the word). <br> - Less accuracy in stemming. <br> - Stemming is used in case of spam detection. | - Lemmatization is slower than stemming. <br> - It is a dictionary-based approach. <br> - Accuracy is more. <br> - Lemmatization is used for text summarizing, question and answers. |


# Stemming-
- Stemming is faster.
- It is a rule based approach(just remove the end part to stem the word).
- Less accuracy in stemming.
- Stemming is used in case of spam detection.

# Lemmatization-
- Lemmatization is slower than stemming.
- It is a dictionary based approach.
- Accuracy is more.
- Lemmatization is used for text summarizing, question and answers.

In [18]:
from nltk.stem import PorterStemmer  #PorterStemmer
stemmer=PorterStemmer()
stemmed_words=[stemmer.stem(word) for word in tokens]
print(stemmed_words)

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']


In [19]:
lemmatizer=WordNetLemmatizer()   #Lemmatizer
tokens=[lemmatizer.lemmatize(word) for word in tokens]
print(tokens)

['quick', 'brown', 'fox', 'jumped', 'lazy', 'dog']


In [None]:
#Another example-
# pos- parts of speech

#Sample text
text='Hello, All! This is an NLP example. I love learning about transformers, attention mechanisms, and embeddings'

#step1: Tokenization:
tokens=word_tokenize(text)
print("Tokens:",tokens)
print('\n')

#step 2: lowercasing
tokens=[word.lower() for word in tokens]
print('Lowercased Tokens:',tokens)
print('\n')
#step 3: Remove punctuation
tokens=[word for word in tokens if word not in string.punctuation]
print('Without punctuation:',tokens)
print('\n')

#step 4: Remove stopwords
stop_words=set(stopwords.words('english'))
tokens=[word for word in tokens if word not in stop_words]
print('Without stopwords:',tokens)
print('\n')

#step 5 : Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens=[lemmatizer.lemmatize(word) for word in tokens]
print('Lemmatized Tokens:',lemmatized_tokens)
print('\n')

#step 6 : Stemming
stemmer=PorterStemmer()
stemmed_words=[stemmer.stem(word) for word in lemmatized_tokens]
print('Stemmed tokens:',stemmed_words)
print('\n')

In [32]:
tokens

['hello',
 'nlp',
 'example',
 'love',
 'learning',
 'transformers',
 'attention',
 'mechanisms',
 'embeddings']

In [None]:
#step 7 : POS TAGGING-
pos_tags=pos_tag(tokens)
print('POS Tags:',pos_tags)
print('\n')

#step 8: Generate N-grams
bigrams=list(ngrams(tokens,2))
print('Bigrams:',bigrams)

Tokens: ['Hello', ',', 'All', '!', 'This', 'is', 'an', 'NLP', 'example', '.', 'I', 'love', 'learning', 'about', 'transformers', ',', 'attention', 'mechanisms', ',', 'and', 'embeddings']


Lowercased Tokens: ['hello', ',', 'all', '!', 'this', 'is', 'an', 'nlp', 'example', '.', 'i', 'love', 'learning', 'about', 'transformers', ',', 'attention', 'mechanisms', ',', 'and', 'embeddings']


Without punctuation: ['hello', 'all', 'this', 'is', 'an', 'nlp', 'example', 'i', 'love', 'learning', 'about', 'transformers', 'attention', 'mechanisms', 'and', 'embeddings']


Without stopwords: ['hello', 'nlp', 'example', 'love', 'learning', 'transformers', 'attention', 'mechanisms', 'embeddings']


Lemmatized Tokens: ['hello', 'nlp', 'example', 'love', 'learning', 'transformer', 'attention', 'mechanism', 'embeddings']


Stemmed tokens: ['hello', 'nlp', 'exampl', 'love', 'learn', 'transform', 'attent', 'mechan', 'embed']


POS Tags: [('hello', 'NN'), ('nlp', 'JJ'), ('example', 'NN'), ('love', 'IN'), ('lea

# POS tags

pos_tag->meaning -> example
1. NN ->Nouns singular->
2. NNS-> Nouns Plural ->
3. VB-> Verb base form-> run,eat,read
4. VBD-> Verb in the past tense -> ate, ran,
5. VBG -> ing form of verb -> running, sleeping
6. VBN -> verb past participle -> eaten,
7. JJ-> adjectives -> beautiful
8. RB -> adverb -> slowly, quickly

In [33]:
# wordnet.NOUN
# wordnet.VERB
# wordnet.ADV
# wordnet.ADJ
#  

# Bag of words -
- is a text representation technique where the text is represented as collection of words(bag) where the grammar annd the word order is not taken into account but the frequency of the words matter.

Each sentence is considered as a document<br>
vocabulary->unique words<br>
- Tokenization
- Vocabulary
- Vectorisation(if a particualar word is there it is given value 1 else 0)

In [36]:
from sklearn.feature_extraction.text import CountVectorizer
#sample documents
docs=['I love machine learning','Machine learning is fun']

#initialize count vectorizer
vectorizer=CountVectorizer(lowercase=True,stop_words='english')

#fit and transform the documents into a bag of words-
X=vectorizer.fit_transform(docs)
print(X)

# convert the result to an array
print(X.toarray())

#Display the vocabulary(features)-
print(vectorizer.get_feature_names_out())

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6 stored elements and shape (2, 4)>
  Coords	Values
  (0, 2)	1
  (0, 3)	1
  (0, 1)	1
  (1, 3)	1
  (1, 1)	1
  (1, 0)	1
[[0 1 1 1]
 [1 1 0 1]]
['fun' 'learning' 'love' 'machine']
