<a href="https://colab.research.google.com/github/Mahemaran/Colab-notebooks/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **NLP**
* Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. It enables computers to understand, interpret, generate, and respond to human language in a meaningful way.

### **Corpus**
* A corpus (plural: corpora) is a large collection of texts or spoken language that is systematically organized and used for language analysis, research, or training models in Natural Language Processing (NLP) and linguistics.
* Real-world Data: It contains actual texts from books, articles, websites, spoken conversations, etc.
* corpus = [
    "Machine learning is amazing.",
    "Natural language processing is a part of AI.",
    "Deep learning models are used for text generation."
]

### **Tokenization**
* Tokenization is the process of splitting text into smaller units called tokens. These tokens can be words, subwords, sentences, or even characters, depending on the task. It is a fundamental step in Natural Language Processing (NLP) and machine learning

```
# This is formatted as code
```



**Types of Tokenization**
```
Text: "I love NLP!"...
Tokens: ['I', 'love', 'NLP', '!']
```
```
Text: "playing"
Tokens: ['play', '##ing']  # '##' indicates subword continuation
```
```
Text: "Hello world. NLP is fun."
Tokens: ['Hello world.', 'NLP is fun.']
```
```
Text: "Hello world. NLP is fun."
Tokens: ['Hello world.', 'NLP is fun.']
```



**Rule based Tokenization**
```
text = "Hello, world!"
tokens = text.split()  # Basic split on spaces
print(tokens)  # Output: ['Hello,', 'world!']
```
**NLTK**
```
from nltk.tokenize import word_tokenize
text = "Tokenization is important for NLP."
print(word_tokenize(text))  
# Output: ['Tokenization', 'is', 'important', 'for', 'NLP', '.']
```
```
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization is crucial for BERT.")
print(tokens)
# Output: ['token', '##ization', 'is', 'crucial', 'for', 'bert', '.']
```





### **Stop words**
Stop words are common words in a language that are often filtered out or removed during text preprocessing in Natural Language Processing (NLP) tasks because they usually do not carry significant meaning or contribute to the analysis.

**Examples of stop words in English include:**
* Articles: a, an, the
* Prepositions: in, on, at
* Conjunctions: and, or, but
* Pronouns: he, she, it, they
* Others: is, are, was, of, to, etc.

**NLTK (Natural Language Toolkit)**

```
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Download stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Example text
text = "This is an example of stop word removal."
# Tokenize text
words = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Original:", words)
print("Filtered:", filtered_words)
#Output:
Original: ['This', 'is', 'an', 'example', 'of', 'stop', 'word', 'removal', '.']
Filtered: ['example', 'stop', 'word', 'removal', '.']
```



### **Padding**
* the process of adding extra values (often zeros) to sequences so that all sequences have the same length.
* it makes the input have same shape and size.

**Common padding strategies:**
* Pre-padding: Add values at the beginning of the sequence.
* Post-padding: Add values at the end of the sequence.

**Padding in NLP (Tokenized Text)**
```
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample sequences
sequences = [
    [1, 2, 3],
    [4, 5],
    [6]]
# Pad sequences to a maximum length of 4
padded_sequences = pad_sequences(sequences, maxlen=4, padding='pre')
print(padded_sequences)
# Output
[[0 0 1 2 3]
 [0 0 0 4 5]
 [0 0 0 0 6]]
```



### **One hot encoder**
* One Hot Encoding is a technique used in machine learning and data preprocessing to convert categorical data into a binary matrix (0s and 1s).

**Using OneHotEncoder from Scikit-learn**
```
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Example Data
data = {'Animal': ['Cat', 'Dog', 'Bird', 'Cat']}
df = pd.DataFrame(data)

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and Transform
encoded = encoder.fit_transform(df[['Animal']])

# Convert to DataFrame
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Animal']))
print(encoded_df)

   Animal_Bird  Animal_Cat  Animal_Dog
0          0.0         1.0         0.0
1          0.0         0.0         1.0
2          1.0         0.0         0.0
3          0.0         1.0         0.0

```

### **Word embeddings**
* Word embeddings are used to convert text (which is inherently non-numeric) into a numerical format that machine learning models can process effectively.
* Word Embedding is a technique in Natural Language Processing (NLP) where words are represented as dense, continuous, and low-dimensional vectors in a numerical space. These vectors capture semantic relationships between words based on their meaning and context.

**Using Gensim for Word2Vec**
```
import gensim.downloader as api

# Load Pretrained Word2Vec Model
model = api.load('word2vec-google-news-300')

# Get Vector for a Word
vector = model['king']
print("Vector for 'king':", vector)

# Find Most Similar Words
similar_words = model.most_similar('king', topn=5)
print("Words similar to 'king':", similar_words)

Output:
Words similar to 'king':
[('queen', 0.78), ('prince', 0.75), ('monarch', 0.72), ('ruler', 0.69)]

```

**Using Keras Embedding Layer**

```
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

# Example Parameters
vocab_size = 1000  # Vocabulary size
embedding_dim = 64  # Size of embedding vector
input_length = 10   # Input sequence length

# Model with Embedding Layer
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_length),
    Flatten(),
    Dense(1, activation='sigmoid')
])

model.summary()
```

### **TF-IDF**
* TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in Natural Language Processing (NLP) and information retrieval to evaluate the importance of a word within a document or a collection of documents (corpus). It is widely used for text classification, clustering, and search engines.
> TF-IDF is based on the idea that:
* Term Frequency (TF): A word is more important if it appears frequently in a document.
* Inverse Document Frequency (IDF): A word is less important if it appears frequently across many documents.



**from sklearn.feature_extraction.text import TfidfVectorizer**
```
from sklearn.feature_extraction.text import TfidfVectorizer

# Example Corpus
documents = [
    "I love machine learning.",
    "Machine learning is fun.",
    "Deep learning is a part of machine learning."
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Transform documents into TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (terms)
terms = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix to an array
tfidf_array = tfidf_matrix.toarray()

# Show the terms and their corresponding TF-IDF values
import pandas as pd
df = pd.DataFrame(tfidf_array, columns=terms)
print(df)

       deep  fun  is  learning  machine  part  the  love  a  of
0  0.000000  0.0  0.0   0.577350  0.577350  0.0  0.0  0.577350  0.0  0.0
1  0.000000  0.577350  0.577350  0.000000  0.577350  0.0  0.0  0.000000  0.0  0.0
2  0.577350  0.0  0.577350  0.577350  0.577350  0.577350  0.0  0.0  0.577350  0.577350
```



### **Stemming**
* Stemming is a text preprocessing technique used in Natural Language Processing (NLP) to reduce words to their root form or stem.

**Example using Python:**


```
from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
stemmer = PorterStemmer()

# List of words to stem
words = ["running", "runner", "ran", "happily", "happiness"]

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

Output:
['run', 'runner', 'ran', 'happili', 'happi']

```

### **Lemmatization**
* Lemmatization is a text preprocessing technique in Natural Language Processing (NLP) that aims to reduce words to their base or dictionary form (known as the lemma). Unlike stemming, which simply removes affixes from words, lemmatization involves more linguistic analysis to ensure that the resulting lemma is a valid word in the language.

**Lemmatization in Python using NLTK**
```
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# Download required NLTK resources
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize the Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize some words
words = ["running", "better", "cats", "studies", "played"]

lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # POS='v' for verb
print(lemmatized_words)

output:
['run', 'better', 'cat', 'study', 'play']
```

### **POS**
Part-of-Speech (POS) Tagging is a fundamental task in Natural Language Processing (NLP) that involves assigning a grammatical category (such as noun, verb, adjective, adverb, etc.) to each word in a sentence. POS tagging helps in understanding the syntactic structure of a sentence, which is crucial for various downstream NLP tasks like named entity recognition (NER), parsing, information retrieval, and machine translation.

**NLTK for POS Tagging**
```
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion"

# Tokenize the text
tokens = nltk.word_tokenize(text)

# POS tagging
tagged = nltk.pos_tag(tokens)

# Print POS tags
for word, tag in tagged:
    print(f'{word}: {tag}')

output:
Apple: NNP
is: VBZ
looking: VBG
at: IN
buying: VBG
U.K.: NNP
startup: NN
for: IN
$: $
1: CD
billion: NN
```



