<h1 align=\"center\"><font color='green'><font size=\"6\">Natural Language Processing (NLP) </font> </h1>

<div style ="background-color: #90EE90;">.</div>

 - Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language in a meaningful way. 
  - It plays a crucial role in various applications, such as sentiment analysis, machine translation, and text summarization, making it easier for machines to interact with human language.

In [None]:
#importing the required libraries 
import re #regular expression operators
#re extracts typical words
import nltk #top library core librarry for NLp
#Natural Language tool kit 
import string
from nltk.corpus import stopwords
#corpus = large collection of naturally occuring text
from nltk.stem import PorterStemmer, WordNetLemmatizer 

In [None]:
#Download NLTK data files 
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
#punkt 

In [None]:
#Sample questions 
questions = ["What’s ur favorite adventure book!? 📚✨",
           "Are u a nite owl or an early bird!? 🌙🐦",
          "What’s your go-to source 4 learning something new!?"]
#this is synthetic reviews 
#You can get data from web scrapping 


### 1. Tokenization

**Definition**: Tokenization is the process of breaking down text into smaller units, known as tokens. These can be words, phrases, or sentences.

**Purpose**: By tokenizing the input, we can analyze individual components of a sentence, which aids in understanding the overall meaning.

**Example**: The sentence "I love pizza!" would be tokenized into the tokens: ["I", "love", "pizza"].

In [None]:
def tokenize_question(que):
    tokens = nltk.word_tokenize(que)
    return tokens

In [None]:
#Step 1: Tokenization
#tokenizing each questions into individual words (token)
def tokenize_question(que):
    tokens = nltk.word_tokenize(que)
    return tokens
#we are giving number of list while it only needs one 
#we are only trying to 
#tokenize all the list 

In [None]:
tokenized_questions = [tokenize_question(que) for que in questions]
print("Tokenized Questions:")
for i, tokens in enumerate(tokenized_questions):
    #print(f"Question {i+1}:", tokens)
    print(tokens)

### 2. Lowercasing
**Definition**: Lowercasing involves converting all characters in the text to lowercase.

**Purpose**: This ensures consistency and prevents treating the same words differently due to case differences (e.g., "Pizza" vs. "pizza"). It simplifies the matching process during analysis.

**Example**: "I Love Pizza!" becomes "i love pizza!".

In [None]:
#converting all tokens to lowercase for uniformity 
def lowercase_tokens(tokens):
    return [token.lower() for token in tokens]

In [None]:
lowercased_questions = [lowercase_tokens(tokens) for tokens in tokenized_questions]
print("\nLowercased Questions:")
for i, tokens in enumerate(lowercased_questions):
    print(f"Question {i+1}:", tokens)
    #print(tokens)

### 3. Removing Stop Words
**Definition**: Stop words are common words that usually do not carry significant meaning in text analysis (e.g., "and", "the", "is").

**Purpose**: Removing stop words helps focus on the more meaningful words in a sentence, improving processing efficiency and relevancy.

**Example**: "I love to eat pizza." would become "love eat pizza."

In [None]:
# can use stopwords from urdu as well
# Step 3: Removing Stop Words
# Stop words are common words like 'is', 'and', 'the' that don't add significant meaning to the text
stop_words = set(stopwords.words('english')) # Combining English and Hindi stopwords

def remove_stop_words(tokens):
    return [token for token in tokens if token not in stop_words]

In [None]:
filtered_questions = [remove_stop_words(tokens) for tokens in lowercased_questions]
print("\nQuestions after Removing Stop Words:")
for i, tokens in enumerate(filtered_questions):
    print(f"Questions {i+1}:", tokens)

### 4. Removing Punctuation
**Definition**: This step involves eliminating punctuation marks from the text.

**Purpose**: Punctuation can interfere with tokenization and analysis, so removing it helps in treating words uniformly.

**Example**: "Hello, how are you?" becomes "Hello how are you".

In [None]:
# Step 4: Removing Punctuation and Emojis
# Removing punctuation and emojis from tokens
def remove_punctuation_and_emojis(tokens):
    clean_tokens = [token for token in tokens if token.isalnum()]  # Retain only alphanumeric tokens
    return clean_tokens

In [None]:
cleaned_questions = [remove_punctuation_and_emojis(tokens) for tokens in filtered_questions]
print("\nQuestions after Removing Punctuation and Emojis:")
for i, tokens in enumerate(cleaned_questions):
    print(f"Questions {i+1}:", tokens)

### 5. Stemming and Lemmatization
**Definition**: Both stemming and lemmatization are techniques used to reduce words to their base or root form. Stemming cuts words down (e.g., "running" becomes "run"), while lemmatization considers the context to convert words to their dictionary form.

**Purpose**: These techniques help in standardizing words, allowing models to recognize variations of a word as the same term, which is crucial for understanding intent.

**Example**:
 - Stemming: "running" → "run"
 - Lemmatization: "better" → "good"    

In [None]:
# Step 5: Stemming and Lemmatization
# Stemming reduces words to their base form (e.g., 'running' -> 'run')
# Lemmatization reduces words to their dictionary root form (e.g., 'running' -> 'run')

# Using NLTK's PorterStemmer for stemming
stemmer = PorterStemmer()
def apply_stemming(tokens):
    return [stemmer.stem(token) for token in tokens]
    

In [None]:
stemmed_questions = [apply_stemming(tokens) for tokens in cleaned_questions]
print("\nStemmed Questions:")
for i, tokens in enumerate(stemmed_questions):
    print(f"Questions {i+1}:", tokens)

In [None]:
# Using NLTK's WordNetLemmatizer for lemmatization
lemmatizer = WordNetLemmatizer()
def apply_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

In [None]:
lemmatized_questions = [apply_lemmatization(tokens) for tokens in cleaned_questions]
print("\nLemmatized Questions:")
for i, tokens in enumerate(lemmatized_questions):
    print(f"Questions {i+1}:", tokens)

### 6. Handling Numbers
**Definition**: This step involves deciding how to process numerical values in the text. Depending on the application, you might convert numbers to their word forms or leave them as is.

**Purpose**: Properly handling numbers ensures that the analysis can incorporate quantitative information effectively.

**Example**: "I have 2 apples" can remain as is or be transformed to "I have two apples".

### 7. Removing Special Characters
**Definition**: Special characters include symbols like @, #, $, etc., that don't contribute to the meaning of the text.

**Purpose**: Removing these characters helps to clean the input, ensuring that the focus remains on relevant words.

**Example**: "I love pizza!!!" becomes "I love pizza".

### 8. Text Normalization
**Definition**: Text normalization is the overall process of transforming text into a consistent format. This includes lowercasing, removing punctuation, handling numbers, and more.

**Purpose**: Normalization prepares text for further analysis by ensuring all input is in a standard format, making it easier to process and understand.

**Example**: "I LOVE PIZZA!!! 2 times." would be normalized to "i love pizza two times".