# **Text Preprocessing :- 1. Text Cleaning**

Welcome! I'm happy to help you learn NLP text preprocessing in a simple and effective way. I'll explain each concept clearly and step by step. We’ll start with the basics and then build up to more advanced techniques. This way, you'll be well-prepared for your interviews.

### What is NLP Text Preprocessing?

In Natural Language Processing (NLP), **text preprocessing** refers to the steps that are performed to clean and prepare raw text data for use in machine learning models or other NLP tasks (such as sentiment analysis, text classification, named entity recognition, etc.). Raw text data can be messy, and preprocessing is necessary to convert it into a clean, structured format that the model can understand.

### Why is Text Preprocessing Important?

1. **Cleaning the Data**: Raw text often contains noise such as punctuation, special characters, extra spaces, etc., which don’t contribute to the task at hand.
2. **Improving Model Performance**: Preprocessing ensures that irrelevant information is removed, so the model can focus on what’s important.
3. **Standardizing the Data**: Preprocessing makes sure that similar things (like “apple” and “Apple”) are treated the same way.

### Key Steps in Text Preprocessing

Let’s go step by step and cover the most common preprocessing techniques.

---

### 1. **Lowercasing**
   **Why**: Text might have uppercase and lowercase letters, but the same word in different cases should be treated as the same word.
   
   **Example**:
   - "Apple" → "apple"
   - "apple" → "apple"
   
   **How to do it?**:
   - Convert all text to lowercase using Python's `.lower()` method.
   
   **Code Example**:
   ```python
   text = "This is an Apple."
   text = text.lower()
   print(text)  # Output: "this is an apple."
   ```

---

### 2. **Removing Punctuation**
   **Why**: Punctuation marks like commas, periods, and exclamation points may not be useful in most NLP tasks. For instance, "apple." and "apple" should be treated as the same word.
   
   **Example**:
   - "apple." → "apple"
   
   **How to do it?**:
   - Use Python’s `string.punctuation` to remove punctuation marks.
   
   **Code Example**:
   ```python
   import string
   text = "Hello! How are you?"
   text = ''.join([char for char in text if char not in string.punctuation])
   print(text)  # Output: "Hello How are you"
   ```

---

### 3. **Removing Numbers**
   **Why**: In many NLP tasks, numbers are not useful unless the task specifically involves numerical data. For example, for text classification, numbers like "123" might not carry useful meaning.
   
   **Example**:
   - "I have 2 apples." → "I have apples."
   
   **How to do it?**:
   - Use regular expressions (`re.sub`) to remove numbers.
   
   **Code Example**:
   ```python
   import re
   text = "I have 2 apples."
   text = re.sub(r'\d+', '', text)
   print(text)  # Output: "I have  apples."
   ```

---

### 4. **Tokenization**
   **Why**: Tokenization is the process of splitting text into smaller chunks (tokens), such as words or sentences. This is important because machine learning models work with individual tokens.

   **Example**:
   - "I have a pen." → ["I", "have", "a", "pen"]
   
   **How to do it?**:
   - We can use Python libraries like `nltk` to tokenize text into words or sentences.
   
   **Code Example (Word Tokenization)**:
   ```python
   from nltk.tokenize import word_tokenize
   text = "I have a pen."
   tokens = word_tokenize(text)
   print(tokens)  # Output: ['I', 'have', 'a', 'pen', '.']
   ```

   **Code Example (Sentence Tokenization)**:
   ```python
   from nltk.tokenize import sent_tokenize
   text = "I have a pen. I love to write."
   sentences = sent_tokenize(text)
   print(sentences)  # Output: ['I have a pen.', 'I love to write.']
   ```

---

### 5. **Stopwords Removal**
   **Why**: Stopwords are common words (e.g., "the", "is", "on") that do not carry much meaningful information for many NLP tasks. Removing them helps reduce the noise in the text.

   **Example**:
   - "I have a pen." → ["I", "pen"]
   
   **How to do it?**:
   - We can use a list of stopwords, like those available in the `nltk` library, and remove them from our text.

   **Code Example**:
   ```python
   from nltk.corpus import stopwords
   stop_words = set(stopwords.words('english'))
   tokens = ["I", "have", "a", "pen"]
   filtered_tokens = [word for word in tokens if word not in stop_words]
   print(filtered_tokens)  # Output: ['pen']
   ```

---

### 6. **Stemming and Lemmatization**
   **Why**: Stemming and lemmatization reduce words to their base or root form. Stemming can sometimes produce non-dictionary words, while lemmatization returns valid words.
   
   **Example**:
   - "running" → "run"
   - "better" → "good"
   
   **Stemming**:
   - Stemming cuts off the prefixes or suffixes of words.
   
   **Code Example (Stemming)**:
   ```python
   from nltk.stem import PorterStemmer
   ps = PorterStemmer()
   words = ["running", "jumps", "easily"]
   stemmed_words = [ps.stem(word) for word in words]
   print(stemmed_words)  # Output: ['run', 'jump', 'easili']
   ```

   **Lemmatization**:
   - Lemmatization considers the context of the word and reduces it to its proper root form.
   
   **Code Example (Lemmatization)**:
   ```python
   from nltk.stem import WordNetLemmatizer
   lemmatizer = WordNetLemmatizer()
   words = ["running", "jumps", "easily"]
   lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
   print(lemmatized_words)  # Output: ['running', 'jump', 'easily']
   ```

---

### 7. **Removing Extra Whitespace**
   **Why**: Text data often contains unnecessary spaces, tabs, or newlines. Removing extra whitespace ensures the text is clean and uniform.
   
   **Example**:
   - "I    have a pen." → "I have a pen."
   
   **How to do it?**:
   - Use regular expressions (`re.sub`) to replace multiple spaces with a single space.
   
   **Code Example**:
   ```python
   import re
   text = "I    have a pen."
   text = re.sub(r'\s+', ' ', text).strip()
   print(text)  # Output: "I have a pen."
   ```

---

### 8. **Removing URLs, Mentions, and Hashtags (Optional)**
   **Why**: In social media or web-based text, URLs (e.g., `http://example.com`), mentions (e.g., `@user`), and hashtags (e.g., `#hashtag`) are often not needed for most NLP tasks.

   **How to do it?**:
   - Use regular expressions to remove these elements.
   
   **Code Example**:
   ```python
   text = "Visit http://example.com for more information. @user #hashtag"
   text = re.sub(r'http\S+|@\S+|#\S+', '', text)
   print(text)  # Output: "Visit  for more information. "
   ```

---

### 9. **Vectorization (Converting Text to Numbers)**
   After preprocessing, we need to convert text into numerical data so machine learning algorithms can process it.

   **Two common methods**:
   - **Bag of Words (BoW)**: Represents text as a matrix of word counts.
   - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs words based on their importance across all documents.

   **Code Example (BoW)**:
   ```python
   from sklearn.feature_extraction.text import CountVectorizer
   vectorizer = CountVectorizer()
   text = ["I have a pen.", "I have a book."]
   X = vectorizer.fit_transform(text)
   print(X.toarray())  # Output: Word count matrix
   ```

   **Code Example (TF-IDF)**:
   ```python
   from sklearn.feature_extraction.text import TfidfVectorizer
   vectorizer = TfidfVectorizer()
   text = ["I have a pen.", "I have a book."]
   X = vectorizer.fit_transform(text)
   print(X.toarray())  # Output: TF-IDF matrix
   ```

---

### Summary of Key Preprocessing Steps:

1. **Lowercasing**: Make the text uniform (all lowercase).
2. **Removing Punctuation & Numbers**: Clean the text by removing non-useful characters.
3. **Tokenization**: Split the text into smaller units (words or sentences).
4. **Stopword Removal**: Remove common words that don’t add much meaning.
5. **Stemming/Lemmatization**: Reduce words to their root form.
6. **Removing Extra Whitespace**: Clean up extra spaces in the text.
7. **Vectorization**: Convert text into numerical data.

---

### Final Tip: Practice!

The best way to learn NLP preprocessing is by **practicing**. Try applying these techniques on different text datasets. Over time, you’ll gain more confidence in how to preprocess text efficiently.

---

Let me know if you'd like to go deeper into any of these topics or if you'd like more examples!

In [1]:
import numpy as np
import pandas as pd
import re
import string


# **A. Load Dataset**

In [2]:
df = pd.read_csv("/content/IMDB Dataset.csv")

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

# **B .Start Text Cleaning**

- Sometimes our acquired data is not very clean. it may contain HTML tags, spelling mistakes, or special characters. So, let’s see some techniques to clean our text data.

## 1. Unicode Normalization:
-  Unicode Normalization: if text data may contain symbols, emojis, graphic characters, or special characters. Either we can remove these characters or we can convert this to machine-readable text.  

In [5]:
text = "Hello! Universe. How are you ????.EVERYTHING Ok in YOUR side."

In [8]:

# Unicode Nomalization
text = "GeeksForGeeks ????"
print(text.encode('utf-8'))

text1 = 'गीक्स फॉर गीक्स ????'
print(text1.encode('utf-8'))

b'GeeksForGeeks ????'
b'\xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 \xe0\xa4\xab\xe0\xa5\x89\xe0\xa4\xb0 \xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 ????'


In [9]:
print(df['review'][0].encode('utf-8'))

b"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

## 2. Lowercasing

- Why: Text might have uppercase and lowercase letters, but the same word in different cases should be treated as the same word.

Example:

"Apple" → "apple"
"apple" → "apple"

- Convert all text to lowercase using Python's .lower() method.

In [11]:
text = "Hello!     Universe. How are you ????.EVERYTHING Ok in YOUR side."

text1 = text.lower()
print(text1)

hello!     universe. how are you ????.everything ok in your side.


In [12]:
df['review'][0].lower()

"one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />i would say the main appeal of the show is due to the fa

## 3. Regex or Regular Expression:
Regular Expression is the tool that is used for searching the string of specific patterns.  
- **Suppose** our data contain phone number, email-Id, and URL. we can find such text using the regular expression. After that either we can keep or remove such text patterns as per requirements.

## 3.1 ** Removing Punctuation**
- Why: Punctuation marks like commas, periods, and exclamation points may not be useful in most NLP tasks. For instance, "apple." and "apple" should be treated as the same word.

Example:

"apple." → "apple"
How to do it?:

- Use Python’s string.punctuation to remove punctuation marks.

In [13]:
text1

'hello!     universe. how are you ????.everything ok in your side.'

In [16]:
import string
text1 = "hello!     universe. how are you ????.everything ok in your side."
text1 = ''.join([char for char in text1 if char not in string.punctuation ])

print(text1)


hello     universe how are you everything ok in your side


In [20]:
import string

text = df['review'][0]
text1 = ''.join([char for char in text if char not in string.punctuation])
text1

'One of the other reviewers has mentioned that after watching just 1 Oz episode youll be hooked They are right as this is exactly what happened with mebr br The first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO Trust me this is not a show for the faint hearted or timid This show pulls no punches with regards to drugs sex or violence Its is hardcore in the classic use of the wordbr br It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda Em City is home to manyAryans Muslims gangstas Latinos Christians Italians Irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awaybr br I would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare Fo

## 3.2. Removing Numbers
Why: In many NLP tasks, numbers are not useful unless the task specifically involves numerical data. For example, for text classification, numbers like "123" might not carry useful meaning.

Example:

"I have 2 apples." → "I have apples."
How to do it?:

- Use regular expressions (re.sub) to remove numbers.

In [18]:
import re

text_num = "I have Rs.100000 in My Bank Account."

text = re.sub(r'\d+','',text_num)  # replace by empty white space. we can replace by any other things.

print(text)



I have Rs.Ten in My Bank Account.


In [21]:
import re

text1 = df['review'][0]

text = re.sub(r'\d+','', text1)
text


"One of the other reviewers has mentioned that after watching just  Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

## 3.3 Removing Extra Whitespace
Why: Text data often contains unnecessary spaces, tabs, or newlines. Removing extra whitespace ensures the text is clean and uniform.

Example:

"I have a pen." → "I have a pen."
How to do it?:

- Use regular expressions (re.sub) to replace multiple spaces with a single space.

In [24]:
import re

text1 = "hello!     universe. how are you ????.everything ok in your side."

text = re.sub(r'  ','',text1)
text

'hello! universe. how are you ????.everything ok in your side.'

### other method

In [26]:
import re

text1 = "hello!     universe. how are you ????.everything ok in your side."

text = re.sub(r'\s+',' ',text1).strip()
text


'hello! universe. how are you ????.everything ok in your side.'

In [27]:
import re
text1 = df['review'][0]

text = re.sub(r'\s+',' ',text1).strip()

text


"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

## 3.4 Removing URLs, HTML Tags, Emailid, and Hashtags (Optional)
Why: In social media or web-based text, URLs (e.g., http://example.com), mentions (e.g., @user), and hashtags (e.g., #hashtag) are often not needed for most NLP tasks.

How to do it?:

- Use regular expressions to remove these elements.

In [34]:
import re

text1 = """<gfg>
#GFG Geeks Learning together
url <https://www.geeksforgeeks.org/>,
email <acs@sdf.dv>"""

# Remove HTML tags:

html_1 = re.compile('[<.#*?>]')

text_html_removed = html_1.sub(r'',text1)
text_html_removed




'gfg  \nGFG Geeks Learning together  \nurl https://wwwgeeksforgeeksorg/,  \nemail acs@sdfdv'

In [40]:
# remove URL from the text:

text_html_removed

url_1 = re.compile('https?://\S+|www\.\s+')

text_url_removed = url_1.sub(r' ', text_html_removed)

text_url_removed

'gfg  \nGFG Geeks Learning together  \nurl    \nemail acs@sdfdv'

In [46]:
# remove email id:

text_url_removed  = "My email id is spra155@gmail.com"

email = re.compile('[A-Za-z0-2]+@[\w]+.[\w]+')
text = email.sub(r'',text_url_removed)
text


'My email id is spra155@gmail.com'

In [48]:
import re
text = df['review'][0]
def clean_text(text):
    # remove HTML TAG
    html = re.compile('[<,#*?>]')
    text = html.sub(r'',text)
    # Remove urls:
    url = re.compile('https?://\S+|www\.S+')
    text = url.sub(r'',text)
    # Remove email id:
    email = re.compile('[A-Za-z0-2]+@[\w]+.[\w]+')
    text = email.sub(r'',text)
    return text


In [49]:
clean_text(text)

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right as this is exactly what happened with me.br /br /The first thing that struck me about Oz was its brutality and unflinching scenes of violence which set in right from the word GO. Trust me this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs sex or violence. Its is hardcore in the classic use of the word.br /br /It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda. Em City is home to many..Aryans Muslims gangstas Latinos Christians Italians Irish and more....so scuffles death stares dodgy dealings and shady agreements are never far away.br /br /I would say the main appeal of the show is due to the fact that it goes where other

# C. **Text Preprocessing**

## C.1. Tokenization

### **Using the split() function**

In [57]:
## sentence tokenization

text = """ Hi I am subhash. i am from Nagda junction. currently I am leaving at MR 10 square, Indore(MP)"""

tokenize_text = text.split('.')

tokenize_text

[' Hi I am subhash',
 ' i am from Nagda junction',
 ' currently I am leaving at MR 10 square, Indore(MP)']

In [55]:
sent2 = 'I am going to jaipur. I will stay there for 3 days. Let\'s hope the trip to be great'
sent2.split('.')

['I am going to jaipur',
 ' I will stay there for 3 days',
 " Let's hope the trip to be great"]

In [61]:
text = df['review'][0]
token_text = text.split('.')
token_text

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked",
 ' They are right, as this is exactly what happened with me',
 '<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO',
 ' Trust me, this is not a show for the faint hearted or timid',
 ' This show pulls no punches with regards to drugs, sex or violence',
 ' Its is hardcore, in the classic use of the word',
 '<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary',
 ' It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda',
 ' Em City is home to many',
 '',
 'Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more',
 '',
 '',
 '',
 'so scuffles, death stares, dodgy dealings and shady agreements are never far away',
 "<br /><

### **Using the Regular expression()**

In [64]:
# word tokenization.

import re

text = "Hi I am subhash!!"

w_token_text = re.findall(r'[\w]+',text)
w_token_text

['Hi', 'I', 'am', 'subhash']

In [65]:
import re

text = df['review'][0]

w_token_text = re.findall(r'[\w]+',text)
w_token_text

['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'Oz',
 'episode',
 'you',
 'll',
 'be',
 'hooked',
 'They',
 'are',
 'right',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me',
 'br',
 'br',
 'The',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'Oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'GO',
 'Trust',
 'me',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 'This',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs',
 'sex',
 'or',
 'violence',
 'Its',
 'is',
 'hardcore',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word',
 'br',
 'br',
 'It',
 'is',
 'called',
 'OZ',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentary',
 'It'

In [69]:
# sentence tokenization.
import re

text = df['review'][0]
s_token_text = re.compile(r'[.!?]').split(text)
s_token_text

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked",
 ' They are right, as this is exactly what happened with me',
 '<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO',
 ' Trust me, this is not a show for the faint hearted or timid',
 ' This show pulls no punches with regards to drugs, sex or violence',
 ' Its is hardcore, in the classic use of the word',
 '<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary',
 ' It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda',
 ' Em City is home to many',
 '',
 'Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more',
 '',
 '',
 '',
 'so scuffles, death stares, dodgy dealings and shady agreements are never far away',
 "<br /><

In [70]:
import re

text = "I am going to jaipur. I will stay there for 3 days ?? Let's hope! the trip to be great "

s_token_text = re.compile('[.!?]').split(text)
s_token_text

['I am going to jaipur',
 ' I will stay there for 3 days ',
 '',
 " Let's hope",
 ' the trip to be great ']

### **Using the NLTK libraries**

In [72]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [74]:
# word tokenization
text = "I am going to jaipur. "

new_1 = word_tokenize(text)
new_1

['I', 'am', 'going', 'to', 'jaipur', '.']

In [77]:
# sentence Tokenization.

text = "I am going to jaipur. I will stay there for 3 days ? Let's hope the trip to be great "

setn_token_1 = sent_tokenize(text)

setn_token_1

['I am going to jaipur.',
 'I will stay there for 3 days ?',
 "Let's hope the trip to be great"]

# D .**Stopwords Removal**
Why: Stopwords are common words (e.g., "the", "is", "on") that do not carry much meaningful information for many NLP tasks. Removing them helps reduce the noise in the text.

Example:

"I have a pen." → ["I", "pen"]
How to do it?:

- We can use a list of stopwords, like those available in the nltk library, and remove them from our text.

In [80]:
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [81]:
text = "I am very good in Chess and Reading but for a ."

tokens = word_tokenize(text)

stop_words = set(stopwords.words('english'))

filtered_word = [word for word in tokens if word not in stop_words]

filtered_word

['I', 'good', 'Chess', 'Reading', '.']

In [85]:
text = df['review'][0]

tokens = word_tokenize(text)

# stop_word = set(stopwords.words('english'))
# fiter_words = [word for word in tokens if word not in stop_word]

stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

fiter_words

['One',
 'reviewers',
 'mentioned',
 'watching',
 '1',
 'Oz',
 'episode',
 "'ll",
 'hooked',
 '.',
 'They',
 'right',
 ',',
 'exactly',
 'happened',
 'me.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'The',
 'first',
 'thing',
 'struck',
 'Oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 ',',
 'set',
 'right',
 'word',
 'GO',
 '.',
 'Trust',
 ',',
 'show',
 'faint',
 'hearted',
 'timid',
 '.',
 'This',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 ',',
 'sex',
 'violence',
 '.',
 'Its',
 'hardcore',
 ',',
 'classic',
 'use',
 'word.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'It',
 'called',
 'OZ',
 'nickname',
 'given',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentary',
 '.',
 'It',
 'focuses',
 'mainly',
 'Emerald',
 'City',
 ',',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 ',',
 'privacy',
 'high',
 'agenda',
 '.',
 'Em',
 'City',
 'home',
 'many',
 '..',
 'Aryans',
 ',',
 'Muslims',
 ',

 # **E.  Stemming**

 Why: Stemming and lemmatization reduce words to their base or root form. Stemming can sometimes produce non-dictionary words, while lemmatization returns valid words.

Example:

"running" → "run"
"better" → "good"
Stemming:

- Stemming cuts off the prefixes or suffixes of words.

In [99]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

text = "I am very cool in Singing and Reading but for a .walk walks walking walked"
words = word_tokenize(text)

stemmed_words = [ps.stem(word) for word in words]
stemmed_words


['i',
 'am',
 'veri',
 'cool',
 'in',
 'sing',
 'and',
 'read',
 'but',
 'for',
 'a',
 '.walk',
 'walk',
 'walk',
 'walk']

In [101]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

text = df['review'][0]
words = word_tokenize(text)

stemmed_words = [ps.stem(word) for word in words]
stemmed_words

['one',
 'of',
 'the',
 'other',
 'review',
 'ha',
 'mention',
 'that',
 'after',
 'watch',
 'just',
 '1',
 'oz',
 'episod',
 'you',
 "'ll",
 'be',
 'hook',
 '.',
 'they',
 'are',
 'right',
 ',',
 'as',
 'thi',
 'is',
 'exactli',
 'what',
 'happen',
 'with',
 'me.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'the',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'oz',
 'wa',
 'it',
 'brutal',
 'and',
 'unflinch',
 'scene',
 'of',
 'violenc',
 ',',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'go',
 '.',
 'trust',
 'me',
 ',',
 'thi',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'heart',
 'or',
 'timid',
 '.',
 'thi',
 'show',
 'pull',
 'no',
 'punch',
 'with',
 'regard',
 'to',
 'drug',
 ',',
 'sex',
 'or',
 'violenc',
 '.',
 'it',
 'is',
 'hardcor',
 ',',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'it',
 'is',
 'call',
 'oz',
 'as',
 'that',
 'is',
 'the',
 'nicknam',
 'g

# **F. Lemmatization:**

- Lemmatization considers the context of the word and reduces it to its proper root form



In [102]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()



[nltk_data] Downloading package wordnet to /root/nltk_data...


In [103]:
text = "I am very cool in Singing and Reading but for a .walk walks walking walked"
words = word_tokenize(text)

lemm_words = [lm.lemmatize(word) for word in words]
lemm_words

['I',
 'am',
 'very',
 'cool',
 'in',
 'Singing',
 'and',
 'Reading',
 'but',
 'for',
 'a',
 '.walk',
 'walk',
 'walking',
 'walked']

In [104]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [105]:
text = df['review'][0]
words = word_tokenize(text)

lem_filtered = [lm.lemmatize(word) for word in words]
lem_filtered

['One',
 'of',
 'the',
 'other',
 'reviewer',
 'ha',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'Oz',
 'episode',
 'you',
 "'ll",
 'be',
 'hooked',
 '.',
 'They',
 'are',
 'right',
 ',',
 'a',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'The',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'Oz',
 'wa',
 'it',
 'brutality',
 'and',
 'unflinching',
 'scene',
 'of',
 'violence',
 ',',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'GO',
 '.',
 'Trust',
 'me',
 ',',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 '.',
 'This',
 'show',
 'pull',
 'no',
 'punch',
 'with',
 'regard',
 'to',
 'drug',
 ',',
 'sex',
 'or',
 'violence',
 '.',
 'Its',
 'is',
 'hardcore',
 ',',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'It',
 'is',
 'called',
 'OZ',
 'a',
 'that',
 'i

# **Advance Preprocessing Techniques**

## Spelling Correction
What it is: Text data can often contain spelling errors or informal language (like "gr8" instead of "great").

Why it’s important: Spelling errors can confuse models, especially if the words don’t match any known tokens in pre-trained models.

How to handle it:

Use spell-checking libraries like TextBlob or pyspellchecker to automatically detect and correct misspelled words.
Example:

"I hav a gr8 day" → "I have a great day"

In [112]:
from textblob import TextBlob

text = "You arr a goed Person but do nothhing.why?"

blob = TextBlob(text)

correct_words = blob.correct()

correct_words


TextBlob("You are a good Person but do nothing.why?")

## Handling Emojis and Emoticons
What it is: Emojis and emoticons (like 😀, 😢, or ❤️) are used in modern communication, especially in social media. They convey sentiment and meaning but are not always represented in text.

Why it’s important: Emojis carry important sentiment or information, especially in tasks like sentiment analysis or social media text classification.

How to handle it:

- Convert Emojis to Text: Use libraries like emoji to convert emojis into text representation (e.g., 😄 → ":grinning_face_with_big_eyes:").
Example: Before:

"I am so happy today 😄"
After handling emojis:

"I am so happy today :grinning_face_with_big_eyes:"

In [114]:
pip install emoji

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


In [115]:
import emoji

text = "I am so happy today 😄"

dm_text = emoji.demojize(text)

dm_text

'I am so happy today :grinning_face_with_smiling_eyes:'

# Part-of-Speech Tagging (POS Tagging)
What it is: POS tagging assigns grammatical categories (such as noun, verb, adjective) to words in a sentence.

Why it’s important: POS tagging helps understand the structure of a sentence. For example, identifying whether a word is a noun or verb can help with tasks like Named Entity Recognition (NER) and syntactic parsing.

How to handle it:

- Use libraries like spaCy or nltk for POS tagging to identify the role of each word in the sentence.
Example:

"The quick brown fox jumps over the lazy dog."
"The" → Determiner (DET)
"quick" → Adjective (ADJ)
"fox" → Noun (NOUN)

In [116]:
import spacy

# load a pre trained model
nlp= spacy.load("en_core_web_sm")
text = "I am very cool in Singing and Reading but for a .walk walks walking walked"

doc = nlp(text)

# extract the pos tags:

for token in doc:
  print(token.text, token.pos_)

I PRON
am AUX
very ADV
cool ADJ
in ADP
Singing PROPN
and CCONJ
Reading PROPN
but CCONJ
for ADP
a DET
.walk NOUN
walks NOUN
walking VERB
walked VERB


# Named Entity Recognition (NER)
What it is: NER identifies entities like names, organizations, locations, and other important information from text.

Why it’s important: NER is used in tasks like information extraction, document summarization, and question answering. For example, "Apple" might refer to the company or the fruit, so it's important to recognize it as an organization.

How to handle it:

- Use NER tools (like spaCy) to detect and label entities in the text.
Example:

"Apple is looking at buying U.K. startup for $1 billion."
Apple → ORGANIZATION
U.K. → GPE (Geopolitical Entity)
$1 billion → MONEY

In [124]:
import spacy

# Load a pre-trained model
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying U.K. startup for $1 billion"


# Process the text
doc = nlp(text)

# Extract named entities
for ent in doc.ents:
    print(ent.text, ent.label_)


Apple ORG
U.K. GPE
$1 billion MONEY


# All Preprocessing in One Function

In [130]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [131]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
import string

# sample text to be preprocessed
text = 'GeeksforGeeks is a very famous edutech company in the IT industry.'

# tokenize the text
tokens = word_tokenize(text)

# remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

# perform stemming and lemmatization
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# remove digits and punctuation
cleaned_tokens = [token for token in lemmatized_tokens
				if not token.isdigit() and not token in string.punctuation]

# convert all tokens to lowercase
lowercase_tokens = [token.lower() for token in cleaned_tokens]

# perform part-of-speech (POS) tagging
pos_tags = pos_tag(lowercase_tokens)

# perform named entity recognition (NER)
named_entities = ne_chunk(pos_tags)

# print the preprocessed text
print("Original text:", text)
print("Preprocessed tokens:", lowercase_tokens)
print("POS tags:", pos_tags)
print("Named entities:", named_entities)


Original text: GeeksforGeeks is a very famous edutech company in the IT industry.
Preprocessed tokens: ['geeksforgeeks', 'famous', 'edutech', 'company', 'industry']
POS tags: [('geeksforgeeks', 'NNS'), ('famous', 'JJ'), ('edutech', 'JJ'), ('company', 'NN'), ('industry', 'NN')]
Named entities: (S geeksforgeeks/NNS famous/JJ edutech/JJ company/NN industry/NN)


# **What is Text Representation?**
Text representation is the process of converting raw text (which is in the form of strings or characters) into a numerical format that computers and machine learning models can understand. The goal is to represent text in a way that captures important information, such as meaning, relationships, and context.

## Why is Text Representation Important?
In NLP, we work with text, but machines understand numbers. Therefore, we need to convert text into numbers (vectors). The better the representation, the better our models can understand and predict based on text data.

## Types of Text Representation
There are several methods for representing text, ranging from simple techniques like Bag of Words to more advanced methods like word embeddings and contextualized representations. Let’s go over the most common ones, starting from the basics.

## **1. Bag of Words (BoW)**
What it is: The Bag of Words model represents text by treating it as an unordered collection (or "bag") of words, ignoring grammar and word order. Each word is assigned a unique integer, and the text is represented as a vector of word frequencies or binary indicators (whether a word appears or not).

- Why it’s important: BoW is one of the simplest text representation methods and can be a good starting point. It’s useful for many simple NLP tasks like text classification.

How it works:

- First, build a vocabulary (a list of unique words from the entire dataset).

- Then, represent each document as a vector of word frequencies or binary values (1 if the word appears in the document, 0 if not).
Example: Consider these two sentences:

"I love NLP"
"NLP is fun"
The vocabulary would be: ["I", "love", "NLP", "is", "fun"]

- Now, we represent each sentence as a vector of word counts:

Sentence 1: [1, 1, 1, 0, 0] (1 for "I", "love", and "NLP")
Sentence 2: [0, 0, 1, 1, 1] (1 for "NLP", "is", and "fun")

In [19]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# list of the sentences.
sentences = ["my love NLP", "NLP is fun"]

vectorizer = CountVectorizer()

# fit the model and transform the text into the vectors.

X = vectorizer.fit_transform(sentences)

# convert result into the array.

# print(X.toarray())

vectorizer.vocabulary_


{'my': 3, 'love': 2, 'nlp': 4, 'is': 1, 'fun': 0}

In [20]:
print(X.toarray())

[[0 0 1 1 1]
 [1 1 0 0 1]]


In [15]:
import numpy as np
import pandas as pd

df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'], 'output':[1,1,0,0]})
df


Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()

bow = cv.fit_transform(df['text'])


In [17]:
cv.vocabulary_

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}

In [22]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())


[[1 0 1 1 0]]
[[2 0 0 1 0]]
[[0 1 1 0 1]]
[[1 1 0 0 1]]


In [23]:
# use new sentence
cv.transform(['campusx watch and write comment of campusx']).toarray()

array([[2, 1, 0, 1, 1]])

# Bag of N-Grams, Bi_Grams, Uni-Grams, Tri-Grams

In [26]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


In [27]:
df = pd.DataFrame({'text':['people watch campusx','campusx watch campusx','people write comment','campusx write comment'], 'output':[1,1,0,0]})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


## Unigrams

In [31]:

cv = CountVectorizer(ngram_range=(1,1))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)
print(len(cv.vocabulary_))
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())
# use new sentence
print()
cv.transform(['campusx watch and write comment of campusx']).toarray()

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}
5
[[1 0 1 1 0]]
[[2 0 0 1 0]]
[[0 1 1 0 1]]
[[1 1 0 0 1]]



array([[2, 1, 0, 1, 1]])

In [32]:
cv = CountVectorizer(ngram_range=(1,2))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)
print(len(cv.vocabulary_))
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())
# use new sentence
print()
cv.transform(['campusx watch and write comment of campusx']).toarray()

{'people': 4, 'watch': 7, 'campusx': 0, 'people watch': 5, 'watch campusx': 8, 'campusx watch': 1, 'write': 9, 'comment': 3, 'people write': 6, 'write comment': 10, 'campusx write': 2}
11
[[1 0 0 0 1 1 0 1 1 0 0]]
[[2 1 0 0 0 0 0 1 1 0 0]]
[[0 0 0 1 1 0 1 0 0 1 1]]
[[1 0 1 1 0 0 0 0 0 1 1]]



array([[2, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1]])

## Bigrams

In [33]:
cv = CountVectorizer(ngram_range=(2,2))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)
print(len(cv.vocabulary_))
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())
# use new sentence
print()
cv.transform(['campusx watch and write comment of campusx']).toarray()

{'people watch': 2, 'watch campusx': 4, 'campusx watch': 0, 'people write': 3, 'write comment': 5, 'campusx write': 1}
6
[[0 0 1 0 1 0]]
[[1 0 0 0 1 0]]
[[0 0 0 1 0 1]]
[[0 1 0 0 0 1]]



array([[1, 0, 0, 0, 0, 1]])

## Trigrams

In [34]:
cv = CountVectorizer(ngram_range=(3,3))
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)
print(len(cv.vocabulary_))
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())
print(bow[3].toarray())
# use new sentence
print()
cv.transform(['campusx watch and write comment of campusx']).toarray()

{'people watch campusx': 2, 'campusx watch campusx': 0, 'people write comment': 3, 'campusx write comment': 1}
4
[[0 0 1 0]]
[[1 0 0 0]]
[[0 0 0 1]]
[[0 1 0 0]]



array([[0, 0, 0, 0]])

# 2. Term Frequency-Inverse Document Frequency (TF-IDF)
##What it is:
 TF-IDF is an improvement over Bag of Words. While BoW simply counts word occurrences, TF-IDF weights the word frequency by how often the word appears in all documents. This reduces the importance of common words (like "the," "is," etc.), which are often less informative.

## TF (Term Frequency):
How often a word appears in a document.

##IDF (Inverse Document Frequency):
How rare the word is across all documents. Words that appear in many documents get lower weights.

- Why it’s important: TF-IDF is more informative than BoW because it reduces the influence of common words and highlights the most unique and meaningful words in each document.

Example:

"I love NLP"
"NLP is fun"

The word "NLP" appears in both sentences, but it is more important in the second sentence because it is the only content word. "I" and "love" are common and less important, so they are down-weighted.

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

# list of the senteces:

sentence = ["I Love NLP ","NLP is Fun"]

#Initialize a TF-IDF Vectorizer
tf = TfidfVectorizer()

# Fit the model and transform the sentences into vectors

X = tf.fit_transform(sentence)

print(tf.vocabulary_)
print()
# Convert the result to an array and print
print(X.toarray())

print(tf.get_feature_names_out())

{'love': 2, 'nlp': 3, 'is': 1, 'fun': 0}

[[0.         0.         0.81480247 0.57973867]
 [0.6316672  0.6316672  0.         0.44943642]]
['fun' 'is' 'love' 'nlp']


# **Word Embeddings (Word2Vec, GloVe, FastText)**
- What it is: Word embeddings represent words as continuous vectors in a high-dimensional space, capturing semantic relationships between words. Words with similar meanings (e.g., "king" and "queen") will have similar vector representations.

- Word2Vec (by Google) and GloVe (by Stanford) are two of the most popular word embedding techniques.
FastText (by Facebook) improves on Word2Vec by also considering subword information, making it useful for handling rare words.
Why it’s important: Word embeddings capture the meaning and context of words in a way that traditional methods like BoW and TF-IDF cannot. They are widely used in NLP for tasks like sentiment analysis, machine translation, and question answering.

- Example: In Word2Vec, the vectors for words like "king" and "queen" will be close to each other in the vector space, showing that they have similar meanings.

- Output: A high-dimensional vector representing the word "nlp".

Use Case: Word embeddings are often used for tasks that require understanding the semantic meaning of words, such as text classification and sentiment analysis.

In [50]:
from gensim.models import Word2Vec

# Sample corpus (list of tokenized sentences)
sentences = [["i", "love", "nlp"], ["nlp", "is", "fun"]]

model = Word2Vec(sentences, min_count=1)

# Get the vector for the word 'nlp'
vectore = model.wv['nlp']

vectore

array([-5.3622725e-04,  2.3643136e-04,  5.1033497e-03,  9.0092728e-03,
       -9.3029495e-03, -7.1168090e-03,  6.4588725e-03,  8.9729885e-03,
       -5.0154282e-03, -3.7633716e-03,  7.3805046e-03, -1.5334714e-03,
       -4.5366134e-03,  6.5540518e-03, -4.8601604e-03, -1.8160177e-03,
        2.8765798e-03,  9.9187379e-04, -8.2852151e-03, -9.4488179e-03,
        7.3117660e-03,  5.0702621e-03,  6.7576934e-03,  7.6286553e-04,
        6.3508903e-03, -3.4053659e-03, -9.4640139e-04,  5.7685734e-03,
       -7.5216377e-03, -3.9361035e-03, -7.5115822e-03, -9.3004224e-04,
        9.5381187e-03, -7.3191668e-03, -2.3337686e-03, -1.9377411e-03,
        8.0774371e-03, -5.9308959e-03,  4.5162440e-05, -4.7537340e-03,
       -9.6035507e-03,  5.0072931e-03, -8.7595852e-03, -4.3918253e-03,
       -3.5099984e-05, -2.9618145e-04, -7.6612402e-03,  9.6147433e-03,
        4.9820580e-03,  9.2331432e-03, -8.1579173e-03,  4.4957981e-03,
       -4.1370760e-03,  8.2453608e-04,  8.4986202e-03, -4.4621765e-03,
      

#  **Contextualized Word Embeddings (BERT, ELMo)**
- What it is: Contextualized word embeddings like BERT (Bidirectional Encoder Representations from Transformers) and ELMo (Embeddings from Language Models) represent words in a way that depends on the context in which they appear. For example, the word "bank" will have different embeddings depending on whether it's in the sentence "I went to the bank" (financial institution) or "The river bank" (side of a river).

- Why it’s important: Unlike static embeddings (Word2Vec, GloVe), contextualized embeddings can differentiate the meaning of words depending on their surrounding context.

- Example:

"I went to the bank to withdraw money" → "bank" refers to a financial institution.
"The river bank was beautiful" → "bank" refers to the side of the river.
Code Example (using Hugging Face’s transformers for BERT):

- BERT will give different embeddings for the word "bank" in context.

In [54]:
from transformers import BertTokenizer, BertModel

#load pretrained Bert model and tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')


# encode the text and get embeddings.

text = " I went to the Bank to withdraw money."
input = tokenizer(text,return_tensors = "pt")
outputs = model(**input)

# Get the embedding for each token(word)
embeddings = outputs.last_hidden_state
print(embeddings)


tensor([[[ 0.0986,  0.1992, -0.2363,  ..., -0.2191,  0.2045,  0.6251],
         [ 0.7358, -0.3095, -0.4123,  ..., -0.3383,  0.5981,  0.1020],
         [ 0.5886, -0.6031,  0.0208,  ..., -0.2261, -0.5764,  0.3983],
         ...,
         [ 0.8027, -0.6078,  0.0245,  ..., -0.0429, -0.5629,  0.2651],
         [ 0.6701,  0.1827, -0.4874,  ...,  0.1279, -0.2757, -0.5079],
         [ 0.7381,  0.1215,  0.0035,  ...,  0.2420, -0.4342, -0.5785]]],
       grad_fn=<NativeLayerNormBackward0>)


# ** Sentence Embeddings (SBERT, USE)**
- What it is: While word embeddings represent individual words, sentence embeddings provide a single vector representation for an entire sentence. This is particularly useful when you need to understand the meaning of whole sentences rather than individual words.

- Why it’s important: Sentence embeddings allow models to capture sentence-level meaning, which is essential for tasks like semantic similarity, question answering, and document classification.

Popular Techniques:

- SBERT (Sentence-BERT): A modification of BERT that generates high-quality sentence embeddings.
- Universal Sentence Encoder (USE): A model by Google that generates sentence-level embeddings.
Code Example (using SBERT):

- SBERT will generate an embedding that captures the meaning of each entire sentence.

# Summary of Text Representation Techniques:
- Bag of Words (BoW): Represents text as a set of word frequencies (simple but doesn’t capture meaning).
- TF-IDF: Weighs word frequencies by their importance in the corpus (better than BoW).
- Word Embeddings: Uses vectors to represent words based on their meaning (Word2Vec, GloVe).
- Contextualized Word Embeddings: Provides different representations of the same word depending on its context (BERT, ELMo).
- Sentence Embeddings: Represents entire sentences as vectors, capturing their overall meaning (SBERT, USE).

In [55]:
from sentence_transformers import SentenceTransformer

#load the pre trained SBERT model.

model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode sentences into vectore.

sentences  = ["I Love NLP","NLP is Fun"]
sentence_embeddings = model.encode(sentences)

# print embeddings

for i, sentence in enumerate(sentences):
  print(f"Sentece: { sentence}")
  print(f"Embeddings: {sentence_embeddings[i]}")




modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentece: I Love NLP
Embeddings: [-4.13823361e-03 -2.96884011e-02  5.89311905e-02 -4.00694571e-02
  8.18729848e-02  2.78293621e-03  6.77005872e-02  4.80399020e-02
  6.54386505e-02  3.86305675e-02 -9.38955508e-03 -1.80722810e-02
 -6.96899816e-02  6.42962456e-02  1.92711912e-02  8.19071308e-02
 -2.53259111e-02  6.97762473e-04 -3.26028466e-02 -3.14748436e-02
 -5.51633611e-02  1.23413451e-01  3.20147700e-03 -6.44584522e-02
  3.23912762e-02  3.48860994e-02 -1.99928973e-02 -2.39631552e-02
 -3.77601432e-03  2.16635759e-03 -2.34584627e-03  5.20627685e-02
  1.54693574e-02  4.19374965e-02 -4.15188931e-02  3.16199064e-02
  2.05578897e-02  1.90055501e-02  2.61998251e-02  6.31199330e-02
 -3.00359782e-02 -1.48233799e-02 -4.26204838e-02  3.58043015e-02
 -2.43016402e-03  2.06026062e-02 -4.44827229e-02  4.16266248e-02
  1.22226179e-02  5.75132575e-03  4.51940065e-03  2.67637847e-03
  4.55945134e-02  7.00789467e-02  1.30700199e-02  1.20276622e-02
 -7.26654101e-03 -1.99770611e-02 -2.25230884e-02 -9.088260

I'm thrilled to guide you through **Model Training** and **Model Evaluation** in NLP! This is a crucial part of building machine learning applications in natural language processing (NLP), and I'll make sure you understand it in a **simple** and **detailed** way so that you're ready to excel in your interviews.

### Introduction to Model Training and Evaluation

In NLP, model training involves teaching a machine learning model to understand patterns in text data. Once the model is trained, we need to evaluate how well it performs. **Model evaluation** tells us how effective the model is and whether it can generalize to new, unseen data.

### **1. Model Training in NLP**

Training a model involves feeding text data to an algorithm and allowing it to learn patterns from that data. This process is usually done in the following steps:

#### 1.1. **Data Preprocessing**:
Before training, the text data needs to be cleaned and preprocessed. This might involve:
- Tokenization (splitting text into words or subwords)
- Lowercasing all words
- Removing stopwords (common words like "the", "is", etc.)
- Lemmatization or stemming (reducing words to their base form)

#### 1.2. **Choosing the Right Model**:
For NLP tasks, you can use a variety of machine learning models depending on the problem. Some common types of models are:
- **Traditional models**: Logistic Regression, Naive Bayes, SVM, etc.
- **Deep learning models**: Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNNs), Transformer models (like BERT, GPT).
- **Pre-trained models**: BERT, GPT, and other pre-trained transformer-based models.

#### 1.3. **Training the Model**:
Training involves using your preprocessed data to teach the model. During training, the model learns to predict the correct output (like classification labels or next words) based on input text.

**Example of training a basic model (logistic regression for text classification)**:

1. **Preprocessing**: Tokenizing and vectorizing text.
2. **Training**: Using the vectorized data to train the model.

```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Sample dataset
texts = ["I love NLP", "NLP is amazing", "I hate bugs", "Coding is fun", "I love solving problems"]
labels = [1, 1, 0, 1, 1]  # 1 for positive sentiment, 0 for negative

# Step 1: Preprocessing (Tokenizing and vectorizing the text)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Step 2: Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)

# Step 3: Training the model (Logistic Regression)
model = LogisticRegression()
model.fit(X_train, y_train)

# Step 4: Predicting on test data
y_pred = model.predict(X_test)
print("Predictions:", y_pred)
```

In the above code:
- **CountVectorizer** converts text into numerical format (word counts).
- **Logistic Regression** is a simple classification model.
- We split the dataset into training and testing data to train and evaluate the model.

#### 1.4. **Hyperparameter Tuning**:
In some cases, you may need to tune **hyperparameters** (like the learning rate, number of layers, etc.) to improve the model’s performance. Techniques like **Grid Search** or **Random Search** can be used to find the best combination of hyperparameters.

---

### **2. Model Evaluation in NLP**

Once the model is trained, you need to **evaluate its performance** to see how well it’s doing. Model evaluation gives you insights into how accurate your model is and where it might be failing.

#### 2.1. **Evaluation Metrics**:
There are several metrics you can use to evaluate NLP models, depending on the type of task you’re performing (e.g., classification, regression, etc.). Some of the most common metrics are:

- **Accuracy**: The proportion of correct predictions out of all predictions. It is commonly used for classification tasks.
  \[
  \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Predictions}}
  \]

- **Precision**: The proportion of true positive predictions out of all positive predictions.
  \[
  \text{Precision} = \frac{TP}{TP + FP}
  \]
  Where:
  - **TP** = True Positives
  - **FP** = False Positives

- **Recall (Sensitivity)**: The proportion of actual positive cases that were correctly identified.
  \[
  \text{Recall} = \frac{TP}{TP + FN}
  \]
  Where:
  - **FN** = False Negatives

- **F1-Score**: The harmonic mean of Precision and Recall. It’s useful when the class distribution is imbalanced (one class is more frequent than the other).
  \[
  \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  \]

- **Confusion Matrix**: A matrix that shows the true vs. predicted classifications in a table format. It helps to visualize the performance of a classification model.

- **ROC-AUC**: For binary classification, the **Receiver Operating Characteristic (ROC)** curve plots the True Positive Rate (Recall) against the False Positive Rate, and the **AUC (Area Under Curve)** measures the performance of the classifier.

#### 2.2. **Example of Evaluating the Model**:

Let’s continue from the previous example, where we used logistic regression for text classification. We’ll now evaluate the model’s performance using **accuracy**, **precision**, **recall**, and **F1-score**.

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")
print(f"Confusion Matrix:\n{conf_matrix}")
```

In the output:
- **Accuracy** tells you the overall correctness of the model.
- **Precision** and **Recall** give you insights into how well the model performs in identifying the correct class (e.g., positive sentiment).
- **F1-Score** is helpful when the classes are imbalanced (e.g., more negative than positive samples).
- **Confusion Matrix** helps you visualize how well the model is performing across different classes.

---

### **3. Cross-Validation**

Cross-validation is a technique used to evaluate a model’s performance by training and testing it on different subsets of the data. This helps ensure that the model doesn’t overfit to a specific subset and can generalize better to unseen data.

A commonly used cross-validation technique is **K-Fold Cross-Validation**, where the dataset is split into **K** equal parts (folds). The model is trained on **K-1** folds and evaluated on the remaining fold. This process is repeated for all folds.

**Code Example** (using **KFold Cross-Validation**):
```python
from sklearn.model_selection import cross_val_score

# Using cross-validation to evaluate the model
scores = cross_val_score(model, X, labels, cv=5, scoring='accuracy')

print(f"Cross-Validation Scores: {scores}")
print(f"Mean Accuracy: {scores.mean()}")
```

In this example, the model is evaluated using **5-fold cross-validation** (i.e., splitting the data into 5 parts and training/testing the model on each fold).

---

### **4. Model Improvement and Fine-Tuning**

Once you have trained and evaluated your model, you might want to **improve** its performance. Here are some strategies to do so:

- **Feature Engineering**: Create new features from the data, such as using **n-grams** (combinations of adjacent words), to help the model capture more information.
- **Hyperparameter Tuning**: Experiment with different hyperparameters (like learning rate, number of layers, etc.) to improve model performance.
- **Ensemble Methods**: Combine predictions from multiple models to improve accuracy (e.g., Random Forests, Gradient Boosting).

---

### **5. Example of Evaluating and Fine-Tuning a Pretrained Model (BERT)**

When using advanced models like **BERT**, you can fine-tune the model using your own data. Below is an example of how to fine-tune a **BERT** model for text classification:

```python
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Tokenize the dataset (assuming you have a dataset ready)
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

# Set up training arguments and trainer
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train and evaluate the model
trainer.train()
trainer.evaluate()
```

This code snippet shows how to fine-tune a BERT model for a text classification task using the **Trainer API** from Hugging Face’s **transformers** library.

---

### **Conclusion**

In summary, here are the key steps you need to follow for **model training** and **model evaluation** in NLP:

1. **Preprocess the data** (tokenize, clean, vectorize).
2. **Choose the right model** based on the task (Logistic Regression, SVM, BERT, etc.).
3. **Train the model** using your processed data.
4. **Evaluate the model** using metrics like accuracy, precision, recall, F1-score, and confusion matrix.
5. Use **cross-validation** to check the model’s performance on different data splits.
6. Fine-tune the model for better results.

By mastering these steps, you’ll be well on your way to excelling in your NLP interviews and building high-performing models. Let me know if you need any further explanations or examples! I'm here to help you succeed.