<a href="https://colab.research.google.com/github/TasnimTamanna02/Machine_Learning/blob/main/Text_Processing%2C_Tokenization_and_Embedding_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing

Objective : Clean text data so that it's uniform, lowercase and ready to use

### Lowercasing

In [11]:
import pandas as pd

#initial data
reviews= [
    "I loved the product sooo much😍😍!!", "Worst purchase ever! It was a waste of money.!", "Absolutely fantastic! Exceeded my expectations. 🌟",
    "The quality is decent, but not worth the price.", "Super fast delivery! I'm really impressed 🚚💨",
    "Terrible experience. The product broke in a week! 😡", "Five stars! I’d buy it again without a doubt. ⭐⭐⭐⭐⭐",
    "Meh, it’s okay. Nothing special.", "This changed my life! Highly recommend it. 🙌", "Not as described. Very disappointed. 😞",
    "Great value for money. Totally worth it! 💸", "I wouldn’t recommend this to anyone. Awful!", "Cute design but not very durable. 🥲",
    "I’m obsessed! It works perfectly. 💕", "Save your money. This is pure junk. 🗑️",
    "The product is fine, but shipping took forever.", "Surprisingly good! I wasn't expecting much. 😊",
    "The quality is top-notch! Love it! ❤️", "It’s just okay. Nothing to write home about.",
    "Horrible customer service! Never buying again. 😠"
]

In [12]:
#Convert to DataFrame
df_reviews= pd.DataFrame(reviews, columns=["Reviews"])

#Lowercasing
df_reviews["Reviews_Lowercased"]= df_reviews["Reviews"].str.lower()

df_reviews.head()

Unnamed: 0,Reviews,Reviews_Lowercased
0,I loved the product sooo much😍😍!!,i loved the product sooo much😍😍!!
1,Worst purchase ever! It was a waste of money.!,worst purchase ever! it was a waste of money.!
2,Absolutely fantastic! Exceeded my expectations. 🌟,absolutely fantastic! exceeded my expectations. 🌟
3,"The quality is decent, but not worth the price.","the quality is decent, but not worth the price."
4,Super fast delivery! I'm really impressed 🚚💨,super fast delivery! i'm really impressed 🚚💨


### Removing punctuations and emojis

In [13]:
import re #Regular Expression Module
df_reviews["Reviews_NoPunctEmojis"]= df_reviews["Reviews_Lowercased"].apply(lambda x: re.sub(r'[^\w\s]', '', x))
df_reviews. head()

Unnamed: 0,Reviews,Reviews_Lowercased,Reviews_NoPunctEmojis
0,I loved the product sooo much😍😍!!,i loved the product sooo much😍😍!!,i loved the product sooo much
1,Worst purchase ever! It was a waste of money.!,worst purchase ever! it was a waste of money.!,worst purchase ever it was a waste of money
2,Absolutely fantastic! Exceeded my expectations. 🌟,absolutely fantastic! exceeded my expectations. 🌟,absolutely fantastic exceeded my expectations
3,"The quality is decent, but not worth the price.","the quality is decent, but not worth the price.",the quality is decent but not worth the price
4,Super fast delivery! I'm really impressed 🚚💨,super fast delivery! i'm really impressed 🚚💨,super fast delivery im really impressed


**.apply(lambda x: ...)**
.apply(): *Applies a function to each element in a column of the DataFrame.*
lambda x: *Creates an anonymous function (a function without a name). x represents each individual text entry in the column "Reviews_Lowercased".*

**re.sub(r'[^\w\s]', '', x)**

re.sub(): *A method from the re (Regular Expression) module used to substitute (or replace) patterns in text.*

r'[^\w\s]': *The pattern to match characters for removal:*

r'...': *The r before the string creates a raw string, where backslashes (\) are treated literally, which is useful in regular expressions.*

[ ... ]: *Denotes a character class, meaning it matches any single character inside the brackets.*

^: *A caret at the start of a character class means "NOT".*

\w: *Matches "word characters": letters, digits, and underscores (a-z, A-Z, 0-9, _).*

\s:* Matches "whitespace characters" like spaces, tabs, and new lines.*

[^\w\s]: *Means "NOT word characters and NOT whitespace", which effectively matches punctuation, symbols, and emojis*

### `Removing Stopwords`

In [14]:
import nltk #Natural Language Toolkit
nltk.download('stopwords')

from nltk.corpus import stopwords #The nltk.corpus module provides access to a variety of text datasets (corpora) that are built into NLTK.
stop_words= set(stopwords.words('english'))
df_reviews["Reviews_NoStopwords"]= df_reviews["Reviews_NoPunctEmojis"].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))
#x.split is creating word list out of the string. Then we are selecting which words are not stopwords and joining them with space in between.
df_reviews.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Reviews,Reviews_Lowercased,Reviews_NoPunctEmojis,Reviews_NoStopwords
0,I loved the product sooo much😍😍!!,i loved the product sooo much😍😍!!,i loved the product sooo much,loved product sooo much
1,Worst purchase ever! It was a waste of money.!,worst purchase ever! it was a waste of money.!,worst purchase ever it was a waste of money,worst purchase ever waste money
2,Absolutely fantastic! Exceeded my expectations. 🌟,absolutely fantastic! exceeded my expectations. 🌟,absolutely fantastic exceeded my expectations,absolutely fantastic exceeded expectations
3,"The quality is decent, but not worth the price.","the quality is decent, but not worth the price.",the quality is decent but not worth the price,quality decent worth price
4,Super fast delivery! I'm really impressed 🚚💨,super fast delivery! i'm really impressed 🚚💨,super fast delivery im really impressed,super fast delivery im really impressed


### Stemming

In [15]:
from nltk.stem import PorterStemmer

ps= PorterStemmer()
df_reviews['Reviews_Stem']= df_reviews['Reviews_NoStopwords'].apply(lambda x: ' '.join(ps.stem(word) for word in x.split())) #ps.stem(word) reduces words to base form

df_reviews.head()

Unnamed: 0,Reviews,Reviews_Lowercased,Reviews_NoPunctEmojis,Reviews_NoStopwords,Reviews_Stem
0,I loved the product sooo much😍😍!!,i loved the product sooo much😍😍!!,i loved the product sooo much,loved product sooo much,love product sooo much
1,Worst purchase ever! It was a waste of money.!,worst purchase ever! it was a waste of money.!,worst purchase ever it was a waste of money,worst purchase ever waste money,worst purchas ever wast money
2,Absolutely fantastic! Exceeded my expectations. 🌟,absolutely fantastic! exceeded my expectations. 🌟,absolutely fantastic exceeded my expectations,absolutely fantastic exceeded expectations,absolut fantast exceed expect
3,"The quality is decent, but not worth the price.","the quality is decent, but not worth the price.",the quality is decent but not worth the price,quality decent worth price,qualiti decent worth price
4,Super fast delivery! I'm really impressed 🚚💨,super fast delivery! i'm really impressed 🚚💨,super fast delivery im really impressed,super fast delivery im really impressed,super fast deliveri im realli impress


# Tokenization

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
word_tokenize("I don't like NLP")

from nltk.tokenize import sent_tokenize
sent_tokenize("I don't like NLP")

df_reviews["Reviews_Tokenized"]= df_reviews['Reviews_NoStopwords'].apply(lambda x: word_tokenize(x))

In [None]:
df_reviews.head()

# Embedding

In [None]:
from gensim.models import Word2Vec

model= Word2Vec(df_reviews['Reviews_Tokenized'], vector_size=3, window=2, min_count=1, sg=1)

print('Embedding of love: ', model.wv["love"])
print('Embedding of product: ', model.wv["product"])
print('Most similar to love: ', model.wv.most_similar("love"))
print('Most similar to product: ', model.wv.most_similar("product"))