### بسم الله الرحمن الرحيم


## Day1- Natural Language Processing

- Tokenization
- Stemming
- Lemmtiazation
- Bag-of-Words

### Text Preprocessing Techniques:

Tokenization -> StopWords -> Stemming -> Lemmtization


### 1. _`Tokenization`_

- Theory: This is the process of dividing text into smaller parts called tokens, which can be words, symbols, or even letters.
- The goal is to convert the raw text into an organized list that the program can process.


In [None]:
import nltk
from nltk.tokenize import word_tokenize
# Split Text into Words

nltk.download('punkt')

text = "Hello, how are you doing today?"
tokens = word_tokenize(text)

print(tokens)

['Hello', ',', 'how', 'are', 'you', 'doing', 'today', '?']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from nltk.tokenize import sent_tokenize
# Split Text into Sentences

text = "Hello, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue."
sentences = sent_tokenize(text)

print(sentences)

['Hello, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.']


### 2. _`StopWords Removal`_

- Theory: Words that are very common in any language, but usually don't add any substantive meaning to the text, such as "in," "of," "to," "on," "this," and "or."
- In many natural language processing applications such as text classification, these words are ignored because their presence in a text doesn't affect its underlying meaning, and removing them reduces data size and speeds up processing.
- This includes converting all letters to lowercase, removing punctuation, and removing meaningless words like "from" and "to" (called stop words).


In [11]:
from nltk.corpus import stopwords
import string

# Download the stopwords from NLTK
nltk.download('stopwords')

# List of stop words in English
stop_words_english = set(stopwords.words('english'))
# List of punctuation characters
punctuation = set(string.punctuation)

cleaned_tokens = []
for token in tokens:
    # Convert to lowercase & Remove punctuation & Remove stop words
    clean_token = token.lower().strip(string.punctuation)

    if clean_token and clean_token not in stop_words_english:
        cleaned_tokens.append(clean_token)

print(cleaned_tokens)

['hello', 'today']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 3. _`Stemming`_ is the process of finding the "root" of a word, or the common part shared by a group of words.

- The goal is to convert words that are similar in meaning but different in form into a single root.
- This process is often `quick`, but it doesn't respect grammatical rules and may result in roots that are not real or not found in the dictionary. (Maybe Meaningless)

Example:
Words: "go", "went", "went", "goes", "going"
Stem: "go"

Another example that might be wrong:
Words: "houses," "house."
Stem: "house" (correct)

However, the word "hospital" might have a root of "she will heal," which would be meaningless.


In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["go", "went", "gone", "goes", "going",
         'history', 'historical', 'historically']
stemmed_words = [ps.stem(w) for w in words]

print(stemmed_words)

['go', 'went', 'gone', 'goe', 'go', 'histori', 'histor', 'histor']


### 4. _`Lemmatization`_ is a more precise process than stemming. It aims to find the base form of a word (the lemma) that is a real word found in the dictionary. This process takes into account grammatical rules (such as verb conjugation or noun case) to ensure that the base form makes sense , but slower than stemming.

Example:
The words: "they go," "went," "she went."
Lemma: "went" (past tense)

Another example:
The words: "ate," "I eat."
Lemma: "ate"


In [15]:
from nltk.stem import WordNetLemmatizer

# Download the WordNet data from NLTK
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["go", "went", "gone", "goes", "going", 'better', 'running', 'fairly']

lemmatized_words = [lemmatizer.lemmatize(w, pos='v') for w in words]
print(lemmatized_words)

['go', 'go', 'go', 'go', 'go', 'better', 'run', 'fairly']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


_`Stemming`: Faster, doesn't respect grammar rules, and may produce nonexistent roots, `Lemmatization`: Slower, more accurate, and ensures that the base form is a real word and exists in the dictionary._


### 4. _`Bag-of-Words (BoW)`_ : It is a Feature Extraction Technique method of `converting text into a digital representation (vector)` by counting the number of times each word occurs in the text.

- In the Bag of Words model, word order and grammar (sentence, grammar, etc.) are ignored, and the focus is solely on the frequency of each word. Imagine taking all the words from a sentence or document and putting them in a bag. All that matters is the number of times each word appears, not its order.

Example:
Sentence 1: "The cat eats the fish."

Sentence 2: "The fish is eaten by the cat."

In the Bag of Words model, both sentences are represented the same way because the same words (cat, eat, fish) are present in the same number in both sentences.


In [16]:
"""
The process of creating a Bag of Words model consists of two basic steps:
1. Creating a Vocabulary: 
    We collect all the unique words from all the texts we have. 
    This dictionary is the basis of our model.

2. Vectorizing the Text: 
    For each text, we create a vector equal to the size of the dictionary.
    Each cell in this vector represents a word from the dictionary, and its value is the number of times that word occurs in the text.
"""

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "The cat eats the fish.",
    "The fish is eaten by the cat."
]

# Create the Bag of Words model
vectorizer = CountVectorizer()

# Convert Text into Vectors
X = vectorizer.fit_transform(documents)

# Display the vocabulary (Unique Words)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Display the vectors
print("Vectors:\n", X.toarray())

Vocabulary: ['by' 'cat' 'eaten' 'eats' 'fish' 'is' 'the']
Vectors:
 [[0 1 0 1 1 0 2]
 [1 1 1 0 1 1 2]]


_`Disadvantages: `_

- `Ignores word order`: This can lead to loss of meaning.

- `It produces large vectors`: The larger the dictionary, the larger the vector's dimensions, causing the "curse of dimensionality."


### 5. _`Converting Words to Numbers (Word Embeddings)`_

- Theory: Computers don't understand words, they understand numbers.
- Word Embeddings is a method of converting each word into a digital vector in a multidimensional space.
- Words with similar meanings have their vectors close to each other in this space.


_Now, we use the Gensim library to create a Word2Vec model, which is one of the most popular embedding methods._


In [8]:
from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ["hello", "how", "are", "you"],
    ["I", "am", "doing", "well"],
    ["how", "about", "you"],
    ["I", "hope", "you", "are", "doing", "great"]
]

# Model training
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Accessing the vector for the word 'hello'
vector_hello = model.wv['hello']

# Get similar words to 'hello'
print("Vector for 'hello':", vector_hello)
print("Most similar words to 'hello':")
print(model.wv.most_similar('hello'))

Vector for 'hello': [ 7.0887972e-03 -1.5679300e-03  7.9474989e-03 -9.4886590e-03
 -8.0294991e-03 -6.6403709e-03 -4.0034545e-03  4.9892161e-03
 -3.8135587e-03 -8.3199050e-03  8.4117772e-03 -3.7470020e-03
  8.6086961e-03 -4.8957514e-03  3.9185942e-03  4.9220170e-03
  2.3926091e-03 -2.8188038e-03  2.8491246e-03 -8.2562361e-03
 -2.7655398e-03 -2.5911583e-03  7.2490061e-03 -3.4634031e-03
 -6.5997029e-03  4.3404270e-03 -4.7448516e-04 -3.5975564e-03
  6.8824720e-03  3.8723124e-03 -3.9002013e-03  7.7188847e-04
  9.1435025e-03  7.7546560e-03  6.3618720e-03  4.6673026e-03
  2.3844899e-03 -1.8416261e-03 -6.3712932e-03 -3.0181051e-04
 -1.5653884e-03 -5.7228567e-04 -6.2628710e-03  7.4340473e-03
 -6.5914928e-03 -7.2392775e-03 -2.7571463e-03 -1.5154004e-03
 -7.6357173e-03  6.9824100e-04 -5.3261113e-03 -1.2755442e-03
 -7.3651113e-03  1.9605684e-03  3.2731986e-03 -2.3138524e-05
 -5.4483581e-03 -1.7260861e-03  7.0849168e-03  3.7362587e-03
 -8.8810492e-03 -3.4135508e-03  2.3541022e-03  2.1380198e-03
 -9.

### 6. _`A Simple Practical Application: Sentiment Classification`_

- Theory: We will classify a simple sentence as "positive" or "negative."
- This type of task is the basis of many natural language processing applications.


In [17]:
# Bag-of-Words Example
from sklearn.feature_extraction.text import CountVectorizer

# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Sample data
documents = [
    ("I love this product", "positive"),
    ("This is the best purchase I've made", "positive"),
    ("I'm very satisfied with my experience", "positive"),
    ("This is a terrible product", "negative"),
    ("I hate this item", "negative"),
    ("This is the worst purchase I've made", "negative")
]

# Split data into texts and labels
texts = [d[0] for d in documents]
labels = [d[1] for d in documents]

# Pipeline -> (Vectorizer + Classifier)
# CountVectorizer -> Convert Text into Bag-of-Words
# MultinominalNB -> Naive Bayes Classifier
model_pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the model
model_pipeline.fit(texts, labels)

# Test the model
test_sentences = [
    "I am very happy with this product",
    "This is the worst experience I've had"
]

predictions = model_pipeline.predict(test_sentences)

print("Text -> ", test_sentences[0])
print("Predicted Sentiment -> ", predictions[0])

print("Text -> ", test_sentences[1])
print("Predicted Sentiment -> ", predictions[1])

# Accuracy Check
test_labels = ["positive", "negative"]
accuracy = accuracy_score(test_labels, predictions)
print("Model Accuracy: ", accuracy)

Text ->  I am very happy with this product
Predicted Sentiment ->  positive
Text ->  This is the worst experience I've had
Predicted Sentiment ->  negative
Model Accuracy:  1.0
