<a href="https://colab.research.google.com/github/Akita20/Practice/blob/master/ntroduction_to_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 lecture notes on Natural Language Processing (NLP) while focusing on the core concepts of stop words, tokenization, and examples in a Google Colab setting.

**Title: Introduction to Natural Language Processing**

**1. Stop Words**

* **What are they?** Stop words are common words in a language that frequently appear in text but usually provide little to no relevant meaning to the core content.  Examples include "the," "a," "an," "is," "and," etc.
* **Why remove them?**
    * **Efficiency:** Stop words take up space and processing time. Removing them makes your models more efficient to train.
    * **Focus:** Filtering out stop words puts the emphasis on more content-rich, meaningful words.
    * **Context:** In *some* cases, stop words may have significance for understanding context (e.g., sentiment analysis), so consider this carefully.

**Example (Colab):**

In [1]:
import nltk
nltk.download('stopwords')  # Download NLTK's stopwords database

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

sentence = "The cat sat on the mat with a hat."
filtered_words = [word for word in sentence.split() if word.lower() not in stop_words]

print(filtered_words)  # Output: ['cat', 'sat', 'mat', 'hat']

['cat', 'sat', 'mat', 'hat.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**2. Tokenization**

* **What is it?** Tokenization breaks down a piece of text into smaller units called tokens. These tokens can be:
    * **Words:** "This is a sample sentence." → ["This", "is", "a", "sample", "sentence"]
    * **Sentences:** "This is a sample sentence. Here's another." → ["This is a sample sentence.", "Here's another."]
    * **Subwords:**  "powerful" → ['power', 'ful'] (useful for handling complex words)
        
* **Why do it?** Machine learning models can't understand raw text directly. Tokenization makes text into a structured format that algorithms can process.

**Types of Tokenizers:**

   * **Word Tokenizers** (e.g., NLTK's word_tokenize)
   * **Sentence Tokenizers** (e.g., NLTK's sent_tokenize)
   * **More advanced tokenizers** that handle things like contractions, special characters, and complex word forms.

**Example (Colab):**

In [2]:
from nltk import word_tokenize
import nltk
nltk.download('punkt')

text = "Natural language processing (NLP) is fascinating!"
tokens = word_tokenize(text)

print(tokens)  # Output: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'fascinating', '!']

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'fascinating', '!']


**Key Steps in NLP (Colab)**

## Converting Text to Sequences

**Why we do it:** Computers don't understand language the same way humans do. Most NLP models work with numerical data. Therefore, we need to convert our text into sequences of numbers for the model to process.

**How it works:**
1. **Building a Vocabulary:** A tokenizer (like the Keras Tokenizer) scans your text data and creates a vocabulary (essentially a dictionary) of unique words.
2. **Assigning Integers:** Each unique word in the vocabulary is assigned a unique integer.
3. **Text Transformation:** Your original text sentences are transformed into lists of integers, where each integer represents a corresponding word in the vocabulary.

**Example:**

Original sentence: "The cat sat on the mat."

Vocabulary: {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5}

Integer sequence: [1, 2, 3, 4, 5]



## Converting Text to Sequences

**Core Idea:** Instead of processing raw text directly, NLP models typically operate on numerical representations of words. We convert sentences into sequences of integers where each integer is a unique code for a word within our vocabulary.

**Why Do We Do This?** Computers excel at number manipulation, not at directly deciphering human language.


## 1. Tokenization

**Explanation:**
1. The tokenizer builds a vocabulary (a mapping of words to unique integers).
2. Each sentence is split into words.
3. Each word is replaced by its corresponding integer code from the vocabulary.


In [3]:
from keras.preprocessing.text import Tokenizer

# Sample sentences
sentences = [
    "The cat sat on the mat.",
    "Dogs are playful and loyal."
]

In [4]:
# Create a basic tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)  # Build vocabulary based on the text

In [5]:
tokenizer.word_index

{'the': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'mat': 5,
 'dogs': 6,
 'are': 7,
 'playful': 8,
 'and': 9,
 'loyal': 10}

In [6]:
# Convert sentences to sequences
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

[[1, 2, 3, 4, 1, 5], [6, 7, 8, 9, 10]]



## 2. Padding Sequences

**Core Idea:** Many neural network models, particularly architectures like GRUs and RNNs, expect inputs to have a consistent length. Padding achieves this by adding neutral tokens (usually zeros) to shorter sequences.

**Why Do This?** Ensuring sequences have the same length allows models to process them in batches, optimizing learning and computation.


## Padding Sequences

**Why we do it:** Many NLP models, particularly those using neural networks like Recurrent Neural Networks (RNNs) or GRUs, are designed to work with fixed-length inputs. But, sentences in real-world text data are naturally of varying lengths. Padding resolves this issue.

**How it works:**
1. **Choosing a maximum length:** You decide a maximum length (e.g., 20 words).
2. **Shorter sequences:** Sequences shorter than the maximum length are padded with zeros (or a special padding token) at the beginning or end to reach the desired length.
3. **Longer sequences:** Sequences longer than the maximum length are truncated to fit.

**Example:**

Sentences:
- "The dog barked loudly."
- "The sun is shining."

Maximum Length: 6

Padded Sequences:
- [0, 1, 4, 5, 6, 7] ('0' represents padding)
- [1, 8, 2, 9, 0, 0]


In [8]:
from keras.preprocessing.sequence import pad_sequences

# Our sequences from the previous example
sequences = [[1, 2, 3, 4, 5], [6, 7, 8, 9]]

# Set the desired maximum length
maxlen = 6

# Apply padding
padded_sequences = pad_sequences(sequences, maxlen=maxlen)
print(padded_sequences)


[[0 1 2 3 4 5]
 [0 0 6 7 8 9]]


<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/Ilovecodingforever/EECS-487-Humor-Classification">https://github.com/Ilovecodingforever/EECS-487-Humor-Classification</a></li>
  </ol>
</div>