#### Text Preprocessing in Python
- Python Libraries like NLTK and spaCy are powerful tools for text preprocessing in NLP tasks.

### 1. Tokenization 
- Word Tokenization
- Sentence Tokenization
- Character Tokenization
- Whitespace Tokenization
- Punctuation-Aware Tokenization

In [2]:
#pip install nltk

In [3]:
import nltk
nltk.download('punkt_tab') ## This downloads the necessary data for tokenization

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [4]:
from nltk.tokenize import word_tokenize

text = "I love programming!"

text

words = nltk.word_tokenize(text)
print(words)

['I', 'love', 'programming', '!']


- NLTK (Natural Language ToolKit) is a popular library in Python used for Natural Language Processing (NLP).
- punkt_tab is a pre-trained tokenizer model is available in NLTK, which is used for braking down text into words or sentences. It is a necessary resource for tokenization tasks. 

#### Sentence tokenization splits a text into sentence 

In [None]:
from nltk.tokenize import sent_tokenize

# Sample text
text = "I love programming. It's fun!"

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokens:", sentences)

Sentence Tokens: ['I love programming.', "It's fun!"]

##### Character tokenization splits text into individual characters

In [None]:
# Example word
word = "programming"

# Character Tokenization (manual splitting)
char_tokens = list(word)

# Print the result
print("Character Tokenization:", char_tokens)

Character Tokenization: ['p', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g']

#### Whitespace tokenization splits the text based on spaces (whitespace characters)

In [None]:
# Example sentence
text = "I love programming."

# Whitespace Tokenization (splitting based on space)
whitespace_tokens = text.split( )

# Print the result
print("Whitespace Tokenization:", whitespace_tokens)

Whitespace Tokenization: ['I', 'love', 'programming.']

#### Punctuation-Aware Tokenization handles punctuation separately from words

In [None]:
# Example sentence with punctuation
text = "Hello! How are you doing?"

# Word Tokenization (with punctuation handling)
word_tokens_with_punct = word_tokenize(text)

# Print the result
print("Punctuation-Aware Tokenization:", word_tokens_with_punct)
Punctuation-Aware Tokenization: ['Hello', '!', 'How', 'are', 'you', 'doing', '?']

In [12]:
from nltk.corpus import stopwords

In [13]:
# Example sentence (tokens)
tokens = ["I", "love", "programming", "and", "it", "is", "fun"]
tokens

['I', 'love', 'programming', 'and', 'it', 'is', 'fun']

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [15]:
# Get the list of stop words in English
stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [16]:
# Print the stop words list (optional)
print("Stop Words List:", stop_words)

Stop Words List: {'being', 'to', 'until', 'their', 'when', 'ourselves', 'because', 'm', 'which', 'not', 'this', 'didn', 'down', 'there', 'few', 'o', 'at', 'who', 'any', 'ours', 'yourselves', 'other', "it's", 'theirs', "isn't", 'was', 'more', "needn't", "hadn't", 'has', 'she', "should've", 'with', 'll', 'weren', 'after', 'did', 'd', 'those', 'yourself', 'it', 'a', 'while', 'isn', 'for', 'nor', 'as', 'during', 'what', 'own', 'he', "shouldn't", 'between', 'is', 'very', "you'd", 's', 'whom', "didn't", "shan't", "aren't", 'again', 'over', "mightn't", "won't", 'further', 'most', 'had', 'from', "haven't", 'won', 'were', 'ain', 'her', 'herself', 'so', 'such', "she's", "mustn't", "you're", 'into', 'just', 'me', 'doing', 'an', 'wasn', 'do', 'but', "don't", 'am', 'through', 'its', 'where', 'under', 're', 'why', 'below', 'or', "you've", 'both', 'the', 'having', 'yours', 'our', 'up', 'aren', 'about', 'wouldn', 'him', 'your', 'are', "wouldn't", "hasn't", 'himself', 'some', "weren't", 'no', 'should',

In [17]:
nltk.download('stopwords') # Download the stop words list

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
# Remove stop words
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

In [19]:
# Print the filtered tokens
#tokens = ["I", "love", "programming", "and", "it", "is", "fun"]
print("Tokens after Stop Word Removal:", filtered_tokens)

Tokens after Stop Word Removal: ['love', 'programming', 'fun']
