In [1]:
!pip install nltk



## Text Preprocessing

Text preprocessing is an essential step in natural language processing (NLP) tasks. It involves transforming raw text data into a format that is more suitable for analysis and machine learning algorithms. In this tutorial, we will cover various common techniques for text preprocessing. Let's dive in!


### Lowercasing
Converting all text to lowercase can help to normalize the data and reduce the vocabulary size. It ensures that words in different cases are treated as the same word. For example, "apple" and "Apple" will both be transformed to "apple".

In [2]:
sent = "Hello, I am your AI Sathi R@3#."

In [3]:
lower_sent = sent.lower()
lower_sent

'hello, i am your ai sathi r@3#.'

### Removal of Punctuation and Special Characters
Punctuation marks and special characters often do not add much meaning to the text and can be safely removed. Common punctuation marks include periods, commas, question marks, and exclamation marks. You can use regular expressions or string operations to remove them.

In [4]:
common_punctuation = ['.', ',', ':', ';', '!', '?', '(', ')', '"', "'"]

In [6]:
result = ""
for each in lower_sent:
    if each not in common_punctuation:
        result += each
result

'hello i am your ai sathi r@3#'

In [7]:
import re

cleaned = re.sub(r'[^\w\s]','', lower_sent)
cleaned

'hello i am your ai sathi r3'

### Stop Word Removal:
Stop words are commonly occurring words in a language, such as "a," "an," "the," "is," and "in." These words provide little semantic value and can be removed to reduce noise in the data. Libraries like NLTK provide a list of predefined stop words for different languages.

Before using the code make sure you downloaded all the stopwords uning the first shell below.

In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/shailesh/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
from nltk.corpus import stopwords

In [15]:
stopwords_eng = stopwords.words('english')

In [16]:
stopwords_eng

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [17]:
filtered = [word for word in cleaned.split(" ") if word not in stopwords_eng]

In [19]:
" ".join(filtered)

'hello ai sathi r3'