Natural Language Processing (NLP) is a field of artificial intelligence that enables computers to understand, interpret, and generate human language. It is essential for making interactions between humans and machines smoother.


### Need of NLP
- Helps computers process and understand human language, making interactions more intuitive.
- Enables businesses to automate tasks like customer support through chatbots.
- Facilitates search engines in retrieving relevant results based on user queries.

### Use of NLP
 - Text Analysis: NLP is used in sentiment analysis to determine whether customer feedback is positive, negative, or neutral.
- Speech Recognition: Virtual assistants like Siri and Alexa convert spoken language into text.


## Tokenization
Tokenization is the process of splitting text into individual words or phrases (tokens). For example:
- Input: "I love natural language processing!"
- Tokenized Output: ["I", "love", "natural", "language", "processing", "!"]


In [1]:
# importing depndencies
from tensorflow.keras.preprocessing.text import Tokenizer

In [4]:
sentences = [
    'I love my Dog',
    'I love my Cat'
]

In [5]:
sentences

['I love my Dog', 'I love my Cat']

In [6]:
token = Tokenizer(num_words=100)

In [7]:
token.fit_on_texts(sentences)

In [9]:
index = token.word_index

In [10]:
index

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}

In [1]:
import nltk

In [17]:
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Dips\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [9]:
data = data = "India (Hindi: Bhārat), officially the Republic of India, is a country in South Asia. It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia."

In [None]:
# sentence tokenize
nltk.sent_tokenize(data)

['India (Hindi: Bhārat), officially the Republic of India, is a country in South Asia.',
 'It is the seventh-largest country by area, the second-most populous country, and the most populous democracy in the world.',
 'Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.',
 'In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand and Indonesia.']

In [11]:
# word tokenize
nltk.word_tokenize(data)

['India',
 '(',
 'Hindi',
 ':',
 'Bhārat',
 ')',
 ',',
 'officially',
 'the',
 'Republic',
 'of',
 'India',
 ',',
 'is',
 'a',
 'country',
 'in',
 'South',
 'Asia',
 '.',
 'It',
 'is',
 'the',
 'seventh-largest',
 'country',
 'by',
 'area',
 ',',
 'the',
 'second-most',
 'populous',
 'country',
 ',',
 'and',
 'the',
 'most',
 'populous',
 'democracy',
 'in',
 'the',
 'world',
 '.',
 'Bounded',
 'by',
 'the',
 'Indian',
 'Ocean',
 'on',
 'the',
 'south',
 ',',
 'the',
 'Arabian',
 'Sea',
 'on',
 'the',
 'southwest',
 ',',
 'and',
 'the',
 'Bay',
 'of',
 'Bengal',
 'on',
 'the',
 'southeast',
 ',',
 'it',
 'shares',
 'land',
 'borders',
 'with',
 'Pakistan',
 'to',
 'the',
 'west',
 ';',
 'China',
 ',',
 'Nepal',
 ',',
 'and',
 'Bhutan',
 'to',
 'the',
 'north',
 ';',
 'and',
 'Bangladesh',
 'and',
 'Myanmar',
 'to',
 'the',
 'east',
 '.',
 'In',
 'the',
 'Indian',
 'Ocean',
 ',',
 'India',
 'is',
 'in',
 'the',
 'vicinity',
 'of',
 'Sri',
 'Lanka',
 'and',
 'the',
 'Maldives',
 ';',
 'i

In [None]:
# POS Tags ( Part of Speech)

### Task:
cleaning text data

In [12]:
import string

In [13]:
def cleaned_data(data):
    data = data.translate(str.maketrans('', '', string.punctuation))
    return data

raw_sentence = "Hi, I love to read books, Do you?"
cleaned_sentence = cleaned_data(raw_sentence)
print(cleaned_sentence)

Hi I love to read books Do you


In [15]:
import string
import nltk
from nltk.corpus import stopwords


In [16]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dips\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:

def cleaned_data(data):
    data = data.translate(str.maketrans('', '', string.punctuation))
    data = ' '.join([word for word in data.split() if word.lower() not in stop_words])
    return data
raw_sentence = "Hi, I love to read books, Do you?"
cleaned_sentence = cleaned_data(raw_sentence)
print(cleaned_sentence)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dips\AppData\Roaming\nltk_data...


Hi love read books


[nltk_data]   Unzipping corpora\stopwords.zip.


In [18]:
def clean_data_2(raw_text: str):
    clean_data = []
    for word in nltk.word_tokenize(raw_text.lower()):
        if word not in stop_words:
            if word not in "punkt_tab":
                clean_data.append(word)
    return " ".join(clean_data)

In [19]:
clean_data_2(data)

'india ( hindi : bhārat ) , officially republic india , country south asia . seventh-largest country area , second-most populous country , populous democracy world . bounded indian ocean south , arabian sea southwest , bay bengal southeast , shares land borders pakistan west ; china , nepal , bhutan north ; bangladesh myanmar east . indian ocean , india vicinity sri lanka maldives ; andaman nicobar islands share maritime border thailand indonesia .'