# Introduction to NLP (Natural Language Processing)

NLP = The field of AI that helps computers understand, interpret, and respond to human language.

**What is NLP?** Natural Language Processing (NLP) is a branch of **Artificial Intelligence** that helps computers understand, interpret, and respond to human language.
It acts like a bridge between human **communication** and **computer understanding.**

**It Work by process**
  > Breaking down language into smaller parts (words, sentences, paragraphs).
  > Understanding meaning using grammar, syntax, and semantics.
  > Using context to respond or take action.

**Applications**
  - Voice assistants (Alexa, Siri, Google Assistant)\
  - Spell checkers (Google Docs, MS Word)\
  - Chatbots (customer service, bookings)\
  - Translation (Google Translate, DeepL)\
  - Information extraction (search engines)\
  - Keyword search (Google, Amazon search)\
  - Making appointments, buying items, etc.\

**Two main parts of NLP**
\
1.**NLU**-Natural Language Understanding
  > The computer reads text/speech and understands its meaning.\
  > Examples: Tokenization, POS tagging, Named Entity Recognition\
2.**NLG**-Natural Language Generation
  > The computer generates natural-sounding text or speech.\
  > Examples: Chatbot replies, text summarization, machine translation\

**Libraries we will use:**
> **NLTK**(Natural Language Toolkit) → main library for basic NLP tasks in Python.\
> **spaCy**→ fast, industrial NLP processing.\
> **Gensim**→ topic modeling, word embeddings.\
> **Stanford NLP (Stanza)** → advanced linguistic analysis

- We import nltk for NLP tasks.\
- Download the built-in language processing resources.\
- Load a sample text document about AI.\
- Use NLTK to break it down into smaller parts (tokens)

In [1]:
import os
import nltk
#nltk.download()

In [2]:

AI = '''Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of
humans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and
problem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.
It is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe
AI could solve major challenges and crisis situations.'''

In [3]:
AI

'Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.'

In [4]:
type(AI)

str

# Word Tokenization #
- word_tokenize() splits text into words(tokens) and punctuation.
- Each piece is called a token.

In [5]:
from nltk.tokenize import word_tokenize

In [6]:
AI_tokens = word_tokenize(AI)
AI_tokens

['Artificial',
 'Intelligence',
 'refers',
 'to',
 'the',
 'intelligence',
 'of',
 'machines',
 '.',
 'This',
 'is',
 'in',
 'contrast',
 'to',
 'the',
 'natural',
 'intelligence',
 'of',
 'humans',
 'and',
 'animals',
 '.',
 'With',
 'Artificial',
 'Intelligence',
 ',',
 'machines',
 'perform',
 'functions',
 'such',
 'as',
 'learning',
 ',',
 'planning',
 ',',
 'reasoning',
 'and',
 'problem-solving',
 '.',
 'Most',
 'noteworthy',
 ',',
 'Artificial',
 'Intelligence',
 'is',
 'the',
 'simulation',
 'of',
 'human',
 'intelligence',
 'by',
 'machines',
 '.',
 'It',
 'is',
 'probably',
 'the',
 'fastest-growing',
 'development',
 'in',
 'the',
 'World',
 'of',
 'technology',
 'and',
 'innovation',
 '.',
 'Furthermore',
 ',',
 'many',
 'experts',
 'believe',
 'AI',
 'could',
 'solve',
 'major',
 'challenges',
 'and',
 'crisis',
 'situations',
 '.']

In [7]:
len(AI_tokens)

81

In [8]:
AI

'Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.'

# Sentence Tokenization # 
- sent_tokenize() splits text into sentences.
- It uses punctuation and grammar rules to decide sentence boundaries.

In [9]:
from nltk.tokenize import sent_tokenize

In [10]:
AI_sent = sent_tokenize(AI)
AI_sent

['Artificial Intelligence refers to the intelligence of machines.',
 'This is in contrast to the natural intelligence of\nhumans and animals.',
 'With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving.',
 'Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.',
 'It is probably the fastest-growing development in the World of technology and innovation.',
 'Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

In [11]:
len(AI_sent)

6

In [12]:
AI

'Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.'

# Paragraph Tokenization 
- blankline_tokenize() splits text into paragraphs based on blank lines.\
- This is helpful when working with larger documents

In [13]:
from nltk.tokenize import blankline_tokenize # Give how many paragraphs 
AI_blank = blankline_tokenize(AI)
print("Paragraphs:", AI_blank)
print("Number of paragraphs:", len(AI_blank))

['Artificial Intelligence refers to the intelligence of machines. This is in contrast to the natural intelligence of\nhumans and animals. With Artificial Intelligence, machines perform functions such as learning, planning, reasoning and\nproblem-solving. Most noteworthy, Artificial Intelligence is the simulation of human intelligence by machines.\nIt is probably the fastest-growing development in the World of technology and innovation. Furthermore, many experts believe\nAI could solve major challenges and crisis situations.']

In [14]:
len(AI_blank)

1

 # TYPES OF TOKENS

### 1 - Bigram - Tokens with 2 words

### 2 - Trigram - Tokens with 3 words

### 3 - ngrams 


In [22]:
import nltk

In [23]:
from nltk.util import bigrams,trigrams,ngrams

In [25]:
string = 'I am studying DataScience with AI under prakash senapati sir guidance'
quotes_tokens = nltk.word_tokenize(string)
quotes_tokens

['I',
 'am',
 'studying',
 'DataScience',
 'with',
 'AI',
 'under',
 'prakash',
 'senapati',
 'sir',
 'guidance']

In [26]:
string

'I am studying DataScience with AI under prakash senapati sir guidance'

In [28]:
len(quotes_tokens)

11

# Bigrams 

In [29]:
quotes_bigrams = list(nltk.bigrams(quotes_tokens))
quotes_bigrams

[('I', 'am'),
 ('am', 'studying'),
 ('studying', 'DataScience'),
 ('DataScience', 'with'),
 ('with', 'AI'),
 ('AI', 'under'),
 ('under', 'prakash'),
 ('prakash', 'senapati'),
 ('senapati', 'sir'),
 ('sir', 'guidance')]

In [30]:
quotes_tokens

['I',
 'am',
 'studying',
 'DataScience',
 'with',
 'AI',
 'under',
 'prakash',
 'senapati',
 'sir',
 'guidance']

# Trigrams

In [33]:
quotes_trigrams = list(nltk.trigrams(quotes_tokens))
quotes_trigrams

[('I', 'am', 'studying'),
 ('am', 'studying', 'DataScience'),
 ('studying', 'DataScience', 'with'),
 ('DataScience', 'with', 'AI'),
 ('with', 'AI', 'under'),
 ('AI', 'under', 'prakash'),
 ('under', 'prakash', 'senapati'),
 ('prakash', 'senapati', 'sir'),
 ('senapati', 'sir', 'guidance')]

In [34]:
len(quotes_trigrams)

9

# ngrams

In [36]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens))
quotes_ngram

TypeError: ngrams() missing 1 required positional argument: 'n'

In [38]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens, 4))      # ngrams of length 4
quotes_ngrams

[('I', 'am', 'studying', 'DataScience'),
 ('am', 'studying', 'DataScience', 'with'),
 ('studying', 'DataScience', 'with', 'AI'),
 ('DataScience', 'with', 'AI', 'under'),
 ('with', 'AI', 'under', 'prakash'),
 ('AI', 'under', 'prakash', 'senapati'),
 ('under', 'prakash', 'senapati', 'sir'),
 ('prakash', 'senapati', 'sir', 'guidance')]

In [40]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens, 5))      # ngrams of length 5 
quotes_ngrams

[('I', 'am', 'studying', 'DataScience', 'with'),
 ('am', 'studying', 'DataScience', 'with', 'AI'),
 ('studying', 'DataScience', 'with', 'AI', 'under'),
 ('DataScience', 'with', 'AI', 'under', 'prakash'),
 ('with', 'AI', 'under', 'prakash', 'senapati'),
 ('AI', 'under', 'prakash', 'senapati', 'sir'),
 ('under', 'prakash', 'senapati', 'sir', 'guidance')]

In [41]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens, 7))       # ngrams of length 7
quotes_ngrams

[('I', 'am', 'studying', 'DataScience', 'with', 'AI', 'under'),
 ('am', 'studying', 'DataScience', 'with', 'AI', 'under', 'prakash'),
 ('studying', 'DataScience', 'with', 'AI', 'under', 'prakash', 'senapati'),
 ('DataScience', 'with', 'AI', 'under', 'prakash', 'senapati', 'sir'),
 ('with', 'AI', 'under', 'prakash', 'senapati', 'sir', 'guidance')]

In [43]:
quotes_ngrams = list(nltk.ngrams(quotes_tokens, 15))
quotes_ngrams

[]