## **Introduction to NLP**

**Natural Language Processing (NLP)** is a field in AI focused on enabling computers to understand, interpret, and generate human language.

---

### **Key Goals**
1. **Language Understanding**: Interpreting text/speech for meaning.
2. **Language Generation**: Producing human-like language from data.

---

### **Applications**
- **Text Analysis**: Sentiment analysis, classification.
- **Translation**: Tools like Google Translate.
- **Speech Recognition**: Siri, Alexa.
- **Chatbots**: Conversational assistants.
- **Information Retrieval**: Search engines, summarization.
- **Spam Filtering**: Email filtering.

---

### **Core Components**
- **Syntax**: Sentence structure (e.g., parsing).
- **Semantics**: Meaning (e.g., synonyms).
- **Pragmatics**: Context and intent (e.g., sarcasm).
- **Morphology**: Word forms (roots, affixes).
- **Phonetics**: Speech sound processing.

---

### **Basic Tasks**
- **Tokenization**: Splitting text (e.g., "NLP is fun" → ["NLP", "is", "fun"]).
- **Stopword Removal**: Removing common words.
- **Stemming/Lemmatization**: Reducing words to roots.
- **POS Tagging**: Grammatical roles (e.g., "NLP/NN").
- **NER**: Identifying entities (e.g., "Barack Obama").
- **Dependency Parsing**: Word relationships.
- **Sentiment Analysis**: Emotional tone.
- **Text Summarization**: Condensed text.

---

### **Key Techniques**
- **Bag of Words (BoW)**: Word frequency.
- **TF-IDF**: Word importance in context.
- **Word Embeddings**: Dense word vectors (e.g., Word2Vec).
- **RNNs**: Sequential data models.
- **Transformers**: Advanced models like BERT, GPT.

---

### **Challenges**
- **Ambiguity**: Multiple meanings.
- **Context Understanding**: Sarcasm, idioms.
- **Language Variability**: Dialects, styles.
- **Out-of-Vocabulary Words**: Rare/new words.
- **Computational Needs**: Processing power.


## **NLP Pipeline**

An **NLP pipeline** processes raw text into structured data or insights for tasks like sentiment analysis, text classification, or translation.

---

### **Key Stages**

1. **Text Acquisition**: Collect text from sources like social media, web scraping, or speech-to-text.
2. **Text Preprocessing**:
   - **Tokenization**: Split text into words/sentences.
   - **Lowercasing**: Convert to lowercase.
   - **Stopword Removal**: Remove common words (e.g., "is," "the").
   - **Stemming/Lemmatization**: Reduce words to root/base forms.
   - **Punctuation Removal**: Clean text of symbols and special characters.
3. **Text Representation**:
   - **Bag of Words (BoW)**: Word frequency vectors.
   - **TF-IDF**: Weighted word importance.
   - **Embeddings**: Dense representations (e.g., Word2Vec, BERT).
4. **Model Training**: Use ML (Naive Bayes, SVM) or DL (RNNs, Transformers).
5. **Evaluation**: Metrics like accuracy, F1-score, or BLEU for performance.
6. **Post-Processing**: Normalize and format outputs.
7. **Deployment**: Use APIs (Flask, FastAPI) or cloud platforms.

## **Text Processing**

Text processing transforms raw text into clean, structured data for NLP tasks. It ensures consistency, reduces noise, and prepares data for analysis.

In [30]:
import re 
import pandas as pd
import string,time
from nltk.corpus import stopwords
from textblob import TextBlob
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
import spacy
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import pandas as pd
import numpy as np
import re
import string

In [2]:
df = pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
def remove_html(data):
    pattern = re.compile("<.*?>")
    return pattern.sub(r"", data)

In [4]:
df.review = df.review.apply(remove_html)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [5]:
def remove_url(s):
    pattern = re.compile(r"https?://\S+|www\.\S+")
    return pattern.sub(r"", s)

In [6]:
df.review = df.review.apply(remove_url)

In [7]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [8]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
exclude = string.punctuation

In [10]:
def remove_punc(data):
    for char in exclude:
        data = data.replace(char, "")
    return data

In [11]:
s = 'string. With Punctation? '
# print(remove_punc(s))
remove_punc(s)

## taking high time

'string With Punctation '

In [12]:
def remove_punc(s):
    return s.translate(str.maketrans("", "", exclude))

In [13]:
s = 'string. With Punctation? '
remove_punc(s)

## taking low time

'string With Punctation '

In [14]:
incorrect_text = "ceertain condition during seveal ggeneration aree moodeified in the same manaer  ."
textBib = TextBlob(incorrect_text)
textBib.correct().string

'certain condition during several generation are modified in the same manner  .'

In [15]:
def remove_stopwords(s):
    end_string = []
    for word in s.split():
        if word not in stopwords.words("english"):
            end_string.append(word)
    return " ".join(end_string)

In [16]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mobcoder/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
# df.review = df.review.apply(remove_stopwords)

In [18]:
def remove_emoji(s):
    emoji_pattern = re.compile(
        "["                     
        u"\U0001F600-\U0001F64F"  
        u"\U0001F300-\U0001F5FF"  
        u"\U0001F680-\U0001F6FF"  
        u"\U0001F1E0-\U0001F1FF"  
        u"\U00002700-\U000027BF"  
        "]+", 
        flags=re.UNICODE
    )
    return emoji_pattern.sub(r'', s)

text_with_emojis = "Hello 🌟, how are you? 🚀"
clean_text = remove_emoji(text_with_emojis)
print(clean_text)  


Hello , how are you? 


In [19]:
#Tokenization

s1 = "I am going t delhi !"
word_tokenize(s1)

['I', 'am', 'going', 't', 'delhi', '!']

In [20]:
s2 = "I am going to do Phd. in A.I"
word_tokenize(s2)

['I', 'am', 'going', 'to', 'do', 'Phd', '.', 'in', 'A.I']

In [21]:
nlp = spacy.load('en_core_web_sm')

In [22]:
doc1 = nlp(s1)
doc2 = nlp(s2)

In [23]:
doc1

I am going t delhi !

In [24]:
doc2

I am going to do Phd. in A.I

In [25]:
##Stemming

ps = PorterStemmer()
def stem_words(s):
    return ' '.join([ps.stem(word) for word in s.split()])

In [26]:
s = "walk walks walking walked"
stem_words(s)

'walk walk walk walk'

In [28]:
word_lemetizer = WordNetLemmatizer()
s  = "He was runing and eating at same time . he was habit of swimmming after playing long hours in the sun"
sentance = nltk.word_tokenize(s)
for word in sentance:
    if word in exclude:
        sentance.remove(word)

In [29]:
sentance

['He',
 'was',
 'runing',
 'and',
 'eating',
 'at',
 'same',
 'time',
 'he',
 'was',
 'habit',
 'of',
 'swimmming',
 'after',
 'playing',
 'long',
 'hours',
 'in',
 'the',
 'sun']

## Text Represtation

In [35]:
cv = CountVectorizer()
df = pd.DataFrame({
    'text': ['people watch campusx', 'campusx watch campusx', 'people write comment', 'campusx write comment'],
    'output': [1, 1, 0, 0]
})
df

Unnamed: 0,text,output
0,people watch campusx,1
1,campusx watch campusx,1
2,people write comment,0
3,campusx write comment,0


In [37]:
# Bay of words
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'campusx': 0, 'write': 4, 'comment': 1}


In [39]:
print(bow[0].toarray())
print(bow[1].toarray())

[[1 0 1 1 0]]
[[2 0 0 1 0]]


In [41]:
cv.transform(["campusx watch and write comment of campusx "]).toarray()

array([[2, 1, 0, 1, 1]])

In [43]:
#N-gram
cv = CountVectorizer(ngram_range= (1,2))

In [44]:
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 4, 'watch': 7, 'campusx': 0, 'people watch': 5, 'watch campusx': 8, 'campusx watch': 1, 'write': 9, 'comment': 3, 'people write': 6, 'write comment': 10, 'campusx write': 2}


In [46]:
#Tf-Ids
tfidf = TfidfVectorizer()
tfidf.fit_transform(df['text']).toarray()

array([[0.49681612, 0.        , 0.61366674, 0.61366674, 0.        ],
       [0.8508161 , 0.        , 0.        , 0.52546357, 0.        ],
       [0.        , 0.57735027, 0.57735027, 0.        , 0.57735027],
       [0.49681612, 0.61366674, 0.        , 0.        , 0.61366674]])

In [47]:
print(tfidf.idf_)

[1.22314355 1.51082562 1.51082562 1.51082562 1.51082562]
