# NLP
Natural language processing (NLP) is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language.

<img src="https://cdn.prod.website-files.com/5ec6a20095cdf182f108f666/5f22908f09f2341721cd8901_AI%20poster.png" width="40%"></img>

## Why it is needed?

Natural Language Processing (NLP) is essential because it bridges the gap between human communication and machine understanding, enabling computers to process and analyze human language effectively. Here are the key reasons why NLP is needed, along with examples:

**1. Understanding Human Language**<br>
Computers inherently do not understand human languages, which are complex, context-dependent, and full of nuances like sarcasm, idioms, and dialects. NLP enables machines to interpret and generate human language for meaningful interaction.<br>
**e.g.,** Virtual assistants like Alexa and Siri use NLP to interpret spoken commands and provide relevant responses.

**2. Automating Repetitive Tasks**<br>
NLP automates tasks such as data entry, document classification, and summarization, saving time and reducing errors.<br>
**e.g.,** Customer service chatbots handle routine queries, freeing human agents for complex issues.

**3. Analyzing Unstructured Data**<br>
A significant portion of data is unstructured (e.g., social media posts, reviews). NLP extracts insights from this data for decision-making.<br>
**e.g.,** Sentiment analysis helps businesses understand customer opinions from reviews or tweets.

**4. Enhancing Productivity**<br>
NLP-powered tools streamline workflows by automating tasks like email sorting, invoice processing, or extracting key information from documents.<br>
**e.g.,** Accounting systems use NLP to populate databases from invoices automatically.

**5. Enabling Accessibility**<br>
NLP makes technology accessible to people with disabilities by supporting voice commands and text-to-speech systems.<br>
**e.g.,** Screen readers for visually impaired users leverage NLP for better comprehension.

**6. Improving Communication Across Languages**<br>
NLP facilitates real-time language translation, breaking down communication barriers globally.<br>
**e.g.,** Google Translate uses NLP to translate text while preserving meaning and context.

**7. Driving Innovation in Specialized Fields**<br>
NLP enables advancements in fields like healthcare (analyzing medical records) or autonomous systems (interpreting commands).<br>
**e.g.,** Clinical applications use NLP to summarize patient records efficiently.

**8. Knowledge Graph and QA systems**<br>
A knowledge graph (KG) is a structured network of entities (nodes) and their relationships (edges) that enables machines to understand context and meaning, making it a critical component in modern question-answering (QA) systems.<br>
**e.g.** When asked, "Where was the painter of the Mona Lisa born?", the KG links "Mona Lisa" → "painted by" → "Leonardo da Vinci" → "born in" → "Italy" to derive the answer


<img src="https://assets.zilliz.com/Figure_1_Knowledge_graphs_illustration_643cec06af.png"></img>

 For example, a search for the film director James Cameron reveals information such as his date of birth, height, movies and TV shows he directed, previous romantic partners, TED Talks he gave

 **9. Text parsing**<br>
Text parsing, also known as syntactic analysis, is the process of analyzing text to understand its structure and meaning based on grammatical rules, separating it into smaller components for further processing. 

<img src='https://cdn.botpenguin.com/assets/website/Parsing_4bcdfead23.webp' width='40%'></img>

## Approaches used for NLP
**1. Heuristic approach**<br>
Heuristics are mental shortcuts that allow people to solve problems and make judgments quickly and efficiently.

e.g., use of Regular expressions, wordnet, open mind common sense.

**2. ML approach**<br>
Based on data. We convert the text into numbers and then apply algorithms.

e.g., Naive bayes, SVM, Logistic regression, LDA, Hidden markov models

**3. Deep Learning Approach**<br>
In ML, sequential information is lost when text converted into numbers. In DL, the sequential information is preserved and also no feature generation is needed in DL unlike ML.

e.g., RNN, LSTM, GRU, Transformers


## Challanges in NLP

**1. Ambiguity**<br>
So much meaning of a single word or sentence is easy for us but not for machines.

e.g., I saw the boy on the bench with my binoculars.

**2. Contextual word**<br>
Different meaning of the word based on the context.

e.g., I ran to the store because we ran out of the supplies.

**3. Colloquialisms and Slangs**<br>
pulling leg meaning is different in our context but not for machines. Colloquialisms are a word or phrase that is used in conversation but not in formal speech or writing

**4. Synonyms**

**5. Tonal difference and irony**

**6. Speeling errors**

**7. Creativity**<br>
e.g., Poems, dialogs, scripts

**8. So much languages/ diversity**

## NLP Pipeline
Steps followed to build an end to end NLP software. It consists of following steps:

1. Data acquisition

2. Text Preparation: 
    - Text cleanup <br>(like spelling mistakes or emoji removing etc.), 
    - Basic preprocessing <br>(removing punctuations, stopwords and tokenization etc.) 
    - Advance preprocessing <br>(chunking, Parts-of-speech or POS tagging, co-reference resolution etc.)

3. Feature Engineering:<br>
Converting words into numbers. <br>
e.g., TF-IDF, Bag-of-words, word2vec

4. Modelling
    - Model building
    - evaluation

5. Deployment
    - Deployment
    - Monitoring
    - Model update

This pipeline is mainly for ML, not for DL. Also this is not universal, e.g., this pipeline is good for sentiment analysis or text summarization but not chatbot.

Also, this is non-linear, i.e., we go back n forth continously depending on the results.

```mermaid
---
title: "1. Data Acquisition"
---
%%{init: {"flowchart": {"htmlLabels": true}}}%%
flowchart LR

    A{"<b>Data Acquisition</b>"} --> B("Available") & C("Other Sources") & D("No where")
    subgraph Available[" "]
        B --> E["Already available in csv"] & F["In the data warehouse, \nneed a data engineer to retrieve the data"] & G["Less Data"]
        G --> H["Data \naugmentation"]
        H --> HA["Replacing some words with Synonyms"] & HB["Bigram flip | e.g., [king, man] to [man, king]"] & HC["Back-translate \n| Used to rearrange text \n| Converting into another lang & \nthen converting back"] & HD["Adding Noise"]
    end
    style A color:#000000, fill:#FFF9C4,  stroke:#000000

    subgraph OtherSources[" "]
        C --> CA["<p align='left'>1. Public Dataset <br> 2. Web Scraping <br> 3. APIs <br> 4. PDF <br> 5. Image <br> 6. Audio"] 
    end

```




```mermaid
---
title: 2. Text preparation
---
%%{init: {"flowchart": {"htmlLabels": true}}}%%
flowchart LR

    A{"<b>Text Preparation</b>"} --> B(Cleaning) & C(Basic preprocessing) & D(Advance Preprocessing)
    subgraph Basic_preprocessing[" "]
    C --> CA[Must] & CC["Optional \n|Based on application"]
    CA --> CAA["Tokenization \n| Sentence or Word Tokenization"]
    CC --> CCA["<p align='left'>1. StopWords removal \n2. Word Stemming \n3. Removing Digits & punctuation \n4. Lower casing \n5. Language Detection"]
    end
    subgraph Advance_preprocessing[" "]
    D --> DA["<p align='left'>1. POS tagging\n2. Parsing\n3. Corefence resolution"]
    end

style A color:#000000, fill:#FFF9C4,  stroke:#000000
```


```mermaid
flowchart TB
A{"<b>modeling</b>"} --> B["Heuristic approach \n (if less data)"] & C["ML Models\n (if thik-thak data)"] & D["DL\n (if much data)"] & E["Cloud API \n (if andha paisa)"]
style A color:#000000, fill:#FFF9C4,  stroke:#000000
```

In [2]:
# remove html tags
import re
def remove_html(data):
    p = re.compile(r'<.*?>')
    return p.sub(r'', data)

In [4]:
data = "<html><head>Krishna.. <a href='google.com'> yes"
remove_html(data)

'Krishna..  yes'

##### Why remove punctuations?
Punctuation removal simplifies text data, streamlining the analysis by reducing the complexity and variability within the data. Also punctuation does't have exact meaning.

e.g., `hi!` and `hi` will be treated differently and will increase complexity.

In [1]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [2]:
def remove_punc(text):
    for char in string.punctuation:
        text = text.replace(char, '')
    return text
text = r"""Removing stopwords is a common text-processing task. The words (like "is," "the," "at," etc.) usually don't contribute to the meaning"""
remove_punc(text)

In [17]:
import time
start = time.time()
remove_punc(text)
time1 = time.time() - start
print(time1)

0.0002079010009765625


In [4]:
# OR
import string
def again_punc_remove(text):
    return text.translate(str.maketrans('', '', string.punctuation))

In [16]:
import time
start = time.time()
again_punc_remove(text)
time2 = time.time() - start
print(time2)

0.00016546249389648438


### Spell correction

In [1]:
from textblob import TextBlob

In [6]:
incorrect = "wjo iss thiis? worng speeling "

In [7]:
TextBlob(incorrect).correct().string

'who iss this? wrong spelling '

### Stop words

In [22]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/naman/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [23]:
import emoji

ModuleNotFoundError: No module named 'emoji'

In [21]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [27]:
import emoji
print(emoji.demojize('''Python is  😀😃😄😁😆'''))

Python is  :grinning_face::grinning_face_with_big_eyes::grinning_face_with_smiling_eyes::beaming_face_with_smiling_eyes::grinning_squinting_face:
