# Natural Language Processing (NLP)

## What is NLP ?
NLP is a subfield of AI which enable computers to understand and process human language. The main task of NLP would be to program computers for analyzing and processing huge amount of natural language data 

NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

## NLP Task
Human language is filled with ambiguities that make it incredibly difficult to write software that accurately determines the intended meaning of text or voice data. Homonyms, homophones, sarcasm, idioms, metaphors, grammar and usage exceptions, variations in sentence structure—these just a few of the irregularities of human language that take humans years to learn, but that programmers must teach natural language-driven applications to recognize and understand accurately from the start, if those applications are going to be useful.

Several NLP tasks break down human text and voice data in ways that help the computer make sense of what it's ingesting. Some of these tasks include the following:

- **Speech recognition**, also called speech-to-text, is the task of reliably converting voice data into text data. Speech recognition is required for any application that follows voice commands or answers spoken questions. What makes speech recognition especially challenging is the way people talk—quickly, slurring words together, with varying emphasis and intonation, in different accents, and often using incorrect grammar.

- **Part of speech tagging**, also called grammatical tagging, is the process of determining the part of speech of a particular word or piece of text based on its use and context. Part of speech identifies ‘make’ as a verb in ‘I can make a paper plane,’ and as a noun in ‘What make of car do you own?’

- **Word sense disambiguation** is the selection of the meaning of a word with multiple meanings  through a process of semantic analysis that determine the word that makes the most sense in the given context. For example, word sense disambiguation helps distinguish the meaning of the verb 'make' in ‘make the grade’ (achieve) vs. ‘make a bet’ (place).

- **Named entity recognition**, or NEM, identifies words or phrases as useful entities. NEM identifies ‘Kentucky’ as a location or ‘Fred’ as a man's name.

- **Co-reference resolution** is the task of identifying if and when two words refer to the same entity. The most common example is determining the person or object to which a certain pronoun refers (e.g., ‘she’ = ‘Mary’),  but it can also involve identifying a metaphor or an idiom in the text  (e.g., an instance in which 'bear' isn't an animal but a large hairy person).

- **Sentiment analysis** attempts to extract subjective qualities—attitudes, emotions, sarcasm, confusion, suspicion—from text.

- **Natural language generation** is sometimes described as the opposite of speech recognition or speech-to-text; it's the task of putting structured information into human language.

## NLP tools and approaches

**Python and the Natural Language Toolkit (NLTK)**

The Python programing language provides a wide range of tools and libraries for attacking specific NLP tasks. Many of these are found in the Natural Language Toolkit, or NLTK, an open source collection of libraries, programs, and education resources for building NLP programs.

The NLTK includes libraries for many of the NLP tasks listed above, plus libraries for subtasks, such as sentence parsing, word segmentation, stemming and lemmatization (methods of trimming words down to their roots), and tokenization (for breaking phrases, sentences, paragraphs and passages into tokens that help the computer better understand the text). It also includes libraries for implementing capabilities such as semantic reasoning, the ability to reach logical conclusions based on facts extracted from text.

**Statistical NLP, machine learning, and deep learning**



## NLP Use Cases

Natural language processing is the driving force behind machine intelligence in many modern real-world applications. Here are a few examples:

- **Spam detection:** You may not think of spam detection as an NLP solution, but the best spam detection technologies use NLP's text classification capabilities to scan emails for language that often indicates spam or phishing. These indicators can include overuse of financial terms, characteristic bad grammar, threatening language, inappropriate urgency, misspelled company names, and more. Spam detection is one of a handful of NLP problems that experts consider 'mostly solved' (although you may argue that this doesn’t match your email experience).

- **Machine translation:** Google Translate is an example of widely available NLP technology at work. Truly useful machine translation involves more than replacing words in one language with words of another.  Effective translation has to capture accurately the meaning and tone of the input language and translate it to text with the same meaning and desired impact in the output language. Machine translation tools are making good progress in terms of accuracy. A great way to test any machine translation tool is to translate text to one language and then back to the original. An oft-cited classic example: Not long ago, translating “The spirit is willing but the flesh is weak” from English to Russian and back yielded “The vodka is good but the meat is rotten.” Today, the result is “The spirit desires, but the flesh is weak,” which isn’t perfect, but inspires much more confidence in the English-to-Russian translation.

- **Virtual agents and chatbots:** Virtual agents such as Apple's Siri and Amazon's Alexa use speech recognition to recognize patterns in voice commands and natural language generation to respond with appropriate action or helpful comments. Chatbots perform the same magic in response to typed text entries. The best of these also learn to recognize contextual clues about human requests and use them to provide even better responses or options over time. The next enhancement for these applications is question answering, the ability to respond to our questions—anticipated or not—with relevant and helpful answers in their own words.

- **Social media sentiment analysis:** NLP has become an essential business tool for uncovering hidden data insights from social media channels. Sentiment analysis can analyze language used in social media posts, responses, reviews, and more to extract attitudes and emotions in response to products, promotions, and events–information companies can use in product designs, advertising campaigns, and more.
Text summarization: Text summarization uses NLP techniques to digest huge volumes of digital text and create summaries and synopses for indexes, research databases, or busy readers who don't have time to read full text. The best text summarization applications use semantic reasoning and natural language generation (NLG) to add useful context and conclusions to summaries.

## NLP Pipeline

1. **Sentence Segmentation**

Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into separate sentences.

Example : Consider the following paragraph

**Independence Day is one of the important festivals for every Indonesian citizen. It is celebrated on the 17th of August each year ever since the first President of Indonesia declared the proclamation. The day celebrates independence in the true sense.**

Sentence Segment produces the following result: 
- "Independence Day is one of the important festivals for every Indonesian citizen."
- "It is celebrated on the 17th of August each year ever since the first President of Indonesia declared the proclamation."
- "The day celebrates independence in the true sense."

2. **Word Tokenization**

Word Tokenizer is used to break the sentence into separate words or tokens.

Example :

**The winter is very cold.**

Word Tokenizer generates the following result:

"The", "winter", "is", "very", "cold", "."

3. **Stemming**

Stemming is used to normalize words into its base form or root form.

Example : 

Intelligence, intelligent, and intelligently, all these words are originated with a single root word "intelligen." In English, the word "intelligen" do not have any meaning.

4. **Lemmatization**

Lemmatization is quite similar to the Stemming. It is used to group different inflected forms of the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word, which has a meaning.

Example : In lemmatization, the words intelligence, intelligent, and intelligently has a root word intelligent, which has a meaning.

5. **Stop Words**

In English, there are a lot of words that appear very frequently like "is", "and", "the", and "a". NLP pipelines will flag these words as stop words. Stop words might be filtered out before doing any statistical analysis.

Example: He **is a** good boy.

6. **Dependency Parsing**

Dependency Parsing is used to find that how all the words in the sentence are related to each other.

7. **POS tags**

POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It indicates that how a word functions with its meaning as well as grammatically within the sentences. A word has one or more parts of speech based on the context in which it is used.

Example: "**Google**" something on the Internet.

8. **Named Entity Recognition (NER)**

Named Entity Recognition (NER) is the process of detecting the named entity such as person name, movie name, organization name, or location.

Example: **Steve Jobs** introduced iPhone at the Macworld Conference in San Francisco, California.

9. **Chunking**

Chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences.


## NLP Libraries

**Scikit-learn:** It provides a wide range of algorithms for building machine learning models in Python.

**Natural language Toolkit (NLTK):** NLTK is a complete toolkit for all NLP techniques.

**Pattern:** It is a web mining module for NLP and machine learning.

**TextBlob:** It provides an easy interface to learn basic NLP tasks like sentiment analysis, noun phrase extraction, or pos-tagging.

**Quepy:** Quepy is used to transform natural language questions into queries in a database query language.

**SpaCy:** SpaCy is an open-source NLP library which is used for Data Extraction, Data Analysis, Sentiment Analysis, and Text Summarization.

**Gensim:** Gensim works with large datasets and processes data streams.

# Example of NLP Implementation

## Text Classification using Bag-of-words

In [33]:
# Import the library from sklearn

from sklearn.feature_extraction.text import CountVectorizer

In [34]:
# Make a dummy train data

x_train = [
    "the weather today is very cold",
    "what are you going to do in this hot weather?",
    "It is very hot today",
    "The whole world is experiencing a financial crisis",
    "The financial sector accounts for the largest profits in the USA",
    "Banks are the most important sector for state finances"
            ]

y_train = ["WEATHER", "WEATHER", "WEATHER", "FINANCE", "FINANCE", "FINANCE"]

In [35]:
vectorizer = CountVectorizer(binary=True)
x_train_vector = vectorizer.fit_transform(x_train)

print(vectorizer.get_feature_names_out())

print(x_train_vector.toarray())


['accounts' 'are' 'banks' 'cold' 'crisis' 'do' 'experiencing' 'finances'
 'financial' 'for' 'going' 'hot' 'important' 'in' 'is' 'it' 'largest'
 'most' 'profits' 'sector' 'state' 'the' 'this' 'to' 'today' 'usa' 'very'
 'weather' 'what' 'whole' 'world' 'you']
[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0]
 [0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0]
 [0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0]
 [1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0]]


That mean in the first sentence contain word "cold" 1, word "is" 1, etc.

Now let's train the model with SVM

In [36]:
# Import the library
from sklearn import svm

clf = svm.SVC(kernel='linear')
clf.fit(x_train_vector, y_train)

And now make a prediction

In [37]:
x_test = ['Stocks greatly affect the finances of a country']

x_test_vector = vectorizer.transform(x_test)

clf.predict(x_test_vector)


array(['FINANCE'], dtype='<U7')

We already make a simple Text Classification !!!

**ADDITION**

In scikit-learn, fit, transform, and fit_transform are three methods used in the process of data preprocessing and modeling.

- `fit` is used to calculate the necessary statistics on the training data. This is done by applying a particular algorithm to the data to learn its characteristics, and the resulting model is stored in memory.

- `transform` is used to apply the same preprocessing steps to the new or unseen data. This step ensures that the new data is in the same format as the original data.

- `fit_transform` is a combination of fit and transform. It first fits the model to the training data and then applies the transformation on the same data. This is a convenient way to apply both steps at once.

In summary, `fit` is used to learn the characteristics of the data and create a model, `transform` is used to apply the learned model to new data, and `fit_transform` is used to learn the model and apply it to the same data.