# <p style='text-align: center;'> Natural Language Processing </p>

## Text Mining:
- The process of exploring and analyzing large collection of unstructured textual data and deriving useful information, patterns, and actionable insights from it.


- Text mining (also referred to as text analytics) is an artificial intelligence (AI) technology that uses natural language processing (NLP) to transform the free (unstructured) text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning (ML) algorithms.


- Text mining employs a variety of methodologies to process the text, one of the most important of these being **Natural Language Processing (NLP)**.


## Need of Text Mining:
- Unprecedented amount of unstructured text data being generated everyday. Text mining transforms unstructured data into structure data to further be used for analysis. Identify business insights to fuel business process and reduce risks. Make informed decisions, automate processes, market research using sentiment analysis, etc.


- The structured data created by text mining can be integrated into databases, data warehouses or business intelligence dashboards and used for descriptive, prescriptive or predictive analytics.


- Other applications of text mining include document summarization, and entity extraction for identifying people, places, organizations and other entities. You can also use for sentiment analysis, to identify and extract subjective information from written natural language.

## Natural Language Processing (NLP):
- Natural Language Processing (NLP) is a field of study that deals with the understanding, interpreting, and manipulating human languages using computers.


- Natural Language Processing is a part of computer science that allows computers to understand language naturally, as a person does. This means the laptop will comprehend sentiments, speech, answer questions, text summarization, etc.


## NLP in Text Mining:
- Since most of the significant information is written down in a Natural Language such as English, French, German, etc. and is not conveniently tagged.


- So after identification and extraction of the content needed for text analytics, we use different NLP techniques to extract meaningful information from it.


- NLP helps computers to communicate with humans in their own language and perform other language-related tasks.


- NLP makes it feasible for computers to read text, hear speech, interpret it, measure sentiment and determine which parts are important.


## NLP Applications:
- Applications of Natural Language Processing (NLP):

![image.png](attachment:image.png)


## What Does Natural Language Toolkit (NLTK) Mean?
- The Natural Language Toolkit (NLTK) is a platform used for building Python programs that work with human language data for applying in statistical natural language processing (NLP).


- It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. It also includes graphical demonstrations and sample data sets as well as accompanied by a cook book and a book which explains the principles behind the underlying language processing tasks that NLTK supports.


## NLTK Corpus/Corpora:
- A **corpus** is a huge collection of written text. A compilation of corpuses is called a **Corpora**. It is a body of written or spoken texts used for linguistic analysis and the development of NLP tools.


- **Example:** 

    Corpus ---> Multiple words ---> Sentence Corpus.
    
    Corpora --> Multiple sentences ---> Paragraph Corpora.


## Tasks and Tools in Natural Language Processing (NLP):
<b> 1. Syntactic analysis:
    
- Syntactic analysis, also known as parsing or syntax analysis, identifies the syntactic structure of a text and the dependency relationships between words, represented on a diagram called a parse tree.

    
- Semantic analysis focuses on identifying the meaning of language. However, since language is polysemic and ambiguous, semantics is considered one of the most challenging areas in NLP.
    

- Semantic tasks analyze the structure of sentences, word interactions, and related concepts, in an attempt to discover the meaning of words, as well as understand the topic of a text.
    
    
<b> 2. Text Classification:
    
- Text classification is the process of understanding the meaning of unstructured text and organizing it into predefined categories (tags). One of the most popular text classification tasks is sentiment analysis, which aims to categorize unstructured data by sentiment.

## Challenges of Natural Language Processing:
- There are many challenges in Natural language processing but one of the main reasons NLP is difficult is simply because human language is ambiguous. Even humans struggle to analyze and classify human language correctly.


- Take sarcasm, for example. How do you teach a machine to understand an expression that’s used to say the opposite of what’s true? While humans would easily detect sarcasm in this comment, below, it would be challenging to teach a machine how to interpret this phrase:



## Text Preprocessing (Data Cleaning) Techniques for NLP:
Text data derived from natural language is unstructured and noisy. Text preprocessing involves transforming text into a clean and consistent format that can then be fed into a model for further analysis and learning.

The raw text data comes directly after the various sources are not cleaned. We apply multiple steps to make data clean. Un-cleaned text data contains useless information that deviates results, so it’s always the first step to clean the data. Some standard preprocessing techniques should be applied to make data cleaner. Cleaned data also prevent models from overfitting.

In this article, we will see the following topics under text processing:

### Tokenization:
- Tokenization means splitting text into meaningful unit words. There are **sentence tokenizers** as well as **word tokenizers**.


- **Sentence tokenizer** splits a paragraph into meaningful sentences, while **word tokenizer** splits a sentence into unit meaningful words. Many libraries can perform tokenization like SpaCy, NLTK, and TextBlob.


- Splitting a sentence on space to get individual unit words can be understood as tokenization.

### Punctuation Removal:
    
- Removing punctuation is a crucial step since punctuation doesn’t add any extra information or value to our data. Hence, removing punctuation reduces the data size; therefore, it improves computational efficiency.
    
    
- The Punctuations are: ('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~') ---> Excluded paranthesis.

### Stop Words Removal:
- Words that frequently occur in sentences and carry no significant meaning in sentences. These are not important for prediction, so we remove stopwords to reduce data size and prevent overfitting. Note: Before filtering stopwords, make sure you lowercase the data since our stopwords are lowercase.


- Using the NLTK library, we can filter out our Stopwords from the dataset.

### Lowercasing:
- The method lower()converts all uppercase characters into lowercase and returns.


- For "Stop Words Removal" we need to appy lowercasing.

### Spelling Correction:
- Most of the text data extracted in customer reviews, blogs, or tweets have some chances of spelling mistakes.


- Correcting spelling mistakes improves model accuracy.


- There are various libraries to fix spelling mistakes, but the most convenient method is to use a text blob.


- The method correct() works on text blob objects and corrects the spelling mistakes.


### Stemming:
- Stemming is converting words into their root word using some set of rules irrespective of meaning. I.e.,


- “fish,” “fishes,” and “fishing” are stemmed into “fish”.


- “playing”, “played”,” plays” are stemmed into “play”.


- Stemming helps to reduce the vocabulary hence improving the accuracy.


- The simplest way to perform stemming is to use NLTK or a TextBlob library.


- NLTK provides various stemming techniques, i.e. Snowball, PorterStemmer; different technique follows different sets of rules to convert words into their root word.

### Lemmatization:
- Lemmatization is converting words into their root word using vocabulary mapping. Lemmatization is done with the help of part of speech and its meaning; hence it doesn’t generate meaningless root words. But lemmatization is slower than stemming.


- “good,” “better,” or “best” is lemmatized into “good“.


- Lemmatization will convert all synonyms into a single root word. i.e. “automobile“, “car“,” truck“,” vehicles” are lemmatized into “automobile”.


- Lemmatization usually gets better results.


- Ie. leafs Stemmed to. leaves stemmed to leav while leafs , leaves lemmatized to leaf


- Lemmatization can be done using NLTK, TextBlob library.


<b> Difference between Stemming Vs Lemmatization:

![image-2.png](attachment:image-2.png)

<b> When we will go with Stemming and Lemmitisation:
    
- When there is a human interaction with that application, then we will go with Lemmitisation. When there is not a human interaction with that application, then we will go with Stemming. 
    
    
- Why we will go with Stemming: because Stemming is faster than Lemmitisation, if there is not a human interaction with that application, then we will go with Stemming.

# <p style='text-align: center;'> Vectorization In Machine Learning </p>

## Vectorization:
- Vectorization is the process of converting textual data into numerical vectors and is a process that is usually applied once the text is cleaned.


- It can help improve the execution speed and reduce the training time of your code.


- these vectors are used for training a machine learning algorithm for generating useful predictions. 


- In short, we can say that vectorization is as important as removing unwanted data from the raw text or training an ml model on the data.

### 1. CountVectorizer (bag of words):
- CountVectorizer is one of the simplest techniques that is used for converting text into vectors. It starts by tokenizing the document into a list of tokens (words). It selects the unique tokens from the token list and creates a vocabulary of words. Finally, a sparse matrix is created containing the frequency of words, where each row represents different sentences and each column represents unique words.


- Python’s scikit-learn has a class named CountVectorizer that provides a simple implementation for performing vectorization on text data.

### 2. TF-IDF:
- Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used statistical method in natural language processing and information retrieval. It measures how important a term is within a document relative to a collection of documents (i.e., relative to a corpus). Words within a text document are transformed into importance numbers by a text vectorization process. There are many different text vectorization scoring schemes, with TF-IDF being one of the most common.


- As its name implies, TF-IDF vectorizes/scores a word by multiplying the word’s Term Frequency (TF) with the Inverse Document Frequency (IDF).


- **Term Frequency:** TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.


              number of times the term appears in the document
    TF = ----------------------------------------------------------
                    total number of terms in the document
                    
                    
                    
- **Inverse Document Frequency:** IDF of a term reflects the proportion of documents in the corpus that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).


                      total number of documents in the corpus
    IDF = log (------------------------------------------------------)
                 number of documents in the corpus contain the term  
    
  
- The TF-IDF of a term is calculated by multiplying TF and IDF scores.


    TF-IDF = TF * IDF


- TF-IDF scores range from 0 to 1. A score closer to 1 is higher the importance of a word to a document.


- TF-IDF is useful in many natural language processing applications. For example, Search Engines use TF-IDF to rank the relevance of a document for a query. TF-IDF is also employed in text classification, text summarization, and topic modeling.


- Note that there are some different approaches to calculating the IDF score. The base 10 logarithm is often used in the calculation. However, some libraries use a natural logarithm. In addition, one can be added to the denominator as follows in order to avoid division by zero.


                        total number of documents in the corpus
    IDF = log (----------------------------------------------------------)
                 number of documents in the corpus contain the term + 1


<b> Numerical Example:
    
- Imagine the term 't' appears 20 times in a document that contains a total of 100 words. Term Frequency (TF) of 't' can be calculated as follow:  
    
    
            20
    TF = --------- = 0.2
            100         


- Assume a collection of related documents contains 10,000 documents. If 100 documents out of 10,000 documents contain the term 't', Inverse Document Frequency (IDF) of 't' can be calculated as follows:
    
    
                 10000
    IDF = log (---------) = 2
                  100
    
    
- Using these two quantities, we can calculate TF-IDF score of the term 't' for the document.   
    
    
    TF-IDF = 0.2 * 2 = 0.4
    
    
- Python’s scikit-learn has a class named TfidfVectorizer that provides a TF-IDF code for performing vectorization on text data. 

    from sklearn.feature_extraction.text import TfidfVectorizer

### 3. N-grams:
- N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset).


- N-grams means number of words which we will combine together.


- In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks


- When computing n-grams, you normally advance one word (although in more complex scenarios you can move n-words). N-grams are used for a variety of purposes.

![image.png](attachment:image.png)


For example, while creating language models, n-grams are utilized not only to create unigram models but also bigrams and trigrams.

### 4. Word2Vec:
- Word2Vec is a word embedding technique that makes use of neural networks to convert words into corresponding vectors in a way that semantically similar vectors are close to each other in N-dimensional space, where N refers to the dimensions of the vector.


- Word2Vec has the ability to maintain semantic relations between words. It can be understood by a simple example where if we have a “king” vector and we remove the vector “man” from the king and add a “women” vector, then we get a vector close to the “queen” vector in N-dimensional space.

    Ex: king - man + women = queen
    
    
There are two ways to implement Word2Vec techniques:

   1. Skip-Gram
   2. CBOW

#### 1. Skip-Gram:
- Skip-Gram tries to predict several context words from a single input word.

![image.png](attachment:image.png)


- Here w[i] is the input word at position ‘i’ in the sentence. The output of the model contains two preceding words and two succeeding words with respect to location ‘i’.

#### 2. CBOW:
- CBOW stands for Continuous Bag of Words, trained to predict a single word from a series of context words. It is the mirror image of the Skip-Gram technique.

![image.png](attachment:image.png)


- Both techniques are good and can generate vectors from the text by considering semantic similarity. Skip-Gram works well with small-size datasets and can find rare words as well. However, CBOW trains faster and can better represent frequent words. According to the original paper, CBOW takes a few hours to train whereas Skip-Gram needs a few days to understand patterns from the data.

## Part-of-Speech (POS) Tagging:
- Part-of-Speech (POS) tagging a process of assigning one of the parts of speech to the given word.


- if we talk about Part-of-Speech (POS) tagging, it may be defined as the process of converting a sentence in the form of a list of words, into a list of tuples. Here, the tuples are in the form of (word, tag).


- Python’s nltk has a class named pos_tag that provides a Part-of-Speech (POS) Tagging code.


- Following table represents the most frequent POS notification used in Penn Treebank corpus −

![image.png](attachment:image.png)