# Natural Language Processing ( NLP )

## Introduction to Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both valuable and meaningful. 

## Key Concepts in NLP

### 1. **Understanding Text Data**
Text data is unstructured data that can come from various sources, such as social media posts, reviews, articles, and emails. Unlike structured data (like numbers in a spreadsheet), text data requires special processing to extract meaningful insights.

**Example:** 
- A restaurant review like "The food was great but the service was slow." contains opinions and sentiments that need to be analyzed.

### 2. **Text Preprocessing**
Before analyzing text data, it needs to be cleaned and prepared. This involves several steps:

- **Tokenization:** Splitting text into individual words or tokens.
- **Lowercasing:** Converting all characters to lowercase to ensure uniformity.
- **Removing Punctuation:** Eliminating symbols that don't add meaning to the analysis.
- **Stop Words Removal:** Filtering out common words (e.g., "is," "and," "the") that don't provide significant information.
- **Stemming/Lemmatization:** Reducing words to their root forms (e.g., "running" becomes "run").

**Example:** 
For the review "The food was great," the preprocessing steps would result in the tokens: `["food", "great"]`.

### 3. **Text Representation**
Once the text is preprocessed, it needs to be converted into a format that machine learning algorithms can understand. Two common methods are:

- **Bag of Words (BoW):** Represents text data as a set of word counts. Each unique word in the dataset is a feature.
- **TF-IDF (Term Frequency-Inverse Document Frequency):** Assigns a score to each word based on its frequency in a document relative to its frequency in the entire corpus, highlighting important words.

**Example:**
For the sentences:
1. "I love pizza."
2. "Pizza is my favorite food."

The BoW representation would be:
```
| I | love | pizza | is | my | favorite | food |
|---|------|-------|----|----|----------|------|
| 1 |  1   |   1   | 0  | 0  |    0     |  0   |
| 0 |  0   |   1   | 1  | 1  |    1     |  1   |
```

### 4. **Machine Learning for Text**
After representing text data numerically, machine learning algorithms can be applied for various tasks:

- **Text Classification:** Assigning categories to text (e.g., spam detection in emails).
- **Sentiment Analysis:** Determining the sentiment expressed in text (positive, negative, neutral).
- **Named Entity Recognition (NER):** Identifying and classifying key entities in text (e.g., names of people, organizations).

### 5. **Evaluation Metrics**
To evaluate the performance of NLP models, several metrics can be used, including:

- **Accuracy:** The ratio of correctly predicted instances to the total instances.
- **Precision, Recall, F1-Score:** Metrics that provide insights into the model's performance, particularly in classification tasks.

## Importance of NLP in Data Science
As a data scientist or analyst, understanding NLP is crucial because:

- **Insight Generation:** NLP helps extract valuable insights from unstructured data, enabling better decision-making.
- **Enhanced Communication:** NLP applications like chatbots and virtual assistants improve user interaction with systems.
- **Market Analysis:** Analyzing customer feedback, social media posts, and reviews provides insights into market trends and customer preferences.

## Future of NLP
The field of NLP is rapidly evolving, with advancements in deep learning and neural networks leading to improved models and applications. Emerging trends include:

- **Contextual Understanding:** Models like BERT and GPT are capable of understanding the context of words, leading to more accurate results.
- **Multimodal NLP:** Integrating text with other data types (like images or audio) for richer insights.
- **Ethics in NLP:** Addressing biases and ethical considerations in NLP applications to ensure fairness and transparency.

## Conclusion
NLP is a powerful tool for data scientists and analysts to derive meaningful insights from textual data. By understanding the fundamental concepts, processes, and applications of NLP, professionals can leverage these techniques to enhance their data-driven decision-making capabilities.

---
## 1. Importing All The Necessary Libraries :-
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [2]:
data = pd.read_csv ( r"C:\Users\dell\Downloads\Restaurant_Reviews.tsv" , delimiter = "\t" , quoting = 3 )  # delimiter specified as tab.

data  # This will display the dataframe.

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


---
## 2. Performing EDA ( Exploratory Data Analysis ) :-
---

In [3]:
data.shape  # This will showcase the number of rows and columns in the dataframe.

(1000, 2)

In [4]:
data.columns  # This will showcase all the columns in the dataframe.

Index(['Review', 'Liked'], dtype='object')

In [5]:
data.info ( )  # This will give the information of the dataframe.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [6]:
data.isna ( ).sum ( )  # There are no null values present in the dataframe.

Review    0
Liked     0
dtype: int64

In [7]:
data.describe ( )  # statistical summary of the dataframe.

Unnamed: 0,Liked
count,1000.0
mean,0.5
std,0.50025
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [8]:
data [ "Liked" ].value_counts ( )

Liked
1    500
0    500
Name: count, dtype: int64

---
## 3. Stopwords :-

### Stopwords in NLP: A Detailed Overview

#### **What are Stopwords?**
Stopwords are commonly used words in a language that carry little to no significant meaning in terms of understanding or interpreting the content of a sentence. Examples include words like "the," "is," "in," "and," "of," "to," "a," "on," etc. These words are used frequently in communication but don't provide much value when it comes to analyzing the essence of a text.

#### **Why Are Stopwords Used in Language?**
Stopwords serve grammatical purposes, linking other words together to form meaningful sentences. For example, in the sentence *"The cat is on the mat,"* the stopwords "the" and "is" help structure the sentence, but the main information comes from the words "cat" and "mat." Even though stopwords are essential in everyday language, they don't add substantial meaning when analyzing text for tasks like classification, summarization, or sentiment analysis.

#### **Why Are Stopwords Removed in NLP?**
In many Natural Language Processing (NLP) tasks, the goal is to focus on the most informative parts of a text, and stopwords often clutter the data with unnecessary, repetitive words. Here are some reasons why stopwords are removed:

1. **Dimensionality Reduction:**
   - Text data often has high dimensionality, meaning there are a large number of unique words (features). Removing stopwords helps reduce the dimensionality of the data, making it easier to work with.
   - For example, if you analyze a large corpus of text and remove words like "the," "and," "is," etc., the remaining words are more likely to carry significant meaning, allowing the model to focus on the most relevant information. <br>
   <br>

2. **Improves Model Efficiency:**
   - By removing stopwords, you're reducing the size of your input data. This can speed up the training process of machine learning models, as they now have fewer features (words) to process.
   - For example, in a bag-of-words model or TF-IDF (Term Frequency-Inverse Document Frequency) representation, each word becomes a feature. Removing stopwords can reduce the number of features, making the model computationally more efficient.<br>
   <br>

3. **Enhances Signal-to-Noise Ratio:**
   - Stopwords act as "noise" in the data because they are common words that occur across many documents and don't help in distinguishing between different classes or meanings. Removing them helps increase the "signal" by focusing on the more meaningful words.
   - In tasks like sentiment analysis, the word *"not"* is an exception since it can alter the meaning of a sentence entirely. For example, *"not happy"* vs. *"happy"*.<br>
   <br>

4. **Improves Accuracy:**
   - In classification tasks, such as spam detection or sentiment analysis, the presence of too many stopwords might confuse the algorithm by giving more weight to irrelevant words. Removing them can improve the model's accuracy by focusing on words that carry real information about the target label.

#### **How Stopwords Are Identified?**
Stopwords lists are generally predefined for each language. These lists contain common words that are agreed upon to have little to no analytical value. Popular NLP libraries, such as **NLTK** (Natural Language Toolkit), **spaCy**, and **Scikit-learn**, come with built-in lists of stopwords for various languages. For example, NLTK provides a list of stopwords for English that can be imported and used directly.

#### **When Should Stopwords Be Removed?**
While removing stopwords is a common preprocessing step, it's not always necessary. In some NLP tasks, stopwords may play an important role, especially in understanding nuances in the language or maintaining grammatical correctness. Here are some scenarios where stopwords may or may not be removed:

- **When to Remove Stopwords:**
   - **Text Classification:** In tasks like spam detection, sentiment analysis, or topic classification, stopwords are generally removed because they don't contribute to identifying the target labels (spam vs. not spam, positive vs. negative sentiment).
   - **Information Retrieval:** In search engines, stopwords are removed to improve retrieval accuracy. For example, when searching for "best restaurants in Paris," the words "in" and "the" can be safely ignored, and the search engine focuses on "best," "restaurants," and "Paris."<br>
   <br>

- **When to Keep Stopwords:**
   - **Sentiment Analysis (with nuance):** In sentiment analysis, especially when considering sentence structure, stopwords like *"not"* can change the meaning of a sentence entirely. For example, *"This movie is not bad"* has a positive sentiment, while *"This movie is bad"* is negative. Removing "not" would incorrectly classify the sentence.
   - **Text Generation:** In natural language generation tasks like chatbots or translation, stopwords should be preserved to maintain the grammatical structure and fluency of the output.

#### **Challenges with Stopwords:**
1. **Language-Specific Stopwords:**
   - Stopwords vary between languages. For instance, the stopwords in English differ from those in French or Hindi. If you're working with a multilingual dataset, you'll need to remove stopwords according to each language.<br>
   <br>

2. **Custom Stopwords Lists:**
   - In some cases, the standard stopwords list may not be sufficient. For example, if you're analyzing scientific research papers, common terms like "data," "model," and "result" might occur frequently but offer little value. You can create a custom list of stopwords based on the specific domain or task you're working on.<br>
   <br>

3. **Stopwords in Short Texts:**
   - In short texts like tweets or SMS messages, removing stopwords may result in very little text remaining. In such cases, you might want to retain some of the stopwords to ensure that enough information is left for analysis.

#### **Example of Stopwords Removal:**

Consider the sentence:
- **Original:** *"The quick brown fox jumps over the lazy dog."*
- **After Stopword Removal:** *"quick brown fox jumps lazy dog"*

In the above sentence, the words "the," "over," and "the" were removed, but the remaining words still convey the key meaning of the sentence.

#### **Stopwords and NLP Tasks:**
Here are some common NLP tasks where stopwords are relevant:
- **Text Classification:** Remove stopwords to improve classification accuracy by focusing on meaningful words.
- **Topic Modeling:** Removing stopwords helps uncover the core topics in a document, as common words are filtered out.
- **Sentiment Analysis:** Remove stopwords, but handle words like *"not"* carefully, as they can reverse the meaning of a sentence.
- **Document Summarization:** Stopwords removal ensures that the summary focuses on the key points, not filler words.
- **Information Retrieval (Search Engines):** Removing stopwords improves search accuracy by focusing on the keywords.

#### **Future of Stopwords in NLP:**
With advancements in deep learning models like BERT (Bidirectional Encoder Representations from Transformers) and GPT, the importance of stopwords might decrease. These models are capable of understanding context much more effectively than traditional models, so they might not rely on explicit stopword removal to achieve good performance. However, stopword removal is still a practical, useful preprocessing step in many cases, especially when using older or simpler NLP models like Bag of Words, TF-IDF, or Naive Bayes classifiers.

In summary, **stopwords** are common, insignificant words that are usually removed in NLP tasks to reduce noise, improve computational efficiency, and focus on the most meaningful parts of the text. While removing stopwords is a standard step in text preprocessing, the decision to remove them depends on the specific task and context.

In [9]:
import nltk  # "nltk" is a library and its full form is natural language tool kit.
from nltk.corpus import stopwords  # "stopwords" is a list which contains all the common words which do not play any imp role in a sentence.
nltk.download ( "stopwords" )  # This line ensures that the stopwords corpus is downloaded and available locally.

all_stopwords = stopwords.words ( "english" )  # This line fetches the stopwords of "english" language and stores it in the variable.
print ( all_stopwords )  # This will give the output by displaying all the stopwords.

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
len ( all_stopwords )  # This will show the number of stopwords.

179

---
## 4. Cleaning The Data :-
---

In [11]:
# 1. IMPORTING THE NECESSARY LIBRARIES :-

import re  # For regular expressions , which helps find and manipulate text patterns ( e.g., replacing certain characters ).
import nltk  # For natural language processing tools like stopwords.
nltk.download ( "stopwords" )  # Download the list of stopwords.
from nltk.corpus import stopwords  # Import stopwords for filtering.
from nltk.stem.porter import PorterStemmer  # Importing stemming class to get "root words".

# 2. PERFORMING DATA CLEANING :-

corpus = [ ]  # Created an empty list to store cleaned reviews.

for i in range ( 0 , 1000 ) :  # Creating a loop for 1000 records / reviews in my dataframe.
    review = re.sub ( "[^a-zA-Z]" , " " , data [ "Review" ] [ i ] )  # This will replace non-letter characters with space. "^" means not.
    review = review.lower ( )  # Convert all text to lower case.
    review = review.split ( )  # Split the review ( sentence ) into individual words ( Tokenization ).

# 2.1 STEMMING AND REMOVING STOPWORDS :-

    ps = PorterStemmer ( )  # Created an instance / object of the class "PorterStemmer ( )" for stemming and stored it in the varibale "ps".
    all_stopwords = stopwords.words ( "english" )  # This gets a list contaning all stopwords of english language and stores it in the variable.
    all_stopwords.remove ( "not" )  #  Removes the word "not" from the stopwords list because "not" can change the meaning of a sentence.
    review = [ ps.stem ( word ) for word in review if not word in set ( all_stopwords ) ]  # Stemming and stopword removal. ( list comprehension )

# 2.2 REJOINING THE CLEANED REVIEW AND JOINING IT TO THE CORPUS :-

    review = " ".join ( review )  # Join the stemmed words back into a single string.
    corpus.append ( review )  # Adding the cleaned review to the corpus.

# OVERALL PROCESS :-

# 1. Remove non-letter characters.
# 2. Convert to lowercase.
# 3. Split the text into individual words.
# 4. Remove stopwords (except for "not").
# 5. Stem words to reduce them to their base form.
# 6. Rejoin the cleaned words and store them in the corpus.

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Explanation Of The Code :-

---

### **1. IMPORTING THE NECESSARY LIBRARIES:**

```python
import re  # For regular expressions, which helps find and manipulate text patterns (e.g., replacing certain characters).
import nltk  # For natural language processing tools like stopwords.
nltk.download("stopwords")  # Download the list of stopwords.
from nltk.corpus import stopwords  # Import stopwords for filtering.
from nltk.stem.porter import PorterStemmer  # Importing stemming class to get "root words".
```

- **`re`:** Regular expressions library, used for text manipulation, such as pattern matching (like replacing non-alphabetic characters with spaces).
- **`nltk`:** A powerful library for working with text data. In this code, it’s being used to access stopwords.
- **`nltk.download("stopwords")`:** Downloads the list of stopwords, which are common words like "is", "the", "and" that are often removed in text processing.
- **`PorterStemmer`:** This is used for stemming, which reduces words to their base or root form (e.g., "loved" becomes "love").

---

### **2. PERFORMING DATA CLEANING:**

```python
corpus = []  # Created an empty list to store cleaned reviews.

for i in range(0, 1000):  # Looping through the first 1000 reviews in the dataset.
    review = re.sub("[^a-zA-Z]", " ", data["Review"][i])  # This replaces non-letter characters with spaces.
    review = review.lower()  # Convert all text to lowercase.
    review = review.split()  # Split the review (sentence) into individual words.
```

- **`corpus = []`:** This initializes an empty list called `corpus` where cleaned reviews will be stored.
- **`for i in range(0, 1000)`:** This loop processes each review one by one from the dataset (for 1000 reviews in this case).
- **`re.sub("[^a-zA-Z]", " ", data["Review"][i])`:** This line removes any character from the review that is not a letter (e.g., punctuation, numbers) and replaces it with a space.
- **`review.lower()`:** Converts the entire review to lowercase to maintain consistency and avoid treating "Love" and "love" as different words.
- **`review.split()`:** Splits the review into individual words (tokenization) so that each word can be processed separately.

---

### **2.1 STEMMING AND REMOVING STOPWORDS:**

```python
    ps = PorterStemmer()  # Created an instance of the PorterStemmer class to perform stemming.
    all_stopwords = stopwords.words("english")  # Get the list of stopwords in English.
    all_stopwords.remove("not")  # Remove the word "not" from stopwords to keep its meaning in sentiment analysis.
    review = [ps.stem(word) for word in review if not word in set(all_stopwords)]  # Stemming and removing stopwords.
```

- **`ps = PorterStemmer()`:** Creates an instance of the PorterStemmer class to later apply stemming (reducing words to their root form, like "playing" to "play").
- **`all_stopwords = stopwords.words("english")`:** Retrieves a list of common English stopwords (like "the", "is", "and").
- **`all_stopwords.remove("not")`:** Removes the word "not" from the stopword list, as it is important in sentiment analysis (e.g., "not happy" is different from "happy").
- **`review = [ps.stem(word) for word in review if not word in set(all_stopwords)]`:** This is a **list comprehension** that:
  - Stems each word (reduces it to its root form).
  - Removes any word that is in the stopword list.
  - Creates a cleaned version of the review.

---

### **2.2 REJOINING THE CLEANED REVIEW AND JOINING IT TO THE CORPUS:**

```python
    review = " ".join(review)  # Join the list of cleaned and stemmed words back into a single string.
    corpus.append(review)  # Add the cleaned review to the corpus.
```

- **`" ".join(review)`:** This takes the list of cleaned and stemmed words and joins them back into a single string, with each word separated by a space.
- **`corpus.append(review)`:** The cleaned and processed review is added to the `corpus` list. The `corpus` will contain all the cleaned reviews after the loop finishes.

---

### **OVERALL PROCESS:**

1. **Remove non-letter characters**: Regular expressions are used to replace anything that is not a letter with a space.
2. **Convert to lowercase**: Ensures consistency in the text by making everything lowercase.
3. **Split the text into individual words**: Tokenization is performed to break the review into separate words.
4. **Remove stopwords (except for "not")**: Common stopwords are removed, but "not" is kept for sentiment analysis.
5. **Stem words**: Words are reduced to their root form to avoid treating different tenses or forms as separate words (e.g., "loved" becomes "love").
6. **Rejoin the cleaned words**: The words are joined back together into a single string for each review.
7. **Store in the corpus**: The cleaned review is stored in the `corpus` for further analysis.

---

This code is preparing your text data for further analysis (like sentiment analysis or classification) by cleaning, tokenizing, removing irrelevant words, and reducing words to their root form.

In [12]:
print ( corpus )  # This corpus includes the cleaned review with no stopwords ready for further processes.

['wow love place', 'crust not good', 'not tasti textur nasti', 'stop late may bank holiday rick steve recommend love', 'select menu great price', 'get angri want damn pho', 'honeslti tast fresh', 'potato like rubber could tell made ahead time kept warmer', 'fri great', 'great touch', 'servic prompt', 'would not go back', 'cashier care ever say still end wayyy overpr', 'tri cape cod ravoli chicken cranberri mmmm', 'disgust pretti sure human hair', 'shock sign indic cash', 'highli recommend', 'waitress littl slow servic', 'place not worth time let alon vega', 'not like', 'burritto blah', 'food amaz', 'servic also cute', 'could care less interior beauti', 'perform', 'right red velvet cake ohhh stuff good', 'never brought salad ask', 'hole wall great mexican street taco friendli staff', 'took hour get food tabl restaur food luke warm sever run around like total overwhelm', 'worst salmon sashimi', 'also combo like burger fri beer decent deal', 'like final blow', 'found place accid could not

---
## 3. Creating Bag Of Words ( BOW ) :-
---

In [13]:
from sklearn.feature_extraction.text import CountVectorizer  # Importing "CountVectorizer" from modules.

cv = CountVectorizer ( )  # Creating an instance / object of the class "CountVectorizer" and storing it in the variable "cv". This transforms textual data into numerical data.
X = cv.fit_transform ( corpus ).toarray ( )  # "fit" learns from the above code and "transform" applies the learning on the corpus containing cleaned reviews.

print ( "The Number Of Unique Features Or Words In Corpus =" , len ( X [ 0 ] ) )  # This will show that how many unique words / columns are there.

The Number Of Unique Features Or Words In Corpus = 1566


## Purpose Of The Above Code :-

**Purpose :** The entire process converts your text data into a numeric form where each word is treated as a feature. This allows you to input this data into machine learning models.

**Bag-of-Words Model :** Each review is transformed into a vector, where the length of the vector is the number of distinct words in the entire corpus, and the values in the vector are the counts of words in the respective review.

---
## 4. Categorizing Data Into Indepenent Variable ( X ) And Dependent Variable ( Y ) :-
---

In [18]:
from sklearn.feature_extraction.text import CountVectorizer  # Importing "CountVectorizer" from modules.

cv = CountVectorizer ( max_features = 1500 )  # Creating an instance / object of the class "CountVectorizer"
X = cv.fit_transform ( corpus ).toarray ( )  # "fit" learns from the above code and "transform" applies the learning on the corpus containing cleaned reviews.
Y = data.iloc [ : , -1 ].values  # Selecting the dependent variable from the dataframe which is the column of "Liked".

print ( "The Number Of Unique Features Or Words In Corpus =" , len ( X [ 0 ] ) )  # This will show that how many unique words / columns are there.
print ( )
print ( "X ( Independent Variable - Review ) :-" )
print ( X )
print ( )
print ( "Y ( Dependent Variable - Liked ) :-" )
print ( Y )

The Number Of Unique Features Or Words In Corpus = 1500

X ( Independent Variable - Review ) :-
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Y ( Dependent Variable - Liked ) :-
[1 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 1 0 1 0 1 1 1
 0 1 0 1 0 0 1 0 1 0 1 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1 1 1 0 0
 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0
 0 0 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1
 1 0 1 1 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0
 0 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1
 0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1
 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 1 1 0 0 0 1 1
 0 1 1 0 0 1 0 0 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 1

## Summary Till Now :-

1. **Textual Data Conversion**: You're right that in normal data modeling, features (independent variables) are usually numerical. Since you have textual data here (the reviews), it’s being converted into numerical format using **Bag of Words** (with `CountVectorizer`), where each word in the corpus becomes a feature in your feature matrix `X`.

2. **X (Independent Variable)**: Now, after converting the text data, you have a numerical matrix (`X`) where each row represents a review, and each column corresponds to a word or feature. These vectors act as your independent variables (just like regular numerical features in normal modeling).

3. **Y (Dependent Variable)**: Your `Y` was already numerical (binary in this case, representing whether a review is liked or not), so it remains the dependent variable.

4. **Prediction Model**: Yes, once you have `X` and `Y`, you can use them to train a machine learning model. The goal of this model will be to predict whether a review is positive (liked = 1) or negative (liked = 0). Since you're analyzing the sentiment (positive or negative) of the reviews, this process becomes a **sentiment analysis predictive model**.

---
## 5. Creating A Function For Automation Of Model Creation :-
---

In [26]:
# Importing all the necessary libraries :-

from sklearn import metrics 
from sklearn.model_selection import cross_val_score , train_test_split
from sklearn.metrics import accuracy_score , f1_score , classification_report , confusion_matrix

In [27]:
def train ( model , x , y ) :  # Creating a custom function which takes input as the model and "x" , "y" data.
    
    x_train , x_test , y_train , y_test = train_test_split ( x , y , test_size = 0.2 , random_state = 40 )  # Splitting data into training testing

    print ( "1. TRAIN TEST SPLIT :-" )  # HEADING.
    print ( )
    print ( "The Shape Of x :" , x.shape )  # Information about the data split.
    print ( "The Shape Of y :" , y.shape )  # Information about the data split.
    print ( )                                                                                         ####################
    print ( "The Shape Of x_train :" , x_train.shape )  # Information about the data split.           # TRAIN TEST SPLIT # 
    print ( "The Shape Of x_test :" , x_test.shape )    # Information about the data split.           ####################
    print ( )
    print ( "The Shape Of y_train :" , y_train.shape )  # Information about the data split.
    print ( "The Shape Of y_test :" , y_test.shape )    # Information about the data split.
    print ( )
    print ( "*" * 70 ) 
    
                                                                                                       ##################
    model.fit ( x_train , y_train )                                                                    # MODEL TRAINING #
                                                                                                       ##################

    print ( )
    cv_score = cross_val_score ( model , x , y , scoring = "accuracy" , cv = 5 )
    print ( "2. MODEL REPORT ( CROSS VALIDATION ) :-" )  # HEADING.
    print ( )
    print ( "Scoring - accuracy" )  # accuracy.
    print ( cv_score )
    print ( )
    cv_score = np.abs ( np.mean ( cv_score ) )                                                          ####################
    print ( "Average Accuracy =" , cv_score )                                                          # CROSS VALIDATION #
    print ( )                                                                                           ####################
    cv_score = cross_val_score ( model , x , y , scoring = "f1" , cv = 5 )  
    print ( "F1 Score" )  # f1 score.
    print ( cv_score )
    print ( )
    cv_score = np.mean ( cv_score )
    print ( "The Average Of f1 Score =" , cv_score )
    print ( )
    print ( "*" * 70 ) 


    print ( )
    print ( "3. ACCURACY :-" )  # HEADING.
    print ( )
    print ( "Accuracy On The Test Data" )
    y_test_pred = model.predict ( x_test )  # Prediction on test data.                                  ###############################
    print ( "accuracy_score Accuracy =" , accuracy_score ( y_test , y_test_pred ) )                     # PREDICTION & ACCURACY CHECK #
    print ( )                                                                                           ###############################
    print ( "Accuracy On The Training Data" )
    y_train_pred = model.predict ( x_train )  # Prediction on training data.
    print ( "accuracy_score Accuracy =" , accuracy_score ( y_train , y_train_pred ) )
    print ( )
    print ( "Accuracy On The Complete Data" )
    y_pred = model.predict ( x )  # Prediction on complete data.
    print ( "accuracy_score Accuracy =" , accuracy_score ( y , y_pred ) )
    print ( )
    print ( "*" * 70 ) 


    print ( )                                                                                           
    print ( "4. CONFUSION MATRIX :-" )  # HEADING.                                                        
    print ( )                                                                                                                                            
    print ( "Confusion Matrix Of Test Data" )                                                            ####################
    print ( confusion_matrix ( y_test , y_test_pred ) )  # Test data.                                    # CONFUSION MATRIX #
    print ( )                                                                                            ####################
    print ( "Confusion Matrix Of Training Data" )
    print ( confusion_matrix ( y_train , y_train_pred ) )  # Training data.
    print ( )
    print ( "Confusion Matrix Of Complete Data" )
    print ( confusion_matrix ( y , y_pred ) )  # Complete data.
    print ( )
    print ( "*" * 70 )


    print ( )                                                                                           
    print ( "5. CLASSIFICATION REPORT :-" )  # HEADING.                                                        
    print ( )                                                                                                                                            
    print ( "Classification Report Of Test Data" )                                                            #########################
    print ( classification_report ( y_test , y_test_pred ) )  # Test data.                                    # CLASSIFICATION REPORT #
    print ( )                                                                                                 #########################
    print ( "Classification Report Of Training Data" )
    print ( classification_report ( y_train , y_train_pred ) )  # Training data.
    print ( )
    print ( "Classification Report Of Complete Data" )
    print ( classification_report ( y , y_pred ) )  # Complete data.
    print ( )
    print ( "*" * 70 )

---
## 6. Creating The Random Forest Classification Model :-
---

In [28]:
from sklearn.ensemble import RandomForestClassifier  # Importing class "RandomForestClassifier" from module "ensemble" from library "sklearn".

rf = RandomForestClassifier ( )  # Created an instance / object of the class "RandomForestClassifier ( )" and stored it in the variable "rf".

train ( rf , X , Y )  # Calling the "CUSTOM FUNCTION" which i previously created and specifying the parameters of model and data.

1. TRAIN TEST SPLIT :-

The Shape Of x : (1000, 1500)
The Shape Of y : (1000,)

The Shape Of x_train : (800, 1500)
The Shape Of x_test : (200, 1500)

The Shape Of y_train : (800,)
The Shape Of y_test : (200,)

**********************************************************************

2. MODEL REPORT ( CROSS VALIDATION ) :-

Scoring - accuracy
[0.745 0.76  0.79  0.815 0.81 ]

Average Accuracy = 0.784

F1 Score
[0.72625698 0.73863636 0.76744186 0.79787234 0.79581152]

The Average Of f1 Score = 0.7652038132183685

**********************************************************************

3. ACCURACY :-

Accuracy On The Test Data
accuracy_score Accuracy = 0.8

Accuracy On The Training Data
accuracy_score Accuracy = 0.99625

Accuracy On The Complete Data
accuracy_score Accuracy = 0.957

**********************************************************************

4. CONFUSION MATRIX :-

Confusion Matrix Of Test Data
[[95 19]
 [21 65]]

Confusion Matrix Of Training Data
[[384   2]
 [  1 413]]

Confusion

---
## 7. Creating The Naive Baye's Classification Model :-
---

In [30]:
from sklearn.naive_bayes import GaussianNB  # Importing class "GaussianNB" from module "naive_bayes" from library "sklearn".

nb = GaussianNB ( )  # Created an instance / object of the class "GaussianNB ( )" and stored it in the variable "nb".

train ( nb , X , Y )  # Calling the "CUSTOM FUNCTION" which i previously created and specifying the parameters of model and data.

1. TRAIN TEST SPLIT :-

The Shape Of x : (1000, 1500)
The Shape Of y : (1000,)

The Shape Of x_train : (800, 1500)
The Shape Of x_test : (200, 1500)

The Shape Of y_train : (800,)
The Shape Of y_test : (200,)

**********************************************************************

2. MODEL REPORT ( CROSS VALIDATION ) :-

Scoring - accuracy
[0.705 0.665 0.665 0.705 0.645]

Average Accuracy = 0.677

F1 Score
[0.74458874 0.71966527 0.70742358 0.74678112 0.69527897]

The Average Of f1 Score = 0.7227475366356415

**********************************************************************

3. ACCURACY :-

Accuracy On The Test Data
accuracy_score Accuracy = 0.665

Accuracy On The Training Data
accuracy_score Accuracy = 0.92375

Accuracy On The Complete Data
accuracy_score Accuracy = 0.872

**********************************************************************

4. CONFUSION MATRIX :-

Confusion Matrix Of Test Data
[[59 55]
 [12 74]]

Confusion Matrix Of Training Data
[[325  61]
 [  0 414]]

Confusi

---
## 8. Creating The Logistic Classification Model :-
---

In [31]:
from sklearn.linear_model import LogisticRegression  # Importing class "LogisticRegression" from module "model_selection" from library "sklearn".

lr = LogisticRegression ( )  # Created an instance / object of the class "LogisticRegression ( )" and stored it in the variable "lr".

train ( lr , X , Y )  # Calling the "CUSTOM FUNCTION" which i previously created and specifying the parameters of model and data.

1. TRAIN TEST SPLIT :-

The Shape Of x : (1000, 1500)
The Shape Of y : (1000,)

The Shape Of x_train : (800, 1500)
The Shape Of x_test : (200, 1500)

The Shape Of y_train : (800,)
The Shape Of y_test : (200,)

**********************************************************************

2. MODEL REPORT ( CROSS VALIDATION ) :-

Scoring - accuracy
[0.8   0.775 0.785 0.82  0.785]

Average Accuracy = 0.793

F1 Score
[0.7979798  0.76190476 0.75706215 0.82178218 0.78817734]

The Average Of f1 Score = 0.785381244979303

**********************************************************************

3. ACCURACY :-

Accuracy On The Test Data
accuracy_score Accuracy = 0.805

Accuracy On The Training Data
accuracy_score Accuracy = 0.96625

Accuracy On The Complete Data
accuracy_score Accuracy = 0.934

**********************************************************************

4. CONFUSION MATRIX :-

Confusion Matrix Of Test Data
[[88 26]
 [13 73]]

Confusion Matrix Of Training Data
[[377   9]
 [ 18 396]]

Confusio

---
## 9. Creating The KNN ( K-Nearest Neighbors ) Classification Model :-
---

In [32]:
from sklearn.neighbors import KNeighborsClassifier # Importing class "KNeighborsClassifier" from module "neighbors" from library "sklearn".

knn = KNeighborsClassifier ( )  # Created an instance / object of the class "KNeighborsClassifier ( )" and stored it in the variable "knn".

train ( knn , X , Y )  # Calling the "CUSTOM FUNCTION" which i previously created and specifying the parameters of model and data.

1. TRAIN TEST SPLIT :-

The Shape Of x : (1000, 1500)
The Shape Of y : (1000,)

The Shape Of x_train : (800, 1500)
The Shape Of x_test : (200, 1500)

The Shape Of y_train : (800,)
The Shape Of y_test : (200,)

**********************************************************************

2. MODEL REPORT ( CROSS VALIDATION ) :-

Scoring - accuracy
[0.72  0.685 0.665 0.72  0.65 ]

Average Accuracy = 0.688

F1 Score
[0.69892473 0.6440678  0.6035503  0.69892473 0.66666667]

The Average Of f1 Score = 0.662426844300083

**********************************************************************

3. ACCURACY :-

Accuracy On The Test Data
accuracy_score Accuracy = 0.69

Accuracy On The Training Data
accuracy_score Accuracy = 0.8225

Accuracy On The Complete Data
accuracy_score Accuracy = 0.796

**********************************************************************

4. CONFUSION MATRIX :-

Confusion Matrix Of Test Data
[[83 31]
 [31 55]]

Confusion Matrix Of Training Data
[[341  45]
 [ 97 317]]

Confusion 

---
## 10. Creating The SVM ( Support Vector Machine ) Classification Model :-
---

In [33]:
from sklearn.svm import SVC # Importing class "SVC" from module "svm" from library "sklearn".

svc = SVC ( kernel = "linear" )  # Created an instance / object of the class "SVC ( )" and stored it in the variable "svc".

train ( svc , X , Y )  # Calling the "CUSTOM FUNCTION" which i previously created and specifying the parameters of model and data.

1. TRAIN TEST SPLIT :-

The Shape Of x : (1000, 1500)
The Shape Of y : (1000,)

The Shape Of x_train : (800, 1500)
The Shape Of x_test : (200, 1500)

The Shape Of y_train : (800,)
The Shape Of y_test : (200,)

**********************************************************************

2. MODEL REPORT ( CROSS VALIDATION ) :-

Scoring - accuracy
[0.785 0.775 0.785 0.82  0.795]

Average Accuracy = 0.792

F1 Score
[0.7839196  0.76923077 0.76243094 0.82524272 0.8       ]

The Average Of f1 Score = 0.788164804978768

**********************************************************************

3. ACCURACY :-

Accuracy On The Test Data
accuracy_score Accuracy = 0.81

Accuracy On The Training Data
accuracy_score Accuracy = 0.9825

Accuracy On The Complete Data
accuracy_score Accuracy = 0.948

**********************************************************************

4. CONFUSION MATRIX :-

Confusion Matrix Of Test Data
[[91 23]
 [15 71]]

Confusion Matrix Of Training Data
[[378   8]
 [  6 408]]

Confusion 

---
## Best Performing Model ( LOGISTIC REGRESSION ) :-

- Logistic Regression stands out as the best model based on:

  - **Highest Test Accuracy**: 80.5%

  - **Highest F1 Score on Test Data**: 0.79

  - **Consistent Cross-Validation Accuracy**: 79.3%

It offers the best balance between performance on the test data and generalization across the dataset, with solid accuracy and F1 scores. It also avoids overfitting compared to Random Forest or SVM, which have a large gap between training and test accuracy.

Finally , by applying hyperparameter tuning on this i can make he results of Logistic Regression Model more accurate and lastly deploying the model for actual use.

So , this was the entire lifecycle of my `NLP Sentiment Analysis Predictive Model Project`.

---

# FINAL CONCLUSION :-


In this project, I aimed to build a machine learning model usin nlp to classify restaurant reviews as positive or negative (binary classification task). The key steps and concepts involved in the project were:


### **1. Problem Definition**:
- **Goal**: Develop a model to predict whether a restaurant review is positive or negative using a labeled dataset containing two columns: 'Review' and 'Liked' (1 or 0).
  

### **2. Data Preprocessing**:
- **Dataset**: 1000 rows, 2 columns.
- **Steps**:
  - **Text Cleaning**: Removal of special characters, punctuation, and stopwords.
  - **Lowercasing**: Convert all text to lowercase.
  - **Tokenization**: Split text into individual words.
  - **Stemming**: Reduce words to their root form using the Porter Stemmer.
  

### **3. Feature Extraction**:
- **Bag of Words (BoW)**: Converted text reviews into numerical feature vectors.


### **4. Model Training & Evaluation**:
- **Models Used**: I applied multiple classification algorithms to compare their performance:
  1. **Random Forest Classifier**
  2. **Naive Bayes**
  3. **Logistic Regression**
  4. **K-Nearest Neighbors (KNN)**
  5. **Support Vector Machine (SVM)** <br>
  <br>

- **Evaluation Metrics**:
  - **Accuracy**: Measured the correctness of predictions.
  - **F1 Score**: Balanced metric considering both precision and recall.
  - **Cross-Validation**: Ensured model performance generalization.


### **5. Model Performance**:
After testing all models, **Logistic Regression** provided the best results:
  - **Test Accuracy**: 80.5%
  - **F1 Score**: 0.79
  - **Cross-Validation Accuracy**: 79.3%


### **6. Conclusion**:
- **Logistic Regression** emerged as the best performing model, providing a robust balance between accuracy and generalization. 
- I successfully automated the model creation and evaluation process through custom Python functions, simplifying model comparison and reporting.

- This project helped solidify my understanding of Natural Language Processing (NLP), text vectorization techniques (BoW), and various classification algorithms.

- I learnt many new things and especially made my existing skills very strong.

- Now i have become comfortable with making the automation function for training , testing , accuracy check etc.. of the model.

- Really enjoyed a lot working on this project.

- Project Owner ( Made By ) :- **Shubham Parihar**.