<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:center;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b> **NLP and Machine Learning Assignment: Sentiment Analysis on Reviews Dataset**</b></div>

</div>

## **Objective**
The goal of this assignment is to learn how to process textual data, extract features, train various machine learning models, and evaluate their performance on a reviews dataset.

---

## **Part 1: Setup and Data Loading**

1. **Load the Dataset**  
   - Use `Pandas` to read the dataset from a CSV file.
   - Display the first few rows of the dataset to understand its structure.

---

## **Part 2: Data Preprocessing**

Preprocess the text data using the following steps:

1. **Convert all text to lowercase**  
2. **Remove all non-alphanumeric characters (punctuation, numbers, etc.)**  
3. **Tokenize the text**  
4. **Remove stop words using NLTK**  
5. **Apply lemmatization using WordNetLemmatizer from NLTK**
6. **Remove the URL's**
7. **Any Possible Processing**

---

## **Part 3: Feature Extraction**

Extract features using two different methods:

1. **Bag of Words (Frequency Count)**  
   - Use `CountVectorizer` from `sklearn` to extract features.

2. **TF-IDF**  
   - Use `TfidfVectorizer` from `sklearn` to extract features.
  
3. **Combine Bag of Words (Frequency Count) and TF-IDF Features**

---

## **Part 4: Data Splitting**

Split the data into training and test sets:

1. Use `train_test_split` from `sklearn` to split the data.
2. Use 80% of the data for training and 20% for testing.

---

## **Part 5: Model Training**

Train three different machine learning models:

1. **Random Forest**  
2. **Support Vector Machine (SVM)**  
3. **Naive Bayes**  

- Use `sklearn`'s implementations for these models.

---

## **Part 6: Evaluation**

Evaluate each model on the test data:

1. Calculate and print the following metrics:
   - **Accuracy**
   - **Precision**
   - **Recall**
   - **F1-score**
   - **Confusion Matrix**
   - **Classification Report**

---

## **Part 7: Comparative Analysis**

Create a comparison graph of the model performance metrics:

1. Plot a bar graph comparing the **Accuracy**, **Precision**, **Recall**, and **F1-score** for each model.
2. Use `matplotlib` or `seaborn` for plotting.

---

## **Part 8: Submission**

1. **Submit a Jupyter Notebook**  
   - Ensure the notebook contains the completed code for all parts.er Notebook**  
   - Ensure the notebook contains the completed code for all parts.



<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:30px;font-family:Georgia;text-align:center;display:fill;border-radius:5px;background-color:#254E58;overflow:hidden"><b> Step by Step Implementation</b></div>

<a id="1"></a>
# <div style="padding:20px;color:white;margin:0;font-size:24px;font-family:Georgia;text-align:Left;display:fill;border-radius:10px;background-color:#254E58;overflow:hidden"><b> Import Required Libraries</b></div>

In [4]:
# Data Handling and Manipulation
import pandas as pd
import numpy as np

# Text Preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Model Training
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# Model Evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Data Splitting
from sklearn.model_selection import train_test_split

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Download necessary NLTK data files (only need to run once)
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\waqar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\waqar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## **Part 1: Setup and Data Loading**

1. **Load the Dataset**  
   - Use `Pandas` to read the dataset from a CSV file.
   - Display the first few rows of the dataset to understand its structure.

---

In [6]:
# Code TODO
# Must Display the output in Dataframe

## **Part 2: Data Preprocessing**

Preprocess the text data using the following steps:

1. **Convert all text to lowercase**  
2. **Remove all non-alphanumeric characters (punctuation, numbers, etc.)**  
3. **Tokenize the text**  
4. **Remove stop words using NLTK**  
5. **Apply lemmatization using WordNetLemmatizer from NLTK**
6. **Remove the URL's**
7. **Any Possible Processing**


---

In [8]:
# Code TODO
# Must Display the output in Dataframe

## **Part 3: Feature Extraction**

Extract features using two different methods:

1. **Bag of Words (Frequency Count)**  
   - Use `CountVectorizer` from `sklearn` to extract features.

2. **TF-IDF**  
   - Use `TfidfVectorizer` from `sklearn` to extract features.
     
3. **Combine Bag of Words (Frequency Count) and TF-IDF Features**


## Must SetUp the following Vectorizer Parameters

In both `CountVectorizer` and `TfidfVectorizer`, we can customize the way text data is transformed into features using various parameters. In this assignment must setup the below  given parameters with explanations:

```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example with CountVectorizer
vectorizer = CountVectorizer(
    token_pattern=r'(?u)\\b\\w\\w+\\b',  # Matches words with two or more alphanumeric characters
    ngram_range=(1, 1),                  # Only includes unigrams (single words)
    analyzer='word',                     # Analyzes text by splitting into words (not characters)
    max_features=10                    # Considers only 10 unique terms 
)

# Example with TfidfVectorizer (same parameters apply)
vectorizer = TfidfVectorizer(
    token_pattern=r'(?u)\\b\\w\\w+\\b',
    ngram_range=(1, 1),
    analyzer='word',
    max_features=10
)

### Note: Must print the features in a Dataframe
---

In [10]:
# Code TODO

## **Part 4: Data Splitting**

Split the data into training and test sets:

1. Use `train_test_split` from `sklearn` to split the data.
2. Use 80% of the data for training and 20% for testing.

---


In [12]:
# Code TODO
# Must Display the output in Dataframe

## **Part 5: Model Training**

Train three different machine learning models:

1. **Random Forest**  
2. **Support Vector Machine (SVM)**  
3. **Naive Bayes**  

- Use `sklearn`'s implementations for these models.

---

In [19]:
# Code

## **Part 6: Evaluation**

Evaluate each model on the test data:

1. Calculate and print the following metrics:
   - **Accuracy**
   - **Precision**
   - **Recall**
   - **F1-score**
   - **Confusion Matrix**
   - **Classification Report**


---


In [22]:
# Code TODO
# Must Display the output in Dataframe

## **Part 7: Comparative Analysis**

Create a comparison graph of the model performance metrics:

1. Plot a bar graph comparing the **Accuracy**, **Precision**, **Recall**, and **F1-score** for each model.
2. Use `matplotlib` or `seaborn` for plotting.

---

In [25]:
# Code TODO

## **Part 8: Submission**

1. **Submit a Jupyter Notebook**  
   - Ensure the notebook contains the completed code for all parts.