## Question

The dataset contains text of online travel reviews (in Column Review) with an associated Rating (column Overall_Rating). 
The objective is to train a classifier to predict the rating from the Review text. 
You are free to choose the model's architecture, but you should describe and justify your design choices.
Train the model and assess it as appropriate in machine learning. You are allowed to preprocess the data however you want 
(e.g. using pretrained embeddings, dropping some features, just a bag-of-words), but the predictive model must be trained 
by yourself from scratch (no pretrained predictor). 

### Problem definition
- ***Dataset***: Online travel reviews with their corresponding ratings
- ***Inputs***: Travel reviews
- ***Output***: Predict the rating from the review

We are training a classifier to predict the rating from the review text

### Step by step.
- **Load data**
    - *loading the data*
    - *splitting the data into train and test*
- **Data preprocessing**
    - **Text cleaning:** *Remove noise(pactuation, stopwords)*
    - **Text normalization:** *Lowercasing, Stemming, Lemmantization*
    - **Tokenization:** *Split text into words, or subwords, or characters*
- **Feature extraction/embeddings**
    - *Bag of words*
    - *TF-IDF*
- **Model Selection**
    - *Naive Bayes classifier*
- **Model Training**
    - *Train on the tain dataset*
    - *Monitor loss and accuracy on validation dataset*
    - **Techniques**
        - *Hyperparameter tuning*
        - *Cross validation*
        - *Regularization(dropout, weight decay)*
    - *Train on the tain dataset*
- **Model evaluation**
    - **Metrics**
        - *Accuracy, precision, recall, f1 score*
        - *Confusion matrix*

In [9]:
## Installing some libraries and packages.
%pip install pandas numpy matplotlib seaborn nltk scikit-learn

Collecting pandasNote: you may need to restart the kernel to use updated packages.

  Using cached pandas-2.2.3-cp39-cp39-win_amd64.whl (11.6 MB)
Collecting numpy
  Using cached numpy-2.0.2-cp39-cp39-win_amd64.whl (15.9 MB)
Collecting matplotlib
  Using cached matplotlib-3.9.4-cp39-cp39-win_amd64.whl (7.8 MB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp39-cp39-win_amd64.whl (11.2 MB)
Collecting tzdata>=2022.7
  Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Collecting pytz>=2020.1
  Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Collecting importlib-resources>=3.2.0
  Using cached importlib_resources-6.5.2-py3-none-any.whl (37 kB)
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.2.3-py3-none-any.whl (111 kB)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.57.0-cp39-cp39-win_amd64.whl (2.2 MB)
Collecti

You should consider upgrading via the 'd:\SCHOOL STUFF\MASTER'S - COMPUTER SCIENCE - AI\SEM 2\ISPR\assignments\assignment 3\venv\Scripts\python.exe -m pip install --upgrade pip' command.


In [53]:
## Importing neccessary libraries
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
# from nltk.corpus import stopwords
# from nltk.stem import PorterStemmer, WordNetLemmatizer
# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

## Loading the dataset

In [71]:
## Define the necessary columns to read
columns_to_read = ["Overall_Rating", "Review"]
df = pd.read_csv("Airline_Reviews.csv", usecols=columns_to_read)

In [72]:
## Check if there is null values
df.isnull().sum()

Overall_Rating    0
Review            0
dtype: int64

In [73]:
# Splitting data into dependent and independent variables 
X = df["Review"]
y = df["Overall_Rating"]

In [74]:
# Splitting data into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

## Data Preprocessing