<a href="https://www.kaggle.com/code/devaanshpuri/restaurant-review-classification-nlp-bow?scriptVersionId=201507868" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Restaurant Review classificaton using nlp(bag of words), random forest classifier and Grid search(some parameters set pre for faster processing)

#### * you can use various classification techniques like naive bayes,xg boost,logistic regression and use best fitted model on the cleaned data
#### * stopwords are words carrying less information for classification tasks. eg conjunctions("and","but") or articles("a","an","the")
#### * porter stemmer is a stemming technique that converts words to their base form eg "running" to "run","happiness" to happy

### Import the basic libraries

In [1]:
import pandas as pd
import numpy as np
import re
import nltk

### Importing the dataset

In [2]:
dataset = pd.read_csv("/kaggle/input/restaurant-review-bag-of-words/Restaurant_Reviews.tsv",delimiter='\t',quoting =3)


### Importing modules for cleaning the dataset

In [3]:
nltk.download("stopwords") #irrelevant words like "a","and","the", does not contribute in classification
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer #convert similar words eg "love" and "loved" to love

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Cleaning the dataset

In [4]:
corpus=[]
for i in range(0,1000):
    review=re.sub('[^a-zA-Z]',' ',dataset['Review'][i]) #except a-z and A-Z everything replaced with ' '
    #eg double qoutes removed
    review = review.lower() #lowercases every letter
    review = review.split() #split reviews into words
    ps= PorterStemmer() #does not apply on stopwords
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    review =[ps.stem(word) for word in review if not word in set(all_stopwords)]
    review = ' '.join(review) #joins stemmed words with a space
    corpus.append(review)
    

### Apllying it to a sparse matrix

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500) #for sparse matrix
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:,-1].values
len(X[0])


1500

### Training and testing split of the data

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Applying Grid Search for best fitting parameters

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators': [150],  # Number of trees
    'criterion': ['gini', 'entropy'],  # Criteria for splitting
    'max_depth': [10,20],  # Maximum depth of trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],  # Minimum number of samples required at a leaf node
    'bootstrap': [True, False]  # Whether to bootstrap samples
}
classifier = RandomForestClassifier()
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_


Fitting 5 folds for each of 72 candidates, totalling 360 fits
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=150; total time=   0.8s
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=150; total time=   0.8s
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=150; total time=   0.8s
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=150; total time=   0.9s
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=2, min_samples_split=2, n_estimators=150; total time=   0.8s
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=150; total time=   0.7s
[CV] END bootstrap=True, criterion=gini, max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=150; total time=   0.7s
[CV] E

### Fitting into the model

In [8]:
from sklearn.ensemble import RandomForestClassifier
best_model.fit(X_train,y_train)


### Prediction 

In [9]:
y_pred = best_model.predict(X_test)

### Confusion matrix for evaluation 

In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[90  7]
 [35 68]]


0.79