# NLP Restaurant Reviews

![img](https://cdn.basedlabs.ai/62a0d420-8334-11ef-88e7-7521375aecdf.jpg)

## Import Libraries

In [1]:
import numpy as np  
import pandas as pd 

## Load Dataset

In [2]:
df = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')

## EDA

In [3]:
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [5]:
import re # Düzenli ifadelerle metin manipülasyonu için 
import nltk # NLP kütüphanesi

from nltk.corpus import stopwords # Önemsiz kelimeler içeren listeyi almak için

from nltk.stem.porter import PorterStemmer # Kelime köklerini almak için 'PorterStemmer' sınıfı

In [6]:
# Temizlenmiş metinleri saklamak için boş bir liste oluştur
corpus = [] 

# İlk 1000 incelemeyi döngü işle
for i in range(0, 1000): 
    
    # 'Review' sütunundaki i'inci incelemeyi al ve özel karakterleri kaldır
    review = re.sub('[^a-zA-Z]', ' ', df['Review'][i]) 
    
    # Tüm harfleri küçük harfe dönüştür
    review = review.lower() 
    
    # İncelemeyi kelimelere ayır
    review = review.split()
        
    # Kelime köklerini almak için 'PorterStemmer' nesnesi oluştur
    ps = PorterStemmer() 
    
    # Her bir kelimenin kökünü alırken durak kelimelerini temizle
    review = [ps.stem(word) for word in review
                if not word in set(stopwords.words('english'))]
    
    # Köklenmiş kelimeleri tekrar birleştirerek temizlenmiş bir metin oluştur
    review = ' '.join(review)  
    
    # Temizlenmiş metni 'corpus' listesine ekle
    corpus.append(review)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# 'max_features' ile maksimum 1500 özellik belirleniyor. Bu, daha iyi sonuçlar elde etmek için denenen bir parametre
cv = CountVectorizer(max_features=1500)

# 'corpus' listesindeki metin verilerini fit edip, özellikleri çıkar ve bir diziye dönüştür
# X değişkeni, bag of words modelini temsil eden bağımlı değişkeni içerir.
X = cv.fit_transform(corpus).toarray() 

# 'y' değişkeni, incelemenin olumlu veya olumsuz olup olmadığını gösterir
y = df.iloc[:, 1].values 

This code snippet creates a Bag of Words model to convert text data into a numerical format.

CountVectorizer: This is used to transform the text data into numerical features based on word frequency. It counts the occurrences of each word in the corpus and creates a matrix representation.

X: This variable contains the numerical representation of the cleaned text (corpus). Each row represents a review, and each column corresponds to a word. The values in the matrix indicate the frequency of each word in the respective review.

y: This variable indicates whether each review is classified as positive or negative. It serves as the target variable for machine learning models, providing the labels needed for training.



## Splitting Corpus into Training and Test set.
For this, we need class train_test_split from sklearn.cross_validation. Split can be made 70/30 or 80/20 or 85/15 or 75/25, here I choose 75/25 via “test_size”. 
X is the bag of words, y is 0 or 1 (positive or negative).

In [8]:
# Veri setini eğitim ve test olarak böl
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

## Model Creation

In [9]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier

# 'n_estimators', ormandaki ağaç sayısını belirtir
# 'criterion' ile modelin hangi ölçütle karar vereceği belirtilir
model = RandomForestClassifier(n_estimators = 501,
                            criterion = 'entropy')

In [10]:
# Modelin Eğitim Setine Uygulanması
model.fit(X_train, y_train) 

In [11]:
# Test seti sonuçlarını tahmin etme
y_pred = model.predict(X_test)
y_pred

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0])

In [19]:
# Karışıklık Matrisinin Oluşturulması
from sklearn.metrics import confusion_matrix, f1_score

cm = confusion_matrix(y_test, y_pred)
cm

array([[118,  17],
       [ 36,  79]])

In [20]:
# Doğruluk Skorunu Hesaplama
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1:.2f}")

Model Accuracy: 78.80%
F1 Score: 0.75
