## Library Imports and Data Preprocessing

Importing all the required libraries

In [179]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer,PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [180]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Loading Data

This is our Corpus which contains the hotel reviews. Some of these reviews are authentic while some of these reviews are fake or deceit. Our Aim is to find the Deceptive and Authentic reveiws.

In [181]:
Data = pd.read_csv('./deceptive-opinion.csv')

In [182]:
Data.head()

Unnamed: 0,deceptive,hotel,polarity,source,text
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...


### Data Preprocessing

Here are the steps we will take for data preprocessing:

* **Converting to Lower Case:** We will convert all the text in the comments to lowercase. This ensures uniformity and helps in accurate tokenization and analysis.

* **Removing URLs:** Any URLs present in the comments will be removed. URLs don't contribute to sentiment analysis and can be eliminated from the text.

* **Removing GIFs:** Similar to URLs, GIFs and other multimedia content will be removed from the comments as they don't carry textual sentiment information.

* **Removing Special Characters:** Special characters, punctuation marks, and symbols will be removed from the text. This simplifies the text and ensures that sentiment analysis focuses on words and their meanings.

* **Tokenization:** We will tokenize the preprocessed text, splitting it into individual words or tokens. This is a crucial step for further analysis.

* **Lemmatization:** Lemmatization involves reducing words to their base or root form. This helps in consolidating words with similar meanings and improves analysis accuracy.

In [183]:
# This function will be used to clean the Corpus
def clean_text(text):
    text = text.lower()                                                             # Convert to lowercase
    text = re.sub(r'http\S+', '', text)                                             # Remove URLs
    text = re.sub(r'<.*?>', '', text)                                               # Remove HTML tags
    text = re.sub(r'(!.*)','', text)                                                # Remove gif
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)                                        # Remove special characters
    tokens = word_tokenize(text)                                                    # Tokenization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]                        # Lemmatization
    cleaned_text = ' '.join(tokens)                                                 # Joining the tokens
    return cleaned_text

In [184]:
Data['cleaned_text'] = Data['text'].apply(clean_text)

lets view our cleaned data

In [185]:
Data.head()

Unnamed: 0,deceptive,hotel,polarity,source,text,cleaned_text
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...,we stayed for a one night getaway with family ...
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...,triple a rate with upgrade to view room wa le ...
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...,this come a little late a i m finally catching...
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...,the omni chicago really delivers on all front ...
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...,i asked for a high floor away from the elevato...


### Creating TF-idf Vectorizer
Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector. Here we will be using Sklearn tf-idf vectorizer.

In [186]:
    vectorizer = TfidfVectorizer()                          # sklearn tf-idf vectorizer
    X = vectorizer.fit_transform(Data['cleaned_text'])      # Creating vectors

In [187]:
X.shape

(1600, 7424)

In [188]:
embed = X.toarray()                                                 # Converting the embeddings to an array
embed_frame = pd.DataFrame(embed)
Features = pd.concat([Data,embed_frame],axis=1)                     # Collecting everything in a Dataframe
Features['y'] = np.where(Features['deceptive']=='truthful',1,0)     # One Hot Encoding the output

In [189]:
Features.head()

Unnamed: 0,deceptive,hotel,polarity,source,text,cleaned_text,0,1,2,3,...,7415,7416,7417,7418,7419,7420,7421,7422,7423,y
0,truthful,conrad,positive,TripAdvisor,We stayed for a one night getaway with family ...,we stayed for a one night getaway with family ...,0.157792,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,truthful,hyatt,positive,TripAdvisor,Triple A rate with upgrade to view room was le...,triple a rate with upgrade to view room wa le ...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,truthful,hyatt,positive,TripAdvisor,This comes a little late as I'm finally catchi...,this come a little late a i m finally catching...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,truthful,omni,positive,TripAdvisor,The Omni Chicago really delivers on all fronts...,the omni chicago really delivers on all front ...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,truthful,hyatt,positive,TripAdvisor,I asked for a high floor away from the elevato...,i asked for a high floor away from the elevato...,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


## Modeling

We will be using Test-train split to evaluate our models

In [190]:
Features = Features.drop(columns=['hotel','polarity','source','text','cleaned_text','deceptive'],)
X,y= Features.iloc[:,:-1],Features.iloc[:,-1]
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

In [191]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7414,7415,7416,7417,7418,7419,7420,7421,7422,7423
0,0.157792,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.071018,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Random Forrest Classifier

In [192]:
from sklearn.ensemble import RandomForestClassifier

In [193]:
rfclf = RandomForestClassifier(n_estimators=500,)
rfclf.fit(X_train,y_train)
y_hat = rfclf.predict(X_test)
print("Accuracy using a Random Forrest Classifier",accuracy_score(y_test,y_hat))

Accuracy using a Random Forrest Classifier 0.8333333333333334


### Logistic Regression

In [194]:
from sklearn.linear_model import LogisticRegression

In [195]:
logClf = RandomForestClassifier(n_estimators=500,)
logClf.fit(X_train,y_train)
y_hat = logClf.predict(X_test)
print("Accuracy using Logistic Regression",accuracy_score(y_test,y_hat))

Accuracy using Logistic Regression 0.8416666666666667


### LightGBM Model

In [196]:
from lightgbm import LGBMClassifier

In [197]:
lgbm = LGBMClassifier(verbose=-1)
lgbm.fit(X_train,y_train)
y_hat = lgbm.predict(X_test)
print("Accuracy using Light Gradient Boosting Classifier",accuracy_score(y_test,y_hat))

Accuracy using Light Gradient Boosting Classifier 0.8020833333333334
