# Online Reviews Analysis - (Part-2)
This article is an extention to [Online Reviews Analysis - (Part-1)](https://www.linkedin.com/pulse/online-reviews-analysis-part-1-anas-buhayh/?trackingId=CkpqzrXBTMWsvpjKb3ZW5A%3D%3D). The first part of this series explained some techniques that can be used to summerize reviews and measure feature sentiment. This article would be relatively shorter to reduce redudancy. 

In this article, we will implement the machine learning library XGBoost to check if there is a relationship between the customers rating and the adjectives they used in their review. unlike the first article, this one will be mostly code and it can also serve as an reference for natural language processing; understanding of machine learning is required to get the most of the material.

## Importing libraries and data

we will start with importing the libraries and the data. Also, we need to make sure to drop the empty reviews

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

df = pd.read_csv('product_reviews.csv').drop(columns=['Product'], axis=1)
df = df[df['Review_Text'] != 'none']
df = df.reset_index(drop=True)

## Creating corpus

Creating the corpus in this example is not any different from the previous one. The only difference is that in this example we need to only keep the adjectives in the reviews.

In [9]:
import re
import nltk
from nltk.tag import pos_tag
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, len(df)):
    review = re.sub('[^a-zA-Z]', ' ', df['Review_Text'][i])
    review = review.lower()
    review = review.split()
    all_stopwords = stopwords.words('english')
    review = [word for word in review if not word in set(all_stopwords)]
    # Keeping only adjectives
    tags = nltk.pos_tag(review)
    review = [word for word, pos in tags if (pos == 'JJ' or pos == 'JJR' or pos == 'JJS')]
    review = ' '.join(review)
    corpus.append(review)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bhiha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Training the model

We need to turn the features/adjectives into vectors in order to train the model. Machine learning models cannot work with text in it's raw form. in simple words, feature [vectorizing](https://towardsdatascience.com/different-techniques-to-represent-words-as-vectors-word-embeddings-3e4b9ab7ceb4) means converting the words to vectors.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, -1].values

Now we split the data into training and test data. where 80% of the data will be used for training and the other 20% is used for testing

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


it's time to train the model. I used [XGBoost](https://xgboost.readthedocs.io/en/latest/) classifier to predict the customer rating

In [13]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

let's check the accuracy of the results

In [15]:
from sklearn.metrics import confusion_matrix, accuracy_score

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

acs = accuracy_score(y_test, y_pred)
print(acs)

[[ 31   1   2   3  32]
 [  4   0   2   0  14]
 [  1   1   1   4  16]
 [  1   0   0   2  46]
 [  1   0   2   5 314]]
0.7204968944099379
