# Data Cleaning, EDA and Feature Engineering for NLP and Sequence generation


---


The following code performs :

1.  Data Cleaning on the scraped web-page data.
1.  Dataset preparation.
1.   Exploratory Data Analysis on the preparted dataset.
1.  Creating the Bag of Words model.
2.  Creating a classification model.
2.  Testing the classifier on a test set.
1.  Analysing the predicion accuracy.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Importing the libraries
import numpy as np
import re
import pandas as pd
import nltk
import json

**Reading Data from JSON object**

In [0]:

with open('drive/My Drive/TextMercato/training_set.json') as f:
    data = json.load(f)
dataset = pd.DataFrame(data)

with open('drive/My Drive/TextMercato/test_set.json') as f:
   data = json.load(f)
test_data = pd.DataFrame(data)

## Analysing the Raw data 

In [0]:
dataset

Unnamed: 0,short-reviews,short-reviews-stars
0,This HDD not connecting and beep sound coming ...,1.0 out of 5 stars
1,First time defective and no replacement.,1.0 out of 5 stars
2,Do not buy this product!,1.0 out of 5 stars
3,I had a bad experience with it,2.0 out of 5 stars
4,From 4Tb u have 3.,5.0 out of 5 stars
5,Request for replacement. Faulty product,1.0 out of 5 stars
6,50-50,3.0 out of 5 stars
7,Very Bad Product of Seagate. Damaged after 3 m...,1.0 out of 5 stars
8,Does that mean that the current one which I ha...,1.0 out of 5 stars
9,One Star,1.0 out of 5 stars


In [0]:
print(dataset.shape)

(4910, 2)


In [0]:
print(dataset.columns)

Index(['short-reviews', 'short-reviews-stars'], dtype='object')


**Conclusion**

1.   The dataset is acollection of short-reviews and star-rating of each review.
2.   The dataset contains 4910 reviews.



## Data Cleaning for NLP



The dataset has to be pre-processed before any valuable insights can be drawn from it.

The dataset is to be converted in such a way that it can be used to predict whether a review is positive or negative.



**Notes:**



*   If the star-rating is above 2.5 the review is considered as 'Positive' and 'Negative' if star-rating is less than or equal to 2.5 .

*   This problem is a binary classifier.

*  The dependent variable or target value is the rating.


---




Data Cleaning involves the following steps: (In the order as performed in the code below)

1.   Removing all numbers
1.   Removing all special characters
1.   Removing all stopwords
2.   Stemming


###Preparing the Training-set

In [0]:
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []           # List to store all the words after cleaning 
ps = PorterStemmer()  # Stemmer object
Y =[]                 # List to srore cleaned and updated ratings (0 if star-rating <= 2.5 and 1 otherwise)

for i in range(0,4910):   # Loop to iterate through each row in the dataset
    
    review = re.sub('[^a-zA-Z]', ' ', dataset['short-reviews'][i])     #Removing all numbers special characters from short-reviews
    
    rating = re.sub('(out.*)', ' ', dataset['short-reviews-stars'][i]) #Cleaning short-reviews-stars
    
    rating = float(rating.strip())                                     #Converting short-reviews-stars into float
    
    review = review.lower().split() #Convert to lowercase and split into words
    
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] #Removing stop words & performing stemming
    
    review = ' '.join(review)  #Joining each word in a review with space
     
    corpus.append(review)  #Appending each cleaned review to corpus
    
    Y.append(rating)

Y_train = [0 if (x <= 2.5) else 1 for x in Y]
Y_train= np.array(Y_train)                       # Vector containing Dependent Variable/ Target Value

#Preparing the Bag of words Model

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=500)
X_train= cv.fit_transform(corpus).toarray()  #Sparse Matrix ()Independent variables set)


###Preparing the Test-set

In [0]:
#Performing all the same cleaning operations on testdata

Y =[]
corpus = []

#preparing training Data

for i in range(0,2210):   #Loop to iterate through each row in the testdata 
    
    review = re.sub('[^a-zA-Z]', ' ', test_data['short-reviews'][i])
    
    rating = re.sub('(out.*)', ' ', test_data['short-reviews-stars'][i])
    
    rating = float(rating.strip())
    
    review = review.lower().split()
    
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    
    review = ' '.join(review)
    
    corpus.append(review)
    
    Y.append(rating)

Y_test = [0 if (x <= 2.5) else 1 for x in Y]
Y_test= np.asanyarray(Y_test)

#4 Bag of words Model
cv2 = CountVectorizer(max_features= 500)
X_test= cv2.fit_transform(corpus).toarray()

## Building the Classifier 

**Steps:**


1.  Initializing the Classifier
1.  Fitting the classifier with training_set
2.  Predicting the resuls for test_set
2.  Comparing the Predicted Values and Actual results with a Cofusion Matrix



In [0]:
from sklearn.metrics import confusion_matrix


# Naive Bayes Classifier

from sklearn.naive_bayes import GaussianNB
classifier1 = GaussianNB()                  #Initializing the Gaussian Classifier
classifier1.fit(X_train,Y_train)            
Y_pred1 = classifier1.predict(X_test)
cm1 = confusion_matrix(Y_test,Y_pred1)



# KNN Classifier p=1(manhattan_distance)

from sklearn.neighbors import KNeighborsClassifier   
classifier2 = KNeighborsClassifier(n_neighbors=5, p=1, algorithm = 'auto')   #Initializing the KNN Classifier with manhattan_distance metric
classifier2.fit(X_train,Y_train)
Y_pred2 = classifier2.predict(X_test)
cm2 = confusion_matrix(Y_test,Y_pred2)



# KNN Classifier p=2 (euclidean_distance)

from sklearn.neighbors import KNeighborsClassifier          
classifier3 = KNeighborsClassifier(n_neighbors=5, p=2)     #Initializing the KNN Classifier with euclidean_distance metric
classifier3.fit(X_train,Y_train)
Y_pred3 = classifier3.predict(X_test)
cm3 = confusion_matrix(Y_test,Y_pred3)



# Logistic Regression Classifier

from sklearn.linear_model import LogisticRegression
classifier4 = LogisticRegression()                         #Initializing the Logistic Regression Classifier                
classifier4.fit(X_train,Y_train)
Y_pred4 = classifier4.predict(X_test)
cm4 = confusion_matrix(Y_test,Y_pred4)



# Support vector classifier

from sklearn.svm import SVC 
classifier5 = SVC(kernel='linear')                         #Initializing the Support Vector Machine Classifier
classifier5.fit(X_train,Y_train)
Y_pred5 = classifier5.predict(X_test)
cm5 = confusion_matrix(Y_test,Y_pred5)



# KNN Classifier (chebyshev_distance)

from sklearn.neighbors import KNeighborsClassifier
classifier6 = KNeighborsClassifier(n_neighbors=5, metric= 'chebyshev')    #Initializing the KNN Classifier with chebyshev_distance metric
classifier6.fit(X_train,Y_train)
Y_pred6 = classifier6.predict(X_test)
cm6 = confusion_matrix(Y_test,Y_pred6)



## Comparing different Classifiers

**Comparing the Confusion Matrix**

In [0]:
print("\nNaive Bayes Classifier:\n",cm1)
print("\nKNN Classifier p=1(manhattan_distance):\n",cm2)
print("\nKNN Classifier p=2 (euclidean_distance):\n",cm3)
print("\nLogistic Regression Classifier:\n",cm4)
print("\nSupport vector classifier:\n",cm5)
print("\nKNN Classifier (chebyshev_distance):\n",cm6)


Naive Bayes Classifier:
 [[ 201  218]
 [ 784 1007]]

KNN Classifier p=1(manhattan_distance):
 [[  69  350]
 [ 106 1685]]

KNN Classifier p=2 (euclidean_distance):
 [[  69  350]
 [ 105 1686]]

Logistic Regression Classifier:
 [[  89  330]
 [ 212 1579]]

Support vector classifier:
 [[ 131  288]
 [ 291 1500]]

KNN Classifier (chebyshev_distance):
 [[  17  402]
 [  10 1781]]


### Comparing the Metrics of each classifier

**Metrics:**


---


**Accuracy = (True Positives + True Negatives) / Number of Observations**

**Precision = True Positive / (True Positive + False Positive) or True Negative / (True Negative + False Negative)**

**Recall = True Positive / (True Positive + False Negative) or True Negative / (True Negative + False Positive)**

**F1 Score = (2 x Precision x Recall) / (Precision x Recall)**

In [0]:
from sklearn.metrics import classification_report

print("\nNaive Bayes Classifier:\n",classification_report(Y_test,Y_pred1))
print("\nKNN Classifier p=1(manhattan_distance):\n",classification_report(Y_test,Y_pred2))
print("\nKNN Classifier p=2 (euclidean_distance):\n",classification_report(Y_test,Y_pred3))
print("\nLogistic Regression Classifier:\n",classification_report(Y_test,Y_pred4))
print("\nSupport vector classifier:\n",classification_report(Y_test,Y_pred5))
print("\nKNN Classifier (chebyshev_distance):\n",classification_report(Y_test,Y_pred6))



Naive Bayes Classifier:
              precision    recall  f1-score   support

          0       0.20      0.48      0.29       419
          1       0.82      0.56      0.67      1791

avg / total       0.70      0.55      0.60      2210


KNN Classifier p=1(manhattan_distance):
              precision    recall  f1-score   support

          0       0.39      0.16      0.23       419
          1       0.83      0.94      0.88      1791

avg / total       0.75      0.79      0.76      2210


KNN Classifier p=2 (euclidean_distance):
              precision    recall  f1-score   support

          0       0.40      0.16      0.23       419
          1       0.83      0.94      0.88      1791

avg / total       0.75      0.79      0.76      2210


Logistic Regression Classifier:
              precision    recall  f1-score   support

          0       0.30      0.21      0.25       419
          1       0.83      0.88      0.85      1791

avg / total       0.73      0.75      0.74      2

**Conclusion**

From the above metrics it can be noted that the K-Nearest Neighbors Classifier with chebyshev_distance metric is the most efficient predictor for the given dataset.

Attained Accuracy : 81.35%