# FAKE PRODUCT REVIEW, ANALYSIS AND MONITORING:

## MODULE 1: Data Cleaning
Data Cleaning: Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
## MODULE 2: Exploratory Data Analysis
Exploratory Data Analysis: EDA is a phenomenon under data analysis used for gaining a better understanding of data aspects like:
– main features of data
– variables and relationships that hold between them
– identifying which variables are important for our problem
## MODULE 3: Corpus
This module has three major functions which leads to the creation of the corpus and they are as follows.
### Tokenization:
It is done in order to view the possibilities of all the meaningful words the data is broken down into the words and phrases.
### Stop-Word Elimination:
Negative stop-words are identified with the help of text mining and are removed. This helps in speeding up the time when training and testing the model.
### Stemming: 
All the words present in the input string will be reduced to its root form by removing any unwanted prefix or suffix in order to make the process more efficient.
## MODULE 4: Feature Engineering
Feature engineering is the extraction of data and converting them in a format where the machine learning model can understand.In Layman terms, it is the conversion of all the string values into numbers.
### Bag of Words Model
Its is used to count the number of induvidual words as well as their frequency, this is done to process NLP by storing these values in a Pandas Dataframe.
### Dummy Variables
Dummy variables are used to convert categorical data into a Numerical Dataframe.This dataframe is then added to the Bag of Words Model Dataframe to form the Final Dataset.
## MODULE 5: Random Forest Classifier
Random Forest Classifier: random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [2]:
import numpy as np
import nltk
import string
import bs4 as bs
import re
import pandas as pd
import matplotlib as plt
import seaborn as sns
%matplotlib inline
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer,LancasterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix,classification_report
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

ModuleNotFoundError: No module named 'ipywidgets'

In [None]:
df=pd.read_excel("C:/Users/Infosystem/Documents/ipython/Fakemain/amazon_reviews.xlsx")
df.head()

# Module 1: Data Cleaning:

Data Cleaning: Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

### Columns not used are removed:     
1) Doc_ID      
2) Product_Title          
3) Review_ Title

In [None]:
del df['DOC_ID']
del df['PRODUCT_TITLE']
del df['REVIEW_TITLE']

In [None]:
df.head()

The Rows which have any null values are completely removed. Although this is a clean data set without any null values, any real-world data set will have many null values. They will have to be removed.

### Heatmap to check for Null Values:

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

### Removing columns with Null Values:

In [None]:
df.dropna()
df

No rows removed.

# Module 2: Exploratory Data Analysis:

Exploratory Data Analysis:        
EDA is a phenomenon under data analysis used for gaining a better understanding of data aspects like:      
– main features of data.      
– variables and relationships that hold between them.        
– identifying which variables are important for our problem.

### Plot: Count for Verified Purchase

In [None]:
sns.countplot(df['VERIFIED_PURCHASE'])

### Plot: Count for Real and Fake Data

Our dataset contains equal number of real and fake data to train and test. 

### Plot: Ratings Against Count 

In [None]:
sns.countplot(df['RATING'])

### Plot: Product Category and Verified_Purchase Against Ratings

In [None]:
sns.set(rc={'figure.figsize':(17,13)})
sns.barplot(x='RATING',y='PRODUCT_CATEGORY',data=df,hue='LABEL')

### Plot: Product_Category and Verified Purchase Against Count

In [None]:
sns.countplot(df['LABEL'])

In [None]:
sns.countplot(y='LABEL',data=df,palette='coolwarm',hue='VERIFIED_PURCHASE')

We can notice a general trend here. The purchases made that are not verified are mostly fake and this number reduces drastically for the real reviews. And Verified purchases are often real reviews. This ratio is polarising to the extent of being 1:5 of fake to real in case of making a verified purchase. 

# Category Selection

In [None]:
cats=list(df['PRODUCT_CATEGORY'].unique())
cats

In [None]:
a=widgets.Combobox(
    value='Health & Personal Care',
    placeholder='Choose a category',
    options=cats,
    description='Combobox:',
    ensure_option=True,
    disabled=False
)
display(a)

In [None]:
k=a.value
k

In [None]:
test = pd.DataFrame(df[df["PRODUCT_CATEGORY"]==k].drop('PRODUCT_CATEGORY',axis=1))
test.columns = ["LABEL","RATING","VERIFIED_PURCHASE","PRODUCT_ID","REVIEW_TEXT"]
test.reset_index(inplace=True)
test

# Module 3: Corpus

In [None]:
stops = set(stopwords.words("english"))
porter = PorterStemmer()
lancaster=LancasterStemmer()

In [None]:
def stemSentence(sentence):
    sentence = [char for char in sentence if char not in string.punctuation]
    sentence = ''.join(sentence)
    sentence=[word for word in sentence.split() if word.lower() not in stops]
    sentence=' '.join(sentence)
    return sentence
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

In [None]:
stemSentence('ran run running!!')

In [None]:
stemSentence('study studied i was studying')

In [None]:
test['CORPUS']=test['REVIEW_TEXT'].apply(stemSentence)

In [None]:
test

# Module 4: Feature Engineering

In [None]:
corpus = test['CORPUS']
corpus

In [None]:
vectorizer=CountVectorizer()

In [None]:
bow=vectorizer.fit_transform(corpus)
print(bow.toarray())
print(vectorizer.get_feature_names())

In [None]:
bow1=pd.DataFrame(bow.toarray(),columns=vectorizer.get_feature_names())
bow1

In [None]:
VP=pd.get_dummies(test['VERIFIED_PURCHASE'])
VP

In [None]:
bow1["VERIFIED_PURCHASE"]=VP["Y"]
bow1['Fake']=test['LABEL']

In [None]:
bow1

# Module 5: Random Forest Classifier:
Random Forest Classifier: A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

In [None]:
X=bow1.drop('Fake',axis=1)
y=bow1['Fake']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=101)

In [None]:
rd=RandomForestClassifier(n_estimators=200)

In [None]:
rd.fit(X_train,y_train)

In [None]:
predpipe=rd.predict(X_test)

In [None]:
print(confusion_matrix(y_test,predpipe))
print('\n')
print(classification_report(y_test,predpipe))

In [None]:
X_test

In [None]:
predpipe=rd.predict(X_test.iloc[[0]])
predpipe

In [None]:
predpipe=rd.predict(X_test)
print(confusion_matrix(y_test,predpipe))
print('\n')
print(classification_report(y_test,predpipe))
X_test
predpipe=rd.predict(X_test.iloc[[0]])
