# Insightly - The Naive Bayes Implementation

This notebook will go through the implementation of the Naive Bayes version of Insightly.

### Setup

Import required packages:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

### Data Preprocessing

Let's start by looking at the data:

In [5]:
df = pd.read_csv('amazon_reviews.csv')
print(df.columns)
df.head()

Index(['Unnamed: 0', 'reviewerName', 'overall', 'reviewText', 'reviewTime',
       'day_diff', 'helpful_yes', 'helpful_no', 'total_vote',
       'score_pos_neg_diff', 'score_average_rating', 'wilson_lower_bound'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
0,0,,4.0,No issues.,2014-07-23,138,0,0,0,0,0.0,0.0
1,1,0mie,5.0,"Purchased this for my device, it worked as adv...",2013-10-25,409,0,0,0,0,0.0,0.0
2,2,1K3,4.0,it works as expected. I should have sprung for...,2012-12-23,715,0,0,0,0,0.0,0.0
3,3,1m2,5.0,This think has worked out great.Had a diff. br...,2013-11-21,382,0,0,0,0,0.0,0.0
4,4,2&amp;1/2Men,5.0,"Bought it with Retail Packaging, arrived legit...",2013-07-13,513,0,0,0,0,0.0,0.0


The columns of interest are:
- overall: the actual review score (out of 5) that the product was given
- reviewText: the content of the review, which will be used for sentiment analysis
so we can drop the remaining columns

In [6]:
df = df[['overall', 'reviewText']]
df.head()

Unnamed: 0,overall,reviewText
0,4.0,No issues.
1,5.0,"Purchased this for my device, it worked as adv..."
2,4.0,it works as expected. I should have sprung for...
3,5.0,This think has worked out great.Had a diff. br...
4,5.0,"Bought it with Retail Packaging, arrived legit..."


Now we can take some steps to clean the data:

In [7]:
# remove missing values
df = df.dropna()

# remove duplicate reviews
df = df.drop_duplicates()

# remove nonexistent reviews
df = df[df['reviewText'].str.len() > 0]

The data is cleaned, and model development can begin.

### Model Development

The process of developing a Naive Bayes model is made rather simple by `scikit-learn`.