# AI NI Academy 2
***
This is a complimentary notebook to go alongside the Azure ML Studio project that we will be taking you through. Feel through to follow on with this notebook, or save it for later so you can compare the code and learn how to complete this model with Python!
***

## Imports
***

In [20]:
import pandas as pd
import numpy as np
import re

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support, classification_report

## Data
***
Load in your data using the Pandas library, so that we have something to work with. It's good practice to understand your data before you start working with it. We do this by using the .head() function which will print the first 5 records. 

__Tip:__ If you put a number inside the parenthesis it will print that amount instead of the default, 5. 
***

In [2]:
reviews_df = pd.read_csv("Electronics.csv")

In [3]:
reviews_df.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,528881469,"[0, 0]",5,We got this GPS for my husband who is an (OTR)...,"06 2, 2013",AO94DHGC771SJ,amazdnu,Gotta have GPS!,1370131200
1,1,528881469,"[12, 15]",1,"I'm a professional OTR truck driver, and I bou...","11 25, 2010",AMO214LNFCEI4,Amazon Customer,Very Disappointed,1290643200
2,2,528881469,"[43, 45]",3,"Well, what can I say. I've had this unit in m...","09 9, 2010",A3N7T0DY83Y4IG,C. A. Freeman,1st impression,1283990400
3,3,528881469,"[9, 10]",2,"Not going to write a long review, even thought...","11 24, 2010",A1H8PY3QHMQQA0,"Dave M. Shaw ""mack dave""","Great grafics, POOR GPS",1290556800
4,4,528881469,"[0, 0]",1,I've had mine for a year and here's what we go...,"09 29, 2011",A24EV6RXELQZ63,Wayne Smith,"Major issues, only excuses for support",1317254400


In [4]:
# We are reducing the amount of data we are working with here because 300,000 will take a while to process
reviews_df_small = reviews_df.head(1000)

In [5]:
reviews_df_small.count()["overall"]

1000

In [6]:
reviews_df_small["overall"][0]

5

In [7]:
#Â We are manipulating our data so that we are working with binary classification instead of multi-class classification,
#Â because we simply want to know if it is positive or nagative.
threshold = 3

reviews_df_small["overall"] = np.where(reviews_df_small["overall"] >= threshold, 1,0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [8]:
reviews_df_small.head()

Unnamed: 0.1,Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,528881469,"[0, 0]",1,We got this GPS for my husband who is an (OTR)...,"06 2, 2013",AO94DHGC771SJ,amazdnu,Gotta have GPS!,1370131200
1,1,528881469,"[12, 15]",0,"I'm a professional OTR truck driver, and I bou...","11 25, 2010",AMO214LNFCEI4,Amazon Customer,Very Disappointed,1290643200
2,2,528881469,"[43, 45]",1,"Well, what can I say. I've had this unit in m...","09 9, 2010",A3N7T0DY83Y4IG,C. A. Freeman,1st impression,1283990400
3,3,528881469,"[9, 10]",0,"Not going to write a long review, even thought...","11 24, 2010",A1H8PY3QHMQQA0,"Dave M. Shaw ""mack dave""","Great grafics, POOR GPS",1290556800
4,4,528881469,"[0, 0]",0,I've had mine for a year and here's what we go...,"09 29, 2011",A24EV6RXELQZ63,Wayne Smith,"Major issues, only excuses for support",1317254400


In [9]:
reviews_df_small.count()

Unnamed: 0        1000
asin              1000
helpful           1000
overall           1000
reviewText         998
reviewTime        1000
reviewerID        1000
reviewerName       995
summary           1000
unixReviewTime    1000
dtype: int64

In [10]:
reviews_df_small = reviews_df_small.dropna()
reviews_df_small = reviews_df_small.reset_index()
reviews_df_small.count()

index             993
Unnamed: 0        993
asin              993
helpful           993
overall           993
reviewText        993
reviewTime        993
reviewerID        993
reviewerName      993
summary           993
unixReviewTime    993
dtype: int64

## Feature Engineering
***
We need to do a bit of work with the data before we can train our model with it and get predictions. 

To Do: 
- Seperate sentiment and associated text
- Replace the punction and numbers found in the review's text with space
- Turn all the text to lowercase
***

In [11]:
sentiment_label = reviews_df_small["overall"]
review_text = reviews_df_small["reviewText"]

In [12]:
# Here we are replacing the punctuation and numbers with space, and making all the text lowercase
for i in range (review_text.count()):
    review_text[i] = re.sub("\W", " ", review_text[i]).lower()
    review_text[i] = re.sub("\d", " ", review_text[i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
review_text.head(10)

0    we got this gps for my husband who is an  otr ...
1    i m a professional otr truck driver  and i bou...
2    well  what can i say   i ve had this unit in m...
3    not going to write a long review  even thought...
4    i ve had mine for a year and here s what we go...
5    i am using this with a nook hd   it works as d...
6    the cable is very wobbly and sometimes disconn...
7    this adaptor is real easy to setup and use rig...
8    this adapter easily connects my nook hd       ...
9    this product really works great but i found th...
Name: reviewText, dtype: object

## Let's Train
***
Ok, so now we have formatted and organised our data. We need to set it up to be fed into our model; to do that we will need to do the following: 

To Do:
- Assign the review text and sentiment data to X and Y
- Split the data into training data and testing data
- Use a count vectorizer to count the words, creating a bag of words
- Use a tfidf transformer to reduce the significance of more common words, like "the", "it" and "a"
- Train a Logistic Regression model and a Support Vector Machine
- Find the accuracy of both
- Generate a classification report
***

In [14]:
X = review_text
Y = sentiment_label

x_train, x_test, y_train, y_test = train_test_split(X, Y, train_size=0.8, random_state=42)

count_vect = CountVectorizer()
x_train_counts = count_vect.fit_transform(x_train)
tfidf_transformer = TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)



In [15]:
clf = LogisticRegression(random_state=0).fit(x_train_counts, y_train)
predictions = clf.predict(count_vect.transform(x_test))

In [23]:
print (metrics.accuracy_score(y_test, predictions))

0.8492462311557789


In [17]:
clf_svm = svm.SVC().fit(x_train_counts, y_train)
svm_predictions = clf_svm.predict(count_vect.transform(x_test))

In [24]:
print (metrics.accuracy_score(y_test, svm_predictions))

0.7989949748743719


In [19]:
precision_recall_fscore_support(y_test, predictions, average="binary")

(0.864406779661017, 0.9622641509433962, 0.9107142857142857, None)

In [48]:
precision_recall_fscore_support(y_test, svm_predictions, average="binary")

(0.7989949748743719, 1.0, 0.888268156424581, None)

In [21]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

          0       0.73      0.40      0.52        40
          1       0.86      0.96      0.91       159

avg / total       0.84      0.85      0.83       199



In [22]:
print(classification_report(y_test, svm_predictions))

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        40
          1       0.80      1.00      0.89       159

avg / total       0.64      0.80      0.71       199



  'precision', 'predicted', average, warn_for)
