# Language Warmup Full Model

## Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import string
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

yelpDataset = pd.read_csv('Yelp.txt', sep='\t', header=None, encoding='latin-1')
yelpDataset.columns = ['review', 'sentiment']
stopword = nltk.corpus.stopwords.words('english')
stopword = [word for word in stopword if word != 'not']
lm = nltk.WordNetLemmatizer()


#def removePunct(text):
   # noPunct = ''.join([char for char in text if char not in string.punctuation])
   # return noPunct

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

def onlyAlpha(tokenizedList):
    text = [word for word in tokenizedList if word.isalpha()]
    return text

def noStop(tokenizedList):
    text = [word for word in tokenizedList if word not in stopword]
    return text

def posTag(tokenizedList):
    text = ''.join([nltk.pos_tag(word) for word in tokenizedList])
    return text

def lemmatize(tokenizedList):
    text = ' '.join([lm.lemmatize(word) for word in tokenizedList])
    return text

yelpDataset['review_tokens'] = yelpDataset['review'].apply(lambda x: tokenize(x.lower()))
yelpDataset['review_alpha'] = yelpDataset['review_tokens'].apply(lambda x: onlyAlpha(x))
yelpDataset['review_nostops'] = yelpDataset['review_alpha'].apply(lambda x: noStop(x))
yelpDataset['review_lemmatized'] = yelpDataset['review_nostops'].apply(lambda x: lemmatize(x))

df1 = pd.DataFrame(data = yelpDataset['review_lemmatized'])
#creates a list that can be vectorized later
df1 = df1['review_lemmatized'].tolist()
df2 = pd.DataFrame(data = yelpDataset['sentiment'])
df2 = df2['sentiment'].tolist()

yelpDataset.head()

print(df1)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Randy_B15\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Randy_B15\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Feature Engineering and Vectorization

In [2]:
# Vectorize data, with 1- and 2- grams

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True, lowercase=False)
#vectorizer = CountVectorizer(binary=True, lowercase=False, ngram_range=(1, 2))
vector = vectorizer.fit_transform(df1)

In [3]:
# Change to a numpy array

data = vector.todense()
data = np.asarray(data)
print(type(data))

<class 'numpy.ndarray'>


In [4]:
# Split into train, test, and validate sets

x_train = np.concatenate([data[:300], data[-300:]])
y_train = np.concatenate([df2[:300], df2[-300:]])
x_val = np.concatenate([data[300:400], data[600:700]])
y_val = np.concatenate([df2[300:400], df2[600:700]])
x_test = np.concatenate([data[400:600]])
y_test = np.concatenate([df2[400:600]])
print(x_train.shape)
print(x_val.shape)
print(x_test.shape)

(600, 1763)
(200, 1763)
(200, 1763)


## Model Architecture

### Model Creation 

This is where we finally make our neural network to process the vectorized data we made above. This process will consist of importing the libraries we need (keras in our instance), selecting the kind of model we wish to work with, creating the nueral network with it number of nodes and layers, selecting optimizer, loss, and metric functions, training, and then finally testing our data.

In [5]:
#Import the models library from the keras main library to use for our model
from keras import models

#Import the layers library from the keras main library, this will allow us to create a sequential neural network with sequential layers
from keras import layers


#Will make our model a sequential model, with sequences of dense layers
model = models.Sequential()

#Will set the first layer as a dense layer with 16 nodes, will use the activation function relu and fit our layer to the data
#Worth noting here that the number of layers and number of nodes are rather arbitrary, from our experience
#with this particular model, 1 dense layer with 16 nodes each allowed for maximal accuracy in the test data
#Will then output a single node dense layer that will output a continuous probability curve via the sigmoid function activator

model.add(layers.Dense(16, activation = 'relu', input_shape = (x_train.shape[1],)))
model.add(layers.Dense(1,  activation = 'sigmoid'))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


### Model Compilation

Now that we have made our model, we must now choose what we want as our metric, our loss function, and our optimizer functions.

Very large degree of freedom here, many different loss functions to use, as well as metrics for model evaluation
and optimizer functions.

From our experience with this model, was more useful to use the following optimizer, loss, and metric functions,
but this can vary depending upon the project you are working on.

In [6]:
model.compile(optimizer = 'rmsprop',           #Set our optimizer function as "rmsprop"
              loss = 'binary_crossentropy',    #Set up our loss function as "binary cross entropy"
              metrics = ['accuracy'])          #Set up our metric function, will use "accuracy" here

### Model Training
Now that we have created our model and defined how we will evaluate our model and how it will optimize loss, will now train it.

To keep track of how our model is doing, will want to print out a history of each epoch to see how our model is improving, and if it is overfitting.

To train the model, must give our model the x_train and y_train data we set aside earlier, set a number of epochs we wish for the model to go through, the number of data entries per epoch, and (optionally), have it analyze a vlaidation set.

Again, worth noting here that the number of epochs and batch size are completely aribtrary and will vary form model to model, from our experience, we wished to maximize accuracy, so a greater number of epochs was desirable 

In [7]:
history = model.fit(x_train,                            #Input the x_train data
                       y_train,                         #Input the y_train data (features)
                       epochs=20,                       #Set our model to go through 20 epochs
                       batch_size=64,                   #Set our epoch batch size to 64 data entires
                       validation_data=(x_val, y_val))  #Have our model also go through the validation set, x_val, y_val

#To prevent overfitting, will plot the model accuracy against the validation accuracy and see where divergence occurs

#Will use the matplotlib.pyplot library to access the necessary plotting functions we will need to plot our output

import matplotlib.pyplot as plt

#Plot the model accuracy by calling history.history['acc']
plt.plot(history.history['acc'], label = "Model Accuracy")

#Add the validation accurayc to our plot by calling history.history['val_acc']
plt.plot(history.history['val_acc'], label = "Validation Accuracy")

#Add a grid to the plot for neatness in analysis
plt.grid()

#Add a legend to our plot
plt.legend()

#Adding a label to the x-axis 
plt.xlabel("Epochs")

#Adding a label to the y-axis
plt.ylabel("% Accuracy")

#Adding a limit to the x-axis range, want to make sure we get the full 0-20 epoch range
plt.xlim(0,20)

Train on 600 samples, validate on 200 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


(0, 20)

### Model Testing

Now that we have tried to prevent overfitting the data and have trained it, let's see how well it does on our test data set.

In [9]:
#This step is as simple as calling the evaluate function from the keras models library with our x_test and y_test data sets
#as parameters to pass in
results = model.evaluate(x_test, y_test)

#We've tested it, now let's print out how accurate our model is on the testing data set
print ("Accuracy:", results[1])

Accuracy: 0.76


ValueError: Error when checking input: expected dense_1_input to have shape (1763,) but got array with shape (1,)