# Language Warmup Full Model

## Data Cleaning

### Getting the Cleaning Tools Ready

To clean the dataset, we will be using Pandas, NumPy, RegEx, NLTK and an autocorrect library. There are some custom solutions here that will be explained further via inline comments. Here, we are just importing the libraries and even selecting specific modules from some of them.

In [11]:
import pandas as pd
import numpy as np
import string
import re
import nltk

# 'spell' will be our spell checker and corrector for cleaning purposes
from autocorrect import spell

# A stopword for our purposes is a word that doesn't add a lot of insight to the sentiment of a sentence
nltk.download('stopwords')
nltk.download('wordnet')

# A lemmatizer is a way for us to find the root of a word. 
# Using this, 'grows', 'grew', and 'grown' all evaluate to 'grow'.
lm = nltk.WordNetLemmatizer()

stopword = nltk.corpus.stopwords.words('english')

# We are removing the word 'not' to avoid a situation where for example 'not good' == 'good' 
stopword = [word for word in stopword if word != 'not']

[nltk_data] Downloading package stopwords to /Users/user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Reading in the Data

These two lines of code make it really easy to manipulate the dataset by organizing it into labelled columns and rows, called a dataframe. We can create functions to manipulate specific columns, and for our purposes, we'd like to play around with the reviews only. To do that, first we must read in the data. We can use Pandas' **pd.read_csv** as seen below to read a specific file, **Yelp.txt**. We just have to indicate that each colummn is separated by a tab, or **\t** which we have done. **Header** is set to None because we want to read the file from its first line. Finally, encoding has been set to **Latin1**. This is because of a decoding error that was thrown at us. Encoding in another character set seemed to fix the problem and create no new ones, so we didn't look back. 

Now that the dataset is read in, we can name our columns as they appear from left to right by using the **.columns** method in Pandas.

In [12]:
yelpDataset = pd.read_csv('Yelp.txt', sep='\t', header=None, encoding='latin-1')
yelpDataset.columns = ['review', 'sentiment']

### Creating the Cleaning Functions

With our data set up the way it is, our cleaning is made fairly simple. Each of these functions 'reads' each review and then executes its function on each of them. The purpose of each function will be explained with inline comments.

In [13]:
# This function takes each review and breaks it up into its individual words
def tokenize(text):
    tokens = re.split('\W+', text) # '\W' will split on any non-word characters such as a whitespace, comma or period
    return tokens

# This function removes all characters from 'review' that are NOT alaphabetical (special characters, punctuation etc)
def onlyAlpha(tokenizedList):
    text = [word for word in tokenizedList if word.isalpha()]
    return text

# This function removes any instance of a stopword in each review
def noStop(tokenizedList):
    text = [word for word in tokenizedList if word not in stopword]
    return text

# This function checks for and corrects spelling errors in each word of each review
def spellCheck(tokenizedList):
    text = [spell(word) for word in tokenizedList]
    return text

# This function uses the lemmatizer mentioned earlier on each word of each review
def lemmatize(tokenizedList):
    text = ' '.join([lm.lemmatize(word) for word in tokenizedList]) # .join is used at the end to return the tokenized
    return text                                                     # reviews and returning them to full sentences




### Applying the Cleaning Functions

To apply each function, we decided to use lambda functions. Using lambda functions is beneficial here because we are performing a fairly specific sequence of tasks. On the left of each expression we are naming a new column (usually to reflect a change), and on the right, we are applying a function to each row of our specified column. This leaves us with a new column of data with the function applied. 

In [3]:
# Create the random junk data
# This was just a test for me to figure out how to work it with fake sets of randomly generated sentences
# The outputs weren't exactly as expected since the sentences generated didn't follow grammar rules
'''
import random
import numpy as np

Now that the data is clean, in order for our Feature Engineering and Vectorization to proceed we must turn our 2-dimensional DataFrame into two separate lists.

'\nimport random\nimport numpy as np\n\nwith open("FeatureCreate\\words.txt") as f:\n    words = f.readlines()\nwords = [x.strip() for x in words]\n\nphrases = []\nfor i in range (0,100):\n    phraselength = random.randint(5,15)\n    phrase = []\n    for j in range(0,phraselength):\n        choice = random.randint(0,len(words))\n        phrase.append(words[choice])\n        sentence = \' \'.join(phrase)\n    phrases.append(sentence)\n\n    \nprint(len(phrases))\n# REMOVE THIS CELL\n'

In [4]:
# First, we can turn our DataFrame column into a 1-dimensional DataFrame 
review = pd.DataFrame(data = yelpDataset['review_lemmatized'])

# Now, we can turn our 1-dimensional DataFrame into a list using Pandas' .tolist method
review = review['review_lemmatized'].tolist()

#The same is done to our sentiment column
sentiment = pd.DataFrame(data = yelpDataset['sentiment'])
sentiment = sentiment['sentiment'].tolist()

'\n#Make fake y values (fake sentiments)\n\ny_dat = [0] * 100\nfor i in range(0,50):\n    y_dat[i] = 1\n        \nprint(y_dat)\n'

## Feature Engineering and Vectorization

In [5]:
# Vectorize data, with 1- and 2- grams

#importing a useful function that converts our data set of sentences into a large matrix of 0's and 1's
#the matrix has each row representing a sentence and each column representing a word from the data set (no words are repeated)
#for each sentence, a 1 is placed in the columns of the words present in the sentence.
#If a word isn't in the sentence, a 0 is put in that column.

from sklearn.feature_extraction.text import CountVectorizer

#placing binary=false would make the matrix count the frequency of a word in the sentence, instead of just marking it's presence
#for some reason this wasn't working unless lowercase=false, some problem with the cleaned data I suppose

vectorizer = CountVectorizer(binary=True, lowercase=False)

#This next line could be used instead if we wished to make the program more sophisticated, but larger/slower
#Instead of a column for every word, there would also be columns for every set of two words placed next to each other in the data set
#Change the 2 to any integer, but makes the matrix exponentially larger

#vectorizer = CountVectorizer(binary=True, lowercase=False, ngram_range=(1, 2))

#Simply calling this function on our cleaned data set 'phrases'

vector = vectorizer.fit_transform(phrases)

In [6]:
# Change to a numpy array just so it's easier to manipulate

data = vector.todense()
data = np.asarray(data)
print(type(data))

<class 'numpy.ndarray'>


In [7]:
# Split into train, test, and validate sets

x_train = np.concatenate([data[:300], data[-300:]])
y_train = np.concatenate([df2[:300], df2[-300:]])
x_val = np.concatenate([data[300:400], data[600:700]])
y_val = np.concatenate([df2[300:400], df2[600:700]])
x_test = np.concatenate([data[400:600]])
y_test = np.concatenate([df2[400:600]])
print(x_train.shape)
print(x_val.shape)
print(x_test.shape)

(600, 1741)
(200, 1741)
(200, 1741)


## Model Architecture

In [8]:
### Model Creation 

This is where we finally make our neural network to process the vectorized data we made above. This process will consist of importing the libraries we need (keras in our instance), selecting the kind of model we wish to work with, creating the nueral network with it number of nodes and layers, selecting optimizer, loss, and metric functions, training, and then finally testing our data.