<a href="https://colab.research.google.com/github/TheMaxnificent1/Sample_WebPage/blob/main/Copy_of_12_1_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Data

So far you have created a simple system for making a call on if a review has a positive or negative overall rating. This system works as long as the review uses words that you have predetermined as positive or negative.

But what about reviews with more complicated words or phrases? In order to have a more nuanced and general understanding of text from the reviews, you are going to build a neural network that is trained using data from reviews. This network will build a more nuanced understanding of the review data.

 There are three main stages of creating this network: 

* Importing and cleaning up the data
* Vectorizing the data
* Creating and training the network


To start, you are going to import the data and do some basic clean-up on it. 

1. Download [this](https://drive.google.com/file/d/18UVOyFEZlPyTmFFkuueRIjV44RqBJhqX/view?usp=sharing) file and upload it to the runtime 
2. Run the cell below to import required libraries. 



In [3]:
import pandas as pd
import numpy as np

import string
from string import punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

import tensorflow as tf

from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Sequential 

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

import joblib

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


2. Run the cell below to read data into your program and see the first few rows. 

In [4]:
# Load in the data
data = pd.read_csv('Reviews.csv')

# Print the first few rows of data
data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Cleaning Up Data


Now that you have the data imported, it's time to extract only the parts you need to train the network.

1. Drop UserId, Id and Time from your data since you don't need it. 
2. Use data.head() to see your data. 

<details><summary>Click for Code</summary>

```
data = data.drop(['UserId', 'Id', 'Time' ], axis=1)

data.dropna(inplace=True)

data.head()
```
</details>

In [5]:
# Drop unneeded data
data = data.drop(['UserId', 'Id', 'Time' ], axis=1)
data.dropna(inplace=True)

# Add data.head() here:
data.head()

Unnamed: 0,ProductId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Summary,Text
0,B001E4KFG0,delmartian,1,1,5,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,B00813GRG4,dll pa,0,0,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,B000LQOCH0,"Natalia Corres ""Natalia Corres""",1,1,4,"""Delight"" says it all",This is a confection that has been around a fe...
3,B000UA0QIQ,Karl,3,3,2,Cough Medicine,If you are looking for the secret ingredient i...
4,B006K2ZZ7K,"Michael D. Bigham ""M. Wassir""",0,0,5,Great taffy,Great taffy at a great price. There was a wid...


##Adding a Polarity Column


Now it's time to add a new column that keeps a label of the rating as either positive, negative, or neutral. In order to do this, you will use the star rating of the reviews. If the reviewer gave the item 4 or 5 stars (more than 3) then it's a positive review, if the reviewer gave the item 1 or 2 stars (less than 3) then it's a negative review, and if it's 3 stars, then it is neutral.


1. Look at the code to see how a function can get applied to the data. 
2. Run the cell to apply the lamda function. 

In [6]:
# Add a column for Positive, Negative, and Neutral reviews.
data['Polarity_Rating'] = data['Score'].apply(lambda x: 'Positive' if x > 3 else('Neutral' if x==3 else 'Negative'))

## Sampling the Data


Now it's time to separate the data into separate groups depending on if it's positive negative or neutral. You will use these lists to make sure that there is an equal amount of each kind of review. 

1. Look at the cell below to see how to separate the positive data. 
2. Repeat the code for the negative and nuetral datasets. 

In [7]:
data_positive = data[data['Polarity_Rating'] == 'Positive']
# Add code to make negative and neutral datasets
data_negative = data[data['Polarity_Rating'] == 'Negative']
data_neutral = data[data['Polarity_Rating'] == 'Neutral']

2. Print out the shape of each list

In [8]:
# Add print statements here:
print(data_positive.shape)
print(data_negative.shape)
print(data_neutral.shape)



(443766, 8)
(82007, 8)
(42638, 8)


3. Use the `sample` method to get a sample that is a size that makes sensse for these datasets. Each dataset should be the same size. 

<details><summary>Click for code</summary>

```

data_positive = data_positive.sample(8000)

data_negative = data_negative.sample(8000)

data_neutral = data_neutral.sample(8000)



In [9]:
# Use the sample method here: 
data_positive = data_positive.sample(8000)

data_negative = data_negative.sample(8000)

data_neutral = data_neutral.sample(8000)

4. Print out the shape of the datasets againn to make sure they are the same size. 

In [10]:
# Add print statements here
print(data_positive.shape)
print(data_negative.shape)
print(data_neutral.shape)

 

(8000, 8)
(8000, 8)
(8000, 8)


5. Add a print statement to prinnt the shape of your new data 
6. Run the cell to combine the lists together to create one large dataset 



In [11]:
data = pd.concat([data_positive, data_negative, data_neutral])
# Add your print statement
print(data)

         ProductId              ProfileName  HelpfulnessNumerator  \
320796  B001EQ56VC                  kradkin                     0   
349745  B001HTIYWY      B. Lima "galaxypie"                     0   
127159  B000HDL1P8             Jane Shepard                     0   
213938  B000EPMP40                      JCH                     0   
291861  B001EO5Y8Y                  Betsy C                     0   
...            ...                      ...                   ...   
510290  B001ULOTKU  K. Eickhoff "Amazonfan"                     3   
82418   B0007IQQXA         C. F. Hill "CFH"                     0   
136031  B005SPQENY                    Lemon                     1   
291083  B005HG9ESG        Lisa Stubblefield                     0   
215506  B0014X5O1C         Dustin G. Rhodes                     0   

        HelpfulnessDenominator  Score  \
320796                       1      4   
349745                       0      5   
127159                       0      5   
213938 

## Cleaning up the Text

Now that you have one large list of the data, it's time to clean the actual text in it. You will create a function that removes punctuation and stop words from the text. The string library has a built-in list of stop words. Stop words are words that don't contain important information and are extremely common in the English language. These are words like: "is", "are", "the". 


1. Run the cell to add a function to remove stop words and punctuation from the text



In [12]:
def text_cleanup(text):
  stopwrds = stopwords.words('english')
  no_punc = [char for char in text if char not in string.punctuation]
  no_punc = ''.join(no_punc)
  return ' '.join([word for word in no_punc.split() if word.lower not in stopwrds])


2. Create a new column called reviews that applies the text cleanup function

<details><summary>Click for code</summary>

```
data['reviews'] = data['Text'].apply(text_cleanup)
```
</details>

In [13]:
data['reviews'] = data['Text'].apply(text_cleanup)
# Add a data.head() to see your data
data.head()

Unnamed: 0,ProductId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Summary,Text,Polarity_Rating,reviews
320796,B001EQ56VC,kradkin,0,1,4,Good base but add more seasons,We make a lot of taco or taco-like meals and t...,Positive,We make a lot of taco or tacolike meals and th...
349745,B001HTIYWY,"B. Lima ""galaxypie""",0,0,5,A Most Delicious Granola,I love this cereal. It is by far the best gran...,Positive,I love this cereal It is by far the best grano...
127159,B000HDL1P8,Jane Shepard,0,0,5,Easy & delicious,"Low in calories, makes a large quantity, very ...",Positive,Low in calories makes a large quantity very ta...
213938,B000EPMP40,JCH,0,0,5,BEST GLUTEN FREE COOKIE EVER!,I received a basket of gluten free items for C...,Positive,I received a basket of gluten free items for C...
291861,B001EO5Y8Y,Betsy C,0,0,5,Loved it!,"I like really strong coffee, so when I got my ...",Positive,I like really strong coffee so when I got my K...


Now you can remmake your dataset using only the information you want.

3. Look at the code to re-create your dataset. 
4. Add a `data.head()` statement to see your new dataset.

In [14]:
data = data[["reviews", "Polarity_Rating"]]
# Add data.head() here 
data.head()

Unnamed: 0,reviews,Polarity_Rating
320796,We make a lot of taco or tacolike meals and th...,Positive
349745,I love this cereal It is by far the best grano...,Positive
127159,Low in calories makes a large quantity very ta...,Positive
213938,I received a basket of gluten free items for C...,Positive
291861,I like really strong coffee so when I got my K...,Positive


##One Hot Encoding

One-hot encoding is a process of creating data categories that the network can understand, and that won't have an inherent bias. In this case, you will create a matrix of three columns (for positive, negative, and neutral) and put either a 1 or 0 in the row for the review text. This way the network can be trained using the matrix to validate it's guesses.

1. Look through the three lines of code to see how this is done. 
2. Add a `data.head()` statement to see your new data.

In [15]:
one_hot = pd.get_dummies(data["Polarity_Rating"])

data = pd.concat([data, one_hot], axis=1)
data.drop(['Polarity_Rating'], axis=1, inplace=True)

# Add a data.head to see your data.
data.head()

Unnamed: 0,reviews,Negative,Neutral,Positive
320796,We make a lot of taco or tacolike meals and th...,0,0,1
349745,I love this cereal It is by far the best grano...,0,0,1
127159,Low in calories makes a large quantity very ta...,0,0,1
213938,I received a basket of gluten free items for C...,0,0,1
291861,I like really strong coffee so when I got my K...,0,0,1


## Train and Test Split


Now that you have all the data cleaned up and formatted in a way that makes sense to the network, it's time to split it into train and test sets for the network! In order to split the data into these groups, you are going to use a library called sklearn. This should already be imported, and all you will need to do is call its train_test_split function to get a train and test data set.

To feed data into this network you need an input set and an output set. The input will be the reviews, and the polarity will be the output. First you will create these datasets, and then you will split them into train and test sets. 

1. Set x_rev to the reviews column 
2. Set y_pol to the polarity matrix by dropping the reviews column
3. Run the cell to use train_test_split to make your training and testing datasets. 

<details><summary>Click for code</summary>

```
x_rev = data["reviews"].values

y_pol = data.drop("reviews", axis=1)
```
</details>

In [16]:
x_rev = data["reviews"].values
y_pol = data.drop("reviews", axis=1)

x_rev_train, x_rev_test, y_pol_train, y_pol_test = train_test_split(x_rev, y_pol, test_size=0.30, shuffle=True)

### Recap

You can use a few Python libraries to clean up and organize data from a spreadsheet. First, you separated the data into three groups and sampled from each of them to ensure that you have an equal amount of each type of review. Then, you removed stopwords and create a matrix for the polarity rating. Finally, you split the data into the appropriate groups to feed into a network to train it. 


* Learn about text vectorizing in the next section. 



#Vectorizing Text

So far you have cleaned up all the text data from the reviews and separated them into input and output sets. Now that you have only the important words of the review, and a matrix to represent their polarity, it's time to turn that text into numbers that the network can understand. You will do this using a process called Vectorizing.

**Vectorizing** is the process of converting text data into numerical data so that a neural network can perform calculations with it.  

In order to do this, you are going to use sklearn. This library has a lot of built-in libraries for reworking data into its numerical form. This process has two phases: fit and transform. 

* The **fit** stage creates a dictionary of counts of each word from a dataset. This dictionary is called a vocabulary. 

* The **transform** stage is all about taking a data set and applying it to the vocabulary created during the fit stage. 

Luckily, the **sklearn** library has built-in functions that will do most of the work for you during this stage. You are going to use a vectorizer called **CountVectorizor** that uses the count of each word to do the vectorization process. Follow these steps to create a vocabulary from the review data, and then transform the input datasets to that vocabulary. 

## Fit Stage


First, you are going to create a vocabulary using the fit function. You will use all of the review data from the dataset and create a dictionary of the frequency of each word. Then, you will export that data and save it.

1. Run the cell to create a  vectorizer object

In [17]:
vect = CountVectorizer()

Since we are somewhat limited in the Google Colab notebook about how much space and power the computer has, you are going to set a maximum amount of features for the vectorizer.

2. Set the max_features to 15000

In [18]:
vect.max_features = 15000

Call the fit function to create the vocab. You are going to use all the review data so that the input is a consistent size. 

3. Put x_rev in the parentheses below

In [19]:
vect.fit(x_rev)

CountVectorizer(max_features=15000)

4. Add a print statement to view vocab
5. Run the cell to save the vocab as a variable and see the vocab


In [20]:
vocab = vect.vocabulary_



6. Run this cell to export the vocab

In [21]:
joblib.dump(vocab, "vocab.pkl")

['vocab.pkl']

7. In the file sidebar, download your vocab so that you can use it again later. 
![image](https://i.imgur.com/2ybphhM.png)

##Transform 
The transform is the stage that actually applies the data you are going to use to the vocabulary created by the fit stage. Follow these steps to transform the data into a form that the network can understand.

1. Look at the code below to see how to create a new datasest called x_rev_train_v for the vectorized data that has been transformed. 
2. Repeaat that line wwith the test data.

In [22]:
# Transform the training data
x_rev_train_v = vect.transform(x_rev_train)

# Transform the testing data
vocab_v=vect.transform(vocab)

3. Look at the code below to see how the training dataset gets converted to an array. 
4. Repeat that code with the test data. 

In [23]:
x_rev_train_v = x_rev_train_v.toarray()

# Convert the test data to an array
vocab_v=vocab_v.toarray()

Print out the shape of these datasets to verify that they have been vectorized. They may have different amounts of reviews, but they are both transformed onto the vocabulary, so the second argument should be 15000.

5. Add a print statement for the shape of each dataset.

In [24]:
# Add print statements 
print(vocab_v.shape)
print(x_rev_train_v.shape)

(15000, 15000)
(16800, 15000)


Data needs to be vectorized in order to be run through the network. You used a python library to do this in two stages: fit and transform. You also exported the vocabulary to be used later, and uploaded it to Drive.

* Experiment with the vectorization process by changing the max features and seeing how that changes the shape of the datasets.



# Making a Network 
So far you have cleaned up and prepared the data and vectorized it. Now, it's officially time to create the network to understand the text. Now that we have text as number input and three categories of output, this is essentially just a classification problem. Thus, you can create a network that does classification.

**Classification** is a kind of machine learning problem where a network learns what category different data belongs in.

In order to create the model, you will use a sequential model. You will create an input layer, several calculation layers, dropout layers, and an output layer. 

1. Create a sequential model

In [25]:
model = Sequential()

###Input Layer


The input layer is the first layer of the network. It takes all of the text input and starts to perform calculations with it. This layer needs to be pretty large so that the network can start to get an idea of what it's dealing with. This one is set to 4000 neurons, but if you want to go larger you can!

Also, for this layer, you will be deciding on an activation algorithm. Relu is one that works well for these purposes. 

1. Add a Dense layer wwith `units=15000` and `activation="relu`

<details><summary>Click for Code</summary>

```
model.add(Dense(units=15000, activation="relu"))
```
</details>


In [26]:
# Add an input layer 
model.add(Dense(units=15000, activation="relu"))

###Dropout Layers


Another important aspect of this model is the dropout layers. These help prevent the network from relying too much on specific neurons and force all the neurons to do some work. This is especially important for large networks like this one. This one is set to drop out at a rate of 0.5 which means half of the neurons. 

1. Add a dropout layer with a rate of 0.5

<details><summary>Click for Code</summary>

```
model.add(Dropout(0.5))

```
</details>

In [27]:
# Add a dropout layer 
model.add(Dropout(0.5))

### Calculation Layers.


Once you have the input, it's time to start actually doing calculations with it. This part of the network will do the bulk of the calculation. It's important to include dropout layers in this part of the network as well so that the network continues to be versatile. 

1. Add as many calculation layers to this network as you please. 
2. In between each layer, add a dropout layer. 

<details><summary>Click for an example network</summary>

```
model.add(Dense(units=2000, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=500, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=250, activation="relu"))
model.add(Dropout(0.5))

```
</details>

In [28]:
# Add your calculation layers
model.add(Dropout(0.5))
model.add(Dense(units=500, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=250, activation="relu"))
model.add(Dropout(0.5))

### Output Layer


Finally, you will add a layer that outputs the network's decision. The number of neurons for this layer is going to be the number of possible categories. In this case, there are three possible outputs: positive, negative, neutral. The activation algorithm will be softmax, an algorithm that creates a probability that each review belongs in each category.

1. Add an output layer with 3 neurons: one for each outcome, and `activation="softmax"`

In [29]:
# Add your output layer
model.add(Dense(3, activation='softmax'))

### Compiling the Network


Now that you have all the layers set up, it's time to finally compile the network. First, you will create a variable to hold the optimizer and then you will compile the network.

The optimizer is going to be a general-purpose algorithm called adam that will work well for this purpose. The loss algorithm is categorical crossentropy since this problem is bout sorting data into categories, and the metric to watch is accuracy. 

1. set the learning rate to `0.001` in the parentheses.
2. Run the cell to compile the model. 


In [30]:
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
# Compile the network 
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

### Fit Data to the Network


Next, you will train the network using the data that you have prepared. During this stage, you will also decide on batch size and the number of epochs. Finally, you will set the validation data to the test datasets you created.

1. Set the batch_size to 256
2. Set the epochs to 10 
3. Run the cell below to stsart training the network. 

In [31]:
x = x_rev_train_v
y = y_pol_train
model.fit(
    x = x_rev_train_v,
    y = y_pol_train,
    batch_size = 256,
    epochs = 10,
    validation_data = (x,y)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f4e72decc50>

###Evaluating the Network


Once the network is done training, you will get the scores from the model, and print out the accuracy. This will give you a good idea of how the network performs on the test data. This network should give an accuracy of about 0.70, which is pretty good. 

1. Add a print statement to print out `scores[1]`
2. Run the cell to get and print the scores.

In [32]:
scores = model.evaluate(x, y, verbose=0)
# Print scores[1] to see the test accuracy
print(scores[1])

0.996666669845581


###Save the Model


Make sure you add the code to export the model. This way you won't have to wait for it to train in the future. 

1. Run this cell to save your model. 

2. Right-click on the model in the File window and Download it. 

In [None]:
model.save('sentiments.h5')

This network is very large and contains a lot of information. It will take quite a while to train, so go ahead and let it run as long as it needs. Just be sure to leave the tab open while it makes its calculations. 

###Recap

Creating a network that can classify text based on sentiment required a sequential network with a bunch of layers of neurons. You created, trained, and exported this model. 