# 1. Introduction

## Problem Statement: 

In this project, sentiment analysis was done using natural language processing on the online reviews prevalant for various items on amazon,yelp and imdb which were lablelled. Using the spacy package of python to preprocess the data before, each individual review  has been tokenized, lemmatized, filtered for stop words and vectorized inorder to prepare the data viable for the machine learning model. 
A pipeline was created  which vectorized the preprocessed data using count vectorization or tfidf vectorizer, which is then split into training and testing datasets and were then used to train the machine learning model (support vector machine) and evaluate. 
**conclusion

## What is Natural Language?

Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Human language is astoundingly complex and diverse. We express ourselves in infinite ways, both verbally and in writing. Not only are there hundreds of languages and dialects, but within each language is a unique set of grammar and syntax rules, terms and slang. When we write, we often misspell or abbreviate words, or omit punctuation. When we speak, we have regional accents, and we mumble, stutter and borrow terms from other languages. 

While supervised and unsupervised learning, and specifically deep learning, are now widely used for modeling human language, there’s also a need for syntactic and semantic understanding and domain expertise that are not necessarily present in these machine learning approaches. NLP is important because it helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics. 

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

Natural language processing includes many different techniques for interpreting human language, ranging from statistical and machine learning methods to rules-based and algorithmic approaches. We need a broad array of approaches because the text- and voice-based data varies widely, as do the practical applications. 

Basic NLP tasks include tokenization and parsing, lemmatization/stemming, part-of-speech tagging, language detection and identification of semantic relationships. If you ever diagramed sentences in grade school, you’ve done these tasks manually before. 

In general terms, NLP tasks break down language into shorter, elemental pieces, try to understand relationships between the pieces and explore how the pieces work together to create meaning.

These underlying tasks are often used in higher-level NLP capabilities, such as:

**Content categorization** A linguistic-based document summary, including search and indexing, content alerts and duplication detection.

**Topic discovery and modeling** Accurately capture the meaning and themes in text collections, and apply advanced analytics to text, like optimization and forecasting.

**Contextual extraction** Automatically pull structured information from text-based sources.

**Sentiment analysis** Identifying the mood or subjective opinions within large amounts of text, including average sentiment and opinion mining. 

**Speech-to-text and text-to-speech conversion** Transforming voice commands into written text, and vice versa. 

**Document summarization** Automatically generating synopses of large bodies of text.

**Machine translation** Automatic translation of text or speech from one language to another.


## Spacy for NLP 

Spacy is written in cython language, (C extension of Python designed to give C like performance to the python program). Hence is a quite fast library. spaCy provides a concise API to access its methods and properties governed by trained machine (and deep) learning models.

Implementation of spacy and access to different properties is initiated by creating pipelines. A pipeline is created by loading the models. There are different type of models provided in the package which contains the information about language – vocabularies, trained vectors, syntaxes and entities.

These pipelines outputs a wide range of document properties such as – tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. Let’s explore some of these properties.

**Tokenization:** Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating the document.

**Part of Speech Tagging:** Part-of-speech tags are the properties of the word that are defined by the usage of the word in the grammatically correct sentence. These tags can be used as the text features in information filtering, statistical models, and rule based parsing.

**Entity Detection** Spacy consists of a fast entity recognition model which is capable of identifying entitiy phrases from the document. Entities can be of different types, such as – person, location, organization, dates, numerals, etc. These entities can be accessed through “.ents” property.

**Dependency Parsing** One of the most powerful feature of spacy is the extremely fast and accurate syntactic dependency parser which can be accessed via lightweight API. The parser can also be used for sentence boundary detection and phrase chunking. The relations can be accessed by the properties “.children” , “.root”, “.ancestor” etc.

**Noun Phrases** Dependency trees can also be used to generate noun phrases

**Word to Vectors Integration** Spacy also provides inbuilt integration of dense, real valued vectors representing distributional similarity information. It uses GloVe vectors to generate vectors. GloVe is an unsupervised learning algorithm for obtaining vector representations for words.

## Dataset

The data set contains about 1000 online reviews each for various items on Amazon, Yelp and IMDB  and of these reviews about 500 were labelled positive and 500 were labelled negative reviews. For each company, the data was given the text format which are needed to be added to a dataframe

In [1]:
import spacy
import  pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns


# 2. Importing and Exploring Data 

Since the data for each of the company is seperately stored into txt files. Each of these files were seperately imported and joined with key fields.

## Importing and  assembling data

In [8]:
data_yelp = pd.read_table('yelp_labelled.txt')
data_amazon = pd.read_table('amazon_cells_labelled.txt')
data_imdb = pd.read_table('imdb_labelled.txt')

# Joining the tables
combined_col= [data_amazon,data_imdb,data_yelp]

# To observe how the data in each individual dataset is structured
print(data_amazon.columns)


Index(['So there is no way for me to plug it in here in the US unless I go by a converter.', '0'], dtype='object')


**NOTE** From the above output, it is evident that the tables do not have headers and the column content comprises of the review and then the label indicating 0 when negative review and 1 when positive review.

In [15]:
# In order to add headers for columns in each dataset

for colname in combined_col:
    colname.columns = ["Review","Label"]
for colname in combined_col:
    print(colname.columns)
    


Index(['Review', 'Label'], dtype='object')
Index(['Review', 'Label'], dtype='object')
Index(['Review', 'Label'], dtype='object')


In [16]:
# In order to recognize which dataset belonged to which company, a 'Company' column is added as a key

company = [ "Amazon", "imdb", "yelp"]

comb_data = pd.concat(combined_col,keys = company)

In [17]:
# Exploring the  structure of  the new  data  frame

print(comb_data.shape)

comb_data.head()

(2745, 2)


Unnamed: 0,Unnamed: 1,Review,Label
Amazon,0,"Good case, Excellent value.",1
Amazon,1,Great for the jawbone.,1
Amazon,2,Tied to charger for conversations lasting more...,0
Amazon,3,The mic is great.,1
Amazon,4,I have to jiggle the plug to get it to line up...,0


In [19]:
comb_data.to_csv("Sentiment_Analysis_Dataset")

print(comb_data.columns)

print(comb_data.isnull().sum())

Index(['Review', 'Label'], dtype='object')
Review    0
Label     0
dtype: int64


## Preprocessing the data using Spacy and Machine learning model training using sklearn

In this stage,  Spacy package of python is used to lemmatize and remove stop words from  the obtained dataset. 

In [38]:
import spacy
import en_core_web_sm
from  spacy.lang.en.stop_words import STOP_WORDS
nlp = en_core_web_sm.load()
#nlp = spacy.load('en')

# To build a list of stop words for filtering
stopwords = list(STOP_WORDS)
print(stopwords)

['until', 'afterwards', 'although', 'therein', 'towards', 'amongst', 'hers', 'she', 'along', 'how', 'move', 'formerly', 'forty', 'namely', 'when', 'whereby', 'either', 'for', 'under', 'ca', 'done', 'her', 'own', 'former', 'put', 'well', 'twenty', 'were', 'nobody', 'some', 'beforehand', 'could', 'from', 'go', 'since', 'make', 'whether', 'part', 'that', 'call', 'thereby', 'my', 'many', 'next', 'anyone', 'nevertheless', 'them', 'bottom', 'was', 'do', 'noone', 'over', 'behind', 'if', 'more', 'their', 'two', 'using', 'themselves', 'herself', 'again', 'ever', 'thru', 'doing', 'the', 'full', 'fifteen', 'its', 'see', 'they', 'is', 'about', 'everything', 'his', 'himself', 'anyhow', 'less', 'wherever', 'upon', 'whole', 'against', 'yourselves', 'those', 'often', 'hundred', 'empty', 'five', 'itself', 'therefore', 'top', 'any', 'ours', 'after', 'whom', 'up', 'does', 'show', 'your', 'have', 'latter', 'take', 'down', 'while', 'indeed', 'amount', 'out', 'becoming', 'six', 'where', 'without', 'might', 

Thus, the stop words have been enlisted

The data is initially split into test and training datasets prior to feeding into the machine learning pipeline. Then, a class object was defined as 'sent_predict' is used as the first step of the pipeline which would inherit from the TransformerMixin package and perform the cleaning of data. The second method of the pipeline is to vectorize the cleaned data.

Tokenized words needs to be lemmatized and filtered for pronouns, stopwords and punctuations using the defined method 'my_tokenizer' For that purpose count vectorizeor and tfidfVectorizer both have been tried subsequently to decide which is better. 

Then the third step of the pipeline is the defining of the classifier. In this case, Linear Support Vector Machine  classifier was chosen. **Other methods could be explored in the furture**


In [69]:
import string
punctuations = string.punctuation
# Creating a Spacy Parser
from spacy.lang.en import English
parser = English()

In [52]:
def my_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    return mytokens

In [53]:
# ML Packages
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

In [77]:
#Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic function to clean the text 
def clean_text(text):     
    return text.strip().lower()

In [78]:
# Vectorization
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1)) 
classifier = LinearSVC()

In [79]:
# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [80]:
# Splitting Data Set
from sklearn.model_selection import train_test_split

In [81]:
# Features and Labels
X = comb_data['Review']
ylabels = comb_data['Label']

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=42)






In [83]:
# Create the  pipeline to clean, tokenize, vectorize, and classify using"Count Vectorizor"
pipe_countvect = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])
# Fit our data
pipe_countvect.fit(X_train,y_train)
# Predicting with a test dataset
sample_prediction = pipe_countvect.predict(X_test)

# Prediction Results
# 1 = Positive review
# 0 = Negative review
for (sample,pred) in zip(X_test,sample_prediction):
    print(sample,"Prediction=>",pred)
    
# Accuracy
print("Accuracy: ",pipe_countvect.score(X_test,y_test))
print("Accuracy: ",pipe_countvect.score(X_test,sample_prediction))
# Accuracy
print("Accuracy: ",pipe_countvect.score(X_train,y_train))

Disappointment.. I hate anything that goes in my ear. Prediction=> 0
It is a true classic.   Prediction=> 0
Great product. Prediction=> 1
This is a great restaurant at the Mandalay Bay. Prediction=> 1
It finds my cell phone right away when I enter the car. Prediction=> 1
It is simple to use and I like it. Prediction=> 1
Of all the dishes, the salmon was the best, but all were great. Prediction=> 1
I love the Pho and the spring rolls oh so yummy you have to try. Prediction=> 1
Their Research and Development division obviously knows what they're doing. Prediction=> 0
Still it's quite interesting and entertaining to follow.   Prediction=> 1
Very poor service. Prediction=> 0
Oh yeah, and the storyline was pathetic too.   Prediction=> 0
Strike 2, who wants to be rushed. Prediction=> 0
Every element of this story was so over the top, excessively phony and contrived that it was painful to sit through.   Prediction=> 0
The battery works great! Prediction=> 1
I am so tired of clichés that is ju

He owns the film, just as Spacek owned Coal Miner's Daughter" and Quaid owned "Great Balls of Fire.   Prediction=> 1
A standout scene.   Prediction=> 0
Worst hour and a half of my life!Oh my gosh!   Prediction=> 0
Its not user friendly. Prediction=> 1
Also, it's a real treat to see Anthony Quinn playing Crazy Horse.   Prediction=> 1
This frog phone charm is adorable and very eye catching. Prediction=> 1
it was a drive to get there. Prediction=> 1
this is the worst sushi i have ever eat besides Costco's. Prediction=> 0
They have a plethora of salads and sandwiches, and everything I've tried gets my seal of approval. Prediction=> 0
The only good thing was that it fits comfortably on small ears. Prediction=> 1
Everything was perfect the night we were in. Prediction=> 1
The writer, Gorman Bechard, undoubtedly did his homework because all references are industry and character-age appropriate.   Prediction=> 1
The chicken I got was definitely reheated and was only ok, the wedges were cold an

the movie is littered with overt racial slurs towards the black cast members and in return the whites are depicted as morons and boobs.   Prediction=> 0
Disapointing Results. Prediction=> 0
Tried to go here for lunch and it was a madhouse. Prediction=> 0
* Both the Hot & Sour & the Egg Flower Soups were absolutely 5 Stars! Prediction=> 0
This may be the only bad film he ever made.   Prediction=> 0
The film deserves strong kudos for taking this stand, for having exceptional acting from its mostly lesser-known cast and for the super-intelligent script that doesn't insult the audience or take the easy way out when it comes to white racism.   Prediction=> 0
Don't make the same mistake I did. Prediction=> 0
Even in my BMW 3 series which is fairly quiet, I have trouble hearing what the other person is saying. Prediction=> 0
Virgin Wireless rocks and so does this cheap little phone! Prediction=> 0
In fact, I liked it better than Interview With a Vampire and I liked this Lestat (Stuart Townsen

You need two hands to operate the screen.This software interface is decade old and cannot compete with new software designs. Prediction=> 1
It was horrendous.   Prediction=> 0
We won't be returning. Prediction=> 0
I highly recommend these and encourage people to give them a try. Prediction=> 1
seems like a good quick place to grab a bite of some familiar pub food, but do yourself a favor and look elsewhere. Prediction=> 1
Jawbone Era is awesome too! Prediction=> 1
If you are looking for a movie with a terrific cast, some good music(including a Shirley Jones rendition of "The Way You Look Tonight"), and an uplifting ending, give this one a try.   Prediction=> 1
This place is like Chipotle, but BETTER. Prediction=> 0
Their monster chicken fried steak and eggs is my all time favorite. Prediction=> 1
I Was Hoping for More. Prediction=> 0
I came over from Verizon because cingulair has nicer cell phones.... the first thing I noticed was the really bad service. Prediction=> 1
Not my thing. Pr

The fries were not hot, and neither was my burger. Prediction=> 0
We would recommend these to others. Prediction=> 1
This product is great... it makes working a lot easier I can go to the copier while waiting on hold for something. Prediction=> 1
It was attached to a gas station, and that is rarely a good sign. Prediction=> 0
Pretty awesome place. Prediction=> 1
The flair bartenders are absolutely amazing! Prediction=> 1
Anyway, this FS restaurant has a wonderful breakfast/lunch. Prediction=> 1
Pros:-Good camera - very nice pictures , also has cool styles like black and white, and more. Prediction=> 1
The menu is always changing, food quality is going down & service is extremely slow. Prediction=> 0
Love This Phone. Prediction=> 1
You can not answer calls with the unit, never worked once! Prediction=> 0
NOBODY identifies with these characters because they're all cardboard cutouts and stereotypes (or predictably reverse-stereotypes).   Prediction=> 1
Though The Wind and the Lion is told

Still, it makes up for all of this with a super ending that depicts a great sea vessel being taken out by the mighty frost.   Prediction=> 1
The poor batter to meat ratio made the chicken tenders very unsatisfying. Prediction=> 0
Bacon is hella salty. Prediction=> 1
I have seen many movies starring Jaclyn Smith, but my god this was one of her best, though it came out 12 years ago.   Prediction=> 0
Must have been an off night at this place. Prediction=> 0
Not enough volume. Prediction=> 0
When I received my Pita it was huge it did have a lot of meat in it so thumbs up there. Prediction=> 0
Good value, works fine - power via USB, car, or wall outlet. Prediction=> 1
If she had not rolled the eyes we may have stayed... Not sure if we will go back and try it again. Prediction=> 0
And it was way to expensive. Prediction=> 1
I just don't know how this place managed to served the blandest food I have ever eaten when they are preparing Indian cuisine. Prediction=> 0
The selection was probably t

In [61]:
# Another random review
pipe.predict(["This was a great movie"])

array([1], dtype=int64)

In [66]:

example = ["I do enjoy my job",
 "What a poor product!,I will have to get a new one",
 "I feel amazing!"]

In [68]:
pipe.predict(example)

array([1, 0, 1], dtype=int64)

# 3. Conclusion

As observed thee model was  about 79.4 % accurate.


# Source:

1. https://en.wikipedia.org/wiki/Natural_language_processing
2. https://www.sas.com/en_us/insights/analytics/what-is-natural-language-processing-nlp.html
3. https://github.com/Jcharis/Natural-Language-Processing-Tutorials/blob/master/Text%20Classification%20With%20Machine%20Learning,SpaCy,Sklearn(Sentiment%20Analysis)/Text%20Classification%20&%20Sentiment%20Analysis%20with%20SpaCy,Sklearn.ipynb