## Sentiment Analysis

### What is it? 

* Sentiment Analysis (SA) is the process of understanding an option (sentiment) about a given subject from written or spoken language.
* It is one of the subfields of Natural Language Processing(NLP) that extracts opinion and attributes from text or speech. E.g. Machine Translation, Question Answering, Sentiment Analysis, Language modeling, etc. 
* Sentiment Analysis is a supervised learning technique.


* SA usually counts on four tasks: 
   * opinion identification, identifying the text which contains an opinion. Such as positive, negative, neutral
   * feature extraction, identifying the aspects being commented on, such as a product's price, color of the product, etc
   * sentiment classification, whether the opinion popularity is positive, negative, or neutral
   * visualization and summarization of results

### Basic Terminology

* Stop words -- Removal of words that are not important from the infomration point of view, such as'the','is','a' etc.
* Tokenization -- Segmentation of text into words (a form of feature extraction)
* Lemmatization -- Assigning the base forms of words(the lemma of 'spoke' is 'speak' and the lemmma of 'languages' is 'language')
* Stemming -- Reducing a word to its stem or root form known as a lemma (car,cars,car's, cars' --> car(stem or root word))
* Word Embedding -- Mappin gwords to vectors of numbers where words with similar meaning have a similar numerical representation.
* Text Classification -- Assigning categories to a document or parts of it
* N-grams -- Consideration of a group of words (phrases) rather than single words to extract meaning. Helps with better understanding of text; 'not happy' instead of 'happy';(e.g. bi-gram pertoken). 

### Open Source NLP libraries:

* Many open source libraries are at our service when we want to implement NLP models:

  * NLTK
  * Spark NLP
  * Spacy

### Import libraries

In [8]:
''' 
Natural Language Toolkit(nltk) is a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing and semantic reasoning
'''
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix

import numpy as np 
import pandas as pd 
import re

import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

import os   # for getting environment variables
os.getcwd()
os.chdir('/Users/axa4/Documents/Supervised Learning/Sentiment-Analysis-for-fun')  #your desired directory


### Work with the provided toy dataset `data/toy_dataset.tsv` and load it into a Pandas dataframe

In [9]:
# get line numbers (optional)
dataset = [line.rstrip() for line in open('./data/toy_dataset.tsv')]
print(len(dataset))

#load into dataframe
dataset = pd.read_csv('./data/toy_dataset.tsv', delimiter='\t', quoting=3)

27


### Preprocess data

In [24]:
# corpus will store your cleaned dataset after preprocessing.
corpus = []
for i in range(0,26):
    data = re.sub('[^a-zA-Z]', ' ', dataset['verified_reviews'][i] )
    data = data.lower()
    data = data.split()
    stemmer = PorterStemmer()

    data = [stemmer.stem(word) for word in data if not word in set(stopwords.words('english'))]
    data = ' '.join(data)
    corpus.append(data)

#### Representing Text as Numerical Vectors: Bag of Words
+ We first need to represent texts in a way that the learning algorithm can process. 
+ To represent each word in the dataset, we will convert the text into a matrix of token counts.

#### TODO: Inspect the source code below to understand how we represent the text as a matrix for further processing

In [25]:
from sklearn.feature_extraction.text import CountVectorizer


# Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
vectorizer = CountVectorizer(max_features=1400)

# The function fit_transform() is used for dataset transformations in scikit-learn. 
# Notice that the vectorizer by default stores everything in a sparse array, and using X.toarray()shows us the dense version.
X = vectorizer.fit_transform(corpus).toarray()

y = dataset.iloc[:,4].values

#### Using the import `train_test_split()` from `sklearn.model_selection` split our dataset into `train` and `test` sets. Set the 'test_size' to 0.2


In [26]:
from sklearn.model_selection import train_test_split

# This means that X_test and y_test contains 20% of our data which we reserve for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Selecting a Classifier 

+ We will use RandomForestClassifier() from scikit-learn library as our classifier. 
+ Please note, there are a number of classifiers that you can use for Sentiment Analysis.

#### Import RandomForestClassifier() from sklearn library and create an instance of it

In [33]:
from sklearn.ensemble import RandomForestClassifier

sentiment_classifier = RandomForestClassifier(n_estimators = 100, random_state = 0)

#### Fit and Predict the test set results

In [34]:
# In scikit-learn, an estimator for classification is a Python object that implements the methods fit(X, y) and predict(T)
sentiment_classifier.fit(X_train, y_train)

# Here we are predicting the test set results
y_pred = sentiment_classifier.predict(X_test)

Continue to **exercise2.ipynb**