# D213 - Advanced Data Analytics
### NLM3 Task 2: Sentiment Analysis Using Neural Networks
#### Advanced Data Analytics — D213
#### PRFA — NLM3
> André Davis
> StudentID: 010630641
> MSDA
>
> Competencies
> 4030.7.1 : Constructing Neural Networks
> The graduate builds neural networks in the context of machine-learning modeling.
> 
> 4030.7.3 : Natural Language Processing
> The graduate extracts insights from text data using effective and appropriate natural language processing (NLP) models.

##### Table of Contents

 <ul>
    <li><a href="#documentation">Documentation</a></li>
    <li><a href="#research-question">A1: Research Question</a></li>
    <li><a href="#objectives">A2: Objectives Or Goals</a></li>
    <li><a href="#neural-networks-identification">A3: Prescribed Network Neural Network Identification</a></li>
    <li><a href="#data-exploration">B1: Data Exploration</a></li>
    <li><a href="#tokenization-process">B2: Tokenization</a></li>
    <li><a href="#padding-process">B3: Padding Process</a></li> 
    <li><a href="#categories-of-sentiment">B4: Categories Of Sentiment</a></li>
    <li><a href="#data-preparation">B5: Steps To Prepare the Data</a></li>
    <li><a href="#copy-of-prepared-data">B6: Prepared Dataset</a></li>
    <li><a href="#tensorflow-model-summary">C1: Model Summary</a></li>
    <li><a href="#network-architecture">C2: Network Architecture</a></li>
    <li><a href="#hyperparameters">C3: Hyperparameters</a></li>
    <li><a href="#stopping-criteria">D1: Stopping Criteria</a></li>
    <li><a href="#fitness">D2: Fitness</a></li>
    <li><a href="#training-process">D3: Training Process</a></li>
    <li><a href="#predictive-accuracy">D4: Predictive Accuracy</a></li>
    <li><a href="#source-code">E: Code</a></li> 
    <li><a href="#functionality">H: Functionality</a></li> 
    <li><a href="#recommendations">G: Recommendeds</a></li>
    <li><a href="#reporting">H: Reporting</a></li>
    <li><a href="#code-references">I: Sources for Thirday Party Code</a></li>
    <li><a href="#source-references">J: Source References</a></li>    
  </ul>

# Documentation

 * [TensorFlow](https://www.tensorflow.org/)
 * [Keras](https://keras.io/)
     * [Dot Products](https://www.khanacademy.org/math/multivariable-calculus/thinking-about-multivariable-function/x786f2022:vectors-and-matrices/a/dot-products-mvc)    

<a id="research-question"></a>
# A1: Research Question

Is it feasible to ascertain the sentiment polarity—whether positive or negative—of an IMDb movie review to a reasonably reliable extent, solely based on the textual content of the review?

<a id="objectives"></a>
# A2: Objectives and Goals of Analysis

The main goal of this analysis is to build a neural network model that can fairly accurately tell if an IMDb movie review is positive or negative based on its text. A secondary goal is to try out different neural network setups and settings to find which one works best for our data and aim.

<a id="neural-networks-identification"></a>
# A3: Prescribed Network Neural Network Identification

<a id="data-exploration"></a>
# B1: Exploratory Data Analysis

In [1]:
!pip install numpy
!pip install pandas
!pip install scikit-learn
!pip install tensorflow



In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf

'''
File format as presented in the readme.txt:

=======
Format:
=======
sentence \t score \n


=======
Details:
=======
Score is either 1 (for positive) or 0 (for negative)
'''
imdb_columns = ['review', 'sentiment_score']
imdb_reviews = pd.read_csv('./imdb_labelled.txt', engine='python', sep='\t+', header=None, names=imdb_columns)
print(imdb_reviews.info())
print(imdb_reviews.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   review           1000 non-null   object
 1   sentiment_score  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB
None
                                              review  sentiment_score
0  A very, very, very slow-moving, aimless movie ...                0
1  Not sure who was more lost - the flat characte...                0
2  Attempting artiness with black & white and cle...                0
3       Very little music or anything to speak of.                  0
4  The best scene in the movie was when Gerardo i...                1


In [3]:
'''
readme.txt states that the data should contain 500 positive and 500 negative sentences, a 50/50 split.

Verifying dataset is complete
'''

positive_sentiments = len(imdb_reviews[imdb_reviews['sentiment_score'] == 1])
negative_sentiments = len(imdb_reviews[imdb_reviews['sentiment_score'] == 0])

print(f'Positive Sentiments Loaded: {positive_sentiments}')
print(f'Negative Sentiments Loaded: {negative_sentiments}')

assert positive_sentiments == 500, 'Failed to load all the positive sentiment scores'
assert negative_sentiments == 500, 'Failed to load all the negative sentiment scores'


Positive Sentiments Loaded: 500
Negative Sentiments Loaded: 500


<a id="tokenization-process"></a>
# B2: Tokenization

<a id="padding-process"></a>
# B3: Padding Process

<a id="categories-of-sentiment"></a>
# B4: Categories Of Sentiment

<a id="data-preparation"></a>
# B5: Steps To Prepare the Data

<a id="copy-of-prepared-data"></a>
# B6: Prepared Dataset

<a id="tensorflow-model-summary"></a>
# C1: Model Summary

<a id="network-architecture"></a>
# C2: Network Architecture

<a id="hyperparameters"></a>
# C3: Hyperparameters


<a id="stopping-criteria"></a>
# D1: Stopping Criteria

<a id="fitness"></a>
# D2: Fitness

<a id="training-process"></a>
# D3: Training Process

<a id="predictive-accuracy"></a>
# D4: Predictive Accuracy


<a id="source-code"></a>
# E: Code

<a id="functionality"></a>
# H: Functionality

<a id="recommendations"></a>
# G: Recommendations

<a id="reporting"></a>
# H: Reporting

<a id="code-references"></a>
# I: Sources for Third Party Code

<a id="source-references"></a>
# J: Source References

 * Kotzias,Dimitrios. (2015). Sentiment Labelled Sentences. UCI Machine Learning Repository. https://doi.org/10.24432/C57604. <br /> <br />
 * Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems. <br /><br /> 