# Sentiment Analysis
### Using SparkNLP pre-trained sentimentdl model <br/>
- Firstly, I've applied a few pre-processing steps. <br/>
> Getting rid of undesired data. <br/>
> Checking the null values and getting rid of them. <br/> 
> Spell Checking <br/>
> Cleaning stopwords.  <br/>
- Then, I've started to create pipeline that includes Sparknlp annotators and models. 
> I've used **glove embedding** as the embedding. <br/>
> I've used the sparknlp **sentimentdl_glove_imdb** pre-trained model as sentiment classifier. 


In [None]:
! pip install -q pyspark==3.1.2 spark-nlp

[K     |████████████████████████████████| 212.4 MB 58 kB/s 
[K     |████████████████████████████████| 122 kB 55.0 MB/s 
[K     |████████████████████████████████| 198 kB 52.6 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp 
spark= sparknlp.start()

In [None]:
from sparknlp.base import *
from sparknlp.annotator import * 
from pyspark.ml import Pipeline
import pyspark.sql.functions as F
from pyspark.ml.feature import SQLTransformer, StringIndexer
from pyspark.sql.functions import explode, col, when, isnan, count

In [None]:
df_train= spark.read\
    .option("header", True)\
    .csv("/content/Train.csv")


In [None]:
df_train.count()

40000

In [None]:
df_train.show(5, truncate=40)

+----------------------------------------+----------------------------------------+
|                                    text|                                   label|
+----------------------------------------+----------------------------------------+
|"I grew up (b. 1965) watching and lov...| during lunch and after school. We al...|
|When I put this movie in my DVD playe...|                                       0|
|"Why do people who do not know what a...| I'll put out my own movie and prove ...|
|Even though I have great interest in ...|                                       0|
|"Im a die hard Dads Army fan and noth...|                                       1|
+----------------------------------------+----------------------------------------+
only showing top 5 rows



Seems like there are some undesired values in 'label' column. We will only keep '0' and '1' values in 'label' colum, so we will get rid of undesired ones.  

In [None]:
df_train.groupBy("label").count().show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
| so little substance|    1|
| giving us a few ...|    1|
| it´s instead a l...|    1|
| a bunch of lonel...|    1|
| ""La Noche del T...|    1|
| ""Nightmare"" is...|    1|
| says a character...|    1|
| he has beautiful...|    1|
| not even uninten...|    1|
| you should get t...|    1|
| as I'd thought i...|    1|
|  and let's remember|    1|
| for the simple r...|    1|
| be a smartie. Co...|    1|
| then this film i...|    1|
|           the thief|    1|
| the camera focus...|    1|
| this independent...|    1|
| almost anyone wo...|    1|
| while others wil...|    1|
+--------------------+-----+
only showing top 20 rows



In [None]:
df_train= df_train.filter((df_train["label"]==0) | (df_train["label"]==1))
df_train.count()

22922

Now, we will check whether there are null, NaN or blank values in dataset. 

In [None]:
df_train.select([count(when(col(c).isNull() | \
                            isnan(c) | \
                            (col(c)== " "), c)).alias(c) for c in df_train.columns]).show()

+----+-----+
|text|label|
+----+-----+
|   0|    0|
+----+-----+



As we see above, there are no null values. We can start building pipeline for sentiment analysis. <br/>
 Firstly, I will create annotators and models. 

In [None]:
documentAssembler= DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer= Tokenizer()\   #tokenizing
    .setInputCols(["document"])\
    .setOutputCol("token")

spell_checker= ContextSpellCheckerModel.pretrained("spellcheck_dl")\   #correcting the misspellings in the text
    .setInputCols(["token"])\
    .setOutputCol("spell_checked")

stopwords_cleaner= StopWordsCleaner.pretrained("stopwords_en", "en")\   #cleaning stopwords
    .setInputCols(["spell_checked"])\
    .setOutputCol("cleaned")\
    .setCaseSensitive(False)

word_embedding= WordEmbeddingsModel.pretrained("glove_100d")\   #glove embedding
    .setInputCols(["document","cleaned"])\
    .setOutputCol("embeddings")

sentence_embedding= SentenceEmbeddings()\     #sentence embedding by using embedded tokens
    .setInputCols(["document" ,"embeddings"])\
    .setOutputCol("sentence_embeddings")\
    .setPoolingStrategy("AVERAGE")

classifier= SentimentDLModel.pretrained("sentimentdl_glove_imdb")\   #pre-trained sentientdl model
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")


spellcheck_dl download started this may take some time.
Approximate size to download 111.4 MB
[OK!]
stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]


Now, It is time to putting all annotators and models into a pipeline and fitting with our dataset. 

In [None]:
nlp_pipeline= Pipeline(stages=[ 
                               documentAssembler,
                               tokenizer,
                               spell_checker,
                               stopwords_cleaner,
                               word_embedding,
                               sentence_embedding,
                               classifier
])

model= nlp_pipeline.fit(df_train.limit(200))
result= model.transform(df_train.limit(200))

In [None]:
result.columns

['text',
 'label',
 'document',
 'token',
 'spell_checked',
 'cleaned',
 'embeddings',
 'sentence_embeddings',
 'sentiment']

The sentiment results of each tweet are like following. 

In [None]:
result.select("sentence_embeddings.result", "sentiment.result").show(5, truncate=140)

+--------------------------------------------------------------------------------------------------------------------------------------------+------+
|                                                                                                                                      result|result|
+--------------------------------------------------------------------------------------------------------------------------------------------+------+
|[When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie w...| [pos]|
|[Even though I have great interest in Biblical movies, I was bored to death every minute of the movie. Everything is bad. The movie is to...| [neg]|
|["Im a die hard Dads Army fan and nothing will ever change that. I got all the tapes, DVD's and audiobooks and every time i watch/listen ...| [pos]|
|[A terrible movie as everyone has said. What made me laugh was the cameo appearance by Scott McNeal