## Reddit Analysis - NLP

### Initializing Spark Session

In [1]:
import findspark
findspark.init()

In [2]:
!/mnt/miniconda/bin/pip install sparknlp

Collecting sparknlp
  Downloading sparknlp-1.0.0-py3-none-any.whl (1.4 kB)
Collecting spark-nlp
  Downloading spark_nlp-3.4.3-py2.py3-none-any.whl (144 kB)
[K     |████████████████████████████████| 144 kB 30.7 MB/s eta 0:00:01
[?25hInstalling collected packages: spark-nlp, sparknlp
Successfully installed spark-nlp-3.4.3 sparknlp-1.0.0


In [3]:
import pyspark.sql.functions as F
from pyspark.sql.functions import col, lit,size
from pyspark.sql import SparkSession
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

In [4]:
spark = SparkSession.builder \
        .appName("reddit") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2") \
    .master('yarn') \
    .getOrCreate()

Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c342d107-ad2a-4bf6-85b2-decf3f433d62;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;3.4.2 in central
	found com.typesafe#config;1.4.1 in central
	found org.rocksdb#rocksdbjni;6.5.3 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.603 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.code.findbugs#annotations;3.0.1 in central
	found net.jcip#jcip-annotations;1.0 in central
	found com.google.code.findbugs#jsr305;3.0.1 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found

In [5]:
spark

### Reading the entire dataset from s3

In [6]:
df_full = spark.read.parquet('s3://ssp88-labdata2/eda_df_full')

                                                                                

In [7]:
df_full.printSchema()

root
 |-- all_awardings: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_created_utc: double (nullable = true)
 |-- author_flair_richtext: string (nullable = true)
 |-- author_flair_type: string (nullable = true)
 |-- author_fullname: string (nullable = true)
 |-- author_patreon_flair: boolean (nullable = true)
 |-- author_premium: boolean (nullable = true)
 |-- awarders: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- can_mod_post: boolean (nullable = true)
 |-- collapsed: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- gilded: long (nullable = true)
 |-- gildings: string (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- no_follow: boolean (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- p

#### Data Text Checks

In [8]:
# Most Common Words
import pyspark.sql.functions as f
common_word = df_full.withColumn('word', f.explode(f.split(f.col('body'), ' '))) \
  .groupBy('word') \
  .count() \
  .sort('count', ascending=False) \
  .limit(10)

In [9]:
common_word.show()

22/04/16 23:39:50 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+----+--------+
|word|   count|
+----+--------+
| the|14798077|
|  to|10378945|
|   a| 9822211|
| and| 7735354|
|  of| 6655809|
|   I| 5741511|
|  is| 5584017|
|that| 5065040|
| you| 5022835|
|  in| 4913229|
+----+--------+



                                                                                

Cleaning is required as stopwords cannot be the top 10 words with highest count

In [10]:
#Distribution of Text Lengths
df_full = df_full.withColumn("comment_length", F.length(col('body')))
df_full.select('body','comment_length').sort('comment_length', ascending=False).show(10)

                                                                                

+--------------------+--------------+
|                body|comment_length|
+--------------------+--------------+
|**Money in Electi...|         11252|
|&gt; 
&gt; 
&gt; ...|         10271|
|Original comment ...|         10192|
|Original comment ...|         10190|
|LOL wait didnt yo...|         10166|
|Biases in Stops, ...|         10145|
|Part 2.

&gt; You...|         10081|
|&gt;I do understa...|         10079|
|UNDELETED comment...|         10074|
|UNDELETED comment...|         10074|
+--------------------+--------------+
only showing top 10 rows



It can be observed that some of the texts with the higher comment length are not useful for analysis (deleted etc). These words/sentences can be removed later.

##### Important words according to TF-IDF

In [159]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

sentenceData = df_full.select(df_full["body"])


tokenizer = Tokenizer(inputCol="body", outputCol="words")
wordsData = tokenizer.transform(sentenceData)


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

for features_label in rescaledData.select("features", "words").take(3):
    print(features_label)

[Stage 152:>                                                        (0 + 1) / 1]

Row(features=SparseVector(20, {3: 0.6267, 7: 0.7122, 10: 1.7161, 19: 0.6819}), words=['so,', 'theyre', 'like', 'australian', 'geese?'])
Row(features=SparseVector(20, {0: 4.6762, 1: 3.639, 2: 1.1016, 3: 3.7599, 4: 3.136, 5: 3.5421, 6: 3.0408, 7: 4.2731, 8: 2.0492, 9: 4.9559, 10: 1.7161, 11: 3.14, 12: 2.6444, 13: 1.6626, 15: 3.1381, 16: 2.78, 17: 3.8672, 18: 4.4873, 19: 0.6819}), words=['by', 'that', 'definition', 'literally', 'any', 'food', 'is', 'a', 'drug.', "there's", 'nothing', 'special', 'about', "sugar's", 'effect', 'on', 'the', 'reward', 'system.', 'you', 'eat', 'food,', 'you', 'feel', 'good.', 'the', 'tongue', 'enjoys', 'the', 'sweetness', 'and', 'you', 'get', 'a', 'hit', 'of', 'dopamine', 'from', 'your', 'reward', 'system.', 'the', 'sugar', "isn't", 'binding', 'to', 'any', 'receptors', 'in', 'your', 'brain', "it's", 'entirely', 'your', "brain's", 'own', 'response', 'to', 'positive', 'stimulus.', 'other', 'examples', 'are', 'finishing', 'paperwork', 'or', 'getting', 'a', 'massag

                                                                                

#### Creating Dummy Variables

In [11]:
df_full_reddit = df_full.withColumn("Pandemic_Freakout", F.regexp_extract('body', \
                                                        r'(?i)\bcovid\b|(?i)\bpandemic\b|(?i)\bcovid-19\b|(?i)\bcorona\b|(?i)\bvirus\b|(?i)\bmasks\b|(?i)\hospital\b',0))
df_full_reddit = df_full_reddit.withColumn("Arrest_Freakout", F.regexp_extract('body', \
                                                        r'(?i)\barrest\b|(?i)\bofficer\b|(?i)\bpolice\b|(?i)\bcop\b|(?i)\bstab\b|(?i)\billegal\b|(?i)\brutal\b',0))

In [12]:
df_full_reddit = df_full_reddit.withColumn("Pandemic_Freakout",F.lower(F.col('Pandemic_Freakout')))
df_full_reddit = df_full_reddit.withColumn("Arrest_Freakout",F.lower(F.col('Arrest_Freakout')))

### Cleaning the data 

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/hadoop/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
from nltk.corpus import stopwords
eng_stopwords = stopwords.words('english')
#eng_stopwords.append('xxxx')

In [15]:
from sparknlp.base import Finisher, DocumentAssembler
from sparknlp.annotator import (Tokenizer, Normalizer,
                                LemmatizerModel, StopWordsCleaner)
from pyspark.ml import Pipeline

In [16]:
documentAssembler = DocumentAssembler() \
     .setInputCol('body') \
     .setOutputCol('document')
tokenizer = Tokenizer() \
     .setInputCols(['document']) \
     .setOutputCol('token')
# note normalizer defaults to changing all words to lowercase.
# Use .setLowercase(False) to maintain input case.
normalizer = Normalizer() \
     .setInputCols(['token']) \
     .setOutputCol('normalized') \
     .setLowercase(True)
# note that lemmatizer needs a dictionary. So I used the pre-trained
# model (note that it defaults to english)
lemmatizer = LemmatizerModel.pretrained() \
     .setInputCols(['normalized']) \
     .setOutputCol('lemma')
stopwords_cleaner = StopWordsCleaner() \
     .setInputCols(['lemma']) \
     .setOutputCol('clean_lemma') \
     .setCaseSensitive(False) \
     .setStopWords(eng_stopwords)
# finisher converts tokens to human-readable output
finisher = Finisher() \
     .setInputCols(['clean_lemma']) \
     .setCleanAnnotations(False)

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ]lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
Download done! Loading the resource.
[ / ]

                                                                                

[OK!]


In [17]:
pipeline = Pipeline() \
     .setStages([
           documentAssembler,
           tokenizer,
           normalizer,
           lemmatizer,
           stopwords_cleaner,
           finisher
     ])

In [18]:
df_full_clean = pipeline.fit(df_full_reddit).transform(df_full_reddit)

In [19]:
df_full_clean.printSchema()

root
 |-- all_awardings: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_created_utc: double (nullable = true)
 |-- author_flair_richtext: string (nullable = true)
 |-- author_flair_type: string (nullable = true)
 |-- author_fullname: string (nullable = true)
 |-- author_patreon_flair: boolean (nullable = true)
 |-- author_premium: boolean (nullable = true)
 |-- awarders: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- can_mod_post: boolean (nullable = true)
 |-- collapsed: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- gilded: long (nullable = true)
 |-- gildings: string (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- no_follow: boolean (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- p

In [20]:
from pyspark.sql.functions import concat_ws

df_full_new = df_full_clean.withColumn("text", concat_ws(" ", "clean_lemma.result"))

#### Finding the Sentiment of comments

In [21]:
#Sentiment 
document_t = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use_t = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("use_embeddings")

docClassifier_t = SentimentDLModel.pretrained('sentimentdl_use_twitter', lang = 'en') \
  .setInputCols(["use_embeddings"])\
  .setOutputCol("sentiment")

pipeline_t = Pipeline(
    stages = [
        document_t,
        use_t,
        docClassifier_t
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ — ]Download done! Loading the resource.
[ \ ]

2022-04-16 23:41:30.869956: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


[ | ]

2022-04-16 23:41:31.300613: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499995000 Hz


[ / ]

2022-04-16 23:41:35.912778: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 60236800 exceeds 10% of free system memory.
2022-04-16 23:41:35.959925: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 60236800 exceeds 10% of free system memory.
2022-04-16 23:41:36.078408: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 60236800 exceeds 10% of free system memory.


[ — ]

2022-04-16 23:41:36.143312: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 60236800 exceeds 10% of free system memory.
2022-04-16 23:41:36.183993: W external/org_tensorflow/tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 60236800 exceeds 10% of free system memory.


[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[ | ]sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[ / ]Download done! Loading the resource.
[OK!]


In [22]:
pipelineModel_t = pipeline_t.fit(df_full_new)
result_t = pipelineModel_t.transform(df_full_new)

In [23]:
result_t.printSchema()

root
 |-- all_awardings: string (nullable = true)
 |-- author: string (nullable = true)
 |-- author_created_utc: double (nullable = true)
 |-- author_flair_richtext: string (nullable = true)
 |-- author_flair_type: string (nullable = true)
 |-- author_fullname: string (nullable = true)
 |-- author_patreon_flair: boolean (nullable = true)
 |-- author_premium: boolean (nullable = true)
 |-- awarders: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- can_mod_post: boolean (nullable = true)
 |-- collapsed: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- gilded: long (nullable = true)
 |-- gildings: string (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- no_follow: boolean (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- p

Selecting the required comments to store in s3

In [24]:
sentiment_df = result_t.select(F.explode(F.arrays_zip('document.result', 'sentiment.result')).alias("cols"),
                               F.expr("author").alias("author"),
                               F.expr("controversiality").alias("controversiality"),
                               F.expr("score").alias("score"),
                               F.expr("total_awards_received").alias("total_awards_received"),
                               F.expr("comment_date").alias("comment_date"),
                               F.expr("year").alias("year"),
                               F.expr("month").alias("month"),
                               F.expr("hour").alias("hour"),
                               F.expr("comment_length").alias("comment_length"),
                               F.expr("Arrest_Freakout").alias("Arrest_Freakout"),
                               F.expr("Pandemic_Freakout").alias("Pandemic_Freakout")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("sentiment"),
       "author","controversiality","score","total_awards_received","comment_date","year","month","hour","comment_length","Arrest_Freakout","Pandemic_Freakout")

In [25]:
sentiment_df.write.parquet("s3://ssp88-labdata2/sentiment_df/")

                                                                                

In [66]:
#Read parquet 
sentiment_read = spark.read.parquet('s3://ssp88-labdata2/sentiment_df')

                                                                                

In [67]:
sentiment_read.printSchema()

root
 |-- document: string (nullable = true)
 |-- sentiment: string (nullable = true)
 |-- author: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- score: long (nullable = true)
 |-- total_awards_received: long (nullable = true)
 |-- comment_date: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- comment_length: integer (nullable = true)
 |-- Arrest_Freakout: string (nullable = true)
 |-- Pandemic_Freakout: string (nullable = true)



In [68]:
!/mnt/miniconda/bin/pip install altair



In [69]:
import altair as alt
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

### Graphs - Business Questions 

##### Sentiment Count 

In [70]:
sentiment_count = sentiment_read.groupby('sentiment').agg(F.count('document'))

In [71]:
sentiment_count = sentiment_count.withColumnRenamed('count(document)','sentiment_count')

In [72]:
sentiment_count.show()

                                                                                

+---------+---------------+
|sentiment|sentiment_count|
+---------+---------------+
| positive|        9880051|
|     null|         108800|
|  neutral|         968815|
| negative|        7163286|
+---------+---------------+



In [73]:
sentiment_count = sentiment_count.toPandas()

                                                                                

In [74]:
sentiment_count_altered = sentiment_count.dropna(subset=['sentiment'])

In [75]:
sentiment_count_altered.head()

Unnamed: 0,sentiment,sentiment_count
0,positive,9880051
2,neutral,968815
3,negative,7163286


In [178]:
fig = (alt.Chart(sentiment_count_altered).mark_bar().encode(
    y=alt.Y('sentiment_count', axis = alt.Axis(title = "Count")),
    x=alt.X('sentiment', axis = alt.Axis(title = "Sentiment"),sort='-y'),
    color=alt.value('#7fc97f'),
    tooltip=['sentiment','sentiment_count']
)).properties(title={"text":'Sentiment Count',"subtitle" : "Sentiment of authors through each comment"},width = 500, height = 500)

fig.save("fig1.html")
fig


There are very few comments overall having neutral sentiment compared to the other two sentiments. Majority of the comments have a positive sentiment, this might be the case as the authors might leave motivational or happy comments under freakout videos or else there are comparatively more video under happy freakout category.

##### Sentiments through time

In [87]:
sentiment_time = sentiment_read.groupby('year','month','sentiment').agg(F.count('document'))

In [88]:
sentiment_time = sentiment_time.withColumnRenamed('count(document)','count')

In [89]:
sentiment_time.show(5)

                                                                                

+----+-----+---------+------+
|year|month|sentiment| count|
+----+-----+---------+------+
|2020|   04| negative|236243|
|2019|   08|  neutral| 22363|
|2020|   01| positive|231719|
|2020|   11| negative|379104|
|2019|   09|     null|  2636|
+----+-----+---------+------+
only showing top 5 rows



In [90]:
sentiment_time_df = sentiment_time.toPandas()

                                                                                

In [93]:
sentiment_time_df = sentiment_time_df.dropna(subset=['sentiment'])

In [94]:
sentiment_time_df

Unnamed: 0,year,month,sentiment,count
0,2020,11,negative,379104
1,2020,04,negative,236243
2,2019,08,neutral,22363
4,2020,01,positive,231719
5,2020,12,neutral,41181
...,...,...,...,...
90,2019,08,positive,240089
91,2020,01,neutral,22618
92,2021,01,neutral,61535
93,2020,10,positive,473585


In [95]:
sentiment_time_df['time_concat'] = sentiment_time_df["year"] + "_" + sentiment_time_df["month"]
sentiment_time_df

Unnamed: 0,year,month,sentiment,count,time_concat
0,2020,11,negative,379104,2020_11
1,2020,04,negative,236243,2020_04
2,2019,08,neutral,22363,2019_08
4,2020,01,positive,231719,2020_01
5,2020,12,neutral,41181,2020_12
...,...,...,...,...,...
90,2019,08,positive,240089,2019_08
91,2020,01,neutral,22618,2020_01
92,2021,01,neutral,61535,2021_01
93,2020,10,positive,473585,2020_10


In [96]:
sentiment_time_df = sentiment_time_df.sort_values(["year","month"]).reset_index().drop('index',axis = 1)

In [179]:
fig2 = (alt.Chart(sentiment_time_df).mark_line().encode(
    x=alt.X('time_concat', axis = alt.Axis(title = "Timeframe")),
    y=alt.Y('count', axis = alt.Axis(title = "Count of Comments")),
    color='sentiment',    
    tooltip=['time_concat','count']
)).resolve_scale(x='independent').properties(title={"text":'Count of Comments',"subtitle" : "Relationship between Time Period and Frequency of Comments for each Sentiment"},width = 500, height = 500).interactive()

fig2.save("fig2.html")
fig2

##### Sentiment of authors around covid

In [181]:
sentiment_covid = sentiment_read.groupby('Pandemic_Freakout','sentiment').agg(F.avg('score'))

In [182]:
sentiment_covid = sentiment_covid.withColumnRenamed('avg(score)','average_score')

In [183]:
sentiment_covid.show()

                                                                                

+-----------------+---------+------------------+
|Pandemic_Freakout|sentiment|     average_score|
+-----------------+---------+------------------+
|           corona|  neutral|18.661516853932586|
|            virus|  neutral| 11.61215932914046|
|            virus| negative|11.208408528841655|
|            masks| positive|15.556957011851445|
|         pandemic| positive| 15.93591145121618|
|            covid|  neutral| 12.82067415730337|
|                 |     null|           5.79875|
|                 |  neutral|14.973012614589047|
|            masks| negative|14.785119574844996|
|         pandemic|  neutral| 24.19124087591241|
|           corona| negative|13.306956201693044|
|         pandemic| negative| 18.04567284132118|
|                 | negative|13.945360261043453|
|            virus| positive|10.075320849989481|
|            covid| positive| 17.26737085805238|
|                 | positive|14.725523070829738|
|            covid| negative|19.250790647384783|
|           corona| 

In [184]:
sentiment_covid = sentiment_covid.toPandas()

                                                                                

In [185]:
sentiment_covid = sentiment_covid.dropna(subset=['sentiment'])

##### Summary Table

In [186]:
sentiment_covid.head(20)

Unnamed: 0,Pandemic_Freakout,sentiment,average_score
0,corona,neutral,18.661517
1,virus,neutral,11.612159
2,virus,negative,11.208409
3,masks,positive,15.556957
4,pandemic,positive,15.935911
5,covid,neutral,12.820674
7,,neutral,14.973013
8,masks,negative,14.78512
9,pandemic,neutral,24.191241
10,corona,negative,13.306956


In [187]:
fig3 = (alt.Chart(sentiment_covid).mark_bar().encode(
    x=alt.X('Pandemic_Freakout', axis = alt.Axis(title = "Covid Terms"),sort='-y'),
    y=alt.Y('average_score', axis = alt.Axis(title = "Average Score")),
    color='sentiment',
    tooltip=['Pandemic_Freakout','average_score','sentiment']
)).properties(title={"text":'Covid Sentiment',"subtitle" : "Sentiment of authors through each comment revolving around covid"},width = 500, height = 500)

fig.save("fig3.html")
fig3

##### controversiality

In [190]:
sentiment_bar = sentiment_read.groupby('controversiality','sentiment').agg(F.avg('score'))

In [191]:
sentiment_bar = sentiment_bar.withColumnRenamed('avg(score)','average_score')

In [192]:
sentiment_bar.show()

                                                                                

+----------------+---------+-------------------+
|controversiality|sentiment|      average_score|
+----------------+---------+-------------------+
|               0|  neutral| 15.731455497859258|
|               1| negative| 0.6144336834713404|
|               1| positive| 0.3619774973596332|
|               0| negative| 14.649875335629206|
|               1|     null|0.27723321620122066|
|               0|     null|  6.087501088081398|
|               1|  neutral| 0.5500021304699817|
|               0| positive| 15.488053330822497|
+----------------+---------+-------------------+



In [193]:
sentiment_bar = sentiment_bar.toPandas()

                                                                                

In [194]:
sentiment_bar = sentiment_bar.dropna(subset=['sentiment']).reset_index().drop('index',axis = 1)

##### Summary Table

In [195]:
sentiment_bar.head(10)

Unnamed: 0,controversiality,sentiment,average_score
0,0,neutral,15.731455
1,1,negative,0.614434
2,1,positive,0.361977
3,0,negative,14.649875
4,1,neutral,0.550002
5,0,positive,15.488053


In [197]:
sentiment_bar["controversiality"] = sentiment_bar["controversiality"].astype("category")

In [198]:
fig4 = (alt.Chart(sentiment_bar).mark_bar().encode(
    x=alt.X('controversiality', axis = alt.Axis(title = "Covid Terms")),
    y=alt.Y('average_score', axis = alt.Axis(title = "Score")),
    color='sentiment',
    tooltip=['controversiality','average_score','sentiment']
)).properties(title={"text":'Covid Sentiment',"subtitle" : "Sentiment of authors through each comment revolving around covid"},width = 500, height = 500)

fig4

In [None]:
spark.stop()
