# About Dataset
Originally we were planning to scrap conversation data from [Yahoo finance](https://finance.yahoo.com/quote/AA/community). The conversation data is recent and diverse. But web-scrapping consumes a lot of time. After 20 hours running on local machine, only about 30MB of data has been generated. I also tried to run the script on a CIMS server, but Google chrome runs very slowly inside the virtual box. Thus, considering time and data size, I obtained the Stockwits dataset(about 194.3MB) collected by an Udacity Team as an alternative option. The dataset contains messages from Stockwits(a social media app), and those messages are similar to posts on twitter. This dataset is available in the pulic domain and contains sufficient data. More detailed description can be found [here](https://vkontech.com/sentiment-analysis-of-stocktwits-messages-using-lstm-in-pytorch/). 

 
# Exploratory data analysis & cleansing
Here, I created a schema for the dataframe, called z.show() to present some rows of the dataset. In total, there are 4 columns, 1548010 rows. Column names and types are shown in the printSchema output.

In [2]:
val filePath = "project/comments.csv"
val schema = "index STRING, message_body STRING, sentiment INT, timestamp TIMESTAMP"
val rawDF = spark.read.schema(schema)
  .option("header", "true")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .option("escape", "\"")
  .csv(filePath)
z.show(rawDF)

In [3]:
rawDF.printSchema

In [4]:
val filePath2 = "project/output.csv"

In [5]:
val rawDF2 = spark.read.csv(filePath2)
val filtered = rawDF2.filter(rawDF2("_c1") =!= "Symbols").cache()
z.show(filtered)

In [6]:
filtered.printSchema


In [7]:
val joinedDF = rawDF.join(broadcast(filtered), rawDF("index") === filtered("_c0"))

In [8]:
z.show(joinedDF)

In [9]:
joinedDF.printSchema

In [10]:
val newDF = joinedDF.select($"index", $"message_body", $"sentiment", $"timestamp", split(col("_c1"),",").alias("list_of_stocks"))

In [11]:
z.show(newDF)

In [12]:
import scala.collection.mutable.WrappedArray
val convert_list = udf((values: WrappedArray[String])=> {
    values.toList})


In [13]:
val converted = newDF.withColumn("list_of_symbols", convert_list(col("list_of_stocks")))
                    .withColumn("index", col("index"))
                    .withColumn("message_body", col("message_body"))
                    .withColumn("sentiment", col("sentiment"))
                    .withColumn("timestamp", col("timestamp"))
z.show(converted)

In [14]:
val flatted = converted.select($"list_of_symbols",$"index", $"message_body", $"sentiment", to_date($"timestamp").alias("timestamp"), explode($"list_of_symbols").alias("flatted_symbol"))


In [15]:
z.show(flatted)

In [16]:
val groupedDF = flatted.groupBy("flatted_symbol", "timestamp").agg(avg("sentiment"))
z.show(groupedDF)


In [17]:
val removeDF = groupedDF
                .withColumn("flatted_symbol", regexp_replace(col("flatted_symbol"), "\\$", ""))
z.show(removeDF)

In [18]:
val dfWithWeekNumber = removeDF.withColumn("dayOfWeek", date_format(col("timestamp"), "E"))
val df4 = dfWithWeekNumber.withColumn("shiftedDate", when( col("dayOfWeek") === "Sat", date_add(col("timestamp"),2))
.when(col("dayOfWeek") === "Sun", date_add(col("timestamp"),1))
.otherwise(col("timestamp")))
z.show(df4)

In [19]:
val nflx = df4.filter(col("flatted_symbol") === "NFLX").select(col("shiftedDate"), col("avg(sentiment)")).sort(col("shiftedDate"))

In [20]:
z.show(nflx)

In [21]:
// val finalDF = nflx.withColumn("date", to_date($"shiftedDate"))
//                 .withColumn("flatted_symbol", $"flatted_symbol")
//                 .withColumn("sentiment", $"avg(sentiment)")
//                 .withColumn("dayOfWeek", $"flatted_symbol")
//                 .withColumn("flatted_symbol", $"flatted_symbol")

In [22]:
z.show(nflx)

In [23]:
rawDF.columns.length

In [24]:
val TSLA = df4.filter(col("flatted_symbol") === "TSLA").select(col("shiftedDate"), col("avg(sentiment)")).sort(col("shiftedDate"))

In [25]:
z.show(TSLA)

In [26]:
val outputPath = "project/cleanedComments.csv"
df4.write.mode("overwrite").csv(outputPath)
