<h2 align="center" style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Forecasting Stock Prices using Sentiment Analysis and Time Series Models: An Advanced Data Analytics Approach</h2>

### **Table of Contents**

- [Introduction](#Introduction)
   - Assessment Overview
   - Objectives
   - Data Source and Storage
- [Install and Import Required Libraries](#Install-and-Import-Required-Libraries)
- [Load Dataset](#Load-Dataset)

<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Introduction</h3>

### Assessment Overview

### Objectives

### Data Source and Storage

<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Install and Import Required Libraries</h3>

In [1]:
!pip install -q pyspark pandas

In [2]:
import os
from datetime import datetime, timedelta

import sqlite3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, lit, to_date, avg, stddev, desc, first
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType
from pyspark.sql.window import Window

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Load Dataset</h3>

#### **Initialize Spark Session and Define Data Paths**

In [3]:
spark = SparkSession.builder \
    .appName("Stock Tweet Analysis") \
    .config("spark.memory.offHeap.enabled", "true") \
    .config("spark.memory.offHeap.size", "10g") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/21 10:21:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
spark.sparkContext.setLogLevel("ERROR")

In [5]:
tweet_data_path = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stocktweet/stocktweet.csv"
stock_price_folder = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stockprice"
db_path = "stock_analysis.db"

#### **Define Data Schemas**

In [6]:
# Define the companies to analyze
companies = ['AAPL', 'AMZN', 'MSFT', 'TSLA', 'GOOGL', 'FB']

In [7]:
# Define schema for tweet data
tweet_schema = StructType([
    StructField("id", StringType(), True),
    StructField("date", StringType(), True),
    StructField("ticker", StringType(), True),
    StructField("tweet", StringType(), True)
])

In [8]:
# Define schema for stock price data
stock_schema = StructType([
    StructField("Date", StringType(), True),
    StructField("Open", DoubleType(), True),
    StructField("High", DoubleType(), True),
    StructField("Low", DoubleType(), True),
    StructField("Close", DoubleType(), True),
    StructField("Adj Close", DoubleType(), True),
    StructField("Volume", LongType(), True)
])

#### **Define Helper Functions for Loading Data**

In [9]:
def load_tweet_data():
    df = spark.read.csv(tweet_data_path, header=True, schema=tweet_schema)
    # Convert date string to standard format
    df = df.withColumn("date", to_date(col("date"), "MM/dd/yyyy"))
    # Filter tweets for selected companies
    df = df.filter(col("ticker").isin(companies))
    return df

In [10]:
def load_stock_data(ticker):
    file_path = os.path.join(stock_price_folder, f"{ticker}.csv")
    df = spark.read.csv(file_path, header=True, schema=stock_schema)
    # Convert date string to standard format
    df = df.withColumn("Date", to_date(col("Date"), "yyyy-MM-dd"))
    # Add ticker column
    df = df.withColumn("ticker", lit(ticker))
    return df

#### **Load Datasets using Helper Functions**

In [11]:
print("Loading tweet data...")
tweets_df = load_tweet_data()
print("Tweet data loaded")

Loading tweet data...
Tweet data loaded


In [12]:
print("Loading stock price data...")
stock_dfs = {}
for company in companies:
    stock_dfs[company] = load_stock_data(company)
print("Stock price data loaded")

Loading stock price data...
Stock price data loaded


<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Data Exploration</h3>

#### **Viewing First 5 Rows of Each Stock DataFrame**

In [13]:
def show_top_rows(df, name):
    """Display the top 5 rows of a DataFrame"""
    print(f"\n{name} Top 5 Rows:")
    df.show(5, truncate=False)

In [14]:
for ticker, df in stock_dfs.items():
    show_top_rows(df, ticker)


AAPL Top 5 Rows:


                                                                                                    

+----------+-----------------+-----------------+-----------------+-----------------+-----------------+---------+------+
|Date      |Open             |High             |Low              |Close            |Adj Close        |Volume   |ticker|
+----------+-----------------+-----------------+-----------------+-----------------+-----------------+---------+------+
|2019-12-31|72.48249816894531|73.41999816894531|72.37999725341797|73.4124984741211 |71.52082061767578|100805600|AAPL  |
|2020-01-02|74.05999755859375|75.1500015258789 |73.79750061035156|75.0875015258789 |73.15264892578125|135480400|AAPL  |
|2020-01-03|74.2874984741211 |75.1449966430664 |74.125           |74.35749816894531|72.44145965576172|146322800|AAPL  |
|2020-01-06|73.44750213623047|74.98999786376953|73.1875          |74.94999694824219|73.0186767578125 |118387200|AAPL  |
|2020-01-07|74.95999908447266|75.2249984741211 |74.37000274658203|74.59750366210938|72.67527770996094|108872000|AAPL  |
+----------+-----------------+----------

#### **Statistical Summary of Each Stock DataFrame**

In [15]:
def show_summary(df, name):
    """Display statistical summary of a DataFrame"""
    print(f"\n{name} Statistical Summary:")
    df.describe().show()

In [16]:
# spark.conf.set("spark.sql.debug.maxToStringFields", 10)

In [17]:
for ticker, df in stock_dfs.items():
    show_summary(df, ticker)


AAPL Statistical Summary:


                                                                                                    

+-------+------------------+-----------------+------------------+------------------+------------------+--------------------+------+
|summary|              Open|             High|               Low|             Close|         Adj Close|              Volume|ticker|
+-------+------------------+-----------------+------------------+------------------+------------------+--------------------+------+
|  count|               254|              254|               254|               254|               254|                 254|   254|
|   mean| 95.17796276122566|96.57026571739377| 93.82802144748958| 95.26071827805887| 93.30824790413924|1.5734118582677165E8|  NULL|
| stddev|22.014833707521472|22.09909824629524| 21.57955997463146|21.810136925990065|21.574201492769824| 6.978351522681883E7|  NULL|
|    min| 57.02000045776367|           57.125| 53.15250015258789|56.092498779296875| 54.77680206298828|            46691300|  AAPL|
|    max| 138.0500030517578|138.7899932861328|134.33999633789062|136.6900024

#### **Checking for Missing Values**

In [18]:
def check_missing_values(df, name):
    """Check missing values in a Dataframe"""
    print(f"\nMissing Values in {name}:")
    df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

In [19]:
for ticker, df in stock_dfs.items():
    check_missing_values(df, ticker)


Missing Values in AAPL:
+----+----+----+---+-----+---------+------+------+
|Date|Open|High|Low|Close|Adj Close|Volume|ticker|
+----+----+----+---+-----+---------+------+------+
|   0|   0|   0|  0|    0|        0|     0|     0|
+----+----+----+---+-----+---------+------+------+


Missing Values in AMZN:
+----+----+----+---+-----+---------+------+------+
|Date|Open|High|Low|Close|Adj Close|Volume|ticker|
+----+----+----+---+-----+---------+------+------+
|   0|   0|   0|  0|    0|        0|     0|     0|
+----+----+----+---+-----+---------+------+------+


Missing Values in MSFT:
+----+----+----+---+-----+---------+------+------+
|Date|Open|High|Low|Close|Adj Close|Volume|ticker|
+----+----+----+---+-----+---------+------+------+
|   0|   0|   0|  0|    0|        0|     0|     0|
+----+----+----+---+-----+---------+------+------+


Missing Values in TSLA:
+----+----+----+---+-----+---------+------+------+
|Date|Open|High|Low|Close|Adj Close|Volume|ticker|
+----+----+----+---+-----+----

#### **Explore tweet data**

In [20]:
show_top_rows(tweets_df, "Tweet Data")
show_summary(tweets_df, "Tweet Data")
check_missing_values(tweets_df, "Tweets")


Tweet Data Top 5 Rows:
+------+----------+------+-------------------------------------------------------------------------------------------------------------------------------------------+
|id    |date      |ticker|tweet                                                                                                                                      |
+------+----------+------+-------------------------------------------------------------------------------------------------------------------------------------------+
|100001|2020-01-01|AMZN  |$AMZN Dow futures up by 100 points already 🥳                                                                                              |
|100002|2020-01-01|TSLA  |$TSLA Daddy's drinkin' eArly tonight! Here's to a PT of ohhhhh $1000 in 2020! 🍻                                                           |
|100003|2020-01-01|AAPL  |$AAPL We’ll been riding since last December from $172.12 what to do. Decisions decisions hmm 🤔. I have 20 mins to dec

[Stage 46:>                                                                             (0 + 1) / 1]

+---+----+------+-----+
| id|date|ticker|tweet|
+---+----+------+-----+
|  0|4177|     0|    0|
+---+----+------+-----+



                                                                                                    

In [21]:
# Tweet counts
print("\nTweet Count by Company:")
tweets_df.groupBy("ticker").count().orderBy(desc("count")).show()

print("\nTweet Count by Date (Top 10):")
tweets_df.groupBy("date").count().orderBy(desc("count")).show(10)


Tweet Count by Company:
+------+-----+
|ticker|count|
+------+-----+
|  TSLA| 4341|
|  AAPL| 1721|
|  AMZN|  407|
|  MSFT|  271|
|    FB|  204|
| GOOGL|   17|
+------+-----+


Tweet Count by Date (Top 10):
+----------+-----+
|      date|count|
+----------+-----+
|      NULL| 4177|
|2020-03-09|  137|
|2020-01-09|  114|
|2020-01-05|  108|
|2020-02-09|  102|
|2020-04-09|  101|
|2020-08-09|   78|
|2020-09-09|   65|
|2020-10-09|   63|
|2020-03-03|   60|
+----------+-----+
only showing top 10 rows



<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Data Preprocessing</h3>