<h2 align="center" style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Forecasting Stock Prices using Sentiment Analysis and Time Series Models: An Advanced Data Analytics Approach</h2>

### **Table of Contents**

- [Introduction](#Introduction)
   - Assessment Overview
   - Objectives
   - Data Source and Storage
- [Install and Import Required Libraries](#Install-and-Import-Required-Libraries)
- [Load Dataset](#Load-Dataset)

<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Introduction</h3>

### Assessment Overview

### Objectives

### Data Source and Storage

<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Install and Import Required Libraries</h3>

In [1]:
!pip install -q pyspark pandas

In [2]:
import os
from datetime import datetime, timedelta

import sqlite3
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, when, lit, to_date, avg, stddev, desc, first
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType
from pyspark.sql.window import Window

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Load Dataset</h3>

#### **Initialize Spark Session and Define Data Paths**

In [3]:
spark = SparkSession.builder \
    .appName("Stock Tweet Analysis") \
    .config("spark.memory.offHeap.enabled", "true") \
    .config("spark.memory.offHeap.size", "10g") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/21 10:21:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
spark.sparkContext.setLogLevel("ERROR")

In [5]:
tweet_data_path = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stocktweet/stocktweet.csv"
stock_price_folder = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stockprice"
db_path = "stock_analysis.db"

#### **Define Data Schemas**

In [6]:
# Define the companies to analyze
companies = ['AAPL', 'AMZN', 'MSFT', 'TSLA', 'GOOGL', 'FB']

In [7]:
# Define schema for tweet data
tweet_schema = StructType([
    StructField("id", StringType(), True),
    StructField("date", StringType(), True),
    StructField("ticker", StringType(), True),
    StructField("tweet", StringType(), True)
])

In [8]:
# Define schema for stock price data
stock_schema = StructType([
    StructField("Date", StringType(), True),
    StructField("Open", DoubleType(), True),
    StructField("High", DoubleType(), True),
    StructField("Low", DoubleType(), True),
    StructField("Close", DoubleType(), True),
    StructField("Adj Close", DoubleType(), True),
    StructField("Volume", LongType(), True)
])

#### **Define Helper Functions for Loading Data**

In [9]:
def load_tweet_data():
    df = spark.read.csv(tweet_data_path, header=True, schema=tweet_schema)
    # Convert date string to standard format
    df = df.withColumn("date", to_date(col("date"), "MM/dd/yyyy"))
    # Filter tweets for selected companies
    df = df.filter(col("ticker").isin(companies))
    return df

In [10]:
def load_stock_data(ticker):
    file_path = os.path.join(stock_price_folder, f"{ticker}.csv")
    df = spark.read.csv(file_path, header=True, schema=stock_schema)
    # Convert date string to standard format
    df = df.withColumn("Date", to_date(col("Date"), "yyyy-MM-dd"))
    # Add ticker column
    df = df.withColumn("ticker", lit(ticker))
    return df

#### **Load Datasets using Helper Functions**

In [11]:
print("Loading tweet data...")
tweets_df = load_tweet_data()
print("Tweet data loaded")

Loading tweet data...
Tweet data loaded


In [12]:
print("Loading stock price data...")
stock_dfs = {}
for company in companies:
    stock_dfs[company] = load_stock_data(company)
print("Stock price data loaded")

Loading stock price data...
Stock price data loaded


<h3 style="background-color:#2D3436;color:white;border-radius:8px;padding:15px">Data Exploration</h3>