### Tomes, Christopher
### CS4650 Big Data and Cloud COmputing
### Cal Poly Pomona

### Github: https://github.com/Ctomes/Project_5_Stocks
### Youtube Video: https://youtu.be/8mYGxZ74g8c

### Final Score:
![Alt Text](portfolio.jpg)

These were screenshots of all of my predictions:

![Alt Text](screenshot1.jpg)
![Alt Text](screenshot2.jpg)
![Alt Text](screenshot3.jpg)
![Alt Text](screenshot4.jpg)
![Alt Text](screenshot5.jpg)
![Alt Text](screenshot6.jpg)
![Alt Text](screenshot7.jpg)



## This Notebook will show the process I used to build a program that creates Stock Predictions.

### The stocks considered: "AAPL","AMZN","GOOGL","MSFT","NFLX","TSLA","NVDA","INTC"


## Gather Data:
To make a prediction we are going to use a few datasets: Historic Stock Information, Twitter Engagement, Google Trends, and Other Stock Dependencies. 

# Historic Stock Information Gathering:

In [7]:
import requests
import json
import os

Stock Information is available through many databases and services. 
For this assignment I decided to use TwelveData which offers a free tier for their Stock API which is updated in realtime. 
You will need to create an account and recieve an API key.

In [8]:

# api key is stored in file: 'api_key.txt'
api_key = ""

with open("api_key.txt", 'r') as file:
    api_key = file.readline().strip()

# Time to start collecting information from: 
start_date = "04/01/2023 8:00 PM"

# The model will predict the price of stock based on one interval forward. 
# Supports: 1min, 5min, 15min, 30min, 45min, 1h, 2h, 4h, 1day, 1week, 1month
interval = "4h"

# Stocks to train on:
tickers = {"AAPL","AMZN","GOOGL","MSFT","NFLX","TSLA","NVDA","INTC"}

We will now query the API and store the data into text files for later referal. 

In [10]:
print('Requesting Data:')
for ticker in tickers:
    url = f"https://api.twelvedata.com/time_series?symbol={ticker}&interval={interval}&format=JSON&start_date={start_date}&apikey={api_key}"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        filename = f"{ticker}.json"
        filepath = os.path.join("nb_stock_data", filename)
        with open(filepath, "w") as f:
            json.dump(data, f)
        print(f"JSON data saved to file {filepath}!")
    else:
        print("Request failed with status code:", response.status_code)
        break;
print('Loop Complete')    

Requesting Data:
JSON data saved to file notebook_stock_data\MSFT.json!
JSON data saved to file notebook_stock_data\AMZN.json!
JSON data saved to file notebook_stock_data\GOOGL.json!
JSON data saved to file notebook_stock_data\TSLA.json!
JSON data saved to file notebook_stock_data\INTC.json!
JSON data saved to file notebook_stock_data\AAPL.json!
JSON data saved to file notebook_stock_data\NFLX.json!
JSON data saved to file notebook_stock_data\NVDA.json!
Loop Complete


# Gathering Twitter Engagement

The module used for this part is snscrape. It is a scraper for social networking services and we will be using their TwitterSearchScraper to gather Tweet data.

In [43]:
# importing libraries and packages
import snscrape.modules.twitter as sntwitter
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [13]:
#Create a dictionary relating each Ticker to Twitter username:
stocks = {"AAPL": "Apple","AMZN": "amazon","GOOGL": "Google","MSFT": "Microsoft","NFLX": "netflix","TSLA": "Tesla","NVDA": "nvidia","INTC" : "intel"}

# Creating list to append tweet data 
tweets_list1 = []
for stock in stocks:
   print(stocks[stock])

   if stocks[stock] == 'Apple':
      #For AAPL: Apple account doesn't tweet, they will recieve a default value for now.
      continue

# Using TwitterSearchScraper to scrape each account's tweets and append to list
   for i,tweet in enumerate(sntwitter.TwitterSearchScraper('from:'+stocks[stock]).get_items()):

      if i>100: # Number of tweets to store
         break
      # Declare the attributes to be returned
      tweets_list1.append([stock,tweet.date, tweet.id, tweet.rawContent, tweet.replyCount, tweet.likeCount, tweet.quoteCount, tweet.viewCount, tweet.vibe, tweet.retweetCount, tweet.conversationId]) 

    
# Create a Dataframe from the tweets list above 
tweets_df1 = pd.DataFrame(tweets_list1, columns=['STOCK', 'Datetime', 'Tweet Id', 'Text', 'Reply Count', 'Like Count', 'Quote Count', 'View Count', 'Vibe', 'Retweet Count', "Conversation Id"])

filename = "tweets_nb.csv"
tweets_df1.to_csv(filename)

Apple
amazon
Google
Microsoft
netflix
Tesla


Could not translate t.co card URL on tweet 1648805800092073986


nvidia
intel


# Process and Clean Twitter Data
Clean the data, remove uneccessary columns, and calculate our own metric to determine a good/bad tweet.

In [59]:
# Load the CSV file into a DataFrame
df = pd.read_csv('tweets_nb.csv')

columns_to_remove = ['Tweet Id', 'Text', 'Quote Count','Vibe', 'Retweet Count', 'Conversation Id', 'Reply Count']

# Drop the unnamed index column and other unecessary columns
df = df.iloc[:, 1:]
df = df.drop(columns_to_remove, axis=1)

# Get the unique values from the first column
unique_values = df['STOCK'].unique()

# Create a dictionary to store the split DataFrames
dfs = {}

# Split the DataFrame based on unique values in the first column
for value in unique_values:
    dfs[value] = df[df['STOCK'] == value].copy()


# Create the directory if it doesn't exist
os.makedirs("nb_twitter_data", exist_ok=True)

# Access the split DataFrames using the unique values
for value, split_df in dfs.items():
    try:


        # Remove nulls
        dfs[value] = dfs[value].dropna()
        
        #group by day.
        dfs[value]['date'] = pd.to_datetime(split_df['Datetime']).dt.date
        dfs[value] = dfs[value].groupby('date').sum(['Like Count', 'View Count', 'Engagement Score'])

        #Calculate the Engagement for any particular day.
        dfs[value]['Engagement Score']  = dfs[value]['Like Count'] / dfs[value]['View Count']
        dfs[value]['Engagement Score']  = dfs[value]['Engagement Score'] - dfs[value]['Engagement Score'].median()

        print(f"DataFrame for {value}:")
        print(dfs[value])

        # Store to CSV 
        filename = (f"{value}.csv")
        filepath = os.path.join("nb_twitter_data", filename)
        dfs[value].to_csv(filepath)
    except Exception as e:
        print(f"An error occurred while writing CSV file for '{value}': {str(e)}")


DataFrame for AMZN:
            Like Count  View Count  Engagement Score
date                                                
2023-05-15         596    175266.0          0.000421
2023-05-16         182    106802.0         -0.001275
2023-05-17         257    167267.0         -0.001443
2023-05-18        1350    431495.0          0.000149
2023-05-19         140     46990.0          0.000000
DataFrame for GOOGL:
            Like Count  View Count  Engagement Score
date                                                
2023-05-18        2129    853940.0          0.000000
2023-05-19        1890    698404.0          0.000213
2023-05-20           0        28.0         -0.002493
DataFrame for MSFT:
            Like Count  View Count  Engagement Score
date                                                
2023-05-05         603    214130.0          0.000000
2023-05-08         721    348736.0         -0.000749
2023-05-09         431    213213.0         -0.000795
2023-05-10           5      1703.0    

# Google Trends Data Collection:

In [60]:
from pytrends.request import TrendReq
import time

def predict_interest(keyword, starttime):
    time.sleep(1)
    prediction ={
                 "current_trend": 1.0,
                 "predicted_trend": 1.0,
                 "delta_trend": 0}
    
    if keyword == 'NA':
        return prediction
    
    # Set up Google Trends API
    pytrends = TrendReq(hl='en-US', tz=360)
    
    timeframe = starttime
    # Query Google Trends for interest over time
    pytrends.build_payload(kw_list=[keyword], timeframe=timeframe)
    interest_over_time = pytrends.interest_over_time()
    # Convert the data to a pandas DataFrame
    df = pd.DataFrame(interest_over_time)
    df = df.drop(df.index[-1])

    return df[keyword]


# Build Model

In [78]:
import json
from sklearn.linear_model import LinearRegression
import update_trends_data
import numpy as np
import warnings

# Suppress the specific warning
warnings.filterwarnings("ignore", category=UserWarning)

In [62]:
# Determine how many days in the past.
def date_to_ints(df, col_name):
    datetime_col = pd.to_datetime(df[col_name])
    days_since_today = (datetime_col.dt.day)
    return days_since_today

In [64]:
# Convert a datetime into the seconds of the day for training. 
def datetime_to_seconds(df, col_name):

    # Extract the datetime column and convert it to datetime type if it's not already
    datetime_col = pd.to_datetime(df[col_name])

    # Convert datetime values to the number of seconds since midnight
    seconds_since_midnight = (datetime_col.dt.hour * 3600) + (datetime_col.dt.minute * 60) + datetime_col.dt.second

    return seconds_since_midnight

In [67]:
# Get the path to the data directory
data_dir = './nb_stock_data/'

# Get a list of all JSON files in the data directory 
json_files = [f for f in os.listdir(data_dir) if f.endswith('.json')]

In [104]:
conversion = []
best_meta = 0
total_conversion = 0.0

# Loop through the JSON files and load the data from each file
for json_file in json_files:
    with open(data_dir + json_file, 'r') as f:
        data = json.load(f)

    # Access the 'meta' and 'values' keys in the loaded JSON data
    meta_data = data['meta']
    values_data = data['values']

    # Convert the data to a Pandas DataFrame
    df = pd.DataFrame(values_data)
    df['next_close'] = df['close'].shift(+1)

    # Convert datetime column to datetime type
    df['datetime'] = pd.to_datetime(df['datetime'])

    #use name of files 'AMZN,GOOGL, to calculate interest.
    stock = json_file.split('.')[0]

    #Query Google Trends for particular Stocks interest from today to last 3 months. 
    stocktrends = update_trends_data.predict_interest(keyword=stock, starttime='today 3-m')

    #Load the Twitter Engagement Scores
    filename = (f"{stock}.csv")
    filepath = os.path.join("nb_twitter_data", filename)
    twitter = pd.read_csv(filepath)
    
    twitter['date'] = pd.to_datetime(twitter['date']).dt.date
    df['date'] = pd.to_datetime(df['datetime']).dt.date

    length = len(df)
    df['like_count'] = np.zeros(length)
    df['view_count'] = np.zeros(length)
    df['engagement'] = np.zeros(length)
    for index, row in df.iterrows():
        date = row['date']
        tempval = 0
        for t_index, t_row in twitter.iterrows():
            t_date = t_row['date']
            if date == t_date:
                df.at[index, 'like_count'] = t_row['Like Count']
                df.at[index, 'view_count'] = t_row['View Count']
                df.at[index, 'engagement'] = t_row['Engagement Score']
                tempval = 1

        date = pd.to_datetime(date)
        
    # Check if the date exists in the data structure
        if date in stocktrends:
            df.at[index, 'value'] = stocktrends[date]
       

    # Calculate the mean of the 'value' column
    mean_value = df['value'].mean()

    # Fill missing values in the 'value' column with the mean
    df['value'].fillna(mean_value, inplace=True)



# Convert all other columns to numeric type
    df[['open', 'high', 'low', 'close', 'volume', 'next_close']] = df[['open', 'high', 'low', 'close', 'volume', 'next_close']].apply(pd.to_numeric)
    df['time'] = df['datetime'].dt.time
    
    df['time'] = datetime_to_seconds(df, 'datetime')
    # Convert 'date' column to real numbers
    df['date'] = date_to_ints(df, 'date')




    #print('Beginning Correlation Dependencies:')
    #This code now adds a correlation component relating each other potential Stock to the current stock. These companies are related and there should be treated as such. 
    #Area of improvement is grabbing the pretrained datasets and then combining at the end instead of doing it here. Then we can grab values like trends and engagement.
    for other_json in json_files:
        if other_json == json_file:
            #print('Skipping: ',other_json)
            continue
        #print('Beginning: ',other_json)
        with open(data_dir + other_json, 'r') as f:
            other_json_data = json.load(f)
            other_vals = pd.DataFrame(other_json_data['values'])
            df[other_json.split('.')[0] + '_open'] = other_vals['open'].apply(pd.to_numeric)
            df[other_json + '_close'] = other_vals['close'].apply(pd.to_numeric)
            
            df[other_json.split('.')[0] + 'volume'] = other_vals['volume'].apply(pd.to_numeric)

    

    predicted_class= df.iloc[0]
    df.drop(df.head(1).index, inplace=True)

# Create a new dataframe with the closing prices and the next closing price
# extract the time component
    columns_to_drop = ['next_close', 'datetime'] # no need for datetime since we have date component. 

    y = df['next_close']
    X = df.drop(columns_to_drop, axis=1)


    print('Beginning LR for', stock)
    model = LinearRegression()
    model.fit(X, y)

    # Predict the next closing price based on the most recent closing price
    last_close = predicted_class.drop(columns_to_drop)
    next_close = model.predict([last_close])

    
    print('Current price:',last_close[2] )
    print('Predicted next closing price:',next_close[0] )
    print((next_close[0]/last_close[2]), ' of my money is expected to exist after trade')

    #Store predictions/conversions
    total_conversion= total_conversion + (next_close[0]/last_close[2])
    conversion.append([meta_data['symbol'], (next_close[0]/last_close[2]), last_close[2]])
    

Beginning LR for AAPL
Current price: 174.94
Predicted next closing price: 176.2078765532871
1.0072474937309197  of my money is expected to exist after trade
Beginning LR for AMZN
Current price: 115.7
Predicted next closing price: 115.02477064756432
0.9941639641103226  of my money is expected to exist after trade
Beginning LR for GOOGL
Current price: 122.15
Predicted next closing price: 123.0870941088829
1.007671666875832  of my money is expected to exist after trade
Beginning LR for INTC
Current price: 29.84
Predicted next closing price: 29.813158784989348
0.9991004954755144  of my money is expected to exist after trade
Beginning LR for MSFT
Current price: 317.26999
Predicted next closing price: 318.67473063525887
1.0044275874792281  of my money is expected to exist after trade
Beginning LR for NFLX
Current price: 364.23001
Predicted next closing price: 351.8781428136666
0.9660877279542853  of my money is expected to exist after trade
Beginning LR for NVDA
Current price: 311.13
Predict

In [105]:
for conv in conversion:
    print(conv)
    if conv[1] > 1.0:
        total_conversion+= conv[1]
print(total_conversion)
print('My pie chart of assets should be:')
for conv in conversion:
    if conv[1] < 1:
        print('SELL: ',conv[0], 'shares.')
        continue
    conv[1]/=total_conversion
    print('BUY: ', conv[0], (int)(conv[1]*1000000/conv[2]), 'shares.')

['AAPL', 1.0072474937309197, 174.94]
['AMZN', 0.9941639641103226, 115.7]
['GOOGL', 1.007671666875832, 122.15]
['INTC', 0.9991004954755144, 29.84]
['MSFT', 1.0044275874792281, 317.26999]
['NFLX', 0.9660877279542853, 364.23001]
['NVDA', 1.0125148230889494, 311.13]
['TSLA', 0.9860001094732429, 178.7]
12.009075439363222
My pie chart of assets should be:
BUY:  AAPL 479 shares.
SELL:  AMZN shares.
BUY:  GOOGL 686 shares.
SELL:  INTC shares.
BUY:  MSFT 263 shares.
SELL:  NFLX shares.
BUY:  NVDA 270 shares.
SELL:  TSLA shares.
