# Stock Data Project 
Objective: Predict future stock prices (e.g., next day, week, or month) using historical data and external factors.

Type of Prediction:

Regression: Predict exact future prices.

Classification: Predict price movement direction (up/down/stable).

# Data Collection

In [165]:
import pandas as pd
import pyodbc
import numpy as np

In [166]:
import yfinance as yf
ticker = "GE"

df = yf.Ticker(ticker).history(period="max")
print(df.head())


                               Open      High       Low     Close  Volume  \
Date                                                                        
1962-01-02 00:00:00-05:00  0.623745  0.634141  0.617508  0.621666  432682   
1962-01-03 00:00:00-05:00  0.618548  0.618548  0.613350  0.615429  296467   
1962-01-04 00:00:00-05:00  0.615429  0.620627  0.602954  0.608152  368581   
1962-01-05 00:00:00-05:00  0.608152  0.609192  0.582163  0.592559  546862   
1962-01-08 00:00:00-05:00  0.592558  0.592558  0.573846  0.591519  620978   

                           Dividends  Stock Splits  
Date                                                
1962-01-02 00:00:00-05:00        0.0           0.0  
1962-01-03 00:00:00-05:00        0.0           0.0  
1962-01-04 00:00:00-05:00        0.0           0.0  
1962-01-05 00:00:00-05:00        0.0           0.0  
1962-01-08 00:00:00-05:00        0.0           0.0  


# Data Cleaning 

Using Data Wrangler the data seems pretty clean but to ensure data integrity, I'm going to use some data cleaning methods using pandas.

We have to ensure there's no missing data, a consistent format, removal of unwanted observations, incorrect data, and removal of duplicates. We don't need to manage outliers.

* Removal of rows with missing data

In [167]:
df = df.dropna()

* Ensures that the dates are in the correct format

In [168]:
df = df.reset_index()

if df["Date"].dt.tz is not None:
    df["Date"] = df["Date"].dt.tz_convert(None)
else:
    df["Date"] = df["Date"].dt.tz_localize("UTC")  
    df["Date"] = df["Date"].dt.tz_convert(None)   
df["Date"] = df["Date"].dt.normalize()

* Ensures that there is no wrong data

In [169]:
columns = list(df.columns)
columns.remove('Date')
columns.remove('Volume')

# Ensure that the columns are of the correct data type
for column in columns:
    df[column] = pd.to_numeric(df[column], errors='coerce', downcast='float')

df['Volume'] = pd.to_numeric(df['Volume'], errors='coerce', downcast='integer')


#Make sure that all numbers are positive, if not make it positive
columns.append('Volume')
for column in columns:
    for i in df.index:
        if df.loc[i, column] < 0:
            df.loc[i, column] = df.loc[i, column] * -1

* Removing duplicates 

In [170]:
df.drop_duplicates(inplace = True)

# More Data 

* This is where I will be collecting more data for the AI to learn from as well as do some feature engineering. A lot of repeated steps for previous.

In [171]:

df['SMA_10'] = df['Close'].rolling(window = 10).mean()
df['SMA_50'] = df['Close'].rolling(window = 50).mean()
df['SMA_100'] = df['Close'].rolling(window = 100).mean()
df['SMA_200'] = df['Close'].rolling(window = 200).mean()
print(df.tail())

            Date        Open        High         Low       Close   Volume  \
15883 2025-02-10  206.139999  206.660004  203.399994  205.220001  3787500   
15884 2025-02-11  205.000000  209.149994  204.440002  208.820007  4110900   
15885 2025-02-12  206.699997  211.419998  206.139999  209.639999  4595200   
15886 2025-02-13  211.119995  211.300003  206.270004  208.360001  3763300   
15887 2025-02-14  208.660004  209.160004  204.970001  208.270004  4267200   

       Dividends  Stock Splits      SMA_10      SMA_50     SMA_100     SMA_200  
15883        0.0           0.0  203.010001  180.747072  181.801186  172.836767  
15884        0.0           0.0  204.449002  181.237477  182.091682  173.088728  
15885        0.0           0.0  205.503001  181.824152  182.358277  173.334488  
15886        0.0           0.0  205.782001  182.354077  182.586053  173.568424  
15887        0.0           0.0  206.252000  182.916547  182.800270  173.791262  


In [172]:
df['EMA_10'] = df['Close'].ewm(span=10).mean()
df['EMA_50'] = df['Close'].ewm(span=50).mean()
df['EMA_100'] = df['Close'].ewm(span=100).mean()
df['EMA_200'] = df['Close'].ewm(span=2000).mean()
print(df.tail())

            Date        Open        High         Low       Close   Volume  \
15883 2025-02-10  206.139999  206.660004  203.399994  205.220001  3787500   
15884 2025-02-11  205.000000  209.149994  204.440002  208.820007  4110900   
15885 2025-02-12  206.699997  211.419998  206.139999  209.639999  4595200   
15886 2025-02-13  211.119995  211.300003  206.270004  208.360001  3763300   
15887 2025-02-14  208.660004  209.160004  204.970001  208.270004  4267200   

       Dividends  Stock Splits      SMA_10      SMA_50     SMA_100  \
15883        0.0           0.0  203.010001  180.747072  181.801186   
15884        0.0           0.0  204.449002  181.237477  182.091682   
15885        0.0           0.0  205.503001  181.824152  182.358277   
15886        0.0           0.0  205.782001  182.354077  182.586053   
15887        0.0           0.0  206.252000  182.916547  182.800270   

          SMA_200      EMA_10      EMA_50     EMA_100    EMA_200  
15883  172.836767  201.969537  186.177389  180.18

In [173]:
import ta

df['RSI'] = ta.momentum.rsi(df['Close'], window = 14, fillna = True)
df['MACD'] = ta.trend.macd(df['Close'], fillna = True)
df['MACD_signal'] = ta.trend.macd_signal(df['Close'], fillna = True)
df['Stochastic'] = ta.momentum.stoch(df['High'], df['Low'], df['Close'], fillna = True)
df['Bollinger_High'] = ta.volatility.bollinger_hband(df['Close'], fillna = True)
df['Bollinger_Low'] = ta.volatility.bollinger_lband(df['Close'], fillna = True)
df['ATR'] = ta.volatility.average_true_range(df['High'], df['Low'], df['Close'], fillna = True)
df['OBV'] = ta.volume.on_balance_volume(df['Close'], df['Volume'], fillna = True)
df['Chaikin'] = ta.volume.chaikin_money_flow(df['High'], df['Low'], df['Close'], df['Volume'], fillna = True)
df['ADX'] = ta.trend.adx(df['High'], df['Low'], df['Close'], fillna = True)
df['SAR_Down'] = ta.trend.psar_down(df['High'], df['Low'], df['Close'], fillna = True)
df['SAR_Up'] = ta.trend.psar_up(df['High'], df['Low'], df['Close'], fillna = True)
df['MA_Crossover'] = df['SMA_10'] - df['SMA_50']
df['BB_Width'] = df['Bollinger_High'] - df['Bollinger_Low']
print(df.tail())

            Date        Open        High         Low       Close   Volume  \
15883 2025-02-10  206.139999  206.660004  203.399994  205.220001  3787500   
15884 2025-02-11  205.000000  209.149994  204.440002  208.820007  4110900   
15885 2025-02-12  206.699997  211.419998  206.139999  209.639999  4595200   
15886 2025-02-13  211.119995  211.300003  206.270004  208.360001  3763300   
15887 2025-02-14  208.660004  209.160004  204.970001  208.270004  4267200   

       Dividends  Stock Splits      SMA_10      SMA_50  ...  Bollinger_High  \
15883        0.0           0.0  203.010001  180.747072  ...      216.224343   
15884        0.0           0.0  204.449002  181.237477  ...      216.226974   
15885        0.0           0.0  205.503001  181.824152  ...      216.538182   
15886        0.0           0.0  205.782001  182.354077  ...      216.579162   
15887        0.0           0.0  206.252000  182.916547  ...      215.856579   

       Bollinger_Low       ATR         OBV   Chaikin        AD

In [174]:
import requests
apikey = '####'

url = f'https://www.alphavantage.co/query?function=FEDERAL_FUNDS_RATE&interval=daily&apikey={apikey}&datatype=csv'
df_fed = pd.read_csv(url)

url = f'https://www.alphavantage.co/query?function=CPI&interval=monthly&apikey={apikey}&datatype=csv'
df_inflation = pd.read_csv(url)

url = f'https://www.alphavantage.co/query?function=UNEMPLOYMENT&apikey={apikey}&datatype=csv'
df_unemployment = pd.read_csv(url)

url = f'https://www.alphavantage.co/query?function=EARNINGS&symbol={ticker}&apikey={apikey}'
r = requests.get(url)

dataAnnual = r.json()['annualEarnings']
dataQuarter = r.json()['quarterlyEarnings']
df_annual_earnings = pd.DataFrame(dataAnnual)
df_quarterly_earnings = pd.DataFrame(dataQuarter)

url = f'https://www.alphavantage.co/query?function=ALL_COMMODITIES&interval=monthly&apikey={apikey}&datatype=csv'
df_commodities = pd.read_csv(url)

print(df_fed)
print(df_inflation)
print(df_unemployment)
print(df_annual_earnings)
print(df_quarterly_earnings)
print(df_commodities)

        timestamp  value
0      2025-02-13   4.33
1      2025-02-12   4.33
2      2025-02-11   4.33
3      2025-02-10   4.33
4      2025-02-09   4.33
...           ...    ...
25791  1954-07-05   0.88
25792  1954-07-04   1.25
25793  1954-07-03   1.25
25794  1954-07-02   1.25
25795  1954-07-01   1.13

[25796 rows x 2 columns]
       timestamp    value
0     2025-01-01  317.671
1     2024-12-01  315.605
2     2024-11-01  315.493
3     2024-10-01  315.664
4     2024-09-01  315.301
...          ...      ...
1340  1913-05-01    9.700
1341  1913-04-01    9.800
1342  1913-03-01    9.800
1343  1913-02-01    9.800
1344  1913-01-01    9.800

[1345 rows x 2 columns]
      timestamp  value
0    2025-01-01    4.0
1    2024-12-01    4.1
2    2024-11-01    4.2
3    2024-10-01    4.1
4    2024-09-01    4.1
..          ...    ...
920  1948-05-01    3.5
921  1948-04-01    3.9
922  1948-03-01    4.0
923  1948-02-01    3.8
924  1948-01-01    3.4

[925 rows x 2 columns]
   fiscalDateEnding reportedEPS
0    

In [175]:
url = f'https://www.alphavantage.co/query?function=NEWS_SENTIMENT&tickers={ticker}&apikey={apikey}&sort=LATEST&limit=1000'
r = requests.get(url)
data = r.json()['feed']
df_sentiments = pd.DataFrame(data)
print(df_sentiments)

                                                 title  \
0    Genpact Chief Accounting Officer Trades $226K ...   
1    Portland General Electric  ( POR )  Q4 Earning...   
2    US Stocks Set For A Volatile Start: Analyst Di...   
3    Top Wall Street Forecasters Revamp Portland Ge...   
4          Ameren  ( AEE )  Lags Q4 Earnings Estimates   
..                                                 ...   
670  One of Wall Street's favorite ways to control ...   
671  HNA arm Bohai to soon start US$5 billion sale ...   
672  Circuit Breaker Market Size is Expected to Rea...   
673  Carlisle  ( CSL )  Q4 Earnings Beat Estimates,...   
674  Process Automation And Instrumentation Market ...   

                                                   url   time_published  \
0    https://www.benzinga.com/insights/news/25/02/4...  20250214T150435   
1    https://www.zacks.com/stock/news/2415983/portl...  20250214T120503   
2    https://www.benzinga.com/government/regulation...  20250214T110518   
3  

In [176]:
df['Lag 1'] = df['Close'].shift(1)
df['7 Day Avg'] = df['Close'].rolling(window=7).mean()
df['Daily Returns'] = (df['Close'] - df['Open']) / df['Open']
df['Price to Volume Ratio'] = df['Close'] / df['Volume']
df['Day of the Week'] = df['Date'].dt.dayofweek
df['Quarter'] = df['Date'].dt.quarter
df['Daily Return'] = df['Close'].pct_change()
df['Volatility'] = df['High'] - df['Low']
df['Price Volume Interaction'] = df['Daily Return'] * df['Volume']
df['RSI * Volume'] = df['RSI'] * df['Volume']
df['MACD / Bollinger Band Width'] = df['MACD'] / df['BB_Width']

Finlight key = ####

In [177]:
dropColumns = [1,3,5,6,7,8,9]

df_sentiments.drop(df_sentiments.columns[dropColumns], axis = 1, inplace=True)

df_sentiments['date'] = pd.to_datetime(df_sentiments['time_published'], format = '%Y%m%dT%H%M%S')
df_sentiments['Date'] = df_sentiments['date'].dt.date

dropColumns = [1, 6]
df_sentiments.drop(df_sentiments.columns[dropColumns], axis = 1, inplace = True)


print(df_sentiments.head())

                                               title  \
0  Genpact Chief Accounting Officer Trades $226K ...   
1  Portland General Electric  ( POR )  Q4 Earning...   
2  US Stocks Set For A Volatile Start: Analyst Di...   
3  Top Wall Street Forecasters Revamp Portland Ge...   
4        Ameren  ( AEE )  Lags Q4 Earnings Estimates   

                                             summary  overall_sentiment_score  \
0  Making a noteworthy insider sell on February 1...                 0.242835   
1  Portland General Electric (POR) delivered earn...                 0.150400   
2  U.S. stock futures traded close to the flatlin...                 0.192551   
3  Portland General Electric Company POR will rel...                 0.118315   
4  Ameren (AEE) delivered earnings and revenue su...                 0.172792   

  overall_sentiment_label                                   ticker_sentiment  \
0        Somewhat-Bullish  [{'ticker': 'GE', 'relevance_score': '0.056143...   
1        Somewha

I figured this out kind of, just not sure how to incorporate this into my data. I will look into this in the future but for now we'll leave it. Will also figure out finlight api later

In [178]:
SectorTicker = {'COMMUNICATION SERVICES': 'XLC',
              'CONSUMER DISCRETIONARY': 'XLY',
              'CONSUMER STAPLES': 'XLP',
              'ENERGY': 'XLE',
              'FINANCIALS': 'XLF',
              'HEALTHCARE': 'XLV',
              'INDUSTRIALS': 'XLI',
              'MATERIALS': 'XLB',
              'REAL ESTATE': 'VNQ',
              'TECHNOLOGY': 'XLK',
              'UTILITIES': 'XLU'}


info = yf.Ticker(ticker).get_info()
sectorName = info['sector'].upper()
sectorSymbol = SectorTicker[sectorName]

df_sector = yf.Ticker(sectorSymbol).history(period="max")
df_sector.sort_values(by = 'Date', inplace = True)

print(df_sector)



                                 Open        High         Low       Close  \
Date                                                                        
1998-12-22 00:00:00-05:00   14.622297   14.671531   14.622297   14.671531   
1998-12-23 00:00:00-05:00   14.779837   14.966924   14.779837   14.927537   
1998-12-24 00:00:00-05:00   15.065394   15.134320   15.016160   15.124474   
1998-12-28 00:00:00-05:00   15.124485   15.213105   15.124485   15.203259   
1998-12-29 00:00:00-05:00   15.282026   15.419879   15.282026   15.419879   
...                               ...         ...         ...         ...   
2025-02-10 00:00:00-05:00  138.220001  138.610001  137.630005  138.559998   
2025-02-11 00:00:00-05:00  138.100006  138.679993  137.699997  138.610001   
2025-02-12 00:00:00-05:00  136.630005  138.240005  136.410004  137.750000   
2025-02-13 00:00:00-05:00  137.740005  138.250000  137.309998  137.889999   
2025-02-14 00:00:00-05:00  138.080002  138.250000  137.389999  137.550003   

# More Data Cleaning
Just to make sure

In [179]:
df = df.dropna()

In [180]:
print(df.dtypes)

Date                           datetime64[ns]
Open                                  float32
High                                  float32
Low                                   float32
Close                                 float32
Volume                                  int32
Dividends                             float32
Stock Splits                          float32
SMA_10                                float64
SMA_50                                float64
SMA_100                               float64
SMA_200                               float64
EMA_10                                float64
EMA_50                                float64
EMA_100                               float64
EMA_200                               float64
RSI                                   float64
MACD                                  float64
MACD_signal                           float64
Stochastic                            float64
Bollinger_High                        float64
Bollinger_Low                     

In [181]:
print(df.isna().sum())

Date                           0
Open                           0
High                           0
Low                            0
Close                          0
Volume                         0
Dividends                      0
Stock Splits                   0
SMA_10                         0
SMA_50                         0
SMA_100                        0
SMA_200                        0
EMA_10                         0
EMA_50                         0
EMA_100                        0
EMA_200                        0
RSI                            0
MACD                           0
MACD_signal                    0
Stochastic                     0
Bollinger_High                 0
Bollinger_Low                  0
ATR                            0
OBV                            0
Chaikin                        0
ADX                            0
SAR_Down                       0
SAR_Up                         0
MA_Crossover                   0
BB_Width                       0
Lag 1     

In [182]:
print(df.isin([np.inf, -np.inf]).sum())

Date                           0
Open                           0
High                           0
Low                            0
Close                          0
Volume                         0
Dividends                      0
Stock Splits                   0
SMA_10                         0
SMA_50                         0
SMA_100                        0
SMA_200                        0
EMA_10                         0
EMA_50                         0
EMA_100                        0
EMA_200                        0
RSI                            0
MACD                           0
MACD_signal                    0
Stochastic                     0
Bollinger_High                 0
Bollinger_Low                  0
ATR                            0
OBV                            0
Chaikin                        0
ADX                            0
SAR_Down                       0
SAR_Up                         0
MA_Crossover                   0
BB_Width                       0
Lag 1     

In [183]:
df['Price to Volume Ratio'] = df['Price to Volume Ratio'].replace([np.inf, -np.inf], np.nan) 
df['MACD / Bollinger Band Width'] = df['MACD / Bollinger Band Width'].replace([np.inf, -np.inf], np.nan) 
df['Daily Returns'] = df['Daily Returns'].replace([np.inf, -np.inf], np.nan)

In [184]:
column = df_annual_earnings.columns[1] 
df_annual_earnings[column] = pd.to_numeric(df_annual_earnings[column], downcast='float')

column = df_annual_earnings.columns[0]
df_annual_earnings[column] = pd.to_datetime(df_annual_earnings[column])
print(df_annual_earnings.dtypes)


fiscalDateEnding    datetime64[ns]
reportedEPS                float32
dtype: object


In [185]:
column = df_commodities.columns[0]
df_commodities[column] = pd.to_datetime(df_commodities[column])

column = df_commodities.columns[1] 
df_commodities[column] = pd.to_numeric(df_commodities[column], errors='coerce', downcast='float')
print(df_commodities.dtypes)
df_commodities.dropna(inplace=True)


timestamp    datetime64[ns]
value               float32
dtype: object


In [186]:
column = df_fed.columns[0]
df_fed[column] = pd.to_datetime(df_fed[column])
print(df_fed.dtypes)

timestamp    datetime64[ns]
value               float64
dtype: object


In [187]:
column = df_inflation.columns[0]
df_inflation[column] = pd.to_datetime(df_inflation[column])
print(df_inflation.dtypes)

timestamp    datetime64[ns]
value               float64
dtype: object


In [188]:
df_quarterly_earnings.drop(columns='fiscalDateEnding', inplace=True)

In [189]:
column = df_quarterly_earnings.columns[0]
df_quarterly_earnings[column] = pd.to_datetime(df_quarterly_earnings[column])

columns = list(df_quarterly_earnings.columns)

for num in range(1, len(columns) - 1):
    df_quarterly_earnings[columns[num]] = pd.to_numeric(df_quarterly_earnings[columns[num]], downcast='float')
df_quarterly_earnings['reportTime'] = df_quarterly_earnings['reportTime'].apply(lambda x: 1 if x == 'post-market' else -1 if x == 'pre-market' else 0)
df_quarterly_earnings['reportTime'] = pd.to_numeric(df_quarterly_earnings['reportTime'], downcast='integer')
print(df_quarterly_earnings.dtypes)

reportedDate          datetime64[ns]
reportedEPS                  float32
estimatedEPS                 float32
surprise                     float32
surprisePercentage           float32
reportTime                      int8
dtype: object


In [190]:
print(df_sector.dtypes)

Open             float64
High             float64
Low              float64
Close            float64
Volume             int64
Dividends        float64
Stock Splits     float64
Capital Gains    float64
dtype: object


In [191]:
column = df_unemployment.columns[0]
df_unemployment[column] = pd.to_datetime(df_unemployment[column])
print(df_unemployment.dtypes)

timestamp    datetime64[ns]
value               float64
dtype: object


In [192]:
df_overall_sentiments = df_sentiments.copy()
columns = [0, 1, 3, 4]
df_overall_sentiments.drop(df_overall_sentiments.columns[columns], axis = 1, inplace=True)
print(df_overall_sentiments.head())

   overall_sentiment_score        Date
0                 0.242835  2025-02-14
1                 0.150400  2025-02-14
2                 0.192551  2025-02-14
3                 0.118315  2025-02-14
4                 0.172792  2025-02-13


In [193]:
date = 'Date'
df_annual_earnings.rename(columns = {'fiscalDateEnding': date, 'reportedEPS': 'AnnualEarnings'}, errors='ignore', inplace = True)
df_commodities.rename(columns={'timestamp': date, 'value': 'Commodities_Index'}, errors='ignore', inplace=True)
df_fed.rename(columns = {'timestamp': date, 'value':'Fed_Rate'}, errors='ignore', inplace=True)
df_inflation.rename(columns = {'timestamp': date, 'value':'Inflation'}, errors='ignore', inplace=True)
df_quarterly_earnings.rename(columns = {'reportedDate': date, 'reportedEPS': 'QuarterEarnings', 'estimatedEPS': 'EstimatedQuarterEarnings', 'surprise': 'QuarterSurprise', 'surprisePercentage': 'QuarterSurprisePercentage', 'reportTime': 'QuarterReportTime'}, errors='ignore', inplace = True)
df_sector.rename(columns = {'timestamp': date, 'Open': 'SectorOpen', 'High':'SectorHigh', 'Low':'SectorLow', 'Close': 'SectorClose', 'Volume':'SectorVolume'}, errors='ignore', inplace = True)
df_unemployment.rename(columns = {'timestamp': date, 'value':'UnemploymentRate'}, errors='ignore', inplace = True)


In [194]:
dropColumns = ['Dividends','Stock Splits', 'Capital Gains']
df_sector = df_sector.reset_index()
df_sector.drop(columns=dropColumns, axis=1, inplace=True)


In [195]:
if df_sector["Date"].dt.tz is not None:
    df_sector["Date"] = df_sector["Date"].dt.tz_convert(None)
else:
    df_sector["Date"] = df_sector["Date"].dt.tz_localize("UTC")  
    df_sector["Date"] = df_sector["Date"].dt.tz_convert(None)   
df_sector["Date"] = df_sector["Date"].dt.normalize()

# Data Loading 

In [196]:
from sqlalchemy import create_engine
server = 'dahomey.database.windows.net'
database = 'Stock Data'
username = 'ttshiamala'
password = '####'
driver = '{ODBC Driver 18 for SQL Server}'

# Connect to Azure SQL
connection_string = f"DRIVER={driver};SERVER={server};PORT=1433;DATABASE={database};UID={username};PWD={password}"
conn = pyodbc.connect(connection_string)

engine = create_engine('mssql+pyodbc://', creator=lambda: conn)
df.to_sql('Stock', engine, if_exists = 'replace', index=False)

print('Data inserting is a success!')

Data inserting is a success!


In [197]:
list_df = [df_annual_earnings, df_commodities, df_fed, df_inflation, df_overall_sentiments, df_quarterly_earnings, df_sector,  df_unemployment]
list_tables = ['AnnualEarnings', 'Commodities', 'FederalFundsRate', 'Inflation', 'Sentiments', 'QuarterlyEarnings', 'SectorPrices', 'Unemployment']

conn = pyodbc.connect(connection_string)
engine = create_engine('mssql+pyodbc://', creator=lambda: conn)
for i, dataFrame in enumerate(list_df):
    dataFrame.to_sql(list_tables[i], engine, if_exists = 'replace', index=False)
conn.close()

# SQL Query and Loading
This query joins all the data tables that was collected and loaded onto the database

In [198]:
with open("Query.sql", "r") as file:
    query = file.read()

conn = pyodbc.connect(connection_string)
engine = create_engine('mssql+pyodbc://', creator=lambda: conn)
finalDF = pd.read_sql(sql=query, con=engine, parse_dates=['Date'])
conn.close()

Next steps is to make sure that the data makes sense. For example, with unemployment, not all values were added because of left join so I may have to revisit how I joined these tables. Need to account for days that the stock wasn't trading (commodities as well). Also have to double check sentiments. Need to also consider the case that there were multiple stories in one day (average). After that I have to clean the dataset once more and handle missing values and things of that nature. Actually now that I'm taking a closer look at it, there are some spurious tables. It may be because of the overall sentiments for example with multiple dates.

I fixed it the dataset is almost perfect. Just need to clean it one more time then it is ready for machine learning.

# Model Datasets
This is a big deal. The way we datasets and handle missing values will affect the machine learning algorithms. Some algorithms know how to handle missing values like random forests for example. Linear regression, for instance, don't know how to directly handle those situations. There is a plethora of ways we can handle the situation. Based on some research, forward fill may be a viable option. There are some downsides though. For example, quarterly earnings aren't released everyday, they're released quarterly. So, having them be present on overy row wouldn't make sense. A way we can fix is this to have a another feature calculating how many days since the anouncment of earnings to keep residual effects. I think what I'll do is use different methods to optimze performance different models. I'll do Random Forest, XGBoost, CNN, and LSTM as my main models. In the future I want to try Linear Regression, SVM, and Transformers. But Transformers require more data than I have. SVM is more for classification and less for continuous price prediction. Linear regression assumes there's a linear relationship, which seems unlikely.

I used Power BI to visualize the stock and I noticed that the GE stock crashes a lot. It made me think if the type of stock affects the model. It turns out it does! So these are all the major things I have to consider when I'm creating these models.
1. Which model to use
2. How much data for the model
3. Which features to use
4. Dimensional Reduction algorithms?
5. Handling Missing Values
6. Which stock to analyze and sector and what type of features it's susceptible to 
7. Normalization
8. time frame of the prediction
- Are you predicting daily, weekly, or monthly prices? Different models perform better on different time horizons.
9. data splitting
10. overfitting and regularization
- Apply techniques like early stopping, dropout (for neural networks), or L1/L2 regularization to prevent overfitting.
11. Model Explainability
- Use techniques like SHAP values or LIME to understand which factors drive predictions.
12. Computer resources
13. Hypertuning