# Stock Data Project 
Objective: Predict future stock prices (e.g., next day, week, or month) using historical data and external factors.

Type of Prediction:

Regression: Predict exact future prices.

Classification: Predict price movement direction (up/down/stable).

# Data Collection

In [1]:
import pandas as pd
import pyodbc


* Data Retreival From Alpha Vantage

API Key to Alpha Vantage (source for data): HPUL5XC5C1RFHRAQ

In [182]:
function = 'TIME_SERIES_DAILY'
symbol = 'IBM'
outputsize = 'full'
datatype = 'csv'
apikey = 'HPUL5XC5C1RFHRAQ'


url = f'https://www.alphavantage.co/query?function={function}&symbol={symbol}&outputsize={outputsize}&apikey={apikey}&datatype={datatype}'
df = pd.read_csv(url)
df.sort_values(by = 'timestamp', inplace = True)
print(df)

       timestamp    open     high     low   close    volume
6336  1999-11-01   98.50   98.810   96.37   96.75   9551800
6335  1999-11-02   96.75   96.810   93.69   94.81  11105400
6334  1999-11-03   95.87   95.940   93.50   94.37  10369100
6333  1999-11-04   94.44   94.440   90.00   91.56  16697600
6332  1999-11-05   92.75   92.940   90.19   90.25  13737600
...          ...     ...      ...     ...     ...       ...
4     2025-01-02  221.82  222.490  217.60  219.94   2579498
3     2025-01-03  220.55  223.660  220.55  222.65   3873578
2     2025-01-06  223.00  224.350  220.75  222.67   2847128
1     2025-01-07  223.35  226.711  222.83  223.96   3299701
0     2025-01-08  223.91  224.900  220.83  223.18   2619768

[6337 rows x 6 columns]


# Data Cleaning 

Using Data Wrangler the data seems pretty clean but to ensure data integrity, I'm going to use some data cleaning methods using pandas.

We have to ensure there's no missing data, a consistent format, removal of unwanted observations, incorrect data, and removal of duplicates. We don't need to manage outliers.

* Removal of rows with missing data

In [183]:
df = df.dropna()

* Ensures that the dates are in the correct format

In [184]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

* Ensures that there is no wrong data

In [185]:
columns = list(df.columns)
columns.remove('timestamp')
columns.remove('volume')

# Ensure that the columns are of the correct data type
for column in columns:
    df[column] = pd.to_numeric(df[column], errors='coerce', downcast='float')

df['volume'] = pd.to_numeric(df['volume'], errors='coerce', downcast='integer')

#Make sure that all numbers are positive, if not make it positive
columns.append('volume')
for column in columns:
    for i in df.index:
        if df.loc[i, column] < 0:
            df.loc[i, column] = df.loc[i, column] * -1

* Removing duplicates 

In [186]:
df.drop_duplicates(inplace = True)

# Data Transforming

* Slight formating for column names

In [187]:
df.rename(columns = {'timestamp':'Date', 'open':'Open', 'high':'High', 'low':'Low', 'close':'Close', 'volume':'Volume'}, inplace = True)
print(df.head())

           Date       Open       High        Low      Close    Volume
6336 1999-11-01  98.500000  98.809998  96.370003  96.750000   9551800
6335 1999-11-02  96.750000  96.809998  93.690002  94.809998  11105400
6334 1999-11-03  95.870003  95.940002  93.500000  94.370003  10369100
6333 1999-11-04  94.440002  94.440002  90.000000  91.559998  16697600
6332 1999-11-05  92.750000  92.940002  90.190002  90.250000  13737600


# Data Loading 

* Loading the data into the SQL Database provided by the cloud service Azure. This only needs to happen once.

In [9]:
server = 'dahomey.database.windows.net'
database = 'Stock Data'
username = 'ttshiamala'
password = 'Bear8486!?'
driver = '{ODBC Driver 18 for SQL Server}'

# Connect to Azure SQL
connection_string = f"DRIVER={driver};SERVER={server};PORT=1433;DATABASE={database};UID={username};PWD={password}"
conn = pyodbc.connect(connection_string)
cursor = conn.cursor()
print("Connected to Azure SQL Database!")
try:
    cursor.execute('TRUNCATE TABLE StockPrices')
    conn.commit()
    for index, row in df.iterrows():
        cursor.execute(
            """
            INSERT INTO StockPrices ([Date], [Open], [High], [Low], [Close], [Volume]) 
            VALUES (?, ?, ?, ?, ?, ?)
            """,
            row.Date, row.Open, row.High, row.Low, row.Close, row.Volume
    )
except pyodbc.IntegrityError as e:
    print(e)


conn.commit()
cursor.close()
conn.close()
print('Data inserting is a success!')

Connected to Azure SQL Database!
Data inserting is a success!


# More Data 

* This is where I will be collecting more data for the AI to learn from as well as do some feature engineering. A lot of repeated steps for previous.

In [10]:
import requests
url = f'https://www.alphavantage.co/query?function=DIVIDENDS&symbol={symbol}&apikey={apikey}'

r = requests.get(url)
data = r.json()["data"]

df_div = pd.DataFrame(data)
print(df_div)

    ex_dividend_date declaration_date record_date payment_date amount
0         2024-11-12       2024-10-30  2024-11-12   2024-12-10   1.67
1         2024-08-09       2024-07-29  2024-08-09   2024-09-10   1.67
2         2024-05-09       2024-04-30  2024-05-10   2024-06-10   1.67
3         2024-02-08       2024-01-30  2024-02-09   2024-03-09   1.66
4         2023-11-09       2023-10-30  2023-11-10   2023-12-09   1.66
..               ...              ...         ...          ...    ...
99        2000-02-08             None        None         None   0.12
100       1999-11-08             None        None         None   0.12
101       1999-08-06             None        None         None   0.12
102       1999-05-06             None        None         None   0.24
103       1999-02-08             None        None         None   0.22

[104 rows x 5 columns]


In [11]:
df_div.drop(df_div.columns[1:4], axis = 1, inplace=True)
print(df_div.head())

  ex_dividend_date amount
0       2024-11-12   1.67
1       2024-08-09   1.67
2       2024-05-09   1.67
3       2024-02-08   1.66
4       2023-11-09   1.66


In [12]:
url = f'https://www.alphavantage.co/query?function=SPLITS&symbol={symbol}&apikey={apikey}'
r = requests.get(url)
data = r.json()["data"]
df_split = pd.DataFrame(data)

print(df_split)

  effective_date split_factor
0     2021-11-04       1.0460
1     1999-05-27       2.0000


In [13]:
df_div['ex_dividend_date'] = pd.to_datetime(df_div['ex_dividend_date'])
df_div['amount'] = pd.to_numeric(df_div['amount'], errors='coerce',downcast='float')

df_split['effective_date'] = pd.to_datetime(df_split['effective_date'])
df_split['split_factor'] = pd.to_numeric(df_split['split_factor'], errors='coerce', downcast ='float')

print(df_div.dtypes)
print(df_split.dtypes)

ex_dividend_date    datetime64[ns]
amount                     float32
dtype: object
effective_date    datetime64[ns]
split_factor             float32
dtype: object


In [188]:

df['SMA_10'] = df['Close'].rolling(window = 10).mean()
df['SMA_50'] = df['Close'].rolling(window = 50).mean()
df['SMA_100'] = df['Close'].rolling(window = 100).mean()
df['SMA_200'] = df['Close'].rolling(window = 200).mean()
print(df.tail())

        Date        Open        High         Low       Close   Volume  \
4 2025-01-02  221.820007  222.490005  217.600006  219.940002  2579498   
3 2025-01-03  220.550003  223.660004  220.550003  222.649994  3873578   
2 2025-01-06  223.000000  224.350006  220.750000  222.669998  2847128   
1 2025-01-07  223.350006  226.710999  222.830002  223.960007  3299701   
0 2025-01-08  223.910004  224.899994  220.830002  223.179993  2619768   

       SMA_10      SMA_50   SMA_100    SMA_200  
4  222.148000  220.609401  216.6979  197.48890  
3  222.395999  220.417401  217.0296  197.63545  
2  222.270999  220.215800  217.3464  197.77900  
1  222.331000  220.327201  217.6628  197.93930  
0  222.456000  220.497401  217.9551  198.10100  


In [189]:
df['EMA_10'] = df['Close'].ewm(span=10).mean()
df['EMA_50'] = df['Close'].ewm(span=50).mean()
df['EMA_100'] = df['Close'].ewm(span=100).mean()
df['EMA_200'] = df['Close'].ewm(span=2000).mean()
print(df.tail())

        Date        Open        High         Low       Close   Volume  \
4 2025-01-02  221.820007  222.490005  217.600006  219.940002  2579498   
3 2025-01-03  220.550003  223.660004  220.550003  222.649994  3873578   
2 2025-01-06  223.000000  224.350006  220.750000  222.669998  2847128   
1 2025-01-07  223.350006  226.710999  222.830002  223.960007  3299701   
0 2025-01-08  223.910004  224.899994  220.830002  223.179993  2619768   

       SMA_10      SMA_50   SMA_100    SMA_200      EMA_10      EMA_50  \
4  222.148000  220.609401  216.6979  197.48890  222.708353  221.867401   
3  222.395999  220.417401  217.0296  197.63545  222.697742  221.898091   
2  222.270999  220.215800  217.3464  197.77900  222.692698  221.928362   
1  222.331000  220.327201  217.6628  197.93930  222.923118  222.008034   
0  222.456000  220.497401  217.9551  198.10100  222.969822  222.053993   

      EMA_100     EMA_200  
4  214.823300  152.341384  
3  214.978284  152.411783  
2  215.130595  152.482131  
1  2

In [191]:
import ta

df['RSI'] = ta.momentum.rsi(df['Close'], window = 14, fillna = True)
df['MACD'] = ta.trend.macd(df['Close'], fillna = True)
df['MACD_signal'] = ta.trend.macd_signal(df['Close'], fillna = True)
df['Stochastic'] = ta.momentum.stoch(df['High'], df['Low'], df['Close'], fillna = True)
df['Bollinger_High'] = ta.volatility.bollinger_hband(df['Close'], fillna = True)
df['Bollinger_Low'] = ta.volatility.bollinger_lband(df['Close'], fillna = True)
df['ATR'] = ta.volatility.average_true_range(df['High'], df['Low'], df['Close'], fillna = True)
df['OBV'] = ta.volume.on_balance_volume(df['Close'], df['Volume'], fillna = True)
df['Chaikin'] = ta.volume.chaikin_money_flow(df['High'], df['Low'], df['Close'], df['Volume'], fillna = True)
df['ADX'] = ta.trend.adx(df['High'], df['Low'], df['Close'], fillna = True)
df['SAR_Down'] = ta.trend.psar_down(df['High'], df['Low'], df['Close'], fillna = True)
df['SAR_Up'] = ta.trend.psar_up(df['High'], df['Low'], df['Close'], fillna = True)
df['MA_Crossover'] = df['SMA_10'] - df['SMA_50']
df['BB_Width'] = df['Bollinger_High'] - df['Bollinger_Low']
print(df.tail())

        Date        Open        High         Low       Close   Volume  \
4 2025-01-02  221.820007  222.490005  217.600006  219.940002  2579498   
3 2025-01-03  220.550003  223.660004  220.550003  222.649994  3873578   
2 2025-01-06  223.000000  224.350006  220.750000  222.669998  2847128   
1 2025-01-07  223.350006  226.710999  222.830002  223.960007  3299701   
0 2025-01-08  223.910004  224.899994  220.830002  223.179993  2619768   

       SMA_10      SMA_50   SMA_100    SMA_200  ...  Bollinger_High  \
4  222.148000  220.609401  216.6979  197.48890  ...      237.888484   
3  222.395999  220.417401  217.0296  197.63545  ...      237.084915   
2  222.270999  220.215800  217.3464  197.77900  ...      235.892401   
1  222.331000  220.327201  217.6628  197.93930  ...      233.511605   
0  222.456000  220.497401  217.9551  198.10100  ...      232.910306   

   Bollinger_Low       ATR        OBV   Chaikin        ADX    SAR_Down  \
4     216.209515  4.339795  353046012 -0.102213  19.622705  

In [133]:
url = f'https://www.alphavantage.co/query?function=FEDERAL_FUNDS_RATE&interval=daily&apikey={apikey}&datatype=csv'
df_fed = pd.read_csv(url)

url = f'https://www.alphavantage.co/query?function=CPI&interval=monthly&apikey={apikey}&datatype=csv'
df_inflation = pd.read_csv(url)

url = f'https://www.alphavantage.co/query?function=UNEMPLOYMENT&apikey={apikey}&datatype=csv'
df_unemployment = pd.read_csv(url)

url = f'https://www.alphavantage.co/query?function=EARNINGS&symbol={symbol}&apikey={apikey}'
r = requests.get(url)
dataAnnual = r.json()['annualEarnings']
dataQuarter = r.json()['quarterlyEarnings']
df_annual_earnings = pd.DataFrame(dataAnnual)
df_quarterly_earnings = pd.DataFrame(dataQuarter)

url = f'https://www.alphavantage.co/query?function=ALL_COMMODITIES&interval=monthly&apikey={apikey}&datatype=csv'
df_commodities = pd.read_csv(url)

print(df_fed)
print(df_inflation)
print(df_unemployment)
print(df_annual_earnings)
print(df_quarterly_earnings)
print(df_commodities)

        timestamp  value
0      2025-01-07   4.33
1      2025-01-06   4.33
2      2025-01-05   4.33
3      2025-01-04   4.33
4      2025-01-03   4.33
...           ...    ...
25754  1954-07-05   0.88
25755  1954-07-04   1.25
25756  1954-07-03   1.25
25757  1954-07-02   1.25
25758  1954-07-01   1.13

[25759 rows x 2 columns]
       timestamp    value
0     2024-11-01  315.493
1     2024-10-01  315.664
2     2024-09-01  315.301
3     2024-08-01  314.796
4     2024-07-01  314.540
...          ...      ...
1338  1913-05-01    9.700
1339  1913-04-01    9.800
1340  1913-03-01    9.800
1341  1913-02-01    9.800
1342  1913-01-01    9.800

[1343 rows x 2 columns]
      timestamp  value
0    2024-11-01    4.2
1    2024-10-01    4.1
2    2024-09-01    4.1
3    2024-08-01    4.2
4    2024-07-01    4.3
..          ...    ...
918  1948-05-01    3.5
919  1948-04-01    3.9
920  1948-03-01    4.0
921  1948-02-01    3.8
922  1948-01-01    3.4

[923 rows x 2 columns]
   fiscalDateEnding reportedEPS
0    

In [37]:
url = f'https://www.alphavantage.co/query?function=NEWS_SENTIMENT&tickers={symbol}&apikey={apikey}&sort=LATEST&limit=1000'
r = requests.get(url)
data = r.json()['feed']
df_sentiments = pd.DataFrame(data)
print(df_sentiments)

                                                title  \
0      1 Top Quantum Computing Stock to Buy Right Now   
1   Okta Rises 9% in a Month: Is the Stock a Screa...   
2   Quantum Computing: The New AI? A Look at the R...   
3   Should You Buy This Artificial Intelligence  (...   
4   IBM Solution Drives Innovation in Defense Appl...   
5   Rigetti Computing Surges 692% in a Month: Is T...   
6   The Zacks Analyst Blog Highlights Goldman Sach...   
7   Quantum Computing, Rigetti, D-Wave Skyrocket U...   
8   4 "Dogs of the Dow" Stocks for Investors to Wa...   
9                  3 Top Tech Stocks to Buy Right Now   
10  Can Cognizant's Expanding Partner Base Push th...   
11  IBM And Illinois Team Up To Supercharge Quantu...   
12  Palantir Stock Up 313% YTD: Is the Party Over,...   
13  2025 Investment Themes: AI, Energy, Defense Re...   
14  D-WAVE QUANTUM Surges 470% Year to Date: Is Th...   
15  IBM Watsonx to Drive AI Adoption in Defense Se...   
16  Adobe Stock Ahead of Q4 Ear

In [192]:
df['Lag 1'] = df['Close'].shift(1)
df['7 Day Avg'] = df['Close'].rolling(window=7).mean()
df['Daily Returns'] = (df['Close'] - df['Open']) / df['Open']
df['Price to Volume Ratio'] = df['Close'] / df['Volume']
df['Day of the Week'] = df['Date'].dt.dayofweek
df['Quarter'] = df['Date'].dt.quarter
df['Daily Return'] = df['Close'].pct_change()
df['Volatility'] = df['High'] - df['Low']
df['Price Volume Interaction'] = df['Daily Return'] * df['Volume']
df['RSI * Volume'] = df['RSI'] * df['Volume']
df['MACD / Bollinger Band Width'] = df['MACD'] / df['BB_Width']

Finlight key = sk_4470114c6d50ee45e7309af37287f659d87a2b8255f771e2d6815c36864bce5a

In [39]:
dropColumns = [1,3,5,6,7,8,9]

df_sentiments.drop(df_sentiments.columns[dropColumns], axis = 1, inplace=True)

df_sentiments['date'] = pd.to_datetime(df_sentiments['time_published'], format = '%Y%m%dT%H%M%S')
df_sentiments['Date'] = df_sentiments['date'].dt.date

dropColumns = [1, 6]
df_sentiments.drop(df_sentiments.columns[dropColumns], axis = 1, inplace = True)


print(df_sentiments.head())

                                               title  \
0     1 Top Quantum Computing Stock to Buy Right Now   
1  Okta Rises 9% in a Month: Is the Stock a Screa...   
2  Quantum Computing: The New AI? A Look at the R...   
3  Should You Buy This Artificial Intelligence  (...   
4  IBM Solution Drives Innovation in Defense Appl...   

                                             summary  overall_sentiment_score  \
0  The next big thing in technology could very we...                 0.208344   
1  Gaining adoption of OKTA's Identity Threat Pro...                 0.437569   
2  Quantum computing is emerging as a game-changi...                 0.343692   
3  Looking for an undervalued AI opportunity at t...                 0.248196   
4  Lockheed Martin is set to incorporate Internat...                 0.271068   

  overall_sentiment_label                                   ticker_sentiment  \
0        Somewhat-Bullish  [{'ticker': 'GOOG', 'relevance_score': '0.1400...   
1               

In [155]:
from textblob import TextBlob
rows = len(df_sentiments.iloc[:,0])
columns = ['title', 'summary', 'overall_sentiment_label']
df_TB_sentiments = df_sentiments.copy()
for column in columns:
    for row in range(0, rows):
        df_TB_sentiments.loc[row, column] = TextBlob(df_TB_sentiments.loc[row, column]).sentiment.polarity

print(df_TB_sentiments)

       title   summary  overall_sentiment_score overall_sentiment_label  \
0   0.392857      0.12                 0.208344                     0.0   
1        0.0     0.225                 0.437569                     0.0   
2   0.159091 -0.133333                 0.343692                     0.0   
3       -0.6     -0.15                 0.248196                     0.0   
4        0.0  0.107143                 0.271068                     0.0   
5        0.5       0.0                 0.231811                     0.0   
6        0.0       0.0                 0.332417                     0.0   
7        0.0  0.066667                 0.140376                     0.0   
8        0.0       0.5                 0.347166                     0.0   
9   0.392857      0.15                 0.290068                     0.0   
10    -0.275  0.334343                 0.331298                     0.0   
11       0.0     -0.05                 0.389159                     0.0   
12       0.5     0.175   

In [156]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

df_vader_sentiments = df_sentiments.copy()
columns = ['title', 'summary', 'overall_sentiment_label']
analyzer = SentimentIntensityAnalyzer()
for column in columns:
    df_vader_sentiments[column] = df_vader_sentiments[column].apply(lambda x: analyzer.polarity_scores(x)['compound'])
print(df_vader_sentiments.head())


    title  summary  overall_sentiment_score  overall_sentiment_label  \
0  0.2023   0.2716                 0.208344                      0.0   
1 -0.3818   0.7003                 0.437569                      0.0   
2  0.2023   0.7650                 0.343692                      0.0   
3  0.4767   0.4215                 0.248196                      0.0   
4  0.8316   0.4767                 0.271068                      0.0   

                                    ticker_sentiment        Date  
0  [{'ticker': 'GOOG', 'relevance_score': '0.1400...  2024-12-24  
1  [{'ticker': 'MSFT', 'relevance_score': '0.1241...  2024-12-23  
2  [{'ticker': 'MSFT', 'relevance_score': '0.1667...  2024-12-22  
3  [{'ticker': 'MSFT', 'relevance_score': '0.1024...  2024-12-22  
4  [{'ticker': 'LMT', 'relevance_score': '0.27127...  2024-12-20  


In [157]:
df_TB_sentiments.drop(df_TB_sentiments.columns[[3, 4]], axis = 1, inplace = True)
df_vader_sentiments.drop(df_vader_sentiments.columns[[3, 4]], axis = 1, inplace = True)

In [158]:
import json

with open('ticker_sentiments.json', 'w') as file:
    json.dump(df_sentiments['ticker_sentiment'][0], file, indent = 4)
print('1')

1


In [159]:

filePath = r"C:\Users\ty725\OneDrive\Documents\Winter Project\ticker_sentiments.json"
df_ticker_sentiments = pd.read_json(filePath)
print(df_ticker_sentiments)

  ticker  relevance_score  ticker_sentiment_score ticker_sentiment_label
0   GOOG         0.140076                0.185753       Somewhat-Bullish
1    IBM         0.668247                0.388151                Bullish


I figured this out kind of, just not sure how to incorporate this into my data. I will look into this in the future but for now we'll leave it. Will also figure out finlight api later

In [54]:
SectorTicker = {'COMMUNICATION SERVICES': 'XLC',
              'CONSUMER DISCRETIONARY': 'XLY',
              'CONSUMER STAPLES': 'XLP',
              'ENERGY': 'XLE',
              'FINANCIALS': 'XLF',
              'HEALTHCARE': 'XLV',
              'INDUSTRIALS': 'XLI',
              'MATERIALS': 'XLB',
              'REAL ESTATE': 'XLRE',
              'TECHNOLOGY': 'XLK',
              'UTILITIES': 'XLU'}

url = f'https://www.alphavantage.co/query?function=OVERVIEW&symbol={symbol}&apikey={apikey}'
r = requests.get(url)
sectorName = r.json()['Sector']
sectorSymbol = SectorTicker[sectorName]

url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={sectorSymbol}&apikey={apikey}&datatype=csv&outputsize=full'
df_sector = pd.read_csv(url)
df_sector.sort_values(by = 'timestamp', inplace = True)
print(df_sector)



       timestamp    open      high       low   close   volume
6335  1999-11-01   42.63   42.9700   42.4100   42.44   230800
6334  1999-11-02   42.44   42.9400   42.1400   42.25   156300
6333  1999-11-03   42.88   43.1300   42.6300   42.88   209200
6332  1999-11-04   43.44   43.4700   42.9100   43.20   409100
6331  1999-11-05   44.25   44.2500   43.4700   43.67   898000
...          ...     ...       ...       ...     ...      ...
4     2024-12-31  235.11  235.2700  231.7600  232.52  4690615
3     2025-01-02  234.36  235.0200  229.7801  231.97  6385156
2     2025-01-03  233.39  235.9999  232.9300  235.75  4560565
1     2025-01-06  238.44  241.0620  237.8100  238.75  4487661
0     2025-01-07  240.00  240.0000  233.1400  233.96  5003649

[6336 rows x 6 columns]


# More Data Cleaning
Just to make sure

In [193]:
print(df.dtypes)


Date                           datetime64[ns]
Open                                  float32
High                                  float32
Low                                   float32
Close                                 float32
Volume                                  int32
SMA_10                                float64
SMA_50                                float64
SMA_100                               float64
SMA_200                               float64
EMA_10                                float64
EMA_50                                float64
EMA_100                               float64
EMA_200                               float64
RSI                                   float64
MACD                                  float64
MACD_signal                           float64
Stochastic                            float64
Bollinger_High                        float64
Bollinger_Low                         float64
ATR                                   float64
OBV                               

In [194]:
columns = list(df.columns)
for column in columns:
    df[column] = df[column].bfill()

In [173]:
column = df_annual_earnings.columns[1] 
df_annual_earnings[column] = pd.to_numeric(df_annual_earnings[column], downcast='float')

column = df_annual_earnings.columns[0]
df_annual_earnings[column] = pd.to_datetime(df_annual_earnings[column])
print(df_annual_earnings.dtypes)


fiscalDateEnding    datetime64[ns]
reportedEPS                float32
dtype: object


In [135]:
column = df_commodities.columns[0]
df_commodities[column] = pd.to_datetime(df_commodities[column])

column = df_commodities.columns[1] 
df_commodities[column] = pd.to_numeric(df_commodities[column], errors='coerce', downcast='float')
print(df_commodities.dtypes)
df_commodities.dropna(inplace=True)


timestamp    datetime64[ns]
value               float32
dtype: object


In [136]:
print(df_div.dtypes)

ex_dividend_date    datetime64[ns]
amount                     float32
dtype: object


In [137]:
column = df_fed.columns[0]
df_fed[column] = pd.to_datetime(df_fed[column])
print(df_fed.dtypes)

timestamp    datetime64[ns]
value               float64
dtype: object


In [143]:
column = df_inflation.columns[0]
df_inflation[column] = pd.to_datetime(df_inflation[column])
print(df_inflation.dtypes)

timestamp    datetime64[ns]
value               float64
dtype: object


In [174]:
df_quarterly_earnings.drop(columns='fiscalDateEnding', inplace=True)

In [175]:
column = df_quarterly_earnings.columns[0]
df_quarterly_earnings[column] = pd.to_datetime(df_quarterly_earnings[column])

columns = list(df_quarterly_earnings.columns)

for num in range(1, len(columns) - 1):
    df_quarterly_earnings[columns[num]] = pd.to_numeric(df_quarterly_earnings[columns[num]], downcast='float')
df_quarterly_earnings['reportTime'] = df_quarterly_earnings['reportTime'].apply(lambda x: 1 if x == 'post-market' else -1 if x == 'pre-market' else 0)
df_quarterly_earnings['reportTime'] = pd.to_numeric(df_quarterly_earnings['reportTime'], downcast='integer')
print(df_quarterly_earnings.dtypes)

reportedDate          datetime64[ns]
reportedEPS                  float32
estimatedEPS                 float32
surprise                     float32
surprisePercentage           float32
reportTime                      int8
dtype: object


In [144]:
print(df_sector.dtypes)

timestamp     object
open         float64
high         float64
low          float64
close        float64
volume         int64
dtype: object


In [145]:
print(df_split.dtypes)

effective_date    datetime64[ns]
split_factor             float32
dtype: object


In [160]:
df_TB_sentiments.drop(df_TB_sentiments.columns[[0, 1]], axis = 1, inplace=True)
print(df_TB_sentiments.dtypes)

overall_sentiment_score    float64
Date                        object
dtype: object


In [162]:
column = df_TB_sentiments.columns[1]
df_TB_sentiments[column] = pd.to_datetime(df_TB_sentiments[column])


In [164]:
column = df_unemployment.columns[0]
df_unemployment[column] = pd.to_datetime(df_unemployment[column])
print(df_unemployment.dtypes)

timestamp    datetime64[ns]
value               float64
dtype: object


In [None]:
df_vader_sentiments.drop(df_vader_sentiments.columns[0], axis = 1, inplace=True)

In [168]:
df_vader_sentiments['Date'] = pd.to_datetime(df_vader_sentiments['Date'])
print(df_vader_sentiments.dtypes)

summary                           float64
overall_sentiment_score           float64
Date                       datetime64[ns]
dtype: object


I dont think the textblob/vader experiment worked. The data looks a bit off so we're just going to stick with the overall sentiment score directly from the API.

In [170]:
df_overall_sentiments = df_sentiments.copy()
columns = [0, 1, 3, 4]
df_overall_sentiments.drop(df_overall_sentiments.columns[columns], axis = 1, inplace=True)
print(df_overall_sentiments.head())

   overall_sentiment_score        Date
0                 0.208344  2024-12-24
1                 0.437569  2024-12-23
2                 0.343692  2024-12-22
3                 0.248196  2024-12-22
4                 0.271068  2024-12-20


In [209]:
date = 'Date'
df_annual_earnings.rename(columns = {'fiscalDateEnding': date, 'reportedEPS': 'AnnualEarnings'}, errors='ignore', inplace = True)
df_commodities.rename(columns={'timestamp': date, 'value': 'Commodities_Index'}, errors='ignore', inplace=True)
df_div.rename(columns = {'ex_dividend_date': date, 'amount':'Dividend'}, errors='ignore', inplace = True)
df_fed.rename(columns = {'timestamp': date, 'value':'Fed_Rate'}, errors='ignore', inplace=True)
df_inflation.rename(columns = {'timestamp': date, 'value':'Inflation'}, errors='ignore', inplace=True)
df_quarterly_earnings.rename(columns = {'reportedDate': date, 'reportedEPS': 'QuarterEarnings', 'estimatedEPS': 'EstimatedQuarterEarnings', 'surprise': 'QuarterSurprise', 'surprisePercentage': 'QuarterSurprisePercentage', 'reportTime': 'QuarterReportTime'}, errors='ignore', inplace = True)
df_sector.rename(columns = {'timestamp': date, 'open': 'SectorOpen', 'high':'SectorHigh', 'low':'SectorLow', 'close': 'SectorClose', 'volume':'SectorVolume'}, errors='ignore', inplace = True)
df_split.rename(columns = {'effective_date': date}, errors='ignore', inplace = True)
df_unemployment.rename(columns = {'timestamp': date, 'value':'UnemploymentRate'}, errors='ignore', inplace = True)


# More Date Loading 

In [179]:
%pip install sqlalchemy

Collecting sqlalchemy
  Using cached SQLAlchemy-2.0.36-cp312-cp312-win_amd64.whl.metadata (9.9 kB)
Collecting typing-extensions>=4.6.0 (from sqlalchemy)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting greenlet!=0.4.17 (from sqlalchemy)
  Using cached greenlet-3.1.1-cp312-cp312-win_amd64.whl.metadata (3.9 kB)
Using cached SQLAlchemy-2.0.36-cp312-cp312-win_amd64.whl (2.1 MB)
Using cached greenlet-3.1.1-cp312-cp312-win_amd64.whl (299 kB)
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, greenlet, sqlalchemy
Successfully installed greenlet-3.1.1 sqlalchemy-2.0.36 typing-extensions-4.12.2
Note: you may need to restart the kernel to use updated packages.


In [213]:
from sqlalchemy import create_engine
server = 'dahomey.database.windows.net'
database = 'Stock Data'
username = 'ttshiamala'
password = 'Bear8486!?'
driver = '{ODBC Driver 18 for SQL Server}'

# Connect to Azure SQL
connection_string = f"DRIVER={driver};SERVER={server};PORT=1433;DATABASE={database};UID={username};PWD={password}"
conn = pyodbc.connect(connection_string)

engine = create_engine('mssql+pyodbc://', creator=lambda: conn)
df.to_sql('Stock', engine, if_exists = 'replace', index=False)

print('Data inserting is a success!')

Data inserting is a success!


In [215]:
conn.close()

In [214]:
list_df = [df_annual_earnings, df_commodities, df_div, df_fed, df_inflation, df_overall_sentiments, df_quarterly_earnings, df_sector, df_split, df_unemployment]
list_tables = ['AnnualEarnings', 'Commodities', 'Dividends', 'FederalFundsRate', 'Inflation', 'Sentiments', 'QuarterlyEarnings', 'SectorPrices', 'Split', 'Unemployment']

conn = pyodbc.connect(connection_string)
engine = create_engine('mssql+pyodbc://', creator=lambda: conn)
for i, dataFrame in enumerate(list_df):
    dataFrame.to_sql(list_tables[i], engine, if_exists = 'replace', index=False)