Load the data from the csv file into a Python Pandas data frame named df

- a. Initially, df should have 1989 rows and 6 columns
- b. Add a column called sscore to df 
    - i. Fill the sscore column with the ‘compound’ sentiment analysis score based on 
the daily headline news for each day. 
    - ii. Calculate the average (mean) ‘compound’ score for the column sscore and store 
this average number in a variable named avg

In [1]:
import pandas as pd

In [8]:
df = pd.read_csv("stockdailyhlnews.csv")

In [11]:
df.shape

(1989, 6)

In [9]:
df.head()

Unnamed: 0,date,weekday,president,sp500,ibm,news
0,8/8/2008,Friday,republican,1296.32,87.77,"b""Georgia 'downs two Russian warplanes' as cou..."
1,8/11/2008,Monday,republican,1305.32,86.26,b'Why wont America and Nato help us? If they w...
2,8/12/2008,Tuesday,republican,1289.59,85.32,b'Remember that adorable 9-year-old who sang a...
3,8/13/2008,Wednesday,republican,1285.83,85.72,b' U.S. refuses Israel weapons to attack Iran:...
4,8/14/2008,Thursday,republican,1292.93,86.5,b'All the experts admit that we should legalis...


In [19]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Create an instance of the SentimentIntensityAnalyzer.
sia = SentimentIntensityAnalyzer()
# Define a function to apply the sentiment analysis to each headline and return the compound score.
def get_sentiment_score(text):
    return sia.polarity_scores(text)['compound']

In [20]:
# Calculate the average (mean) ‘compound’ score for the column sscore
df['sscore'] = df['news'].apply(get_sentiment_score)

In [22]:
#  store  this average number in a variable named avgsscore.
avgscore = df['sscore'].mean()

2) Converts weekday and the president columns to dummy variables

- a. Add the dummy variables (columns) to the original data frame df

In [28]:
df = pd.get_dummies(df, columns=["weekday","president"])

3) Is the IBM stock price influenced by the sentiment compound score and/or s&p 500 index? 


a. Use from statsmodels.formula.api import ols for this linear regression task


In [30]:
from statsmodels.formula.api import ols

b. Store adjusted rsquaures in a variable named adj_rsquared


In [46]:
# Fit a multiple linear regression model
model = ols('ibm ~ sp500 + sscore', data=df).fit()
model.summary()

0,1,2,3
Dep. Variable:,ibm,R-squared:,0.369
Model:,OLS,Adj. R-squared:,0.368
Method:,Least Squares,F-statistic:,579.5
Date:,"Fri, 17 Feb 2023",Prob (F-statistic):,5.64e-199
Time:,13:20:57,Log-Likelihood:,-8759.7
No. Observations:,1989,AIC:,17530.0
Df Residuals:,1986,BIC:,17540.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,55.9497,1.733,32.289,0.000,52.551,59.348
sp500,0.0373,0.001,34.043,0.000,0.035,0.039
sscore,-0.9153,0.801,-1.142,0.253,-2.487,0.656

0,1,2,3
Omnibus:,1303.547,Durbin-Watson:,0.005
Prob(Omnibus):,0.0,Jarque-Bera (JB):,128.395
Skew:,0.207,Prob(JB):,1.3199999999999999e-28
Kurtosis:,1.826,Cond. No.,6070.0


In [47]:

# Retrieve the R-squared value and the number of observations
r_squared = model.rsquared
nobs = model.nobs
k = X.shape[1]  # number of independent variables

# Calculate the adjusted R-squared value
adj_rsquared = 1 - (1 - r_squared) * (nobs - 1) / (nobs - k - 1)

# Print the adjusted R-squared value
print(f"Adjusted R-squared: {adj_rsquared:.3f}")

Adjusted R-squared: 0.368


c. Store pvalue of f-statistics in a variable named f_pvalue


In [48]:
# Retrieve the F-statistic and its p-value
f_stat = model.fvalue
f_pvalue = model.f_pvalue

# Print the p-value of the F-statistic
print(f"P-value of F-statistic: {f_pvalue:.4f}")

P-value of F-statistic: 0.0000


d. Store pvalue of sscore in a variable named sscore_pvalue


In [49]:
sscore_pvalue = model.pvalues[2]

# Print the p-value of the sscore_pvalue
print(f"sscore_pvalue of pvalue: {sscore_pvalue:.4f}")

sscore_pvalue of pvalue: 0.2534


e. Store pvalue of sp500 in a variable named sp500_pvalue


In [53]:
sp500_pvalue = model.pvalues[1]

# Print the p-value of the sp500_pvalue
print(f"sp500_pvalue of pvalue: {sp500_pvalue:.4f}")

sp500_pvalue of pvalue: 0.0000


f. If a relationship exists between sscore and ibm stock price, then store a boolean value of 
True in a variable named sscore_rel; otherwise, sscore_rel should be set to False

In [54]:
sscore_rel = False

g if a relationship exists between s&p 500 index and ibm stock price, then store a boolean 
value of True in a variable named sp500_rel; otherwise, sp500_rel should be set to False

In [55]:
sp500_rel = True

4) Can we predict whether Republican or Democrat will be in the White House based on s&p 500 
index, ibm stock price, and sscore?

a. Use sklearn for this classification problem:
   - i. from sklearn.model_selection import train_test_split
   - ii. from sklearn.linear_model import LogisticRegression
   - iii. split the data into training and test datasets using 80% training & 20% test and a random seed value of 10
   - iv. store the logistic model score in a variable named logmodel_score

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [62]:
X = df[["sp500","ibm","sscore"]]
y = df["president_democrat"]

In [64]:
# plit the data into training and test datasets using 80% training & 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [79]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [85]:
from sklearn.metrics import classification_report
###   test datasets  20%
print(classification_report(y_test, y_pred, target_names=["republican","democrat"]))

              precision    recall  f1-score   support

  republican       0.90      0.62      0.73        29
    democrat       0.97      0.99      0.98       369

    accuracy                           0.97       398
   macro avg       0.94      0.81      0.86       398
weighted avg       0.97      0.97      0.96       398



In [87]:
###   training datasets  80%
y_pred = logreg.predict(X_train)
print(classification_report(y_train, y_pred, target_names=["republican","democrat"]))

              precision    recall  f1-score   support

  republican       0.91      0.64      0.75        83
    democrat       0.98      1.00      0.99      1508

    accuracy                           0.98      1591
   macro avg       0.95      0.82      0.87      1591
weighted avg       0.98      0.98      0.98      1591

