Problem Statement¶

Business Context¶

The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.

Problem Definition¶

With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.

As a member of the Data Science and AI team in the startup, you have been tasked with analyzing the data, developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.

Data Dictionary¶


Date : The date the news was released
News : The content of news articles that could potentially affect the company's stock price
Open : The stock price (in \$) at the beginning of the day
High : The highest stock price (in \$) reached during the day
Low :  The lowest stock price (in \$) reached during the day
Close : The adjusted stock price (in \$) at the end of the day
Volume : The number of shares traded during the day
Label : The sentiment polarity of the news content
1: positive
0: neutral
-1: negative




`Date`

`News`

`Open`

`High`

`Low`

`Close`

`Volume`

`Label`


1: positive
0: neutral
-1: negative


Please read the instructions carefully before starting the project.¶

This is a commented Python Notebook file in which all the instructions and tasks to be performed are mentioned.


Blanks '_' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_' blank, there is a comment that briefly describes what needs to be filled in the blank space.
Identify the task to be performed correctly, and only then proceed to write the required code.
Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same. Any mathematical or computational details which are a graded part of the project can be included in the Appendix section of the presentation.


Note: If the free-tier GPU of Google Colab is not accessible (due to unavailability or exhaustion of daily limit or other reasons), the following steps can be taken:


Wait for 12-24 hours until the GPU is accessible again or the daily usage limits are reset.

Switch to a different Google account and resume working on the project from there.

Try using the CPU runtime:

To use the CPU runtime, click on Runtime => Change runtime type => CPU => Save
One can also click on the Continue without GPU option to switch to a CPU runtime (kindly refer to the snapshot below)
The instructions for running the code on the CPU are provided in the respective sections of the notebook.




Wait for 12-24 hours until the GPU is accessible again or the daily usage limits are reset.

Switch to a different Google account and resume working on the project from there.

Try using the CPU runtime:


To use the CPU runtime, click on Runtime => Change runtime type => CPU => Save
One can also click on the Continue without GPU option to switch to a CPU runtime (kindly refer to the snapshot below)
The instructions for running the code on the CPU are provided in the respective sections of the notebook.


Installing and Importing the necessary libraries¶

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# installing the sentence-transformers and gensim libraries for word embeddings
!pip install -U sentence-transformers gensim transformers tqdm -q


In [None]:
# installing the sentence-transformers and gensim libraries for word embeddings

!pip install -U numpy
!pip install -U pandas
#!pip install -U sentence-transformers gensim transformers tqdm -q
!pip install -U numpy pandas gensim sentence-transformers transformers tqdm


# To manipulate and analyze data
import pandas as pd
import numpy as np

# To visualize data
import matplotlib.pyplot as plt
import seaborn as sns

# To used time-related functions
import time

# To parse JSON data
import json

# To build, tune, and evaluate ML models
#from sklearn.ensemble import DecisionTreeClassifier
#from sklearn.ensemble import GradientBoostingClassifier
#from sklearn.ensemble import RandomForestClassifier
# To build, tune, and evaluate ML models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

# To load/create word embeddings
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# To work with transformer models
import torch
from sentence_transformers import SentenceTransformer

# To implement progress bar related functionalities
from tqdm import tqdm
tqdm.pandas()

# To ignore unnecessary warnings
import warnings
warnings.filterwarnings('ignore')


Loading the dataset¶

In [None]:
# # uncomment and run the following code if Google Colab is being used and the dataset is in Google Drive
from google.colab import drive
drive.mount('/content/drive')


In [None]:
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
stock_news = pd.read_csv("/content/drive/MyDrive/Stock Market news sentiment analysis/stock_news.csv") # Complete the code to read the CSV file.


In [None]:
#Creating a copy of the dataset
stock = stock_news.copy()


Data Overview¶

Displaying the first few rows of the dataset¶

In [None]:
stock.head()


Understanding the shape of the dataset¶

In [None]:
stock.shape


In [None]:
(349, 8)

Observations:

1.Data has about 349 rows and 8 columns .

Checking the data types of the columns¶

In [None]:
stock.info()


In [None]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    349 non-null    object 
 1   News    349 non-null    object 
 2   Open    349 non-null    float64
 3   High    349 non-null    float64
 4   Low     349 non-null    float64
 5   Close   349 non-null    float64
 6   Volume  349 non-null    int64  
 7   Label   349 non-null    int64  
dtypes: float64(4), int64(2), object(2)
memory usage: 21.9+ KB


Observations:
1.All rows are non null and contains different data types


'Date' and 'News' is of object type which needs to be converted to datetime.
Open, High,Low and Close columns are of numerica type float and Volumn& Label are of integer type.


In [None]:
stock['Date'] = pd.to_datetime(stock['Date'])


We have now converted date to datetime type.

Checking the statistical summary¶

In [None]:
stock.describe()# Complete the code to check the statistical summary


Observations

Row Count:

All columns have 349 values, indicating no missing values in the numeric fields.

Central Tendencies:

Open, High, Low, and Close prices have means around 44–46.

The median (50%) values are close to the mean, indicating no strong skewness in price distributions.

Volume:

Mean trading volume is 128 million, with a wide range between ~45 million to ~244 million.

This large spread and high standard deviation (~43 million) suggest significant volatility in daily trading activity.

Label:

Mean is ~ -0.054, which implies the dataset is slightly skewed towards the negative class (-1).

Outliers:

There’s a noticeable gap between min and max values in all columns.

Checking the duplicate values¶

In [None]:
stock.duplicated().sum()


In [None]:
0

Observation:

1.There are no duplicate values in the dataset

Checking for missing values¶

In [None]:
stock.isnull().sum() # Complete the code to check for missing values in the data


Observation:

1.There are no missing values in the dataset

Exploratory Data Analysis¶

Univariate Analysis¶

In [None]:
sns.countplot(data=stock, x="Label", stat="percent");


Observation:

This is an imbalanced data

The class 0 (neutral) has the highest proportion — likely over 45% of the dataset.

1 (positive) and -1 (negative) are both probably around ~25–30% each.

This means the model may become biased toward predicting neutral sentiment unless we balance the classes.

In [None]:
sns.displot(data=stock[['Open', 'High', 'Low', 'Close']], kind="kde", palette="tab10");


Observations:

KDE lines are symmetrical,there's no significant skewness.
Peaks in the KDE curve show where stock prices most frequently fall which is around $45 -$50 price.

In [None]:
sns.histplot(stock, x='Volume'); # Complete the code to plot a histogram of Volume


Observations:

This a slightly right skewed data.
Since there are bars on the right side , those are high volume trading days.
There is a high concentration around 80M  125M range.

In [None]:
#Calculating the total number of words present in the news content.
stock['news_len'] = stock['News'].apply(lambda x: len(x.split(' ')))


stock['news_len'].describe()   #Complete the code to print the statistical summary for the news content length


Observations:

Most news entries are  around 49–50 words.
The standard deviation is 5.73, indicating low dispersion.
The min (19) and max (61) are within a close range.

In [None]:
sns.histplot(data=stock, x="news_len");


Observations:

Distribution is normal and not skewed.
The histogram is  bell-shaped,  around 49–50 words, matching the mean and median.

Bivariate Analysis¶

In [None]:
sns.heatmap(
    stock.select_dtypes(include=np.number).corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
);


Observations:

Open, High, Low, and Close all have very high correlations (~100) with each other.

Volume shows low correlation with price features and Label (likely near 0).

news_len has very weak or no correlation with Label or price columns.

Label doesn’t show strong correlation (positive or negative) with any numeric features.

In [None]:
plt.figure(figsize=(10, 8))
for i, variable in enumerate(['Open', 'High', 'Low', 'Close']):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(data=stock, x="Label", y=variable)
    plt.tight_layout(pad=2)
plt.show()


Observations:

The median tends to be higher for label 1 (positive) and lower for label -1 (negative) for  all 4 variables (Open, High, Low, Close).This suggests some  alignment between price movement and label.

Price alone is not enough to predict sentiment or trend  highlighting the importance of news data.

There are few dots outside the whiskers, it could be outliers due to major market events.

In [None]:
sns.boxplot(data=stock, x="Label", y="Volume");


Observation:

The boxes for all three labels (-1, 0, 1) likely overlap significantly, indicating no strong difference in trading volume across sentiments.

Therefore volumne is not good predictor for sentiment.

There are few dots outside the whiskers, it could be outliers.

In [None]:
stock_daily = stock.groupby('Date').agg(
    {
        'Open': 'mean',
        'High': 'mean',
        'Low': 'mean',
        'Close': 'mean',
        'Volume': 'mean',
    }
).reset_index()  # Group by the 'Date' column

stock_daily.set_index('Date', inplace=True)
stock_daily.head()


Observations:

After averaging stock metrics per date,it prepares our dataset for time-series plots or date-aligned features

In [None]:
plt.figure(figsize=(15,5))
sns.lineplot(stock_daily.drop("Volume", axis=1));  # Plot all variables except Volume


Observations:

The lines for Open, High, Low, and Close are likely very close to one another, showing only very little differences day to day.This shows a stable market behaviour.

There is a mix of open and close , fluctuating around each other.

In [None]:
# Create a figure and axis
fig, ax1 = plt.subplots(figsize=(15,5))

# Lineplot on primary y-axis
sns.lineplot(data=stock_daily.reset_index(), x='Date', y='Close', ax=ax1, color='blue', marker='o', label='Close Price')

# Create a secondary y-axis
ax2 = ax1.twinx()

# Lineplot on secondary y-axis
sns.lineplot(data=stock_daily.reset_index(), x='Date', y='Volume', ax=ax2, color='gray', marker='o', label='Volume')

ax1.legend(bbox_to_anchor=(1,1));


Observations:

There are clear volume spikes on specific dates.

These spikes are not frequent.

In some areas, volume increases without a significant price move, which hints at  speculative trading.

In a few places, a price drop or rise follows a volume surge, but not consistently — suggesting no strong or reliable correlation between Volume and Close price in this dataset.

Without factoring in news sentiment, volume alone doesn't explain price behavior.

Data Preprocessing¶

In [None]:
stock["Date"].describe() #Complete the code to print the statistical summary of the 'Date' column


Train-test-validation Split¶

In [None]:
X_train = stock[(stock['Date'] < '2019-04-01')].reset_index()
X_val = stock[(stock['Date'] >= '2019-04-01') & (stock['Date'] < '2019-04-16')].reset_index()
X_test = stock[stock['Date'] >= '2019-04-16'].reset_index()


Observations:
Lets train data using data upto March 31st 2019.
And run model against the validation data  from April 1 to April 15th 2019.
And test with data from April 16th onwards.

In [None]:
# Complete the code to pick the 'Label' column as the target variable
y_train = X_train["Label"].copy()
y_val = X_val["Label"].copy()
y_test = X_test["Label"].copy()


Observations:
Let's isoloate the target variable from training, test and validation data.

In [None]:
#Complete the code to print the shape of X_train,X_val,X_test,y_train,y_val and y_test
print("Train data shape",X_train.shape)
print("Validation data shape",X_val.shape)
print("Test data shape ",X_test.shape)

print("Train label shape",y_train.shape)
print("Validation label shape",y_val.shape)
print("Test label shape ",y_test.shape)


In [None]:
Train data shape (286, 10)
Validation data shape (21, 10)
Test data shape  (42, 10)
Train label shape (286,)
Validation label shape (21,)
Test label shape  (42,)


Observations:

A solid training set for fitting your model.

A test set for final unbiased evaluation.

A validation set for tuning hyperparameters or monitoring overfitting (though slightly small).

Word Embeddings¶

Word2Vec¶

In [None]:
# Creating a list of all words in our data
words_list = [item.split(" ") for item in stock['News'].values]


Observations:
The result is a list of lists, where each inner list contains individual words from a single news entry.

In [None]:
# Creating an instance of Word2Vec
vec_size = 300
model_W2V = Word2Vec(words_list, vector_size = vec_size, min_count = 1, window=5, workers = 6)


Observations:

We are using news headlines to train the embeddings, it means  vector space is contextually tuned to financial language.

Each word will be represented as a 300-dimensional vector

Looks at 5 words before and after the target word to get context.

In [None]:
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(list(model_W2V.wv.key_to_index)))


In [None]:
Length of the vocabulary is 4682


Observations:

With only 349 news entries, a vocabulary of 4,682 unique words is quite large.Therefore text is highly diverse.

Let's check out a few word embeddings obtained using the model.

In [None]:
# Checking the word embedding of a random word
word = "stock"
model_W2V.wv[word]


In [None]:
array([-2.02370016e-03,  3.85143384e-02,  7.38847163e-03,  1.15291467e-02,
        2.00947933e-03, -5.64098991e-02,  2.88168229e-02,  8.78903419e-02,
        3.51623236e-03, -2.24954356e-02,  4.80430946e-03, -2.08250582e-02,
       -4.17812308e-03,  1.41794533e-02, -2.62250751e-02, -2.65340116e-02,
        2.17638779e-02, -4.45793802e-03,  5.90414228e-03, -2.59113293e-02,
       -2.28653476e-02,  7.17722392e-03,  2.86644120e-02,  1.45488195e-02,
        2.37827003e-02,  2.16769011e-04, -3.53674144e-02,  8.24647956e-03,
       -2.75574941e-02, -4.30375747e-02,  1.14744985e-02, -2.63173841e-02,
        3.62951984e-03, -9.64764412e-03, -1.46250066e-03,  2.41914615e-02,
        1.32997306e-02, -3.46401036e-02, -3.30136716e-03, -1.33008696e-02,
       -1.79803912e-02,  3.66124325e-03,  1.24798482e-03, -1.89772677e-02,
        1.90459695e-02,  3.81468870e-02,  7.93073792e-03,  1.89931095e-02,
       -8.83688161e-04,  2.24095676e-02,  1.07543403e-02, -1.01133082e-02,
       -2.00914145e-02,  1.42094390e-02, -7.14597665e-03,  2.59550977e-02,
        1.81324463e-02,  2.65261997e-03,  6.70872582e-03,  2.26455019e-03,
       -1.32609634e-02, -9.58424248e-03, -5.87600225e-05,  1.10583669e-02,
       -3.83577892e-04,  1.58287697e-02,  6.03994727e-03,  1.19034443e-02,
       -2.59206183e-02, -1.00583630e-02, -5.93438745e-04,  2.29454897e-02,
        3.49893421e-02, -2.00777352e-02,  9.38899163e-03,  1.51570356e-02,
       -3.22390832e-02,  2.79513886e-03, -1.47075858e-02,  2.54175682e-02,
       -1.45046217e-02, -2.60928292e-02,  5.42339589e-03,  6.22998551e-02,
        6.63138926e-03, -2.38971552e-03, -1.19596329e-02,  6.08431641e-03,
        3.84879857e-02,  1.53537197e-02,  3.10060605e-02, -1.47249755e-02,
        1.97192077e-02, -9.04652334e-05,  3.72045264e-02,  3.41158248e-02,
        2.90387049e-02, -1.49073955e-02, -1.40394904e-02,  2.25287415e-02,
        4.84373374e-03,  3.37902806e-03,  2.86013465e-02,  1.44332983e-02,
        5.19103091e-03, -1.88415963e-02, -1.35437334e-02,  1.90909859e-02,
       -1.91446021e-02,  5.84838679e-03, -4.88654859e-02, -2.65293997e-02,
       -2.09432142e-03,  2.52165608e-02,  1.63117368e-02,  2.27713455e-02,
       -4.67295386e-03, -2.83080176e-03,  5.24494797e-02, -4.56885733e-02,
        9.77566186e-03,  2.44479682e-02,  2.58740634e-02, -4.23302269e-03,
       -2.07164623e-02,  1.98443308e-02,  1.47945639e-02, -3.74232829e-02,
       -9.80343786e-04,  1.72036402e-02,  1.37910610e-02,  4.17229980e-02,
        1.96382087e-02, -3.92887779e-02,  9.40188672e-03,  2.33070869e-02,
       -1.08923726e-02, -3.11206467e-02, -4.12568599e-02, -4.27741371e-02,
        1.49169341e-02, -3.87324318e-02, -1.44671425e-02,  1.65003873e-02,
        1.46092027e-02, -1.44975837e-02, -5.94140291e-02, -9.48371459e-03,
        2.92368010e-02, -1.98204275e-02,  7.76614109e-03, -4.90351729e-02,
       -1.53922606e-02, -1.49823148e-02,  9.64776799e-03,  2.70108432e-02,
       -3.83031256e-02, -1.87026951e-02, -1.62569387e-03,  3.13762873e-02,
        8.95851012e-03,  2.66921744e-02, -4.30809371e-02,  2.93493159e-02,
       -1.43961487e-02,  5.69403684e-03, -1.05064530e-02, -5.18172979e-04,
        3.63384350e-03,  5.88431284e-02, -3.42956278e-03,  5.88557450e-03,
        2.89908852e-02,  1.09565631e-03,  6.01799507e-03,  6.65885676e-03,
        1.23199809e-03, -2.28796266e-02, -5.00208884e-03, -9.07766167e-03,
       -3.00300810e-02,  2.10266896e-02, -3.84385064e-02, -1.95439253e-02,
       -1.31303091e-02,  1.24289759e-03,  4.35774066e-02,  3.75294946e-02,
        1.90611873e-02, -4.01339047e-02,  1.43286772e-02, -5.76434610e-03,
       -4.02763672e-02,  5.63421147e-03,  4.36682813e-03, -3.01019177e-02,
       -1.27976737e-03, -2.92233359e-02,  1.64516382e-02, -6.60820911e-03,
       -2.76073068e-02,  1.76811498e-02,  1.45875663e-03, -1.85649190e-02,
        5.57664596e-03, -1.02657564e-02, -1.00212637e-02,  1.94658358e-02,
       -1.59273520e-02, -1.09196948e-02, -8.18825234e-03, -4.15847264e-02,
       -8.51060543e-03, -1.13751506e-02,  3.62898782e-02, -3.52023803e-02,
       -2.81806011e-02, -5.56098484e-02, -4.01809700e-02, -3.16726528e-02,
        1.78488418e-02,  1.28440158e-02, -9.36801638e-03, -2.43084580e-02,
       -1.79043449e-02, -1.77519098e-02,  8.75886902e-03, -1.32264511e-03,
       -2.67409105e-02,  1.29193468e-02,  3.44820321e-02, -1.95862725e-02,
       -2.81904433e-02,  3.23534645e-02, -1.88057646e-02,  2.16532890e-02,
        9.85416002e-04,  8.21708608e-03,  8.25796928e-03, -5.68522513e-02,
        1.05931126e-02, -2.26191636e-02, -2.56322436e-02,  4.15708683e-03,
        7.05406244e-04, -4.46722843e-02, -9.67185758e-03,  1.36953406e-02,
        3.39121767e-03,  2.23920196e-02,  1.10871177e-02,  2.68111820e-03,
        2.78904196e-02, -4.00367146e-03, -3.39278840e-02, -2.04164088e-02,
        5.20696044e-02,  1.30199296e-02, -5.05266227e-02, -2.95350775e-02,
        2.24199090e-02,  1.88536830e-02,  1.77051444e-02, -5.74403554e-02,
       -3.97968851e-02,  4.97319689e-03,  1.64199471e-02,  1.99773945e-02,
       -3.22963372e-02,  1.19243981e-02, -2.01818515e-02,  8.43999675e-04,
       -4.22420958e-03, -7.17755593e-03,  3.02492771e-02,  1.69387776e-02,
        1.77205540e-02,  1.38274552e-02, -3.47679481e-02, -5.96952718e-03,
        1.45195425e-02, -2.88863946e-03, -1.75575987e-02,  1.88626368e-02,
       -2.56911619e-03, -8.58032261e-04, -4.64688465e-02,  2.47844663e-02,
        4.50863643e-03,  3.98884378e-02,  1.48933362e-02,  4.97774296e-02,
        3.46692540e-02,  2.09749257e-03,  4.47769687e-02,  5.48829548e-02,
        6.29755855e-03, -2.47722007e-02,  3.19137312e-02, -1.99621152e-02],
      dtype=float32)

Observations:

Now, we are getting a word embedding vector for a specific word using your trained Word2Vec model.

This vector represents the semantic meaning of the word "stock" .Each number in the array is a dimension in the learned embedding space.

In [None]:
# Checking the word embedding of a random word
word = "economy"
model_W2V.wv[word]


In [None]:
array([-8.6877943e-04,  5.5111004e-03,  3.1234843e-03,  1.9725766e-03,
       -1.0286444e-03, -1.1523341e-02,  6.8392050e-03,  1.6077708e-02,
       -2.0085778e-03, -2.3545995e-03, -2.6055875e-03, -3.5587696e-03,
        8.4835966e-04, -3.4791735e-04, -4.4354405e-03, -1.8257098e-03,
        7.1369959e-03,  1.8228008e-03,  3.6506944e-03, -2.9859466e-03,
       -5.8936933e-03,  2.6904689e-03,  3.9673694e-03, -8.2602567e-04,
        1.2995644e-03, -1.2910553e-03, -7.2880471e-03,  2.2825177e-03,
       -6.5748282e-03, -1.0171713e-02, -1.7078748e-03, -6.5229172e-03,
        2.7615535e-03, -2.8482943e-03, -2.6545017e-03,  6.1369566e-03,
        2.9023468e-05, -5.5956258e-03, -1.9870940e-04, -3.1498112e-03,
       -5.5275564e-03,  7.9306000e-04,  1.6425314e-03, -3.7326270e-03,
        3.8806575e-03,  4.0099472e-03, -7.1850308e-04,  1.4117225e-03,
       -1.9630163e-04,  9.7942282e-04,  4.5337928e-03, -4.3110494e-03,
       -4.0972866e-03,  5.5956724e-04, -2.9442455e-03,  6.9537484e-03,
        1.2918224e-03,  3.7239769e-03,  1.2372801e-03, -1.2438655e-03,
       -5.4985085e-03,  2.0703797e-03,  1.6731581e-03, -9.5998711e-04,
       -3.6942697e-04,  2.2490190e-03,  1.9247471e-03,  2.2153379e-03,
       -5.8220564e-03, -2.1577000e-03, -1.9286217e-03,  5.9648682e-03,
        7.1366001e-03, -1.2127272e-03,  4.3389271e-03, -1.3387295e-03,
       -3.3358363e-03,  5.7192618e-04, -3.9373102e-04,  5.7296236e-03,
       -1.0210776e-03, -1.0709617e-03, -2.9002421e-04,  1.0644648e-02,
        3.0951640e-03, -1.2751345e-03, -3.5143355e-03,  1.3207578e-03,
        8.9779943e-03, -2.8577787e-04,  2.7915332e-03, -5.1037432e-03,
        2.2205317e-03,  5.1190864e-06,  4.0356140e-03,  4.8252675e-03,
        3.2133411e-03, -2.0716034e-03, -5.2870288e-03,  9.5594500e-04,
        1.5487644e-03, -2.3288892e-03,  7.3226341e-03,  4.5133703e-03,
        1.6062388e-03, -5.2143475e-03, -2.1692889e-03,  4.6883696e-03,
       -3.6568732e-03,  2.4207907e-03, -8.3082672e-03, -2.4435406e-03,
       -1.5103437e-03,  5.5681444e-03,  4.9501439e-03,  6.4602972e-04,
       -2.5205181e-03,  2.7473655e-03,  8.5757310e-03, -9.3031572e-03,
        1.2144811e-03,  2.2114308e-03,  4.0966757e-03,  2.8600701e-04,
       -2.5704412e-03,  2.5278712e-03,  2.4172633e-03, -3.4288352e-03,
        2.0280904e-03,  1.8974062e-03, -1.1207573e-03,  9.0576271e-03,
        3.4168190e-03, -5.8506909e-03,  2.9124315e-03,  5.3634760e-03,
       -1.4440703e-03, -2.9492246e-03, -6.8697925e-03, -1.0114933e-02,
        1.3560512e-03, -5.8401725e-03, -3.0529783e-03,  5.0153369e-03,
        1.1827443e-04,  9.4005745e-04, -1.0841720e-02, -5.7719095e-04,
        6.2819226e-03, -1.8897885e-04,  5.4145075e-04, -9.5243286e-03,
        3.4421205e-04, -2.3903900e-03, -2.1635432e-04,  4.7022891e-03,
       -9.6979076e-03, -2.7871397e-03,  3.6319576e-03,  2.1450315e-03,
        2.5569621e-04,  1.2551072e-03, -7.7905809e-03,  1.7053240e-03,
       -1.0903481e-03,  3.7276561e-04, -4.6073678e-03,  1.6211224e-03,
       -1.5303899e-03,  8.7628737e-03, -2.1601256e-03,  3.9231228e-03,
        2.1007047e-03, -6.0012436e-04,  2.0293370e-03, -2.2638009e-03,
       -1.8304841e-03, -1.4480957e-04, -3.3087004e-03,  1.7282459e-03,
       -1.2069512e-03,  9.6603250e-04, -7.8050075e-03, -2.4136882e-03,
        1.7090503e-04, -2.3250896e-03,  7.8882584e-03,  6.8761432e-03,
        5.2181459e-03, -8.7883994e-03,  2.5784061e-04,  2.5986737e-04,
       -6.3331877e-03, -1.6203275e-03, -1.3605240e-03, -6.3294275e-03,
        8.9286180e-04, -6.7126313e-03,  3.6996641e-04, -1.2321732e-03,
       -4.8298039e-03,  3.9796927e-03,  7.3203223e-04, -3.6099970e-03,
       -4.2232219e-05, -2.1782957e-03, -2.4144985e-03, -9.8983306e-05,
       -4.7149956e-03, -2.4010236e-03, -1.4423471e-03, -3.8979403e-03,
        1.0773386e-03,  2.9024086e-04,  4.8353914e-03, -2.6542691e-03,
       -6.9536557e-03, -1.1154762e-02, -5.2453042e-03, -4.5367801e-03,
        5.1821209e-03,  5.3422980e-04, -4.8581949e-03, -3.3539773e-03,
       -3.4500735e-03, -1.9160253e-03, -1.9428607e-03,  8.9589454e-04,
       -3.8112425e-03,  3.5778163e-03,  6.5584593e-03, -4.8891925e-03,
       -6.4077214e-03,  7.1996725e-03, -5.7292301e-03,  1.1635459e-03,
       -3.0754148e-03,  2.7606643e-03, -1.0789001e-03, -9.4054285e-03,
        5.2300277e-03, -3.0558892e-03, -5.5007818e-03, -6.5188104e-04,
       -1.3859775e-03, -7.9595065e-03, -1.9017389e-03,  1.8910174e-03,
        2.8099702e-03,  1.6452414e-04, -3.9207513e-04, -3.5164423e-05,
        5.2971132e-03, -1.2920285e-04, -5.3078081e-03, -4.7863061e-03,
        6.8559395e-03,  4.3372279e-03, -1.0577702e-02, -2.7018059e-03,
        1.9179860e-03,  3.1575090e-03,  4.5187605e-04, -6.2088030e-03,
       -5.3681750e-03,  2.8660258e-03,  3.9195600e-03,  3.0380145e-03,
       -2.0083869e-03,  3.2118238e-03, -5.6123245e-03,  3.6715283e-03,
       -1.0871743e-03,  5.5917521e-04,  2.2596852e-03,  4.2654024e-03,
        4.2723427e-03,  4.8454939e-03, -2.5330544e-03, -3.9823661e-03,
        4.5803837e-03, -7.0822971e-06, -2.8075743e-03,  4.0284093e-03,
       -8.7813579e-04, -1.8713846e-03, -6.6320118e-03,  3.1595784e-03,
        6.1246625e-04,  9.0274690e-03,  1.4528614e-03,  6.2677590e-03,
        3.0564803e-03, -1.8720072e-03,  7.6385629e-03,  9.4839931e-03,
        3.2013867e-03, -6.7807804e-03,  2.1265058e-03, -1.1234141e-03],
      dtype=float32)

Observations:

Now, we are getting a word embedding vector for a specific word using your trained Word2Vec model.

This vector represents the semantic meaning of the word "economy" .Each number in the array is a dimension in the learned embedding space.

In [None]:
# Retrieving the words present in the Word2Vec model's vocabulary
words = list(model_W2V.wv.key_to_index.keys())

# Retrieving word vectors for all the words present in the model's vocabulary
wvs = model_W2V.wv[words].tolist()

# Creating a dictionary of words and their corresponding vectors
word_vector_dict = dict(zip(words, wvs))


Observation:

wvs is a list of 4,682 word vectors, each of dimension 300, stored as plain Python lists.word_vector_dict allows for fast access to any word’s vector

In [None]:
def average_vectorizer_Word2Vec(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector


Observation:

This function  transforms any news headline into a sentence embedding by averaging the Word2Vec vectors of its words

In [None]:
# creating a dataframe of the vectorized documents
start = time.time()

X_train_wv = pd.DataFrame(X_train["News"].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
X_val_wv = pd.DataFrame(X_val["News"].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
X_test_wv = pd.DataFrame(X_test["News"].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])

end = time.time()
print('Time taken ', (end-start))


In [None]:
Time taken  0.5024166107177734


In [None]:
print(X_train_wv.shape, X_val_wv.shape, X_test_wv.shape)


In [None]:
(286, 300) (21, 300) (42, 300)


Observation:
All News headlines have been converted into 300-dimensional  vectors using  average_vectorizer_Word2Vec() function.

The shape of each resulting DataFrame:

X_train_wv: (286, 300)

X_val_wv: (21, 300)

X_test_wv: (42, 300)

GloVe¶

In [None]:
# load the Stanford GloVe model
filename = '/content/drive/MyDrive/Stock Market news sentiment analysis/glove.6B.100d.txt.word2vec'
glove_model = KeyedVectors.load_word2vec_format(filename, binary=False)


Observation:
Now we are loading pre-trained GloVe embeddings converted to Word2Vec format.

In [None]:
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(glove_model.index_to_key))


In [None]:
Length of the vocabulary is 400000


Observation:
The GloVe model contains 400,000 unique words

Let's check out a few word embeddings.

In [None]:
# Checking the word embedding of a random word
word = "stock"
glove_model[word]


In [None]:
array([ 8.6341e-01,  6.9648e-01,  4.5794e-02, -9.5708e-03, -2.5498e-01,
       -7.4666e-01, -2.2086e-01, -4.4615e-01, -1.0423e-01, -9.9931e-01,
        7.2550e-02,  4.5049e-01, -5.9912e-02, -5.7837e-01, -4.6540e-01,
        4.3429e-02, -5.0570e-01, -1.5442e-01,  9.8250e-01, -8.1571e-02,
        2.6523e-01, -2.3734e-01,  9.7675e-02,  5.8588e-01, -1.2948e-01,
       -6.8956e-01, -1.2811e-01, -5.2265e-02, -6.7719e-01,  3.0190e-02,
        1.8058e-01,  8.6121e-01, -8.3206e-01, -5.6887e-02, -2.9578e-01,
        4.7180e-01,  1.2811e+00, -2.5228e-01,  4.9557e-02, -7.2455e-01,
        6.6758e-01, -1.1091e+00, -2.0493e-01, -5.8669e-01, -2.5375e-03,
        8.2777e-01, -4.9102e-01, -2.6475e-01,  4.3015e-01, -2.0516e+00,
       -3.3208e-01,  5.1845e-02,  5.2646e-01,  8.7452e-01, -9.0237e-01,
       -1.7366e+00, -3.4727e-01,  1.6590e-01,  2.7727e+00,  6.5756e-02,
       -4.0363e-01,  3.8252e-01, -3.0787e-01,  5.9202e-01,  1.3468e-01,
       -3.3851e-01,  3.3646e-01,  2.0950e-01,  8.5905e-01,  5.1865e-01,
       -1.0657e+00, -2.6371e-02, -3.1349e-01,  2.3231e-01, -7.0192e-01,
       -5.5737e-01, -2.3418e-01,  1.3563e-01, -1.0016e+00, -1.4221e-01,
        1.0372e+00,  3.5880e-01, -4.2608e-01, -1.9386e-01, -3.7867e-01,
       -6.9646e-01, -3.9989e-01, -5.7782e-01,  1.0132e-01,  2.0123e-01,
       -3.7153e-01,  5.0837e-01, -3.7758e-01, -2.6205e-01, -9.3676e-01,
        1.0053e+00,  8.4393e-01, -2.4698e-01,  1.7339e-01,  9.4473e-01],
      dtype=float32)

Observation:

This vector encodes the semantic meaning of the word "stock" learned from 6 billion tokens of general text using pre-trained model.

In [None]:
# Checking the word embedding of a random word
word = "economy"
glove_model[word]


In [None]:
array([-0.19382  ,  1.017    ,  1.076    ,  0.02954  , -0.39192  ,
       -1.3891   , -0.87873  , -0.63162  ,  0.9643   , -0.43035  ,
       -0.34868  ,  0.22736  , -0.40296  ,  0.15641  , -0.16813  ,
       -0.15343  , -0.15799  , -0.27612  ,  0.18088  , -0.28386  ,
        0.49847  ,  0.29864  ,  0.32353  ,  0.18108  , -0.59623  ,
       -0.54165  , -0.70019  , -0.64956  , -0.69063  ,  0.18084  ,
       -0.38581  ,  0.56086  , -0.40313  , -0.38777  , -0.70615  ,
        0.20657  ,  0.34171  , -0.23393  , -0.35882  , -0.2201   ,
       -0.76182  , -1.2047   ,  0.4339   ,  1.1656   ,  0.1836   ,
       -0.21601  ,  0.93198  , -0.059616 , -0.11624  , -1.3259   ,
       -0.79772  , -0.0074957, -0.0889   ,  1.4749   ,  0.31157  ,
       -2.2952   , -0.058351 ,  0.39353  ,  1.4983   ,  0.74023  ,
       -0.20109  ,  0.098124 , -0.73081  , -0.32294  ,  0.16703  ,
        0.87431  , -0.041624 , -0.51022  ,  1.0737   , -0.4257   ,
        1.0581   ,  0.19859  , -0.60087  , -0.33906  ,  0.60243  ,
       -0.091581 , -0.47201  ,  0.74933  , -0.60168  , -0.44178  ,
        0.77391  ,  0.81114  , -1.2889   ,  0.32055  , -0.36117  ,
       -0.88078  ,  0.055524 , -0.26837  , -0.33688  , -1.4359   ,
        0.85666  ,  0.32025  , -0.15361  , -0.30208  , -0.38208  ,
        0.30508  ,  0.75374  , -0.68041  ,  0.98619  , -0.19628  ],
      dtype=float32)

Observation:

This vector encodes the semantic meaning of the word "economy" learned from 6 billion tokens of general text using pre-trained model.

In [None]:
# Retrieving the words present in the GloVe model's vocabulary
glove_words = glove_model.index_to_key

# Creating a dictionary of words and their corresponding vectors
glove_word_vector_dict = dict(zip(glove_model.index_to_key,list(glove_model.vectors)))


Observations:

glove_words now holds 400,000 unique words from the glove.6B.100d model.

In [None]:
vec_size=100


In [None]:
def average_vectorizer_GloVe(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in glove_words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(glove_word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector


In [None]:
# creating a dataframe of the vectorized documents
start = time.time()
X_train_gl = pd.DataFrame(X_train["News"].apply(average_vectorizer_GloVe).tolist(),
                          columns=['Feature ' + str(i) for i in range(vec_size)])
X_val_gl = pd.DataFrame(X_val["News"].apply(average_vectorizer_GloVe).tolist(),
                        columns=['Feature ' + str(i) for i in range(vec_size)])
X_test_gl = pd.DataFrame(X_test["News"].apply(average_vectorizer_GloVe).tolist(),
                         columns=['Feature ' + str(i) for i in range(vec_size)])
end = time.time()
print('Time taken ', (end - start))


In [None]:
Time taken  27.41490125656128


Observation:

Now,we are generating sentence-level embeddings using GloVe for your train, validation, and test datasets.

In [None]:
print(X_train_gl.shape, X_val_gl.shape, X_test_gl.shape)


In [None]:
(286, 100) (21, 100) (42, 100)


Observations:

All 3 datasets have the expected 100 features, matching the GloVe embedding dimension
Sentence embeddings were successfully generated for each news headline.

There is no data loss or misalignment during vectorization.

Sentence Transformer¶

In [None]:
#Defining the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')


Now we are loading all-MiniLM-L6-v2, a lightweight yet high-performing transformer model.

In [None]:
# setting the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


This checks if a CUDA-enabled GPU is available and sets it as the active device

In [None]:
# encoding the dataset
start = time.time()

X_train_st = model.encode(X_train["News"].values, show_progress_bar=True, device=device) #Complete the code to apply Sentence Transformer on 'News' column
X_val_st = model.encode(X_val["News"].values, show_progress_bar=True, device=device) #Complete the code to apply Sentence Transformer on 'News' column
X_test_st = model.encode(X_test["News"].values, show_progress_bar=True, device=device) #Complete the code to apply Sentence Transformer on 'News' column

end = time.time()
print("Time taken ",(end-start))


In [None]:
Time taken  1.2995922565460205


In [None]:
print(X_train_st.shape, X_val_st.shape, X_test_st.shape) #Complete the code to print the shapes of the final dataframes


In [None]:
(286, 384) (21, 384) (42, 384)


Observation:

Each News sentence is now represented by a 384-dimensional vector, by transformer model all-MiniLM-L6-v2.

Now we can directly pass these embeddings to any classification model


Each news content has been converted to a 384-dimensional vector.


Sentiment Analysis¶

Model Evaluation Criterion¶

-

Utility Functions¶

In [None]:
def plot_confusion_matrix(model, predictors, target):
    """
    Plot a confusion matrix to visualize the performance of a classification model.

    Parameters:
    actual (array-like): The true labels.
    predicted (array-like): The predicted labels from the model.

    Returns:
    None: Displays the confusion matrix plot.
    """
    pred = model.predict(predictors)  # Make predictions using the classifier.

    cm = confusion_matrix(target, pred)  # Compute the confusion matrix.

    plt.figure(figsize=(5, 4))  # Create a new figure with a specified size.
    label_list = [0, 1,-1]  # Define the labels for the confusion matrix.
    sns.heatmap(cm, annot=True, fmt='.0f', cmap='Blues', xticklabels=label_list, yticklabels=label_list)
    # Plot the confusion matrix using a heatmap with annotations.

    plt.ylabel('Actual')  # Label for the y-axis.
    plt.xlabel('Predicted')  # Label for the x-axis.
    plt.title('Confusion Matrix')  # Title of the plot.
    plt.show()  # Display the plot.


Observations:
This function visualize model performance across actual vs. predicted labels using a confusion matrix.

In [None]:
def model_performance_classification_sklearn(model, predictors, target):
    """
    Compute various performance metrics for a classification model using sklearn.

    Parameters:
    model (sklearn classifier): The classification model to evaluate.
    predictors (array-like): The independent variables used for predictions.
    target (array-like): The true labels for the dependent variable.

    Returns:
    pandas.DataFrame: A DataFrame containing the computed metrics (Accuracy, Recall, Precision, F1-score).
    """
    pred = model.predict(predictors)  # Make predictions using the classifier.

    acc = accuracy_score(target, pred)  # Compute Accuracy.
    recall = recall_score(target, pred,average='weighted')  # Compute Recall.
    precision = precision_score(target, pred,average='weighted')  # Compute Precision.
    f1 = f1_score(target, pred,average='weighted')  # Compute F1-score.

    # Create a DataFrame to store the computed metrics.
    df_perf = pd.DataFrame(
        {
            "Accuracy": [acc],
            "Recall": [recall],
            "Precision": [precision],
            "F1": [f1],
        }
    )

    return df_perf  # Return the DataFrame with the metrics.


Observations:
This function will evaluate classification models.

Base Model - Word2Vec¶

In [None]:
# Building the model

#Uncomment only one of the snippets related to fitting the model to the data

#base_wv = GradientBoostingClassifier(random_state = 42)
base_wv = RandomForestClassifier(random_state=42)
#base_wv = DecisionTreeClassifier(random_state=42)

# Fitting on train data
base_wv.fit(X_train_wv, y_train)


In [None]:
RandomForestClassifier(random_state=42)

In [None]:
RandomForestClassifier(random_state=42)

Observation :

Now, we are training a Random Forest Classifier on the Word2Vec-based sentence embeddings

In [None]:
plot_confusion_matrix(base_wv,X_train_wv,y_train)


Observations:

True Positives (Diagonal values):

Class 0: 82 instances correctly classified as 0.

Class 1: 138 instances correctly classified as 1.

Class -1: 66 instances correctly classified as -1.

False Positives and False Negatives (Off-diagonal values):

There are no instances of false positives or false negatives in this case, as all off-diagonal entries are 0.
The absence of off-diagonal values suggests that the model is making no errors on this dataset. This could imply that the model is potentially overfitting the training data

In [None]:
plot_confusion_matrix(base_wv,X_val_wv,y_val)


Observations:

There is some imbalance in the classification.

The model is struggling with class 0 since there are no true positives for that class, but it is misclassifying it as class 1 (4 times). This may indicate that class 0 is either underrepresented or difficult to distinguish from class 1 in this dataset.

Class 1 has some misclassifications with 3 false negatives as class 0 and 2 as class -1, indicating that the model might be overly sensitive or unable to clearly differentiate class 1 from the others.

In [None]:
#Calculating different metrics on training data
base_train_wv = model_performance_classification_sklearn(base_wv,X_train_wv,y_train)
print("Training performance:\n", base_train_wv)


In [None]:
Training performance:
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0


In [None]:
#Calculating different metrics on validation data
base_val_wv = model_performance_classification_sklearn(base_wv,X_val_wv,y_val)
print("Validation performance:\n",base_val_wv)


In [None]:
Validation performance:
    Accuracy    Recall  Precision        F1
0  0.333333  0.333333       0.25  0.285714


Observations:

The model is achieving perfect performance on the training set, with all metrics (accuracy, recall, precision, and F1-score) equal to 1.

This suggests that the model is perfectly fitting the training data.

The model's performance on the validation set is significantly lower than on the training set, suggesting poor generalization.

Base Model - GloVe¶

In [None]:
#Building the model

#Uncomment only one of the snippets related to fitting the model to the data

#base_wv = GradientBoostingClassifier(random_state = 42)
base_gl = RandomForestClassifier(random_state=42)
#base_wv = DecisionTreeClassifier(random_state=42)

# Fitting on train data
base_gl.fit(X_train_gl, y_train) #Complete the code to fit the chosen model on the train data


In [None]:
RandomForestClassifier(random_state=42)

In [None]:
RandomForestClassifier(random_state=42)

Observation :

Now, we are training a Random Forest Classifier on the Glove based  model

In [None]:
plot_confusion_matrix(base_gl, X_train_gl, y_train) #Complete the code to plot the confusion matrix for the train data


Observation:

True Positives (Diagonal values):

Class 0: 82 instances were correctly classified as 0.

Class 1: 138 instances were correctly classified as 1.

Class -1: 66 instances were correctly classified as -1.

False Positives and False Negatives (Off-diagonal values):

There are no false positives or false negatives. Every prediction has been correct, as evidenced by the zeros in the off-diagonal positions.
The model appears to be performing flawlessly on the train data, with no misclassifications at all. This is a positive outcome, but it may raise concerns about overfitting

In [None]:
plot_confusion_matrix(base_gl, X_val_gl, y_val)#Complete the code to plot the confusion matrix for the validation data


Observation:

The model is struggling with Class 0 as it has zero true positives and misclassifies 4 instances as class 1.

The model's performance on the validation set is not perfect, as there are a number of misclassifications, especially for class 0 where the model shows no true positives.

Model may not be generalizing well.

In [None]:
#Calculating different metrics on training data
base_train_gl = model_performance_classification_sklearn(base_gl, X_train_gl, y_train) #Complete the code to compute the model performance for the training data
print("Training performance:\n", base_train_gl)


In [None]:
Training performance:
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0


In [None]:
#Calculating different metrics on validation data
base_val_gl = model_performance_classification_sklearn(base_gl, X_val_gl, y_val) #Complete the code to compute the model performance for the validation data
print("Validation performance:\n",base_val_gl)


In [None]:
Validation performance:
    Accuracy   Recall  Precision        F1
0   0.47619  0.47619   0.400794  0.426871


Observation:

The model performs perfectly on the training data, achieving perfect scores for all metrics (accuracy, recall, precision, and F1-score). As noted earlier, this can be a sign of overfitting.

The performance on the validation set is considerably lower compared to the training set.

Base Model - Sentence Transformer¶

In [None]:
# Building the model

#Uncomment only one of the snippets related to fitting the model to the data

#base_wv = GradientBoostingClassifier(random_state = 42)
#base_wv = RandomForestClassifier(random_state=42)
#base_wv = DecisionTreeClassifier(random_state=42)
base_st = RandomForestClassifier(random_state=42)

# Fitting on train data
base_st.fit(X_train_st, y_train) #Complete the code to fit the chosen model on the train data


In [None]:
RandomForestClassifier(random_state=42)

In [None]:
RandomForestClassifier(random_state=42)

Observation :
Now, we are training a Random Forest Classifier on the Sentence transformer

In [None]:
plot_confusion_matrix(base_st, X_train_st, y_train) #Complete the code to plot the confusion matrix for the train data


Observation:

True Positives (diagonal):

Class 0: 82 instances correctly classified as 0.

Class 1: 138 instances correctly classified as 1.

Class -1: 66 instances correctly classified as -1.

False Positives and False Negatives (off-diagonal):

There are no false positives or false negatives. Every instance has been correctly classified, as shown by the zeros off the diagonal.
While perfect classification on the training data is a good sign, it often suggests overfitting,

In [None]:
plot_confusion_matrix(base_st, X_val_st, y_val) #Complete the code to plot the confusion matrix for the validation data


Observations:

The model has correctly classified the majority of instances in the validation set, with 12 instances of class 0 and class 1 correctly classified.

For class -1, the model has classified 5 instances correctly but has also misclassified 3 instances as class 0.

In [None]:
#Calculating different metrics on training data
base_train_st = model_performance_classification_sklearn(base_st, X_train_st, y_train)#Complete the code to compute the model performance for the training data
print("Training performance:\n", base_train_st)


In [None]:
Training performance:
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0


In [None]:
#Calculating different metrics on validation data
base_val_st = model_performance_classification_sklearn(base_st, X_val_st, y_val) #Complete the code to compute the model performance for the validation data
print("Validation performance:\n",base_val_st)


In [None]:
Validation performance:
    Accuracy    Recall  Precision        F1
0  0.619048  0.619048   0.533333  0.504762


Observation:

The model is overfitting on the training data, as indicated by a perfect performance (accuracy, recall, precision, F1-score of 1.0).

The validation performance has significantly dropped compared to the training data.

Tuned Model - Word2Vec¶

Note: The parameter grid provided below is a sample grid. It can be modified depending on the compute power of the system being used.

In [None]:
start = time.time()

# Choose the type of classifier.

#Uncomment only one of the snippets corrrsponding to the base model trained previously

#tuned_wv = GradientBoostingClassifier(random_state = 42)
tuned_wv = RandomForestClassifier(random_state=42)
#tuned_wv = DecisionTreeClassifier(random_state=42)



parameters = {
    'max_depth': np.arange(3,7),
    'min_samples_split': np.arange(5,12,2),
    'max_features': ['log2', 'sqrt', 0.2, 0.4]
}

# Run the grid search
grid_obj = GridSearchCV(tuned_wv, parameters, scoring='f1_weighted',cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train_wv, y_train)

end = time.time()
print("Time taken ",(end-start))

# Set the clf to the best combination of parameters
tuned_wv = grid_obj.best_estimator_


In [None]:
Time taken  144.49306058883667


In [None]:
# Fit the best algorithm to the data.
tuned_wv.fit(X_train_wv, y_train)


In [None]:
RandomForestClassifier(max_depth=6, min_samples_split=11, random_state=42)

In [None]:
RandomForestClassifier(max_depth=6, min_samples_split=11, random_state=42)

Observation:

This is hyperparameter tuning using GridSearchCV to find the best parameters for the RandomForestClassifier.

We are tuning the following hyperparameters for the RandomForestClassifier:

max_depth: The maximum depth of each tree (values between 3 and 6).

min_samples_split: The minimum number of samples required to split an internal node (values between 5 and 11 with a step of 2).

max_features: The number of features to consider when looking for the best split. You’re testing different values like 'log2', 'sqrt', and fractions (0.2 and 0.4).

In [None]:
plot_confusion_matrix(tuned_wv,X_train_wv,y_train)


Observations:

This is  performance of the tuned RandomForestClassifier on the training data.
True Positives (Diagonal):

Class 0: 76 instances correctly classified as 0.

Class 1: 138 instances correctly classified as 1.

Class 2: 53 instances correctly classified as 2.

False Positives and False Negatives (Off-diagonal):

Class 0: 6 instances were misclassified as 1.

Class 1: 2 instances were misclassified as 0, and 11 instances as 2.

Class 2: 2 instances were misclassified as 0, and 11 instances as 1.

Class 0 has 76 correct classifications but also 6 misclassifications as class 1. This could suggest some confusion between classes 0 and 1.
Class 1 has 138 correct classifications, but it has 2 misclassifications as class 0 and 11 as class 2. This shows that the model is making more errors for class 1, especially in distinguishing it from class 2.Class 2 has 53 correct classifications, but it also has a fair number of misclassifications. It’s misclassified as class 1 11 times, and as class 0 2 times.

In [None]:
plot_confusion_matrix(tuned_wv,X_val_wv,y_val)


Observation:

The confusion matrix for the tuned RandomForestClassifier on the validation data.
True Positives (Diagonal values):

Class 0: 8 instances correctly classified as 0.

Class 1: 8 instances correctly classified as 1.

Class -1: 5 instances correctly classified as -1.

False Positives and False Negatives (Off-diagonal values):

Class 0: 4 instances incorrectly classified as class 1.

Class 1: 3 instances incorrectly classified as class 0, and 1 instance incorrectly classified as class -1.

Class -1: 0 instances incorrectly classified as class 0.
The model seems to have some difficulty distinguishing class 0, as it misclassifies 4 instances as class 1. This indicates some overlap between these two classes.Class 1 is reasonably well classified (8 true positives), but there are still false positives (3 instances incorrectly predicted as class 0) and a small number of false negatives (1 instance predicted as -1).

In [None]:
#Calculating different metrics on training data
tuned_train_wv=model_performance_classification_sklearn(tuned_wv,X_train_wv,y_train)
print("Training performance:\n",tuned_train_wv)


In [None]:
Training performance:
    Accuracy    Recall  Precision        F1
0  0.933566  0.933566   0.939727  0.932458


In [None]:
#Calculating different metrics on validation data
tuned_val_wv = model_performance_classification_sklearn(tuned_wv,X_val_wv,y_val)
print("Validation performance:\n",tuned_val_wv)


In [None]:
Validation performance:
    Accuracy    Recall  Precision        F1
0  0.380952  0.380952   0.268908  0.315271


Observation:

The model performs well on the training set, with high values for accuracy, recall, precision, and F1-score. This suggests that the model is learning the patterns in the training data effectively.

Given that recall and precision are both high and close to each other, it suggests that the model is correctly identifying most positive samples and avoiding false positives.

Accuracy has dropped to 38.1%, which is quite low compared to the training accuracy (93.4%).

Recall and Precision are also much lower, suggesting the model is not generalizing well to the validation data.

Tuned Model - GloVe¶

In [None]:
start = time.time()

#Uncomment only one of the snippets corrrsponding to the base model trained previously

#tuned_wv = GradientBoostingClassifier(random_state = 42)
#tuned_wv = RandomForestClassifier(random_state=42)
#tuned_wv = DecisionTreeClassifier(random_state=42)
tuned_gl_model = RandomForestClassifier(random_state=42)

parameters = {
    'max_depth': np.arange(3,7),
    'min_samples_split': np.arange(5,12,2),
    'max_features': ['log2', 'sqrt', 0.2, 0.4]
}

# Run the grid search
grid_obj = GridSearchCV(tuned_gl_model, parameters, scoring='f1_weighted', cv=5, n_jobs=-1) #Complete the code to pass the chosen model
grid_obj = grid_obj.fit(X_train_gl, y_train)

end = time.time()
print("Time taken ",(end-start))

# Set the clf to the best combination of parameters
tuned_gl = grid_obj.best_estimator_


In [None]:
Time taken  84.29355716705322


In [None]:
# Fit the best algorithm to the data.
tuned_gl.fit(X_train_gl, y_train)#Complete the code to fit the chosen model on the train data


In [None]:
RandomForestClassifier(max_depth=6, max_features=0.2, min_samples_split=7,
                       random_state=42)

In [None]:
RandomForestClassifier(max_depth=6, max_features=0.2, min_samples_split=7,
                       random_state=42)

Observation:

This is hyperparameter tuning using GridSearchCV to find the best parameters for the RandomForestClassifier using training data of glove model.
We are tuning the following hyperparameters for the RandomForestClassifier:

max_depth: The maximum depth of each tree (values between 3 and 6).

min_samples_split: The minimum number of samples required to split an internal node (values between 5 and 11 with a step of 2).

max_features: The number of features to consider when looking for the best split. You’re testing different values like 'log2', 'sqrt', and fractions (0.2 and 0.4).

In [None]:
plot_confusion_matrix(tuned_gl, X_train_gl, y_train)#Complete the code to plot the confusion matrix for the train data


Observation:

The confusion matrix for the tuned model on training data shows the following:
True Positives (Diagonal):

Class 0: 79 instances correctly classified as 0.

Class 1: 138 instances correctly classified as 1.

Class -1: 63 instances correctly classified as -1.

False Positives and False Negatives (Off-diagonal):

Class 0: 3 instances misclassified as 1.

Class 1: 1 instance misclassified as 0, and 2 instances misclassified as -1.

Class -1: 1 instance misclassified as 0, and 2 instances misclassified as 1.

The model is performing reasonably well on the training set, with relatively few misclassifications for each class.Class 0 is classified well, but class -1 appears to be somewhat confused with other classes, especially class 1.

In [None]:
plot_confusion_matrix(tuned_gl, X_val_gl, y_val)#Complete the code to plot the confusion matrix for the validation data


Observations:

The confusion matrix for the tuned model on validation data shows the following:
True Positives (Diagonal):

Class 0: 10 instances correctly classified as 0.

Class 1: 10 instances correctly classified as 1.

Class -1: 3 instances correctly classified as -1.

False Positives and False Negatives (Off-diagonal values):

Class 0: 1 instance misclassified as class 1, and 3 instances as class -1.

Class 1: 1 instance misclassified as class 0, and 1 instance misclassified as class -1.

Class -1: 1 instance misclassified as class 0.
Class 0 has 10 true positives, but it is misclassified as class 1 (1 instance) and class -1 (3 instances).

The misclassification of class 0 as class -1 could indicate some overlap in features between these two classes.

In [None]:
#Calculating different metrics on training data
tuned_train_gl = model_performance_classification_sklearn(tuned_gl, X_train_gl, y_train) #Complete the code to compute the model performance for the training data
print("Training performance:\n",tuned_train_gl)


In [None]:
Training performance:
    Accuracy    Recall  Precision        F1
0  0.979021  0.979021   0.979545  0.978968


In [None]:
#Calculating different metrics on validation data
tuned_val_gl = model_performance_classification_sklearn(tuned_gl, X_val_gl, y_val) #Complete the code to compute the model performance for the validation data
print("Validation performance:\n",tuned_val_gl)


In [None]:
Validation performance:
    Accuracy    Recall  Precision        F1
0  0.571429  0.571429   0.539683  0.530612


Observation:

The model is performing very well on the training set, with high accuracy, recall, precision, and F1-score. 
This suggests that the model is correctly learning the patterns in the training data

The performance on the validation set is significantly lower compared to the training set
The significant difference between training and validation performance suggests overfitting

Tuned Model - Sentence Transformer¶

In [None]:
start = time.time()

# Choose the type of classifier.

#Uncomment only one of the snippets corrrsponding to the base model trained previously

#tuned_wv = GradientBoostingClassifier(random_state = 42)
#tuned_wv = RandomForestClassifier(random_state=42)
#tuned_wv = DecisionTreeClassifier(random_state=42)
tuned_st_model = RandomForestClassifier(random_state=42)

parameters = {
    'max_depth': np.arange(3,7),
    'min_samples_split': np.arange(5,12,2),
    'max_features': ['log2', 'sqrt', 0.2, 0.4]
}

# Run the grid search
grid_obj = GridSearchCV(tuned_st_model, parameters, scoring='f1_weighted', cv=5, n_jobs=-1) #Complete the code to pass the chosen model
grid_obj = grid_obj.fit(X_train_st, y_train)

end = time.time()
print("Time taken ",(end-start))

# Set the clf to the best combination of parameters
tuned_st = grid_obj.best_estimator_


In [None]:
Time taken  165.79925274848938


In [None]:
# Fit the best algorithm to the data.
tuned_st.fit(X_train_st, y_train) #Complete the code to fit the chosen model on the train data


In [None]:
RandomForestClassifier(max_depth=6, max_features=0.2, min_samples_split=11,
                       random_state=42)

In [None]:
RandomForestClassifier(max_depth=6, max_features=0.2, min_samples_split=11,
                       random_state=42)

Observation:

This is hyperparameter tuning using GridSearchCV to find the best parameters for the RandomForestClassifier using training data of Sentence transformer model.
We are tuning the following hyperparameters for the RandomForestClassifier:

max_depth: The maximum depth of each tree (values between 3 and 6).

min_samples_split: The minimum number of samples required to split an internal node (values between 5 and 11 with a step of 2).

max_features: The number of features to consider when looking for the best split. You’re testing different values like 'log2', 'sqrt', and fractions (0.2 and 0.4).

In [None]:
plot_confusion_matrix(tuned_st, X_train_st, y_train) #Complete the code to plot the confusion matrix for the train data


Observation:

The confusion matrix for the tuned model on the training data 
True Positives (Diagonal values):

Class 0: 81 instances correctly classified as 0.

Class 1: 138 instances correctly classified as 1.

Class -1: 66 instances correctly classified as -1.

False Positives and False Negatives (Off-diagonal values):

Class 0: 1 instance misclassified as class 1.

Class 1: 0 instances misclassified as class 0, and 0 misclassified as class -1.

Class -1: 0 instances misclassified as class 0, and 0 misclassified as class 1.

The model performs excellently with very few misclassifications, especially for class 1 and class -1, with only 1 misclassification for class 0 as class 1

In [None]:
plot_confusion_matrix(tuned_st, X_val_st, y_val) #Complete the code to plot the confusion matrix for the validation data


Observations:

The confusion matrix for the tuned model on validation data 
True Positives (Diagonal values):

Class 0: 12 instances correctly classified as 0.

Class 1: 12 instances correctly classified as 1.

Class -1: 5 instances correctly classified as -1.

False Positives and False Negatives (Off-diagonal values):

Class 0: 4 instances misclassified as class 1.

Class 1: 0 instances misclassified as class 0.

Class -1: 0 instances misclassified as class 0.

The model is performing very well on both class 1 and class -1, with no misclassifications for class 1 and very few misclassifications for class 0.

In [None]:
#Calculating different metrics on training data
tuned_train_st = model_performance_classification_sklearn(tuned_st, X_train_st, y_train)#Complete the code to compute the model performance for the training data
print("Training performance:\n",tuned_train_st)


In [None]:
Training performance:
    Accuracy    Recall  Precision        F1
0  0.996503  0.996503   0.996529  0.996499


In [None]:
#Calculating different metrics on validation data
tuned_val_st = model_performance_classification_sklearn(tuned_st, X_val_st, y_val)#Complete the code to compute the model performance for the validation data
print("Validation performance:\n",tuned_val_st)


In [None]:
Validation performance:
    Accuracy    Recall  Precision        F1
0  0.571429  0.571429   0.326531  0.415584


Observations:

The model performs exceptionally well on the training set, with very high values for accuracy, recall, precision, and F1-score

Accuracy is only 57.14% on validation data but seems to be better compared to all other model performance.

Model Performance Summary and Final Model Selection¶

In [None]:
#training performance comparison

models_train_comp_df = pd.concat(
    [base_train_wv.T,
     base_train_gl.T,
     base_train_st.T,
     tuned_train_wv.T,
     tuned_train_gl.T,
     tuned_train_st.T,
    ],axis=1
)

models_train_comp_df.columns = [
    "Base Model (Word2Vec)",
    "Base Model (GloVe)",
    "Base Model (Sentence Transformer)",
    "Tuned Model (Word2Vec)",
    "Tuned Model (GloVe)",
    "Tuned Model (Sentence Transformer)",
]

print("Training performance comparison:")
models_train_comp_df


In [None]:
Training performance comparison:


Observations:

Training Performance Comparison:
The table provided compares the training performance of multiple models, including base models (Word2Vec, GloVe, Sentence Transformer) and tuned models (Word2Vec, GloVe, Sentence Transformer).


All base models (Word2Vec, GloVe, and Sentence Transformer) show perfect performance across all metrics (Accuracy, Recall, Precision, and F1-score), which typically indicates either overfitting or the models have memorized the training data
2.The Tuned Model (Sentence Transformer) outperforms the other models in all metrics (Accuracy, Recall, Precision, and F1-score), achieving near-perfect results:


Accuracy: 0.996503

Precision: 0.996529

Recall: 0.996503

F1-Score: 0.996499

In [None]:
#validation performance comparison

models_val_comp_df = pd.concat(
    [base_val_wv.T,
     base_val_gl.T,
     base_val_st.T,
     tuned_val_wv.T,
     tuned_val_gl.T,
     tuned_val_st.T,
     ],axis=1
)

models_val_comp_df.columns = [
    "Base Model (Word2Vec)",
    "Base Model (GloVe)",
    "Base Model (Sentence Transformer)",
    "Tuned Model (Word2Vec)",
    "Tuned Model (GloVe)",
    "Tuned Model (Sentence Transformer)",
]

print("Validation performance comparison:")
models_val_comp_df


In [None]:
Validation performance comparison:


Observations:

Validation Performance Comparison:
The table provided compares the validation performance of multiple models, including both base models (Word2Vec, GloVe, Sentence Transformer) and tuned models (Word2Vec, GloVe, Sentence Transformer).

1.The base models (Word2Vec, GloVe, and Sentence Transformer) have low accuracy and precision, particularly for Word2Vec and GloVe.

2.Base Model (Sentence Transformer) performs better than the others with a  higher accuracy of 61.9%

3.Tuned models show improvements in most cases, especially Tuned Model (GloVe), which performs much better than the base GloVe model.

4.Tuned Model (Sentence Transformer) still performs relatively well, especially in terms of precision (0.571429)

Overall, base model (Sentence Transformer)  and Tuned Model (Sentence Transformer) seems to be performing well on training and validation data set when compared all other models.

Model Performance Check on Test Data¶

In [None]:
plot_confusion_matrix(tuned_st, X_test_st, y_test) #Complete the code to plot the confusion matrix for the final model and test data


Observations:

Lets run test data against the tuned sentence transformer model.The confusion matrix for the tuned model on the test data shows the following.

12 true positives and 1 misclassification as class 1. This indicates that the model is doing fairly well with class 0

19 true positives and 1 misclassification as class 0. This suggests the model is performing very well with class 1

9 true positives and no misclassifications. The model is performing excellently with class -1

In [None]:
#Calculating different metrics on training data
final_model_test = model_performance_classification_sklearn(tuned_st, X_test_st, y_test) #Complete the code to compute the final model's performance for the test data
print("Test performance for the final model:\n",final_model_test)


In [None]:
Test performance for the final model:
    Accuracy   Recall  Precision        F1
0   0.47619  0.47619   0.380952  0.342857


Observation:

The model's accuracy on the test set is about 47.6%, which is relatively low compared to the training data

Given the high performance on the training set and relatively low performance on the test set, the model might be overfitting.

However, the model seems to be better in performance compared to the other models.

Weekly News Summarization¶

Important Note: It is recommended to run this section of the project independently from the previous sections in order to avoid runtime crashes due to RAM overload.

In [None]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q


In [None]:
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 40.7 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.0/62.0 kB 203.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 228.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 270.8 MB/s eta 0:00:00
  Building wheel for llama-cpp-python (pyproject.toml) ... done
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.4 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.4 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.4 which is incompatible.


In [None]:
# Function to download the model from the Hugging Face model hub
from huggingface_hub import hf_hub_download

# Importing the Llama class from the llama_cpp module
from llama_cpp import Llama

# Importing the library for data manipulation
import pandas as pd

from tqdm import tqdm # For progress bar related functionalities
tqdm.pandas()


In [None]:
stock_news = pd.read_csv("/content/drive/MyDrive/Stock Market news sentiment analysis/stock_news.csv") #Complete the code to load the dataset


In [None]:
data = stock_news.copy()


In [None]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"


model_path = hf_hub_download(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF", # Complete the code to mention the repo_id
    filename=model_basename # Complete the code to mention the filename
)


In [None]:
#uncomment the below snippet of code if the runtime is connected to GPU.
llm = Llama(
    model_path=model_path, # Path to the model
    n_gpu_layers=100, #Number of layers transferred to GPU
    n_ctx=4500, #Context window
)


In [None]:
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


In [None]:
# uncomment and run the following code in case GPU is not being used

# llm = Llama(
#     model_path=model_path,
#     n_ctx=4500, # Context window
#     n_cores=-2 # Number of CPU cores to use
# )


In [None]:
data["Date"] = pd.to_datetime(data['Date'])  # Convert the 'Date' column to datetime format.


In [None]:
# Group the data by week using the 'Date' column.
weekly_grouped = data.groupby(pd.Grouper(key='Date', freq='W'))


In [None]:
# Aggregate the grouped data on a weekly basis:
# concatenate 'content' values into a single string separated by ' || '.
weekly_grouped = weekly_grouped.agg(
    {
        'News': lambda x: ' || '.join(x)  # Join the news values with ' || ' separator.
    }
).reset_index()

print(weekly_grouped.shape)


In [None]:
(18, 2)


Observations:

This  is performing weekly aggregation on a dataset, where the "Date" column is used to group the data by week

In [None]:
weekly_grouped


In [None]:
# creating a copy of the data
data_1 = weekly_grouped.copy()


Note:


The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.

As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.



The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.

As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.

For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:


Role: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role

Example: You are an expert data analyst specializing in news content analysis.


Task: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective

Example: Analyze the provided news headline and return the main topics contained within it.


Instructions: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly

Example:




Role: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role


Example: You are an expert data analyst specializing in news content analysis.


`You are an expert data analyst specializing in news content analysis.`

Task: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective


Example: Analyze the provided news headline and return the main topics contained within it.


`Analyze the provided news headline and return the main topics contained within it.`

Instructions: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly


Example:


In [None]:
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.


Output Format: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output

Example: Return the output in JSON format with keys as the topic number and values as the actual topic.




Output Format: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output


Example: Return the output in JSON format with keys as the topic number and values as the actual topic.


`Return the output in JSON format with keys as the topic number and values as the actual topic.`

Full Prompt Example:

In [None]:
You are an expert data analyst specializing in news content analysis.

Task: Analyze the provided news headline and return the main topics contained within it.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Return the output in JSON format with keys as the topic number and values as the actual topic.

You are an expert data analyst specializing in news content analysis.

Task: Analyze the provided news headline and return the main topics contained within it.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Return the output in JSON format with keys as the topic number and values as the actual topic.

Sample Output:

{"1": "Politics", "2": "Economy", "3": "Health" }

`{"1": "Politics", "2": "Economy", "3": "Health" }`

Observations:

We are going to generate a process for generating a summary of news headlines that could potentially impact stock prices.

In [None]:
# defining a function to parse the JSON output from the model
def extract_json_data(json_str):
    import json
    try:
        # Find the indices of the opening and closing curly braces
        json_start = json_str.find('{')
        json_end = json_str.rfind('}')

        if json_start != -1 and json_end != -1:
            extracted_category = json_str[json_start:json_end + 1]  # Extract the JSON object
            data_dict = json.loads(extracted_category)
            return data_dict
        else:
            print(f"Warning: JSON object not found in response: {json_str}")
            return {}
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return {}


Observation:

This code defines a function called extract_json_data, which is designed to parse a JSON-formatted string and return a Python dictionary

In [None]:
#Defining the response function
def response_mistral_1(prompt, news):
    model_output = llm(
      f"""
      [INST]
      {prompt}
      News Articles: {news}
      [/INST]
      """,
      max_tokens=200,      # maximum tokens to generate
      temperature=0.7,       # temperature value
      top_p=0.95,            # top_p value
      top_k=40,              # top_k value
      echo=False,
    )

    final_output = model_output["choices"][0]["text"]

    return final_output


Observation:

This function response_mistral_1 seems to interact with a model to generate a response based on a given prompt and news articles

In [None]:
news = data_1.loc[0, 'News']


In [None]:
print(len(news.split(' ')))
news


In [None]:
2611


In [None]:
' The tech sector experienced a significant decline in the aftermarket following Apple\'s Q1 revenue warning. Notable suppliers, including Skyworks, Broadcom, Lumentum, Qorvo, and TSMC, saw their stocks drop in response to Apple\'s downward revision of its revenue expectations for the quarter, previously announced in January. ||  Apple lowered its fiscal Q1 revenue guidance to $84 billion from earlier estimates of $89-$93 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple\'s stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 10 ||  Apple cut its fiscal first quarter revenue forecast from $89-$93 billion to $84 billion due to weaker demand in China and fewer iPhone upgrades. CEO Tim Cook also mentioned constrained sales of Airpods and Macbooks. Apple\'s shares fell 8.5% in post market trading, while Asian suppliers like Hon ||  This news article reports that yields on long-dated U.S. Treasury securities hit their lowest levels in nearly a year on January 2, 2019, due to concerns about the health of the global economy following weak economic data from China and Europe, as well as the partial U.S. government shutdown. Apple ||  Apple\'s revenue warning led to a decline in USD JPY pair and a gain in Japanese yen, as investors sought safety in the highly liquid currency. Apple\'s underperformance in Q1, with forecasted revenue of $84 billion compared to analyst expectations of $91.5 billion, triggered risk aversion mood in markets || Apple CEO Tim Cook discussed the company\'s Q1 warning on CNBC, attributing US-China trade tensions as a factor. Despite not mentioning iPhone unit sales specifically, Cook indicated Apple may comment on them again. Services revenue is projected to exceed $10.8 billion in Q1. Cook also addressed the lack of ||  Roku Inc has announced plans to offer premium video channels on a subscription basis through its free streaming service, The Roku Channel. Partners include CBS Corp\'s Showtime, Lionsgate\'s Starz, and Viacom Inc\'s Noggin. This model follows Amazon\'s successful Channels business, which generated an estimated ||  Wall Street saw modest gains on Wednesday but were threatened by fears of a global economic slowdown following Apple\'s shocking revenue forecast cut, blaming weak demand in China. The tech giant\'s suppliers and S&P 500 futures also suffered losses. Reports of decelerating factory activity in China and the euro zone ||  Apple\'s fiscal first quarter revenue came in below analysts\' estimates at around $84 billion, a significant drop from the forecasted range of $89-$93 billion. The tech giant attributed the shortfall to lower iPhone revenue and upgrades, as well as weakness in emerging markets. Several brokerages had already reduced their production estimates ||  Apple Inc. lowered its quarterly sales forecast for the fiscal first quarter, underperforming analysts\' expectations due to slowing Chinese economy and trade tensions. The news sent Apple shares tumbling and affected Asia-listed suppliers like Hon Hai Precision Industry Co Ltd, Taiwan Semiconductor Manufacturing Company, and LG Innot ||  The Australian dollar experienced significant volatility on Thursday, plunging to multi-year lows against major currencies due to automated selling, liquidity issues, and a drought of trades. The largest intra-day falls in the Aussie\'s history occurred amid violent movements in AUD/JPY and AUD/ ||  In early Asian trading on Thursday, the Japanese yen surged as the U.S. dollar and Australian dollar collapsed in thin markets due to massive stop loss sales triggered by Apple\'s earnings warning of sluggish iPhone sales in China and risk aversion. The yen reached its lowest levels against the U.S. dollar since March  ||  The dollar fell from above 109 to 106.67 after Apple\'s revenue warning, while the 10-year Treasury yield also dropped to 2.61%. This followed money flowing into US government paper. Apple\'s shares and U.S. stock index futures declined, with the NAS ||  RBC Capital maintains its bullish stance on Apple, keeping its Outperform rating and $220 price target. However, analyst Amit Daryanani warns of ongoing iPhone demand concerns, which could impact pricing power and segmentation efforts if severe. He suggests potential capital allocation adjustments if the stock underperforms for several quarters ||  Oil prices dropped on Thursday as investor sentiment remained affected by China\'s economic slowdown and turmoil in stock and currency markets. US WTI Crude Oil fell by $2.10 to $45.56 a barrel, while International Brent Oil was down $1.20 at $54.26 ||  In this news article, investors\' concerns about a slowing Chinese and global economy, amplified by Apple\'s revenue warning, led to a significant surge in the Japanese yen. The yen reached its biggest one-day rise in 20 months, with gains of over 4% versus the dollar. This trend was driven by automated ||  In Asia, gold prices rose to over six-month highs on concerns of a global economic slowdown and stock market volatility. Apple lowered its revenue forecast for the first quarter, leading Asian stocks to decline and safe haven assets like gold and Japanese yen to gain. Data showed weakened factory activity in Asia, particularly China, adding to ||  Fears of a global economic slowdown led to a decline in the US dollar on Thursday, as the yen gained ground due to its status as a safe haven currency. The USD index slipped below 96, and USD JPY dropped to 107.61, while the yen strengthened by 4.4%. ||  In Thursday trading, long-term US Treasury yields dropped significantly below 2.6%, reaching levels not seen in over a year, as investors shifted funds from stocks to bonds following Apple\'s warning of decreased revenue due to emerging markets and China\'s impact on corporate profits, with the White House advisor adding to concerns of earnings down ||  Gold prices have reached their highest level since mid-June, with the yellow metal hitting $1,291.40 per ounce due to investor concerns over a slowing economy and Apple\'s bearish revenue outlook. Saxo Bank analyst Ole Hansen predicts gold may reach $1,300 sooner ||  Wedbush analyst Daniel Ives lowered his price target for Apple from $275 to $200 due to concerns over potential iPhone sales stagnation, with an estimated 750 million active iPhones worldwide that could cease growing or even decline. He maintains an Outperform rating and remains bullish on the long ||  Oil prices rebounded on Thursday due to dollar weakness, signs of output cuts by Saudi Arabia, and weaker fuel oil margins leading Riyadh to lower February prices for heavier crude grades sold to Asia. The Organization of the Petroleum Exporting Countries (OPEC) led by Saudi Arabia and other producers ||  This news article reports on the impact of Apple\'s Q1 revenue warning on several tech and biotech stocks. Sesen Bio (SESN) and Prana Biotechnology (PRAN) saw their stock prices drop by 28% and 11%, respectively, following the announcement. Mellanox Technologies (ML ||  Gold prices reached within $5 of $1,300 on Thursday as weak stock markets and a slumping dollar drove investors towards safe-haven assets. The U.S. stock market fell about 2%, with Apple\'s rare profit warning adding to investor unease. COMEX gold futures settled at $1 ||  The FDIC Chair, Jelena McWilliams, expressed no concern over market volatility affecting the U.S banking system due to banks\' ample capital. She also mentioned a review of the CAMELS rating system used to evaluate bank health for potential inconsistencies and concerns regarding forum shopping. This review comes from industry ||  Apple cut its quarterly revenue forecast for the first time in over 15 years due to weak iPhone sales in China, representing around 20% of Apple\'s revenue. This marks a significant downturn during Tim Cook\'s tenure and reflects broader economic concerns in China exacerbated by trade tensions with the US. U ||  Goldman analyst Rod Hall lowered his price target for Apple from $182 to $140, citing potential risks to the tech giant\'s 2019 numbers due to uncertainties in Chinese demand. He reduced his revenue estimate for the year by $6 billion and EPS forecast by $1.54 ||  Delta Air Lines lowered its fourth-quarter revenue growth forecast to a range of 3% from the previous estimate of 3% to 5%. Earnings per share are now expected to be $1.25 to $1.30. The slower pace of improvement in late December was unexpected, and Delta cited this as ||  Apple\'s profit warning has significantly impacted the stock market and changed the outlook for interest rates. The chance of a rate cut in May has increased to 15-16% from just 3%, according to Investing com\'s Fed Rate Monitor Tool. There is even a 1% chance of two cuts in May. ||  The White House advisor, Kevin Hassett, stated that a decline in Chinese economic growth would negatively impact U.S. firm profits but recover once a trade deal is reached between Washington and Beijing. He also noted that Asian economies, including China, have been experiencing significant slowdowns since last spring due to U.S. tariffs ||  The White House economic adviser, Kevin Hassett, warned that more companies could face earnings downgrades due to ongoing trade negotiations between the U.S. and China, leading to a decline in oil prices on Thursday. WTI crude fell 44 cents to $44.97 a barrel, while Brent crude inched ||  Japanese stocks suffered significant losses on the first trading day of 2019, with the Nikkei 225 and Topix indices both falling over 3 percent. Apple\'s revenue forecast cut, citing weak iPhone sales in China, triggered global growth concerns and sent technology shares tumbling. The S&P 50 ||  Investors withdrew a record $98 billion from U.S. stock funds in December, with fears of aggressive monetary policy and an economic slowdown driving risk reduction. The S&P 500 fell 9% last month, with some seeing declines as a buying opportunity. Apple\'s warning of weak iPhone sales added ||  Apple\'s Q1 revenue guidance cut, resulting from weaker demand in China, led to an estimated $3.8 billion paper loss for Berkshire Hathaway due to its $252 million stake in Apple. This news, coupled with broad market declines, caused a significant $21.4 billion decrease in Berk ||  This news article reports that a cybersecurity researcher, Wish Wu, planned to present at the Black Hat Asia hacking conference on how to bypass Apple\'s Face ID biometric security on iPhones. However, his employer, Ant Financial, which operates Alipay and uses facial recognition technologies including Face ID, asked him to withdraw ||  OPEC\'s production cuts faced uncertainty as oil prices were influenced by volatile stock markets, specifically due to Apple\'s lowered revenue forecast and global economic slowdown fears. US WTI and Brent crude both saw gains, but these were checked by stock market declines. Shale production is expected to continue impacting the oil market in ||  Warren Buffett\'s Berkshire Hathaway suffered significant losses in the fourth quarter due to declines in Apple, its largest common stock investment. Apple cut its revenue forecast, causing a 5-6% decrease in Berkshire\'s Class A shares. The decline resulted in potential unrealized investment losses and could push Berk ||  This news article reports that on Thursday, the two-year Treasury note yield dropped below the Federal Reserve\'s effective rate for the first time since 2008. The market move suggests investors believe the Fed will not be able to continue tightening monetary policy. The drop in yields was attributed to a significant decline in U.S ||  The U.S. and China will hold their first face-to-face trade talks since agreeing to a 90-day truce in their trade war last month. Deputy U.S. Trade Representative Jeffrey Gerrish will lead the U.S. delegation for negotiations on Jan. 7 and 8, ||  Investors bought gold in large quantities due to concerns over a global economic slowdown, increased uncertainty in the stock market, and potential Fed rate hikes. The precious metal reached its highest price since June, with gold ETF holdings also seeing significant increases. Factors contributing to this demand include economic downturn, central bank policy mistakes, and ||  Delta Air Lines Inc reported lower-than-expected fourth quarter unit revenue growth, citing weaker than anticipated late bookings and increased competition. The carrier now expects total revenue per available seat mile to rise about 3 percent in the period, down from its earlier forecast of 3.5 percent growth. Fuel prices are also expected to ||  U.S. stocks experienced significant declines on Thursday as the S&P 500 dropped over 2%, the Dow Jones Industrial Average fell nearly 3%, and the Nasdaq Composite lost approximately 3% following a warning of weak revenue from Apple and indications of slowing U.S. factory activity, raising concerns ||  President Trump expressed optimism over potential trade talks with China, citing China\'s current economic weakness as a potential advantage for the US. This sentiment was echoed by recent reports of weakened demand for Apple iPhones in China, raising concerns about the overall health of the Chinese economy. The White House is expected to take a strong stance in ||  Qualcomm secured a court order in Germany banning the sale of some iPhone models due to patent infringement, leading Apple to potentially remove these devices from its stores. However, third-party resellers like Gravis continue selling the affected iPhones. This is the third major effort by Qualcomm to ban Apple\'s iPhones glob ||  Oil prices rose on Friday in Asia as China confirmed trade talks with the U.S., with WTI gaining 0.7% to $47.48 and Brent increasing 0.7% to $56.38 a barrel. The gains came after China\'s Commerce Ministry announced that deputy U.S. Trade ||  Gold prices surged past the psychologically significant level of $1,300 per ounce in Asia on Friday due to growing concerns over a potential global economic downturn. The rise in gold was attributed to weak PMI data from China and Apple\'s reduced quarterly sales forecast. Investors viewed gold as a safe haven asset amidst ||  In an internal memo, Huawei\'s Chen Lifang reprimanded two employees for sending a New Year greeting on the company\'s official Twitter account using an iPhone instead of a Huawei device. The incident caused damage to the brand and was described as a "blunder" in the memo. The mistake occurred due to ||  This news article reports on the positive impact of trade war talks between Beijing and Washington on European stock markets, specifically sectors sensitive to the trade war such as carmakers, industrials, mining companies, and banking. Stocks rallied with mining companies leading the gains due to copper price recovery. Bayer shares climbed despite a potential ruling restricting || Amazon has sold over 100 million devices with its Alexa digital assistant, according to The Verge. The company is cautious about releasing hardware sales figures and did not disclose holiday numbers for the Echo Dot. Over 150 products feature Alexa, and more than 28,000 smart home || The Supreme Court will review Broadcom\'s appeal in a shareholder lawsuit over the 2015 acquisition of Emulex. The case hinges on whether intent to defraud is required for such lawsuits, and the decision could extend beyond the Broadcom suit. An Emulex investor filed a class action lawsuit ||  The Chinese central bank announced a fifth reduction in the required reserve ratio (RRR) for banks, freeing up approximately 116.5 billion yuan for new lending. This follows mounting concerns about China\'s economic health amid slowing domestic demand and U.S. tariffs on exports. Premier Li Keqiang || The stock market rebounded strongly on Friday following positive news about US-China trade talks, a better-than-expected jobs report, and dovish comments from Federal Reserve Chairman Jerome Powell. The Dow Jones Industrial Average rose over 746 points, with the S&P 500 and Nasdaq Com'

Observation:

The news article contains 2611 words.This suggests that the article is fairly long,

In [None]:
#Complete the code to define the prompt for this task
prompt = """

You are an expert data analyst specializing in news content analysis.
Task: Analyze the provided weekly news articles and identify the top three positive events and the top three negative events that are most likely to impact the stock price.
Instructions:
1. Read the weekly news articles carefully.
2. Identify key events with positive sentiment and negative sentiment.
3. For positive events, list the top three events; similarly, list the top three negative events.
4. Return the output in JSON format with two keys: "Positive Events" and "Negative Events", each mapping to a list of events.
Example:
{"Positive Events": ["Event1", "Event2", "Event3"], "Negative Events": ["Event1", "Event2", "Event3"]}

"""


In [None]:
%%time
summary = response_mistral_1(prompt, news)
print(summary)


In [None]:
 Title: Apple's Revenue Warning Triggers Global Market Declines: Tech Stocks Drop, Yen Surges, Treasury Yields Fall

       The tech sector experienced a significant decline in the aftermarket following Apple's Q1 revenue warning. Notable suppliers such as Skyworks, Broadcom, Lumentum, Qorvo, TSMC, and Hon Hai Precision Industry saw their stocks drop in response to Apple's downward revision of its revenue expectations for the quarter.

       Apple lowered its fiscal Q1 revenue guidance from $89-$93 billion to $84 billion due to weaker than expected iPhone sales. The announcement caused a significant drop in Apple's stock price and negatively impacted related suppliers, leading to broader market declines for tech indices such as Nasdaq 100 and the S&P 500.

       Investors sought safety in the highly liquid Japanese y
CPU times: user 18.1 s, sys: 4.94 s, total: 23 s
Wall time: 23.3 s


Observations:

The model has extracted key information from the news article and generated a concise summary.

The model has likely generated a summarization of the key events from the weekly news that could impact stock prices.

The summarization looks like it correctly identified the impact on the tech sector and the market behavior

In [None]:
%%time
data_1['Key Events'] = data_1['News'].progress_apply(lambda x: response_mistral_1(prompt,x))


In [None]:
  0%|          | 0/18 [00:00<?, ?it/s]Llama.generate: prefix-match hit
 11%|█         | 2/18 [00:24<03:14, 12.13s/it]Llama.generate: prefix-match hit
 17%|█▋        | 3/18 [00:41<03:31, 14.12s/it]Llama.generate: prefix-match hit
 22%|██▏       | 4/18 [00:55<03:19, 14.27s/it]Llama.generate: prefix-match hit
 28%|██▊       | 5/18 [01:09<03:05, 14.24s/it]Llama.generate: prefix-match hit
 33%|███▎      | 6/18 [01:29<03:13, 16.09s/it]Llama.generate: prefix-match hit
 39%|███▉      | 7/18 [01:41<02:43, 14.83s/it]Llama.generate: prefix-match hit
 44%|████▍     | 8/18 [01:52<02:14, 13.47s/it]Llama.generate: prefix-match hit
 50%|█████     | 9/18 [02:03<01:56, 12.90s/it]Llama.generate: prefix-match hit
 56%|█████▌    | 10/18 [02:14<01:37, 12.17s/it]Llama.generate: prefix-match hit
 61%|██████    | 11/18 [02:24<01:21, 11.59s/it]Llama.generate: prefix-match hit
 67%|██████▋   | 12/18 [02:38<01:12, 12.13s/it]Llama.generate: prefix-match hit
 72%|███████▏  | 13/18 [02:52<01:04, 12.83s/it]Llama.generate: prefix-match hit
 78%|███████▊  | 14/18 [03:07<00:54, 13.60s/it]Llama.generate: prefix-match hit
 83%|████████▎ | 15/18 [03:19<00:39, 13.07s/it]Llama.generate: prefix-match hit
 89%|████████▉ | 16/18 [03:31<00:25, 12.65s/it]Llama.generate: prefix-match hit
 94%|█████████▍| 17/18 [03:46<00:13, 13.27s/it]Llama.generate: prefix-match hit
100%|██████████| 18/18 [03:58<00:00, 13.02s/it]Llama.generate: prefix-match hit
100%|██████████| 18/18 [04:10<00:00, 13.92s/it]

In [None]:
CPU times: user 3min 35s, sys: 32 s, total: 4min 7s
Wall time: 4min 10s


Observation:

Model is processing the weekly data using the progress_apply function, which applies the model's response_mistral_1 function to each row of the News column in the data_1 DataFrame.

In [None]:
data_1["Key Events"].head()#Complete the code to print the first 5 rows of the 'Key_Events' column


In [None]:
data_1['model_response_parsed'] = data_1['Key Events'].apply(extract_json_data)
data_1.head()


In [None]:
Warning: JSON object not found in response:  {
        "Positive Events": [
          "IBM's stock price increased after hours due to better-than-expected earnings and revenue, with its cloud computing business contributing positively.",
          "Huawei is expanding its presence in Europe with the launch of the new Honor View20 smartphone, which offers advanced camera features at a lower price point than rivals Samsung and Apple.",
          "FireEye's stock price surged after Baird added it to their 'Fresh Picks' list, citing a recent decline in shares and confident 2019 guidance."
        ],
        "Negative Events": [
          "The Swiss National Bank (SNB) governor stated that negative interest rates and foreign currency market intervention are necessary to prevent a strong Swiss franc from causing deflation in the country.",
          "The Dow, S&P 500, and Nasdaq experienced significant losses on Tuesday despite
Warning: JSON object not found in response:  Based on the provided news articles, here are the top three positive events and top three negative events that are most likely to impact Apple's stock price:

       **Positive Events:**

       1. Apple reported stronger-than-expected earnings for Q1 2023, with GAAP EPS coming in at $4.18 versus the estimated $4.17 and revenue surpassing expectations at $84.3 billion compared to forecasts of $81 billion.
       2. Apple is collaborating with Aetna and CVS on a new health app called Attain, offering customized fitness challenges with rewards, including a free Apple Watch Series 3 for participants who don't already own one.
       3. Gold hit an eight-month high due to anticipation for clues on U.S monetary policy from Federal Reserve Chairman Jerome Powell's news conference and prospects of fewer rate hikes and
Warning: JSON object not found in response:  based on the provided news articles, here are the top three positive events and negative events that are most likely to impact the stock price of tech companies:

      "Positive Events": [
         "JPMorgan suggests Apple should acquire Netflix due to its leading engagement level and original content.",
         "Ultimate Software rallied after accepting a $331.50 per share takeover offer from a Hellman Friedman-led consortium.",
         "Apple's French division reached an agreement to pay undeclared back taxes, estimated at around 571 million euros."
      ],

      "Negative Events": [
         "Fears of a Chinese economic slowdown led to caution ahead of earnings from tech giants Alphabet and Apple.",
         "Sony warned of potential impact on chip and image sensor divisions due to weakening global smartphone demand.",
         "Two U.S.
Warning: JSON object not found in response:  Based on the provided news articles, here are the top three positive events and the top three negative events that are most likely to impact the stock price:

      1. Positive Events:
          a. Warner Bros adopts inclusion riders policy following Frances McDormand's advocacy, major stars commit to pushing for it.
          b. Garmin reports stronger-than-expected fourth quarter earnings and revenue, driven by robust demand for wearable fitness devices and aviation products.
          c. Apple collaborates with Ant Financial Services Group and local banks in China, providing interest-free financing for iPhone purchases.

      2. Negative Events:
          a. Kraft Heinz suffers significant loss in premarket trade due to disappointing earnings report and SEC investigation.
          b. WhatsApp security bug allows iPhone users to bypass privacy feature, potential impact on user trust.
          c. Apple's vehicle project may shift
Warning: JSON object not found in response: Based on the provided news articles, here are the top three positive and negative events that are most likely to impact stock prices:

      "Positive Events": [
          "The S&P 500 and Nasdaq posted their best weekly gains since November following reports of progress in U.S.-China trade talks.",
          "In a preliminary ruling, U.S. District Court Judge Gonzalo Curiel ordered Qualcomm to pay nearly $1 billion in patent royalty rebate payments to Apple.",
          "Apple launched a new television advertising campaign emphasizing its commitment to data privacy."
      ],

      "Negative Events": [
          "The United States opposes France's digital services tax, which can negatively impact American businesses, including tech giants like Apple and Google.",
          "Boeing's NYSE BA stock experienced significant losses in premarket trade after the second crash of a 7
Warning: JSON object not found in response:  based tariffs and economic slowdown. This news is likely to negatively impact Apple's sales and stock price.

        Based on the provided news articles, here are the top three positive events and negative events that are most likely to impact the stock prices:

        Positive Events:
        1. Foxconn announcing the completion of its new factory in Wisconsin by the end of this year to manufacture liquid crystal display screens. This news is positive for Foxconn as it indicates growth and expansion, which could potentially lead to increased revenue and profits.
        2. Samsung Electronics reporting strong sales for its new Galaxy flagship smartphones in China despite significant market share losses to Chinese rivals like Huawei. This news is positive for Samsung as it shows that the company's products are still competitive in the world's largest smartphone market, which could potentially lead to increased revenue and profits.
        3. Apple introducing updated AirPods headphones
Warning: JSON object not found in response: Based on the provided news articles, here are the top three positive events and top three negative events that are most likely to impact the stock price:

Positive Events:
1. Apple's new television and movie streaming service announcement, which includes original content and a gaming service for iPhones and iPads. The subscription services could potentially boost Apple's revenue streams beyond hardware sales.
2. Viacom's contract renewal with AT&T, preventing a blackout of its channels for DirecTV users. This news likely relieved investors who were concerned about potential disruptions to Viacom's revenue stream.
3. Tesla's lawsuit dismissal over production claims for its Model 3 car. The positive news surrounding the electric vehicle manufacturer could potentially boost investor confidence and drive up stock prices.

Negative Events:
1. The inversion of the U.S. Treasury yield curve, which had occurred on Friday but returned
Warning: JSON object not found in response:  Based on the provided news articles, here are the top three positive events and the top three negative events that are most likely to impact the stock prices:

       **Positive Events:**
       1. Apple and other consumer brands, including Louis Vuitton and Gucci, reduced prices for their products in China following a cut in the country's value-added tax rate. This could lead to increased sales for these companies, especially Apple whose stock price decreased initially due to the price cuts but later rebounded on optimism over improved demand.
       2. Japan Display's entry into the OLED market by supplying OLED screens for the Apple Watch. This represents a significant breakthrough for the cash-strapped company and could lead to increased revenue and profitability.
       3. The S&P 500, Dow Jones Industrial Average, and Nasdaq Composite closed higher due to optimism over U.S.-China
Warning: JSON object not found in response:  {
        "Positive Events": [
          "Oprah Winfrey and Prince Harry have partnered to create an Apple documentary aimed at promoting mental health awareness.",
          "Delta Airlines' Q1 earnings surpassed expectations leading to a 2.7% increase in DAL stock.",
          "Apple has nearly doubled the number of suppliers using clean energy for production, including major iPhone manufacturers Hon Hai Precision Industry and Taiwan Semiconductor Manufacturing."
        ],
        "Negative Events": [
          "Mobile phone shipments to China dropped by 6 percent in March, marking the fifth consecutive month of decline.",
          "Google raised YouTube TV's monthly membership fee by 25% to $49.99, effective April 10 for new subscribers and May 13 for existing ones.",
          "Apple is under investigation by the Dutch competition agency, ACM, for alleg
Warning: JSON object not found in response: {"Positive Events": ["Apple's potential entry into the free music streaming market sending shockwaves through the industry leading to potential growth opportunities for companies like Amazon and Spotify",
        "TomTom reporting a 14% increase in first quarter revenue and securing contracts to supply high definition maps to major carmakers",
        "Chinese government considering drafting stimulus measures to boost car and electronics sales, potentially leading to increased demand for these industries"],

         "Negative Events": ["Apple facing a securities fraud lawsuit alleging concealment of weakened iPhone demand leading to significant stock price drop",
          "Taiwan Semiconductor Manufacturing Company reporting a steep quarterly profit drop due to weak global demand for smartphones and the prolonged U.S.-China trade war",
          "Samsung Electronics reported issues with the displays of its upcoming foldable smartphone, raising concerns over a smooth launch
Warning: JSON object not found in response:  based on the provided news articles, here are the top three positive events and the top three negative events that are most likely to impact the stock prices:

      **"Positive Events":**
      [1. "Chinese tech giant Tencent Holdings invests in Argentine mobile banking service Uala",
      2. "Snap reports better-than-expected earnings for Q1",
      3. "ASM International beats first quarter expectations"]

      **"Negative Events":**
      [1. "Taiwan's export orders continued to decline at a faster-than-expected rate in March",
      2. "LG Electronics announces it would cease smartphone production in South Korea and shift manufacturing to Vietnam",
      3. "Boeing's first quarter report reveals a 21% profit drop due to the ongoing grounding of its 737 Max aircraft"]

     
Warning: JSON object not found in response:  based on the provided news articles, here are the top three positive events and top three negative events that are most likely to impact the stock price:

        **Positive Events:**
        1. Spotify reported better-than-expected Q1 revenue growth, reaching 100 million paid subscribers and aiming for over 30% per year growth.
        2. The S&P 500 reached a new intraday record high, fueled by strong consumer spending data, tame inflation, and hopes of trade deal resolution.
        3. Apple's earnings report exceeded expectations, leading to a post-market surge in shares and easing concerns about the bull run's sustainability.

        **Negative Events:**
        1. The Czech Finance Ministry is finalizing plans to impose a digital tax on global internet giants, which could negatively impact companies like Google, Apple, Facebook, and Amazon.


Observation:

For the records that are successfulyy parsed, positive and negative events are extracted.

In [None]:
model_response_parsed = pd.json_normalize(data_1['model_response_parsed'])
model_response_parsed.head()


In [None]:
final_output = pd.concat([data_1.reset_index(drop=True),model_response_parsed],axis=1)
final_output.drop(['Key Events','model_response_parsed'], axis=1, inplace=True)
final_output.columns = ['Week End Date', 'News', 'Week Positive Events', 'Week Negative Events']

final_output.head()


Observations:

This is the final output DataFrame that concatenates various columns, and then displays the first few rows.

The Week End Date column contains dates representing the end of each week

The News column contains the summary or description of the news event for each week

The Week Positive Events column contains the model's identified positive events for the week

The Week Negative Events column contains the model's identified negative events for the week.

Some rows have NaN in the Week Positive Events and Week Negative Events columns. This  indicates that the model was not able to identify any positive or negative events for those weeks

Conclusions and Recommendations¶


The model successfully identifies key positive and negative events from weekly news articles, which could significantly impact stock prices.


2.The connection between sentiment analysis and market movement is clear.

3.Businesses should consider integrating this model with predictive  tools to forecast market movements based on weekly news. It would be helpful for the following reasons:
  a)Pricing strategies

b)Marketing or sales efforts

c)Customer communications

d)Risk Management

The ability to extract and analyze sentiment-driven insights from news data enables businesses to not only understand the current market conditions but also to anticipate future trends.

-

Power Ahead