<a href="https://colab.research.google.com/github/ShauryaRawat10/Artificial-Intelligence/blob/main/SentimentalAnalysis_StockMarket_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement
Business Context
The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.

## Problem Definition
With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.

As a member of the Data Science and AI team in the startup, you have been tasked with analyzing the data, developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.

## Data Dictionary
- Date : The date the news was released
- News : The content of news articles that could potentially affect the company's stock price
- Open : The stock price `(in $)` at the beginning of the day
- High : The highest stock price `(in $)` reached during the day
- Low : The lowest stock price `(in $)` reached during the day
- Close : The adjusted stock price `(in $)` at the end of the day
- Volume : The number of shares traded during the day
- Label : The sentiment polarity of the news content
  - 1: positive
  - 0: neutral
  - -1: negative

## Installing and Importing Necessary Libraries

In [None]:
%pip install --upgrade --force-reinstall sentence-transformers==2.7.0 transformers==4.40.2 bitsandbytes==0.46.0 accelerate==1.7.0 sentencepiece==0.2.0 pandas==2.2.2 numpy==2.0.2 matplotlib==3.10.0 seaborn==0.13.2 torch==2.6.0 scikit-learn==1.6.1

Collecting sentence-transformers==2.7.0
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Collecting transformers==4.40.2
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes==0.46.0
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting accelerate==1.7.0
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Collecting sentencepiece==0.2.0
  Downloading sentencepiece-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting pandas==2.2.2
  Downloading pandas-2.2.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting numpy==2.0.2
  Downloading numpy-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
%pip install gensim

In [None]:
# To manipulate and analyze data
import pandas as pd
import numpy as np

# To visualize data
import matplotlib.pyplot as plt
import seaborn as sns

# To used time-related functions
import time

# To parse JSON data
import json

# To build, tune, and evaluate ML models
# from sklearn.ensemble import DecisionTreeClassifier # Incorrecct import
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier # Update import replacing commented sklearn.ensemble import

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score

# To load/create word embeddings
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

# To work with transformer models
import torch
from sentence_transformers import SentenceTransformer

# To implement progress bar related functionalities
from tqdm import tqdm
tqdm.pandas()

# To ignore unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Check library versions
print("pandas:   ", pd.__version__)
print("numpy:    ", np.__version__)
print("seaborn:  ", sns.__version__)
print("torch:    ", torch.__version__)

## Loading the dataset

In [None]:
stock_news = pd.read_csv('https://raw.githubusercontent.com/ShauryaRawat10/Data-Science/main/Generative%20AI/Storage/stock_news.csv', engine='python')

In [None]:
stock = stock_news.copy()

## Data Overview

In [None]:
stock.head()

In [None]:
stock.tail()

In [None]:
stock.shape

- Total 349 rows and 8 columns in the Stock market dataset

In [None]:
stock.columns

In [None]:
stock.duplicated().sum()

- No duplicates

In [None]:
stock.isnull().sum()

- No null values

In [None]:
stock.nunique()

- News columns is always unique
- Label (Sentiment) has 3 categories

In [None]:
stock.info()

- 6 Columns are of numerical data type. 2 are categorical

#### Convert Date to DateTime format

In [None]:
stock['Date'] = pd.to_datetime(stock['Date'])

In [None]:
stock.dtypes

In [None]:
pd.set_option('display.float_format', '{:.2f}'.format)
stock.describe()

- Observations:
  - For 349 days in dataset, the Stock market has:
    - Stock Volume of 244439200
    - Highest Stock price: 67
    - Lowest Stock price: 36.25

In [None]:
ax = sns.countplot(data=stock, x='Label', stat="percent")

# Add percentage values at the top of each bar
total = len(stock)
                                       # Total number of items in the dataset
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height:.2f}%',                            # Display percentage with 2 decimals
                (p.get_x() + p.get_width() / 2., height),    # Position at the top of the bar
                ha='center', va='center',                    # Alignment
                fontsize=8, color='black', fontweight='bold', # Styling
                xytext=(0, 5), textcoords='offset points')   # Adjust text position

For the reported news:
- 48.7% with Neural Sentiment
- 28.4% with Negative Sentiment
- 22.9% with positive sentiment

In [None]:
sns.boxplot(data=stock, x='Open')

- Most of the stocks with 'Opening Price' are between 35 - 63 unit dollars, with few outliers of 65+

In [None]:
sns.boxplot(data=stock, x='Close')

- Closing price for stocks are between 37 to 55 unit dollar. Outliers of 63-65 exists

In [None]:
# Density Plot of Price (Open,High,Low,Close)
for i in ["Open","High","Low","Close"]:
    sns.kdeplot(stock[i], label=i, shade=True)
plt.xlabel("Price")
plt.ylabel("Density")
plt.title("Density Plot of Stock Prices")
plt.legend()
plt.show()

- Stock Price mostly landed in the 40-50 range

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='High', data=stock, color='blue', label='High')
sns.lineplot(x='Date', y='Low', data=stock, color='red', label='Low')
plt.title('Stocks High and Low price')
plt.xticks(
    rotation=45,                     # rotate for readability
    ha='right'                       # right-align the labels
)
plt.show()

In [None]:
sns.heatmap( stock.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt=".2f" )

- Open, Close, High, Low are highly correlated
- Volume is negatively correlated with High, Open, Close, Low and Label

In [None]:
stock['news_len'] = stock['News'].apply(lambda x: len(x.split(' '))) # Calculating the total number of words present in the news content column.
stock['news_len'].describe()                                         # Print the statistical summary for the news content length after splitting into words.

- Average words in news are around 50
- Maximum words are 61 and minimum as 19

In [None]:
plt.figure(figsize=(8, 4))

for i, variable in enumerate(['Open', 'High', 'Low', 'Close']):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(data=stock, x="Label", y=variable) # Label = Sentiment (1) - Positive, (0) - Neutral, (-1) - Negative
    plt.tight_layout(pad=2)

plt.show()

In [None]:
sns.boxplot(
    data=stock, x="Label", y="Volume"  # # Label = Sentiment (1) - Positive, (0) - Neutral, (-1) - Negative
);

In [None]:
stock_daily = stock.groupby('Date').agg(
    {
        'Open': 'mean',
        'High': 'mean',
        'Low': 'mean',
        'Close': 'mean',
        'Volume': 'mean',
    }
).reset_index()                             # Group the 'stocks' DataFrame by the 'Date' column

stock_daily.set_index('Date', inplace=True) # Index
stock_daily.sample(n=10, random_state=42)   # Random selection of rows. But the same ones used whenever randon is chosen.

In [None]:
plt.figure(figsize=(12,3))
sns.lineplot(stock_daily.drop("Volume", axis=1));

- There is a cyclic pattern (seasionality) with increase/descrease in prices of stock
- In the first half of month the stock surges, in next half it drops and remains mostly low

In [None]:
fig, ax1 = plt.subplots(figsize=(15,5))      # Create a figure and axis
sns.lineplot(data=stock_daily.reset_index(), x='Date', y='Close', ax=ax1, color='blue', marker='o', label='Close Price') # Lineplot on primary y-axis
ax2 = ax1.twinx()                            # Create a secondary y-axis
sns.lineplot(data=stock_daily.reset_index(), x='Date', y='Volume', ax=ax2, color='gray', marker='o', label='Volume') # Lineplot on secondary y-axis
ax1.legend(bbox_to_anchor=(1,1));            # Legend set to the Volume data

In [None]:
stock["Date"].fillna("Unknown", inplace=True) # Replace NaN with a Default Value (e.g., “Unknown” or a placeholder)
stock["Date"].describe()

## Test Train Split

In [None]:
X_train = stock[(stock['Date'] < '2019-04-01')].reset_index()    # Select all rows where the 'Date' is before '2019-04-01'
X_val = stock[(stock['Date'] >= '2019-04-01') & (stock['Date'] < '2019-04-16')].reset_index()    # Select all rows where the 'Date' is from '2019-04-01 to '2019-04-16' (excluded)
X_test = stock[stock['Date'] >= '2019-04-16'].reset_index()      # Select all rows where the 'Date' is from '2019-04-16' till the end.

In [None]:
# 'Label' column is the target variable (lower cse variables)
y_train = X_train["Label"].copy()
y_val = X_val["Label"].copy()
y_test = X_test["Label"].copy()

In [None]:
# Print the shape of X_train,X_val,X_test,y_train,y_val and y_test
print("\nTrain data shape",X_train.shape)
print("Validation data shape",X_val.shape)
print("Test data shape ",X_test.shape)
line= '_' * 25
print(line)
print("\nTrain Label shape",y_train.shape)
print("Validation Label shape",y_val.shape)
print("Test Label shape ",y_test.shape)

## **Word Embeddings**

### Model 1 - Word2Vec

In [None]:
words_list = [item.split(" ") for item in stock['News'].values] # Creating a list of all words in our data

In [None]:
# Creating an instance of Word2Vec
vec_size = 300 # Determines the number of features used to represent each word in the vector space. A higher vec_size can increase computational complexity as it captures more nuances.
model_W2V = Word2Vec(words_list, vector_size = vec_size, min_count = 1, window=5, workers = 6) # Model will learn these embeddings by analyzing word co-occurrences within a context window of 5 words.

%md
<h3>Word2Vec Parameters for our Model</h2>

  <table>
    <tr>
      <th>Parameter</th>
      <th>Description</th>
      <th>Value</th>
      <th>Comment</th>
    </tr>
    <tr>
      <td>`vec_size`</td>
      <td>Dimensionality of word vectors</td>
      <td>300</td>
      <td>It determines the number of features used to represent each word in the vector space.</td>
    </tr>
    <tr>
      <td>`model_W2V`</td>
      <td>Word2Vec model instance</td>
      <td>-</td>
      <td>The Word2Vec model learns these representations by analyzing the co-occurrence patterns of words in the input text.</td>
    </tr>
    <tr>
      <td>`words_list`</td>
      <td>Input data (list of sentences or words)</td>
      <td>-</td>
      <td>This argument represents the input data for the model. The model will learn word embeddings based on the words and their contexts within these sentences.</td>
    </tr>
    <tr>
      <td>`vector_size`</td>
      <td>Dimensionality of word vectors</td>
      <td>300</td>
      <td>In this case, it is set to 300, meaning that **each word** will be represented by a vector with 300 dimensions.</td>
    </tr>
    <tr>
      <td>`min_count`</td>
      <td>Minimum word frequency to be included</td>
      <td>1</td>
      <td>Specifies the minimum number of times a word must appear in the training data to be included in the model's vocabulary.</td>
    </tr>
    <tr>
      <td>`window`</td>
      <td>Context window size</td>
      <td>5</td>
      <td>Context window around a target word. The model considers words within a window before and after the target word **to learn its vector representation**.</td>
    </tr>
    <tr>
      <td>`workers`</td>
      <td>Number of worker threads</td>
      <td>6</td>
      <td>Using multiple workers can significantly speed up the training process, especially for large datasets.</td>
    </tr>
  </table>

</body>
</html>

In [None]:
print("Length of the vocabulary is", len(list(model_W2V.wv.key_to_index))) # Size of the vocabulary or number of unique words that the Word2Vec model has learned representations for.

- The Number of Unique Words or Vocabulary above (4692)

In [None]:
word = "stock"     # Selected word used frequently.
model_W2V.wv[word] # Observe the word embedding of a selected word

In [None]:
word = "economy"   # Second selected word
model_W2V.wv[word] # Observe the word embedding of the second selected word

In [None]:
words = list(model_W2V.wv.key_to_index.keys()) # Retrieve the words present in the --Word2Vec-- model's vocabulary
wvs = model_W2V.wv[words].tolist()             # Retrieve word vectors for all the words present in the model's vocabulary
word_vector_dict = dict(zip(words, wvs))       # Create a dictionary of words and their corresponding vectors

In [None]:
def average_vectorizer_Word2Vec(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Create a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in words]

    # Add the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(word_vector_dict[word])

    # Divide by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector

In [None]:
# Create a dataframe of the vectorized documents
start = time.time()

X_train_wv = pd.DataFrame(X_train["News"].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
X_val_wv = pd.DataFrame(X_val["News"].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
X_test_wv = pd.DataFrame(X_test["News"].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])

end = time.time()
print('Time taken ', (end-start))

In [None]:
print(X_train_wv.shape,'train split\n', X_val_wv.shape,'validation split\n', X_test_wv.shape,'test split\n') # Train-Validatio-Test Splits

Model 2 - GloVe

In [None]:
# Download the GloVe model (Stanford's) if it doesn't exist
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

# Convert GloVe to word2vec format
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

# Load the converted model
filename = 'glove.6B.100d.txt.word2vec'
glove_model = KeyedVectors.load_word2vec_format(filename, binary=False)

In [None]:
print("Length of the vocabulary is", len(glove_model.index_to_key)) # Check the size of the vocabulary

In [None]:
word = "stock"    # Select the word embedding for first word. A very frequently used word.
glove_model[word] # View the word embedding of selected word

In [None]:
word = "economy"  # Select the word embedding for a second word.
glove_model[word] # View the word embedding of selected word

In [None]:
glove_words = glove_model.index_to_key                                                 # Retrieve the words present in the GloVe model's vocabulary
glove_word_vector_dict = dict(zip(glove_model.index_to_key,list(glove_model.vectors))) # Create a dictionary of words and their corresponding vectors

In [None]:
# Each word can be represented by a 100-dimensional vector (100 features).
vec_size=100 # Specifies the number of dimensions for the embedding space.

In [None]:
def average_vectorizer_GloVe(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in glove_words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(glove_word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector

In [None]:
# Create a dataframe of the vectorized documents
start = time.time()

X_train_gl = pd.DataFrame(X_train["News"].apply(average_vectorizer_GloVe).tolist(), columns=['Feature '+str(i) for i in range(vec_size)]) # Apply GloVe on 'News' column for Training set.
X_val_gl = pd.DataFrame(X_val["News"].apply(average_vectorizer_GloVe).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])     # Apply GloVe on 'News' column For Validation set.
X_test_gl = pd.DataFrame(X_test["News"].apply(average_vectorizer_GloVe).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])   # Apply GloVe on 'News' column for Testing set.

end = time.time()
print('Time taken ', (end-start))

In [None]:
print(f'Time taken to create a dataframe of the vectorized documents using GloVe \033[1m{end - start:.6f} seconds.') # Rounded to 6 significant digits.

In [None]:
print('For GloVe:\n', X_train_gl.shape,'train split\n', X_val_gl.shape,'validation split\n', X_test_gl.shape,'test split\n') # Train-Validatio-Test Splits

## Model 3 - Sentence Transformer

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Defining the model for text classification, semantic search and sentiment analysis.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Setting the device to GPU if available, else CPU

In [None]:
# Encoding the dataset splits with the Transformer Model
start = time.time()

X_train_st = model.encode(X_train["News"].values, show_progress_bar=True, device=device) # Apply Sentence Transformer on 'News' column for the Training set.
X_val_st = model.encode(X_val["News"].values, show_progress_bar=True, device=device)     # Apply Sentence Transformer on 'News' column for the Validation set.
X_test_st = model.encode(X_test["News"].values, show_progress_bar=True, device=device)   # Apply Sentence Transformer on 'News' column for the Test set.

end = time.time()
print(f'Time taken using a Transformer \033[1m{end - start:.6f} seconds.') # Rounded to 6 significant digits.

In [None]:
print('For our Transformer model:\n', X_train_st.shape,'train split\n', X_val_st.shape,'validation split\n', X_test_st.shape,'test split\n') # Train-Validatio-Test Splits

## Model Evaluation Criterion

For each Model we will look for:

- Accuracy and F1 Scores from Sentiment Predictions (Labels) by measuring the Confusion Matrix for each model and compare them.
- Computational Cost: Consider the time and resources required to train and use each model. We can compare processing times for each model and compare them.
- I will select the same Classifier when comparing all 3 models.
- We could run more extensive training, (Colab permitting) by trying each of the following classifiers:
  - GradientBoostingClassifier
  - RandomForestClassifier
  - DecisionTreeClassifier

We Need to Consider these factors:
- Vector Size: While not the sole determinant, larger vectors (like Transformer's 384) can potentially capture more complex relationships but may also be computationally more expensive.
- Training Data: The quality and size of the data used to train each model significantly impacts performance.

In [None]:
def plot_confusion_matrix(model, predictors, target):
    """
    Plot a confusion matrix to visualize the performance of a classification model.

    Parameters:
    actual (array-like): The true labels.
    predicted (array-like): The predicted labels from the model.

    Returns:
    None: Displays the confusion matrix plot.
    """
    pred = model.predict(predictors)  # Make predictions using the classifier.

    cm = confusion_matrix(target, pred)  # Compute the confusion matrix.

    plt.figure(figsize=(5, 4))  # Create a new figure with a specified size.
    label_list = [0, 1,-1]  # Define the labels for the confusion matrix.
    sns.heatmap(cm, annot=True, fmt='.0f', cmap='Blues', xticklabels=label_list, yticklabels=label_list)
    # Plot the confusion matrix using a heatmap with annotations.

    plt.ylabel('Actual')  # Label for the y-axis.
    plt.xlabel('Predicted')  # Label for the x-axis.
    plt.title('Confusion Matrix')  # Title of the plot.
    plt.show()  # Display the plot.

In [None]:
def model_performance_classification_sklearn(model, predictors, target):
    """
    Compute various performance metrics for a classification model using sklearn.

    Parameters:
    model (sklearn classifier): The classification model to evaluate.
    predictors (array-like): The independent variables used for predictions.
    target (array-like): The true labels for the dependent variable.

    Returns:
    pandas.DataFrame: A DataFrame containing the computed metrics (Accuracy, Recall, Precision, F1-score).
    """
    pred = model.predict(predictors)  # Make predictions using the classifier.

    acc = accuracy_score(target, pred)  # Compute Accuracy.
    recall = recall_score(target, pred,average='weighted')  # Compute Recall.
    precision = precision_score(target, pred,average='weighted')  # Compute Precision.
    f1 = f1_score(target, pred,average='weighted')  # Compute F1-score.

    # Create a DataFrame to store the computed metrics.
    df_perf = pd.DataFrame(
        {
            "Accuracy": [acc],
            "Recall": [recall],
            "Precision": [precision],
            "F1": [f1],
        }
    )

    return df_perf  # Return the DataFrame with the metrics.

## Untuned - Model Training

#### Untuned: Word2Vec

In [None]:
# Building the model

#Uncomment only one of the snippets related to fitting the model to the data

#base_wv = GradientBoostingClassifier(random_state = 42)
#base_wv = RandomForestClassifier(random_state=42)
base_wv = DecisionTreeClassifier(random_state=42)

# Fitting on train data
base_wv.fit(X_train_wv, y_train)

In [None]:
plot_confusion_matrix(base_wv,X_train_wv,y_train) # Training

In [None]:
plot_confusion_matrix(base_wv,X_val_wv,y_val) # Validation

In [None]:
# Calculating different metrics on training data
base_train_wv = model_performance_classification_sklearn(base_wv,X_train_wv,y_train)
print("Training performance:\n", base_train_wv)

In [None]:
# Calculating different metrics on validation data
base_val_wv = model_performance_classification_sklearn(base_wv,X_val_wv,y_val)
print("Validation performance:\n",base_val_wv)

#### Untuned: GloVe

In [None]:
#Building the model

#Uncomment only one of the snippets related to fitting the model to the data

#base_wv = GradientBoostingClassifier(random_state = 42)
#base_wv = RandomForestClassifier(random_state=42)
base_gl = DecisionTreeClassifier(random_state=42)

# Fitting on train data
base_gl.fit(X_train_gl, y_train) #Complete the code to fit the chosen model on the train data

In [None]:
plot_confusion_matrix(base_gl,X_train_gl,y_train) # Confusion matrix for the train data on GloVe

In [None]:
plot_confusion_matrix(base_gl,X_val_gl,y_val) # Confusion matrix for the validation data on GloVe

In [None]:
#Calculating different metrics on training data
base_train_gl=model_performance_classification_sklearn(base_gl,X_train_gl,y_train) # Calculate model performance for the training data on a GloVe Model
print("Training performance:\n", base_train_gl)

In [None]:
#Calculating different metrics on validation data
base_val_gl = model_performance_classification_sklearn(base_gl,X_val_gl,y_val) # Calculate model performance for the validation data on a GloVe model.
print("Validation performance:\n",base_val_gl)

#### Untuned: Sentence Transformer

In [None]:
# Building the model

#Uncomment only one of the snippets related to fitting the model to the data

#base_wv = GradientBoostingClassifier(random_state = 42)
#base_wv = RandomForestClassifier(random_state=42)
base_st = DecisionTreeClassifier(random_state=42)

# Fitting on train data
base_st.fit(X_train_st, y_train) #Complete the code to fit the chosen model on the train data

In [None]:
plot_confusion_matrix(base_st,X_train_st,y_train) # Confusion matrix for the train data on our Transformer model.

In [None]:
plot_confusion_matrix(base_st,X_val_st,y_val) # Confusion matrix for the validation data on our Transformer model.

In [None]:
#Calculating different metrics on training data
base_train_st=model_performance_classification_sklearn(base_st,X_train_st,y_train) # Model performance for the training data on our Transformer Model.
print("Training performance:\n", base_train_st)

In [None]:
#Calculating different metrics on validation data
base_val_st = model_performance_classification_sklearn(base_st,X_val_st,y_val)  # Model performance for the validation data on our Transformer Model.
print("Validation performance:\n",base_val_st)


<h2>Decision Tree Classifier Tuning Parameters</h2>

<table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
    <th>Values</th>
  </tr>
  <tr>
    <td>max_depth</td>
    <td>Maximum depth of the tree</td>
    <td>[3, 4, 5, 6]</td>
  </tr>
  <tr>
    <td>min_samples_split</td>
    <td>Minimum number of samples required to split an internal node</td>
    <td>[5, 7, 9, 11]</td>
  </tr>
  <tr>
    <td>max_features</td>
    <td>Number of features considered when splitting a node</td>
    <td>['log2', 'sqrt', 0.2, 0.4]</td>
  </tr>
</table>

</body>
</html>

In [None]:
start = time.time()

# Choose the type of classifier.

#Uncomment only one of the snippets corrrsponding to the base model trained previously

#tuned_wv = GradientBoostingClassifier(random_state = 42)
#tuned_wv = RandomForestClassifier(random_state=42)
tuned_wv = DecisionTreeClassifier(random_state=42)
tuned_gl = DecisionTreeClassifier(random_state=42)
tuned_st = DecisionTreeClassifier(random_state=42)

parameters = {
    'max_depth': np.arange(3,7),
    'min_samples_split': np.arange(5,12,2),
    'max_features': ['log2', 'sqrt', 0.2, 0.4]
}

# Run the grid search
grid_obj = GridSearchCV(tuned_wv, parameters, scoring='f1_weighted',cv=5,n_jobs=-1)
grid_obj = grid_obj.fit(X_train_wv, y_train)

end = time.time()
print("Time taken ",(end-start))

# Set the clf to the best combination of parameters
tuned_wv = grid_obj.best_estimator_

#### Tuned: Word2Vec

In [None]:
# Fit the best algorithm to the data.
tuned_wv.fit(X_train_wv, y_train)

In [None]:
plot_confusion_matrix(tuned_wv,X_train_wv,y_train)

In [None]:
plot_confusion_matrix(tuned_wv,X_val_wv,y_val)

In [None]:
#Calculating different metrics on training data
tuned_train_wv=model_performance_classification_sklearn(tuned_wv,X_train_wv,y_train)
print("Training performance:\n",tuned_train_wv)

In [None]:
#Calculating different metrics on validation data
tuned_val_wv = model_performance_classification_sklearn(tuned_wv,X_val_wv,y_val)
print("Validation performance:\n",tuned_val_wv)

#### Tuned: GloVe

In [None]:
# Fit the best algorithm to the data.
tuned_gl.fit(X_train_gl, y_train) # Fit the chosen model on the train data

In [None]:
plot_confusion_matrix(tuned_gl,X_train_gl, y_train) # Confusion matrix for the train data

In [None]:
plot_confusion_matrix(tuned_gl,X_val_gl,y_val) # Confusion matrix for the validation data

In [None]:
# Metrics on training data
tuned_train_gl=model_performance_classification_sklearn(tuned_gl,X_train_gl, y_train) # Model performance for the training data on GloVe model.
print("Training performance:\n",tuned_train_gl)

In [None]:
#Calculating different metrics on validation data
tuned_val_gl = model_performance_classification_sklearn(tuned_gl,X_val_gl,y_val) # Model performance for the validation data on GloVe model.
print("Validation performance:\n",tuned_val_gl)

#### Tuned: Sentence Transformer

In [None]:
# Fit the best algorithm to the data.
tuned_st.fit(X_train_st, y_train) #Complete the code to fit the chosen model on the train data

In [None]:
plot_confusion_matrix(tuned_st,X_train_st,y_train) #Complete the code to plot the confusion matrix for the train data

In [None]:
plot_confusion_matrix(tuned_st,X_val_st,y_val) #Complete the code to plot the confusion matrix for the validation data

In [None]:
# Metrics on training data
tuned_train_st=model_performance_classification_sklearn(tuned_st,X_train_st,y_train) #C Model performance for the training data
print("Training performance:\n",tuned_train_st)

In [None]:
# Metrics on validation data
tuned_val_st = model_performance_classification_sklearn(tuned_st,X_val_st,y_val) # Model performance for the validation data
print("Validation performance:\n",tuned_val_st)

## Model Selection

#### Model Performance Summary

In [None]:
#training performance comparison

models_train_comp_df = pd.concat(
    [base_train_wv.T,
     base_train_gl.T,
     base_train_st.T,
     tuned_train_wv.T,
     tuned_train_gl.T,
     tuned_train_st.T,
    ],axis=1
)

models_train_comp_df.columns = [
    "Base Model (Word2Vec)",
    "Base Model (GloVe)",
    "Base Model (Sentence Transformer)",
    "Tuned Model (Word2Vec)",
    "Tuned Model (GloVe)",
    "Tuned Model (Sentence Transformer)",
]

print("Training performance comparison:")
models_train_comp_df

In [None]:
# Validation performance comparison

models_val_comp_df = pd.concat(
    [base_val_wv.T,
     base_val_gl.T,
     base_val_st.T,
     tuned_val_wv.T,
     tuned_val_gl.T,
     tuned_val_st.T,
     ],axis=1
)

models_val_comp_df.columns = [
    "Base Model (Word2Vec)",
    "Base Model (GloVe)",
    "Base Model (Sentence Transformer)",
    "Tuned Model (Word2Vec)",
    "Tuned Model (GloVe)",
    "Tuned Model (Sentence Transformer)",
]

print("Validation performance comparison:")
models_val_comp_df

## Model Performance Check on Testing dataset

In [None]:
# Fit the best model to the test data.
tuned_st.fit(X_test_wv, y_test) # Fit the chosen model on the test data.

In [None]:
plot_confusion_matrix(tuned_st,X_test_wv,y_test) # Confusion matrix for the final model and test data.

In [None]:
# Calculating different metrics on test data
final_model_test = model_performance_classification_sklearn(tuned_st,X_test_wv,y_test) # Final model's performance with the test data.
print("Test performance for the final model:\n",final_model_test)

- Best Model: Sentence Transformer

## **Weekly News Summarization**

## Installing and Importing the necessary libraries

In [None]:
# It installs version 0.1.85 of the GPU llama-cpp-python library
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q # Invoked as a shell command executed within Jupyter/Google Colab.

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
#!CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.85 --force-reinstall --no-cache-dir -q

In [None]:
%%time
from huggingface_hub import hf_hub_download          # Function to download the model from the Hugging Face model hub
from llama_cpp import Llama                          # Importing the Llama class from the llama_cpp module
import pandas as pd                                  # Importing the library for data manipulation
from tqdm import tqdm                                # For progress bar related functionalities
tqdm.pandas()

## Loading the data

In [None]:
stock_data = pd.read_csv('https://raw.githubusercontent.com/ShauryaRawat10/Data-Science/main/Generative%20AI/Storage/stock_news.csv', engine='python')

In [None]:
data = stock_news.copy()                            # Make a dtaframe copy for analysis

## Wordcount to check on tokens

In [None]:
# Wordcount of a text in a file named stock_news
import re

words = re.findall(r'\b\w+\b', ", ".join(data['News'].astype(str)).lower()) #Find all words, convert to lower case
print("Total word count:", len(words))

In [None]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"  # Model path
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"          # Model name

model_path = hf_hub_download(                                  # Download the little model with 7.3 billion parameters
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",          # Use the repo_id
    filename="mistral-7b-instruct-v0.2.Q6_K.gguf"              # Use this filename
)                                                              # Examine progress (blue) until completion (green)

In [None]:
# Connect runtime to GPU, a number of layers will be offloaded to the GPU for computations.
llm = Llama(                                                   # Variable that will hold the instance of the Llama model. Llama is the instantiated class of the llama-cpp-python lybrary.
    model_path=model_path,                                     # Path to the model, previously defined. This is typically a .bin file that contains the trained weights of the model.
    n_gpu_layers=100,                                          # Number of layers transferred to GPU. Which ones will be listed as an output.
    n_ctx=4500,                                                # Context window. It determines how much text (in tokens) the model can process or “remember” in a single pass.
)

In [None]:
# Aggregating data weekly
data["Date"] = pd.to_datetime(data['Date'])                                     # Convert the 'Date' column to datetime format.

In [None]:
weekly_grouped = data.groupby(pd.Grouper(key='Date', freq='W'))                 # Group the data by week using the 'Date' column.

In [None]:
weekly_grouped_full = weekly_grouped.apply(lambda x: x).reset_index(drop=True)  # Display all rows from the grouped Dataframe
print(weekly_grouped_full)

In [None]:
# Aggregate the "News" column with a '||' as a separator and their corresponding "Volume" and "Label" also separated by '||' and reset the index.
# weekly_grouped = data.groupby(pd.Grouper(key='Date', freq='W')) # Group the data by week using the 'Date' column.
weekly_aggregated = weekly_grouped.agg(
    {
        "News": lambda x: " || ".join(x),
        "Volume": lambda x: " || ".join(map(str, x)),  # Assuming 'Value' needs string conversion
        "Label": lambda x: " || ".join(map(str, x)),  # Assuming 'Label' needs string conversion
    }
).reset_index()
weekly_aggregated

In [None]:
# Compare processed files ...
print("\nCompare processed files ...")
print("\nweekly_grouped_full: ",weekly_grouped_full.shape, "with Columns: ", weekly_grouped_full.columns)
print("weekly_aggregated:    ",weekly_aggregated.shape,"with Columns: ", weekly_aggregated.columns)

In [None]:
weekly_aggregated_copy = weekly_aggregated.copy()

## Untilities

In [None]:
# defining a function to parse the JSON output from the model
def extract_json_data(json_str):
    import json
    try:
        # Find the indices of the opening and closing curly braces
        json_start = json_str.find('{')
        json_end = json_str.rfind('}')

        if json_start != -1 and json_end != -1:
            extracted_category = json_str[json_start:json_end + 1]  # Extract the JSON object
            data_dict = json.loads(extracted_category)
            return data_dict
        else:
            print(f"Warning: JSON object not found in response: {json_str}")
            return {}
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return {}

In [None]:
# Response function only creating Key Events
def response_mistral_1(prompt, news):
    model_output = llm(
      f"""
      [INST]
      {prompt}
      News Articles: {news}
      [/INST]
      """,
      max_tokens=150,       # Set max tokens to limit response length. Limits the maximum number of tokens in the LLM's response to 150, controlling the length of the output.
      temperature=0,        # Set temperature for minimum creativity. Sets the temperature to 0, results in more deterministic responses. Higher temperatures can lead to more diverse but imaginative outputs.
      top_p=0.95,           # Set top_p for diversity. 0.95 means that the model will only select from the top 95% most probable tokens, leading to more focused and coherent responses.
      top_k=50,             # Limit to top 50 tokens for better focus. Limits the number of considered tokens to the top 50 most probable ones, further refining the selection process.
      stop=['INST'],        # Stop at the end of the instruction. Instructs the LLM to stop generating text when it encounters the [/INST] marker.
      echo=False,
    )

    final_output = model_output["choices"][0]["text"]

    return final_output

In [None]:
# Response function with 'Label' data incorporated
def response_mistral_2(prompt, news, labels):
    # Combine 'news' and 'labels' into a format you want to pass to the model
    formatted_news_and_labels = "\n".join([f"News: {n} | Label: {l}" for n, l in zip(news, labels)])

    # Construct the prompt with both the news and the labels
    model_output = llm(
      f"""
      [INST]
      {prompt}
      News Articles and Labels:
      {formatted_news_and_labels}
      [/INST]
      """,
      max_tokens=150,   # Set max tokens to limit response length
      temperature=0,    # Set temperature for a more predictable response
      top_p=0.95,       # Set top_p for diversity
      top_k=50,         # Limit to top 50 tokens for better focus
      stop=['INST'],    # Stop at the end of the instruction
      echo=False,
    )

    final_output = model_output["choices"][0]["text"]
    return final_output

In [None]:
news = weekly_aggregated_copy.loc[0, 'News'] # Using PROMPT 1 AND JSON

In [None]:
print(len(news.split(' '))) # Using PROMPT 1 AND JSON
news

In [None]:
# ------------------------------------------PROMPT 1 NOW WORKING ---
prompt1 = """
You are an expert data analyst specializing in news analysis and sentiment analysis.

Task: Analyze the provided news headlines and return the main topics within them.  Each event should be listed once, even if mentioned multiple times.

Instructions:
1. Read the news headline carefully to dentify the main subjects or entities mentioned in the news headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent these same topics and summarize each.
5. List these resulting summarized topics in a concise manner using an uniform numerical format like 1,2,3,4,5 always starting with numerical 1, per topic.
6. Be sure to use uniform formatting for the output and always end each row with a period.

Return the output results in JSON format with keys as the topic number and values as the actual topic
"""

In [None]:
%%time
summary = response_mistral_1(prompt1, news) # Using JSON, Using Simple Promts, Topics nicely generated.
print(summary)

In [None]:
# ---------------------------------------------PROMPT 2 -----
prompt2 = """
You are an expert data analyst specializing in sentiment analysis.

Task: Analyze the numbered Key Events and return corresponding Labels.

Instructions:
1. Read the numbered Key Events carefully.
3. Determine the matching Labels key events and number them to match the Key Events.
4. For each instance of a -1 create the text "Negative News Event" and for each instance of a 1 create the text "Positive News Event" matching each Key Events number.
5. List these matching Labels now in text and make sure they still match the exact number of Key Events.
6. Be sure to use uniform formatting for the output and always end each row with a period.

"""
# Return the output results in JSON format with keys as the topic number and values as the actual topic.
# """

In [None]:
%%time
summary = response_mistral_1(prompt2, data) # Using JSON, Using Simple Promts, Topics nicely generated.
print(summary)

In [None]:
# TEST
prompt3 = """
You are an expert data analyst specializing in stock market news article analysis that affects the financial market.
Task: Analyze the news headlines and determine which news articles are positive or negative in sentiment.
Instructions:
1. Read the individual news article that is separated by ' || '.
2. Identify if the article contains positive or negative sentiment based on optimistic or pessimistic indicators.
2. Extract each article and create a summary based on the sentiment (Positive or Negative).
2. Summarize results by grouping by date into weeks, include the individual news articles and count the number of Positive (1) and Negative (-1) sentiments.
Output the results in JSON format.
"""

In [None]:
%%time
summary = response_mistral_1(prompt, news)
print(summary)

In [None]:
# ------------------------------------------PROMPT 2.1 ----
prompt = """
You are an expert data analyst specializing in news analysis and sentiment analysis.

Task: Analyze the provided news headlines and return the main topics for each of them.  Each event should be listed once, even if mentioned multiple times.

Instructions:
1. Read the news headline carefully to dentify the main subjects or entities mentioned in the news headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent these same topics and summarize each.
5. List these resulting summarized topics in a concise manner.
6. Be sure to use uniform formatting for the output.

"""
# Return the output results in JSON format with keys as the topic number and values as the actual topic.
# """

In [None]:
summary_nonjson = response_mistral_1(prompt, weekly_aggregated_copy) # This is where the rubber meets the road. Using the prompt properly <==========================  PROMPT #2
print(summary_nonjson)

In [None]:
%%time
data['Key Events'] = data['News'].progress_apply(lambda x: response_mistral_1(prompt,x))

In [None]:
data_1.head()