## **Problem Statement**

### Business Context

The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.

### Problem Definition

With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.

As a member of the Data Science and AI team in the startup, you have been tasked with developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.

### Data Dictionary

* `Date` : The date the news was released
* `News` : The content of news articles that could potentially affect the company's stock price
* `Open` : The stock price (in \$) at the beginning of the day
* `High` : The highest stock price (in \$) reached during the day
* `Low` :  The lowest stock price (in \$) reached during the day
* `Close` : The adjusted stock price (in \$) at the end of the day
* `Volume` : The number of shares traded during the day
* `Label` : The sentiment polarity of the news content
    * 1: positive
    * 0: neutral
    * -1: negative

## **Install and Import necessary libraries**

In [None]:
# installing the sentence-transformers and gensim libraries for word embeddings
!pip install numpy==1.26.4 \
             scikit-learn==1.6.1 \
             scipy==1.13.1 \
             gensim==4.3.3 \
             sentence-transformers==3.4.1 \
             pandas==2.2.2

In [None]:
# Import Pandas: Used for data manipulation and analysis.
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format

#Import NumPy: Used for numerical operations and array handling.
import numpy as np

# Matplotlib: Used for creating static, interactive, and animated visualizations.
import matplotlib.pyplot as plt
import seaborn as sns
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer

# To build and evaluate ML models
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Import TensorFLow and Keras for deep learning model building
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

# To implement progress bar related functionalities
from tqdm import tqdm
tqdm.pandas()

# To ignore unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

## **Loading the dataset**

In [None]:
# Mount the google drive
from google.colab import drive
drive.mount('/content/drive')

# Load the dataset
data = pd.read_csv('/content/drive/MyDrive/ColabNotebooks/Dataset/stock_news.csv')


## **Data Overview**

**Display the number of rows and columns**

In [None]:
# Show the number of rows and columns in the data
data.shape

**Display the first 10 rows of the dataframe**

In [None]:
# Show first 10 columns of dataframe
data.head(10)

**Display the data type of the columns**

In [None]:
# Show datatype of columns
data.info()

Date column is  of type object.

**Convert the data type of Date as date**

In [None]:
# Convert Date column datatype to date
data['Date'] = pd.to_datetime(data['Date'])

Display the statistical summary of data

In [None]:
# Show statistical summary of data. It is transposed to show the columns as rows.
data.describe().T

* Data is from Jan 2019 - April 2019
* Between Jan 2019 - April 2019 the opening high was 66.82 and opening low was 37.57

*   The highest price was 67.06
*   The lowest price was 37.30






**Check for duplicates**

In [None]:
# Check for duplicate rows
duplicate_rows = data[data.duplicated()]

# Print the number of duplicate rows
print(duplicate_rows)

No duplicate rows

**Check for missing values**

In [None]:
# Check for missing values
data.isnull().sum()

No missing values

## **Exploratory Data Analysis**

### **Univariate Analysis**

* Distribution of individual variables
* Compute and check the distribution of the length of news content

In [None]:
# Show the distribution of new polarity in percentage
sns.countplot(data=data, x='Label',stat="percent")
plt.title('Distribution of polarity of news')
#plt.show()

Graph shows that about 50% of the time, market was neutral to the market news.
About 28% of of the time, market behaved negatively to the market news.
And about 22% market was positive to the news.

**Density Plot Of Price (Open, High, Low, Close)**

In [None]:
# Show the distribution of 4 stock price
sns.displot(data=data[['Open','High','Low','Close']], kind='kde',palette="tab10")


* All four prices show similar distribution peak ~40 -50 and another peak ~60 - 70
* Lines are so close, indicating low variance.

**Histogram on Volumn**

In [None]:
# Show the distribution of trading volumn
sns.histplot(data=data, x='Volume', kde=True, stat="frequency")
plt.title('Histogram of Volumn')

X-axis: Range of trading volumns.
Y-axis: No of times the trading volumns fell in the range bucket.
* Distribution is right skewed.
* Lower trading volumns are common
* There are some ouliers as high trading volumes on the right.
* Most days have low trading volumns. Some days have high trading volumn possibly due to the makert moving newes.


**Statistical Summary on News Length**

In [None]:
# Calulate the number of words in each new article
data['news_length'] = data['News'].apply(lambda x: len(x.split()))
data['news_length'].describe()

* The average word count for each article is 48.35
* The minimum word count for each article is 18
* The max word count for each article is 60


**Histogram on News Length**

In [None]:
# Show the distribution of News word count
sns.histplot(data=data, x='news_length', kde=True, stat="frequency")
plt.title('Histogram of Article Word Count')

Histogram shows that:
* Most articles have word count between 45 - 55.
* Very few articles with low word count.


### **Bivariate Analysis**

* Correlation
* Sentiment Polarity vs Price
* Date vs Price

**Note**: The above points are listed to provide guidance on how to approach bivariate analysis. Analysis has to be done beyond the above listed points to get maximum scores.

**Correlation Heatmap**

In [None]:
# Show the correlation between variables
cols= ['Open','High','Low','Close','Volume','news_length']
corr = data[cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')

* There is a strong positive correlation between the open, high, low and close variable.
* News length has a weak correlation with the price variables. Meaning the length of the news has no effect on the price vairbles.
* Volumn also has a weak correlation to price variables.

**Box Plot between Label vs Price**


In [None]:
# Show box plot of label vs price
plt.figure(figsize=(10, 8))

for i, variable in enumerate(['Open', 'High', 'Low', 'Close']):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(data=data, x="Label", y=variable)
    plt.tight_layout(pad=2)

plt.show()

* Price distribution is similar across market polarity.
* Not much variation in price based on market sentiments.


**Date vs Price**

In [None]:
# Group data by date
data_daily = data.groupby('Date').agg(
    {
        'Open': 'mean',
        'High': 'mean',
        'Low': 'mean',
        'Close': 'mean',
        'Volume': 'mean',
    }
).reset_index()

data_daily.set_index('Date', inplace=True)
data_daily.head()

In [None]:
 # Show the line plot of price over time
plt.figure(figsize=(15,5))
sns.lineplot(data_daily.drop("Volume", axis=1)); #Complete the

* The 4 different prices move in sync, prices are on a given day are closely related
* There regular interval of highs and lows indicate periodic market events.


## **Data Preprocessing**

Transform raw data into a clean organized and usable format for models to learn.

In [None]:
# Check the date to figure out how to split the data
data['Date'].describe()

## **Train-Test Split**

Split data for training and test. Since time series data, train using data from 2019-01-02 - 2019-03-31. And test on April Data.

In [None]:
# Split data for train and test
X_train = data[data['Date'] < '2019-04-01']
X_test = data[data['Date'] >= '2019-04-01']

In [None]:
# Assign Label as target variable
y_train = X_train['Label'].copy()
y_test = X_test['Label'].copy()

In [None]:
#Check the shape of training and testing data to confirm split
print('Shape of training data:', X_train.shape)
print('Shape of testing data:', X_test.shape)
print('Shape of training target:', y_train.shape)
print('Shape of testing target:', y_test.shape)

##**Word Embeddings**
In this project we'll try Word2Vec-based model and Sentence Transformer-based model to generate the vector representation of stock metadata.

### **Text embedding using Word2Vec**

Step 1: Data Preparation

In [None]:
# Create data_word2vec to store tokens, embedding independently without modifying the original dataframe.
data_word2vec = data.copy()


In [None]:
# Using simple_preprocess function to tokenize and lowercase all words.
from gensim.utils import simple_preprocess
data_word2vec['tokens'] = data_word2vec['News'].apply(lambda x: simple_preprocess(x))


In [None]:
# Check the tokens
#print(data_word2vec['tokens'])

Step 2: Model Training

In [None]:
# Train the word2vec model preprocessed data. Tokenize and vectorize.
from gensim.models import Word2Vec
vec_size = 300
model_word2vec = Word2Vec(sentences=data_word2vec['tokens'], vector_size=vec_size, window=5, min_count=1, workers=4,seed=42)

In [None]:
# Check the tokens
#print(model_word2vec.wv.index_to_key)

Step 3: Create a Dictionary of words

In [None]:
# Create a list of all unique words the model has learned for embedding
words = list(model_word2vec.wv.index_to_key)
word_vectors = {word: model_word2vec.wv[word].tolist() for word in words}

Step 4: Average Vector Calculation

In [None]:
# Averaging the word vectors to get sentence encoding. Average all words in the new article to get a single fixed size vector.
def average_vector(doc):
    vectors = [word_vectors[word] for word in doc.split() if word in word_vectors]
    return np.mean(vectors, axis=0) if vectors else np.zeros(vec_size)

In [None]:
# Creating a dataframe of the vectorized articles
X_train_wv = pd.DataFrame(X_train['News'].apply(average_vector).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
X_test_wv = pd.DataFrame(X_test['News'].apply(average_vector).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])



In [None]:
# Check the shape of the data
print(X_train_wv.shape)
print(X_test_wv.shape)

### **Text embedding using Sentence Transformer**

**Define the Model**

In [None]:
# Define the model
# model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

**Encoding the data**

In [None]:
# Encoding the dataset
X_train_st = model.encode(X_train['News'].values, show_progress_bar=True)
X_test_st = model.encode(X_test['News'].values, show_progress_bar=True)

In [None]:
# Check the shape of the dataframe
print(X_train_st.shape)
print(X_test_st.shape)

## **Sentiment Analysis**

#### **Model Evaluation Criterion**

##### **Utility Functions**

In [None]:
def plot_confusion_matrix(Actual, target):
    """
    Plot a confusion matrix to visualize the performance of a classification model.

    Parameters:
    actual (array-like): The true labels.
    predicted (array-like): The predicted labels from the model.

    Returns:
    None: Displays the confusion matrix plot.
    """

    # Compute the confusion matrix.
    cm = confusion_matrix(target, Actual)

    # Create a new figure with a specified size
    plt.figure(figsize=(5, 4))

    # Define the labels for the confusion matrix
    label_list = [0, 1,-1]

    # Plot the confusion matrix using a heatmap with annotations
    sns.heatmap(cm, annot=True, fmt='.0f', cmap='Blues', xticklabels=label_list, yticklabels=label_list)

    # Label for the y-axis
    plt.ylabel('Actual')

    # Label for the x-axis
    plt.xlabel('Predicted')

    # Title of the plot
    plt.title('Confusion Matrix')

    # Display the plot
    plt.show()

In [None]:
def model_performance_classification_sklearn(actual, target):
    """
    Compute various performance metrics for a classification model using sklearn.

    Parameters:
    model (sklearn classifier): The classification model to evaluate.
    predictors (array-like): The independent variables used for predictions.
    target (array-like): The true labels for the dependent variable.

    Returns:
    pandas.DataFrame: A DataFrame containing the computed metrics (Accuracy, Recall, Precision, F1-score).
    """

    # Compute Accuracy
    acc = accuracy_score(target, actual)
    # Compute Recall
    recall = recall_score(target, actual,average='weighted')
    # Compute Precision
    precision = precision_score(target, actual,average='weighted')
    # Compute F1-score
    f1 = f1_score(target, actual,average='weighted')

    # Create a DataFrame to store the computed metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": [acc],
            "Recall": [recall],
            "Precision": [precision],
            "F1": [f1],
        }
    )
    # Return the DataFrame with the metrics
    return df_perf

###**Build Random Forest Models using different text embeddings**


###**Build Random Forest Model using text embedding from Word2Vec**

In [None]:
# Build the Random Forest Model
# n_estimators: the number of trees in the forest
# max_depth: the maximum depth of each tree
# random_state: seed for random number generator
rf_wv = RandomForestClassifier(n_estimators=300,max_depth=3,random_state=42)
# Fit the model on training data
rf_wv.fit(X_train_wv, y_train)

**Check Training and Test Performance**

In [None]:
# Predicting on train data
y_pred_train = rf_wv.predict(X_train_wv)

# Predicting on test data
y_pred_test = rf_wv.predict(X_test_wv)

**Confusion Matrix**

In [None]:
# Plot confusion matrix of train data
plot_confusion_matrix(y_train,y_pred_train)

* Model overpredicts positive sentiments.
* 47 actual positive sentiments were identified as neutral.
* 60 actual positive sentiments were identified as negative.
* Very few Negative sentiments were correctly identified as Negative. This suggests too few examples of negative sentiments in the training data.



In [None]:
# Plot confusion matrix of test data
plot_confusion_matrix(y_test,y_pred_test)

* There are no predictions for negative sentiments. This means the model has failed to predict negative sentiments on unseen data.
* 16 positive sentiments misclassified as neutral sentiments
* 28 postive sentiments were corrctly identified as positive sentiments
* 13 positive sentiments were misclassified as negative sentiments.


**Classification Report**

In [None]:
# Calculate the key evaluation matrix on train data
rf_wv_train_performance = model_performance_classification_sklearn(y_pred_train, y_train)
print("rf_wv_Training_Performance:\n",rf_wv_train_performance)

# Calculate the key evaluation matrix on train data
rf_wv_test_performance = model_performance_classification_sklearn(y_pred_test, y_test)
print("rf_wv_Test_Performance:\n",rf_wv_test_performance)

**Training Performance:** Accuracy is 60% and precision is 52%. Suggesting better generalization.  
**Test Performance:** Accuracy of 46% and Precision is low 29%, meaning some predictions are incorrect.

**Tuning applied**
n_estimators = 300 and max_depth = 3
* rf_wv_Training_Performance:
* Accuracy = 60%
* Precision  = 52%
* F1 score = .51
* rf_wv_Test_Performance:
Accuracy = 46%
* Precision  = 29%
* F1 score = .34
--------------------------------------
n_estimators = 300 and max_depth = 5
* rf_wv_Training_Performance:
* Accuracy = 89%
* Precision  = 91%
* F1 score = .89
* rf_wv_Test_Performance:
* Accuracy = 43%
* Precision  = 27%
* F1 score = .33

--------------------------------------
n_estimators = 300 and max_depth = 7
* rf_wv_Training_Performance:
* Accuracy = 100%
* Precision  = 100%
* F1 score = 1
* rf_wv_Test_Performance:
* Accuracy = 46%
* Precision  = 32%
* F1 score = .36


Based on the hyperparameter tuning applied, a max depth of 3 and n_estimators = 300 seem to be more balanced and giving best generalization.

###**Build Random Forest Model using text embedding from Sentence Transformer**


In [None]:
# Build the model
rf_st = RandomForestClassifier(n_estimators=300,max_depth=5,random_state=42)
# Fit the model on training data
rf_st.fit(X_train_st, y_train)

**Check Training and Test Performance**

In [None]:
# Predict on train data
y_pred_train = rf_st.predict(X_train_st)

# Predict on test data
y_pred_test = rf_st.predict(X_test_st)

**Confusion Matrix**

In [None]:
# Plot confusion matrix of train data
plot_confusion_matrix(y_train,y_pred_train)


* There is high accuracy in the classification of Neutral and negative sentiments on training data
* Most positive sentiments have also been correctly identified as positive


In [None]:
# Plot confusion matrix of test data
plot_confusion_matrix(y_test,y_pred_test)

* Neutral sentiments has been identified correctly as Neutral. But only very little data.
* No negative sentiments.
* Of the 61 positive sentiment, only 32 were correctly identified. So model struggles to learn for positive sentiments.
* This might be due to class imbalance, not having enough data in all classes.



**Classification Report**

In [None]:
# Calculate the key evaluation matrix on train data
rf_st_train_performance = model_performance_classification_sklearn(y_pred_train, y_train)
print("rf_st_Training_Performance:\n",rf_st_train_performance)

# Calculate the key evaluation matrix on test data
rf_st_test_performance = model_performance_classification_sklearn(y_pred_test, y_test)
print("rf_st_Test_Performance:\n",rf_st_test_performance)

**Training Performance**:
Accuracy is 98% and precision is 98%. Suggesting better generalization.

**Test Performance**: Accuracy of 54% and Precision is 54%, meaning some predictions are incorrect.


--------------------------------------
n_estimators = 300 and max_depth = 5
* rf_wv_Training_Performance:
* Accuracy = 98%
* Precision  = 98%
* F1 score = .98
* rf_wv_Test_Performance:
* Accuracy = 54%
* Precision  = 54%
* F1 score = .41
--------------------------------------
n_estimators = 300 and max_depth = 3
* rf_wv_Training_Performance:
* Accuracy = 69%
* Precision  = 80%
* F1 score = .65
* rf_wv_Test_Performance:
Accuracy = 52%
* Precision  = 53%
* F1 score = .38


--------------------------------------
n_estimators = 300 and max_depth = 7
* rf_wv_Training_Performance:
* Accuracy = 100%
* Precision  = 100%
* F1 score = 1
* rf_wv_Test_Performance:
* Accuracy = 54%
* Precision  = 54%
* F1 score = .41
--------------------------------------
n_estimators = 300 and max_depth = 10
Overfitting

### **Build Neural Network Models using different text embeddings**

###**Build Neural Network Model using text embedding from Word2Vec**

In [None]:
# Map the labels
label_mapping = {1: 2, 0: 1, -1: 0}

y_train_mapped_wv = y_train.map(label_mapping)
y_test_mapped_wv = y_test.map(label_mapping)

# Convert features Dataframe to a numpy array
X_train_wv_np = np.array(X_train_wv)
X_test_wv_np = np.array(X_test_wv)
y_train_mapped_wv = np.array(y_train_mapped_wv)
y_test_mapped_wv = np.array(y_test_mapped_wv)

In [None]:
# Check the shape after converting to array
print(X_train_wv_np.shape)
print(X_test_wv_np.shape)
print(y_train_mapped_wv.shape)
print(y_test_mapped_wv.shape)

In [None]:
# Check the number of features in the input vector
X_train_wv_np.shape[1]

In [None]:
import gc

# Clear tensorflow/keras previous session
tf.keras.backend.clear_session()
gc.collect()

# Define Model
model_wv = Sequential()

# Input Layer: Input layer units = number of features in your input vector = vec_size
model_wv.add(Dense(128, input_dim=vec_size, activation='relu'))

# Dropout layer: To reduce overfitting set the dropout rate between .2 - .5
model_wv.add(Dropout(0.3))

# Hidden Layer 1
model_wv.add(Dense(64, activation='relu'))
# Dropout layer: To reduce overfitting set the dropout rate between .2 - .5
model_wv.add(Dropout(0.3))

# Hidden Layer 2
model_wv.add(Dense(32, activation='relu'))

# Output Layer:
model_wv.add(Dense(3, activation='softmax'))

# Compile the model
model_wv.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
model_wv.summary()

In [None]:
# Fit model on training data
# - epochs: number of times the model will see the entire training data (typically 5 to 50)
# - batch_size: number of samples the model will process before updating weights (commonly 16 to 128)
history = model_wv.fit(
    X_train_wv_np, y_train_mapped_wv,
    validation_data=(X_test_wv_np, y_test_mapped_wv),
    epochs=10,
    batch_size=32
)

**Check Training and Test Performance**

In [None]:
# Predict probabilities on training data
y_train_pred_probs = model_wv.predict(X_train_wv_np)

# Convert probabilities to labels
y_train_preds = tf.argmax(y_train_pred_probs, axis=1).numpy()

In [None]:
# Predict probabilities on test data
y_test_pred_probs = model_wv.predict(X_test_wv_np)

# Convert probabilities to labels
y_test_preds = tf.argmax(y_test_pred_probs, axis=1).numpy()

In [None]:
# Convert back to [-1, 0, 1] to match utility function expectations
label_mapping = {2: 1, 1: 0, 0: -1}

y_train_preds = np.array([label_mapping[index] for index in y_train_preds])
y_test_preds = np.array([label_mapping[index] for index in y_test_preds])

**Confusion Matrix**

In [None]:
# Plot confusion matrix for train data
plot_confusion_matrix(y_train_mapped_wv,y_train_preds)

* Only the Neutral sentiments are being predicted correctly.
* Model failed to predict Positive and Negative sentiments.

In [None]:
# Plot confusion matrix for test data
plot_confusion_matrix(y_test_mapped_wv,y_test_preds)

* Only the Neutral sentiments are being predicted correct on the test data, classifying everything as neutral.
* Model failed to predict positive and negative for test data.

**Classification Report**

In [None]:
# Calculate the key evaluation matrix on train data
NN_train_wv_performance = model_performance_classification_sklearn(y_train_preds, y_train_mapped_wv)
print("NN_train_wv_performance:\n",NN_train_wv_performance)

# Calculate the key evaluation matrix on test data
NN_test_wv_performance = model_performance_classification_sklearn(y_test_preds, y_test_mapped_wv)
print("NN_test_wv_performance:\n",NN_test_wv_performance)

**Training data**: Model is performing poor on train data. Model struggles to learn meaningful patterns.
Accuracy - 29%
Precision - 29%
F1-Score - .13

**Test Data:** Model is performing poor on test data also. Model struggle sto learn meaningful patterns.
Accuracy - 27%
Precision - 27%
F1-Score - .11

###**Build Neural Network Model using text embedding from Sentence Transformer**

In [None]:
# Map labels
label_mapping = {1:2,0:1,-1:0}

y_train_mapped_st = y_train.map(label_mapping)
y_test_mapped_st = y_test.map(label_mapping)

# Convert features Dataframe to a numpy array
X_train_st_np = np.array(X_train_st)
X_test_st_np = np.array(X_test_st)
y_train_mapped_st = np.array(y_train_mapped_st)
y_test_mapped_st = np.array(y_test_mapped_st)

In [None]:
from collections import Counter

# Count the frequency of each label
label_counts = Counter(y_train_mapped_st)

# Print the results
for label, count in label_counts.items():
    print(f"Label {label}: {count} occurrences")


This shows that the dataset is imbalanced. Positive snetiments out number the other two sentiments.

In [None]:
# Check shape after converting to array
print(X_train_st_np.shape)
print(X_test_st_np.shape)
print(y_train_mapped_st.shape)
print(y_test_mapped_st.shape)

In [None]:
# Check the number of features in the input vector
X_train_st_np.shape[1]

In [None]:
from sklearn.utils.class_weight import compute_class_weight
# Compute balanced class weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train_mapped_st),
    y=y_train_mapped_st
)
# Convert to Dict
Class_weight_dict = dict(enumerate(class_weights))
Class_weight_dict

In [None]:
import gc

# Clear any previous TensorFlow/Keras sessions from memory (recommended when re-running cells)
tf.keras.backend.clear_session()
gc.collect()

# Define the model
model_st = Sequential()

# Input layer: input later size is shape of input data
model_st.add(Dense(256, input_dim=768, activation='relu'))

# Dropout layer: To reduce overfitting set the dropout rate between .2 - .5
model_st.add(Dropout(0.4))

# Hidden Layer 1
model_st.add(Dense(128, activation='relu'))
# Dropout layer: To reduce overfitting set the dropout rate between .2 - .5
model_st.add(Dropout(0.3))

# Hidden Layer 2
#model_st.add(Dense(32, activation='relu'))
# Dropout layer: To reduce overfitting set the dropout rate between .2 - .5

# Output Layer:
model_st.add(Dense(3, activation='softmax'))

# Compile the model
model_st.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
model_st.summary()

In [None]:
# Fit model on training data
# class weights applied o address imbalnce of data
history = model_st.fit(
    X_train_st_np, y_train_mapped_st,
    validation_data=(X_test_st_np, y_test_mapped_st),
    epochs=10,
    batch_size=32,
    class_weight=Class_weight_dict
)

**Check Training and Test Performance**

In [None]:
# Predict probabilities on training data
y_train_pred_probs = model_st.predict(X_train_st_np)

# Convert probabilities to labels
y_train_preds = tf.argmax(y_train_pred_probs, axis=1).numpy()

In [None]:
# Predict probabilities on test data
y_test_pred_probs = model_st.predict(X_test_st_np)

# Convert probabilities to labels
y_test_preds = tf.argmax(y_test_pred_probs, axis=1).numpy()

In [None]:
# Convert back to [-1, 0, 1] to match utility function expectations
#label_mapping = {2:1,1:0,0:-1}

#y_train_preds = np.array([label_mapping[index] for index in y_train_preds])
#y_test_preds = np.array([label_mapping[index] for index in y_test_preds])



In [None]:

from collections import Counter

# Replace this with your actual predictions
# y_train_preds = np.array([...])

# Count the frequency of each label
label_counts = Counter(y_train_preds)

# Print the results
for label, count in label_counts.items():
    print(f"Label {label}: {count} occurrences")


**Confusion Matrix**

In [None]:
# Plot confusion matrix for train data
plot_confusion_matrix(y_train_mapped_st,y_train_preds)
#plot_confusion_matrix(y_train,y_train_preds)

* Model performs better diagonally.
* Neutral and Negative also has the most misclassifications.
* Model has strong classification performance.

In [None]:
# Plot confusion matrix for test data
plot_confusion_matrix(y_test_mapped_st,y_test_preds)


* Model predicts better for positive sentiments. But there are misclassifications.
* There is misclassifications of positive sentiments to other categories.
* Negative sentiments have many misclassifictaions.
* Model has class imbalance.

**Classification Report**

In [None]:
# Calculate the key evaluation matrix on train data
NN_train_st_performance = model_performance_classification_sklearn(y_train_preds, y_train_mapped_st)
print("NN_train_st_performance:\n",NN_train_st_performance)

# Calculate the key evaluation matrix on test data
NN_test_st_performance = model_performance_classification_sklearn(y_test_preds, y_test_mapped_st)
print("NN_test_st_performance:\n",NN_test_st_performance)

Training data: Model is performing with high accuracy on train data.
Accuracy - 100% Precision - 100% F1-Score - 1

Test Data: Model is showing a moderate perofmance on test data also.
Accuracy - 43% Precision - 43% F1-Score - .39

### **Model Performance Summary and Final Model Selection**

In [None]:
# Concat the training performance matrix from different models
# Transpose performance dataframe to make metric as rows
model_train_performance = pd.concat(
                            [
                            rf_wv_train_performance.T, # Random Forest using Word2Vec embeddings
                            rf_st_train_performance.T, # Random Forest with Sentence Transformer embeddings
                            NN_train_wv_performance.T, # Neural Network with Word2Vec embeddings
                            NN_train_st_performance.T  # Neural Network with Sentence Tranformer embeddings
                            ],axis=1)
# Assign columns names for the performance matrix
model_train_performance.columns = [
                            'rf_wv_Training_Performance',
                            'rf_st_Training_Performance',
                            'NN_wv_Training_Performance',
                            'NN_st_Training_Performance'
                          ]
# Print the training performance matrix
print("model_train_performance:\n",model_train_performance)


**rf_st_Training_Performance** and **NN_st_Training_Performance** :
- Has the best overall performance.
- Shows high accuracy, precision and recall.
- But sometime such high score can lead to overfitting

**rf_wv_Training_Performance**
- Perform moderately well with Accuracy of 60%, Recall 60% and Precision of .52
- Balanced recall and F1 score suggest more generalization.

**NN_wv_Training_Performance**
- Poor performance on traing data.
- Model struggles to learn.


In [None]:
# Concat the testing performance matrix from different models
# Transpose performance dataframe to make metric as rows
model_test_performance = pd.concat(
                            [
                            rf_wv_test_performance.T, # Random Forest using Word2Vec embeddings
                            rf_st_test_performance.T, # Random Forest with Sentence Transformer embeddings
                            NN_test_wv_performance.T, # Neural Network with Word2Vec embeddings
                            NN_test_st_performance.T  # Neural Network with Sentence Tranformer embeddings
                            ],axis=1)
# Assign columns names for the performance matrix
model_test_performance.columns = [
                            'rf_wv_Test_Performance',
                            'rf_st_Test_Performance',
                            'NN_wv_Test_Performance',
                            'NN_st_Test_Performance'
                          ]
# Print the testing performance matrix
print("model_test_performance:\n",model_test_performance)

**rf_st_Test_Performance** :
- Has the best overall performance.
- Shows high accuracy 54%, precision 54% and recall .54
- No signs of overfitting despite the high training score.

**rf_wv_Test_Performance**
- Perform moderately well with Accuracy of 46%, Recall 46%
- But Precision is 29%

**NN_st_Test_Performance**
- It is performing better that NN Word2Vec model.
- Accuracy of 43% , Precision 43% and F1 score .39 suggest underfitting.

**NN_wv_Test_Performance**
- Poor performance on test data also.
- Model struggles to learn or is underfitting.

## **Conclusions and Recommendations**

**Conclusion**
- **1. Best Performing Model (Test Set):**
  * **Random Forest with Sentence Transformer Embeddings**

    - Test Accuracy: 54%
    - Test F1 Score: 0.41
    - This model performs best overall and is not over fitting despite high training scores.
- **2. 2nd Best Performing Model:**
  * **Random Forest with Word2Vec**
    -  Random Forest with Word2Vec: 60% training accuracy vs 43% test accuracy

- **3. Neural Network Models Underperformed**
  - Though NN_ST has high accuracy on training data, test data suggest underfitting.
  - NN_W2V show extremely poor accuracy and F1 scores, even on training data. This indicates:
    - Not enough data to effectively train NN (deep models usually need more samples)
    - Possibly poor hyperparameter settings

**Recommendation**
- Use Random Forest + Sentence Transformer Embeddings as the primary model for predicting the market sentiment for the start up. As this model is consistent in performance across training and test sets.

**Steps taken to improve accuracy**:
- Attempted to tune Hyperparameters of Random Forest:
  * Max_depth to 7 was tried, but the model was overfitting.
  * n_estimators up to 300 was tried. May be can increase up to 500 .
  * Attempeted to improve neural network performance by using weighted class to avoid data imbalance.







-

<font size=6 color='blue'>Power Ahead</font>
___