# Stock Sentiment Analysis Project

## Overview

This project focuses on predicting the impact of news headlines on stock prices for specific companies. The dataset contains historical top headlines related to these companies, along with labels indicating whether the stock price had a positive or negative impact. The primary goal is to develop a machine learning model that can analyze news sentiment and make predictions about the likely effect on stock prices.

## Dataset

The dataset consists of the following columns:

- **Top1, Top2, ...:** Headlines corresponding to specific time periods.
- **Label (0 or 1):** Binary labels indicating a negative (0) or positive (1) impact on stock prices.

## Project Steps

1. **Data Exploration and Cleaning:**
   - Check for missing values and duplicates.
   - Explore the distribution of labels and headlines.

2. **Text Preprocessing:**
   - Tokenize, lemmatize, or remove stop words from the headlines.

3. **Exploratory Data Analysis (EDA):**
   - Analyze patterns and correlations in the data.

4. **Feature Engineering:**
   - Create additional features or transform existing ones.

5. **Model Training:**
   - Utilize machine learning algorithms for sentiment analysis.

6. **Model Evaluation:**
   - Assess model performance using relevant metrics.

## Implementation

The project will be implemented using [programming language/framework] and relevant libraries for natural language processing and machine learning.

## Conclusion

The success of the project depends on the quality of the dataset, effective text preprocessing, and the selection of an appropriate machine learning model. Regular updates and model refinement may be necessary to ensure continued accuracy in predicting stock sentiment.

---

*Note: Customize the placeholders like [programming language/framework] with the actual tools you plan to use in your project.*


### **Import Required Liabraries**

In [16]:
# Data Analysis and Manipulation
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Natural Language Processing (NLP)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Text Preprocessing
import re
import numpy as np


### **First Look of Data**

In [17]:
# Reading the CSV file named "data.csv" with ISO-8859-1 encoding
df = pd.read_csv('./data.csv', encoding='ISO-8859-1')

# Displaying the first few rows of the DataFrame to get an overview
df.head()


Unnamed: 0,Date,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
0,2000-01-03,0,A 'hindrance to operations': extracts from the...,Scorecard,Hughes' instant hit buoys Blues,Jack gets his skates on at ice-cold Alex,Chaos as Maracana builds up for United,Depleted Leicester prevail as Elliott spoils E...,Hungry Spurs sense rich pickings,Gunners so wide of an easy target,...,Flintoff injury piles on woe for England,Hunters threaten Jospin with new battle of the...,Kohl's successor drawn into scandal,The difference between men and women,"Sara Denver, nurse turned solicitor",Diana's landmine crusade put Tories in a panic,Yeltsin's resignation caught opposition flat-f...,Russian roulette,Sold out,Recovering a title
1,2000-01-04,0,Scorecard,The best lake scene,Leader: German sleaze inquiry,"Cheerio, boyo",The main recommendations,Has Cubie killed fees?,Has Cubie killed fees?,Has Cubie killed fees?,...,On the critical list,The timing of their lives,Dear doctor,Irish court halts IRA man's extradition to Nor...,Burundi peace initiative fades after rebels re...,PE points the way forward to the ECB,Campaigners keep up pressure on Nazi war crime...,Jane Ratcliffe,Yet more things you wouldn't know without the ...,Millennium bug fails to bite
2,2000-01-05,0,Coventry caught on counter by Flo,United's rivals on the road to Rio,Thatcher issues defence before trial by video,Police help Smith lay down the law at Everton,Tale of Trautmann bears two more retellings,England on the rack,Pakistan retaliate with call for video of Walsh,Cullinan continues his Cape monopoly,...,South Melbourne (Australia),Necaxa (Mexico),Real Madrid (Spain),Raja Casablanca (Morocco),Corinthians (Brazil),Tony's pet project,Al Nassr (Saudi Arabia),Ideal Holmes show,Pinochet leaves hospital after tests,Useful links
3,2000-01-06,1,Pilgrim knows how to progress,Thatcher facing ban,McIlroy calls for Irish fighting spirit,Leicester bin stadium blueprint,United braced for Mexican wave,"Auntie back in fashion, even if the dress look...",Shoaib appeal goes to the top,Hussain hurt by 'shambles' but lays blame on e...,...,Putin admits Yeltsin quit to give him a head s...,BBC worst hit as digital TV begins to bite,How much can you pay for...,Christmas glitches,"Upending a table, Chopping a line and Scoring ...","Scientific evidence 'unreliable', defence claims",Fusco wins judicial review in extradition case,Rebels thwart Russian advance,Blair orders shake-up of failing NHS,Lessons of law's hard heart
4,2000-01-07,1,Hitches and Horlocks,Beckham off but United survive,Breast cancer screening,Alan Parker,Guardian readers: are you all whingers?,Hollywood Beyond,Ashes and diamonds,Whingers - a formidable minority,...,Most everywhere: UDIs,Most wanted: Chloe lunettes,Return of the cane 'completely off the agenda',From Sleepy Hollow to Greeneland,Blunkett outlines vision for over 11s,"Embattled Dobson attacks 'play now, pay later'...",Doom and the Dome,What is the north-south divide?,Aitken released from jail,Gone aloft


##

In [18]:
# Selecting rows where the 'Date' is before January 1, 2015, for training set
train = df[df['Date'] < '20150101']

# Selecting rows where the 'Date' is after December 31, 2014, for the test set
test = df[df['Date'] > '20141231']

In [19]:
import pandas as pd

def preprocess_headlines(df):
    """
    Preprocesses headlines in a DataFrame by removing punctuations, converting to lowercase,
    and joining into a list of strings.

    Parameters:
    - df: DataFrame containing headlines.

    Returns:
    - List of preprocessed and joined headlines.
    """

    # Extracting the relevant columns (headlines) from the DataFrame
    data = df.iloc[:, 2:27]

    # Removing punctuations from the headlines using regex
    data.replace("[^a-zA-Z]", " ", regex=True, inplace=True)

    # Generating a list of integers from 0 to 24
    list1 = [i for i in range(25)]

    # Creating a new list of strings for column names
    new_Index = [str(i) for i in list1]

    # Renaming column names for ease of access
    data.columns = new_Index

    # Converting headlines to lowercase
    for index in new_Index:
        data[index] = data[index].str.lower()

    # Joining the words in each row into a single string
    headlines = [' '.join(str(x) for x in data.iloc[row, 0:25]) for row in range(len(data.index))]

    return headlines

# Example usage with the 'train' DataFrame
headlines = preprocess_headlines(train)

# Displaying the first 5 preprocessed and joined headlines from the training set
print(headlines[0])


a  hindrance to operations   extracts from the leaked reports scorecard hughes  instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes  leeds pay the penalty hammers hand robson a youthful lesson saints party like it s      wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit     flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver  nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title


In [20]:
headlines[0]

'a  hindrance to operations   extracts from the leaked reports scorecard hughes  instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes  leeds pay the penalty hammers hand robson a youthful lesson saints party like it s      wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit     flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver  nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title'

### **Models Training**

In [22]:
# count ectorizer
from sklearn.feature_extraction.text import CountVectorizer
# Random forest
from sklearn.ensemble import RandomForestClassifier

In [24]:
# Bag of words
cv=CountVectorizer(ngram_range=(2,2))
traindataset=cv.fit_transform(headlines)

In [25]:
traindataset[0]

<1x584289 sparse matrix of type '<class 'numpy.int64'>'
	with 138 stored elements in Compressed Sparse Row format>

In [28]:
rfc=RandomForestClassifier(n_estimators=200,criterion='entropy')
rfc=rfc.fit(traindataset,train['Label'])

In [29]:
test_transform=[]
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))
test_dataset=cv.transform(test_transform)
predictions=rfc.predict(test_dataset)

In [30]:
predictions

array([1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,

In [31]:
from sklearn.metrics import confusion_matrix,confusion_matrix,accuracy_score

In [33]:
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
score=accuracy_score(test['Label'],predictions)
print(accuracy_score)
report=classification_report(test['Label'],predictions)
print(report)

[[139  47]
 [ 10 182]]
<function accuracy_score at 0x000001A98022D580>
              precision    recall  f1-score   support

           0       0.93      0.75      0.83       186
           1       0.79      0.95      0.86       192

    accuracy                           0.85       378
   macro avg       0.86      0.85      0.85       378
weighted avg       0.86      0.85      0.85       378



In [34]:
headlines[0]

'a  hindrance to operations   extracts from the leaked reports scorecard hughes  instant hit buoys blues jack gets his skates on at ice cold alex chaos as maracana builds up for united depleted leicester prevail as elliott spoils everton s party hungry spurs sense rich pickings gunners so wide of an easy target derby raise a glass to strupar s debut double southgate strikes  leeds pay the penalty hammers hand robson a youthful lesson saints party like it s      wear wolves have turned into lambs stump mike catches testy gough s taunt langer escapes to hit     flintoff injury piles on woe for england hunters threaten jospin with new battle of the somme kohl s successor drawn into scandal the difference between men and women sara denver  nurse turned solicitor diana s landmine crusade put tories in a panic yeltsin s resignation caught opposition flat footed russian roulette sold out recovering a title'

Same Model with tfidf

In [35]:
#Doing same  by tfidf
# Bag of words
tfidf=TfidfVectorizer(ngram_range=(2,2))
traindataset=tfidf.fit_transform(headlines)

In [37]:
rfc=RandomForestClassifier(n_estimators=200,criterion='entropy')
rfc=rfc.fit(traindataset,train['Label'])

In [38]:
test_transform=[]
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))
test_dataset=tfidf.transform(test_transform)
predictions=rfc.predict(test_dataset)

In [39]:
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
score=accuracy_score(test['Label'],predictions)
print(accuracy_score)
report=classification_report(test['Label'],predictions)
print(report)

[[146  40]
 [ 10 182]]
<function accuracy_score at 0x000001A98022D580>
              precision    recall  f1-score   support

           0       0.94      0.78      0.85       186
           1       0.82      0.95      0.88       192

    accuracy                           0.87       378
   macro avg       0.88      0.87      0.87       378
weighted avg       0.88      0.87      0.87       378



using naive bayes Model

In [41]:
from sklearn.naive_bayes import MultinomialNB

In [42]:
naive=MultinomialNB()
naive.fit(traindataset,train['Label'])

In [43]:
test_transform=[]
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))
test_dataset=tfidf.transform(test_transform)
predictions=naive.fit(traindataset,train['Label']).predict(test_dataset)

In [44]:
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)
score=accuracy_score(test['Label'],predictions)
print(accuracy_score)
report=classification_report(test['Label'],predictions)
print(report)

[[130  56]
 [  0 192]]
<function accuracy_score at 0x000001A98022D580>
              precision    recall  f1-score   support

           0       1.00      0.70      0.82       186
           1       0.77      1.00      0.87       192

    accuracy                           0.85       378
   macro avg       0.89      0.85      0.85       378
weighted avg       0.89      0.85      0.85       378



# Stock Sentiment Analysis Conclusions

## Overview

This sentiment analysis project aimed to predict the impact of news headlines on stock prices using various natural language processing (NLP) techniques and machine learning classifiers. The dataset included labeled headlines, with "0" indicating a negative impact and "1" indicating a positive impact on stock prices.

## Bag of Words with Random Forest Classifier

- **Accuracy:** 85%
- The Bag of Words representation with a Random Forest Classifier showed solid performance.
- Precision, recall, and F1-score were reasonably balanced for both positive and negative sentiments.
- The model demonstrated effectiveness in capturing the sentiment patterns in the headlines.

## TF-IDF with Random Forest Classifier

- **Accuracy:** 87%
- TF-IDF vectorization outperformed Bag of Words, achieving higher accuracy.
- The model demonstrated improved precision, recall, and F1-score for both positive and negative sentiments.
- TF-IDF, capturing the importance of words, proved valuable in sentiment analysis.

## TF-IDF with Naive Bayes Classifier

- **Accuracy:** 85%
- Naive Bayes, a simpler classifier, performed competitively with Random Forest.
- The model showed high precision and recall, especially for positive sentiments.
- Naive Bayes demonstrated effectiveness despite its inherent assumptions.

## General Conclusions

- **TF-IDF Outperformed Bag of Words:** TF-IDF consistently outperformed the Bag of Words representation, emphasizing the importance of word weighting in sentiment analysis.
- **Classifier Impact:** The choice of classifier significantly influenced model performance. Random Forest and Naive Bayes proved effective for this task.
- **Applicability:** These models can be valuable tools for predicting sentiment based on news headlines, providing insights into potential stock price movements.
- **Further Exploration:** Experimentation with hyperparameter tuning, additional feature engineering, or exploring deep learning models could potentially enhance predictive capabilities.

In conclusion, this sentiment analysis project demonstrated the effectiveness of NLP techniques and machine learning models in predicting stock sentiment based on news headlines. The choice of vectorization technique and classifier played a crucial role in achieving accurate predictions. The insights derived from this analysis can be valuable for investors and financial analysts in making informed decisions. Further refinement and exploration of advanced techniques could lead to even more robust models.
