## NLP: Sentiment Analysis on the relation between News title and Share Prices

Objective: This project studies relation between stock price and news titles with the trend of Inc. Meta's share price and related news on internet at the same time.

Data source: 

Implementation steps:
1. Prepocessing data
2. Training data & Evaluation
- Utilize the nltk library's NaiveBayesClassifier to classify titles as positive or negative.
- Compare the label proportion with the historical shares prices change of the company during the same period
- Explore the correlation between the positive rate and the stock prices.
3. Finding

In [1]:
import nltk
import string
import pandas as pd

from nltk import NaiveBayesClassifier
from nltk.corpus import stopwords
import ipywidgets as widgets
from ipywidgets import interact_manual
import plotly.express as px

In [2]:
word_apple = ['Apple', 'iPhone', 'iPad' , 'MacBook', 'iMac', 'iOS', 'macOS'
              , 'App Store ','iTunes', 'iCloud', 'AirPods', 'HomePod', 'Siri', 'Tim Cook', 'Steve Jobs']
word_mcs = ['Microsoft', 'Windows', 'Office', 'Azure', 'Xbox', 'Surface', 'Outlook',
            'Excel', 'Word', 'PowerPoint', 'Visual Studio', 'SQL Server', 
            'SharePoint', 'OneDrive','Azure', 'MSN', 'Bill Gates']
word_meta = ['Meta', 'Facebook', 'Instagram', 'WhatsApp', 'Oculus', 
             'Virtual Reality', 'VR', 'Augmented Reality', 'AR', 'Messenger',
             'Mark Zuckerberg', 'Sheryl Sandberg', 'Andrew Bosworth',
             'Mike Schroepfer','Chris Cox', 'David Wehner', 'Javier Olivan',
             'Adam Mosseri','Naomi Gleit', 'Fidji Simo']

In [5]:
all_df = pd.read_excel('/Users/nguyenhien/Desktop/OneDrive/2. Learning/2.3 Data Science/@python/1. Machine Learning/Project/Lattest/NLP_Newstitles/Train_title_sen_analyst.xlsx')

#df_title = all_df
df_title = all_df[(all_df['Date'].dt.month == 10) & (all_df['Date'].dt.year == 2022)].iloc[:, [0,1,2]]

# 1. Split the data into training and testing sets
split = int(len(all_df) * 0.8)
my_df_train = all_df.iloc[:split, 0:2]
my_df_test = all_df.iloc[split:, 0:2]

df_title['Company'] = df_title['Title'].apply(lambda x: 'Meta Inc.' if any(word in x for word in word_meta)
                                       else 'Microsoft Inc.' if any(word in x for word in word_mcs)
                                       else 'Apple Inc.' if any(word in x for word in word_apple)
                                       else None)
df_title                                       


Unnamed: 0,Title,Sign,Date,Company
883,WhatsApp bans over 23 lakh Indian accounts in ...,Negative,2022-10-01,Meta Inc.
884,How banks can -1otiate the US and European san...,Negative,2022-10-01,
885,Recession fears hit Big Tech firms as Facebook...,Negative,2022-10-01,Meta Inc.
886,Indian Army bets on hire-to-retire platform Zi...,Positive,2022-10-02,
887,People still don't know what metaverse is all ...,Negative,2022-10-02,Apple Inc.
...,...,...,...,...
1049,Blurring lines between sci-fi and reality: why...,Negative,2022-10-30,
1050,Honeywell appoints Microsoft's Rajesh Rege as ...,Negative,2022-10-30,Microsoft Inc.
1051,CBI seeks metadata of 'Yogi' email from Micros...,Negative,2022-10-30,Microsoft Inc.
1052,"Workers walk out of iPhone factory, highlighti...",Negative,2022-10-31,Apple Inc.


In [6]:
# 2. Preprocess the training data
stop_words = set(stopwords.words("english") + list(string.punctuation))

def preprocess_text(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [token.lower() for token in tokens if token.lower() not in stop_words]
    return filtered_tokens

# Tokenize and preprocess the positive, negative, and neutral texts
positive_texts = [preprocess_text(text) for text in my_df_train[my_df_train['Sign'] == 'Positive']['Title']]
negative_texts = [preprocess_text(text) for text in my_df_train[my_df_train['Sign'] == 'Negative']['Title']]

# 3. Build a bag-of-words model
def build_bag_of_words_features_filtered(words):
    return {word: 1 for word in words}

# 4. Build the feature sets
positive_features = [(build_bag_of_words_features_filtered(text), 'Positive') for text in positive_texts]
negative_features = [(build_bag_of_words_features_filtered(text), 'Negative') for text in negative_texts]

# 5. Train the classifier
classifier = NaiveBayesClassifier.train(positive_features + negative_features)

# 6. Test the accuracy
training_accuracy = None
training_accuracy = nltk.classify.util.accuracy(classifier, positive_features[:split]+negative_features[:split])*100

test_accuracy = None
test_accuracy = nltk.classify.util.accuracy(classifier, positive_features[:split]+negative_features[:split])*100

test_accuracy, training_accuracy

(93.56512714063311, 93.56512714063311)

## 2. Apply model and doing sentiment analyst
### 2.1 sentiment analyst

In [7]:
# Ánh xạ các dòng văn bản sang nhãn tương ứng và gán vào cột "Label"
df_title['Label'] = None
df_title['Label'] = df_title['Title'].map(lambda x: classifier.classify({word: True for word in nltk.word_tokenize(x)}))

df_title


Unnamed: 0,Title,Sign,Date,Company,Label
883,WhatsApp bans over 23 lakh Indian accounts in ...,Negative,2022-10-01,Meta Inc.,Positive
884,How banks can -1otiate the US and European san...,Negative,2022-10-01,,Negative
885,Recession fears hit Big Tech firms as Facebook...,Negative,2022-10-01,Meta Inc.,Negative
886,Indian Army bets on hire-to-retire platform Zi...,Positive,2022-10-02,,Positive
887,People still don't know what metaverse is all ...,Negative,2022-10-02,Apple Inc.,Negative
...,...,...,...,...,...
1049,Blurring lines between sci-fi and reality: why...,Negative,2022-10-30,,Negative
1050,Honeywell appoints Microsoft's Rajesh Rege as ...,Negative,2022-10-30,Microsoft Inc.,Positive
1051,CBI seeks metadata of 'Yogi' email from Micros...,Negative,2022-10-30,Microsoft Inc.,Negative
1052,"Workers walk out of iPhone factory, highlighti...",Negative,2022-10-31,Apple Inc.,Negative


In [8]:
# Loại bỏ các dòng có giá trị None ở bất kỳ cột nào
df_title = df_title.dropna()

# Tạo bảng mới với tổng các dòng nhóm bởi  "Date", "Company"
df_short_1 = df_title.groupby(["Date", "Company"]).size().reset_index(name="Sum")

# Tính tỷ trọng của Sign trong ngày và trong cùng công ty
df_short_2 = df_title.groupby(["Sign", "Date", "Company"]).size().reset_index(name="Percentage")
df_short_2["Percentage"] = (df_short_2["Percentage"] * 100) / df_short_2.groupby(["Date", "Company"])["Percentage"].transform("sum")

df_short_nofilter = df_short_2
# Filter Company
df_short_2 = df_short_2[df_short_2['Company'] == 'Meta Inc.']
df_short_2 = df_short_2.sort_values('Sign', ascending=False)

In [9]:
color_palette =   ["#A4C3A2","#A8C686", '#EAE7D6', "#FF9C6E", "#E45756"]

filtered_df = df_short_2[df_short_2["Company"] == 'Meta Inc.']
fig = px.bar(filtered_df, x="Date", y="Percentage", color="Sign",
                title=f"Label Proportion for Meta Inc.", color_discrete_sequence=color_palette,
                category_orders={"Sign": ["ExtremPositive", "Positive", "Neutral", "Negative", "ExtremNegative"]})
fig.update_layout(barmode="stack")
fig.update_layout(xaxis_title='Quarter', yaxis_title='Label Proportion %',
                    xaxis=dict(rangeslider=dict(visible=True)))
fig.show()




### 2.2 Shares price history

In [11]:
# Read dữ liệu
stock_history_df = pd.read_csv('/Users/nguyenhien/Desktop/OneDrive/2. Learning/2.3 Data Science/@python/1. Machine Learning/Project/Lattest/NLP_Newstitles/Actual_title_sen_AAPL_history.csv', sep = ';')
stock_history_df = stock_history_df.iloc[:, [0,4,7]]
# Xử lý cột 'Date' sang định dạng datetime
stock_history_df['Date'] = pd.to_datetime(stock_history_df['Date'], format='%d/%m/%Y')
# Xử lý cột 'Close'
stock_history_df['Close'] = stock_history_df['Close'].astype(str)
stock_history_df['Close'] = stock_history_df['Close'].str.replace('.', '').astype(float) / 1000000
stock_history_df['Close'] = stock_history_df['Close'].apply(lambda x: round(x, 2))

#Lấy sự thay đổi mỗi ngày
stock_history_df['daily_change'] = stock_history_df.groupby('Company')['Close'].pct_change()
stock_history_df['change_percent'] = (stock_history_df['Close'] - stock_history_df['Close'].shift(1)) / stock_history_df['Close'].shift(1) * 100

stock_history_df = stock_history_df[(stock_history_df['Date'].dt.month == 10) & (stock_history_df['Date'].dt.year == 2022)]
stock_history_all = stock_history_df
stock_history_df = stock_history_df[stock_history_df['Company'] == 'Meta Inc.']

stock_history_df

@interact_manual(company=widgets.Dropdown(options=stock_history_df['Company'].unique()), layout=widgets.Layout(width='200px', height='10px'))
# Tạo hàm để vẽ biểu đồ tương tác
def plot_interactive_chart(company):
    filtered_df = stock_history_df[stock_history_df["Company"] == company]
    # Vẽ biểu đồ line chart
    fig = px.line(filtered_df, x='Date', y='change_percent', color='Company', title='Stock Prices Change')
    fig.update_layout(xaxis_title='Quarter', yaxis_title='Daily Change',
                    xaxis=dict(rangeslider=dict(visible=True)))
    # Hiển thị biểu đồ
    fig.show()


interactive(children=(Dropdown(description='company', options=('Meta Inc.',), value='Meta Inc.'), Button(descr…

### 3. Finding
The analysis initially uncovers a weak correlation between the proportion of labels and share price. However, as we increase the levels of labeling the correlation strengthens. This highlights the challenge of making predictions solely using the Naive Bayes Classifier, but demonstrates the feasibility of achieving better result by incorporating multiple libraries

Limitations
1. News headlines often discuss operational decisions that typically (1) don't have a direct impact on stock prices, or (2) they mentions the events in financial reports that do affect stock prices, but only when there is a significant increase
or decrease in share price. Therefore, it may not always provide investors with timely information from financial reports to stay up-to-date.
2. Binary labels (positive, negative, extremely positive, and extremely negative) don't indicate the exact percentage of how much the title affects the share price or how much the share price will change based on the news.
To overcome these limitations, we can take additional steps:
- Broaden the range of news sources to ensure comprehensive coverage, capturing information that is not included in
the Economics Time
- Use additional and advanced libraries to analyze multiple features such as the "Title," "Target Company," "Date," and
"Share Price Change" as inputs and assign scores as outputs for the machine learning model. From using this method we can do the absolute correlation rate.