=====================================================================================================================
# **Tesla News Sentiment Stock Prediction**

We have developed an innovative application designed to forecast the performance of various electric vehicle (EV) manufacturing stocks. Leveraging advanced sentiment analysis algorithms, our application extracts valuable insights from news articles pertaining to Tesla, a leading player in the EV industry. By analyzing the sentiment expressed in these articles, our application generates predictive models capable of forecasting the market data of EV stocks. This cutting-edge approach not only enables investors to make informed decisions but also provides a comprehensive understanding of the market dynamics influenced by Tesla's activities and perceptions.

Team Member:

1. M. Gifhari Heryndra (Data Analyst)
2. Fadhil Athallah (Data Scientist)
3. Reski Hidayat (Data Engineer)

=====================================================================================================================

# **1.** **Import Libraries** 

In [57]:
import pandas as pd
import numpy as np

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

import xgboost as xgb
import joblib
import random

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import L2

import warnings
warnings.filterwarnings('ignore')


# **2.** **Data Loading**

Loading ```tweet_cleaned.csv```

In [58]:
Tweet=pd.read_csv('tweets_cleaned.csv')
Tweet

Unnamed: 0,Date,tweet_processed
0,2023-04-10,tesla open new megafactory shanghai china comp...
1,2023-04-10,china hold military drill around taiwan ftx co...
2,2023-04-09,watch tesla chief executive elon musk making p...
3,2023-04-09,tesla model x starting show age
4,2023-04-09,market biggest company apple tesla microsoft i...
...,...,...
22340,2010-02-17,plane owned tesla engineer crash 3 dead
22341,2010-02-17,plane owned tesla engineer crash 3 dead
22342,2010-01-14,move thomas edison nikola tesla pioneer altern...
22343,2009-09-15,bloomberg news electric sportscar maker tesla ...


Creating a new column for the dataset, the column contain news count and all of sentiment score
 

This code effectively aggregates tweets by date, providing a consolidated view of tweet data for analysis or further processing.

In [59]:
# Concatenate titles and count news
Tweet['tweet_processed'] = Tweet.groupby('Date')['tweet_processed'].transform(lambda x: ' '.join(x))
Tweet['news_count'] = Tweet.groupby('Date')['tweet_processed'].transform('count')
Tweet = Tweet.drop_duplicates(subset=['Date', 'tweet_processed'])

Followed by the use of VADER sentiment analyzer to add sentiment score from the tweet title.

In [60]:
# Initialize VADER sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Function to calculate sentiment
def analyze_sentiment(text):
    return sia.polarity_scores(text)

# Apply sentiment analysis to each concatenated title
Tweet['sentiment'] = Tweet['tweet_processed'].apply(analyze_sentiment)

# Extract sentiment scores
Tweet['positive'] = Tweet['sentiment'].apply(lambda x: x['pos'])
Tweet['negative'] = Tweet['sentiment'].apply(lambda x: x['neg'])
Tweet['neutral'] = Tweet['sentiment'].apply(lambda x: x['neu'])
Tweet['compound'] = Tweet['sentiment'].apply(lambda x: x['compound'])

In [61]:
#  Function to categorize sentiment
def categorize_sentiment(compound_score):
    if compound_score >= 0.05:
        return 'Positive'
    elif compound_score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

# Apply sentiment analysis to each concatenated title
Tweet['sentiment'] = Tweet['tweet_processed'].apply(lambda x: sia.polarity_scores(x))
Tweet['sentiment_category'] = Tweet['compound'].apply(categorize_sentiment)
Tweet['Date']=pd.to_datetime(Tweet['Date'],utc=True).dt.date

Loading the CSV of `STOCK`

In [62]:
stock = pd.read_csv('merged_data.csv')
stock.reset_index(inplace=True)
stock['Next Close'] = stock['Close'].shift(-1)
stock.dropna(inplace=True)
stock['Date'] = pd.to_datetime(stock['Date'], utc=True).dt.date
stock

Unnamed: 0,index,Date,Company_ID,Open,High,Low,Close,Volume,Dividends,Stock Splits,Adj Close,Next Close
0,0,2018-09-12,NIO,6.000000,6.930000,5.350000,6.600000,66849000,0.0,0.0,6.600000,11.600000
1,1,2018-09-13,NIO,6.620000,12.690000,6.520000,11.600000,158346500,0.0,0.0,11.600000,9.900000
2,2,2018-09-14,NIO,12.660000,13.800000,9.220000,9.900000,172473600,0.0,0.0,9.900000,8.500000
3,3,2018-09-17,NIO,9.610000,9.750000,8.500000,8.500000,56323900,0.0,0.0,8.500000,7.680000
4,4,2018-09-18,NIO,8.730000,9.100000,7.670000,7.680000,41827600,0.0,0.0,7.680000,8.500000
...,...,...,...,...,...,...,...,...,...,...,...,...
12046,12046,2023-03-30,TSLA,195.580002,197.330002,194.419998,195.279999,110252200,0.0,0.0,195.279999,207.460007
12047,12047,2023-03-31,TSLA,197.529999,207.789993,197.199997,207.460007,170222100,0.0,0.0,207.460007,194.770004
12048,12048,2023-04-03,TSLA,199.910004,202.690002,192.199997,194.770004,169545900,0.0,0.0,194.770004,192.580002
12049,12049,2023-04-04,TSLA,197.320007,198.740005,190.320007,192.580002,126463800,0.0,0.0,192.580002,185.520004


Merging the `stock` and `tweet` csv by its date

In [127]:
# Merge tweets data with stock data on 'Date'
merged_data = pd.merge(stock, Tweet, on='Date', how='inner')

# Sort by date
merged_data = merged_data.sort_values(by='Date')
merged_data

Unnamed: 0,index,Date,Company_ID,Open,High,Low,Close,Volume,Dividends,Stock Splits,Adj Close,Next Close,tweet_processed,news_count,sentiment,positive,negative,neutral,compound,sentiment_category
4536,5122,2008-07-08,PCRFY,16.691616,16.867801,16.599693,16.852480,349094,0.0,0.0,16.852480,16.201361,tesla tap chrysler executive,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.000,0.000,1.000,0.0000,Neutral
4535,1279,2008-07-08,NVDA,2.793185,2.861983,2.706042,2.758786,180528800,0.0,0.0,2.758786,2.710628,tesla tap chrysler executive,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.000,0.000,1.000,0.0000,Neutral
4537,1579,2009-09-15,NVDA,3.715074,3.802217,3.680675,3.756352,54613600,0.0,0.0,3.756352,3.655448,bloomberg news electric sportscar maker tesla ...,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.000,0.000,1.000,0.0000,Neutral
4538,5422,2009-09-15,PCRFY,11.990054,12.075420,11.912449,12.036617,190405,0.0,0.0,12.036617,12.090944,bloomberg news electric sportscar maker tesla ...,1,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.000,0.000,1.000,0.0000,Neutral
4540,5506,2010-01-14,PCRFY,13.099814,13.309350,13.014449,13.231744,842698,0.0,0.0,13.231744,13.138618,move thomas edison nikola tesla pioneer altern...,1,"{'neg': 0.14, 'neu': 0.698, 'pos': 0.163, 'com...",0.163,0.140,0.698,0.0772,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4530,8834,2023-04-05,PCRFY,9.235722,9.285217,9.215925,9.255521,216544,0.0,0.0,9.255521,9.005571,tesla sprawling manufacturing hub austin texas...,5,"{'neg': 0.121, 'neu': 0.752, 'pos': 0.127, 'co...",0.127,0.121,0.752,0.0772,Positive
4531,12050,2023-04-05,TSLA,190.520004,190.679993,183.759995,185.520004,133882500,0.0,0.0,185.520004,185.059998,tesla sprawling manufacturing hub austin texas...,5,"{'neg': 0.121, 'neu': 0.752, 'pos': 0.127, 'co...",0.127,0.121,0.752,0.0772,Positive
4532,1149,2023-04-06,NIO,8.940000,9.070000,8.830000,9.010000,23019800,0.0,0.0,9.010000,7.570035,link correction tesla employee privately share...,13,"{'neg': 0.034, 'neu': 0.844, 'pos': 0.122, 'co...",0.122,0.034,0.844,0.9201,Positive
4533,4992,2023-04-06,NVDA,265.754749,270.713149,264.185245,270.283295,39765400,0.0,0.0,270.283295,15.495338,link correction tesla employee privately share...,13,"{'neg': 0.034, 'neu': 0.844, 'pos': 0.122, 'co...",0.122,0.034,0.844,0.9201,Positive


Dropping the unused column from the dataset based on domain knowledge and its value

In [128]:
merged_data = merged_data.drop(columns=['Date','index', 'sentiment','tweet_processed','Dividends','Stock Splits','Adj Close'])

merged_data

Unnamed: 0,Company_ID,Open,High,Low,Close,Volume,Next Close,news_count,positive,negative,neutral,compound,sentiment_category
4536,PCRFY,16.691616,16.867801,16.599693,16.852480,349094,16.201361,1,0.000,0.000,1.000,0.0000,Neutral
4535,NVDA,2.793185,2.861983,2.706042,2.758786,180528800,2.710628,1,0.000,0.000,1.000,0.0000,Neutral
4537,NVDA,3.715074,3.802217,3.680675,3.756352,54613600,3.655448,1,0.000,0.000,1.000,0.0000,Neutral
4538,PCRFY,11.990054,12.075420,11.912449,12.036617,190405,12.090944,1,0.000,0.000,1.000,0.0000,Neutral
4540,PCRFY,13.099814,13.309350,13.014449,13.231744,842698,13.138618,1,0.163,0.140,0.698,0.0772,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4530,PCRFY,9.235722,9.285217,9.215925,9.255521,216544,9.005571,5,0.127,0.121,0.752,0.0772,Positive
4531,TSLA,190.520004,190.679993,183.759995,185.520004,133882500,185.059998,5,0.127,0.121,0.752,0.0772,Positive
4532,NIO,8.940000,9.070000,8.830000,9.010000,23019800,7.570035,13,0.122,0.034,0.844,0.9201,Positive
4533,NVDA,265.754749,270.713149,264.185245,270.283295,39765400,15.495338,13,0.122,0.034,0.844,0.9201,Positive


Final dataset that we can use for modelling

# **3.** **Modeling**
For modeling, we will try four different regression algorithms to compare and select the best one based on its `R2` and `RMSE` scores. The algorithms include `Linear Regression`, `XGBoost`, `Random Forest`, and `SVR`.


## **Preprocess**

In [129]:
# Drop rows with missing values
final_data = merged_data.dropna()

# Separate features and target
X = final_data.drop(columns=['Next Close'])
y = final_data['Next Close'].values

# Define categorical features
categorical_features = ['Company_ID', 'sentiment_category']

# Define preprocessor with OneHotEncoder for categorical features and StandardScaler for numeric features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), [col for col in X.columns if col not in categorical_features]),
        ('cat', OneHotEncoder(categories='auto'), categorical_features)  
    ]
)

### **Linear Regression** 

In [130]:
# Define the model pipeline
model_LinReg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model_LinReg.fit(X_train, y_train)
# Make predictions
y_pred = model_LinReg.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R^2 Score: {r2}")


Mean Squared Error (MSE): 13.664077063408547
Root Mean Squared Error (RMSE): 3.6964952405499654
R^2 Score: 0.9977945352839853


### **XGBoost**

In [22]:
# Define the model pipeline with XGBoost
model_XGB = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', xgb.XGBRegressor())
])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model_XGB.fit(X_train, y_train)

# Make predictions
y_pred = model_XGB.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R^2 Score: {r2}")

Mean Squared Error (MSE): 23.242072724584393
Root Mean Squared Error (RMSE): 4.821003290248244
R^2 Score: 0.9962485888301679


### **Random Forest** 


In [24]:
# Define the model pipeline with Random Forest
model_RF = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor())
])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
model_RF.fit(X_train, y_train)

# Make predictions
y_pred = model_RF.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R^2 Score: {r2}")

Mean Squared Error (MSE): 17.476215418576402
Root Mean Squared Error (RMSE): 4.180456364869319
R^2 Score: 0.9971792330871466


### **SVR**

In [41]:
# Define the model pipeline with SVR
model_SVR = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', SVR())
])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVR model
model_SVR.fit(X_train, y_train)

# Make predictions using SVR
y_pred = model_SVR.predict(X_test)

# Evaluate the SVR model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R^2 Score: {r2}")

Mean Squared Error (MSE): 451.14040674618104
Root Mean Squared Error (RMSE): 21.240066072076637
R^2 Score: 0.927183208611163


## **Best Model**
Based on the RMSE score, Linear Regression emerges as the best model with an RMSE of only 3.69. To further improve its performance, we intend to fine-tune this model using grid search. Once optimized, we'll save it and incorporate it into the final model.

In [131]:
# Define the model pipeline
model_LinReg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Define the parameter grid
param_grid = {
    'model__fit_intercept': [True, False],
    'model__positive': [True, False]
}

# Create GridSearchCV object
grid_search = GridSearchCV(model_LinReg, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search
grid_search.fit(X_train, y_train)

# Get the best estimator
best_model = grid_search.best_estimator_

# Make predictions with the best model
y_pred = best_model.predict(X_test)

# Evaluate the best model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mse)

print("Best Parameters:", grid_search.best_params_)
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R^2 Score: {r2}")

Best Parameters: {'model__fit_intercept': True, 'model__positive': False}
Mean Squared Error (MSE): 13.664077063408547
Root Mean Squared Error (RMSE): 3.6964952405499654
R^2 Score: 0.9977945352839853


In [132]:
joblib.dump(best_model, 'model_LinReg.pkl')

['model_LinReg.pkl']

# **4.** **Model Inference**

In [118]:
import random
import pandas as pd
import joblib

# Load the trained model
model_LinReg = joblib.load('model_LinReg.pkl')

# Define lists to store generated data and predictions
random_data_list = []
predictions = []

# Generate 5 random data points and predictions
for _ in range(5):
    # Define the random data for each data point
    random_data = {
        'Company_ID': random.choice(['NVDA', 'TSLA', 'NIO', 'PCRFY']),  # Random company ID
        'Open': random.uniform(10, 100),  # Random open price
        'High': random.uniform(10, 100),  # Random high price
        'Low': random.uniform(10, 100),   # Random low price
        'Close': random.uniform(10, 100), # Random close price
        'Volume': random.randint(1000, 100000),  # Random volume
        'Next Close': random.uniform(10, 100), # Actual next close from the dataset
        'news_count': random.randint(1, 100),    # Random news count
        'positive': random.uniform(0, 1),    # Random sentiment scores
        'negative': random.uniform(0, 1),
        'neutral': random.uniform(0, 1),
        'compound': random.uniform(-1, 1),
        'sentiment_category': random.choice(['Positive', 'Negative', 'Neutral'])  # Random sentiment category
    }
    # Append the random data to the list
    random_data_list.append(random_data)

# Convert the list of dictionaries into a DataFrame
random_df = pd.DataFrame(random_data_list)

random_df


Unnamed: 0,Company_ID,Open,High,Low,Close,Volume,Next Close,news_count,positive,negative,neutral,compound,sentiment_category
0,TSLA,31.95564,37.103098,87.367544,91.940307,28540,98.804392,48,0.120125,0.159145,0.316411,-0.505889,Negative
1,NVDA,81.90081,76.929805,65.69575,29.64814,82770,40.437696,25,0.774185,0.229554,0.978182,0.820361,Positive
2,PCRFY,34.356343,44.217,47.360117,83.588258,88565,84.477127,63,0.622942,0.036139,0.201703,-0.442476,Neutral
3,TSLA,56.816132,92.72052,29.989682,55.324113,41486,92.414447,100,0.387441,0.638307,0.04117,-0.994928,Neutral
4,NVDA,73.833979,38.101907,91.095349,85.818926,78765,13.888175,51,0.297852,0.499941,0.862836,0.536272,Neutral


In [119]:
# Extract features (X) for prediction
X_random = random_df.drop(['Next Close'], axis=1)  # Drop 'Next Close' as it's not used for prediction

# Make predictions
predicted_next_close = model_LinReg.predict(X_random)

print("Predicted Next Close Prices for 5 Random Data Points:")
for i, pred in enumerate(predicted_next_close):
    print(f"Next Close for Random Case {i+1}: {pred}")


Predicted Next Close Prices for 5 Random Data Points:
Next Close for Random Case 1: -0.8285348643031369
Next Close for Random Case 2: 252.63851346460362
Next Close for Random Case 3: 45.20808613718755
Next Close for Random Case 4: 73.06512828351129
Next Close for Random Case 5: 221.6911907449757


Model can predict the next close, this model is ready for deployment