
# Stock Movement Analysis Using Social Media Sentiment

This notebook performs stock movement prediction using sentiment analysis from Twitter data. The following steps are implemented:

1. **Data Scraping**:
   - Scrape historical tweets using the `GetOldTweets3` library.
   - Fetch historical stock prices using the `yfinance` library.

2. **Data Cleaning & Feature Engineering**:
   - Clean tweets (remove punctuation, stopwords, hyperlinks, etc.).
   - Perform sentiment analysis using the `VADER` sentiment analyzer.
   - Extract additional features like tweet volume, sentiment trends, and stock-specific keywords.

3. **Machine Learning Models**:
   - Predict stock prices using Random Forest and Support Vector Regression (SVR).
   - Evaluate models using metrics such as RMSE, MAE, and R² Score.

4. **Visualization**:
   - Visualize sentiment trends, stock price predictions, and feature importance.
    

In [None]:

# Import necessary libraries
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
sns.set(style="whitegrid")
    

## Step 2: Data Scraping (Simulated)

In [None]:

# Simulating data scraping for demonstration purposes
# Use actual scraping code (GetOldTweets3 and yfinance) for real data

# Example stock data
stock_data = pd.DataFrame({
    "Date": pd.date_range(start="2023-01-01", periods=100),
    "Prices": np.cumsum(np.random.normal(0, 1, 100)) + 100
})

# Example tweets data
tweets_data = pd.DataFrame({
    "Date": pd.date_range(start="2023-01-01", periods=100),
    "Tweets": ["Example tweet text"] * 100
})

# Save the simulated data for further steps
stock_data.to_csv("stock_data.csv", index=False)
tweets_data.to_csv("tweets_data.csv", index=False)
    

## Step 3: Feature Engineering

In [None]:

# Load the data
stock_data = pd.read_csv("stock_data.csv")
tweets_data = pd.read_csv("tweets_data.csv")

# Sentiment Analysis
sia = SentimentIntensityAnalyzer()
tweets_data["Compound"] = tweets_data["Tweets"].apply(lambda x: sia.polarity_scores(x)["compound"])

# Add tweet volume (simulated as random for this example)
tweets_data["TweetVolume"] = np.random.randint(50, 200, len(tweets_data))

# Merge with stock data
merged_data = pd.merge(stock_data, tweets_data, on="Date")

# Add rolling average of sentiment score
merged_data["RollingSentiment"] = merged_data["Compound"].rolling(window=5, min_periods=1).mean()
    

## Step 4: Model Training and Evaluation

In [None]:

# Prepare data for training
X = merged_data[["Compound", "TweetVolume", "RollingSentiment"]]
y = merged_data["Prices"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Random Forest Model
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
rf_predictions = rf.predict(X_test)

# Support Vector Regression
svr = SVR(kernel="rbf", C=100, gamma=0.1)
svr.fit(X_train, y_train)
svr_predictions = svr.predict(X_test)

# Evaluation
models = {"Random Forest": rf_predictions, "SVR": svr_predictions}
results = {}

for model_name, predictions in models.items():
    results[model_name] = {
        "RMSE": np.sqrt(mean_squared_error(y_test, predictions)),
        "MAE": mean_absolute_error(y_test, predictions),
        "R²": r2_score(y_test, predictions)
    }

# Display results
results_df = pd.DataFrame(results).T
print(results_df)
    

## Step 5: Visualization

In [None]:

# Plot actual vs. predicted prices
plt.figure(figsize=(12, 6))
plt.plot(y_test.values, label="Actual Prices", marker="o")
plt.plot(rf_predictions, label="RF Predicted Prices", marker="x")
plt.plot(svr_predictions, label="SVR Predicted Prices", marker="s")
plt.legend()
plt.title("Actual vs Predicted Prices")
plt.xlabel("Test Samples")
plt.ylabel("Prices")
plt.show()

# Feature importance for Random Forest
importances = rf.feature_importances_
plt.figure(figsize=(8, 4))
sns.barplot(x=X.columns, y=importances)
plt.title("Feature Importance (Random Forest)")
plt.ylabel("Importance Score")
plt.show()
    