
# **Airline Sentiment Analysis Using RNN**
**PROJECT DETAIL**: 1.HIMANSHI SHARMA(055012)                                         
                  
                  2.MUSKAN BOHRA (055025)
## **Objective**
The goal of this project is to analyze and classify sentiments expressed in tweets about major U.S. airlines using Recurrent Neural Networks (RNN). The task involves preprocessing tweet text data, training an RNN model to classify sentiments into negative and non-negative (positive and neutral), and visualizing key insights that can help airlines improve their services and reputation.
Source: The dataset is fetched using the yfinance library.

Ticker Used: ^NSEI (Nifty 50 index)

Period: Last 5 years (approx.)

Interval: 1 day

**Columns Used:**

Open

High

Low

Close

Adj Close

Volume

Among these, the Close price is primarily used for training and prediction.
---

## **1. Installing & Importing Libraries**
We begin by installing and importing essential Python libraries needed for data analysis, visualization, and building deep learning models using PyTorch.

Libraries used:
- `pandas`, `numpy`: Data handling
- `matplotlib`, `seaborn`, `wordcloud`: Data visualization
- `torch`, `torch.nn`, `torch.optim`: Neural network modeling
- `sklearn`: Train-test split and model evaluation

---

## **2. Uploading and Extracting the Dataset**
The dataset, in zipped format, is uploaded and extracted into the Colab environment. The file `Tweets.csv` is read using pandas for further processing.

---

## **3. Load Dataset**
The dataset contains multiple fields, but we select only two columns:
- `text`: The tweet content
- `airline_sentiment`: The sentiment label (positive, negative, or neutral)

We then simplify the sentiment labels by mapping:
- `negative` → 0
- `neutral` and `positive` → 1

This transforms the problem into a binary classification task: negative vs. non-negative.

---

## **4. Preprocessing**
To clean the tweet text data, we define a function that:
- Removes URLs and mentions
- Removes special characters
- Converts text to lowercase

Each tweet is then cleaned and stored in a new column.
Download stock data using yfinance.

Select Close column for time series analysis.

Normalize the data using MinMaxScaler from sklearn.preprocessing to scale values between 0 and 1.

Create sequences for RNN input:

Time steps: 60 days.

The next day’s close price is the target.

Train-Test Split:

Training Data: 80%

Testing Data: 20%

 **Visualization:**
The original and scaled stock price data were plotted for better understanding of trends and patterns.

---

### **Tokenization and Vocabulary**
- We tokenize the cleaned text into words and count word frequency.
- A vocabulary dictionary is created to map each word to a unique integer.
- Each tweet is then represented as a sequence of integers using this vocabulary.

---

### **Padding Sequences**
Tweets vary in length. To make them uniform, all sequences are padded (or truncated) to a fixed length (50 words). This ensures that input data is consistent for the RNN.

---

## **5. Train-Test Split**
The data is split into training (80%) and test (20%) sets. This is essential to train the model on one set and evaluate it on another to avoid overfitting.

---

## **6. DataLoader Preparation**
A custom `TweetDataset` class is created to:
- Convert inputs and labels to PyTorch tensors
- Use PyTorch’s `DataLoader` for efficient batching and shuffling

This prepares the data for RNN training in manageable mini-batches.

---

## **7. RNN Model Definition**
The model consists of:
- **Embedding layer**: Converts word indices to dense vectors
- **LSTM layer**: Captures sequential relationships in text
- **Fully Connected (Linear) layer**: Produces a single output value
- **Sigmoid activation**: Converts output to a probability (0–1)

The model is initialized with appropriate vocabulary size, embedding dimensions, and hidden layer size.

---
**Model Architecture**
Model Type: Recurrent Neural Network (RNN)

Framework: Keras (TensorFlow backend)

 Layers Used:
SimpleRNN Layer:

Units: 50

Activation: 'tanh'

Return sequences: True (to stack another RNN layer)

SimpleRNN Layer:

Units: 50

Return sequences: False

Dense Layer:

Units: 1 (for regression output)

 **Model Summary:**
Total Parameters: ~5,151

Loss Function: mean_squared_error

Optimizer: adam



## **8. Model Training**
The training loop runs for 5 epochs using:
- **Binary Cross-Entropy Loss** for classification
- **Adam Optimizer** for gradient updates

After each epoch, loss is printed, and a loss curve is plotted to visualize the model’s learning process.

---

## **9. Model Evaluation**
The model is evaluated on the test set by:
- Generating predictions
- Comparing with true labels
- Calculating **accuracy**
- Plotting a **confusion matrix** to understand prediction errors

**Predictions:**
The model’s predictions were inverse-transformed back to original scale using MinMaxScaler.inverse_transform().

The predicted stock prices were compared with the actual values using plots.

Evaluation Metrics:
Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

These metrics help quantify how close the predictions are to actual prices.
---

## **10. Data Visualization**
Multiple visualizations are created to understand the dataset and insights:

### **Word Clouds**
- For **negative** tweets: Common words expressing dissatisfaction
- For **non-negative** tweets: Words representing positive or neutral experiences

### **Tweet Length Distribution**
- Shows how the length of tweets varies with sentiment
- Helps understand if longer tweets are more likely to be negative/positive

### **Airline-wise Negative Sentiment**
- Bar chart showing which airlines receive the most negative tweets
- United Airlines had the most, indicating potential service issues
Plots generated:

Actual vs Predicted Prices (Test Set)

Training vs Validation Loss

Historical Prices for understanding trends

These visualizations provide a clear picture of model performance.



---

## **11. Managerial Insights**
- Most tweets were negative, reflecting dissatisfaction with airline services.
- United Airlines had the highest count of negative tweets.
- The most common issues were delays, baggage loss, and customer service.
- Businesses can use real-time sentiment monitoring to mitigate PR issues and improve customer support.

---
 **12.Future Scope**

Incorporate LSTM or GRU for better sequence modeling.

Add technical indicators like RSI, MACD for richer feature set.

Try multi-step forecasting instead of one-step prediction.

Use more granular data like hourly stock prices.


**13. Challenges and Limitations**

Class Imbalance: The dataset was dominated by negative tweets (~63%), which may bias the model.

Sarcasm Detection: The RNN model may fail to capture sarcastic tones or contextually ambiguous expressions.

Model Simplicity: While RNNs are powerful, Transformer-based models (like BERT) can provide higher accuracy in sentiment classification tasks.

Vocabulary Limitation: Words not seen in the training set (out-of-vocabulary) are ignored, which may affect predictions.

**14. Future Work**

Several enhancements can be implemented to improve performance and deployment readiness:

Use of pre-trained word embeddings like GloVe or FastText to improve word representation.

Replace the simple LSTM with Bidirectional LSTM (BiLSTM) to capture both past and future context.

Expand the model to multi-class sentiment classification (negative, neutral, positive).

Deploy the model using a Flask or Streamlit interface for real-time tweet classification.

Integrate Twitter API to fetch live tweets and classify sentiment continuously.

Implement attention mechanisms to focus on the most relevant words in a tweet during classification.


## **15. Conclusion**
The project demonstrates how RNNs can be effectively used for sentiment classification from social media text. The model was successfully trained on airline tweets and achieved promising accuracy on the test set. Visualizations provided meaningful insights into customer sentiment that can help airlines refine their operations.

This end-to-end solution showcases the real-world application of deep learning and NLP for customer feedback analysis.
This project successfully implemented an RNN model for stock price forecasting. It covered end-to-end processes from data collection to model deployment-ready predictions. While RNNs can model temporal relationships, performance can improve by using more advanced architectures like LSTMs.



