# Twitter Sentiment Analysis Using Machine Learning 

Social media platforms like Twitter serve as real-time hubs for public opinion, making sentiment analysis a powerful tool for businesses, governments, and researchers. This project aims to develop a machine learning pipeline capable of analyzing and classifying the sentiments expressed in tweets. By leveraging the vast repository of public tweets, the model attempts to categorize sentiments into distinct categories (e.g., positive, negative, neutral) based on textual content.

The core objectives of this project are:
1. **Understanding Sentiment Trends**: Identifying how users feel about a particular topic, product, or event.
2. **Automating Sentiment Classification**: Building a robust and scalable pipeline to classify tweets automatically.
3. **Providing Actionable Insights**: Offering predictions that can be used for business strategies, public relations, or market analysis.

#### Methodology

1. **Dataset Preparation**:
   - The dataset is sourced from a training file containing tweets labeled with sentiment categories (e.g., positive, negative).
   - Initial steps include data cleaning to handle missing values in the `Tweet` column, ensuring the model is trained on complete and reliable data.

2. **Text Representation**:
   - Textual data is converted into numerical form using the **TF-IDF Vectorizer**. This method highlights the importance of words in individual tweets relative to the dataset.

3. **Model Training**:
   - The **Naive Bayes classifier**, known for its effectiveness in text classification, is chosen. A pipeline combining the TF-IDF vectorizer and the classifier ensures smooth preprocessing and prediction.

4. **Evaluation**:
   - The dataset is split into training and testing sets to evaluate model performance.
   - Metrics such as **accuracy score** and **classification report** (precision, recall, F1-score) assess the model’s predictive capability.

5. **Validation**:
   - A function is provided to accept new datasets, preprocess them, and predict sentiments for unseen tweets.

6. **Output**:
   - Predictions for new tweets are saved into a CSV file (`predicted_sentiments.csv`), including the overall mode sentiment for the dataset.

#### Applications

- **Customer Feedback Analysis**: Analyze user feedback to improve services and products.
- **Social Media Monitoring**: Track public opinion on specific campaigns or events.
- **Crisis Management**: Detect negative sentiments during crises and respond proactively.
- **Market Research**: Understand consumer preferences and sentiment trends in industries.

In [1]:
#Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, classification_report
from scipy.stats import mode

In [2]:
# Path to the training dataset
data_path = r"C:/Users/Asus/Desktop/Rishabh/twitter/twitter_training.csv"

In [3]:
# Loading the training dataset
data = pd.read_csv(data_path)
print("Training Data Loaded:")
print(data.head())

Training Data Loaded:
   Unnamed: 0    ID     Category Sentiment  \
0           0  2401  Borderlands  Positive   
1           1  2401  Borderlands  Positive   
2           2  2401  Borderlands  Positive   
3           3  2401  Borderlands  Positive   
4           4  2401  Borderlands  Positive   

                                               Tweet  
0  I am coming to the borders and I will kill you...  
1  im getting on borderlands and i will kill you ...  
2  im coming on borderlands and i will murder you...  
3  im getting on borderlands 2 and i will murder ...  
4  im getting into borderlands and i can murder y...  


In [4]:
# Checking if required columns are present
if 'Tweet' not in data.columns or 'Sentiment' not in data.columns:
    raise ValueError("The dataset must contain 'Tweet' and 'Sentiment' columns.")

In [5]:
# Handling missing values in the 'Tweet' column
data.dropna(subset=['Tweet'], inplace=True)  # Dropping rows where 'Tweet' is NaN

#### Splitting data into training and testing sets

In [6]:
X = data['Tweet']
y = data['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Creating and training the model pipeline

In [7]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)

#### Evaluating the model on the test set

In [8]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 72.17%
Classification Report:
              precision    recall  f1-score   support

  Irrelevant       0.96      0.39      0.56      2624
    Negative       0.64      0.90      0.75      4463
     Neutral       0.85      0.61      0.71      3589
    Positive       0.70      0.83      0.76      4123

    accuracy                           0.72     14799
   macro avg       0.79      0.68      0.70     14799
weighted avg       0.77      0.72      0.71     14799



In [9]:
# Function to load and process the validation dataset
def load_validation_data(file_path):
    if file_path.endswith('.csv'):
        validation_data = pd.read_csv(file_path)
    elif file_path.endswith('.xlsx'):
        validation_data = pd.read_excel(file_path)
    else:
        raise ValueError("Please upload a CSV or Excel file.")
    
    # Ensuring the file has the 'Tweet' column
    if 'Tweet' not in validation_data.columns:
        raise ValueError("The uploaded file must contain a 'Tweet' column.")
    
    return validation_data

In [10]:
# Main function to run the sentiment analysis
def main():
    # Gets file path from user input
    file_path = input("Please enter the path to your validation CSV or Excel file: ")
    
    try:
        # Loads validation data
        validation_data = load_validation_data(file_path)

        # Handles missing values in validation data
        validation_data.dropna(subset=['Tweet'], inplace=True)  # Drops rows where 'Tweet' is NaN

        # Predicts sentiments on validation data
        validation_data['Predicted_Sentiment'] = validation_data['Tweet'].apply(lambda x: model.predict([x])[0])
        
        # Displays the prediction results
        print("Prediction Results:")
        print(validation_data[['Tweet', 'Predicted_Sentiment']])
        print("overall result : ",validation_data['Predicted_Sentiment'].mode()[0])

        # Saves results to a CSV file
        output_file = "predicted_sentiments.csv"
        validation_data.to_csv(output_file, index=False)
        print(f"Results saved to {output_file}.")

    except Exception as e:
        print(f"An error occurred: {e}")

In [12]:
# Runs the main function
if __name__ == "__main__":
    main()

Please enter the path to your validation CSV or Excel file: C:\Users\Asus\Desktop\Rishabh\twitter\validation_dataset2.xlsx
Prediction Results:
                                                Tweet Predicted_Sentiment
0   BBC News - Amazon boss Jeff Bezos rejects clai...             Neutral
1   @Microsoft Why do I pay for WORD when it funct...            Negative
2   CSGO matchmaking is so full of closet hacking,...            Negative
3   Now the President is slapping Americans in the...             Neutral
4   Hi @EAHelp Iâ€™ve had Madeleine McCann in my c...            Negative
5   Thank you @EAMaddenNFL!! \n\nNew TE Austin Hoo...            Positive
6   Rocket League, Sea of Thieves or Rainbow Six: ...            Positive
7   my ass still knee-deep in Assassins Creed Odys...            Positive
8   FIX IT JESUS ! Please FIX IT ! What In the wor...            Negative
9   The professional dota 2 scene is fucking explo...            Positive
10  Itching to assassinate \n\n#TCCGif #Ass

### Final Conclusion

This project successfully demonstrates the potential of machine learning for sentiment analysis on Twitter data. By integrating a well-defined pipeline, the model achieves a balance between simplicity and performance, making it an excellent solution for real-world applications.

#### Key Takeaways

1. **Model Performance**:
   - The model, powered by a TF-IDF vectorizer and a Naive Bayes classifier, performs efficiently on text data.
   - Evaluation metrics indicate strong accuracy and reliable predictions, highlighting the robustness of the chosen methods.

2. **Scalability**:
   - The modular design of the pipeline ensures scalability. New datasets can be seamlessly incorporated for predictions without additional manual preprocessing steps.

3. **Practical Applications**:
   - This tool can empower businesses to gain insights into customer sentiments, enabling data-driven decision-making.
   - Beyond businesses, governments and NGOs can utilize this model to gauge public sentiment on policies or events.

4. **Challenges Addressed**:
   - Handling missing data and ensuring the dataset's quality improved prediction accuracy.
   - Automating the prediction and saving results streamlined the workflow for end-users.

5. **Future Scope**:
   - **Enhanced Models**: Experimenting with advanced machine learning models like Support Vector Machines (SVM) or deep learning architectures (e.g., LSTMs).
   - **Broader Sentiment Categories**: Expanding beyond binary or ternary sentiment categories to capture nuanced emotions like excitement or anger.
   - **Real-time Analysis**: Implementing real-time data fetching and sentiment classification directly from Twitter’s API.

---

In conclusion, this project underscores the transformative potential of machine learning in understanding human emotions and opinions at scale. By translating unstructured text data into actionable insights, this tool bridges the gap between raw social media data and strategic decision-making.