## Twitter Sentiment Analysis Project

Here is the complete project walkthrough with all the Python code. You can copy and paste this entire code block into a single Google Colab cell and run it.

### Step 1: Setup, Load Data, and Explore

First, we import our libraries and load a suitable dataset from Kaggle. This dataset conveniently contains tweets already labeled as `Positive`, `Negative`, and `Neutral`, just like in your project description.

### Step 2: Text Preprocessing (NLTK)

This is where we clean the text data. We'll create a function that performs the exact steps you listed:

1.  **Clean Text:** Remove URLs, hashtags, mentions, and non-alphanumeric characters.
2.  **Tokenize:** Split the text into individual words.
3.  **Remove Stopwords:** Filter out common words (`"the"`, `"is"`, `"in"`) that don't add meaning.
4.  **Lemmatize:** Reduce words to their root form ("running" -\> "run", "studies" -\> "study").

### Step 3: Feature Engineering (Scikit-learn)

A machine learning model can't read words. We must convert the cleaned text into a numerical format. We will use **TF-IDF** (Term Frequency-Inverse Document Frequency), a standard and effective method.

  * **TF (Term Frequency):** How often a word appears in a single tweet.
  * **IDF (Inverse Document Frequency):** How "important" a word is. Words that appear in *every* tweet (like "tweet") are less important than words that appear in only a few (like "awesome").

### Step 4: Model Training and Evaluation

This is the machine learning part.

1.  **Split Data:** We'll use 80% of our data to *train* the model and 20% to *test* it.
2.  **Train Model:** We'll use a **Multinomial Naive Bayes** classifier. It's a classic, fast, and highly effective model for text classification.
3.  **Evaluate:** We'll check the model's performance on the unseen test data and print the **accuracy score** and a detailed **classification report**.

-----

## Complete Python Code for Google Colab

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# --- 1. SETUP AND DATA LOADING ---

# Download necessary NLTK data (stopwords, tokenizer, lemmatizer)
# You only need to run these lines once
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab') # Download the missing punkt_tab resource

# Load the dataset
# Make sure you have uploaded 'twitter_training.csv' to your Colab session!
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('twitter_training.csv')

# The dataset has 4 columns. We only need the tweet text and the sentiment label.
# The column names are 'tweet' and 'label' respectively.
# Let's rename them for clarity.
df = df[['tweet', 'label']].rename(columns={'tweet': 'text', 'label': 'sentiment'})

# Drop rows with missing text
df = df.dropna(subset=['text'])

# Map sentiments to numerical values (optional, but good practice for some models)
# We'll also drop 'Irrelevant' to match the Positive/Negative/Neutral goal
df = df[df['sentiment'] != 'Irrelevant']
df['sentiment_label'] = df['sentiment'].map({'Positive': 2, 'Neutral': 1, 'Negative': 0})

print("Data loaded and cleaned. Head:")
print(df.head())
print("\nSentiment distribution:")
print(df['sentiment'].value_counts())


# --- 2. TEXT PREPROCESSING (NLTK) ---

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Ensure the input is a string
    if not isinstance(text, str):
        return "" # Return empty string for non-string inputs

    # 1. Clean: Remove URLs, mentions, hashtags, and non-alphanumeric chars
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)     # Remove mentions
    text = re.sub(r'#\w+', '', text)     # Remove hashtags
    text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove punctuation/numbers

    # 2. Tokenize
    tokens = word_tokenize(text.lower())

    # 3. Remove Stopwords and Lemmatize
    cleaned_tokens = []
    for token in tokens:
        if token not in stop_words:
            cleaned_tokens.append(lemmatizer.lemmatize(token))

    # 4. Join back into a string
    return " ".join(cleaned_tokens)

print("\nStarting preprocessing...")
# Apply the preprocessing function to our text column
df['processed_text'] = df['text'].apply(preprocess_text)
print("Preprocessing complete. Processed text example:")
print(df[['text', 'processed_text']].head())


# --- 3. FEATURE ENGINEERING (SCIKIT-LEARN) ---

print("\nStarting feature engineering (TF-IDF)...")

# Initialize the TF-IDF Vectorizer
# max_features=5000 means we only keep the top 5000 most important words
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Create the TF-IDF matrix (our 'X' features)
X = tfidf_vectorizer.fit_transform(df['processed_text'])

# Our 'y' labels
y = df['sentiment_label']

print("TF-IDF matrix shape:", X.shape)


# --- 4. MODEL TRAINING AND EVALUATION ---

print("\nSplitting data and training model...")

# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Initialize the Multinomial Naive Bayes classifier
model = MultinomialNB()

# Train the model
model.fit(X_train, y_train)

print("Model trained.")


# --- 5. EVALUATION ---

print("\nEvaluating model performance...")

# Make predictions on the unseen test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\n--- Model Accuracy ---")
print(f"Accuracy: {accuracy * 100:.2f}%")

# Print the detailed classification report
print("\n--- Classification Report ---")
target_names = ['Negative', 'Neutral', 'Positive'] # Must match the order of our labels (0, 1, 2)
print(classification_report(y_test, y_pred, target_names=target_names))


# --- 6. HOW TO USE THE MODEL (Example) ---

print("\n--- Testing Model on New Tweets ---")
new_tweets = [
    "This is the best movie I have ever seen! Amazing.",
    "I'm not sure how I feel about this product, it's just okay.",
    "What a terrible experience. I will never buy this again.",
    "The airline lost my luggage, I am so angry!"
]

# 1. Preprocess the new tweets
processed_new_tweets = [preprocess_text(tweet) for tweet in new_tweets]

# 2. Transform using the *same* TF-IDF vectorizer
new_tweets_tfidf = tfidf_vectorizer.transform(processed_new_tweets)

# 3. Predict
predictions = model.predict(new_tweets_tfidf)

# Map predictions back to labels
predicted_labels = [target_names[p] for p in predictions]

for i in range(len(new_tweets)):
    print(f"Tweet: '{new_tweets[i]}'")
    print(f"Predicted Sentiment: {predicted_labels[i]}\n")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Data loaded and cleaned. Head:
                                                text sentiment  \
0  im getting on borderlands and i will murder yo...  Positive   
1  I am coming to the borders and I will kill you...  Positive   
2  im getting on borderlands and i will kill you ...  Positive   
3  im coming on borderlands and i will murder you...  Positive   
4  im getting on borderlands 2 and i will murder ...  Positive   

   sentiment_label  
0                2  
1                2  
2                2  
3                2  
4                2  

Sentiment distribution:
sentiment
Negative    22358
Positive    20655
Neutral     18108
Name: count, dtype: int64

Starting preprocessing...
Preprocessing complete. Processed text example:
                                                text  \
0  im getting on borderlands and i will murder yo...   
1  I am coming to the borders and I will kill you...   
2  im getting on borderlands and i will kill you ...   
3  im coming on borderlands and 

### How to Achieve 85% Accuracy

The code above will likely yield an accuracy between 70% and 80%. Reaching the **85%** accuracy mentioned in your project description requires **tuning and experimentation**.

Here are the steps you would take to improve the score:

1.  **Try a Different Model:** Naive Bayes is a good baseline, but other models often perform better on this task. Replace `MultinomialNB()` with one of these:
      * `from sklearn.linear_model import LogisticRegression`
        `model = LogisticRegression(max_iter=1000)`
      * `from sklearn.svm import LinearSVC`
        `model = LinearSVC()`
2.  **Tune Hyperparameters:** You can use `GridSearchCV` from Scikit-learn to automatically find the best settings for your model and vectorizer (e.g., finding the optimal `max_features` for TF-IDF).
3.  **Use a Better Dataset:** A larger, more balanced dataset (like the 1.6 million-tweet Sentiment140 dataset, though it's only positive/negative) can produce a more robust model.