<a href="https://colab.research.google.com/github/Merina62/AI-and-ML/blob/main/MerinaShrestha_Worksheet_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Helper Function for Text Cleaning:

Implement a Helper Function as per Text Preprocessing Notebook and Complete the following pipeline.

# Build a Text Cleaning Pipeline

In [8]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download the 'stopwords' dataset
nltk.download('stopwords')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def text_cleaning_pipeline(dataset, rule="lemmatize"):
    """
    Clean the input text data.

    Args:
        dataset (str): Text to clean.
        rule (str): Either 'lemmatize' or 'stem'.

    Returns:
        str: Cleaned and processed text.
    """
    # Convert to lowercase
    data = dataset.lower()

    # Remove URLs
    data = re.sub(r'http\S+|www\S+|https\S+', '', data, flags=re.MULTILINE)

    # Remove emojis
    data = re.sub(r'[^\x00-\x7F]+', '', data)

    # Remove punctuation and other unwanted characters
    data = re.sub(r'[^\w\s]', '', data)

    # Tokenize the text
    tokens = nltk.word_tokenize(data)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Apply stemming or lemmatization
    if rule == "lemmatize":
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    elif rule == "stem":
        tokens = [stemmer.stem(word) for word in tokens]
    else:
        print("Pick between lemmatize or stem")
        return ""

    return " ".join(tokens)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Text Classification using Machine Learning Models


### 📝 Instructions: Trump Tweet Sentiment Classification

1. **Load the Dataset**  
   Load the dataset named `"trump_tweet_sentiment_analysis.csv"` using `pandas`. Ensure the dataset contains at least two columns: `"text"` and `"label"`.

2. **Text Cleaning and Tokenization**  
   Apply a text preprocessing pipeline to the `"text"` column. This should include:
   - Lowercasing the text  
   - Removing URLs, mentions, punctuation, and special characters  
   - Removing stopwords  
   - Tokenization (optional: stemming or lemmatization)
   - "Complete the above function"

3. **Train-Test Split**  
   Split the cleaned and tokenized dataset into **training** and **testing** sets using `train_test_split` from `sklearn.model_selection`.

4. **TF-IDF Vectorization**  
   Import and use the `TfidfVectorizer` from `sklearn.feature_extraction.text` to transform the training and testing texts into numerical feature vectors.

5. **Model Training and Evaluation**  
   Import **Logistic Regression** (or any machine learning model of your choice) from `sklearn.linear_model`. Train it on the TF-IDF-embedded training data, then evaluate it using the test set.  
   - Print the **classification report** using `classification_report` from `sklearn.metrics`.


# Read the data.

In [10]:
import pandas as pd
import numpy as np

In [11]:
df = pd.read_csv('/content/drive/MyDrive/AI- 6CS012/Week-8/trum_tweet_sentiment_analysis.csv')

In [12]:
df.head()

Unnamed: 0,text,Sentiment
0,RT @JohnLeguizamo: #trump not draining swamp b...,0
1,ICYMI: Hackers Rig FM Radio Stations To Play A...,0
2,Trump protests: LGBTQ rally in New York https:...,1
3,"""Hi I'm Piers Morgan. David Beckham is awful b...",0
4,RT @GlennFranco68: Tech Firm Suing BuzzFeed fo...,0


# Text Cleaning and Tokenization

Apply a text preprocessing pipeline to the "text" column. This should include:

Lowercasing the text

Removing URLs, mentions, punctuation, and special characters

Removing stopwords

Tokenization (optional: stemming or lemmatization)

"Complete the above function"

In [13]:
!pip install nltk
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('stopwords')
nltk.download('punkt')  # Download punkt tokenizer models
nltk.download('wordnet')
nltk.download('punkt_tab') # This line was added to download punkt_tab

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def text_cleaning_pipeline(text, rule="lemmatize"):
    # Lowercase
    text = text.lower()

    # Remove URLs, mentions, punctuation/special chars
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    text = re.sub(r"@\w+", "", text)
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"[^\x00-\x7F]+", "", text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatize
    if rule == "lemmatize":
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    else:
        print("Invalid rule selected. Use 'lemmatize'.")

    return " ".join(tokens)

# Apply to dataset
df['cleaned_text'] = df['text'].apply(lambda x: text_cleaning_pipeline(x))




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# Train-Test Split

Split the cleaned and tokenized dataset into training and testing sets using train_test_split from sklearn.model_selection.

In [14]:
from sklearn.model_selection import train_test_split

X = df['cleaned_text']
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# TF-IDF Vectorization

Import and use the TfidfVectorizer from sklearn.feature_extraction.text to transform the training and testing texts into numerical feature vectors.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)


# Model Training and Evaluation

Import Logistic Regression (or any machine learning model of your choice) from sklearn.linear_model. Train it on the TF-IDF-embedded training data, then evaluate it using the test set.

Print the classification report using classification_report from sklearn.metrics.

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Predict
y_pred = model.predict(X_test_tfidf)

# Evaluation
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.93      0.96      0.94    248563
           1       0.90      0.86      0.88    121462

    accuracy                           0.92    370025
   macro avg       0.92      0.91      0.91    370025
weighted avg       0.92      0.92      0.92    370025

