<a href="https://colab.research.google.com/github/Gopikuppala7/MachineLearning/blob/main/Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To classify tweets using Natural Language Processing (NLP) techniques and TF-IDF, we'll go through several preprocessing steps—stemming, lemmatization, and removal of stop words—before applying the TF-IDF model to transform the text data into a suitable form for classification.

Step 1: Load the Data
First, we need to load the tweet data from the CSV file.

In [9]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/Tweet_Data.csv')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,id,created_at,full_text,sentiment
0,1.03e+18,24/08/2018 05:46,"b""@twolfuk wow, that was quiet a dramatic tale...",1
1,1.03e+18,24/08/2018 05:46,"b""No stream this weekend i'm going to a music ...",1
2,1.03e+18,24/08/2018 05:46,"b""yes it is true, a stream schedule so you kno...",1
3,1.03e+18,24/08/2018 05:45,"b""@ThorntonParsons @1776Stonewall You know the...",1
4,1.03e+18,24/08/2018 05:45,"b""@CalamityDeath I honestly barely play too lo...",1


Step 2: Data Cleaning
Before processing, it's crucial to clean the text data. This includes removing any URLs, special characters, and numbers.

In [10]:
import re

def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)  # Remove numbers and special characters
    text = text.lower()  # Convert to lowercase to maintain consistency
    return text

df['cleaned_text'] = df['full_text'].apply(clean_text)

Step 3: Remove Stop Words
Stop words are common words that carry minimal useful information for analysis. Removing them helps focus on important words.

In [11]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download the set of stop words
import nltk
nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_words = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_words)

df['filtered_text'] = df['cleaned_text'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Step 4: Stemming and Lemmatization
Stemming reduces words to their base form by chopping off the ends of words. Lemmatization, on the other hand, reduces words into linguistically valid lemmas.

In [12]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stem_and_lemmatize(text):
    tokens = word_tokenize(text)
    stemmed = [stemmer.stem(word) for word in tokens]
    lemmatized = [lemmatizer.lemmatize(word) for word in stemmed]
    return ' '.join(lemmatized)

df['processed_text'] = df['filtered_text'].apply(stem_and_lemmatize)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Step 5: Applying TF-IDF
Transform the processed text into vectors using TF-IDF.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # considering only the top 1000 features
tfidf_features = tfidf_vectorizer.fit_transform(df['processed_text'])

# Convert to array to view as a DataFrame, using the updated method to get feature names
feature_array = tfidf_features.toarray()
tfidf_df = pd.DataFrame(feature_array, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df.head()


Unnamed: 0,abl,absolut,abt,accept,account,across,act,activ,actual,ad,...,yet,yoongi,youd,youll,young,your,youtub,youv,youxexxr,yxexxal
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Step 6: Classification
Now, use these TF-IDF features to train a classifier. Let's use a simple logistic regression model as an example.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Assuming there's a 'label' column in your dataframe for classification
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df['sentiment'], test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.76355


Train the Support Vector Classifier
We'll split the data into training and testing sets, then train an SVC model.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=100)
tfidf_features = tfidf_vectorizer.fit_transform(df['processed_text'])

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Assuming your data has a target variable 'label'
X_train, X_test, y_train, y_test = train_test_split(tfidf_features, df['sentiment'], test_size=0.2, random_state=42)

# Initialize and train the SVC
model = SVC(kernel='linear')  # Using a linear kernel
model.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))


Accuracy: 0.666575
              precision    recall  f1-score   support

          -1       0.70      0.58      0.63     20007
           1       0.64      0.76      0.69     19993

    accuracy                           0.67     40000
   macro avg       0.67      0.67      0.66     40000
weighted avg       0.67      0.67      0.66     40000



Perform PCA to reduce the dimension space as the model takes very long to run

In [17]:
# Perform PCA to reduce dimensionality
from sklearn.decomposition import PCA
pca = PCA(n_components=50)  # Reduce features to 50 principal components
tfidf_pca_features = pca.fit_transform(tfidf_features.toarray())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_pca_features, df['sentiment'], test_size=0.2, random_state=42)

# Initialize and train the SVC with a linear kernel
from sklearn.svm import SVC
svc = SVC(kernel='linear')
svc.fit(X_train, y_train)

# Predict and evaluate the model
from sklearn.metrics import classification_report, accuracy_score

y_pred = svc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

accuracy, classification_rep

(0.646525,
 '              precision    recall  f1-score   support\n\n          -1       0.68      0.56      0.61     20007\n           1       0.63      0.73      0.67     19993\n\n    accuracy                           0.65     40000\n   macro avg       0.65      0.65      0.64     40000\nweighted avg       0.65      0.65      0.64     40000\n')