## Data loading
Loading the "emotions.csv" dataset and limiting the number of tweets per emotion to 10,000.


In [3]:
import pandas as pd
try:
    df = pd.read_csv('emotions.csv')
    df = df.groupby('label').apply(lambda x: x.head(10000)).reset_index(drop=True)
    print(f"Shape of the dataframe: {df.shape}")
    print(f"Data types:\n{df.dtypes}")
    display(df.head())
except FileNotFoundError:
    print("Error: 'emotions.csv' not found.")
    df = pd.DataFrame()
except Exception as e:
    print(f"An error occurred: {e}")
    df = pd.DataFrame()

Shape of the dataframe: (60000, 2)
Data types:
text     object
label     int64
dtype: object


  df = df.groupby('label').apply(lambda x: x.head(10000)).reset_index(drop=True)


Unnamed: 0,text,label
0,ive enjoyed being able to slouch about relax a...,0
1,i dont know i feel so lost,0
2,i was beginning to feel quite disheartened,0
3,i can still lose the weight without feeling de...,0
4,im feeling a little like a damaged tree and th...,0


## Data cleaning
Cleaning the text data by handling missing values, removing URLs, mentions, hashtags, and punctuation, removing stopwords, and converting to lowercase.


In [5]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Handle missing values (already done, but included for completeness)
print(f"Number of missing values before handling: {df['text'].isnull().sum()}")
df['text'].fillna('', inplace=True)
print(f"Number of missing values after handling: {df['text'].isnull().sum()}")

# Example tweets before cleaning
print("\nExample tweets before cleaning:")
print(df['text'][0:3])

# Remove URLs, mentions, hashtags, and punctuation
def clean_tweet(tweet):
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
    tweet = re.sub(r'@\w+', '', tweet)
    tweet = re.sub(r'#\w+', '', tweet)
    tweet = re.sub(r'[^\w\s]', '', tweet)
    return tweet

df['text'] = df['text'].apply(clean_tweet)

# Remove stopwords and convert to lowercase
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words and word.isalnum()]
    return " ".join(tokens)

df['text'] = df['text'].apply(preprocess_text)

# Example tweets after cleaning
print("\nExample tweets after cleaning:")
print(df['text'][0:3])

Number of missing values before handling: 0
Number of missing values after handling: 0

Example tweets before cleaning:
0    ive enjoyed being able to slouch about relax a...
1                           i dont know i feel so lost
2           i was beginning to feel quite disheartened
Name: text, dtype: object

Example tweets after cleaning:
0    ive enjoyed able slouch relax unwind frankly n...
1                                  dont know feel lost
2                    beginning feel quite disheartened
Name: text, dtype: object


## Data wrangling
Convert the cleaned text data to numerical features using TF-IDF.


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer object with adjusted parameters
vectorizer = TfidfVectorizer(max_features=2000, min_df=1, max_df=1.0)

# Fit and transform the 'text' column
X = vectorizer.fit_transform(df['text'])

# Check if X is a sparse matrix
print(f"Is X a sparse matrix? {isinstance(X, type(vectorizer.fit_transform(['test'])))}")
print(f"Shape of the TF-IDF matrix: {X.shape}")

Is X a sparse matrix? True
Shape of the TF-IDF matrix: (60000, 2000)


## Data splitting
Splitting the data into training and testing sets.


In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42, stratify=df['label'])

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (48000, 2000)
X_test shape: (12000, 2000)
y_train shape: (48000,)
y_test shape: (12000,)


## Model training
Training multiple SVM models with different kernels using the training data.


In [13]:
from sklearn import svm

kernels = ['linear', 'rbf', 'poly', 'sigmoid']
models = {}

for kernel in kernels:
    model = svm.SVC(kernel=kernel, random_state=42)
    model.fit(X_train, y_train)
    models[kernel] = model

## Model evaluation
Evaluating the performance of each trained SVM model using the specified metrics.

In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

evaluation_results = {}

for kernel, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')

    evaluation_results[kernel] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }

    print(f"Evaluation results for {kernel} kernel:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print("-" * 20)

Evaluation results for linear kernel:
Accuracy: 0.9080
Precision: 0.9095
Recall: 0.9080
F1 Score: 0.9073
--------------------
Evaluation results for rbf kernel:
Accuracy: 0.9024
Precision: 0.9037
Recall: 0.9024
F1 Score: 0.9018
--------------------
Evaluation results for poly kernel:
Accuracy: 0.8618
Precision: 0.8635
Recall: 0.8618
F1 Score: 0.8621
--------------------
Evaluation results for sigmoid kernel:
Accuracy: 0.9077
Precision: 0.9090
Recall: 0.9077
F1 Score: 0.9070
--------------------


Based on the evaluation results, the linear kernel performed best. Now, we will perform hyperparameter tuning using GridSearchCV for the linear kernel.

In [16]:
import joblib

best_model = models['linear']

# Define the filename for saving the model
filename = 'best_svm_model.joblib'

# Save the model
joblib.dump(best_model, filename)

print(f"Best model saved as {filename}")

Best model saved as best_svm_model.joblib
