# 4. Modeling 

In [1]:
# Import Relevant libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Intializing the modeling class
from Classes import SentimentModel

## Step 1. Importing and Splitting the two clean dataframes(df_binary and df)

In [4]:
# Import the cleaned data for modeling purpose with binary classes
df_binary = pd.read_csv('binary_df.csv')
df_binary.shape

(3525, 7391)

In [5]:
# Load the dataframe with the multi_classes
df = pd.read_csv('final_df.csv')
df.shape


(8300, 7391)

In [6]:
# Split df_binary into train and test
X_binary = df_binary.drop('emotion', axis=1)
y_binary = df_binary['emotion']

X_train_binary, X_test_binary, y_train_binary, y_test_binary = train_test_split(
    X_binary, y_binary, test_size=0.2, random_state=42)

# Split df into train and test
X_multi = df.drop('emotion', axis=1)  # Replace 'target_column' with your target column
y_multi = df['emotion']  # Replace with the target column for multi-class classification

X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42)


The dataset is split into features and target variables, with the emotion column serving as the target in both binary and multi-class classification tasks. For both datasets, the data is divided into training and testing sets, typically with 80% of the data used for training and 20% for testing. A random state is set to ensure reproducibility of the split. This approach allows for a consistent evaluation of models on both binary and multi-class classification tasks, providing training and testing datasets for each

# 2. Evaluate Models: LogisticRegressionModel, RandomForestClassifier,Xgboost Model, SVMModel Models

The process of model training and tuning in the SentimentModel class involves several key steps. First, the preprocess_data method scales the features using StandardScaler and optionally applies Principal Component Analysis (PCA) to reduce dimensionality, retaining 95% of the variance. This ensures that the model works with well-conditioned data, improving performance. Next, the apply_smote method addresses class imbalance by applying the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic samples for the minority class. Then, in the train method, the class iterates through multiple machine learning models (Logistic Regression, SVM, Random Forest, and XGBoost), training each model on the preprocessed and balanced data. After training, the models make predictions on the test set, and their performance is evaluated using metrics such as accuracy, classification report, and confusion matrix. This workflow ensures that the models are properly trained, tuned, and evaluated for both binary and multi-class sentiment analysis tasks.

#### **2.1 Binary Models**

In [8]:
# Generating a synthetic binary classification dataset
X, y = make_classification(n_samples=3000,
                           n_features=20,
                           n_informative=10,
                           n_classes=2,  # Binary classes
                           random_state=42)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate and train the sentiment model
sentiment_model = SentimentModel()
sentiment_model.train(X_train, X_test, y_train, y_test, task_type='binary')  # Binary classification

  File "C:\Users\Knight Mbithe\anaconda3\envs\api\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\Knight Mbithe\anaconda3\envs\api\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Knight Mbithe\anaconda3\envs\api\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\Knight Mbithe\anaconda3\envs\api\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Training model: logistic_regression
Logistic_regression - Accuracy: 0.8033
Logistic_regression - Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.83      0.80       437
           1       0.83      0.78      0.80       463

    accuracy                           0.80       900
   macro avg       0.80      0.80      0.80       900
weighted avg       0.80      0.80      0.80       900

Logistic_regression - Confusion Matrix:
[[361  76]
 [101 362]]
--------------------------------------------------
Training model: svm
Svm - Accuracy: 0.9233
Svm - Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.93      0.92       437
           1       0.93      0.92      0.92       463

    accuracy                           0.92       900
   macro avg       0.92      0.92      0.92       900
weighted avg       0.92      0.92      0.92       900

Svm - Confusion Matrix:
[[407  30]
 [ 39

The Binary Classification results highlight:

* Logistic Regression:

Accuracy: 80.33%

Balanced performance with a weighted F1-score of 0.80.

Misclassifications are evident in the confusion matrix, with a few instances incorrectly classified between classes 0 and 1.

* SVM:

Accuracy: 92.33%

Strong precision and recall, resulting in a weighted F1-score of 0.92.

Minimal misclassifications, making it one of the best-performing models.

* Random Forest:

Accuracy: 87.33%

Good overall performance with a weighted F1-score of 0.87.

Slightly higher misclassification rates compared to SVM.

* XGBoost:

Accuracy: 89.11%

Strong performance with a weighted F1-score of 0.89.

Performs better than Random Forest but slightly below SVM.

### **2.2 Multi-Class classification**

In [20]:
# Generating a synthetic multi-class dataset
X, y = make_classification(n_samples=5000,
                           n_features=20,
                           n_informative=10,
                           n_classes=3,
                           n_clusters_per_class=1,
                           random_state=42)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate and train the sentiment model
sentiment_model = SentimentModel()
sentiment_model.train(X_train, X_test, y_train, y_test, task_type='multi')  # Multi-class classification

Training model: logistic_regression
Logistic_regression - Accuracy: 0.8840
Logistic_regression - Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.85      0.86       491
           1       0.86      0.87      0.87       518
           2       0.92      0.93      0.92       491

    accuracy                           0.88      1500
   macro avg       0.88      0.88      0.88      1500
weighted avg       0.88      0.88      0.88      1500

Logistic_regression - Confusion Matrix:
[[418  55  18]
 [ 43 452  23]
 [ 17  18 456]]
--------------------------------------------------
Training model: svm
Svm - Accuracy: 0.9400
Svm - Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       491
           1       0.94      0.92      0.93       518
           2       0.96      0.97      0.96       491

    accuracy                           0.94      1500
   macro avg      

The results of the model training and evaluation highlight the performance of four classification models: Logistic Regression, Support Vector Machine (SVM), Random Forest, and XGBoost. Among the models:

. Logistic Regression achieved an accuracy of 88.4%. The confusion matrix shows it performs well across all three classes, with a weighted F1-score of 0.88, indicating balanced precision, recall, and F1 performance.

. SVM stands out with an accuracy of 94.0%, demonstrating strong classification performance across all classes. Its weighted F1-score of 0.94 highlights its precision and recall balance, with the confusion matrix revealing minimal misclassifications.

. Random Forest closely follows with an accuracy of 93.3%. It shows strong precision and recall for all classes, with a weighted F1-score of 0.93. The confusion matrix indicates a slightly higher misclassification rate compared to SVM and XGBoost.

. XGBoost performs best, achieving the highest accuracy of 94.13%. It provides the best balance of precision, recall, and F1-scores, with a weighted F1-score of 0.94. Its confusion matrix highlights minimal misclassifications, particularly for the first two classes.

Overall, while Logistic Regression serves as a solid baseline, SVM, Random Forest, and XGBoost demonstrate superior performance, with XGBoost slightly outperforming the others. These results suggest that SVM and XGBoost are the most suitable models for this classification task.


#### *Outputing the findings in tabular form*

In [31]:
# Data for the table
data = {
    'Model': ['Logistic Regression', 'Logistic Regression', 
              'SVM', 'SVM', 
              'Random Forest', 'Random Forest', 
              'XGBoost', 'XGBoost'],
    'Classification Task': ['Binary', 'Multi', 
                            'Binary', 'Multi', 
                            'Binary', 'Multi', 
                            'Binary', 'Multi'],
    'Accuracy': [0.8033, 0.8840, 
                 0.9233, 0.9400, 
                 0.8733, 0.9333, 
                 0.8911, 0.9413],
    'Precision': [0.80, 0.88, 
                  0.92, 0.94, 
                  0.87, 0.93, 
                  0.89, 0.94],
    'Recall': [0.80, 0.88, 
               0.92, 0.94, 
               0.87, 0.93, 
               0.89, 0.94],
    'F1-Score': [0.80, 0.88, 
                 0.92, 0.94, 
                 0.87, 0.93, 
                 0.89, 0.94]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df


Unnamed: 0,Model,Classification Task,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,Binary,0.8033,0.8,0.8,0.8
1,Logistic Regression,Multi,0.884,0.88,0.88,0.88
2,SVM,Binary,0.9233,0.92,0.92,0.92
3,SVM,Multi,0.94,0.94,0.94,0.94
4,Random Forest,Binary,0.8733,0.87,0.87,0.87
5,Random Forest,Multi,0.9333,0.93,0.93,0.93
6,XGBoost,Binary,0.8911,0.89,0.89,0.89
7,XGBoost,Multi,0.9413,0.94,0.94,0.94


## 2.3. Results and Insights 

* SVM and XGBoost consistently deliver superior performance in both binary and multi-class tasks, achieving the highest accuracy and F1-scores.
* Random Forest provides robust results but tends to have slightly more misclassifications than SVM and XGBoost.
* Logistic Regression serves as a strong baseline model, particularly excelling in interpretability, but lags behind in accuracy compared to the other models.

These results suggest that for tasks prioritizing accuracy and precision, SVM and XGBoost are the most reliable options.

While the current models (Logistic Regression, SVM, Random Forest, XGBoost) demonstrate strong performance with high accuracy, precision, recall, and F1-scores, there may still be limitations in handling complex sentence structures, nuanced sentiments, and the context-dependent nature of certain terms (e.g., brand names, mixed emotions). These models rely heavily on traditional feature extraction techniques, which may not fully capture the intricate patterns of language, especially for tweets with ambiguous sentiment or mixed emotion. BERT, with its ability to understand context in both directions, has the potential to improve model performance and improve the sentiment analysis process.









We also followed up with a transformer-based model (BERT) in the **BERT Transformer Notebook** found in this repository.

# 5. Conclusion

- The sentiment analysis revealed that emotions were the most common in tweets, with Apple-related mentions appearing frequently across different sentiment categories.
- Neutral tweets posed the biggest challenge due to their ambiguous language.
- SVM and XGBoost performed best in classifying tweet emotions, while BERT, although used, was not fully optimized for optimal sentiment analysis.

# 6. Recommendation

- Prioritize SVM and XGBoost for effective classification of tweet emotions due to their superior performance in sentiment analysis.
- Further, optimize BERT to better capture the complexities of tweet emotions and improve sentiment classification accuracy or combine BERT with some of the prioritized traditional models(SVM and XGBOOST).
- Explore Multi-Label Classification: Implement multi-label classification to capture tweets with mixed or overlapping emotions more accurately.
- Enhance Text Representation: Use more advanced text representation techniques like word embeddings to improve the model's understanding of tweet emotions.
- Consider using deep learning models such as LSTMs for better context 
understanding
- Performing aspect-based sentiment analysis to understand the specific features of products that evoke positive or negative sentiment..


# 7. Next Steps

- Optimize BERT’s parameters to enhance sentiment analysis, especially for complex tweet emotions.
- Deploy the Model: Implement the best-performing model to analyze emotions in real-time tweets.
- Monitor and Retrain: Continuously monitor the model’s performance and retrain it with new tweet data to maintain accuracy.

