
## **Feature Engineering**


### Model Evaluation Libraries

1. `sklearn.model_selection` (from scikit-learn library):
   - Library for model selection and evaluation, including train-test split, cross-validation, and parameter tuning.

2. `sklearn.svm.LinearSVC` (from scikit-learn library):
   - Support Vector Machines (SVM) implementation for classification tasks with linear kernel in scikit-learn.

3. `sklearn.metrics` (from scikit-learn library):
   - Collection of evaluation metrics and methods for assessing the quality of predictions in machine learning models.

4. `pandas`:
   - Data manipulation and analysis library that provides data structures and tools for working with structured data, particularly tabular data.

These libraries are commonly used in machine learning and data analysis tasks to split data for training and testing, build classification models using Support Vector Machines, and evaluate model performance using various metrics and tools.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd 

import warnings 
warnings.filterwarnings("ignore")



5. `sklearn.naive_bayes.GaussianNB` (from scikit-learn library):
   - Implementation of the Gaussian Naive Bayes algorithm for classification, assuming that features follow a Gaussian distribution.



In [2]:
from sklearn.naive_bayes import GaussianNB


##### Code Explanation

- `Model Class`
    1. `LinearSVC_model`:
    - Trains a Linear Support Vector Classifier model, evaluates it on the test dataset, makes predictions on the validation dataset, and provides accuracy, classification report, and confusion matrix for both test and validation datasets.

    2. `NaiveBayes_model`:
    - Trains a Gaussian Naive Bayes model, evaluates it on the test dataset, makes predictions on the validation dataset, and provides accuracy, classification report, and confusion matrix for both test and validation datasets.

    These methods encapsulate the functionality to train, evaluate, make predictions, and analyze the performance of machine learning models, specifically the Linear Support Vector Classifier and Gaussian Naive Bayes models, on test and validation datasets.

    Note:

        Irrelevant:0 , Negative:1, Neutral:2, Positive:3

In [3]:
class Models:
    def __init__(self, x_train, x_test, y_train, y_test, x_valid, y_valid, original_validation_data):
        self.x_train = pd.read_csv(x_train)
        self.x_test = pd.read_csv(x_test)
        self.y_train = pd.read_csv(y_train)
        self.y_test = pd.read_csv(y_test)
        self.x_valid = pd.read_csv(x_valid)
        self.y_valid = pd.read_csv(y_valid)
        self.original_validation_data = pd.read_csv(original_validation_data)
    
    def LinearSVC_model(self):

        # Initialize the LinearSVC model
        self.linear_svc_model = LinearSVC(C=1.0, random_state=42)
        # Train the LinearSVC model
        self.linear_svc_model.fit(self.x_train, self.y_train)
        # Make predictions
        y_pred = self.linear_svc_model.predict(self.x_test)
        # Evaluate the model
        accuracy = accuracy_score(self.y_test, y_pred)
        report = classification_report(self.y_test, y_pred)
        # Calculate the confusion matrix
        conf_matrix = confusion_matrix(self.y_test, y_pred)
        print("-" * 100)
        print("\n   SVM Results for test dataset:")
        print("\n       Accuracy:", accuracy)
        print("\n       Classification Report:")
        print(report)
        print("\n       Confusion Matrix:")
        print(conf_matrix)
        print("-" * 100)

        # Make predictions
        y_pred = self.linear_svc_model.predict(self.x_valid)
        predicted_sentiment_df = pd.DataFrame(y_pred, columns=['predicted_svm_sentiment'])
        # Concatenate the new DataFrame with the original DataFrame
        df_with_predicted_sentiment = pd.concat([self.original_validation_data, predicted_sentiment_df], axis=1)
        df_with_predicted_sentiment.to_csv("/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/validation_with_predicted_columns.csv")
        # Evaluate the model
        accuracy = accuracy_score(self.y_test, y_pred)
        report = classification_report(self.y_valid, y_pred)
        # Calculate the confusion matrix
        conf_matrix = confusion_matrix(self.y_valid, y_pred)
        print("\n   SVM Results for validation dataset:")
        print("\n       Accuracy:", accuracy)
        print("\n       Classification Report:")
        print(report)
        print("\n       Confusion Matrix:")
        print(conf_matrix)
        print("-" * 100)

    def NaiveBayes_model(self):

        # Initialize the Naive Bayes model
        self.nb_model = GaussianNB()
        # Train the Naive Bayes model
        self.nb_model.fit(self.x_train, self.y_train)
        # Make predictions
        y_pred = self.nb_model.predict(self.x_test)
        # Evaluate the model
        accuracy = accuracy_score(self.y_test, y_pred)
        report = classification_report(self.y_test, y_pred)
        # Calculate the confusion matrix
        conf_matrix = confusion_matrix(self.y_test, y_pred)
        print("-" * 100)
        print("\n   Naive Bayes Results for test dataset:")
        print("\n       Accuracy:", accuracy)
        print("\n        Classification Report:")
        print(report)
        print("\n       Confusion Matrix:")
        print(conf_matrix)
        print("-" * 100)

        # Make predictions
        y_pred = self.nb_model.predict(self.x_valid)
        predicted_naive_df = pd.DataFrame(y_pred, columns=['predicted_naive_sentiment'])
        # Concatenate the new DataFrame with the original DataFrame
        df_with_predicted_naive = pd.concat([self.original_validation_data, predicted_naive_df], axis=1)
        df_with_predicted_naive.to_csv("/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/validation_with_predicted_columns.csv")
        # Evaluate the model
        accuracy = accuracy_score(self.y_valid, y_pred)
        report = classification_report(self.y_valid, y_pred)
        # Calculate the confusion matrix
        conf_matrix = confusion_matrix(self.y_valid, y_pred)
        print("\n   Naive Bayes Results for validation dataset:")
        print("\n       Accuracy:", accuracy)
        print("\n        Classification Report:")
        print(report)
        print("\n       Confusion Matrix:")
        print(conf_matrix)
        print("-" * 100)




In [5]:
x_train = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature_train.csv"
x_test = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature_test.csv"
y_train = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/y_twitter_training.csv"
y_test = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/y_twitter_test.csv"
original_validation_data = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/raw/twitter_validation.csv"
y_valid = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/y_twitter_validation.csv"
x_valid = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature_validation.csv"
svm_model = Models(x_train, x_test, y_train, y_test,x_valid, y_valid, original_validation_data)
svm_model.LinearSVC_model()

----------------------------------------------------------------------------------------------------

   SVM Results for test dataset:

       Accuracy: 0.268

       Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        92
           1       0.00      0.00      0.00       126
           2       0.27      1.00      0.42       134
           3       0.00      0.00      0.00       148

    accuracy                           0.27       500
   macro avg       0.07      0.25      0.11       500
weighted avg       0.07      0.27      0.11       500


       Confusion Matrix:
[[  0   0  92   0]
 [  0   0 126   0]
 [  0   0 134   0]
 [  0   0 148   0]]
----------------------------------------------------------------------------------------------------

   SVM Results for validation dataset:

       Accuracy: 0.268

       Classification Report:
              precision    recall  f1-score   support

           0       0.

In [7]:
original_validation_data1 = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/validation_with_predicted_columns.csv"

naiive_Baysian_model = Models(x_train, x_test, y_train, y_test, x_valid, y_valid, original_validation_data1)
naiive_Baysian_model.NaiveBayes_model()


----------------------------------------------------------------------------------------------------

   Naive Bayes Results for test dataset:

       Accuracy: 0.292

        Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        92
           1       0.00      0.00      0.00       126
           2       0.27      0.99      0.43       134
           3       0.81      0.09      0.16       148

    accuracy                           0.29       500
   macro avg       0.27      0.27      0.15       500
weighted avg       0.31      0.29      0.16       500


       Confusion Matrix:
[[  0   0  92   0]
 [  0   0 124   2]
 [  0   0 133   1]
 [  0   0 135  13]]
----------------------------------------------------------------------------------------------------

   Naive Bayes Results for validation dataset:

       Accuracy: 0.304

        Classification Report:
              precision    recall  f1-score   support

   

## **Analysing Results:**
Based on the provided results, both the Support Vector Machine (SVM) model and the Naive Bayes model have relatively low accuracy when tested on the validation dataset. Here's an analysis comparing the two models and discussing reasons for the low accuracy:

1. Accuracy Comparison:
   - SVM Accuracy: 26.8% (Test), 26.8% (Validation)
   - Naive Bayes Accuracy: 29.2% (Test), 30.4% (Validation)

2. Precision, Recall, and F1-Score:
   - Both models exhibit low precision, recall, and F1-scores for various sentiment classes, indicating poor classification performance across different sentiment categories.

Reasons for Low Accuracy:
   1. Inadequate Representativeness of Features:
      - The features used to represent the text data, such as word frequency or embeddings, may not capture the underlying sentiment patterns effectively, leading to poor discrimination between sentiment categories.

   2. Complexity of Sentiment Expression:
      - The sentiment expression in the text data may be nuanced, subtle, or context-dependent, making it challenging for the models to accurately capture the sentiment polarity.

   3. Class Imbalance:
      - The dataset may have an unequal distribution of sentiment classes, leading to biased model learning and poorer performance on minority classes.

   4. Feature-Target Mismatch:
      - The features may not fully capture the information essential for sentiment classification, leading to a weak correspondence between features and target sentiment labels.

   5. Overfitting or Underfitting:
      - Both models may be overfitting or underfitting the training data, resulting in poor generalization to unseen data.

`Note` :

   Feature extraction was base on Word2Vec model, by provided result is obviously comprehensible that it has poor effect on results because :

   1. **Limited Semantic Representation with Word2Vec**:
      - Word2Vec may not fully capture the complex semantics and sentiment nuances in the text data, leading to misrepresentations that hinder the accurate classification of sentiment categories.

   2. **Lack of Contextual Information**:
      - Word2Vec model may struggle to capture contextual information necessary for sentiment analysis, resulting in a disconnect between the semantic features learned and the true sentiment expressions in different contexts.

   3. **Difficulty in Handling Negation and Sarcasm**:
      - Word2Vec may struggle to represent negations, sarcasm, and sentiment reversals effectively, leading to misclassifications and lowered accuracy, especially in cases where sentiment is conveyed through nuanced language patterns.

   4. **Model Complexity and Non-linear Relationships**:
      - Both SVM and Naive Bayes models may struggle to capture non-linear relationships and intricate patterns in the Word2Vec embeddings, which could lead to suboptimal performance in multi-class sentiment classification tasks.

   5. **Imbalance and Ambiguity in Sentiment Expressions**:
      - The sentiment dataset may contain imbalanced class distributions, ambiguous expressions, and diverse sentiment manifestations, making the task of sentiment classification inherently challenging.

