**Input Datasets and Preprocessing**

in this part of code, first we have three inputs of datasets training, testing and evaluating.
then we have preprocessing module such with these functionality:


1.   Removing special character in english and persian
2.   Converting to lowercase
3.   tokenizing texts
4.   removing stopwords
5.   lemmitization
6.   Joining the tokens back into a single string

then we have outputs of these head of dataset after preprocessing


In [7]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download nltk resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')


# Load the data
data_train = pd.read_csv('twitter_training.csv', encoding='latin-1')
data_test = pd.read_csv('twitter_test.csv', encoding='latin-1')
data_validation = pd.read_csv('twitter_validation.csv', encoding='latin-1')

# Data Preprocessing
def preprocess_text(text):
    if isinstance(text, str):  # Check if text is a string
        # Remove special characters and numbers
        text = re.sub(r'[^a-zA-Zآ-ی]', ' ', text)

        # Convert to lowercase
        text = text.lower()

        # Tokenize the text
        tokens = nltk.word_tokenize(text)

        # Remove stopwords
        stopwords_list = stopwords.words('english')
        tokens = [token for token in tokens if token not in stopwords_list]

        # Lemmatization
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token) for token in tokens]

        # Join the tokens back into a single string
        text = ' '.join(tokens)
    return text

data_train['Tweet content'] = data_train['Tweet content'].apply(preprocess_text)
data_test['Tweet content'] = data_test['Tweet content'].apply(preprocess_text)
data_validation['Tweet content'] = data_validation['Tweet content'].apply(preprocess_text)


print(data_train.head())
print(data_test.head())
print(data_validation.head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


   Tweet ID       entity sentiment                 Tweet content
0      2401  Borderlands  Positive  im getting borderland murder
1      2401  Borderlands  Positive            coming border kill
2      2401  Borderlands  Positive    im getting borderland kill
3      2401  Borderlands  Positive   im coming borderland murder
4      2401  Borderlands  Positive  im getting borderland murder
   Tweet ID     entity   sentiment  \
0      3364   Facebook  Irrelevant   
1       352     Amazon     Neutral   
2      8312  Microsoft    Negative   
3      4371      CS-GO    Negative   
4      4433     Google     Neutral   

                                       Tweet content  
0  mentioned facebook struggling motivation go ru...  
1  bbc news amazon bos jeff bezos reject claim co...  
2  microsoft pay word function poorly samsungus c...  
3  csgo matchmaking full closet hacking truly awf...  
4  president slapping american face really commit...  
   Tweet ID                             entity sent

**Feature Selection**


In this part of code, we have Feature selection operation. and First, we should Replace NaN values with an empty string. second, we should implement the operation of feature selection and extraction using TF-IDF.

we use TfidfVectorizer from scikit-learn to convert the preprocessed text into numerical features. We fit the vectorizer on the training data and transform the test and validation data using the fitted vectorizer. We also store the corresponding sentiment labels in separate variables.

in output of this part we can see this vectorization...

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Replace NaN values with an empty string
data_train['Tweet content'].fillna('', inplace=True)
data_test['Tweet content'].fillna('', inplace=True)
data_validation['Tweet content'].fillna('', inplace=True)

# Feature Extraction
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(data_train['Tweet content'])
X_test = vectorizer.transform(data_test['Tweet content'])
X_validation = vectorizer.transform(data_validation['Tweet content'])

y_train = data_train['sentiment']
y_test = data_test['sentiment']
y_validation = data_validation['sentiment']

print (X_train)

  (0, 15143)	0.6544346822104361
  (0, 2604)	0.4074132771999723
  (0, 9148)	0.42357797560585864
  (0, 11123)	0.47572194280161867
  (1, 12633)	0.501903248747078
  (1, 2601)	0.6902302356270013
  (1, 4201)	0.521224856202602
  (2, 12633)	0.5399671738206703
  (2, 2604)	0.45351894519199304
  (2, 9148)	0.4715129512311874
  (2, 11123)	0.5295578857587865
  (3, 4201)	0.48600512487910713
  (3, 15143)	0.6313859910022002
  (3, 2604)	0.3930644917893328
  (3, 11123)	0.45896737820002736
  (4, 15143)	0.6544346822104361
  (4, 2604)	0.4074132771999723
  (4, 9148)	0.42357797560585864
  (4, 11123)	0.47572194280161867
  (5, 15143)	0.6544346822104361
  (5, 2604)	0.4074132771999723
  (5, 9148)	0.42357797560585864
  (5, 11123)	0.47572194280161867
  (6, 12060)	0.250620536967174
  (6, 25500)	0.2898143722283676
  :	:
  (74679, 11001)	0.27102604405314906
  (74679, 26307)	0.19824373582358248
  (74680, 16812)	0.41140948118007103
  (74680, 15912)	0.19642521538823637
  (74680, 6456)	0.27944591398588514
  (74680, 25673)

**Model Training**

In this part, we create instances of the Naive Bayes (MultinomialNB) and SVM (SVC) models. We then train these models using the preprocessed training data and the corresponding sentiment labels.

so we used two method of training: 1. Naive bayes  2. SVM

these are two important methods for classinfication modeling

and fitting with svm is so long time and complexity...


but fitting with naive bayes is so fast

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Model Training
svm_model = SVC()
svm_model.fit(X_train, y_train)  # Fit the SVM model with the training data

svm_predictions = svm_model.predict(X_test)  # Make predictions using the trained model

# Evaluation
print("SVM Classification Report:")
print(classification_report(y_test, svm_predictions))
print("SVM Confusion Matrix:")
print(confusion_matrix(y_test, svm_predictions))

**Saving Predictions to Excel**

In this part, we define a function save_predictions() that takes the predicted sentiment labels, the original data, and a filename as input. The function creates a copy of the data, adds a new columncalled "Predicted Sentiment" containing the predictions, and saves the modified data to an Excel file using to_excel(). We then call this function for both Naive Bayes and SVM predictions, providing the predictions, test data, and desired filenames.

In [6]:
def save_predictions(predictions, data, filename):
    data_with_predictions = data.copy()
    data_with_predictions['Predicted Sentiment'] = predictions
    data_with_predictions.to_excel(filename, index=False)

save_predictions(svm_predictions, data_test, 'svm_predictions.xlsx')

NameError: ignored