<h3>Task 1 - Sentiment analysis</h3>
In this task we are using a dataset from kaggle which is cited from the following article: <b>Malo, Pekka, et al. "Good debt or bad debt: Detecting semantic orientations in economic texts." Journal of the Association for Information Science and Technology 65.4 (2014): 782-796.</b> which contains various kinds of sentences with different emotions.

The task is to train a model on this data for sentiment analysis and use f-1 score as the metric to pick the best classifier model.

In [None]:
# imporing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import sklearn
import seaborn as sns

: 

In [None]:
# reading the dataset
df = pd.read_csv(r'C:\Users\Atharva Tawde\Desktop\ignite projects\task 1\data_task_1.csv')
df.head()

: 

<h3><b>Data Visualisation</b></h3>
Below pie chart shows the distribution of the sentiment across the dataset for positive, nutral and negative. From the below pie chart we can observe that the data is likely skewed where the neutral sentiment contributes to almost 54% of the data and the negative sentiment only ocntributes for 15% of the data. This may be due to the dataset being financial in nature and most of the sentences being just stating facts about the stock market of many countries.

In [None]:
# plotting the sentiment distribution using a pie chart
plt.figure(figsize=(5,5))
df['Sentiment'].value_counts().plot(kind='pie', autopct='%1.0f%%')
plt.title('Sentiment distribution across the dataset');

: 

In [None]:
!pip install nltk
import nltk
nltk.download('stopwords')

: 

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

: 

**Pre-processing:**
The pre-processing of the data is done as follows:<br>
1. Converting the text to lower case to avoid any repeat of the words.
2. Removing the punctuations
3. Tokenising and lemmatising the word to reduce the number of significant words that would be used in training.

In [None]:
# text preprocessing
def pre_processing(text):
    # converting to lower case
    text = text.lower()
    # removing punctuations
    text = text.replace('[^\w\s]', '')
    # removing stopwords
    stop_words = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    text = [word for word in tokens if word not in stop_words]
    # lemmatization
    lemmatizer = nltk.WordNetLemmatizer()
    text = [lemmatizer.lemmatize(word) for word in text]
    return ' '.join(text)

: 

In [None]:
# splitting the data into train and test
from sklearn.model_selection import train_test_split

X = df['Sentence']
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

: 

In [None]:
# preprocessing the data
X_train = X_train.apply(pre_processing)
X_test = X_test.apply(pre_processing)

X_train.shape, X_test.shape

: 

**Testing the models:** We have used most of the classifier models from the `sklearn` library and used the metric as f-1 score to chosse the best model to use for the task. The words are also first vectorised using the tf-idf vectorisation which will assure that the classifier models will get a numerical data.

In [None]:
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import f1_score, classification_report
from sklearn.pipeline import Pipeline

classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Multinomial Naive Bayes': MultinomialNB(),
    'Support Vector Classifier': SVC(),
    'Random Forest': RandomForestClassifier(n_estimators = 1000,max_depth=1000),
    'Gradient Boosting': GradientBoostingClassifier()
}

tfidf = TfidfVectorizer()

# Evaluate each classifier
f1_scores = {}
for name, clf in classifiers.items():
    pipeline = Pipeline([
        ('tfidf', tfidf),
        ('classifier', clf)
    ])

    # Train the model
    pipeline.fit(X_train, y_train)

    # Predict on test data
    y_pred = pipeline.predict(X_test)

    # Calculate F1 score
    f1 = f1_score(y_test, y_pred, average='weighted')
    f1_scores[name] = f1

# Print F1 scores for all classifiers
print("F1 Scores for all classifiers:")
for name, score in f1_scores.items():
    print(f"{name}: {score:.4f}")

: 

In [None]:
# print the classification report for the best performing classifier
model = max(f1_scores, key=f1_scores.get)
print(f"Best Classifier: {model}")
print(classification_report(y_test, y_pred))

: 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels = y.unique())
disp.plot(cmap='BuPu',colorbar=False)
plt.title('Confusion Matrix')
plt.show()

: 

**Conclusion**: The confusion matrix for the classifiers is plotted and observed that `LogisticRegression()` model is a good classifier model with a f-1 score of approximately 67% compared to other classifier models. The classification report is also provided which tells us the f-1 scores for each of the classes where the neutral class has the highest score because of the large support size in the dataset.