## Objective
Build a minimal sentiment classifier for cryptocurrency-related Reddit comments, accessible via a basic REST API.

## Background

Developing an advanced sentiment analysis pipeline to analyze cryptocurrency-related discussions from social media platforms, especially Reddit. Accurate, comprehensive, and scalable sentiment analysis is critical to various downstream analytical applications.

**the classifier will be deployed to handle approximately 10 million comments daily within a computational budget of roughly $100/month.**

Since the classifier needs to handle a lot of comments daily and with a budget of 100$ running an inference with transformers could take time and could be costly, with timelimit and budget in mind lets start with logistic Regressor and Naive Bayes (Personally have felt to be very useful in these cases)

If time allows we will look for SVM and transformers BERT base uncased - we already have transformer data which is lower cased

In [None]:
# Recommend using Colab for faster and better run so that libraries dependencies are not unmet
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Load Data

In [109]:
crypto_data_base=pd.read_csv('/content/crypto_data_base.csv')

In [110]:
# crypto_data_base=pd.read_csv('crypto_data_base.csv') # if required to replicate

In [111]:
crypto_data_base.Sentiment.unique()

array(['Positive', 'Negative'], dtype=object)

### Mapping Labels

In [112]:
sentiment_mapping = {'Positive': 1, 'Negative': 0}

In [113]:
crypto_data_base['label']=crypto_data_base['Sentiment'].map(sentiment_mapping)

In [114]:
crypto_data_base.label.isna().sum()

np.int64(0)

In [115]:
crypto_data_base.head()

Unnamed: 0,Comment_demojified_cleaned,Sentiment,label
0,i bought 2200 at the ico at 0.50$ per coin. ho...,Positive,1
1,harmony one algorand cardano solana vechain go...,Positive,1
2,honestly after reading this post and many of t...,Negative,0
3,in bear market is where money is made. i will ...,Positive,1
4,funny how people think bitcoins risk is compar...,Negative,0


## Model Development (Base Tfidf)

In [None]:
X=crypto_data_base['Comment_demojified_cleaned']
y=crypto_data_base['label']

In [121]:
# to maintain the positive and negative ratio we are doing stratify
X_train_text, X_test_text, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

In [122]:
tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 3),
    max_df=0.95,
    min_df=2,
    use_idf=True
)

Lets fit it on X train and then transform together on train and test

In [123]:
tfidf_vectorizer.fit(X_train_text)

In [124]:
X_train_tfidf = tfidf_vectorizer.transform(X_train_text)
X_test_tfidf = tfidf_vectorizer.transform(X_test_text)

### Logistic Regression

In [125]:
classifier = LogisticRegression(solver='liblinear', random_state=42)

To avaoid looking for more values lets make sure we do as much grid search as we can to speed up getting to good accuracy

In [126]:
param_grid_clf = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

In [127]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [128]:
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid_clf, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [129]:
print(grid_search.best_params_)

{'C': 10, 'penalty': 'l2'}


In [130]:
best_classifier = grid_search.best_estimator_

In [131]:
y_pred = best_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)

In [132]:
print(f"Test Set Accuracy: {accuracy:.4f}")

Test Set Accuracy: 0.7500


In [133]:
print(confusion_matrix(y_test, y_pred))

[[27 19]
 [ 6 48]]


Not good enough accuracy so lets get on to Naive BAyes

### Naive Bayes

In [134]:
classifier = MultinomialNB()

In [None]:
param_grid_clf = {
    'alpha': [0.01, 0.1, 0.5, 1.0, 5.0, 10.0]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


In [136]:
grid_search = GridSearchCV(estimator=classifier, param_grid=param_grid_clf, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train_tfidf, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [None]:
best_classifier = grid_search.best_estimator_
print("Evaluating the Naive Bayes")
y_pred = best_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.4f}")

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Evaluating the best Naive Bayes classifier on the test set (TF-IDF transformed)...
Test Set Accuracy: 0.7800

Classification Report:
              precision    recall  f1-score   support

    Negative       0.88      0.61      0.72        46
    Positive       0.74      0.93      0.82        54

    accuracy                           0.78       100
   macro avg       0.81      0.77      0.77       100
weighted avg       0.80      0.78      0.77       100


Confusion Matrix:
[[28 18]
 [ 4 50]]


## Model Development variable tfidf

Lets try to get in the part of using tfidf features inside hyperparameter to save time

In [138]:
X=crypto_data_base['Comment_demojified_cleaned_all_processed']
y=crypto_data_base['label']

KeyError: 'Comment_demojified_cleaned_all_processed'

In [139]:
# to maintain the positive and negative ratio we are doing stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

In [None]:
# pipeline so that we can take tfidf params inside hyperparameter search
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(solver='liblinear', random_state=42, class_weight='balanced'))
])

In [None]:
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)], #have to use with pipeline name followed by __ followed by parameter name as per guidelines else it throws error
    'tfidf__max_df': [0.75, 0.95, 1.0],
    'tfidf__min_df': [1, 2, 3],
    'tfidf__use_idf': [True, False],
    'clf__C': [0.01, 0.1, 1, 10, 100],
    'clf__penalty': ['l1', 'l2'] 
}

In [142]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


In [None]:
grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters")
print(grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

best_model_pipeline = grid_search.best_estimator_

print("Evaluating")
y_pred = best_model_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Fitting 5 folds for each of 360 candidates, totalling 1800 fits

Best parameters found by GridSearchCV for the pipeline:
{'clf__C': 10, 'clf__penalty': 'l1', 'tfidf__max_df': 0.75, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1), 'tfidf__use_idf': True}
Best cross-validation accuracy for the pipeline: 0.8025

Evaluating the best model pipeline on the test set...
Test Set Accuracy: 0.7000

Classification Report:
              precision    recall  f1-score   support

    Negative       0.74      0.54      0.62        46
    Positive       0.68      0.83      0.75        54

    accuracy                           0.70       100
   macro avg       0.71      0.69      0.69       100
weighted avg       0.71      0.70      0.69       100


Confusion Matrix:
[[25 21]
 [ 9 45]]


### Naive Bayes

In [146]:
X=crypto_data_base['Comment_demojified_cleaned']
y=crypto_data_base['label']

In [147]:
# to maintain the positive and negative ratio we are doing stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

In [148]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

In [None]:
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_df': [0.75, 0.95, 1.0],
    'tfidf__min_df': [1, 2, 3],
    'tfidf__use_idf': [True, False],
    'clf__alpha': [0.01, 0.1, 0.5, 1.0, 5.0, 10.0]
}

In [150]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


In [None]:
grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters")
print(grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

best_model_pipeline = grid_search.best_estimator_

print("Evaluating")
y_pred = best_model_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.4f}")

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Fitting 5 folds for each of 216 candidates, totalling 1080 fits

Best parameters found by GridSearchCV for the pipeline:
{'clf__alpha': 0.01, 'tfidf__max_df': 0.75, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1), 'tfidf__use_idf': False}
Best cross-validation accuracy for the pipeline: 0.8153

Evaluating the best model pipeline on the test set...
Test Set Accuracy: 0.7333

Classification Report:
              precision    recall  f1-score   support

    Negative       0.74      0.65      0.70        49
    Positive       0.73      0.80      0.76        56

    accuracy                           0.73       105
   macro avg       0.73      0.73      0.73       105
weighted avg       0.73      0.73      0.73       105


Confusion Matrix:
[[32 17]
 [11 45]]


### Linear SVM

In [152]:
from sklearn.svm import LinearSVC

In [155]:
X=crypto_data_base['Comment_demojified_cleaned']
y=crypto_data_base['label']

In [156]:
# to maintain the positive and negative ratio we are doing stratify
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

In [157]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LinearSVC(random_state=42, class_weight='balanced', dual=True))
])

In [None]:
param_grid = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__max_df': [0.75, 0.95, 1.0],
    'tfidf__min_df': [1, 2, 3],
    'tfidf__use_idf': [True, False],
    'clf__C': [0.01, 0.1, 1, 10, 100]
}

In [159]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters")
print(grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

best_model_pipeline = grid_search.best_estimator_

print("Evaluating")
y_pred = best_model_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.4f}")

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Fitting 5 folds for each of 216 candidates, totalling 1080 fits

Best parameters found by GridSearchCV for the pipeline:
{'clf__alpha': 0.01, 'tfidf__max_df': 0.75, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1), 'tfidf__use_idf': False}
Best cross-validation accuracy for the pipeline: 0.8153

Evaluating the best model pipeline on the test set...
Test Set Accuracy: 0.7333

Classification Report:
              precision    recall  f1-score   support

    Negative       0.74      0.65      0.70        49
    Positive       0.73      0.80      0.76        56

    accuracy                           0.73       105
   macro avg       0.73      0.73      0.73       105
weighted avg       0.73      0.73      0.73       105


Confusion Matrix:
[[32 17]
 [11 45]]


After checking the system with both the data for a base model the highest we reached was 80 percent for a tfidf based approach on base data with base data and tfidf it was not performing very well so now lets move on to the transformer based approach

## Moving away from TFIDF

Change the runtime to GPU here for faster replication

In [None]:
# Recommend using colab to run it to make sure faster and no environment issues
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import  precision_recall_fscore_support

In [2]:
crypto_data_transformer=pd.read_csv('/content/crypto_data_transformer.csv')

In [3]:
crypto_data_transformer.columns

Index(['Comment_demojified_cleaned', 'Sentiment'], dtype='object')

In [41]:
# moving to transformer based approach
import torch
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    TrainingArguments,
    Trainer
)

In [42]:
model_id="distilbert-base-uncased"

In [43]:
sentiment_mapping = {'Positive': 1, 'Negative': 0} # as before

In [44]:
crypto_data_transformer['label'] = crypto_data_transformer['Sentiment'].map(sentiment_mapping)  # as before
crypto_data_transformer['label'] = crypto_data_transformer['label'].astype(int)

In [45]:
text = crypto_data_transformer['Comment_demojified_cleaned'].to_list() # we did the same before for validation of sentiments list is imp to generate encodings
labels=crypto_data_transformer['label'].tolist()

In [46]:
train_texts, val_texts, train_labels, val_labels = train_test_split(text, labels, test_size=0.20, random_state=42, stratify=labels) # as before just dividing data with 20 percent validation for enough training on dataset

In [47]:
# This is an important part for us to make sure we first tokenize the data as per model and then use it for training we made the data in such a way to be used in this phase
tokenizer = DistilBertTokenizerFast.from_pretrained(model_id)

In [48]:
# code from huggingface
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=450)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=450)

In [49]:
train_encodings.keys() # some checks to understand how to make our custom dataset in pytorch

dict_keys(['input_ids', 'attention_mask'])

In [50]:
for key,value in train_encodings.items():
  print(f"{key}: {value}")

input_ids: [[101, 2145, 2228, 2632, 20255, 5685, 2038, 2009, 2035, 2293, 2023, 9226, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [51]:
class CryptoDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        inputs = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} # this will return input id and attention mask for 1 item at tim
        inputs['labels'] = torch.tensor(self.labels[idx]) # same way label will be for that particular item
        return inputs

    def __len__(self):
        return len(self.labels)

In [52]:
train_dataset = CryptoDataset(train_encodings, train_labels)
val_dataset = CryptoDataset(val_encodings, val_labels)

In [29]:
train_dataset.__getitem__(0) # sanity check

{'input_ids': tensor([  101, 28855, 14820,  1012,  3806,  9883,  2024,  2205,  2152,  2000,
          2191,  2009,  4276,  2478,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

In [71]:
model = DistilBertForSequenceClassification.from_pretrained(model_id, num_labels=len(sentiment_mapping))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [73]:
# the most basic trainer config with changes in decay, warmup and loggin steps
training_args = TrainingArguments(
    output_dir="./fine_tuned_distilbert_sentiment_crypto_2",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=1e-5,
    warmup_steps=20,
    weight_decay=0.1,
    logging_dir='./logs',
    logging_steps=20,
    eval_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    overwrite_output_dir=True,
    report_to="none",
)

In [74]:
# from a older notebook which I used for a research on effect of preprocessing techniques on transformers
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
    }

In [75]:
# lets now make a trainer and train the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

With higher epochs we started to see the model overfit so reducing th enumber of epochs

In [76]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
20,0.6964,0.689327,0.542857
40,0.6801,0.671985,0.647619
60,0.6165,0.608546,0.685714
80,0.5207,0.521032,0.8
100,0.3547,0.472605,0.790476
120,0.2935,0.457622,0.8
140,0.2064,0.381786,0.847619
160,0.1396,0.400347,0.847619
180,0.1298,0.436758,0.828571
200,0.0951,0.425171,0.847619


TrainOutput(global_step=270, training_loss=0.29377144155678925, metrics={'train_runtime': 208.7522, 'train_samples_per_second': 19.976, 'train_steps_per_second': 1.293, 'total_flos': 485498190582000.0, 'train_loss': 0.29377144155678925, 'epoch': 10.0})

The model did overfit based on train and val losses but a baseline 85 percent with minimal efforts is a good starting point, suspecting the system had those minor issues of wrong labels, with a validation data of 100 datapoints, getting less than 10 datapoints wrong needs more analysis on data quality

In [77]:
eval_results = trainer.evaluate()

In [78]:
eval_results

{'eval_loss': 0.4450535774230957,
 'eval_accuracy': 0.8571428571428571,
 'eval_runtime': 1.546,
 'eval_samples_per_second': 67.919,
 'eval_steps_per_second': 4.528,
 'epoch': 10.0}

In [79]:
model.save_pretrained('./fine_tuned_distilbert_sentiment_crypto_2')
tokenizer.save_pretrained('./fine_tuned_distilbert_sentiment_crypto_2')

('./fine_tuned_distilbert_sentiment_crypto_2/tokenizer_config.json',
 './fine_tuned_distilbert_sentiment_crypto_2/special_tokens_map.json',
 './fine_tuned_distilbert_sentiment_crypto_2/vocab.txt',
 './fine_tuned_distilbert_sentiment_crypto_2/added_tokens.json',
 './fine_tuned_distilbert_sentiment_crypto_2/tokenizer.json')

To account for the time avaialable and solution not improving in accuracy the earlier thoughts of model not having best data quality might be an issue for the same.

Some other experiments could be done with better models but with better models the scalability will be greatly hampered