This implementation uses a python library called the autocorrect to correct the spellings.

In [None]:
!pip install autocorrect



Importing required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report,accuracy_score
from sklearn.pipeline import make_pipeline
from autocorrect import Speller
import warnings

To ignore unwanted warnings that make cell outputs noisy

In [None]:
warnings.filterwarnings("ignore", category=UserWarning)

Using pandas to read the csv files

In [None]:
crowd_sourced_data = pd.read_csv('/content/crowdsourced_train.csv',sep='\t')


gold_train_data = pd.read_csv('/content/gold_train.csv',sep='\t')


test_data = pd.read_csv('test.csv',sep='\t')


*   Creating a pipeline to construct a pipeline for easier implementation
*   Implementing TfidVectorizer and LinearSVC into a pipeline



In [None]:
pipeline = make_pipeline( TfidfVectorizer(), LinearSVC(dual=False) )



*   Copying the original dataset to create a new dataset for correcting spellings
*  Using "Speller" module of "autocorrect" and setting "lang" as "en"

*   Reference: [https://github.com/filyp/autocorrect]





In [None]:
data = crowd_sourced_data['sentiment'].tolist()
corrected_CSdata = crowd_sourced_data.copy()

spell_correction = Speller(lang='en')

Defining a function that corrects the sentiment terms into their right spellings

In [None]:
def correct_spelling(term):
    term = term.strip()
    if term.lower() == 'neutral l':
      return 'neutral'
    elif term.lower() in ['positive', 'neutral', 'negative']:
      return term.lower()
    else:
      return spell_correction(term)

corrected_CSdata['sentiment'] = corrected_CSdata['sentiment'].apply(correct_spelling)

Defining a function to implement hyper-parameter tuning, fitting datasets, prediction using test.csv, and calculating classification report and accuracy score

In [None]:
def implementation(data, test):

  param_grid = {
    'tfidfvectorizer__max_features': [1000, 5000, 10000],
    'tfidfvectorizer__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'linearsvc__C': [0.1, 1, 100]
  }

  grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)

  grid_search.fit(data['text'], data['sentiment'])

  best_model = grid_search.best_estimator_

  predict = best_model.predict(test_data['text'])

  print(f"Classification report for given data is:")
  print(classification_report(test['sentiment'], predict))
  print("Accuracy:",accuracy_score(test['sentiment'], predict))

Implementing the above implementation() for the dataset "crowd_sourced_data"

In [None]:
implementation(crowd_sourced_data,test_data)

Classification report for given data is:
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         0
     Neutral       0.00      0.00      0.00         0
    Positive       0.00      0.00      0.00         0
    negative       0.52      0.31      0.39      1077
     neutral       0.59      0.46      0.52      2597
    positive       0.71      0.33      0.45      1850

    accuracy                           0.39      5524
   macro avg       0.30      0.18      0.23      5524
weighted avg       0.62      0.39      0.47      5524

Accuracy: 0.38848660391021


Implementing the above implementation() for the dataset "gold_train_data"

In [None]:
implementation(gold_train_data,test_data)

Classification report for given data is:
              precision    recall  f1-score   support

    negative       0.77      0.29      0.42      1077
     neutral       0.62      0.86      0.72      2597
    positive       0.74      0.61      0.67      1850

    accuracy                           0.67      5524
   macro avg       0.71      0.59      0.60      5524
weighted avg       0.69      0.67      0.65      5524

Accuracy: 0.6660028964518465


Implementing the above implementation() for the dataset "correctedCS_data". "correctedCS_data" is the dataset that is created after crowd_sourced_data is preprocessed.

In [None]:
implementation(corrected_CSdata,test_data)

Classification report for given data is:
              precision    recall  f1-score   support

    negative       0.59      0.40      0.47      1077
     neutral       0.59      0.82      0.69      2597
    positive       0.74      0.47      0.57      1850

    accuracy                           0.62      5524
   macro avg       0.64      0.56      0.58      5524
weighted avg       0.64      0.62      0.61      5524

Accuracy: 0.6205648081100652
