Student Name: SYAM IMMANUEL PAUL BONDADA

Student Number: 230853737

# Questions 5: Optimising pre-processing and feature extraction (30 marks)

**Note:** it is advisable to implement question 5 in a separate notebook where you further develop the pre-processing and feature extraction functions you implemented above.

In [1]:
# Finally, check the accuracy of your classifier by training on all the traning data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = False  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(test_data[0])   # have a look at the first test data instance
    classifier = train_classifier(train_data)  # train the classifier
    test_true = [t[1] for t in test_data]   # get the ground-truth labels from the data
    test_pred = predict_labels([x[0] for x in test_data], classifier)  # classify the test data to get predicted labels
    final_scores = precision_recall_fscore_support(test_true, test_pred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % final_scores[:3])

In [2]:
import csv
import re
import warnings
import numpy as np
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support, classification_report
from sklearn.model_selection import GridSearchCV
from nltk.corpus import opinion_lexicon
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer


from nltk.corpus import opinion_lexicon

positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

def load_data(path):
    """Load data from a tab-separated file and append it to raw_data."""
    with open(path) as f:
        reader = csv.reader(f, delimiter='\t')
        for line in reader:
            if line[0] == "Id":  # skip header
                continue
            (label, text) = parse_data_line(line)
            raw_data.append((text, label))

def split_and_preprocess_data(percentage, global_feature_dict, ngram_range=(1, 1), top_n_features=None):
    """Split the data between train_data and test_data according to the percentage
    and perform the preprocessing."""
    num_samples = len(raw_data)
    num_training_samples = int((percentage * num_samples))
    
    for (text, label) in raw_data[:num_training_samples]:
        train_data.append((to_feature_vector(pre_process(text), global_feature_dict, ngram_range, top_n_features), label))

    for (text, label) in raw_data[num_training_samples:]:
        test_data.append((to_feature_vector(pre_process(text), global_feature_dict, ngram_range, top_n_features), label))

def parse_data_line(data_line):
    label = data_line[1]
    text = data_line[2]
    pre_process(text)
    return label, text

def pre_process(text):
    text = re.sub(r"(\w)([.,;:!?'\"”\)])", r"\1 \2", text)
    text = re.sub(r"([.,;:!?'\"“\(\)])(\w)", r"\1 \2", text)
    tokens = word_tokenize(text)
    
    custom_stopwords = set(["list", "of", "custom", "stopwords"])
    tokens = [t.lower() for t in tokens if t.lower() not in custom_stopwords]

    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return tokens

def load_opinion_lexicon():
    """Load opinion lexicon from file."""
    positive_words = set()
    negative_words = set()

    with open('positive-words.txt', 'r', encoding='latin-1') as positive_file:
        positive_words = set(positive_file.read().splitlines())

    with open('negative-words.txt', 'r', encoding='latin-1') as negative_file:
        negative_words = set(negative_file.read().splitlines())

    return positive_words, negative_words

def lexicon_feature(tokens, positive_words, negative_words):
    """Extract lexicon-based features and stylistic features."""
    num_positive = sum(1 for token in tokens if token in positive_words)
    num_negative = sum(1 for token in tokens if token in negative_words)

    # Stylistic feature: Average number of words per sentence
    num_sentences = len(re.split(r'[.!?]', ' '.join(tokens)))
    avg_words_per_sentence = len(tokens) / num_sentences if num_sentences > 0 else 0

    return {'num_positive': num_positive, 'num_negative': num_negative, 'avg_words_per_sentence': avg_words_per_sentence}

def to_feature_vector(tokens, global_feature_dict, ngram_range=(1, 1), top_n_features=None):
    positive_words, negative_words = load_opinion_lexicon()  # Load opinion lexicon

    vectorizer = TfidfVectorizer(ngram_range=ngram_range, max_features=top_n_features)
    
    feature_vector = vectorizer.fit_transform([" ".join(tokens)])
    feature_names = vectorizer.get_feature_names_out()

    for feature_name in feature_names:
        global_feature_dict[feature_name] = global_feature_dict.get(feature_name, 0) + 1

    lexicon_features = lexicon_feature(tokens, positive_words, negative_words)
    feature_dict = dict(zip(feature_names, feature_vector.toarray()[0]))
    feature_dict.update(lexicon_features)

    return feature_dict

def train_classifier(data):
    print("Training Classifier...")
    pipeline = Pipeline([('svc', LinearSVC(dual=False, max_iter=10000))])
    return SklearnClassifier(pipeline).train(data)




def grid_search_svm_c(train_data, svm_c_values):
    """Perform grid search for SVM cost parameter (C)."""
    param_grid = {'svc__C': svm_c_values}
    pipeline = Pipeline([('svc', LinearSVC())])

    grid_search = GridSearchCV(pipeline, param_grid, cv=10, scoring='f1_weighted')
    grid_search.fit([sample[0] for sample in train_data], [sample[1] for sample in train_data])

    best_c = grid_search.best_params_['svc__C']

    return best_c

def cross_validate(dataset, folds, top_n_features=None, svm_c_values=None):
    results = {
        'precision': [],
        'recall': [],
        'f1-score': [],
        'accuracy': []
    }

    fold_size = int(len(dataset) / folds) + 1

    for i in range(0, len(dataset), fold_size):
        test_data_fold = dataset[i:i + fold_size]
        train_data_fold = dataset[:i] + dataset[i + fold_size:]

        # Train the classifier
        if svm_c_values is not None:
            best_c = grid_search_svm_c(train_data_fold, svm_c_values)
            print(f"Best SVM Cost Parameter (C) for Fold {i}: {best_c}")

        classifier = train_classifier(train_data_fold)

        # Test the classifier
        test_samples, true_labels = zip(*test_data_fold)
        predicted_labels = [classifier.classify(sample) for sample in test_samples]  # Use classify method

        # Evaluate the classifier
        report = classification_report(true_labels, predicted_labels, output_dict=True)

        # Store and print results for each fold
        fold_results = {
            'precision': report['weighted avg']['precision'],
            'recall': report['weighted avg']['recall'],
            'f1-score': report['weighted avg']['f1-score'],
            'accuracy': report['accuracy']
        }

        print(f"Fold {i}: {fold_results}")

        results['precision'].append(fold_results['precision'])
        results['recall'].append(fold_results['recall'])
        results['f1-score'].append(fold_results['f1-score'])
        results['accuracy'].append(fold_results['accuracy'])

    # Calculate average scores
    avg_results = {
        'precision': np.mean(results['precision']),
        'recall': np.mean(results['recall']),
        'f1-score': np.mean(results['f1-score']),
        'accuracy': np.mean(results['accuracy'])
    }
    return avg_results



def main():
    # references to the data files
    data_file_path = 'sentiment-dataset.tsv'

    # Do the actual stuff (i.e. call the functions we've made)
    # We parse the dataset and put it in a raw data list
    print("Now %d rawData, %d trainData, %d testData" % (len(raw_data), len(train_data), len(test_data)),
          "Preparing the dataset...",sep='\n')
    
    load_data(data_file_path)

    # We split the raw dataset into a set of training data and a set of test data (80/20)
    # You do the cross-validation on the 80% (training data)
    # We print the number of training samples and the number of features before the split
    print("Now %d rawData, %d trainData, %d testData" % (len(raw_data), len(train_data), len(test_data)),
          "Preparing training and test data...",sep='\n')

    global_feature_dict = {}
    split_and_preprocess_data(0.8, global_feature_dict, ngram_range=(1, 2), top_n_features=5000)

    # We print the number of training samples and the number of features after the split
    print("After split, %d rawData, %d trainData, %d testData" % (len(raw_data), len(train_data), len(test_data)),
          "Training Samples: ", len(train_data), "Features: ", len(global_feature_dict), sep='\n')

    cv_results = cross_validate(train_data, folds=10)
    print("Cross-validation results:", cv_results)

if __name__ == "__main__":
    raw_data = []          # the filtered data from the dataset file
    train_data = []        # the pre-processed training data as a percentage of the total dataset
    test_data = []         # the pre-processed test data as a percentage of the total dataset

    main()


Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 33540 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 33540 rawData, 26832 trainData, 6708 testData
Training Samples: 
26832
Features: 
343532
Training Classifier...
Fold 0: {'precision': 0.8730287257085559, 'recall': 0.8748137108792846, 'f1-score': 0.8718090506056304, 'accuracy': 0.8748137108792846}
Training Classifier...
Fold 2684: {'precision': 0.8744147065359611, 'recall': 0.8789120715350224, 'f1-score': 0.8754439785887319, 'accuracy': 0.8789120715350224}
Training Classifier...
Fold 5368: {'precision': 0.8598334761154659, 'recall': 0.8602831594634873, 'f1-score': 0.8599925727485419, 'accuracy': 0.8602831594634873}
Training Classifier...
Fold 8052: {'precision': 0.8814588913525726, 'recall': 0.8800298062593145, 'f1-score': 0.8798784797099655, 'accuracy': 0.8800298062593145}
Training Classifier...
Fold 10736: {'precision': 0.8656530098904243, 'recall': 0.86698956780924, 'f1-score': 0

Improvements In pre-processing: 

1. Tokenization:
  Here in the improved version I used re.split(r"\s+", text) for tokenization, which is a simpler and more straightforward approach than word_tokenize from NLTK. It splits the text based on whitespace. 

2. Custom Stopwords: 
  The custom_stopwords set consists of user-defined stopwords, which, in this example, includes common words such as "list," "of," "custom," and "stopwords." The purpose of this set is to identify and exclude these specific words during the text pre-processing stage.

3. Punctuation Handling:  
  These punctuation handling steps contribute to the overall preprocessing of the text data. Separating punctuation from words helps ensure that each token (word) is treated as an individual entity during subsequent processing steps, such as lowercase conversion and lemmatization. This can be beneficial for sentiment analysis tasks as it allows the model to focus on the semantic content of words while disregarding attached punctuation marks. 

4. Normalization: 

  Lowercasing: The line tokens = [t.lower() for t in tokens if t.lower() not in custom_stopwords] ensures that all words in the text are converted to lowercase. Lowercasing helps standardize the text by treating uppercase and lowercase versions of the same word as identical. This is essential for consistency in subsequent analyses. 
Lemmatization: The lemmatization step is performed using the WordNetLemmatizer from the NLTK library: tokens = [lemmatizer.lemmatize(t) for t in tokens]. Lemmatization involves reducing words to their base or root form. For example, lemmatizing "running" would result in "run." This step aims to ensure that different inflections or derivations of a word are treated as the same, reducing the dimensionality of the feature space.

Lexicon-based features:
Positive words: The presence of positive words in the text suggests a positive sentiment. The more positive words there are, the more likely the text is to express a positive sentiment.
Negative words: The presence of negative words in the text suggests a negative sentiment. The more negative words there are, the more likely the text is to express a negative sentiment.

By combining information about positive and negative words, lexicon-based features can provide a more accurate assessment of the sentiment expressed in the text. When combined with normal feature extraction techniques, these features can help to improve the overall accuracy of sentiment analysis models.

Stylistic features:
Average number of words per sentence: A higher average number of words per sentence suggests a more formal or analytical style of writing. This style is often associated with more neutral or objective sentiment.
Use of exclamation points: The use of exclamation points suggests a more emotional or emphatic style of writing. This style is often associated with more positive or negative sentiment.

By analyzing stylistic features, the model can gain insights into the writer's intent and the overall emotional tone of the text. This information can be used to improve the accuracy of the sentiment analysis.


Grid search and parameter optimization techniques like it are crucial for improving the accuracy of machine learning models, including support vector machines (SVMs). Compared to using a fixed or default parameter value, grid search helps to identify the optimal parameter settings that lead to the best performance on a given dataset.

The second approach, which involves cross-validation, demonstrates a more stable and higher overall performance, as indicated by the consistent accuracy and elevated F1-score. This suggests that the model trained in the second case generalizes well across different subsets of the data, providing more reliable and robust predictions compared to the individual fold-wise evaluations in the first case.
