# Notebook for Programming Question 3
Welcome to the programming portion of the assignment! . We will be using [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true), so if you have never used it before, take a quick look through this introduction: [Working with Google Colab](https://docs.google.com/document/d/1LlnXoOblXwW3YX-0yG_5seTXJsb3kRdMMRYqs8Qqum4/edit?usp=sharing).

We'll also be programming in Python, which we will assume a basic familiarity with. Python has fantastic community support and we'll be using numerous packages for machine learning (ML) and natural language processing (NLP) tasks.

### Learning Objectives
In this problem we will implement logistic regression and test it on a sentiment analysis dataset.

### Writing Code
Look for the keyword "TODO" and fill in your code in the empty space.
HINT: Adding a bias is equivalent to adding a special token to your features (e.g. \<BIAS\>) with count = 1. This could simplify your implementation (although this is not required).

### Data preprocessing

#### Class and function for loading data

In [None]:
# Import libraries
import argparse
import time

# Define a class to store a single sentiment example
class SentimentExample:
    def __init__(self, words, label):
        self.words = words
        self.label = label # 0 or 1

    def __repr__(self):
        return repr(self.words) + "; label=" + repr(self.label)

    def __str__(self):
        return self.__repr__()


# Reads sentiment examples in the format [0 or 1]<TAB>[raw sentence]; tokenizes and cleans the sentences.
def read_sentiment_examples(infile) -> list[SentimentExample]:
    f = open(infile, encoding='iso8859')
    exs = []
    for line in f:
            fields = line.strip().split(" ")
            label = 0 if "0" in fields[0] else 1
            exs.append(SentimentExample(fields[1:], label))
    f.close()
    return exs

#### Download and load the data

In [76]:
!curl -o train-sent.txt https://raw.githubusercontent.com/Tsegaye-misikir/NLP-rug/main/sentiment/train-sent.txt
!curl -o dev-sent.txt https://raw.githubusercontent.com/Tsegaye-misikir/NLP-rug/main/sentiment/dev-sent.txt
train_file = 'train-sent.txt'
dev_file = 'dev-sent.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  721k  100  721k    0     0  4626k      0 --:--:-- --:--:-- --:--:-- 4746k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 94400  100 94400    0     0  1097k      0 --:--:-- --:--:-- --:--:-- 1298k


#### Indexer for examples
This section contains code for an Indexer which is useful for creating a mapping between words and indices. It has already been implemented for you. Do read it and try to understand what the functions.

In [77]:
# Bijection between objects and integers starting at 0. Useful for mapping
# labels, features, etc. into coordinates of a vector space.

# This class creates a mapping between objects (here words) and unique indices
# For example: apple->1, banana->2, and so on
class Indexer(object):
    def __init__(self):
        self.objs_to_ints = {}
        self.ints_to_objs = {}

    def __repr__(self):
        return str([str(self.get_object(i)) for i in range(0, len(self))])

    def __str__(self):
        return self.__repr__()

    def __len__(self):
        return len(self.objs_to_ints)

    # Returns the object corresponding to the particular index
    def get_object(self, index):
        if (index not in self.ints_to_objs):
            return None
        else:
            return self.ints_to_objs[index]

    def contains(self, object):
        return self.index_of(object) != -1

    # Returns -1 if the object isn't present, index otherwise
    def index_of(self, object):
        if (object not in self.objs_to_ints):
            return -1
        else:
            return self.objs_to_ints[object]

    # Adds the object to the index if it isn't present, always returns a nonnegative index
    def add_and_get_index(self, object, add=True):
        if not add:
            return self.index_of(object)
        if (object not in self.objs_to_ints):
            new_idx = len(self.objs_to_ints)
            self.objs_to_ints[object] = new_idx
            self.ints_to_objs[new_idx] = object
        return self.objs_to_ints[object]

In [78]:
# Load the data from the files
train_exs = read_sentiment_examples(train_file)
print("train: ", train_exs[:5])
dev_exs = read_sentiment_examples(dev_file)
print("dev: ", dev_exs[:5])
n_pos = 0
n_neg = 0
for ex in train_exs:
    if ex.label == 1:
        n_pos += 1
    else:
        n_neg += 1
print("%d train examples: %d positive, %d negative" % (len(train_exs), n_pos, n_neg))
print("%d dev examples" % len(dev_exs))

train:  [['a', 'stirring', ',', 'funny', 'and', 'finally', 'transporting', 're-imagining', 'of', 'beauty', 'and', 'the', 'beast', 'and', '1930s', 'horror', 'films']; label=1, ['apparently', 'reassembled', 'from', 'the', 'cutting-room', 'floor', 'of', 'any', 'given', 'daytime', 'soap', '.']; label=0, ['they', 'presume', 'their', 'audience', 'wo', "n't", 'sit', 'still', 'for', 'a', 'sociology', 'lesson', ',', 'however', 'entertainingly', 'presented', ',', 'so', 'they', 'trot', 'out', 'the', 'conventional', 'science-fiction', 'elements', 'of', 'bug-eyed', 'monsters', 'and', 'futuristic', 'women', 'in', 'skimpy', 'clothes', '.']; label=0, ['this', 'is', 'a', 'visually', 'stunning', 'rumination', 'on', 'love', ',', 'memory', ',', 'history', 'and', 'the', 'war', 'between', 'art', 'and', 'commerce', '.']; label=1, ['jonathan', 'parker', "'s", 'bartleby', 'should', 'have', 'been', 'the', 'be-all-end-all', 'of', 'the', 'modern-office', 'anomie', 'films', '.']; label=1]
dev:  [['one', 'long', 's

### Define Logistic Regression model

In [79]:
# Import libraries
from collections import Counter
from typing import List
import numpy as np
import math

#### Define feature extractors



In [80]:
from typing import Dict, Union
# Feature extraction base type. Takes an example and returns an indexed list of features.
class FeatureExtractor(object):
    # Extract features. Includes a flag add_to_indexer to control whether the indexer should be expanded.
    # At test time, any unseen features should be discarded, but at train time, we probably want to keep growing it.
    def extract_features(self, ex: "SentimentExample", add_to_indexer: bool) -> Dict[int, float]:
        raise Exception("Don't call me, call my subclasses")


# Extracts unigram bag-of-words features from a sentence. It's up to you to decide how you want to handle counts
class UnigramFeatureExtractor(FeatureExtractor):
    def __init__(self, indexer: Indexer):
        self.indexer = indexer

    def extract_features(self, ex:"SentimentExample", add_to_indexer=False) -> Dict[int, float]:
        features: Dict[int, float] = Counter()
        for w in ex.words:
            feat_idx: int = self.indexer.add_and_get_index(w) \
                if add_to_indexer else self.indexer.index_of(w)
            if feat_idx != -1:
                features[feat_idx] += 1.0
        return features


# Bigram feature extractor analogous to the unigram one.
class BigramFeatureExtractor(FeatureExtractor):
    def __init__(self, indexer: Indexer):
        self.indexer = indexer

    def extract_features(self, ex:"SentimentExample", add_to_indexer=False) -> Dict[int, float]:
        features: Dict[int, float] = Counter()
        for i in range(len(ex.words) - 1):
            w = ex.words[i] + "||" + ex.words[i + 1]
            feat_idx = self.indexer.add_and_get_index(w) \
                if add_to_indexer else self.indexer.index_of(w)
            if feat_idx != -1:
                features[feat_idx] += 1.0
        return features

#### Define base classifiers

In [81]:
# Sentiment classifier base type
class SentimentClassifier(object):
    # Makes a prediction for the given
    def predict(self, ex: "SentimentExample"):
        raise Exception("Don't call me, call my subclasses")


# Always predicts the positive class
class AlwaysPositiveClassifier(SentimentClassifier):
    def predict(self, ex: "SentimentExample") -> int:
        return 1

#### Logistic Regression class

In [None]:
class LogisticRegressionClassifier(SentimentClassifier):
    def __init__(self, feat_extractor: FeatureExtractor, train_examples, num_iters=50, reg_lambda=0.0, learning_rate=0.1):
        # TODO: Initialize the logistic regression model

        # Arguments: feat_extractor is unigram or bigram, train_examples is train dataset
        # num_iters is the number of epochs, reg_lambda is the regularization parameter
        # learning_rate is the learning rate used in gradient descent

        # STEP 1: Define variables for weights and biases, and initialize them

        # STEP 2: Call the train() function. (This has already been done for you)

        ##### SOLUTION START #####
        shape = len(feat_extractor.indexer)
        self.feature_ext = feat_extractor
        self.weights = np.random.uniform(0, reg_lambda, shape)
        self.biases = 0

        ##### SOLUTION END #####

        self.train(feat_extractor, train_examples, num_iters, reg_lambda, learning_rate)

    def train(self, feat_extractor: FeatureExtractor, train_examples:list["SentimentExample"], num_iters=50, reg_lambda=0.0, learning_rate=0.1):
        # TODO: Function for training the logistic regression model

        # STEP 1: Write a 'for' loop which iterates over the dataset num_iters times

        # STEP 2: Write an inner 'for' loop for each step of gradient descent
        # You can use stochastic gradient descent or mini-batch SGD

        # STEP 3: In each step of gradient descent apply the update rule
        # to weights and biases

        ##### SOLUTION START #####

        for epoch in range(num_iters): # num_iters is the number of epochs
            for ex in train_examples: # iterate over the dataset
                x = feat_extractor.extract_features(ex, add_to_indexer=True)
                y = ex.label

                # Compute linear combination of weights and features
                # z = np.dot(self.weights, np.array(list(x.values()))) + self.biases # copilot
                z = np.dot(self.weights, x) + self.biases # TODO: x might have to be converted into a np.array
                y_hat = self.sigmoid(z)

                # Compute gradients
                gradient_w = (y_hat - y) * x + reg_lambda * self.weights # TODO: x: np.array?n
                gradient_b = (y_hat - y)

                # Update weights and biases
                self.weights -= learning_rate * gradient_w
                self.biases -= learning_rate * gradient_b

        ##### SOLUTION END #####

    def sigmoid(self, z):
        """Sigmoid function for logistic regression."""
        return 1 / (1 + np.exp(-z)) 

    def predict(self, ex):
        # TODO: Logistic regression model's prediction for a single example

        ##### SOLUTION START #####

        x = self.feat_extractor.extract_features(ex, add_to_indexer=False)
        z = np.dot(self.weights, x) + self.biases
        return 1 if self.sigmoid(z) > 0.5 else 0

        ##### SOLUTION END #####

#### Training function for logistic regression

In [None]:
# Train a logsitic regression model on the given training examples using the given FeatureExtractor
def train_lr(train_exs: List["SentimentExample"], feat_extractor: FeatureExtractor, reg_lambda) -> LogisticRegressionClassifier:
    # TODO: Function for training logistic regression model.
    # Populate the feature_extractor.
    # Initialize and return an object of instance LogisticRegressionClassifier

    ##### SOLUTION START #####
    X_train: list["SentimentExample"] = [feat_extractor.extract_features(ex) for ex in train_exs]  
    
    # Convert to numpy arrays
    X_train = np.array(X_train)
    #y_train = np.array(y_train)
    
    # Initialize and train the logistic regression model
    model = LogisticRegressionClassifier(feat_extractor, X_train, reg_lambda=reg_lambda)
    model.train(feat_extractor, )
    
    return model
    ##### SOLUTION END #####
    pass

In [84]:
# Main entry point for your modifications. Trains and returns one of several models depending on the options passed
def train_model(feature_type:str, model_type:str, train_exs, reg_lambda=0.0):
    # Initialize feature extractor
    if feature_type == "unigram":
        # Add additional preprocessing code here
        feat_extractor = UnigramFeatureExtractor(Indexer())
    elif feature_type == "bigram":
        # Add additional preprocessing code here
        feat_extractor = BigramFeatureExtractor(Indexer())
    else:
        raise Exception("Pass unigram or bigram")

    # Train the model
    if model_type == "AlwaysPositive":
        model = AlwaysPositiveClassifier()
    elif model_type == "LogisticRegression":
        model = train_lr(train_exs, feat_extractor, reg_lambda=reg_lambda)
    else:
        raise Exception("Pass AlwaysPositive or LogisticRegression")
    return model

### Functions for evaluating the model

In [85]:
# Evaluates a given classifier on the given examples
def evaluate(classifier, exs):
    return print_evaluation([ex.label for ex in exs], [classifier.predict(ex) for ex in exs])


# Prints accuracy comparing golds and predictions, each of which is a sequence of 0/1 labels.
def print_evaluation(golds, predictions):
    num_correct = 0
    num_pos_correct = 0
    num_pred = 0
    num_gold = 0
    num_total = 0
    if len(golds) != len(predictions):
        raise Exception("Mismatched gold/pred lengths: %i / %i" %
                        (len(golds), len(predictions)))
    for idx in range(0, len(golds)):
        gold = golds[idx]
        prediction = predictions[idx]
        if prediction == gold:
            num_correct += 1
        if prediction == 1:
            num_pred += 1
        if gold == 1:
            num_gold += 1
        if prediction == 1 and gold == 1:
            num_pos_correct += 1
        num_total += 1

    print("Accuracy: %i / %i = %.2f %%" %
          (num_correct, num_total,
           num_correct * 100.0 / num_total))
    return num_correct * 100.0 / num_total

# Evaluate on train and dev dataset
def eval_train_dev(model):
    print("===== Train Accuracy =====")
    train_acc = evaluate(model, train_exs)
    print("===== Dev Accuracy =====")
    eval_acc = evaluate(model, dev_exs)
    return [train_acc, eval_acc]

### Unigram vs Bigram

In [86]:
# Evaluate logistic regression with unigram features
unigram_model = train_model('unigram', 'LogisticRegression', train_exs)
eval_train_dev(unigram_model)

TypeError: LogisticRegressionClassifier.__init__() missing 2 required positional arguments: 'feat_extractor' and 'train_examples'

In [None]:
# Evaluate logistic regression with bigram features
unigram_model = train_model('bigram', 'LogisticRegression', train_exs)
eval_train_dev(unigram_model)

### Logistic regression with regularization

In [None]:
# TODO: Experiment with different regularization parameters and plot train and dev accuracies
# You can either hard code the values or,
# write a loop that calculates accuracies for different parameters

### SOLUTION START ###

### SOLUTION END ###