### Coursework 1 - revised and can be used with automarker

In this coursework you will be aiming to complete two classification tasks. One of the classification tasks is related to image classification and the other relates to text classification.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single PDF file. You could have additional functions implemented that you require for carrying out each task.

#### Task 1

In this task, you are provided with three classes of images, cars, bikes and people in real world settings. You are provided with code for obtaining features for these images (specifically histogram of gradients (HoG) features). You need to implement a boosting based classifier that can be used to classify the images. 

This task is worth 30 points out of 100 points. 
Implementing a working boosting based classifier and validating it by cross-validation on the training set will be evaluated for 15 out of 30 points. 10 points are based on the evaluation carried out on a separate test dataset that will be done at the time of evaluation. Finally 5 points are reserved for analysis of this part of the task and presenting it well in a lab report. 

Note that the boosting classifier you implement can include decision trees from your previous ML1 coursework or can be a decision stump. Use the image_dataset directory provided with the assignment and save it in the same directory as the Python notebook

#### Write your  Image feature extraction code

In [1]:
def obtain_dataset(folder_name):
    import cv2
    import numpy as np
    import glob

    # assuming 128x128 size images and HoGDescriptor length of 34020
    hog_feature_len=34020
    hog = cv2.HOGDescriptor()
    
    X = np.empty([0,hog_feature_len], dtype=float)
    ytargets = []
    
    directories_list = ['bikes', 'people', 'cars']
    for directory in directories_list:
        for filename in glob.glob(folder_name+'/'+directory+'/*.png'):
            
            im = cv2.imread(filename)
            h = hog.compute(im).reshape(1,-1)

            X = np.append(X, h, axis=0)
            ytargets.append(directory)

    y = np.array(ytargets)

    return (X,y)

In [2]:
# # Optional function for those who want to include pre-processing for train data in obtain dataset
# def obtain_dataset_train_test(folder_name_train, folder_name_test):
#     import cv2
#     import numpy as np    
#     # assuming 128x128 size images and HoGDescriptor length of 34020
#     hog_feature_len=34020
#     hog = cv2.HOGDescriptor()
#     #code for obtaining hog feature for one image file name
#     # im = cv2.imread(image_filename)
#     # h = hog.compute(image)
#     # use this to read all images in the three directories and obtain the set of features X and train labels Y
#     # you can assume there are three different classes in the image dataset
#     #Process train images first, do NOT include test images in this stage
#     # Process test images over here while ensuring that there is no data leakage from train to test
#     # For instance dimensionality reduction should not be computed over train and test images jointly
#     return (X_train, y_train, X_test, y_test) 

#### Boosting classifier class

In [3]:
class BoostingClassifier:
    # You need to implement this classifier. 
    def __init__(self):
        #implement initialisation
        self.num_classifiers = None
        self.classifiers = []
        self.classes = []
    
    def fit(self, X,y):
        #implement training of the boosting classifier
        import numpy as np
        from sklearn.tree import DecisionTreeClassifier
        
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        
        # number of classifiers for images
        if len(self.classes) == 3:
            
            # You can change this value to impact the value of 
            # M for the image dataset
            self.num_classifiers = 155
            
        # number of classifiers for text
        elif len(self.classes) == 2:
            
            # You can change this value to impact the value of 
            # M for the text dataset
            self.num_classifiers = 350
        
        # Initialize weights to 1/N
        weights = np.full(n_samples, (1 / n_samples), dtype=np.float32)
        
        for classifier_index in range(self.num_classifiers):
            classifier_tm = DecisionTreeClassifier(max_depth=1).fit(X,y,sample_weight=weights).predict
            
            I = np.where(classifier_tm(X) != y, 1, 0)
            errM = sum(weights * I) / sum(weights)
            
            aM = np.log((1-errM)/errM) + np.log(len(self.classes)-1)
        
            for i in range(weights.shape[0]):
                weights[i] = weights[i]*np.exp(aM*I[i])
        
            self.classifiers.append((aM, classifier_tm))
        
    
    def predict(self, X):
        # implement prediction of the boosting classifier
        import numpy as np
        from sklearn.tree import DecisionTreeClassifier
        
        classes = self.classes
        
        predictions = np.zeros([X.shape[0],1])
        # Through each class
        for clas in classes:
            
            class_sum = np.zeros([X.shape[0], 1])
            
            # Through each classifier
            for classifier_index in range(self.num_classifiers):
                
                aM, classifier_tm = self.classifiers[classifier_index]

                # To update sum
                I = np.expand_dims(np.where(classifier_tm(X) == clas, 1, 0), axis=1)
                class_sum += aM * I
                
            predictions = np.append(predictions, class_sum, axis=1)
        
        predictions = predictions[:,1:]

        index_predictions = np.argmax(predictions, axis=1)

        classes_array = np.array(classes)
        y_pred = classes_array[index_predictions]
        
        return y_pred


### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [4]:
def test_func_boosting_image(image_dataset_train, image_dataset_test):
    import numpy as np
    import cv2
    import glob
    from sklearn.metrics import accuracy_score
    
    (X_train, Y_train) = obtain_dataset(image_dataset_train)
    (X_test, Y_test) = obtain_dataset(image_dataset_test) # optionally replace the two calls with a single call to obtain_dataset_train_test() function
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    y_pred = bc.predict(X_test)
    acc = accuracy_score(Y_test, y_pred)
    return acc

#### Task 2

In this task, you need to classify the above dataset using a Support Vector Machine (SVM).

This task is worth 25 points out of 100 points. You are allowed to use existing library functions such as scikit-learn for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom kernels. The marking will be 15 marks for analysing the dataset using various kernels including your own kernels, 5 points for the performance on the test dataset and 5 points for a lab-report that provides the analysis and comparisons.

In [5]:
class SVMClassifier:
 
    def __init__(self):
        self.image_classifier = None
        self.text_classifier = None
        
        # Change the name of these parameter to test the corresponding dataset
        # On the other kenrnels. Choose from this list
        # ['linear', 'poly', 'rbf', 'sigmoid','laplacian', 'log', 'cauchy']
        self.image_kernel = 'laplacian'
        self.text_kernel = 'laplacian'

        self.image_X_train = None
        self.text_X_train = None

    def fit_image(self, X, y):
        #training of the SVM 
        # providing for separate image kernels 
        from sklearn import svm

        if self.image_kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
            clf = svm.SVC(kernel=self.image_kernel)
            clf.fit(X, y)
        else:
            clf = svm.SVC(kernel='precomputed')
            
            if self.image_kernel == 'laplacian':
                kernel_train = self.laplacian_kernel(X, X)
                
            elif self.image_kernel == 'log':
                kernel_train = self.log_kernel(X, X)

            elif self.image_kernel == 'cauchy':
                kernel_train = self.cauchy_kernel(X, X)
                        
            clf.fit(kernel_train, y)
            self.image_X_train = X
        
        self.image_classifier = clf
    
    def fit_text(self, X,y):
        # training of the SVM
        from sklearn import svm

        if self.text_kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
            clf = svm.SVC(kernel=self.text_kernel)
            clf.fit(X, y)
        else:
            clf = svm.SVC(kernel='precomputed')
            
            if self.text_kernel == 'laplacian':
                kernel_train = self.laplacian_kernel(X, X)
                
            elif self.text_kernel == 'log':
                kernel_train = self.log_kernel(X, X)

            elif self.text_kernel == 'cauchy':
                kernel_train = self.cauchy_kernel(X, X)
                        
            clf.fit(kernel_train, y)
            self.text_X_train = X
        
        self.text_classifier = clf
    
    
    def predict_image(self, X):
        # prediction routine for the SVM
        from sklearn import svm

        clf = self.image_classifier
        
        if self.image_kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
            y_pred = clf.predict(X)
            
        else:
            X_train = self.image_X_train
            
            if self.image_kernel == 'laplacian':
                kernel_test = self.laplacian_kernel(X, X_train)
                
            elif self.image_kernel == 'log':
                kernel_test = self.log_kernel(X, X_train)

            elif self.image_kernel == 'cauchy':
                kernel_test = self.cauchy_kernel(X, X_train)
        
            y_pred = clf.predict(kernel_test)
        
        return y_pred
        

    def predict_text(self, X):
        # prediction routine for the SVM
        from sklearn import svm

        clf = self.text_classifier
        
        if self.text_kernel in ['linear', 'poly', 'rbf', 'sigmoid']:
            y_pred = clf.predict(X)
            
        else:
            X_train = self.text_X_train
            
            if self.text_kernel == 'laplacian':
                kernel_test = self.laplacian_kernel(X, X_train)
                
            elif self.text_kernel == 'log':
                kernel_test = self.log_kernel(X, X_train)

            elif self.text_kernel == 'cauchy':
                kernel_test = self.cauchy_kernel(X, X_train)
        
            y_pred = clf.predict(kernel_test)
        
        return y_pred
    
    def laplacian_kernel(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances
        import scipy
        
        if scipy.sparse.issparse(X):
            sigma = np.sqrt(X.shape[1] * np.var(X.toarray()))
        else:
            sigma = np.sqrt(X.shape[1] * np.var(X))

        euclid_dist = euclidean_distances(X, y)
        result = (- euclid_dist/ sigma)

        return result
    
    
    def log_kernel(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances

        d = 1
        euclid_dist = euclidean_distances(X, y)
        result = -np.log(euclid_dist**d + 1)
        return result

    
    def cauchy_kernel(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances
        import scipy

        if scipy.sparse.issparse(X):
            sigma = X.shape[1] * np.var(X.toarray())
        else:
            sigma = X.shape[1] * np.var(X)
            
        euclid_dist = euclidean_distances(X, y)

        result = 1 / (1+ (euclid_dist**2 / sigma**2))
        return result
 

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [6]:
def test_func_svm_image(image_dataset_train, image_dataset_test):
    import numpy as np
    import cv2
    from sklearn.metrics import accuracy_score
    
    (X_train, Y_train) = obtain_dataset(image_dataset_train)
    (X_test, Y_test) = obtain_dataset(image_dataset_test) # optionally replace the two calls with a single call to obtain_dataset_train_test() function
    sc = SVMClassifier()
    sc.fit_image(X_train, Y_train)
    y_pred = sc.predict_image(X_test)
    acc = accuracy_score(Y_test, y_pred)
    return acc

#### Task 3

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train a boosting based classifier to obtain train and cross-validate on the dataset provided. The method will be evaluated against an external test set.

This task is worth 25 points out of 100 points. 15 points will be for implementing the pre-processing and Bag of Words based feature extractor correctly and evaluating the boosting based classifier for the text features and validating it by cross-validation on the training set. 5 points are based on the evaluation carried out on a separate test dataset that will be done at the time of evaluation. Finally 5 points are reserved for analysis of this part of the task and presenting it well in a lab report.

Use the movie_review_train.csv file provided with the assignment, and save it in the same directory as the Python notebook

#### Process the text and obtain a bag of words-based features 

In [7]:
# def extract_bag_of_words(train_file):
#     import numpy as np
#     import pandas as pd
#     from sklearn.feature_extraction.text import CountVectorizer
#     from nltk.tokenize import RegexpTokenizer
#     data = pd.read_csv(train_file)
#     # Shuffling to remove bias
#     data = data.sample(frac=1)
#     data = data.sample(frac=1)
#     x = data['review']
#     y = data['sentiment']
#     count_vect = CountVectorizer()
#     X = count_vect.fit_transform(x)
#     return X, y

In [8]:
def extract_bag_of_words_train_test(train_file, test_file):
    # Write your preprocessor to process the text
    # Write your own bag of words feature extractor using nltk and scikit-learn
    # return (X_train,y_train,X_test,y_test)
    
    # Process training data first and ensure the test data is not used while extracting bag of words feature vector
    import numpy as np
    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    from nltk.tokenize import RegexpTokenizer
    
    train_data = pd.read_csv(train_file)
    # Shuffling to remove bias
    train_data = train_data.sample(frac=1)
    
    x_train = train_data['review']
    y_train = train_data['sentiment']
    
    token = RegexpTokenizer(r'[a-z]+')
    count_vect = CountVectorizer(lowercase=True, stop_words='english', tokenizer=token.tokenize)
    
    X_train = count_vect.fit_transform(x_train)

    # Process testing data here. Ensure that test data is not used above
    test_data = pd.read_csv(test_file)

    x_test_data = test_data['review']
    y_test = test_data['sentiment']
    
    # Using tranform instead of fit transform to prevent leakage
    X_test = count_vect.transform(x_test_data)

    return X_train, y_train, X_test, y_test

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [9]:
def test_func_boosting_text(text_dataset_train, text_dataset_test):
    import numpy as np
    import cv2
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.feature_extraction.text import CountVectorizer
    from nltk.tokenize import RegexpTokenizer
    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(text_dataset_train, text_dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    y_pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, y_pred)
    return acc

#### Task 4

In this task, you need to classify the above movie review dataset using a Support Vector Machine (SVM).

This task is worth 20 points out of 100 points. You are allowed to use existing library functions such as scikit-learn for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. The marking will be 10 marks for analysing the dataset using various kernels including your own kernels, 5 points for the performance on the test dataset and 5 points for a lab-report that provides the analysis and comparisons.

### Test function that will be called to evaluate your code. Separate train and test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. We will also be evaluating the cross-validation performance with a set train and val split.

In [10]:
def test_func_svm_text(text_dataset_train, text_dataset_test):
    import numpy as np
    import pandas as pd
    from sklearn.metrics import accuracy_score
    from sklearn import svm
    from sklearn.feature_extraction.text import CountVectorizer
    from nltk.tokenize import RegexpTokenizer
    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(text_dataset_train, text_dataset_test)
    sc = SVMClassifier()
    sc.fit_text(X_train, Y_train)
    y_pred = sc.predict_text(X_test)
    acc = accuracy_score(Y_test, y_pred)
    return acc