### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

#### Process the text and obtain a bag of words-based features 

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
import re

In [2]:
def extract_bag_of_words_train_test(train_file, test_file):
    import nltk
    nltk.download('punkt')
    nltk.download('stopwords') #download nltk natural packages
    from nltk.corpus import stopwords, wordnet as wn #used to exclude unnecessary punctuations or words
    from nltk import word_tokenize, pos_tag #used to divide sentence into each individual strings
    from nltk.stem import WordNetLemmatizer
    wnl = WordNetLemmatizer() #create object to hold lemmatizer word
    from collections import defaultdict
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['j'] = wn.ADJ
    tag_map['v'] = wn.VERB
    tag_map['a'] = wn.ADV

    def Clean_Data(review):
        review = review.lower() #converts in to lower case
        review= re.sub('\[.*?\]', '', review) #removes characters inside []form each reviews
        review = re.sub("\\W"," ",review) #removes /,\ slashes 
        review = re.sub('https?://\S+|www\.\S+', '', review) #remove the links and macthes the white spaces
        review = re.sub('<.*?>+', '', review) #removes <br> tags specially <>
        review = re.sub('br','',review)
        #review = re.sub('[%s]' % re.escape(string.punctuation), '', review) #remove punctuations from the review
        review = re.sub('[^\w\s]','',review)
        review= re.sub('\n', '', review) #matches the white spaces and reformat the text
        review = re.sub('\w*\d\w*', '', review) #remove numbers and digits from the reviews   
        return review

    def nltk_pos_tagger(nltk_tag):
        # checking the context of whole sentence and determine each term is what part of speech
        if nltk_tag.startswith('J'):
            return wn.ADJ
        elif nltk_tag.startswith('V'):
            return wn.VERB
        elif nltk_tag.startswith('N'):
            return wn.NOUN
        elif nltk_tag.startswith('R'):
            return wn.ADV
        else:          
            return None

    def lemmatize_sentence(sentence):
        nltk_tagged = nltk.pos_tag(sentence) #apply nltk default packages for tokenizing each word 
        wordnet_tagged = map(lambda x: (x[0], nltk_pos_tagger(x[1])), nltk_tagged) #map the pos tag with corresponding term
        lemmatized_sentence = []
        
        #loop through tagged term and append them correspondingly vased on the tag
        for word, tag in wordnet_tagged:
            if tag is None:
                lemmatized_sentence.append(word) # if tag is none, append the tag
            else:        
                lemmatized_sentence.append(wnl.lemmatize(word, tag)) #if not none, append the tag and the term together
        return " ".join(lemmatized_sentence)

    vectorizer = TfidfVectorizer(min_df = 5, #remove words that appear too rarely
                               max_df = 0.8, #remove words that appear too many
                               sublinear_tf = True,
                               use_idf = True,  
                               ngram_range = (1,2), #allow the use of unigrams and bigrams (two word combinations)
                               max_features = 10000) #setting maximum features to prevent overfitting

    '''Read the CSV file and extract Bag of Words Features'''
    Data_train = pd.read_csv(train_file) #'movie_review_train.csv'
    Data_test = pd.read_csv(test_file) #'movie_review_test.csv'
    pd.set_option('display.max_colwidth',5000)

    '''Set our list of stopwords as those from the Natural Language Toolkit for the english language'''
    stopwords = set(nltk.corpus.stopwords.words('english'))

    '''Preprocessing Data for Train file'''
    Data_train['review'] = Data_train['review'].apply(Clean_Data)

    '''Use the Natural Language Toolkit to tokenise the clean reviews'''
    Data_train['review'] = Data_train['review'].apply(nltk.word_tokenize) 

    '''Remove all stopwords from our tokenised list'''
    Data_train['review'] = Data_train['review'].apply(lambda x: [word for word in x if word not in stopwords])

    '''Apply lemmatization and tokenise the results'''
    Data_train['review_token'] = Data_train['review'].apply(lemmatize_sentence)
    
    '''Create a new column illustraing the sentiment type as int'''
    Data_train['Value'] = Data_train['sentiment'].apply(lambda x: 1 if x =='positive' else -1)
    
    '''Preprocessing Data for Test file'''
    Data_test['review'] = Data_test['review'].apply(Clean_Data)
    
    '''Use the Natural Language Toolkit to tokenise the clean reviews'''
    Data_test['review'] = Data_test['review'].apply(nltk.word_tokenize) 
    
    '''Remove all stopwords from our tokenised list'''
    Data_test['review'] = Data_test['review'].apply(lambda x: [word for word in x if word not in stopwords])
    
    '''Apply lemmatization and tokenise the results'''
    Data_test['review_token'] = Data_test['review'].apply(lemmatize_sentence)
    
    '''Create a new column illustraing the sentiment type as int'''
    Data_test['Value'] = Data_test['sentiment'].apply(lambda x: 1 if x =='positive' else -1)
    
    X_train = vectorizer.fit_transform(Data_train['review_token'])
    y_train = np.array(Data_train["Value"])
    
    X_test = vectorizer.transform(Data_test['review_token'])
    y_test = np.array(Data_test["Value"])
    
    return (X_train,y_train,X_test,y_test)

In [3]:
class SVMClassifier:
    def __init__(self):
        import numpy as np
        from sklearn import svm
        #implement initialisation
        self.some_paramter=1
        #Calling SVM kernel with some hyperparameter
        self.clf = svm.SVC(kernel=self.laplacian_kernel, degree = 2, C = 1)
        
    # define your own kernel here
    # Refer to the documentation here: https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html 
    '''Refer to documentation here for how I create the code: 
    http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#source_code'''
    
    def laplacian_kernel(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances
        import scipy
        
        # Calculating Standard Deviation Sigma
        if scipy.sparse.issparse(X): # Checking if X_train is a matrix
            sigma = np.sqrt(X.shape[1] * np.var(X.toarray())) #if yes, then caluclate sigma but variance have to be an array
        else:
            sigma = np.sqrt(X.shape[1] * np.var(X))

        euclid_dist = euclidean_distances(X, y) #using euclidean_distances function to find the distance between two points
        result = (- euclid_dist/ sigma) # follow the equations from the reference website above

        return result
    
    def cauchy_kernel(self, X, y):
        import numpy as np
        from sklearn.metrics.pairwise import euclidean_distances
        import scipy
        
        # Calculating Standard Deciation Sigma
        if scipy.sparse.issparse(X): #Checking if X_train is a matrix
            sigma = X.shape[1] * np.var(X.toarray()) # if yes, then calculate sigma but variacne have to be an array
        else:
            sigma = X.shape[1] * np.var(X)
            
        euclid_dist = euclidean_distances(X, y) # using euclidean_distances to find the distance between two points
        result = 1 / (1+ (euclid_dist**2 / sigma**2)) #follow the equations from the reference website above
        return result
    
    def fit(self, X, Y):
        # fitting the model
        f = self.clf.fit(X,Y)
        return f
    
    def predict(self, X):
        # prediction routine for the SVM
        prediction = self.clf.predict(X)
        return prediction 

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [4]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [5]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\justi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\justi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Accuracy: 0.874


### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [6]:
class BoostingClassifier:
    # You need to implement this classifier. 
    def __init__(self):
        from sklearn.tree import DecisionTreeClassifier
        import numpy as np
        #implement initialisation
        self.some_paramter=1
        # Calling DecistionTree classifier with its hyperparameter
        self.model = DecisionTreeClassifier(criterion = 'entropy', max_depth=7)
        self.classifiers = []
        self.alphas = []
        
    def fit(self, X,y):
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.metrics import accuracy_score 
        from sklearn import base
        import numpy as np
        # calling X shape parameter
        N, M = X.shape
        #calculaing the weights based on the X_train parameter
        weights = np.full(N, (1 / N), dtype=np.float32)
        #Append correct classifier to classifiers list
        classifiers = []
        #Append incorrect classifier to Alphas list
        alphas = []
        
        #looping the maximum number of trees
        for i in range(1000):
            
            #Cloning the fitted decision tree model with the weight scale we caluclate above
            f = base.clone(self.model).fit(X,y, sample_weight=weights)
            
            # Add decision tree to list of classifiers
            classifiers.append(f)
            
            # Use our weak learner to generate a prediction
            pred = f.predict(X)
            
            # Calculate the accuracy of our prediction
            acc = accuracy_score(y, pred)
            
            # Create a list of entries where the prediction was incorrect
            incorrect = np.where(pred != y, 1, 0)
            
            # Calculate the error
            error = np.sum(weights * incorrect)
            
            # Calculate the alpha
            a = np.log((1 - error) / error) / 2
            
            # Add alpha to our existing list of alphas
            alphas.append(a)
            
            # Calculate our new sample weights
            weights = weights * np.exp(-a * y * pred)
            
            # Normalise so that weights sum to 1
            weights = weights / np.sum(weights)
            
        #implement training of the boosting classifier
        self.classifiers = classifiers
        self.alphas = alphas
        
        return
    
    def predict(self, X):
        import numpy as np
        # implement prediction of the boosting classifier
        prediction = np.zeros(X.shape[0])
        classifiers = self.classifiers
        alphas = self.alphas
        
        # Loop through our list of classifiers and corresponding alphas
        for i in range(len(alphas)):
            
            # Generate our weighted prediction
            prediction += alphas[i] * classifiers[i].predict(X)
        
        # Determine whether the weighted prediction is overall positive or negative
        prediction = np.sign(prediction)
        
        return prediction

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [7]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [8]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\justi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\justi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Accuracy: 0.838


In [9]:
# import nltk
# nltk.download('stopwords') #download nltk natural packages
# from nltk.corpus import stopwords #used to exclude unnecessary punctuations or words
# stop = stopwords.words('english') #access default stopwords in nltk packages
# print(stop)