##### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2022 Semester 1

## Assignment 2: Sentiment Classification of Tweets | Model

This notebook contains code focused on selecting features for a best accuracy model.

**Full Name:** `Xavier Travers`

**Student ID:** `1178369`

First, import the data and the necessary python modules for the pipeline

In [2]:
import pandas as pd
from pprint import pprint

# import the feature extractors
from features import *
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# classifiers
from sklearn.base import BaseEstimator
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# pipeline and parameter optimisation
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# import the data
train_data = pd.read_csv("../datasets/Train.csv", sep=',')
test_data = pd.read_csv("../datasets/Test.csv", sep=',')

### Classifier Container Object
This will contain classifiers (which can then be compared by `GridSearchCV`).

Reference: https://stackoverflow.com/questions/50285973/pipeline-multiple-classifiers

In [3]:
class ClassifierContainer(BaseEstimator):

    def __init__(
        self, 
        estimator = GaussianNB(),
    ):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """ 

        self.estimator = estimator

    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self

    def predict(self, X, y=None):
        return self.estimator.predict(X)

    def predict_proba(self, X):
        return self.estimator.predict_proba(X)

    def score(self, X, y):
        return self.estimator.score(X, y)

### Parameter Selection Pipeline

Where the magic happens. This will select the optimal parameter combinations

Reference: https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html

In [None]:
# extract the training inputs and outputs
train_input = train_data[['text']].values[:, 0]
train_output = train_data[['sentiment']].values[:, 0]