## Question 3
In this question, we will work on a [Kaggle competition](https://www.kaggle.com/t/24eae58e93964645985d075969ab1eb8). First, you need to download the dataset from the competition.

We will train a model for binary sentiment classification task. The dataset consists of reviews for businesses, including restaurants, bars, dentists, ..., and the output is positive or negative. 

First, we need to load the training dataset from csv file. It has two columns, each line is text input and the corresponding binary label.

In [1]:
import pandas as pd
data = pd.read_csv('data/train_data.csv')

In [2]:
x_train = data['text']
y_train = [1  if _ == 'TRUE' else 0 for _ in data['label']]

We apply a simple preprocessing step that removes punctuations and converts text to lowercase.

In [3]:
import re
def remove_punctuation(x):
    return re.sub(r'[^\w\s]', '', x)
def prepropess(x):
    x = remove_punctuation(x)
    x = x.lower()
    return x

x_train = [prepropess(x) for x in x_train]


Since the model can not read text directly, we need to transform them such that the model can understand. For that purpose, we will use tf-idf (term frequency–inverse document frequency) to encode the input to a vector in which each value corresponds to a word.

#### Count vectorization
A document includes many words, which may have different contributions to the label. Intuitively, a word that is mentioned more in the text should be more important and should be assigned higher weight. The simplest way is assigning weights based on the number of occurences of words in the document. This is referred to as term frequency (tf).


In [4]:
from collections import defaultdict
def get_count_dict(data):
    word_dict = defaultdict(int)
    for word in data.split():
        word_dict[word] += 1
    return word_dict

def get_word_set(data):
    word_set = set()
    for line in data:
        for word in line.split():
            word_set.add(word)
    return word_set

word_set = get_word_set(x_train)

In [5]:
def compute_tf(doc):
    word_dict = get_count_dict(doc)
    doc = doc.split()
    num_words = len(doc)
    for word in word_dict:
        word_dict[word] = word_dict[word] / float(num_words)
    return word_dict

In [6]:
tf_data = []
for doc in x_train:
    tf_data.append(compute_tf(doc))

You can also use ```CountVectorizer``` from ```sklean``` package, however it does not normalize by the number of words in each document.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorized_x_train = vectorizer.fit_transform(x_train)
print(vectorizer.get_feature_names_out())
vectorized_x_train[0].toarray()

['00' '000' '0000' ... 'zztaine' 'zzzzaaaacccchhh' 'émigré']


array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### n-grams
In above example, we consider each word separately. Nevertheless, in text data, some words usually occur together, for instance "machine learning", "linear regression", ..., and this information is ignored in this vectorization method.

To include more meaningful information to text features, we can consider a group of n words, which is called n-gram, as a feature and count the occurrence. To avoid high number of features, we only use small n (1-gram, 2-grams, 3-grams). This is done by simply setting the ```ngram_range``` in ```CountVectorizer```.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 3))
vectorized_x_train = vectorizer.fit_transform(x_train)
print(vectorizer.get_feature_names_out())
vectorized_x_train[0].toarray()

['00' '00 more' '00 more positive' ... 'émigré' 'émigré who'
 'émigré who in']


array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### TF-IDF
Term frequency itself has a serious problem: it only considers the importance of words in that document. When all documents are about the same topic, for instance *movie reviews*, "movie", "actor", ... will appear many times but convey no information about the semantic meaning of the review. We therefore apply inverse document frequency (idf) to downweight words by the number of documents that contains the word. We compute idf as follows
$$
\operatorname{idf}(w)=\log\frac{N}{\operatorname{df}(w)}
$$
where $N$ is the total number of documents in the datasets, $\operatorname{df}(w)$ is the number of documents that contains word $w$.

Finally, we combine two definition together and compute tf-idf of a word $w$ in the document $d$ by
$$
\operatorname{tf-idf}(w,d)=\operatorname{tf}(w,d) .\operatorname{idf}(w)
$$

In [9]:

def compute_idf(data):
    import math
    N = len(data)
    idf = defaultdict(int)
    for doc in data:
        word_dict = get_count_dict(doc)
        for k, v in word_dict.items():
            idf[k] += 1
    for word, _ in idf.items():
        idf[word] = math.log(N / float(idf[word]))
    return idf


def compute_tfidf(doc, idf):
    tf = compute_tf(doc)
    tfidf = defaultdict(int)
    for word, value in tf.items():
        tfidf[word] = value * idf[word]
    return tfidf

idf = compute_idf(x_train)



In [10]:
tfidf_data = []
for doc in x_train:
    tfidf_data.append(compute_tfidf(doc, idf))

You obtain a list ```tfidf_data``` of dictionary corresponding to each document where keys are words and values are their tfidf in that document.

Similarly, you can also use ```TfidfVectorizer``` from ```sklearn``` to extract features.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorized_x_train = vectorizer.fit_transform(x_train)
print(vectorizer.get_feature_names_out())
vectorized_x_train[0].toarray()

['00' '000' '0000' ... 'zztaine' 'zzzzaaaacccchhh' 'émigré']


array([[0., 0., 0., ..., 0., 0., 0.]])

Your tasks is using classification models that we learnt so far, including kNN, Decision Trees, Perceptron, to predict sentiment of each review in ```test_data.csv``` and ```priv_test_data.csv``` file. 
- You need to try every vectorization method we have learnt in this notebook, including count vectorization, n-grams, tf-idf.
- You will submit an output ```.csv``` file to Kaggle. Details on how to submit can be found here [competition](https://www.kaggle.com/t/24eae58e93964645985d075969ab1eb8). You can try any technique to improve the result, e.g. change objective function, add regularization, preprocess the data, ... You are allowed to use any external package for preprocessing step or learning the model.



In [12]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [13]:
def ngram(text):
    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(ngram_range=(1, 3))
    vectorized = vectorizer.fit_transform(text)
    return vectorized

def tfidf(text):
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    vectorizer = vectorizer.fit(text)
    return vectorizer

In [14]:
X_train, X_test, y_train, y_test  = train_test_split(x_train, y_train, test_size=0.22, random_state=1)
y_train = [1  if _ == 'TRUE' else 0 for _ in data['label']]

def train(X_train, y_train, vecterizer, model):
    vectorizer = vecterizer(X_train)
    vectorized_x_train = vectorizer.transform(X_train)
    model = model.fit(vectorized_x_train, y_train)
    return model, vectorizer

model, vectorizer = train(x_train, y_train, tfidf, DecisionTreeClassifier(max_depth=300))

In [15]:
def load_predict(file, vectorizer, model):
    data = pd.read_csv(file)
    x = data['text']
    x = [prepropess(text) for text in x]
    vectorized_x = vectorizer.transform(x)
    labels = model.predict(vectorized_x)
    return labels

In [16]:
y_test = load_predict('data/test_data.csv', vectorizer, model)
y_pri_test = load_predict('data/priv_test_data.csv', vectorizer, model)

y_all = np.concatenate((y_test, y_pri_test))

result = pd.DataFrame({'id':range(len(y_all)), 'label': y_all})
result.to_csv('results.csv', index=False)