# Text Classification

In this assignment, you will work on the [OffensEval](https://sites.google.com/site/offensevalsharedtask/) shared task. This challenge has been part of the 2019 and 2020 editions of SemEval and focuses on the identification of offensive language in social media platforms. In particular, you are solving subtasks A and B of the 2019 edition:

* **SubTask A: Offensive language identification.** The goal of this subtask is to discriminate between offensive and non-offensive posts. Offensive posts include insults, threats, and other type of non-acceptable language. This subtask can be addressed as a Binary Text Classication problem.


* **SubTask B: Automatic categorization of offense types.** The goal is to predict if the offensive post is targeted or not. A post is considered targeted if it contains insults or threats to an individual or group. An untargeted offensive post contains non-acceptable language that is not targeted at anyone in particular. In this assignment, you will work on a version of this subtask where the goal is to classify posts as targeted, untargeted and non-offensive. This version of the subtask can be addressed as a Multiclass Text Classication problem.

You will work with [scikit-learn](https://scikit-learn.org/stable/), a **Python** Machine Learning library that provides a wide range of tools, including some for text data. Specifically, you will use the following objects and functions:

In [3]:
!pip install -r '/content/sample_data/Assignment2+-+requirements.txt'

Collecting ipython==8.5.0 (from -r /content/sample_data/Assignment2+-+requirements.txt (line 1))
  Downloading ipython-8.5.0-py3-none-any.whl (752 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m752.0/752.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyter==1.0.0 (from -r /content/sample_data/Assignment2+-+requirements.txt (line 2))
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting nbimporter==0.3.4 (from -r /content/sample_data/Assignment2+-+requirements.txt (line 3))
  Downloading nbimporter-0.3.4-py3-none-any.whl (4.9 kB)
Collecting pytest==7.1.3 (from -r /content/sample_data/Assignment2+-+requirements.txt (line 4))
  Downloading pytest-7.1.3-py3-none-any.whl (298 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.2/298.2 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas==1.3.5 (from -r /content/sample_data/Assignment2+-+requirements.txt (line 5))
  Downloading pandas-1.3.5-cp310-cp

In [4]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import f1_score, accuracy_score

The data for the assignment consists of 13240 tweets for training and 860 tweets for test with annotations for both **SubTask A** (*True* or *False*) and **SubTask B** (*TIN*, *UNT* and *NOT*). The dataset also includes the Sentiment Analysis of the tweets that you will use later in the assignment. The dataset can be loaded into two `DataFrames` as follows:

In [6]:
train = pd.read_csv("/content/sample_data/train.tsv", sep="\t")
test = pd.read_csv("/content/sample_data/test.tsv", sep="\t")
train[["tweet", "sentiment", "subtask_a", "subtask_b"]]

Unnamed: 0,tweet,sentiment,subtask_a,subtask_b
0,@USER She should ask a few native Americans wh...,neutral,True,UNT
1,@USER @USER Go home you’re drunk!!! @USER #MAG...,negative,True,TIN
2,Amazon is investigating Chinese employees who ...,neutral,False,NOT
3,"@USER Someone should'veTaken"" this piece of sh...",negative,True,UNT
4,@USER @USER Obama wanted liberals &amp; illega...,negative,False,NOT
...,...,...,...,...
13235,@USER Sometimes I get strong vibes from people...,negative,True,TIN
13236,Benidorm ✅ Creamfields ✅ Maga ✅ Not too sh...,positive,False,NOT
13237,@USER And why report this garbage. We don't g...,negative,True,TIN
13238,@USER Pussy,negative,True,UNT


## Text Representation - [ 6 Marks]

In order to apply Text Classification for both subtasks, we first need to convert the text of the tweets into numerical feature vectors. For this assignment, you are using a bag-of-words based on tf-idf. This representation can be obtained with **scikit-learn** using [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer). `TfidfVectorizer` provides a number of pre-processing steps, such as tokenization and stop-words, and other options to represent the text.

You must complete the code for the `create_tfidfvectorizer` function. The function must create and return a `TfidfVectorizer` with all parameters at their default value. Check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer) to learn how.

In [7]:
def create_tfidfvectorizer():   # 3 Marks
    return TfidfVectorizer()

In [8]:
vectorizer = create_tfidfvectorizer()

Now that you have created the `TfidfVectorizer`, the next step is to apply it to the dataset and get the representation of the tweets. For this, `TfidfVectorizer` first needs to learn the vocabulary and idf values from the train set. Then, it can be used to transform both the train and test sets. When applied to the text data, `TfidfVectorizer` will pre-process it according to the parameters used.

You must complete the code for the `run_vectorizer` function. The function takes the vectorizer created previously, and the tweets of the train and test sets. The function should train the vectorizer on the train text to learn the vocabulary and the idf values, and apply it to transform both the train and test tweets. The expected output is the result of these transformations where each tweet should be represented with a vector of 19083 dimensions:
> Shape of train input data: (13240, 19083)  
Shape of test input data: (860, 19083)

In [9]:
def run_vectorizer(vectorizer, train, test):   # 3 Marks

    # Fit vectorizer on the train data
    vectorizer.fit(train)

    # Transform the train and test data
    train_trnsfrm = vectorizer.transform(train)
    test_trnsfrm = vectorizer.transform(test)

    return train_trnsfrm, test_trnsfrm

In [10]:
train_x, test_x = run_vectorizer(vectorizer, train["tweet"], test["tweet"])
print(f"Shape of train input data: {train_x.get_shape()}")
print(f"Shape of test input data: {test_x.get_shape()}")

Shape of train input data: (13240, 19083)
Shape of test input data: (860, 19083)


## Logistic Regression - [8 Marks]

Having obtained the feature vectors from the text, you can proceed with training a classifier to make predictions about the offensive language of a tweet. You will begin by creating a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) with **scikit-learn**. To keep the exercise simple, you are going to use the default options which include the *one-vs-all* strategy for the Multiclass case and the [Limited-memory BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm for optimization. The [SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) allows implementing a Logistic Regression classifier that works with Stochastic Gradient Descent, but you won't use it for this assignment.

You must complete the code for the `create_model` function. The functions should create and return a `LogisticRegression` with the default options, just increase the maximum number of training iterations to 1000 to ensure that the model converges.

In [11]:
def create_model():   # 3 Marks
    return LogisticRegression(max_iter=1000)

In [12]:
model = create_model()

Just based on the target labels you use for training, the `LogisticRegression` you have created is able to automatically recognize the type of classification problem you are working on, Binary or Multiclass. In the following exercise, you will implement the code to train the model and make predictions on the test set. The same solution will be used for both **SubTask A** and **SubTask B**.

You must complete the code for the `run_model` function. The function takes input as the model, the train feature vectors, the train target labels and the test feature vectors. The function should train the model using the train features and labels, and return the predictions for the given test.

In [14]:
def run_model(model, train_x, train_y, test_x):   # 5 Marks

    # Train the model
    model.fit(train_x, train_y)

    # Predict the model on the test data
    predictions = model.predict(test_x)

    return predictions

In [15]:
prediction = run_model(model, train_x, train["subtask_a"], test_x)
test['prediction_a'] = prediction
test[['id', 'tweet', 'subtask_a', 'prediction_a']]

Unnamed: 0,id,tweet,subtask_a,prediction_a
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,True,False
1,27014,"#ConstitutionDay is revered by Conservatives, ...",False,False
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,False,False
3,13876,#Watching #Boomer getting the news that she is...,False,False
4,60133,#NoPasaran: Unity demo to oppose the far-right...,True,False
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,True,False
856,25657,#MeetTheSpeakers 🙌 @USER will present in our e...,False,False
857,67018,3 people just unfollowed me for talking about ...,True,True
858,50665,#WednesdayWisdom Antifa calls the right fascis...,False,False


In [16]:
prediction = run_model(model, train_x, train["subtask_b"], test_x)
test['prediction_b'] = prediction
test[['id', 'tweet', 'subtask_b', 'prediction_b']]

Unnamed: 0,id,tweet,subtask_b,prediction_b
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,TIN,TIN
1,27014,"#ConstitutionDay is revered by Conservatives, ...",NOT,NOT
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,NOT,NOT
3,13876,#Watching #Boomer getting the news that she is...,NOT,NOT
4,60133,#NoPasaran: Unity demo to oppose the far-right...,TIN,NOT
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,TIN,NOT
856,25657,#MeetTheSpeakers 🙌 @USER will present in our e...,NOT,NOT
857,67018,3 people just unfollowed me for talking about ...,UNT,NOT
858,50665,#WednesdayWisdom Antifa calls the right fascis...,NOT,NOT


You can now evaluate the performance of the `LogisticRegression` on the test set by computing different metrics with the true labels and the predictions obtained in the previous step. **SubTask A** can be evaluated with `accuracy` and `binary f1`, while for **SubTask B** `micro f1` and `macro f1` can be applied. If all went well, you should see results like the following:
> \*\*\* SubTask A \*\*\*  
accuracy: 0.80  
binary f1: 0.49  
>
> \*\*\* SubTask B \*\*\*    
micro f1: 0.78  
macro f1: 0.46

In [17]:
print("*** SubTask A ***")
print(f"accuracy: {accuracy_score(test['subtask_a'], test['prediction_a']):0.2f}")
print(f"binary f1: {f1_score(test['subtask_a'], test['prediction_a'], average='binary'):0.2f}")
print("")
print("*** SubTask B ***")
print(f"micro f1: {f1_score(test['subtask_b'], test['prediction_b'], average='micro'):0.2f}")
print(f"macro f1: {f1_score(test['subtask_b'], test['prediction_b'], average='macro'):0.2f}")

*** SubTask A ***
accuracy: 0.80
binary f1: 0.49

*** SubTask B ***
micro f1: 0.78
macro f1: 0.46


## Balancing the Dataset - [3 Marks]

The differences observed between the metrics used in the above evaluation indicate that the **OfensEval** dataset is not balanced. In **SubTask A**, getting an `accuracy` much higher than the `binary f1` can mean that the number of `False` cases is larger than the number of `True` cases. Similarly, obtaining very different `micro f1` and `macro f1` scores in **SubTask B** is a hint that some of the classes are more frequent than others. This can be verified with the following code lines:
>```python
train.groupby(by="subtask_a")[["tweet"]].count().reset_index()

|    | subtask_a   |   tweet |
|---:|:------------|--------:|
|  0 | False       |    8840 |
|  1 | True        |    4400 |

>```python
train.groupby(by="subtask_b")[["tweet"]].count().reset_index()

|    | subtask_b   |   tweet |
|---:|:------------|--------:|
|  0 | NOT         |    8840 |
|  1 | TIN         |    3876 |
|  2 | UNT         |     524 |

One solution that can mitigate this problem is to assign weights to the classes in a way that reduces the influence of the most frequent ones. **Scikit-learn** allows easily applying such approach by setting the appropriate option when creating the model. The goal of the next exercise is to create a new version of the `LogisticRegression` that handles the unbalanced dataset better.

You must complete the code for the `create_balanced_model` function. The function should create and return a `LogisticRegression` equal to the one created by `create_model` with the only difference being that this version automatically adjusts class weights. Check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to learn which parameter to set.

In [18]:
def create_balanced_model():   # 3 Marks
    # max_iter parameter set to 1000 and the class_weight parameter set to 'balanced' to auto adjust weights.
    return LogisticRegression(max_iter=1000, class_weight='balanced')

In [19]:
balanced_model = create_balanced_model()
prediction = run_model(balanced_model, train_x, train["subtask_a"], test_x)
test['prediction_balanced_a'] = prediction
test[['id', 'tweet', 'subtask_a', 'prediction_balanced_a']]

Unnamed: 0,id,tweet,subtask_a,prediction_balanced_a
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,True,True
1,27014,"#ConstitutionDay is revered by Conservatives, ...",False,False
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,False,False
3,13876,#Watching #Boomer getting the news that she is...,False,False
4,60133,#NoPasaran: Unity demo to oppose the far-right...,True,False
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,True,False
856,25657,#MeetTheSpeakers 🙌 @USER will present in our e...,False,False
857,67018,3 people just unfollowed me for talking about ...,True,True
858,50665,#WednesdayWisdom Antifa calls the right fascis...,False,True


In [20]:
prediction = run_model(balanced_model, train_x, train["subtask_b"], test_x)
test['prediction_balanced_b'] = prediction
test[['id', 'tweet', 'subtask_b', 'prediction_balanced_b']]

Unnamed: 0,id,tweet,subtask_b,prediction_balanced_b
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,TIN,TIN
1,27014,"#ConstitutionDay is revered by Conservatives, ...",NOT,TIN
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,NOT,NOT
3,13876,#Watching #Boomer getting the news that she is...,NOT,NOT
4,60133,#NoPasaran: Unity demo to oppose the far-right...,TIN,NOT
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,TIN,TIN
856,25657,#MeetTheSpeakers 🙌 @USER will present in our e...,NOT,NOT
857,67018,3 people just unfollowed me for talking about ...,UNT,UNT
858,50665,#WednesdayWisdom Antifa calls the right fascis...,NOT,NOT


The new model should reduce the differences between `accuracy` and `binary f1` and `micro f1` and `macro f1` respectively. You will observe some decrease in `accuracy` and `micro f1` scores, but at the same time `binary f1` and `macro f1` will improve significantly:

> \*\*\* SubTask A \*\*\*  
accuracy: 0.78  
binary f1: 0.61  
>
> \*\*\* SubTask B \*\*\*  
micro f1: 0.75  
macro f1: 0.59

In [21]:
print("*** SubTask A ***")
print(f"accuracy: {accuracy_score(test['subtask_a'], test['prediction_balanced_a']):0.2f}")
print(f"binary f1: {f1_score(test['subtask_a'], test['prediction_balanced_a'], average='binary'):0.2f}")
print("")
print("*** SubTask B ***")
print(f"micro f1: {f1_score(test['subtask_b'], test['prediction_balanced_b'], average='micro'):0.2f}")
print(f"macro f1: {f1_score(test['subtask_b'], test['prediction_balanced_b'], average='macro'):0.2f}")

*** SubTask A ***
accuracy: 0.78
binary f1: 0.61

*** SubTask B ***
micro f1: 0.75
macro f1: 0.59


## Additional Features - [3 Marks]

When working with linear classifiers for Text Classification, if additional information related to the input texts is available, it is often a good idea to incorporate this information in the form of additional features to the text representation. In this assignment, the result of a Sentiment Analysis on the tweets is provided in the *sentiment* column of the `DataFrames`. The goal of this last exercise is to incorporate this information into the input vectors of the `LogisticRegression`.

There are different ways to achieve this using **scikit-learn**, but a very handy approach, especially in combination with **pandas** `DataFrame`, is to create a [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html). A `ColumnTransformer` can apply different encoding approaches to different columns of the input data separately and concatenate them to generate a single feature vector.

You must complete the code for the `create_column_transformer` function. The function must create and return a `ColumnTransformer` with two transformers:

*  A `TfidfVectorizer` that should be applied to the text of the tweets.
*  A [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) that should encode the annotations of the Sentiment Analysis.

You must use the default parameters for both transformers.

In [40]:
def create_column_transformer():   # 3 Marks

    # Create TfidfVectorizer for the tweet text
    tfidf_vec = TfidfVectorizer()

    # Create OneHotEncoder to encode sentiment annotations
    onehot_enc = OneHotEncoder()

    # Create the ColumnTransformer with the specified transformers
    column_transform = ColumnTransformer(
        transformers=[
            ('tfidf', tfidf_vec, 'tweet'),    # Apply TfidfVectorizer to the 'tweet' column
            ('onehot', onehot_enc, ['sentiment'])  # Apply OneHotEncoder to the 'sentiment' column
        ]
    )

    return column_transform

The `ColumnTransformer` can now be run using the `run_vectorizer` function you implemented above. Notice that, in this case, the whole train and test `DataFrames` are passed to `run_vectorizer` along with the `ColumnTransformer`. However, the code of the function should be able to train and run it. The output of the new feature extraction strategy should be a vector of 19086 dimensions per tweet, 3 more dimensions that the previous approach:

> Shape of train input data: (13240, 19086)  
Shape of test input data: (860, 19086)

In [41]:
column_transformer = create_column_transformer()
train_x_sentiment, test_x_sentiment = run_vectorizer(column_transformer, train, test)
print(f"Shape of train input data: {train_x_sentiment.get_shape()}")
print(f"Shape of test input data: {test_x_sentiment.get_shape()}")

Shape of train input data: (13240, 19086)
Shape of test input data: (860, 19086)


In [42]:
prediction = run_model(balanced_model, train_x_sentiment, train["subtask_a"], test_x_sentiment)
test['prediction_sentiment_a'] = prediction
test[['id', 'tweet', 'subtask_a', 'prediction_sentiment_a']]

Unnamed: 0,id,tweet,subtask_a,prediction_sentiment_a
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,True,True
1,27014,"#ConstitutionDay is revered by Conservatives, ...",False,False
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,False,False
3,13876,#Watching #Boomer getting the news that she is...,False,False
4,60133,#NoPasaran: Unity demo to oppose the far-right...,True,False
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,True,True
856,25657,#MeetTheSpeakers 🙌 @USER will present in our e...,False,False
857,67018,3 people just unfollowed me for talking about ...,True,True
858,50665,#WednesdayWisdom Antifa calls the right fascis...,False,True


In [43]:
prediction = run_model(balanced_model, train_x_sentiment, train["subtask_b"], test_x_sentiment)
test['prediction_sentiment_b'] = prediction
test[['id', 'tweet', 'subtask_b', 'prediction_sentiment_b']]

Unnamed: 0,id,tweet,subtask_b,prediction_sentiment_b
0,15923,#WhoIsQ #WheresTheServer #DumpNike #DECLASFISA...,TIN,TIN
1,27014,"#ConstitutionDay is revered by Conservatives, ...",NOT,NOT
2,30530,#FOXNews #NRA #MAGA #POTUS #TRUMP #2ndAmendmen...,NOT,NOT
3,13876,#Watching #Boomer getting the news that she is...,NOT,NOT
4,60133,#NoPasaran: Unity demo to oppose the far-right...,TIN,NOT
...,...,...,...,...
855,73439,#DespicableDems lie again about rifles. Dem Di...,TIN,TIN
856,25657,#MeetTheSpeakers 🙌 @USER will present in our e...,NOT,NOT
857,67018,3 people just unfollowed me for talking about ...,UNT,UNT
858,50665,#WednesdayWisdom Antifa calls the right fascis...,NOT,NOT


The addition of the Sentiment Analysis to the input feature vector should help in both **SubTask A** and **SubTask B**. All the metrics should get some improvement, especially `binary f1` and `macro f1`:

> \*\*\* SubTask A \*\*\*  
accuracy: 0.79  
binary f1: 0.66    
>
> \*\*\* SubTask B \*\*\*  
micro f1: 0.77  
macro f1: 0.62

In [44]:
print("*** SubTask A ***")
print(f"accuracy: {accuracy_score(test['subtask_a'], test['prediction_sentiment_a'] ):0.2f}")
print(f"binary f1: {f1_score(test['subtask_a'], test['prediction_sentiment_a'], average='binary'):0.2f}")
print("")
print("*** SubTask B ***")
print(f"micro f1: {f1_score(test['subtask_b'], test['prediction_sentiment_b'], average='micro'):0.2f}")
print(f"macro f1: {f1_score(test['subtask_b'], test['prediction_sentiment_b'], average='macro'):0.2f}")

*** SubTask A ***
accuracy: 0.79
binary f1: 0.66

*** SubTask B ***
micro f1: 0.77
macro f1: 0.62
