# Lab 11 Tasks - Solutions

A common text classification task involves automatically determining the language in which a document is written, based on previously-labelled example documents.

In this notebook, we will look at automatically classifying the text from tweets as either English or non-English. The dataset we will use is a subset of the [UMass Global English on Twitter Dataset](https://www.kaggle.com/rtatman/the-umass-global-english-on-twitter-dataset).

## Task 1 - Preprocessing

Read the Twitter dataset from the CSV file 'tweet-language.tsv' into a Pandas DataFrame, where the row index is given by 'Tweet Id'.

In [1]:
import pandas as pd
df = pd.read_csv("tweet-language.tsv", sep="\t").set_index("Tweet ID")
print("Read %d documents" % len(df))
df.head(5)

Read 6759 documents


Unnamed: 0_level_0,Tweet,English
Tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1
285903159434563584,volkan konak adami tribe sokar yemin ederim :d,0
285965965118824448,i felt my first flash of violence at some fool...,1
286057979831275520,ladies drink and get in free till 10:30,1
286216100784521216,watching #miranda on bbc1!!! u r hilarious,1
286525170670243840,all over twitter because you and your friends ...,1


Our target label for classification here is going to be the column 'English' -- a value of 1 indicates that a tweet is in English, while a value of 0 indicates it is written in another language.

From this column, check the number of tweets in the dataset for each class.

In [2]:
target = df["English"]
target.value_counts()

English
1    3704
0    3055
Name: count, dtype: int64


Using the DataFrame and functionality from scikit-learn, create a vector representations of the documents. For real applications we would want to use a custom tokenizer to handle the specifics of tweets (e.g. mentions, hashtags etc). However, for this example we can just use the standard scikit-learn tokenizer and a simple *CountVectorizer*. 

Note that we should not use any "stop words" here. For language detection, common stop words might actually prove to be useful features.

In [3]:
# the content for all documents
documents = df["Tweet"]
# apply the vectorization process
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df = 10, stop_words=None)
X = vectorizer.fit_transform(documents)
# check the size of the resulting representation
print(X.shape)

(6759, 892)


In [4]:
# check the number of terms/words in our preprocessed vocabulary
terms = vectorizer.get_feature_names_out()
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 892 distinct terms


## Task 2 - Classification and Train/Test Evaluation

Train a kNN classification model with 3 neighbours, and evaluate the accuracy of this model using a single train/test split, so that we have 70% of the tweets in the training set and 30% in the test set.

In [5]:
# perform the split - note test_size=0.3 means 30% assigned to the test set
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(X, target, test_size=0.3)
# we will just check how many tweets in each set
print("Training set has %d tweets" % data_train.shape[0] )
print("Test set has %d tweets" % data_test.shape[0] )

Training set has 4731 tweets
Test set has 2028 tweets


In [6]:
# prepare the k-NN classification model, for 3 nearest neighbours in this case
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(data_train, target_train)

In [7]:
# make predictions for the tweets in the test set
predicted = model.predict(data_test)
predicted

array([0, 0, 1, ..., 0, 1, 0])

In [8]:
# now we will evaluate the performance of the classifier
from sklearn.metrics import accuracy_score
print("Accuracy = %.4f" % accuracy_score(target_test, predicted))

Accuracy = 0.8235


Repeat the classification and evaluation process again using a different train/test split. Did the classifier achieve the same accuracy score as before?

In [9]:
data_train, data_test, target_train, target_test = train_test_split(X, target, test_size=0.3)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(data_train, target_train)
predicted = model.predict(data_test)
print("Accuracy = %.4f" % accuracy_score(target_test, predicted))

Accuracy = 0.8240


## Task 3 - Classification and Cross-Validation

If we re-run the evaluation above several times, we will get different performance scores depending on the randomly-generated training/test split that we are using. A more robust strategy involves using *k-fold cross-validation* to evaluate a classifier.

Evaluate the kNN classifier from above, but this time using 5-fold cross validation. The model in each fold should be evaluated using accuracy. Calculate the overall average accuracy across all 5 folds.


In [10]:
from sklearn.model_selection import cross_val_score
# create a single classifier
model = KNeighborsClassifier(n_neighbors=3)
# apply 5-fold cross-validation, measuring accuracy each time
acc_scores = cross_val_score(model, X, target, cv=5, scoring="accuracy")

In [11]:
# represent the results as a Pandas Series
labels = ["Fold %d" % i for i in range(1,len(acc_scores)+1)]
s_acc = pd.Series(acc_scores, index = labels)
s_acc

Fold 1    0.831361
Fold 2    0.822485
Fold 3    0.821746
Fold 4    0.830621
Fold 5    0.829756
dtype: float64

In [12]:
# overall average accuracy
print("Mean accuracy: %.4f" % s_acc.mean())

Mean accuracy: 0.8272
