In [1]:
# imports
import pandas as pd

## Dataset (Pavlick Formality Scores)
**Download**

To enable this notebook to run, please download the following .csv files  from [here](https://huggingface.co/datasets/osyvokon/pavlick-formality-scores/tree/main) and save them in ./data (Hint: click on the file size):
- all.csv
- test.csv
- train.csv

### All Data

In [2]:
# load all data from csv into pandas dataframe
df_all = pd.read_csv('./data/all.csv')
df_all.head()

Unnamed: 0,domain,avg_score,sentence
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...
1,answers,-1.8,it's a Holiday Inn for terroists.
2,answers,-2.0,Good Luck and don't give up!
3,answers,0.2,Most SHC victims are found near a heat source.
4,answers,-1.6,Tanay: I did have an opinion.


In [3]:
df_all.shape

(11274, 3)

## AI003 ticket:
1. Remove null values
2. Tokenization (sentences to words)
3. Normalisation
    - Stemming (Create a function to stem text with chosen stemmer)
    - Lemmatisation
4. Use some stopwords (carefully curate)
5. Vectorize

### Check for data for null values

In [4]:
null_sum = df_all.isnull().sum()
 
# printing the number of null values present
print('Number of NaN values present: ' + str(null_sum))

Number of NaN values present: domain       0
avg_score    0
sentence     0
dtype: int64


No null values to remove in the dataset :)

### Tokenization

In [5]:
from nltk.tokenize import word_tokenize

def tokenize(sentence):
    tokens = word_tokenize(sentence)
    return [w for w in tokens if w.isalpha()]

df_all['tokenized'] = df_all.apply(lambda x: tokenize(x['sentence']), axis=1)
df_all.head()

Unnamed: 0,domain,avg_score,sentence,tokenized
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...,"[Pimp, Successfully, complete, all, the, Snatc..."
1,answers,-1.8,it's a Holiday Inn for terroists.,"[it, a, Holiday, Inn, for, terroists]"
2,answers,-2.0,Good Luck and don't give up!,"[Good, Luck, and, do, give, up]"
3,answers,0.2,Most SHC victims are found near a heat source.,"[Most, SHC, victims, are, found, near, a, heat..."
4,answers,-1.6,Tanay: I did have an opinion.,"[Tanay, I, did, have, an, opinion]"


### Stemming

In [6]:
from nltk.stem import PorterStemmer

def stem(token):
    ps = PorterStemmer()
    return [ps.stem(w) for w in token]

df_all['stemmed'] = df_all.apply(lambda x: stem(x['tokenized']), axis=1)
df_all.head()

Unnamed: 0,domain,avg_score,sentence,tokenized,stemmed
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...,"[Pimp, Successfully, complete, all, the, Snatc...","[pimp, success, complet, all, the, snatch, loc..."
1,answers,-1.8,it's a Holiday Inn for terroists.,"[it, a, Holiday, Inn, for, terroists]","[it, a, holiday, inn, for, terroist]"
2,answers,-2.0,Good Luck and don't give up!,"[Good, Luck, and, do, give, up]","[good, luck, and, do, give, up]"
3,answers,0.2,Most SHC victims are found near a heat source.,"[Most, SHC, victims, are, found, near, a, heat...","[most, shc, victim, are, found, near, a, heat,..."
4,answers,-1.6,Tanay: I did have an opinion.,"[Tanay, I, did, have, an, opinion]","[tanay, i, did, have, an, opinion]"


### Lemmatisation

Lemmatisation is not used as pre-processing step, as information in the words that can be used to determine formality may be lost. For example, lemmatising 'better' to 'good' may reduce the effectiveness of a model to determine to formality as words can be transformed into more or less formal lemmas.

### Stopwords

In [7]:
import nltk
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

def stopper(token):
    stop_words = set(stopwords.words('english'))
    stopped_sentence = []
    for w in token:
        if w not in stop_words:
            stopped_sentence.append(w)
    return stopped_sentence

df_all['stopped'] = df_all.apply(lambda x: stopper(x['stemmed']), axis=1)
df_all.head()


Unnamed: 0,domain,avg_score,sentence,tokenized,stemmed,stopped
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...,"[Pimp, Successfully, complete, all, the, Snatc...","[pimp, success, complet, all, the, snatch, loc...","[pimp, success, complet, snatch, locat, level]"
1,answers,-1.8,it's a Holiday Inn for terroists.,"[it, a, Holiday, Inn, for, terroists]","[it, a, holiday, inn, for, terroist]","[holiday, inn, terroist]"
2,answers,-2.0,Good Luck and don't give up!,"[Good, Luck, and, do, give, up]","[good, luck, and, do, give, up]","[good, luck, give]"
3,answers,0.2,Most SHC victims are found near a heat source.,"[Most, SHC, victims, are, found, near, a, heat...","[most, shc, victim, are, found, near, a, heat,...","[shc, victim, found, near, heat, sourc]"
4,answers,-1.6,Tanay: I did have an opinion.,"[Tanay, I, did, have, an, opinion]","[tanay, i, did, have, an, opinion]","[tanay, opinion]"


## Vectorisation

1. Count Vectorization (Bag of Words)
2. TF-IDF Vectorization
3. Hashing Vectorization


We will now be using SKLearn's built in text feature extraction library that can tokenize, stem and remove stopwords via a single function!

In [8]:
#first lets drop the the columns we added for 'tokenized', 'stemmed' and 'stopped'
df_all = df_all.drop(['tokenized', 'stemmed', 'stopped'], axis=1)
df_all.head(0)

Unnamed: 0,domain,avg_score,sentence


### 1. Count vectorizer (Bag of Words)

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

params1 = {'lowercase' : True, #Convert all characters to lowercase before tokenizing
          'stop_words': 'english', #Use sklearn built in corpus for stop word removal
          'max_df': 1.0, #maximum document frequency: we can ignore words which occur frequently
          'min_df': 0.0, #minimum document frequency: we can ignore words which occur infrequently
          'analyzer' : 'word', #tokenize to words
          'ngram_range': (1,1),
         }

vectorizer1 = CountVectorizer(**params1) #initialise the vectorizer
vectorizer1.fit(df_all['sentence'].tolist()) #fit vectorizer to entire corpus

df_all['count_vector'] = df_all.apply(lambda x: vectorizer1.transform([x['sentence']]), axis=1)

df_all.head()

Unnamed: 0,domain,avg_score,sentence,count_vector
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...,"(0, 27)\t1\n (0, 3540)\t1\n (0, 9180)\t1\n..."
1,answers,-1.8,it's a Holiday Inn for terroists.,"(0, 7640)\t1\n (0, 8217)\t1\n (0, 15577)\t1"
2,answers,-2.0,Good Luck and don't give up!,"(0, 5077)\t1\n (0, 6993)\t1\n (0, 9466)\t1"
3,answers,0.2,Most SHC victims are found near a heat source.,"(0, 7452)\t1\n (0, 10541)\t1\n (0, 14069)\..."
4,answers,-1.6,Tanay: I did have an opinion.,"(0, 4756)\t1\n (0, 10993)\t1\n (0, 15403)\t1"


We can view the format of each sentence vector:

In [10]:
df_all['count_vector'].iloc[2]

<1x17324 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

And the locations in which the words appear in the feature vector:

In [11]:
df_all['count_vector'].iloc[2].nonzero()[1]

array([5077, 6993, 9466], dtype=int32)

And how this relates to each feature:

In [12]:
features = vectorizer1.get_feature_names_out()

print('Original sentence: \n', df_all['sentence'].iloc[2])
print('Sentence transformed back from vector: \n', features[df_all['count_vector'].iloc[2].nonzero()[1]])

Original sentence: 
 Good Luck and don't give up!
Sentence transformed back from vector: 
 ['don' 'good' 'luck']


This example shows that stemming and stopwords are working correctly.

### 2. TF-IDF

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

params2 = {'lowercase' : True, #Convert all characters to lowercase before tokenizing
          'stop_words': 'english', #Use sklearn built in corpus for stop word removal
          'max_df': 1.0, #maximum document frequency: we can ignore words which occur frequently
          'min_df': 0.0, #minimum document frequency: we can ignore words which occur infrequently
          'analyzer' : 'word', #tokenize to words
          'ngram_range': (1,1),
         }

vectorizer2 = TfidfVectorizer(**params2) #initialise the vectorizer
vectorizer2.fit(df_all['sentence'].tolist()) #fit vectorizer to entire corpus

df_all['tf_idf_vector'] = df_all.apply(lambda x: vectorizer2.transform([x['sentence']]), axis=1)
df_all.head()

Unnamed: 0,domain,avg_score,sentence,count_vector,tf_idf_vector
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...,"(0, 27)\t1\n (0, 3540)\t1\n (0, 9180)\t1\n...","(0, 15103)\t0.3894924264295973\n (0, 14441)..."
1,answers,-1.8,it's a Holiday Inn for terroists.,"(0, 7640)\t1\n (0, 8217)\t1\n (0, 15577)\t1","(0, 15577)\t0.6225307646760007\n (0, 8217)\..."
2,answers,-2.0,Good Luck and don't give up!,"(0, 5077)\t1\n (0, 6993)\t1\n (0, 9466)\t1","(0, 9466)\t0.7093871675370852\n (0, 6993)\t..."
3,answers,0.2,Most SHC victims are found near a heat source.,"(0, 7452)\t1\n (0, 10541)\t1\n (0, 14069)\...","(0, 16600)\t0.4528952035324624\n (0, 14546)..."
4,answers,-1.6,Tanay: I did have an opinion.,"(0, 4756)\t1\n (0, 10993)\t1\n (0, 15403)\t1","(0, 15403)\t0.7086807170172069\n (0, 10993)..."


We can view the format of each sentence vector (which is the same as the count vectorizer):

In [14]:
df_all['tf_idf_vector'].iloc[2]

<1x17324 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

And the locations in which the words appear in the feature vector:

In [15]:
df_all['tf_idf_vector'].iloc[2].nonzero()[1]

array([9466, 6993, 5077], dtype=int32)

And how this relates to each feature:

In [16]:
features = vectorizer2.get_feature_names_out()

print('Original sentence: \n', df_all['sentence'].iloc[2])
print('Sentence transformed back from vector: \n', features[df_all['tf_idf_vector'].iloc[2].nonzero()[1]])

Original sentence: 
 Good Luck and don't give up!
Sentence transformed back from vector: 
 ['luck' 'good' 'don']


### 3. Hashing Vectorisation

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

- it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.

- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.

- it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

- there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

- there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

- no IDF weighting as this would render the transformer stateful.



In [17]:
from sklearn.feature_extraction.text import HashingVectorizer


params3 = {'n_features': 2**14,#The number of features (columns) in the output matrices.
           #Small numbers of features are likely to cause hash collisions,
           #but large numbers will cause larger coefficient dimensions in linear learners.
           'lowercase' : True, #Convert all characters to lowercase before tokenizing
           'stop_words': 'english', #Use sklearn built in corpus for stop word removal
           'analyzer' : 'word', #tokenize to words
           'ngram_range': (1,1),
          }

vectorizer3 = HashingVectorizer(**params3) #initialise the vectorizer
vectorizer3.fit(df_all['sentence'].tolist()) #fit vectorizer to entire corpus

df_all['hashing_vector'] = df_all.apply(lambda x: vectorizer3.transform([x['sentence']]), axis=1)
df_all.head()

Unnamed: 0,domain,avg_score,sentence,count_vector,tf_idf_vector,hashing_vector
0,answers,-1.4,Pimp (10) Successfully complete all the Snatch...,"(0, 27)\t1\n (0, 3540)\t1\n (0, 9180)\t1\n...","(0, 15103)\t0.3894924264295973\n (0, 14441)...","(0, 929)\t-0.3779644730092272\n (0, 3046)\t..."
1,answers,-1.8,it's a Holiday Inn for terroists.,"(0, 7640)\t1\n (0, 8217)\t1\n (0, 15577)\t1","(0, 15577)\t0.6225307646760007\n (0, 8217)\...","(0, 7725)\t-0.5773502691896258\n (0, 13975)..."
2,answers,-2.0,Good Luck and don't give up!,"(0, 5077)\t1\n (0, 6993)\t1\n (0, 9466)\t1","(0, 9466)\t0.7093871675370852\n (0, 6993)\t...","(0, 65)\t0.5773502691896258\n (0, 2230)\t0...."
3,answers,0.2,Most SHC victims are found near a heat source.,"(0, 7452)\t1\n (0, 10541)\t1\n (0, 14069)\...","(0, 16600)\t0.4528952035324624\n (0, 14546)...","(0, 1880)\t-0.4472135954999579\n (0, 1898)\..."
4,answers,-1.6,Tanay: I did have an opinion.,"(0, 4756)\t1\n (0, 10993)\t1\n (0, 15403)\t1","(0, 15403)\t0.7086807170172069\n (0, 10993)...","(0, 261)\t-0.5773502691896258\n (0, 363)\t-..."


We can view the format of each sentence vector:

In [18]:
df_all['hashing_vector'].iloc[2]

<1x16384 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

By using the hashing vectorizer, we can reduce the size of the sparse format vectors by reducing the 'n_features' parameter. The smaller these vectors are, the more efficient training and inference will be. However, the downside of not being able to compute the inverse transform means that when evaluating the model, one will not be able to determine which language the model determines as being a good indicator for informal or formal sentences.