Author: Abel Stanley

NIM: 13517068

# Week 03 Handson - Data Preprocessing #02

To use a learning model, we need to input numerical data to the model. However, we often
get non-numerical data as input, e.g., text data. Thus, to use text as input to the learning
model, we need to do pre-processing and convert it to numerical data.


Steps below are typical pre-processing steps for text data.

1. Tokenization

2. Normalization

3. Cleaning

4. Lemmatization/stemming
Tokenization

## Tokenization

In [39]:
import nltk
from nltk.tokenize import word_tokenize

# we have two raw texts here that we want to pre-process
text1 = "After watching two hours non stop, \
he says that the film is really fantastic #brilliant."
text2 = "Foods sold there are little bit pricy, \
meanwhile the taste is not delicious #notrecommended. "

tokens1 = word_tokenize(text1)
print("tokens1:\n", tokens1)
tokens2 = word_tokenize(text2)
print("\n\ntokens2:\n", tokens2)

tokens1:
 ['After', 'watching', 'two', 'hours', 'non', 'stop', ',', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', '#', 'brilliant', '.']


tokens2:
 ['Foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', ',', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', '#', 'notrecommended', '.']


## Normalization

In this block of code, we try one of normalization processes: converting to lowercase.

In [40]:
# convert to Lower case
normalized_words1 = [w.lower() for w in tokens1]
print("normalized words1:\n", normalized_words1)
## Normalization
normalized_words2 = [w.lower() for w in tokens2]
print("\n\nnormalized words2:\n", normalized_words2)

normalized words1:
 ['after', 'watching', 'two', 'hours', 'non', 'stop', ',', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', '#', 'brilliant', '.']


normalized words2:
 ['foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', ',', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', '#', 'notrecommended', '.']


### Cleaning 01: remove punctuation

In [41]:
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
punc_removed1 = [w.translate(table) for w in normalized_words1]
print("punc_removedi:\n", punc_removed1)
punc_removed2 = [w.translate(table) for w in normalized_words2]
print("\n\npunc_removed2:\n", punc_removed2)

punc_removedi:
 ['after', 'watching', 'two', 'hours', 'non', 'stop', '', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', '', 'brilliant', '']


punc_removed2:
 ['foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', '', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', '', 'notrecommended', '']


### Cleaning 02: remove not alphabetic

In [42]:
# remove remaining tokens that are not alphabetic
isalpha_words1 = [word for word in punc_removed1 if word.isalpha() ]
print("isalpha_words1:\n", isalpha_words1)
isalpha_words2 = [word for word in punc_removed2 if word.isalpha() ]
print("\n\nisalpha_words2:\n", isalpha_words2)

isalpha_words1:
 ['after', 'watching', 'two', 'hours', 'non', 'stop', 'he', 'says', 'that', 'the', 'film', 'is', 'really', 'fantastic', 'brilliant']


isalpha_words2:
 ['foods', 'sold', 'there', 'are', 'little', 'bit', 'pricy', 'meanwhile', 'the', 'taste', 'is', 'not', 'delicious', 'notrecommended']


### Cleaning 03: remove stop words

In [43]:
# filter out stop words
from nltk.corpus import stopwords
# nltk.download('stopwords')

stop_words = set(stopwords.words('english' ))
#print("stop words:\n", stop words, "\n") #print this tf you want
#to see stop word exampLes
stopWords_removed1 = [w for w in isalpha_words1 if not w in stop_words ]
print("stopWords removedi:\n", stopWords_removed1)
stopWords_removed2 = [w for w in isalpha_words2 if not w in stop_words ]
print("\n\nstopWords removed2:\n", stopWords_removed2)

stopWords removedi:
 ['watching', 'two', 'hours', 'non', 'stop', 'says', 'film', 'really', 'fantastic', 'brilliant']


stopWords removed2:
 ['foods', 'sold', 'little', 'bit', 'pricy', 'meanwhile', 'taste', 'delicious', 'notrecommended']


## Stemming

In [44]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_word1 = [ps.stem(w) for w in stopWords_removed1 ]
print("stemmed_word1:\n", stemmed_word1)
stemmed_word2 = [ps.stem(w) for w in stopWords_removed2 ]
print("\n\nstemmed_word2:\n", stemmed_word2)

stemmed_word1:
 ['watch', 'two', 'hour', 'non', 'stop', 'say', 'film', 'realli', 'fantast', 'brilliant']


stemmed_word2:
 ['food', 'sold', 'littl', 'bit', 'prici', 'meanwhil', 'tast', 'delici', 'notrecommend']


## Lemmatization

In [45]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_words1 = [lemmatizer.lemmatize(w) for w in stopWords_removed1]
print("“lemmatized_words1:\n", lemmatized_words1)
lemmatized_words2 = [lemmatizer.lemmatize(w) for w in stopWords_removed2]
print("\n\nlemmatized_words2:\n", lemmatized_words2)

“lemmatized_words1:
 ['watching', 'two', 'hour', 'non', 'stop', 'say', 'film', 'really', 'fantastic', 'brilliant']


lemmatized_words2:
 ['food', 'sold', 'little', 'bit', 'pricy', 'meanwhile', 'taste', 'delicious', 'notrecommended']


## Example of Converting Preprocessed Text into

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

# merge two texts into one List (you may also try to use the stemmed_word)
two_preprocessed_txt = [lemmatized_words1, lemmatized_words2 ]
# define the tfidf vectorizer
def dummy(doc):
    return doc
                         
tfidf = TfidfVectorizer(
    analyzer='word', #''
    tokenizer=dummy ,
    preprocessor=dummy,
    token_pattern=None )
                         
# train / learn from the given data
model = tfidf.fit(two_preprocessed_txt)
# transform to numerical features using the trained model
numerical_features = model.transform(two_preprocessed_txt).toarray()
''' 
    ==> these numerical features can then be used by the model,
    e.g., for classification to sentiment class: positive and negative
'''
                         
print("numerical features of text1:\n", numerical_features[0],
"; shape:", numerical_features[0].shape)
print("\n\nnumerical features of text2:\n", numerical_features[1],
"; shape:", numerical_features[1].shape)

numerical features of text1:
 [0.         0.31622777 0.         0.31622777 0.31622777 0.
 0.31622777 0.         0.         0.31622777 0.         0.
 0.31622777 0.31622777 0.         0.31622777 0.         0.31622777
 0.31622777] ; shape: (19,)


numerical features of text2:
 [0.33333333 0.         0.33333333 0.         0.         0.33333333
 0.         0.33333333 0.33333333 0.         0.33333333 0.33333333
 0.         0.         0.33333333 0.         0.33333333 0.
 0.        ] ; shape: (19,)


#  Question 01 (Q01)

a. What are tokenization, normalization and cleaning?

b. What is/are the difference(s) between stemming and lemmatization?

### Answer:

#### a. What are tokenization, normalization and cleaning?
##### Tokenization
Tokenization is the process of converting text into tokens before transforming it into vectors. It is also easier to filter out unnecessary tokens. For example, a document into paragraphs or sentences into words.


#### Normalization
Normalization consists of the translation (mapping) of terms in the scheme or linguistic reductions through stemming, lemmatization and other forms of standardization.

Words which look different due to casing or written another way but are the same in meaning need to be process correctly. Normalisation processes ensure that these words are treated equally. For example, changing numbers to their word equivalents or converting the casing of all the text.

#### Cleaning
The cleaning process consists of getting rid of the less useful parts of text through stop-word removal, dealing with capitalization and characters and other details.

#### b. What is/are the difference(s) between stemming and lemmatization?

##### 1. The main difference is the way they work and the result that each of them returns:

        Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.
    
        Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.
    
##### 2. A lemma is the base form of all its inflectional forms, whereas a stem isn’t. This is why regular dictionaries are lists of lemmas, not stems. This has two consequences:
    
        First, the stem can be the same for the inflectional forms of different lemmas. This translates into noise in our search results. In fact, it is very common to find entire forms as instances of several lemmas.

        Also, the same lemma can correspond to forms with different stems, and we need to treat them as the same word. For example, in Greek, a typical verb has different stems for perfective forms and for imperfective ones. If we were using stemming algorithms we won't be able to relate them with the same verb, but using lemmatization it is possible to do so.

#  Question 02 (Q02)

Please explain what TF-IDF is!

Note: 

(i) you can insert picture (if you want) in the answer, and then upload all the
materials (this ipynb file and the pictures) into one zip file to the course portal, 

(ii) you can also use mathematical equation here, for exampe: you can write log2(Pi) by using
$log_{2}(P_{i})$.

### Answer:

#### Definition

tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

#### What a “relevant word” means

We can come up with a more or less subjective definition driven by our intuition: a word’s relevance is proportional to the amount of information that it gives about its context (a sentence, a document or a full dataset). That is, the most relevant words are those that would help us, as humans, to better understand a whole document without reading it all.

As pointed out, relevant words are not necessarily the most frequent words since stopwords like “the”, “of” or “a” tend to occur very often in many documents.

There is another caveat: if we want to summarize a document compared to a whole dataset about an specific topic (let’s say, movie reviews), there will be words (other than stopwords, like character or plot), that could occur many times in the document as well as in many other documents. These words are not useful to summarize a document because they convey little discriminating power; they say very little about what the document contains compared to the other documents.


#### The Algorithm

For a term t in a document d, the weight Wt,d of term t in document d is given by:

$W(t,d) = TF(t,d) * log (N/DF(t))$

Where:

$TF(t,d)$ is the number of occurrences of t in document d.

$DF(t)$ is the number of documents containing the term t.

$N$ is the total number of documents in the corpus.

#  Question 03 (Q03)

What are other methods that can be used to convert "preprocessed text" to "numerical
features" other than TF-IDF? From what you mention, what are methods that keep the
semantic?

### Answer:

#### One possible alternative: Word Embedding

Word Embedding converts a word to an n-dimensional vector. Words which are related such as ‘house’ and ‘home’ map to similar n-dimensional vectors, while dissimilar words such as ‘house’ and ‘airplane’ have dissimilar vectors. In this way the ‘meaning’ of a word can be reflected in its embedding, a model is then able to use this information to learn the relationship between words. The benefit of this method is that a model trained on the word ‘house’ will be able to react to the word ‘home’ even if it had never seen that word in training.

#### The difference between Word Embedding and TF-IDF

Word Embedding:

    1. Multi dimensional vector which attempts to capture a word's relationship to other words
    2. Often trained on large external corpus
    3. Must be applied to each word individually
    4. More memory intensive
    5. Ideal for problems involving a single word such as a word translation
    
TF-IDF:

    1. Uses a sparse matrix where each word map to just a single value, captures no meaning
    2. Trained without external data
    3. Can be applied to each training document at once
    4. Less memory intensive
    5. Ideal for problems with many words and larger document files
    
#### Method that keeps the semantic in Word Embedding

In Word Embedding method, we are able to keep the semantic of each word by constructing a relationship of one word with the others. In order to train a model to do this, we must be able to pass multiple words simultaneously into our model. The solution is to concatenate each word vector together and pass the combined vector.

For example:

We concatenate every 20 words together where each word is a 300-dimensional embedding and yield a 6,000-dimensional vector. What to do in the case that not every text is exactly 20 words? For cases that are fewer then 20 words we will pad the end of the vector with zeroes, the model will learn not to assign any meaning to these values. For cases that are longer then 20 words we will resort to keeping only the first 20 words and dropping the rest.

The result of this algorithm is a machin readable table with the size of number of documents x length of feature vector