## Approching text classification/regression

In general, these problems are also known as Natural Language Processing (NLP) problems. NLP problems are also like images in the sense that, it’s quite different. You need to create pipelines you have never created before for tabular problems. There are many different types of NLP problems, and the most common type is the classification of strings. For computers, everything is numbers. Let’s say we start with a fundamental task of sentiment classification. We will try to classify sentiment from movie reviews. So, you have a text, and there is a sentiment associated with it. How will you approach this kind of problem? You start with the basics. One review maps to one target variable. A review is a bunch of sentences. So, until now you must have seen classifying only a single sentence, but in this problem, we will be classifying multiple sentences.


### How would we start with the problem?

A simple way is to create one list which contain all the positive words and another which contain all the negative words. If a sentence contain 

* Large number of positive words then it has positive reviews.
* Large number of negative words then it has negative reviews.
* If both equal number of negative and positive then neutral.

This is one of the oldest way and the code is given below.

In [1]:
def find_sentiment(sentence, pos, neg):
    
    """ This function returns sentiment of sentence 
    :param sentence: sentence, a string 
    :param pos: set of positive words 
    :param neg: set of negative words 
    :return: returns positive, negative or neutral sentiment """ 
    
    #split sentence by a space
    #"this is a sentence!" becomes: 
    #["this", "is" "a", "sentence!"] 
    #note that im splitting on all whitespaces 
    #if you want to split by space use .split("") 
    sentence = sentence.split() 
    
    #make sentence into a set 
    sentence = set(sentence) 
    #check number of common words with positive 
    num_common_pos = len(sentence.intersection(pos)) 
    
    #check number of common words with negative 
    num_common_neg = len(sentence.intersection(neg)) 
    
    #make conditions and return 
    #see how return used eliminates if else 
    
    if num_common_pos > num_common_neg:
        return "positive" 
    
    if num_common_pos < num_common_neg:
        return "negative" 
    
    return "neutral" 

However, this kind of approach does not take a lot into consideration. And as you can see that our split() is also not perfect. If you use split(), a sentence like: 

“hi, how are you?” 

gets split into [“hi,”, “how”, “are”, “you?”] 


This is not ideal, because you see the comma and question mark, they are not split. It is therefore not recommended to use this method if you don’t have a preprocessing that handles these special characters before the split. Splitting a string into a list of words is known as tokenization. One of the most popular tokenization comes from **NLTK (Natural Language Tool Kit).** 

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/shubhangi/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:
from nltk.tokenize import word_tokenize

sentence = "hi, how are you?"


sentence.split()

['hi,', 'how', 'are', 'you?']

In [6]:
word_tokenize(sentence)

['hi', ',', 'how', 'are', 'you', '?']

As you can see, using NLTK’s word tokenize, the same sentence is split in a much better manner. Comparing using a list of words will also work much better now! This is what we will apply to our first model to detect sentiment.

One of the basic models that you should always try with a classification problem in NLP is bag of words. In bag of words, we create a huge sparse matrix that stores counts of all the words in our corpus (corpus = all the documents = all the sentences). For this, we will use CountVectorizer from scikit-learn. Let’s see how it works. 

In [7]:
from sklearn.feature_extraction.text import CountVectorizer 
#create a corpus of sentences 
corpus = [ "hello, how are you?", 
          "im getting bored at home. And you? What do you think?",
          "did you know about counts", "let's see if this works!", "YES!!!!" ] 

#initialize CountVectorizer 
ctv = CountVectorizer() 
#fit the vectorizer on corpus 
ctv.fit(corpus) 

corpus_transformed = ctv.transform(corpus) 

In [9]:
print(corpus_transformed)

  (0, 2)	1
  (0, 9)	1
  (0, 11)	1
  (0, 22)	1
  (1, 1)	1
  (1, 3)	1
  (1, 4)	1
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 13)	1
  (1, 17)	1
  (1, 19)	1
  (1, 22)	2
  (2, 0)	1
  (2, 5)	1
  (2, 6)	1
  (2, 14)	1
  (2, 22)	1
  (3, 12)	1
  (3, 15)	1
  (3, 16)	1
  (3, 18)	1
  (3, 20)	1
  (4, 21)	1


Our corpus is now a sparse matrix, where, for first sample, we have four elements, for sample 2 we have ten elements, and so on, for sample 3 we have five elements and so on. We also see that these elements have a count associated with them. 

Some are seen twice, some are seen only once. For example, in sample 2 (row 1), we see that column 22 has a value of two. Why is that? And what is column 22? The way CountVectorizer works is it first tokenizes the sentence and then assigns a value to each token. So, each token is represented by a unique index. These unique indices are the columns that we see. The CountVectorizer stores this information. 

In [10]:
print(ctv.vocabulary_)

{'hello': 9, 'how': 11, 'are': 2, 'you': 22, 'im': 13, 'getting': 8, 'bored': 4, 'at': 3, 'home': 10, 'and': 1, 'what': 19, 'do': 7, 'think': 17, 'did': 6, 'know': 14, 'about': 0, 'counts': 5, 'let': 15, 'see': 16, 'if': 12, 'this': 18, 'works': 20, 'yes': 21}


We see that index 22 belongs to “you” and in the second sentence, we have used “you” twice. Thus, the count is 2. I hope it’s clear now what is bag of words. But we are missing some special characters. Sometimes those special characters can be useful too. For example, “?” denotes a question in most sentences. Let’s integrate word_tokenize from scikit-learn in CountVectorizer and see what happens. 

In [11]:
from sklearn.feature_extraction.text import CountVectorizer 
from nltk.tokenize import word_tokenize 

#create a corpus of sentences 
corpus = [ "hello, how are you?", 
          "im getting bored at home. And you? What do you think?", 
          "did you know about counts", "let's see if this works!", 
          "YES!!!!" ] 

#initialize CountVectorizer with word_tokenize from nltk 
#as the tokenizer 

ctv = CountVectorizer(tokenizer=word_tokenize, token_pattern=None) 

#fit the vectorizer on corpus 
ctv.fit(corpus) 

corpus_transformed = ctv.transform(corpus) 

print(ctv.vocabulary_) 

{'hello': 14, ',': 2, 'how': 16, 'are': 7, 'you': 27, '?': 4, 'im': 18, 'getting': 13, 'bored': 9, 'at': 8, 'home': 15, '.': 3, 'and': 6, 'what': 24, 'do': 12, 'think': 22, 'did': 11, 'know': 19, 'about': 5, 'counts': 10, 'let': 20, "'s": 1, 'see': 21, 'if': 17, 'this': 23, 'works': 25, '!': 0, 'yes': 26}


Now, we have more words in the vocabulary. Thus, we can now create a sparse matrix by using all the sentences in IMDB dataset and can build a model. The ratio to positive and negative samples in this dataset is 1:1, and thus, we can use accuracy as the metric. 

We will use StratifiedKFold and create a single script to train five folds. Which model to use you ask? Which is the fastest model for high dimensional sparse data? Logistic regression. We will use logistic regression for this dataset to start with and to create our first actual benchmark.

The code will be in the script **rg.py** and on running it we will get thee output as follows :

<img src="../text_prob1.png">


* We will notice the following things when we run the code 
    * The accucracy is quite good for the first trial.
    * We will also recive the warning message also and this is because the number of features (i.e the vocabulary) is much more in size as compare to the number of training examples.
    
    
Now we will use another algorithm called MultinomialNB from scikit-learn.


The code will be in the script **mb.py** and on running it we will get thee output as follows :


<img src="../text_prob2.png">

We see that this score is low. But the naïve bayes model is superfast. 
Another method in NLP that most of the people these days tend to ignore or don’t care to know about is called **TF-IDF**. 
TF is term frequencies, and IDF is inverse document frequency. It might seem difficult from these terms, but things will become apparent with the formulae for TF and IDF. 

<img src="../ss1.jpeg">



Similar to CountVectorizer in scikit-learn, we have TfidfVectorizer. Let’s try using it the same way we used CountVectorizer. 


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from nltk.tokenize import word_tokenize 

#create a corpus of sentences 
corpus = ["hello, how are you?", 
          "im getting bored at home. And you? What do you think?",
          "did you know about counts", 
          "let's see if this works!", 
          "YES!!!!" ] 

#initialize TfidfVectorizer with word_tokenize from nltk 
#as the tokenizer 
tfv = TfidfVectorizer(tokenizer=word_tokenize, token_pattern=None) 
#fit the vectorizer on corpus 
tfv.fit(corpus) 
corpus_transformed = tfv.transform(corpus) 
print(corpus_transformed)

  (0, 27)	0.2965698850220162
  (0, 16)	0.4428321995085722
  (0, 14)	0.4428321995085722
  (0, 7)	0.4428321995085722
  (0, 4)	0.35727423026525224
  (0, 2)	0.4428321995085722
  (1, 27)	0.35299699146792735
  (1, 24)	0.2635440111190765
  (1, 22)	0.2635440111190765
  (1, 18)	0.2635440111190765
  (1, 15)	0.2635440111190765
  (1, 13)	0.2635440111190765
  (1, 12)	0.2635440111190765
  (1, 9)	0.2635440111190765
  (1, 8)	0.2635440111190765
  (1, 6)	0.2635440111190765
  (1, 4)	0.42525129752567803
  (1, 3)	0.2635440111190765
  (2, 27)	0.31752680284846835
  (2, 19)	0.4741246485558491
  (2, 11)	0.4741246485558491
  (2, 10)	0.4741246485558491
  (2, 5)	0.4741246485558491
  (3, 25)	0.38775666010579296
  (3, 23)	0.38775666010579296
  (3, 21)	0.38775666010579296
  (3, 20)	0.38775666010579296
  (3, 17)	0.38775666010579296
  (3, 1)	0.38775666010579296
  (3, 0)	0.3128396318588854
  (4, 26)	0.2959842226518677
  (4, 0)	0.9551928286692534


Now we will replace **CountVectorizer** with **TfidVectorizer** 

The code is in the script **rg1.py**  and on running it will get the output as follows:

<img src="../text_prob2.png">


We see that these scores are a bit higher than CountVectorizer, and thus, it becomes the new benchmark that we would want to beat. 
Another interesting concept in NLP is n-grams. **N-grams** are combinations of words in order. N-grams are easy to create. You just need to take care of the order. To make things even more comfortable, we can use n-gram implementation from NLTK.

In [6]:
from nltk import ngrams 
from nltk.tokenize import word_tokenize 
#let's see 3 grams 
N = 3 

#input sentence 
sentence = "hi, how are you?" 

#tokenized sentence 
tokenized_sentence = word_tokenize(sentence) 

#generate n_grams 
n_grams = list(ngrams(tokenized_sentence, N)) 
print(n_grams)

[('hi', ',', 'how'), (',', 'how', 'are'), ('how', 'are', 'you'), ('are', 'you', '?')]


Similarly, we can also create 2-grams, or 4-grams, etc. Now, these n-grams become a part of our vocab, and when we calculate counts or tf-idf, we consider one n-gram as one entirely new token. So, in a way, we are incorporating context to some extent. Both CountVectorizer and TfidfVectorizer implementations of scikit-learn offers ngrams by ngram_range parameter, which has a minimum and maximum limit. By default, this is (1, 1). When we change it to (1, 3), we are looking at unigrams, bigrams and trigrams. The code change is minimal. Since we had the best result till now with tf-idf, let’s see if including n-grams up to trigrams improves the model. The only change required is in the initialization of TfidfVectorizer. 


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
tdidf_vsc = TfidfVectorizer(
                            tokenizer = word_tokenize,
                            token_pattern = None,
                            ngram_range= (1,3)
)


Thus we will improve in our previous code that is **rg1.py** and add the new code will we in **tvf_rg_trigram** and on running this we will get the following output 

<img src="../text_prob4.png">

This looks okay, but we do not see any improvements. Probably you can try to do it on your own. There are a lot more things in the basics of NLP. One term that you must be aware of is **stemming.** Another is **lemmatization.** Stemming and lemmatization reduce a word to its smallest form. 

* In the case of stemming, the processed word is called the stemmed word, and in the case of lemmatization, it is known as the lemma. 
* It must be noted that lemmatization is more aggressive than stemming and stemming is more popular and widely used. 
* Both stemming and lemmatization come from linguistics. And you need to have an in-depth knowledge of a given language if you plan to make a stemmer or lemmatizer for that language.

* Both stemming and lemmatization can be done easily by using the NLTK package.

Below code contain the most common Snowball Stemmer and WordNet Lemmatizer. 

In [13]:
from nltk.stem import WordNetLemmatizer 
from nltk.stem.snowball import SnowballStemmer 

#initialize lemmatizer 
lemmatizer = WordNetLemmatizer()

#initialize stemmer 
stemmer = SnowballStemmer("english") 
words = ["fishing", "fishes", "fished"] 
for word in words:
    print(f"word={word}") 
    print(f"stemmed_word={stemmer.stem(word)}") 
    print(f"lemma={lemmatizer.lemmatize(word)}") 
    print("") 

word=fishing
stemmed_word=fish
lemma=fishing

word=fishes
stemmed_word=fish
lemma=fish

word=fished
stemmed_word=fish
lemma=fished



* When we do stemming, we are given the smallest form of a word which may or may not be a word in the dictionary for the language the word belongs to. 
* However, in the case of lemmatization, this will be a word.

One more topic that you should be aware of is topic extraction. Topic extraction can be done using non-negative matrix factorization (NMF) or latent semantic analysis (LSA), which is also popularly known as singular value decomposition or SVD. These are decomposition techniques that reduce the data to a given number of components. You can fit any of these on sparse matrix obtained from CountVectorizer or TfidfVectorizer. Let’s apply it on TfidfVetorizer that we have used before.

The code is in **svd.py** 

<img src="../text_prob5.png">
