# Feature Extraction from Dicts
Representing large dataset using a dense matrix may require huge memory and compute time. For example, representing a dataset with 1 millon samples and 1 millon features using a dense matrix requires order of $10^{12}$ byte or 1 TB memory. And a dataset having 1 million samples and 1 million features is not uncommon in real life. However, most of those datasets has sparse features and thus, can be represented in more efficiently. For example, consider a dataset with the following feature matrix.

|  | f1 | f2 | f3 | f5 | f6 | f7 | f8 |
---|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
s1 | 0  | 0.5|  0 | 0  | 2.3| 0  | 0  |
s2 | 1.9| 0  |  0 | 0  | 0  | 0  | 2.1|
s3 | 0  | 0  | 0.9| 1.3| 0  | 0  | 1.1|
s4 | 0  | 0  |  0 | 0  | 0.1| 3.1| 0  |
s5 | 1.9| 0  |  0 | 0  | 0  | 0  | 0  |

In the above dataset, most of the entries in the feature matrix are zero. Thus, it is more efficient to store the features as a sparse matrix and perform the machine learning operations on the sparse matrix. The scipy package [scipy.sparse](https://docs.scipy.org/doc/scipy/reference/sparse.html) defines several matrix and array classes for different sparse matrix formats. You can represents your data as a sparse matrix to work with the Scikit-learn's machine learning algorithms.

Sometimes representing a dataset with sparse features using python dictionaries can be more convenient. You can represent the above dataset as a list of python dictionaries as follows:

| sample | Features |
:-------:|:-------- |
s1       | {'f2': 0.5, 'f6': 2.3} |
s2       | {'f1': 1.9, 'f8': 2.1} |
s3       | {'f3': 0.9, 'f4': 1.3, 'f8': 1.1} |
s4       | {'f6': 0.1, 'f7': 3.1} |
s5       | {'f1': 1.9, 'f6': 2.3} |

Scikit-learn provides a class <em>sklearn.feature_extraction.DictVectorizer</em> to convert the list of python dictionaries into a numpy array or scipy.sparse matrix. The full documentation of <em>sklearn.feature_extraction.DictVectorizer</em> can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html).

## Working with Scikit-learn DictVectorizer
The original code can be found [here](https://scikit-learn.org/stable/modules/feature_extraction.html).

### Creating a small dataset

In [2]:
measurements = [
     {'city': 'Dubai', 'temperature': 33.},
     {'city': 'London', 'temperature': 12.},
     {'city': 'San Francisco', 'temperature': 18.},
 ]

### Importing the DictVectorizer

In [3]:
from sklearn.feature_extraction import DictVectorizer

### Transforming the data

In [9]:
vec = DictVectorizer(sparse=True)

vec.fit(measurements)
X = vec.transform(measurements)

# Alternatively, you can call fit_tranform() to call both fit() and transform() together
# X = vec.fit_transform(measurements)

In [10]:
type(X)

scipy.sparse._csr.csr_matrix

In [11]:
X

<3x4 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [12]:
X.toarray()

array([[ 1.,  0.,  0., 33.],
       [ 0.,  1.,  0., 12.],
       [ 0.,  0.,  1., 18.]])

### Viewing the feature names

In [13]:
vec.get_feature_names_out()

array(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'],
      dtype=object)

## Another Example

### The dataset

In [14]:
movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003},
               {'category': ['animation', 'family'], 'year': 2011},
               {'year': 1974}]

### Transforming dataset

In [15]:
X = vec.fit_transform(movie_entry)
print(vec.get_feature_names_out())

['category=animation' 'category=drama' 'category=family'
 'category=thriller' 'year']


In [16]:
X.toarray()

array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
       [1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
       [0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])

### What happens to unseen features?

In [17]:
vec.transform({'category': ['thriller'],
               'unseen_feature': '3'}).toarray()

array([[0., 0., 0., 1., 0.]])

# Feature Extraction from Raw Text
In many real life machine learning problems, the raw texts are used as the input to the system. For example, in a news article classification task which classifies a news article into different categories like political, sports-related, business-related etc., the input to the task is a news article. However, machine learning algorithms cannot operate on raw texts. To work with such textual data, many preprocessing steps are performed to convert the data into some stardard formats like vectors of real numbers. Common preprocessing pipelines involve steps like
* Tokenization
* Lower casing
* Stemming
* Lemmatization
* Stop word removal
* Vectorization

In this tutorial, we will first go through some of the examples of those steps using [NLTK library](https://www.nltk.org/). Then we will see how those steps can be perform using Scikit-learn [Features Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html) module.

## Text Preprocessing Pipeline
Please refer to [this blog](https://medium.com/mlearning-ai/nlp-tokenization-stemming-lemmatization-and-part-of-speech-tagging-9088ac068768) for a better understanding.

### Tokenization

In [18]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [19]:
text = "Hello everyone! Welcome to my blog post on Medium. We are studying Natural Language Processing."

Sentence tokenization

In [20]:
tokens_sents = nltk.sent_tokenize(text)
print(tokens_sents)

['Hello everyone!', 'Welcome to my blog post on Medium.', 'We are studying Natural Language Processing.']


Word tokenization

In [21]:
tokens_words = nltk.word_tokenize(text)
print(tokens_words)

['Hello', 'everyone', '!', 'Welcome', 'to', 'my', 'blog', 'post', 'on', 'Medium', '.', 'We', 'are', 'studying', 'Natural', 'Language', 'Processing', '.']


### Lower casing

In [22]:
token_words_lowercased = [w.lower() for w in tokens_words]
print(token_words_lowercased)

['hello', 'everyone', '!', 'welcome', 'to', 'my', 'blog', 'post', 'on', 'medium', '.', 'we', 'are', 'studying', 'natural', 'language', 'processing', '.']


### Stemming

In [23]:
from nltk.stem import PorterStemmer

In [24]:
ps = PorterStemmer()
words = ["civilization", "boy's", "boyes", "boy"]
print([ps.stem(w) for w in words])

['civil', "boy'", 'boy', 'boy']


### Lemmatization

In [25]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [26]:
text = "The striped bats are hanging on their feet for best"
words = nltk.word_tokenize(text)

In [27]:
words

['The',
 'striped',
 'bats',
 'are',
 'hanging',
 'on',
 'their',
 'feet',
 'for',
 'best']

In [28]:
print([ps.stem(w) for w in words])

['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']


In [29]:
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(w) for w in words])

['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']


### Stop words removal

In [30]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [31]:
sw = stopwords.words('english')
print(len(sw))
print(sw[:10])
print(sw[-10:])

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
['shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]


In [32]:
print(words)

['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']


In [33]:
words_stopwords_removed = [w for w in words if w.lower() not in sw]
print(words_stopwords_removed)

['striped', 'bats', 'hanging', 'feet', 'best']


### Vectorization
Vectorization is the process of representing the list of words obtained after all the previous preprocessing (i.e. tokenization, stopwords removal, stemming, lemmatization etc.). Consider the following example of analysis review comments:

**Review 1**: Game of Thrones is an amazing tv series!

**Review 2**: Game of Thrones is the best tv series!

**Review 3**: Game of Thrones is so great

A possible vectorization for the above dataset can be


|   | amazing | an  | best | game | great | is | of | series | so | the | thrones | tv
|:--|:-------:|:---:|:----:|:----:|:-----:|:--:|:--:|:------:|:--:|:---:|:------:|:---:
| **0** | 1       | 1   | 0    | 1    | 0     | 1  | 1  | 1      | 0  | 0   | 1       | 1
| **1** | 0       | 0   | 1    | 1    | 0     | 1  | 1  | 1      | 0  | 1   |1       | 1
| **2** | 0 | 0   | 0   | 1    | 1    | 1     | 1  | 0  | 1      | 0  | 1   |  0       |

The above feature representation is called <em>Bag of Words</em> representation.

## Bag of Word Representation
Scikit-learn provides the class <em>sklearn.feature_extraction.text.CountVectorizer</em> to convert a list of raw texts into a sparse matrix of bag-of-words features. The class internally performs the neccessary preprocessing steps like tokenization, lower casing, stopwords removal etc. and returns the bag-of-words representation. The stemming and lemmatization, though tricky, can also be added into the pipeline if needed. The complete documentation of Scikit-learn <em>CountVectorizer</em> can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer).

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

In [35]:
vectorizer = CountVectorizer(lowercase=True,
                             tokenizer=None,
                             token_pattern=r'(?u)\b\w\w+\b',
                             stop_words=None,
                             ngram_range=(1, 1),
                             vocabulary=None,
                             binary=False,
                             max_df=1.0,
                             min_df=1,
                             max_features=None)

In [36]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)

In [38]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [37]:
X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:

In [39]:
vectorizer.vocabulary_.get('document')

1

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [40]:
vectorizer.transform(['Something completely new.']).toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])

### Bigrams

In [47]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                    token_pattern=r'\b\w+\b',
                                    min_df=1)

In [48]:
X_2 = bigram_vectorizer.fit_transform(corpus).toarray()

In [43]:
X_2

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]])

In [44]:
bigram_vectorizer.get_feature_names_out()

array(['and', 'and the', 'document', 'first', 'first document', 'is',
       'is the', 'is this', 'one', 'second', 'second document',
       'second second', 'the', 'the first', 'the second', 'the third',
       'third', 'third one', 'this', 'this is', 'this the'], dtype=object)

## Tf–idf term weighting
Tf-idf (Term frequency - inverse document frequency) is measure of importance of a word in a document amoung a set of documents. Tf-idf scores provides an alternative to bag-of-words features. You can refer to [this article](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to know more about Tf-idf scoring. Scikit-learn provides the class <em>sklearn.feature_extraction.text. TfidfVectorizer</em> to convert a corpus of raw text into the Tf-idf feature representation. The complete documentation of the same can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer).

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [50]:
X.toarray()

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

In [51]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

## A Complete Example of Feature Extraction from Textual Data
The following code snippets are based on the code given [here](https://scikit-learn.org/stable/datasets/real_world.html).

In [52]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [53]:
from pprint import pprint
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [54]:
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)
print(newsgroups_train.target[:10])

(11314,)
(11314,)
[ 7  4  4  1 14 16 13  3  2  4]


In [55]:
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

print(list(newsgroups_train.target_names))
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)
print(newsgroups_train.target[:10])

['alt.atheism', 'sci.space']
(1073,)
(1073,)
[0 1 1 1 0 1 1 0 0 0]


### Converting the texts into vectors

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)

(2034, 34118)


In [57]:
vectors.nnz / float(vectors.shape[0])

159.0132743362832

###  Vectorizing the test data

In [58]:
newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)

### Training a Logistic Regression Classifier

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

clf = LogisticRegression(penalty='none')
clf.fit(vectors, newsgroups_train.target)

pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average='macro')



0.877168967755968

### Viewing the most informative features

In [60]:
import numpy as np
def show_top10(classifier, vectorizer, categories):
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        top10 = np.argsort(np.abs(classifier.coef_[i]))[-10:]
        print("%s:\t\t%s" % (category, " ".join(feature_names[top10])))
show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism:		is tek islamic space edu atheism caltech god atheists keith
comp.graphics:		file polygon tiff files 3d god space points image graphics
sci.space:		launch toronto pat orbit god alaska moon henry nasa space
talk.religion.misc:		that buffalo who jesus beast objective morality space christian god
