# Chapter 3: Feature Extraction & Preprocessing

## Extraction features from categorical variable
* **categorical** or **nominal**
- variables are commonly encoded using one-of-K or one-hot encoding, in which the explanatory variable is encoded using one binary feature

In [1]:
from sklearn.feature_extraction import DictVectorizer

onehot_encoder = DictVectorizer(sort=True) #sort=False == [[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]
instances = [{"City":"New York"}, {"City":"San Fransisco"}, {"City":"Chapel Hill"}]
print("One-Hot-Encoder\n", onehot_encoder.fit_transform(instances).toarray())

One-Hot-Encoder
 [[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


## The bag-of-word representation model
- Bag-of-words can be thought of as an extension to one-hot encoding.
- The bag-of-words model can be used effectively for document classification and retrieval despite the limited information that it encodes.
- A collection of documents is called a **corpus**.

In [2]:
corpus = ['UNC played Duke in basketball', 'Duke lost the basketball game']

- Our corpus has eight unique words, so each document will be represented by a vector with eight elements. The number of elements that comprise a feature vector is called the vector's dimension.
- *CountVectorizer* converts the characters in the documents to lowercase, and tokenizes the documents. **Tokenization** is the process of splitting a string into **tokens**.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['UNC played Duke in basketball', 'Duke lost the basketball game', "I ate a sandwich"] #added last docuement
vectorizer = CountVectorizer()
print("Vectorizer: \n",vectorizer.fit_transform(corpus).todense())
print("Vocabulary: \n",vectorizer.vocabulary_)

Vectorizer: 
 [[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
Vocabulary: 
 {'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}


- when using a metric such as **Euclidean distance**. The Euclidean distance between two vectors is equal to the **Euclidean norm**, or L2 norm, of the difference between the two vectors:
<img src="images\euclidean distance1.jpg"  width ="100"></img>
- Euclidean norm of a vector is equal to the vector's magnitude, which is given by the following equation:
<img src="images\euclidean distance2.jpg"  width ="200"></img>
- scikit-learn's euclidean_distances function can be used to calculate the distance between two or more vectors

In [4]:
from sklearn.metrics.pairwise import euclidean_distances

counts = [[0,1,1,0,1,0,1,0,0,1] ,[0,1,1,1,0,1,0,0,1,0], [1,0,0,0,0,0,0,1,0,0]]
print("Distance between first and second documents: ",euclidean_distances([counts[0]], [counts[1]]))
print("Distance between first and third documents: ",euclidean_distances([counts[0]], [counts[2]]))
print("Distance between second and third documents: ",euclidean_distances([counts[1]], [counts[2]]))

Distance between first and second documents:  [[2.44948974]]
Distance between first and third documents:  [[2.64575131]]
Distance between second and third documents:  [[2.64575131]]


- High-dimensional feature vectors that have many zero-valued elements are called sparse vectors.
- The first problem is that high-dimensional vectors require more memory than smaller vectors.
- The second problem is known as the **curse of dimensionality**, or the **Hughes effect**.

## Stop-word filtering
- A second strategy is to remove words that are common to most of the documents in the corpus. These words, called **stop word**.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['UNC played Duke in basketball', 'Duke lost the basketball game', "I ate a sandwich"]
vectorizer = CountVectorizer(stop_words = 'english')
print(vectorizer.fit_transform(corpus).todense())
print("vocabulary \n", vectorizer.vocabulary_)

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
vocabulary 
 {'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}


## Stemming and Iemmatization

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["He ate the sandwitched", "Every sandwitch was eaten by him"]
vectorizer = CountVectorizer(stop_words = 'english', binary=True)
print(vectorizer.fit_transform(corpus).todense())
print("vocabulary \n", vectorizer.vocabulary_)

[[1 0 0 1]
 [0 1 1 0]]
vocabulary 
 {'ate': 0, 'sandwitched': 3, 'sandwitch': 2, 'eaten': 1}


In [7]:
corpus = ["I am gathering ingredients for the sandwitch"," There were many wizards at the gathering"]

- We will use the **Natural Language Tool Kit (NTLK)** to stem and lemmatize the corpus

In [21]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
# nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('gathering','v'))
print(lemmatizer.lemmatize('gathering','n') + "\n")

#Let's compare lemmatization with stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem("gathering"))

gather
gathering

gather


- Let's compare lemmatization with stemming

In [1]:
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

wordnet_tags = ['n','v']
corpus = ['He ate the the sandwitch','Every sandwitch was eaten by him']
stemmer = PorterStemmer()
print("Stemmer:")
# li = [[stemmer.stem(token) for token in word_tokenize(documents)] for documents in corpus]

def lemmatize(token, tag):
    if tag[0].lower() in ['v','n']:
        return lemmatizer.lemmatize(token, tag[0].lower())
    return token
lemmatizer = WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(documents)) for documents in corpus]

Stemmer:


LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\Ahmed/nltk_data'
    - 'D:\\Istalled-Program\\Anaconda\\nltk_data'
    - 'D:\\Istalled-Program\\Anaconda\\share\\nltk_data'
    - 'D:\\Istalled-Program\\Anaconda\\lib\\nltk_data'
    - 'C:\\Users\\Ahmed\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************
