>Real world data is messy and we can't use it directly for our machine learning model. We will use feature engineering to convert any data format to usable data. In other words algorithms are pretty naive by themselves and cannot work out of the box on raw data. Hence the need for engineering meaningful features from raw data is of utmost importance which can be understood and consumed by these algorithms. This process of handpicking features manually is widely known as Feature Engineering. Let's apply some of this for our data.

In [146]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [147]:
# The first step is to import data

data = [
    {'price' : 850000, 'rooms' : 4, 'neighborhood' : 'Queen Anne'},
    {'price' : 700000, 'rooms' : 3, 'neighborhood' : 'Fremont'},
    {'price' : 650000, 'rooms' : 3, 'neighborhood' : 'Wallingford'},
    {'price' : 600000, 'rooms' : 2, 'neighborhood' : 'Fremont'}
]

>We can't feed the categorical data(neighborhood) directly to our machine learning model. We will now learn different techniques to encode the categorical features to numeric quantities. The techniques that we will cover are the following:
- Replacing values
- Encoding labels
- One-Hot encoding
- Binary encoding
- Backward difference encoding
- Miscellaneous features


<font size = 4> Replacing Values

In [148]:
# We can do something like this

df = pd.DataFrame(data)
df['local'] = df.neighborhood.map({'Queen Anne' : 1, 'Fremont' : 2, 'Wallingford': 3})
df

Unnamed: 0,neighborhood,price,rooms,local
0,Queen Anne,850000,4,1
1,Fremont,700000,3,2
2,Wallingford,650000,3,3
3,Fremont,600000,2,2


<font size = 4>Label Encoder

>To convert categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data. Let’s have a look at the code.

In [149]:
df = pd.DataFrame(data)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['local'] = encoder.fit_transform(df['neighborhood'])
df

Unnamed: 0,neighborhood,price,rooms,local
0,Queen Anne,850000,4,1
1,Fremont,700000,3,0
2,Wallingford,650000,3,2
3,Fremont,600000,2,0


<font size = 4> One Hot Encoding

In [188]:
# Converting the data by One- hot- encoding

from sklearn.feature_extraction import DictVectorizer
vect = DictVectorizer(sparse = True, dtype = int)
x = vect.fit_transform(data)
x

<4x5 sparse matrix of type '<class 'numpy.int32'>'
	with 12 stored elements in Compressed Sparse Row format>

In [151]:
vect.get_feature_names()

['neighborhood=Fremont',
 'neighborhood=Queen Anne',
 'neighborhood=Wallingford',
 'price',
 'rooms']

>One clear disadvantage is if you have many possible values, this can greatly increase the size of your dataset. But as the encoded data contains mostly zeros, a sparse matrix output can be a very efficient solution.

In [152]:
vec = DictVectorizer(sparse = False, dtype = int)
x = vec.fit_transform(data)
x
# pd.DataFrame(x.toarray(), columns=vec.get_feature_names())

array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]], dtype=int32)

>You can also use sklearn.preprocesssing.OneHotEncoder and sklearn.feature_extraction.FeatureHasher for simillar encoding.

>While one-hot encoding solves the problem of unequal weights given to categories within a feature, it is not very useful when there are many categories, as that will result in formation of as many new columns, which can result in the curse of dimensionality. The concept of the **“curse of dimensionality”** discusses that in high-dimensional spaces some things just stop working properly. We will see two methods on how to achieve this.

<font size = 3> Binary Encoding

>This technique is not as intuitive as the previous ones. In this technique, first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns. This encodes the data in fewer dimensions than one-hot.

First do this **<font size = 3> pip install category_encoders**

In [153]:
import category_encoders as ce
df = pd.DataFrame(data)
encoder = ce.binary.BinaryEncoder(cols = ['neighborhood'])
df_binary = encoder.fit_transform(df)
df_binary

Unnamed: 0,neighborhood_0,neighborhood_1,neighborhood_2,price,rooms
0,0,0,1,850000,4
1,0,1,0,700000,3
2,0,1,1,650000,3
3,0,1,0,600000,2


>Though this method is useful, the dimensions we can get will be still high. Now let's see, how we can apply PCA along with OHE to deal with curse of dimensionality. To know more about curse of dimensionality please read this article - 

>To know more read this article [Click Here](https://medium.freecodecamp.org/the-curse-of-dimensionality-how-we-can-save-big-data-from-itself-d9fa0f872335) :

In [154]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
x_pca = pca.fit_transform(x[:, 0:2])
print(x_pca)

[[ 0.85566188 -0.28344797]
 [-0.54795577 -0.11065475]
 [ 0.24024967  0.50475746]
 [-0.54795577 -0.11065475]]


In [155]:
# Let's see how we can apply this in our custom dataset

z = np.eye(10, 8)
print(z)
pca = PCA(n_components = 4)
z_pca = pca.fit_transform(z)
z_pca

[[1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0.]]


array([[ 1.44359862e-17,  9.35414347e-01, -1.84588162e-17,
         5.98179592e-17],
       [-7.42834160e-03, -1.33630621e-01, -7.00065635e-03,
         9.22224668e-01],
       [ 3.59527634e-01, -1.33630621e-01, -1.07751846e-01,
        -1.28287683e-01],
       [-1.88814138e-01, -1.33630621e-01, -1.94870178e-01,
        -2.04536968e-01],
       [ 6.42214080e-01, -1.33630621e-01,  1.95336849e-01,
        -1.28287683e-01],
       [-3.52947815e-01, -1.33630621e-01, -6.63780198e-01,
        -1.28287683e-01],
       [-5.39022767e-01, -1.33630621e-01,  6.79646276e-01,
        -1.28287683e-01],
       [ 8.64713477e-02, -1.33630621e-01,  9.84197531e-02,
        -2.04536968e-01],
       [ 3.55592934e-18,  4.16333634e-17, -2.24202824e-17,
         4.40853504e-16],
       [ 3.55592934e-18,  4.16333634e-17, -2.24202824e-17,
         4.40853504e-16]])

<font size = 4>Text Features

>We can convert text to set of representative numerical values to feed as features in our machine learning model. 

In [156]:
# We need to create our sample data 
sample = ['We are learning data science', 'How good we are?', 'Data Science is interesting!']

In [192]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [190]:
m = vec.fit_transform(sample) # Here we are using both fit and transform method simultaneously
m

<3x9 sparse matrix of type '<class 'numpy.float64'>'
	with 13 stored elements in Compressed Sparse Row format>

In [159]:
vec.get_feature_names()

['are',
 'data',
 'good',
 'how',
 'interesting',
 'is',
 'learning',
 'science',
 'we']

In [160]:
# Let's visualize the sparse matrix that we have created
pd.DataFrame(m.toarray(), columns = vec.get_feature_names())

Unnamed: 0,are,data,good,how,interesting,is,learning,science,we
0,1,1,0,0,0,0,1,1,1
1,1,0,1,1,0,0,0,0,1
2,0,1,0,0,1,1,0,1,0


In [198]:
sample

['We are learning data science',
 'How good we are?',
 'Data Science is interesting!']

>The problem with this approach is we are giving too much weight on words that appears very frequnetly which are not very informative. One approach to fix this is to use **term frequency-inverse document frequency(TF-IDF)**. This weights the word counts by a measure of how often they appear in the documentation. 

In [193]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
# pd.DataFrame(X.toarray(), columns = vec.get_feature_names())

In [162]:
type(X) # sparse matrix are easy to compute and takes less space

scipy.sparse.csr.csr_matrix

In [163]:
test_data = ['I am bad at data science'] # Test data
test_data_dtm = vec.transform(test_data) # Converting the data to document term matrix

# Here we are applying the transform not the fit transform
pd.DataFrame(test_data_dtm.toarray(), columns = vec.get_feature_names())

Unnamed: 0,are,data,good,how,interesting,is,learning,science,we
0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0


- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Working with the real data

In [164]:
# read file into pandas using a relative path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [165]:
sms.shape

(5572, 2)

In [166]:
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [167]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [168]:
sms.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [195]:
# Creating the feature matrix and Targets

X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

(5572,)
(5572,)


In [170]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


In [171]:
vec = TfidfVectorizer()

In [172]:
# Apply fit and transform into a single step to train data
X_train_dtm = vec.fit_transform(X_train)
X_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.float64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [173]:
# transform testing data (using fitted vocabulary) into a document-term matrix
# Only apply transform to the test data

X_test_dtm = vec.transform(X_test)
X_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.float64'>'
	with 17604 stored elements in Compressed Sparse Row format>

In [174]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [196]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

Wall time: 0 ns


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [176]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [177]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.964824120603015

In [178]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1208,    0],
       [  49,  136]], dtype=int64)

In [179]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

Series([], Name: message, dtype: object)

In [180]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

147     FreeMsg Why haven't you replied to my text? I'...
1745    Someone has conacted our dating service and en...
1064    We have new local dates in your area - Lots of...
4460    Welcome to UK-mobile-date this msg is FREE giv...
2680    New Tones This week include: 1)McFly-All Ab..,...
1217    You have 1 new voicemail. Please call 08719181...
3766    Someone U know has asked our dating service 2 ...
5566    REMINDER FROM O2: To get 2.50 pounds free call...
881     Reminder: You have not downloaded the content ...
3132    LookAtMe!: Thanks for your purchase of a video...
2295     You have 1 new message. Please call 08718738034.
5269    If you don't, your prize will go to another cu...
5110      You have 1 new message. Please call 08715205273
1045    We know someone who you know that fancies you....
4965    Dear Voucher holder Have your next meal on us....
2583    3 FREE TAROT TEXTS! Find out about your love l...
4348    U 447801259231 have a secret admirer who is lo...
943     How ab

In [181]:
# Notice here both have same dimension but not of the same type
# But we will leave this for sckit-learn to take care for us

print(type(y_test))
print(y_test.shape)
print(type(y_pred_class))
print(y_pred_class.shape)

<class 'pandas.core.series.Series'>
(1393,)
<class 'numpy.ndarray'>
(1393,)


In [182]:
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [183]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([0.02471001, 0.00471482, 0.05713988, ..., 0.03443199, 0.65287927,
       0.0045776 ])

In [143]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9883926973330948

## Remove stop words 

In [184]:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

In [185]:
sentence = "this is a foo bar sentence"
print([i for i in sentence.lower().split() if i not in stop])

['foo', 'bar', 'sentence']


In [187]:
x = [i for i in word_tokenize(sentence.lower()) if i not in stop]
x

['foo', 'bar', 'sentence']