# Feature Engineering - part 2

*Pupose*

To test importing the new `feature_engineering` module and start exploring the addition of these new features using a baseline Logistic Regression model.


*Results*

- Stacking these 7 new features onto the 100,000 feature space of our best LR model so far did not improve accuracy: they just get lost.
- Using SVD to generate a matrix of 1,000 features and stacking the new features onto this semantic space yields even worse results.
- A hypothesis is that a simple Logistic Regression model is not the best way to validate these new representations.

*Next Steps*

- A possible next step is to visualize the engineered features to try to understand why they are performing above average alone but degrade accuracy when combined with the document-term TF-IDF matrix or SVD semantic space.

In [1]:
import re
import os
import time
import json
import numpy as np
import pandas as pd

import feature_engineering as Fe

import urlextract
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split

### Load Data

In [2]:
with open("contractions_map.json") as f:
    contractions_map = json.load(f)

url_extractor = urlextract.URLExtract()
lemmatizer = WordNetLemmatizer()

# load X, y train subsets
raw_path = os.path.join("..","data","1_raw")
X_train = pd.read_csv(os.path.join(raw_path, "X_train.csv"))
y_train = pd.read_csv(os.path.join(raw_path, "y_train.csv"))

# create arrays
X_array = np.array(X_train.iloc[:,0]).ravel()
y_array = np.array(y_train.iloc[:,0]).ravel()

In [3]:
X_array.shape, y_array.shape

((3900,), (3900,))

In [38]:
start_time = time.time()

try:
    clean_docs, X_transformed = Fe.DocumentToFeaturesCounterTransformer().fit_transform(X_array)
except RuntimeWarning:
    pass

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} m {secs:0.0f} s')

Elapsed: 0 m 3 s


In [39]:
[(ix, val) for ix, val in enumerate(X_array[12:15])]

[(0, "K come to nordstrom when you're done"),
 (1, ':-) :-)'),
 (2, 'Okay... I booked all already... Including the one at bugis.')]

In [40]:
[(ix, val) for ix, val in enumerate(clean_docs[12:15])]

[(0, 'k come to nordstrom when you are done'),
 (1, ''),
 (2, 'okay i booked all already including the one at bugis')]

### New Features

In [41]:
#dlen_raw  dlen_cln n_tokns tkn_maxL tkn_meanL tkn_stdL rsr_
print(X_transformed[12:15])

[[36.     37.      8.      9.      3.75    2.222   0.5   ]
 [ 7.      0.      0.      0.         nan     nan  0.    ]
 [59.     52.     10.      9.      4.3     2.3259  0.5116]]


In [46]:
# impute nans with zeros
X_transformed[np.isnan(X_transformed)] = 0

In [47]:
print(X_transformed[12:15])

[[36.     37.      8.      9.      3.75    2.222   0.5   ]
 [ 7.      0.      0.      0.      0.      0.      0.    ]
 [59.     52.     10.      9.      4.3     2.3259  0.5116]]


In [48]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([('std_scaler', StandardScaler()), 
                 ('log_reg', LogisticRegression(solver="liblinear", random_state=42))])

Using the pipeline just to scale then perform cross validation with a model.

In [49]:
X_scaled = pipe['std_scaler'].fit_transform(X_transformed)

In [50]:
from sklearn.model_selection import cross_val_score

log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, X_scaled, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.8785 (+/- 0.0093)


Using the full pipeline and predicting once.

In [51]:
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X_transformed, y_array, test_size=0.33, random_state=42)

pipe.fit(X_train, y_train)
y_preds = pipe.predict(X_test)

print(f'Accuracy: {accuracy_score(y_test, y_preds):0.4f}')

Accuracy: 0.8718


### Dotument-Term Matrix plus engineered features

Trigrams with `vocab_size=100000` for best speed and accuracy

In [52]:
import cleanup_module as Cmod
from sklearn.feature_extraction.text import TfidfTransformer

dtm_pipe = Pipeline([('counter', Cmod.DocumentToNgramCounterTransformer(n_grams=2)),
                     ('bow', Cmod.WordCounterToVectorTransformer(vocabulary_size=2000))
                    ])

In [53]:
start_time = time.time()

X_transformed_dtm = dtm_pipe.fit_transform(X_array)

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} m {secs:0.0f} s')

Elapsed: 0 m 7 s


In [54]:
X_transformed_dtm

<3900x2001 sparse matrix of type '<class 'numpy.intc'>'
	with 57752 stored elements in Compressed Sparse Row format>

In [55]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, X_transformed_dtm, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.9854 (+/- 0.0059)



- *What will combining them result in?*

In [56]:
import scipy.sparse as sp

X_stacked = sp.hstack((X_scaled, X_transformed_dtm))

In [57]:
X_stacked

<3900x2008 sparse matrix of type '<class 'numpy.float64'>'
	with 85052 stored elements in COOrdinate format>

#### Scaled vs Unscaled


In [58]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, X_stacked, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.9854 (+/- 0.0065)


In [59]:
X_stacked_unscaled = sp.hstack((X_transformed, X_transformed_dtm))
log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, X_stacked_unscaled, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.9846 (+/- 0.0065)


- What about using SVD for dimentionality reduction and then stacking the new features?

### SVD plus New Features

In [62]:
from scipy.sparse.linalg import svds
from sklearn.utils.extmath import svd_flip

start_time = time.time()
U, Sigma, VT = svds(X_transformed_dtm.asfptype().T, # transposed to a term-document matrix
                    k=300) # k = number of components / "topics"
    
# reverse outputs
Sigma = Sigma[::-1]
U, VT = svd_flip(U[:, ::-1], VT[::-1])

mins, secs = divmod(time.time() - start_time, 60)
print(f'Elapsed: {mins:0.0f} min {secs:0.0f} sec')

Elapsed: 0 min 3 sec


In [63]:
U.shape, Sigma.shape, VT.shape

((2001, 300), (300,), (300, 3900))

In [64]:
V = VT.T
V.shape, y_array.shape

((3900, 300), (3900,))

In [65]:
# convert to sparse matrix
V_sparse = sp.csr_matrix(V)
X_scaled_sparse = sp.csr_matrix(X_scaled)

# stack
V_stacked = sp.hstack((V_sparse, X_scaled_sparse))
V_stacked

<3900x307 sparse matrix of type '<class 'numpy.float64'>'
	with 1197000 stored elements in COOrdinate format>

**SVD alone**

In [66]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, V_sparse, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.8674 (+/- 0.0012)


**SVD plus new features**

In [67]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, V_stacked, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.8967 (+/- 0.0115)


How about just adding raw document length?

In [68]:
V_stacked = sp.hstack((V_sparse, X_transformed[:,0:1]))

In [69]:
log_clf = LogisticRegression(solver="liblinear", random_state=42)

score = cross_val_score(log_clf, V_stacked, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.8513 (+/- 0.0084)


Just clean document length?

In [70]:
V_stacked = sp.hstack((V_sparse, X_transformed[:,1:2]))
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, V_stacked, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.8515 (+/- 0.0086)


Just number of tokens?

In [71]:
V_stacked = sp.hstack((V_sparse, X_transformed[:,2:3]))
log_clf = LogisticRegression(solver="liblinear", random_state=42)
score = cross_val_score(log_clf, V_stacked, y_array, cv=10, verbose=0, scoring='accuracy', n_jobs=-1)
print(f'Accuracy: {score.mean():0.4f} (+/- {np.std(score):0.4f})')

Accuracy: 0.8533 (+/- 0.0085)


*Final Results*

- Adding the new features to SVD also degrade its accuracy.
- It's possible that a more complex algorithm such as Random Forests might yield different results.

---