# CHAPTER 4: FEATURE ENGINEERING AND SELECTION
## Temporal data
In this document, I will delve into the crucial aspects of feature engineering in text data for machine learning applications. Textual data presents unique challenges and opportunities for feature extraction and manipulation, making it essential to understand effective techniques in this domain.

Throughout this exploration, we will discuss various strategies and methodologies for feature engineering specifically tailored to text data using Python. Leveraging libraries such as NLTK, spaCy, and scikit-learn, we will demonstrate techniques including tokenization, vectorization, word embeddings, and advanced text processing methods.

#### *Jose Ruben Garcia Garcia*
#### *February 2024*
#### *Reference: Practical Machine Learning Python Problems Solver*

## Feature engineering on text data

### Data preparation

In [38]:
#importing libraries
import pandas as pd
import numpy as np
import re
import nltk

In [3]:
#Creating a corpus to manage
corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Unnamed: 0,Document,Category
0,The sky is blue and beautiful.,weather
1,Love this blue and beautiful sky!,weather
2,The quick brown fox jumps over the lazy dog.,animals
3,The brown fox is quick and the blue dog is lazy!,animals
4,The sky is very blue and the sky is very beaut...,weather
5,The dog is lazy but the brown fox is quick!,animals


### Text pre-processing

Before delving into feature engineering, it's imperative to preprocess, clean, and normalize the text, as previously mentioned. There exist numerous preprocessing techniques, some of which are quite intricate. While we won't delve deeply into each in this section, we will explore many of them in greater detail in an upcoming chapter focused on text classification and sentiment analysis. Here are some popular preprocessing techniques:

Text tokenization and conversion to lowercase,
Removal of special characters,
Expansion of contractions,
Elimination of stopwords,
Spelling correction,
Stemming and 
Lemmatization

In [4]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

In [5]:
norm_corpus = normalize_corpus(corpus)
norm_corpus

array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog', 'brown fox quick blue dog lazy',
       'sky blue sky beautiful today', 'dog lazy brown fox quick'],
      dtype='<U30')

### Bag of Words Model

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1],
       [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]])

In [10]:
#Getting feature names
vocab = cv.get_feature_names_out()
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,beautiful,blue,brown,dog,fox,jumps,lazy,love,quick,sky,today
0,1,1,0,0,0,0,0,0,0,1,0
1,1,1,0,0,0,0,0,1,0,1,0
2,0,0,1,1,1,1,1,0,1,0,0
3,0,1,1,1,1,0,1,0,1,0,0
4,1,1,0,0,0,0,0,0,0,2,1
5,0,0,1,1,1,0,1,0,1,0,0


### Bag of N-Grams Model

In [12]:
bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names_out()
pd.DataFrame(bv_matrix, columns=vocab)

Unnamed: 0,beautiful sky,beautiful today,blue beautiful,blue dog,blue sky,brown fox,dog lazy,fox jumps,fox quick,jumps lazy,lazy brown,lazy dog,love blue,quick blue,quick brown,sky beautiful,sky blue
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0
3,0,0,0,1,0,1,1,0,1,0,0,0,0,1,0,0,0
4,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1
5,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0


### TF-IDF Model


In [15]:
# Import the TfidfVectorizer class from sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object (tv) with specified parameters:
# min_df=0.: Ignore terms that have a document frequency strictly lower than this threshold.
# max_df=1.: Ignore terms that have a document frequency strictly higher than this threshold.
# use_idf=True: Enable inverse-document-frequency reweighting.
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)

# Transform the normalized corpus (norm_corpus) into a TF-IDF matrix (tv_matrix)
# using the fit_transform() method of the TfidfVectorizer object.
tv_matrix = tv.fit_transform(norm_corpus)

# Convert the TF-IDF matrix to a dense NumPy array representation using toarray() method.
tv_matrix = tv_matrix.toarray()

# Retrieve the feature names (vocabulary) from the TfidfVectorizer object 
# using the get_feature_names_out() method.
vocab = tv.get_feature_names_out()

# Create a DataFrame using the TF-IDF matrix and feature names, rounded to two decimal places for clarity,
# using pd.DataFrame().
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)


Unnamed: 0,beautiful,blue,brown,dog,fox,jumps,lazy,love,quick,sky,today
0,0.6,0.52,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.0
1,0.46,0.39,0.0,0.0,0.0,0.0,0.0,0.66,0.0,0.46,0.0
2,0.0,0.0,0.38,0.38,0.38,0.54,0.38,0.0,0.38,0.0,0.0
3,0.0,0.36,0.42,0.42,0.42,0.0,0.42,0.0,0.42,0.0,0.0
4,0.36,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.72,0.52
5,0.0,0.0,0.45,0.45,0.45,0.0,0.45,0.0,0.45,0.0,0.0


### Document Similarity


In [16]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Unnamed: 0,0,1,2,3,4,5
0,1.0,0.753128,0.0,0.185447,0.807539,0.0
1,0.753128,1.0,0.0,0.139665,0.608181,0.0
2,0.0,0.0,1.0,0.784362,0.0,0.839987
3,0.185447,0.139665,0.784362,1.0,0.109653,0.933779
4,0.807539,0.608181,0.0,0.109653,1.0,0.0
5,0.0,0.0,0.839987,0.933779,0.0,1.0


### Clustering documents using similarity features

In [18]:
# Import the KMeans class from sklearn.cluster
from sklearn.cluster import KMeans

# Initialize a KMeans object (km) with specified parameters:
# n_clusters=2: Number of clusters to form.
km = KMeans(n_clusters=2)

# Fit the KMeans model to the similarity DataFrame (similarity_df) using the fit_transform() method.
km.fit_transform(similarity_df)

# Retrieve the cluster labels assigned to each data point from the fitted KMeans model using the labels_ attribute.
cluster_labels = km.labels_

# Create a DataFrame to store the cluster labels with the column name 'ClusterLabel'
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])

# Concatenate the original corpus DataFrame (corpus_df) with the cluster labels DataFrame along the columns axis.
pd.concat([corpus_df, cluster_labels], axis=1)


  super()._check_params_vs_input(X, default_n_init=10)


Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,1
1,Love this blue and beautiful sky!,weather,1
2,The quick brown fox jumps over the lazy dog.,animals,0
3,The brown fox is quick and the blue dog is lazy!,animals,0
4,The sky is very blue and the sky is very beaut...,weather,1
5,The dog is lazy but the brown fox is quick!,animals,0


### Topic models

In [21]:
# Import the LatentDirichletAllocation class from sklearn.decomposition
from sklearn.decomposition import LatentDirichletAllocation

# Initialize a LatentDirichletAllocation object (lda) with specified parameters:
# n_components=2: Number of topics to find.
# max_iter=100: Maximum number of iterations for the EM algorithm.
# random_state=42: Seed for random number generation for reproducibility.
lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)

# Fit the LatentDirichletAllocation model to the TF-IDF matrix (tv_matrix)
# using the fit_transform() method.
dt_matrix = lda.fit_transform(tv_matrix)

# Create a DataFrame to store the topic distributions for each document
# with column names 'T1' and 'T2'
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features

Unnamed: 0,T1,T2
0,0.190548,0.809452
1,0.176804,0.823196
2,0.846184,0.153816
3,0.814863,0.185137
4,0.180516,0.819484
5,0.839172,0.160828


### Show topics and their weights


In [24]:
from sklearn.decomposition import LatentDirichletAllocation

# Initialize LDA with specified parameters
lda = LatentDirichletAllocation(n_components=2, max_iter=100, random_state=42)

# Fit LDA model to the TF-IDF matrix
dt_matrix = lda.fit_transform(tv_matrix)

# Retrieve the topic-term matrix
tt_matrix = lda.components_

# Iterate over each topic to print top tokens with weights
for topic_weights in tt_matrix:
    topic = [(token, weight) for token, weight in zip(vocab, topic_weights)]
    # Sort the topic tokens by weight in descending order
    topic = sorted(topic, key=lambda x: -x[1])
    # Filter out tokens with weights less than 0.6
    topic = [item for item in topic if item[1] > 0.6]
    # Print the top tokens for the topic
    print(topic)
    print()


[('brown', 1.7273638692668467), ('dog', 1.7273638692668467), ('fox', 1.7273638692668467), ('lazy', 1.7273638692668467), ('quick', 1.7273638692668467), ('jumps', 1.0328325272484777), ('blue', 0.7731573162915626)]

[('sky', 2.264386643135622), ('beautiful', 1.9068269319456903), ('blue', 1.7996282104933266), ('love', 1.148127242397004), ('today', 1.0068251160429935)]



### Clustering documents using topic model features

In [25]:
# Import the KMeans class from sklearn.cluster
from sklearn.cluster import KMeans

# Initialize a KMeans object (km) with specified parameters:
# n_clusters=2: Number of clusters to form.
km = KMeans(n_clusters=2)

# Fit the KMeans model to the features DataFrame using the fit_transform() method.
# features: Data representing the features to be clustered.
km.fit_transform(features)

# Retrieve the cluster labels assigned to each data point from the fitted KMeans model using the labels_ attribute.
cluster_labels = km.labels_

# Create a DataFrame to store the cluster labels with the column name 'ClusterLabel'
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])

# Concatenate the original corpus DataFrame (corpus_df) with the cluster labels DataFrame along the columns axis.
# axis=1: Concatenate along the columns axis.
pd.concat([corpus_df, cluster_labels], axis=1)


  super()._check_params_vs_input(X, default_n_init=10)


Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,0
1,Love this blue and beautiful sky!,weather,0
2,The quick brown fox jumps over the lazy dog.,animals,1
3,The brown fox is quick and the blue dog is lazy!,animals,1
4,The sky is very blue and the sky is very beaut...,weather,0
5,The dog is lazy but the brown fox is quick!,animals,1


### Word Embeddings

Basically, word embeddings can be used for feature
extraction and language modeling. This representation tries to map each word or phrase into a complete
numeric vector such that semantically similar words or terms tend to occur closer to each other and
these can be quantified using these embeddings.

For this example I will us ethe word2vec module from google based on the following parametters:

• size: Represents the feature vector size for each word in the corpus when
transformed.

• window: Sets the context window size specifying the length of the window of words to
be taken into account as belonging to a single, similar context when training.

• min_count: Specifies the minimum word frequency value needed across the corpus
to consider the word as a part of the final vocabulary during training the model.

• sample: Used to downsample the effects of words which occur very frequently.

In [28]:
# Import the Word2Vec class from gensim.models
from gensim.models import Word2Vec

# Initialize a WordPunctTokenizer object from NLTK
wpt = nltk.WordPunctTokenizer()

# Tokenize the normalized corpus using the WordPunctTokenizer
# tokenized_corpus: A list of tokenized documents
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

# Set values for various Word2Vec parameters
feature_size = 10    # Word vector dimensionality  
window_context = 10          # Context window size                                                                                    
min_word_count = 1   # Minimum word count                        
sample = 1e-3   # Downsample setting for frequent words

# Create a Word2Vec model
# w2v_model: Word2Vec model trained on the tokenized corpus
w2v_model = Word2Vec(sentences=tokenized_corpus, vector_size=feature_size, 
                     window=window_context, min_count=min_word_count,
                     sample=sample)


In [29]:
w2v_model.wv['sky']

array([ 0.07380505, -0.01533471, -0.04536613,  0.06554051, -0.0486016 ,
       -0.01816018,  0.0287658 ,  0.00991874, -0.08285215, -0.09448818],
      dtype=float32)

In [34]:
def average_word_vectors(words, model, vocabulary, num_features):
    """
    Calculate the average word vectors for a list of words.
    
    Parameters:
    - words: A list of words to calculate the average vectors for.
    - model: The Word2Vec model containing word vectors.
    - vocabulary: A set of words in the model's vocabulary.
    - num_features: The dimensionality of the word vectors.
    
    Returns:
    - feature_vector: The average word vector for the input list of words.
    """
    # Initialize a zero vector for the feature representation
    feature_vector = np.zeros((num_features,), dtype="float64")
    # Initialize a counter for the number of words found in the model's vocabulary
    nwords = 0.
    
    # Iterate through each word in the input list of words
    for word in words:
        # Check if the word is in the model's vocabulary
        if word in vocabulary: 
            # Increment the word counter
            nwords = nwords + 1.
            # Add the word vector to the feature vector
            feature_vector = np.add(feature_vector, model[word])
    
    # If there are words found in the vocabulary
    if nwords:
        # Calculate the average feature vector by dividing by the number of words
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector
    
   
def average_word_vectors(words, model, vocabulary, num_features):
    """
    Calculate the average word vectors for a list of words.
    
    Parameters:
    - words: A list of words to calculate the average vectors for.
    - model: The Word2Vec model containing word vectors.
    - vocabulary: A set of words in the model's vocabulary.
    - num_features: The dimensionality of the word vectors.
    
    Returns:
    - feature_vector: The average word vector for the input list of words.
    """
    # Initialize a zero vector for the feature representation
    feature_vector = np.zeros((num_features,), dtype="float64")
    # Initialize a counter for the number of words found in the model's vocabulary
    nwords = 0.
    
    # Iterate through each word in the input list of words
    for word in words:
        # Check if the word is in the model's vocabulary
        if word in vocabulary: 
            # Increment the word counter
            nwords = nwords + 1.
            # Add the word vector to the feature vector
            feature_vector = np.add(feature_vector, model.wv[word])  # Access word vector via model.wv[word]
    
    # If there are words found in the vocabulary
    if nwords:
        # Calculate the average feature vector by dividing by the number of words
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector



In [35]:
w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.041,0.023496,-0.002957,0.021184,-0.032642,-0.02787,0.055925,0.030505,-0.05313,-0.073217
1,0.009201,0.026787,0.010757,0.030243,-0.005814,-0.036322,0.044707,0.037997,-0.046948,-0.070347
2,-0.029849,0.023591,0.004554,-0.029321,0.034318,-0.004285,-0.009695,0.001931,-0.013142,0.030991
3,-0.033462,0.023669,0.007271,-0.014668,0.002782,-0.02458,0.015932,0.028622,-0.023004,0.014064
4,0.037648,0.016684,-4.4e-05,0.039924,-0.040712,-0.016636,0.051486,0.010691,-0.054663,-0.049233
5,-0.039082,0.02793,-0.001482,-0.03562,0.021944,-0.015263,0.006201,0.016401,-0.017574,0.024404


In [36]:
# Import the AffinityPropagation class from sklearn.cluster
from sklearn.cluster import AffinityPropagation

# Initialize an AffinityPropagation object (ap)
ap = AffinityPropagation()

# Fit the AffinityPropagation model to the Word2Vec feature array (w2v_feature_array)
ap.fit(w2v_feature_array)

# Retrieve the cluster labels assigned to each data point from the fitted AffinityPropagation model
cluster_labels = ap.labels_

# Create a DataFrame to store the cluster labels with the column name 'ClusterLabel'
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])

# Concatenate the original corpus DataFrame (corpus_df) with the cluster labels DataFrame along the columns axis
# axis=1: Concatenate along the columns axis
pd.concat([corpus_df, cluster_labels], axis=1)


Unnamed: 0,Document,Category,ClusterLabel
0,The sky is blue and beautiful.,weather,0
1,Love this blue and beautiful sky!,weather,0
2,The quick brown fox jumps over the lazy dog.,animals,1
3,The brown fox is quick and the blue dog is lazy!,animals,1
4,The sky is very blue and the sky is very beaut...,weather,0
5,The dog is lazy but the brown fox is quick!,animals,1
