
## **Feature Engineering**


### Feature Engineering Libraries
1. `pandas`: is a powerful data manipulation and analysis library for Python, providing data structures and operations for manipulating numerical tables and time series.

2. `gensim.models.Word2Vec`: is a tool for learning word embeddings and deep learning models for natural language processing.

3. `sklearn.preprocessing.OneHotEncoder` is used for converting categorical variables into a form that can be provided to ML algorithms to do a better job in prediction.

4. `numpy` is a fundamental package for scientific computing with Python, providing support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

In [14]:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.preprocessing import OneHotEncoder
import numpy as np 


##### Code Explanation

- `Feature Engineering Class`

    The "FeatureEngineering" class facilitates the creation of feature vectors using word embeddings and one-hot encoding for categorical variables. Here's a breakdown of its functions:

    1. `__init__`: Initializes the class by reading the input and output CSV files into dataframes, and extracts essential columns like "entity", "tweetcontent", and "sentiment".

    2. `average_word_vectors`: Computes the average word vectors based on a given Word2Vec model, vocabulary, and the number of features.

    3. `word_to_vec`: Processes the text data, tokenizes it, trains a Word2Vec model to create word embeddings, transforms tokenized text into average word vectors, performs one-hot encoding on the "entity" column, combines word vectors and entity features, and saves the resulting feature vectors as a CSV file specified by the "save_path" parameter.



In [13]:
class FeatureEngineering:
    def __init__(self, x_path, y_path):
        self.df_x = pd.read_csv(x_path)
        self.df_y = pd.read_csv(y_path)
        self.entity = self.df_x['entity']
        self.text = self.df_x["tweetcontent"]
        self.sentiment = self.df_y["sentiment"]
    
    # Convert tokenized text to average word vectors
    def average_word_vectors(self, words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary:
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])

        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector
    
    def word_to_vec(self, save_path):
        
        # Preprocess the text data to ensure there are no NaNs and all entries are strings
        self.text = self.text.fillna("")  # Replace NaNs with empty strings
        self.text = self.text.astype(str)  # Ensure all entries are of type str
        # Preprocess and tokenize the text data
        tokenized_text = self.text.str.split()  # Tokenize the text data

        # Train Word2Vec model to create word embeddings
        word2vec_model = Word2Vec(sentences=tokenized_text, vector_size=100, window=5, min_count=1, workers=4)  # Adjust parameters as needed

        
        # Transform each tokenized text into average word vectors
        X_text_train_w2v = [ self.average_word_vectors(tokens, word2vec_model, word2vec_model.wv.index_to_key, 100) for tokens in self.text]

        # Perform one-hot encoding on the "entity" column
        # Note: One-hot encoding might not be suitable if "entity" has many unique values, but here we dont have problem.
        encoder = OneHotEncoder(sparse=False)
        entity_encoded = encoder.fit_transform(self.df_x[['entity']])

        # Combine word vectors and entity features
        vectorized_feature = pd.concat([pd.DataFrame(X_text_train_w2v), pd.DataFrame(entity_encoded)], axis=1)
        vectorized_feature.to_csv(save_path)
        # vectorized_feature.to_csv("/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature")
        return vectorized_feature
    


### Feature extraction for each dataset
the result are saved in dataset/processed/...


`twitter_training`

`twitter_test`

`twitter_validation`

In [10]:
x_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/x_twitter_training.csv"
y_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/y_twitter_training.csv"
save_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature_train.csv"
word2vec = FeatureEngineering(x_path, y_path)
word2vec.word_to_vec(save_path)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,-0.510210,0.329607,0.187727,-0.130115,0.208569,-0.171933,-0.023382,0.895186,-0.183727,0.304573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.604116,0.389101,0.104229,-0.138763,0.233229,-0.189435,0.014863,1.039501,-0.282258,0.294499,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.632097,0.368336,0.085285,-0.100172,0.270511,-0.211743,-0.040240,1.033592,-0.283484,0.348261,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.513339,0.324849,0.163870,-0.134573,0.206742,-0.201047,-0.008785,0.901811,-0.177600,0.292814,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.510210,0.329607,0.187727,-0.130115,0.208569,-0.171933,-0.023382,0.895186,-0.183727,0.304573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74677,-0.780766,0.418759,-0.031023,-0.067492,0.401715,-0.208004,-0.083094,1.196726,-0.350497,0.479689,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74678,-0.767273,0.410999,-0.027105,-0.065119,0.393747,-0.198727,-0.083464,1.172742,-0.343818,0.466080,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74679,-0.762606,0.402580,-0.027047,-0.057850,0.396219,-0.197538,-0.088723,1.159278,-0.338355,0.461525,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74680,-0.738909,0.420830,-0.005152,-0.086406,0.377505,-0.187701,-0.069887,1.177424,-0.335205,0.446001,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
x_test_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/x_twitter_test.csv"
y_test_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/y_twitter_test.csv"
save_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature_test.csv"
word2vec = FeatureEngineering(x_test_path, y_test_path)
word2vec.word_to_vec(save_path)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,-0.001880,-0.003641,-0.002030,-0.001055,0.001244,-0.001371,0.002660,0.000059,-0.002425,-0.000343,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.001808,-0.001752,0.000165,-0.002452,0.001323,-0.000767,0.000862,-0.001034,-0.003024,0.000989,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.001147,-0.003895,0.000904,-0.001410,0.001812,-0.000254,0.000636,-0.000288,-0.002540,0.001465,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.001035,-0.004208,-0.000856,-0.002770,0.000080,-0.001193,0.002557,-0.000088,-0.000201,0.000353,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.001014,-0.004061,0.001509,-0.001045,0.000038,-0.000557,0.001927,-0.001203,-0.001120,-0.000528,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,-0.001304,-0.004325,0.000982,-0.003215,0.000355,-0.001381,0.000432,-0.001408,-0.000582,0.002490,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
496,-0.002101,-0.001763,-0.000681,-0.002095,0.001591,-0.001691,0.000953,-0.001007,-0.002414,0.002264,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
497,-0.002849,-0.001472,-0.004176,-0.001352,-0.000116,-0.004096,0.002124,-0.001306,-0.000959,0.000502,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
498,-0.000761,-0.002292,-0.000535,-0.001636,0.000952,-0.002170,0.001425,-0.000049,-0.002728,0.000926,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
x_validate_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/x_twitter_validation.csv"
y_validate_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/y_twitter_validation.csv"
save_path = "/home/asma-rashidian/Documents/DrRahmani_projects/project2-DM-24-Azar-1402/dataset/processed/vectorized_feature_validation.csv"
word2vec = FeatureEngineering(x_test_path, y_test_path)
word2vec.word_to_vec(save_path)



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,-0.001880,-0.003641,-0.002030,-0.001055,0.001244,-0.001371,0.002660,0.000059,-0.002425,-0.000343,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,-0.001808,-0.001752,0.000165,-0.002452,0.001323,-0.000767,0.000862,-0.001034,-0.003024,0.000989,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.001147,-0.003895,0.000904,-0.001410,0.001812,-0.000254,0.000636,-0.000288,-0.002540,0.001465,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.001035,-0.004208,-0.000856,-0.002770,0.000080,-0.001193,0.002557,-0.000088,-0.000201,0.000353,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.001014,-0.004061,0.001509,-0.001045,0.000038,-0.000557,0.001927,-0.001203,-0.001120,-0.000528,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,-0.001304,-0.004325,0.000982,-0.003215,0.000355,-0.001381,0.000432,-0.001408,-0.000582,0.002490,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
496,-0.002101,-0.001763,-0.000681,-0.002095,0.001591,-0.001691,0.000953,-0.001007,-0.002414,0.002264,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
497,-0.002849,-0.001472,-0.004176,-0.001352,-0.000116,-0.004096,0.002124,-0.001306,-0.000959,0.000502,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
498,-0.000761,-0.002292,-0.000535,-0.001636,0.000952,-0.002170,0.001425,-0.000049,-0.002728,0.000926,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
