#Dataset Finance Feature extractor TF-IDF and Glove Concatenated

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, f1_score
from scipy.sparse import hstack


In [5]:
# Load the dataset into a pandas DataFrame
df = pd.read_csv('/content/finance.csv')
df.head()


Unnamed: 0,Sentence,Sentiment
0,The GeoSolutions technology will leverage Bene...,positive
1,"$ESI on lows, down $1.50 to $2.50 BK a real po...",negative
2,"For the last quarter of 2010 , Componenta 's n...",positive
3,According to the Finnish-Russian Chamber of Co...,neutral
4,The Swedish buyout firm has sold its remaining...,neutral


In [6]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['Sentence'], df['Sentiment'], test_size=0.2, random_state=42)

In [7]:
# Create TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer()
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)

In [25]:
!pip install -U gensim
import gensim.downloader as api
glove_vectors = api.load('glove-wiki-gigaword-100')




In [30]:
def get_word_embeddings(texts):
    embeddings = []
    for text in texts:
        tokens = text.split()
        word_embeddings = []
        for word in tokens:
            if word in glove_vectors.key_to_index:
                word_embeddings.append(glove_vectors[word])
            else:
                word_embeddings.append(np.zeros(glove_vectors.vector_size))
        embedding = np.mean(word_embeddings, axis=0)
        embeddings.append(embedding)
    return np.array(embeddings)

In [31]:
X_train_glove = get_word_embeddings(X_train)
X_test_glove = get_word_embeddings(X_test)

# Combine TF-IDF features and GloVe embeddings using hstack
X_train_combined = hstack((X_train_features, X_train_glove))
X_test_combined = hstack((X_test_features, X_test_glove))


In [33]:
svm = SVC()
svm.fit(X_train_combined, y_train)


In [35]:
# Make predictions on the test set
y_pred = svm.predict(X_test_combined)

# Calculate evaluation metrics
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)

Precision: 0.7044145631015707
Recall: 0.7100085543199316
F1 Score: 0.6725764420236192


The provided code performs text classification using a combination of TF-IDF features and pre-trained GloVe word embeddings. Here's a technical summary of the code:

1. The dataset is loaded from a CSV file into a pandas DataFrame. The data should have two columns: 'text' containing the input text data and 'liked' containing the corresponding labels.

2. The dataset is split into training and testing sets using the `train_test_split` function from scikit-learn. The split is performed with a 80% training and 20% testing ratio.

3. The TF-IDF vectorizer is initialized using `TfidfVectorizer` from scikit-learn. This vectorizer converts the text data into numerical features based on term frequency-inverse document frequency.

4. The TF-IDF vectorizer is fit on the training data using its `fit_transform` method, and the same vectorizer is applied to transform the testing data using the `transform` method.

5. The pre-trained GloVe word embeddings are loaded using the `gensim.downloader` module. In this case, the 'glove-wiki-gigaword-100' pre-trained vectors are used.

6. A function `get_word_embeddings` is defined to convert the text data into GloVe word embeddings. For each text sample, the function tokenizes the text, retrieves the corresponding GloVe word embeddings for each token, and takes the average of the embeddings to represent the whole text sample.

7. The `get_word_embeddings` function is applied to both the training and testing data to obtain the GloVe embeddings.

8. The TF-IDF features and GloVe embeddings are combined by concatenating them horizontally using `np.concatenate`.

9. An SVM classifier is initialized using `SVC` from scikit-learn.

10. The SVM classifier is trained on the combined features (`X_train_combined`) and corresponding labels (`y_train`) using the `fit` method.

11. Predictions are made on the test set by calling the `predict` method of the SVM classifier with the combined features (`X_test_combined`).

12. Evaluation metrics (precision, recall, and F1 score) are calculated using `precision_score`, `recall_score`, and `f1_score` from scikit-learn. The `average` parameter is set to `'weighted'` to account for the multiclass nature of the target variable.

13. The precision, recall, and F1 score are printed as the final output.

This code combines TF-IDF features with pre-trained GloVe word embeddings and trains an SVM classifier to perform text classification. The evaluation metrics provide an assessment of the classifier's performance on the test set.