## Portfolio 2 - script 1 - tfidf vectorizer

*By Sofie Mosegaard, 07-03-2024*

In this script, the data will be vectorized and the new feature extracted data will be saved as objects. By doing so, I will only vectorize the data once in total instead of once per script.

### Import packages

In [29]:
# System tools
import os
import sys
import scipy as sp

# Data munging tools
import pandas as pd

# Machine learning packages
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# For saving
from joblib import dump, load

### Reading in the data

In [3]:
# Load the data to pandas csv
filepath = os.path.join(
                        "..",
                        "in",
                        "fake_or_real_news.csv"
                        )

data = pd.read_csv(filepath)

In [4]:
# Create the data variables
X = data["text"]
y = data["label"]

### Train-test split

Creating a 80:20 train:test split in the data using the input X (the text for the model) and y (the classification labels). To ensure reproducibility, a random state of 123 is included.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,    
                                                    test_size = 0.2,
                                                    random_state = 123)

### Vectorizing and feature extraction

Next, I want to create a vectorizer object. The vectorizer object will be on unigrams and bigrams, make everything lowercase to ensure alignment, remove the very common words (top 5%) and very rare words (bottom 5%), as well as keep the top 500 features.

The vectorizer object will then be used to transform the text data into vectors of numbers.

In [6]:
# Create vectorizer object
vectorizer = TfidfVectorizer(ngram_range = (1,2),
                             lowercase =  True,
                             max_df = 0.95,
                             min_df = 0.05,
                             max_features = 500)

In [7]:
# Fit the vectorizer object to the training data
X_train_features = vectorizer.fit_transform(X_train)

# Fit the vectorizer object to the test data
X_test_features = vectorizer.transform(X_test)

### Save the vectorizer and feature extracted objects 

As vectorizing and feature extracting can be very time consuming on larger datasets, I will save the vectorizer and the vectorized feature extracted objects. By doing so, we can simply load in the objects and use the extracted features.

In [38]:
# Save the vectorizer
dump(vectorizer, "../models/tfidf_vectorizer.joblib")

['../models/tfidf_vectorizer.joblib']

In [None]:
# Save the spicy sparse matrixs objects (X_train_features and X_test_features)
sp.sparse.save_npz('../models/X_train_features_sparse_matrix.npz', X_train_features)
sp.sparse.save_npz('../models/X_test_features_sparse_matrix.npz', X_test_features)