<a href="https://colab.research.google.com/github/SergioJF10/MLT-ESI-UCLM_CIS/blob/main/products/Notebooks/Models/TF_IDF_POS_and_Extra.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF Vectorization, POS Tagging and Extra Features
In this colab, we will develop the fourth and final approach, including the TF-IDF vectorization, with POS tagging and two extra features:
1. Number of words in an opinion.
2. Number of sentences in an opinion.

Again, we want to highlight that we did not include the N-grams due to its high memory demanding aspect.

In [1]:
%%capture
!pip install nltk
import nltk
nltk.download("popular")
import json
from tqdm import tqdm
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

!cp '/content/gdrive/MyDrive/Colab Notebooks/Machine Learning Tecniques/Natural Language Processing/products.csv' 'products.csv'
!cp '/content/gdrive/MyDrive/Colab Notebooks/Machine Learning Tecniques/Natural Language Processing/x_train.json' 'x_train.json'
!cp '/content/gdrive/MyDrive/Colab Notebooks/Machine Learning Tecniques/Natural Language Processing/y_train.json' 'y_train.json'
!cp '/content/gdrive/MyDrive/Colab Notebooks/Machine Learning Tecniques/Natural Language Processing/x_test.json' 'x_test.json'
!cp '/content/gdrive/MyDrive/Colab Notebooks/Machine Learning Tecniques/Natural Language Processing/y_test.json' 'y_test.json'

Mounted at /content/gdrive


# 0. Loading the Data
From the Preprocessing notebook, we obtain the following files with the data ready to be vectorized.
- x_train.json
- x_test.json
- y_train.json
- y_test.json
- word_count.json
- sents_count.json

_Note: Please upload those four files. They can be found in the "Data/Interim" folder in the `products` project folder._

In [4]:
x_train = []
x_test = []
y_train = []
y_test = []
word_count = []
sents_count = []
with open('x_train.json', 'r', encoding='utf-8') as x_train_file:
  x_train = json.load(x_train_file)
with open('x_test.json', 'r', encoding='utf-8') as x_test_file:
  x_test = json.load(x_test_file)
with open('y_train.json', 'r', encoding='utf-8') as y_train_file:
  y_train = json.load(y_train_file)
with open('y_test.json', 'r', encoding='utf-8') as y_test_file:
  y_test = json.load(y_test_file)
with open('word_count.json', 'r', encoding='utf-8') as word_file:
  word_count = json.load(word_file)
with open('sents_count.json', 'r', encoding='utf-8') as sent_file:
  sent_count = json.load(sent_file)

Once the file descriptors have been used, we will delete them to save RAM.

In [5]:
del x_test_file
del x_train_file
del y_test_file
del y_train_file
del word_file
del sent_file

Another step is to split correctly the word and sentence count arrays. If we remember, in the NLP [preprocessing colab](https://colab.research.google.com/github/SergioJF10/MLT-ESI-UCLM_CIS/blob/main/products/Notebooks/NLP/NLP_products.ipynb#scrollTo=WnzTadlOEGas) we used the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) with `shuffle = False`, which means that the data will be returned in the same order as input but splitted.

So now, we can just split the arrays with the code above and we have obtained the train and test arrays in the correct order to be matched with X data.

In [9]:
word_train = word_count[:len(x_train)]
word_test = word_count[len(x_train):]

sent_train = sent_count[:len(x_train)]
sent_test = sent_count[len(x_train):]

# 1. TF-IDF Vectorization, POS Tagging and Extra Features
Let's now apply vectorization techniques over the preprocessed data in order to prepare the input for the models.