<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# ***Term Frequency - Inverse Document Frequency***

$ \ $



Term Frequency-Inverse Document Frequency (TF-IDF) is a popular technique used in natural language processing and information retrieval to quantify the importance of a word (term) in a document relative to a collection of documents. It is based on the idea that words that appear frequently in a document and are rare in other documents are likely to be more relevant in describing the content of that document.

### $\color{lightgreen}{\text{Mathematical Details}}$

Consider a collection of documents represented as a set of $N$ documents $D = \{d_1, d_2, ..., d_N\}$. Each document $d_i$ consists of a sequence of words or terms.

$(1)$ $\color{lightblue}{\text{Term Frequency (TF)}}$

The term frequency of a term $t$ in a document $d_i$ is the number of times $t$ appears in $d_i$. It is denoted as $TF(t, d_i)$. The simplest way to compute the term frequency is by using the raw count:

$$\begin{cases}
TF(t, d_i) = \text{Number of occurrences of term } t \text{ in document } d_i,\\ \\
t = \text{term}, \ d_{i} =  \text{document}
\end{cases}$$

$(2)$ $\color{lightblue}{\text{Inverse Document Frequency (IDF)}}$

The inverse document frequency of a term $t$ is a measure of how much information the term provides across the entire document collection. It is calculated as the logarithm of the ratio of the total number of documents $|D|$ to the number of documents that contain the term $t$. The formula for IDF is:

$$\begin{cases}
IDF(t, D) = \log\left({\dfrac{|D|}{|\{d \in D: t \in d\}|+1}}\right), \\ \\
t = \text{term}, \ D = \text{set of documents}
\end{cases}$$

$(3)$ $\color{lightblue}{\text{TF-IDF}}$

The TF-IDF score of a term $t$ in a document $d_i$ is the product of its term frequency (TF) and inverse document frequency (IDF):

$$\begin{cases}
\color{yellow}{\text{TF_IDF}}(t, d_i, D) = TF(t, d_i) \cdot IDF(t, D), \\ \\
t = \text{term}, \ d_{i} =  \text{document}, \  D = \text{set of documents}
\end{cases}$$

$(4)$ $\color{lightblue}{\text{TF-IDF Matrix}}$

The TF-IDF matrix is a matrix representation of the TF-IDF scores for all terms in all documents. It is an $M \times N$ matrix, where $M$ is the number of documents and $N$ is the total number of unique terms in the entire document collection. Each element $\color{yellow}{\text{TF_IDF}}(i,j)$ in the matrix represents the TF-IDF score of term $i$ in document $j$.

$(5)$ $\color{lightblue}{\text{Normalization}}$

In practice, the raw TF-IDF values can be further normalized to prevent bias towards long documents. One common normalization technique is the L2 normalization. For each document $d_i$, the L2 normalized TF-IDF scores are computed as:

$$\color{yellow}{\text{TF_IDF}}_{\text{norm}}(i, j, D) = \frac{\color{yellow}{\text{TF_IDF}}(t_j, d_i, D)}{\sqrt{\sum\limits_{t_k \in d_i}{(\color{yellow}{\text{TF_IDF}}(t_k, d_i, D))^2}}}$$

where $\sum\limits_{t_k \in d_i}{(\color{yellow}{\text{TF_IDF}}(t_k, d_i, D))^2}$ is the sum of the squares of all TF-IDF scores for terms in document $d_i$.

## $\color{lightgreen}{\text{Conclusion}}$

In summary, Term Frequency-Inverse Document Frequency (TF-IDF) is a mathematical technique used to represent the importance of a term in a document relative to a collection of documents. It combines term frequency (TF), which measures the frequency of a term within a document, with inverse document frequency (IDF), which measures the importance of a term across the entire document collection. The resulting TF-IDF scores provide a numerical representation of the significance of each term in each document, making it a valuable tool in text processing, information retrieval, and text mining tasks.

$ \ $

-----

## ***Objectives***

$ \ $

After completing this lab you will be able to:

*   Understand what term frequency and tf-idf matrices are

*   Explain the intuition behind both matrices and how they are calculated

*   Apply tf-idf to a corpus of text and find the most important word in each document

$ \ $

-----

## ***Installing required libraries***

$ \ $


The following required modules are pre-installed in the Skills Network Labs environment.

In [1]:
!pip install skillsnetwork==0.19.2

Collecting skillsnetwork==0.19.2
  Downloading skillsnetwork-0.19.2-py3-none-any.whl (12 kB)
Collecting jedi>=0.16 (from ipython->skillsnetwork==0.19.2)
  Downloading jedi-0.18.2-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, skillsnetwork
Successfully installed jedi-0.18.2 skillsnetwork-0.19.2


In [2]:
import re
import skillsnetwork
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [3]:
# Surpress numpy data type warnings
import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning)
warnings.filterwarnings("ignore", category = UserWarning)
warnings.filterwarnings("ignore", category = RuntimeWarning)
warnings.filterwarnings("ignore", category = FutureWarning)

$ \ $

----


## ***Example (Term Frequency - Inverse Document Frequency)***

$ \ $

Let's consider two documents with one sentence each:

$(0)$ "We like dogs and cats"

$(1)$ "We like cars and planes"

To vectorize these two documents into a Term Frequency (TF) matrix, we count the occurrences of each word in each document. The resulting matrix is as follows:

$$ \begin{array}{cccccccc}
\text{doc} & \text{We} & \text{like} & \text{and} & \text{dogs} & \text{cats} & \text{cars} & \text{planes} \\
\hline
0 & 1 & 1 & 1 & 1 & 1 & 0 & 0 \\
1 & 1 & 1 & 1 & 0 & 0 & 1 & 1 \\
\end{array} $$

In this matrix, each row represents a document, and each column represents a unique word (term). The value in each cell represents the count of the corresponding term in the respective document. Now, let's convert this Term Frequency (TF) matrix into a TF-IDF matrix. The TF-IDF score of each term is calculated using the following formula:

$\color{lightblue}{\text{Term Frequency (TF)}}$: $ TF(t,d)$
is the raw count of the term $t$ in document $d$.

$\color{lightblue}{\text{Inverse Document Frequency (IDF):}}$ The IDF term is used to measure the importance of a term across the entire corpus of documents. It is calculated as:

$$ \text{IDF}(t, D) = \log\left(\dfrac{|D|}{{|\{d \in D: t \in d\}|+1}} \right) $$

where:

- $|D|$ is the total number of documents in the corpus (i.e., the total number of documents).

- $ |\{d \in D: t \in d\}| $ is the number of documents in the corpus where the term $t$ appears.

Note that we add $1$ to the idf portion to prevent division by zero and to ensure that any word that appears in every document is not entirely ignored.

$\color{lightblue}{\text{TF-IDF Calculation:}}$ Finally, the TF-IDF score of a term $t$ in a document $d$ is obtained by multiplying its Term Frequency (TF) with its Inverse Document Frequency (IDF):

$$ \text{TD_IDF}(t, d, D) = TF(t,d) \times \text{IDF}(t, D) $$

The resulting TF-IDF matrix will have the same dimensions as the TF matrix, and each element $ \text{tfidf}(t, d, D) $ represents the TF-IDF score of term $t$ in document $d$.

* In the document 1, the TF-IDF value for the term `like` is:

$$ log\left(\dfrac{2}{2+1}\right) = log\left(\dfrac{2}{3}\right) \approx -0.405465 $$

* For the term `dogs`, the TF-IDF value is:

$$log\left(\dfrac{2}{1+1}\right) = log\left(\dfrac{2}{2}\right) = log(1) = 0$$

* Performing the same calculations for all the elements in the matrix, we obtain:

$$
\begin{array}{cccccccc}
\text{doc} & \text{We} & \text{like} & \text{and} & \text{dogs} & \text{cats} & \text{cars} & \text{planes} \\
\hline
0 & 1 & -0.405465 & -0.405465 & 0 & 0 & 0 & 0 \\
1 & 1 & -0.405465 & -0.405465 & 0 & 0 & 0 & 0 \\
\end{array}
$$



The TF-IDF matrix provides a more refined representation of the importance of each term in each document, considering both the term's frequency in the document and its rarity across the entire collection of documents. As shown in the matrix, common words like `We`, `like`, and `and` have lower TF-IDF values because they appear in both documents and are less informative for distinguishing between the two documents. On the other hand, less common words like `dogs`, `cats`, `cars`, and `planes` have higher TF-IDF values as they are more specific to the respective documents. This is a crucial step in information retrieval and text analysis as it helps to emphasize the unique characteristics of each document while downplaying the impact of common words that appear across the entire corpus.

In summary, the TF-IDF matrix provides a numerical representation of the importance of each term in each document relative to the entire corpus. This technique is widely used in information retrieval, text mining, and natural language processing tasks to identify significant words and reduce the impact of common terms that appear across all documents.

$ \ $

----

### ***Doing it in code***

$ \ $

$(1)$ This is the function from sklearn that can convert a list of document strings to a term frequency matrix.

```python
CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
```

$ \ $

$(2)$ This is the function that converts a term frequency matrix into a tf-idf matrix.

```python
TfidfTransformer(*, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
```



$ \ $

-----

## ***Example (Term Frequency - Inverse Document Frequency)***

$ \ $

$(1)$ Let's implement the example above using these functions.

In [4]:
# Definimos una lista llamada D que contiene dos cadenas de texto que representan los documentos en el corpus.
D = ["We like dogs and cats", "We like cars and planes"]

In [5]:
# Creamos una instancia del objeto CountVectorizer llamado cv para convertir el texto en una matriz de términos de frecuencia (TF).
cv = CountVectorizer()

# Aplicamos CountVectorizer al corpus de texto D, lo que crea una matriz de términos de frecuencia (TF) de los documentos en el corpus.
tf_mat = cv.fit_transform(D)

In [6]:
# Convertimos nuestra data tf_mat en un array
data = tf_mat.toarray()

# describimos los nombres de los rasgos dados en cv
columnas = cv.get_feature_names_out()

# Convertimos la matriz de términos de frecuencia en un DataFrame de Pandas llamado tf, donde cada fila representa un documento y cada columna representa un término con su frecuencia.
tf = pd.DataFrame(data, columns = columnas)

# Mostramos el dataframe en pantalla
tf

Unnamed: 0,and,cars,cats,dogs,like,planes,we
0,1,0,1,1,1,0,1
1,1,1,0,0,1,1,1


$ \ $

$(2)$ We instantiate the TfidfTransformer object named tfidf_trans to compute the TF-IDF representation of the corpus.

In [7]:
# Creamos una instancia del objeto TfidfTransformer llamado tfidf_trans para calcular la representación TF-IDF del corpus.
tfidf_trans = TfidfTransformer(smooth_idf = False)

# Aplicamos TfidfTransformer a la matriz de términos de frecuencia tf, lo que crea una matriz TF-IDF que representa la importancia relativa de cada término en cada documento.
tfidf_mat = tfidf_trans.fit_transform(tf)

In [8]:
# Convertimos nuestra data tfidf_mat en un array
data = tfidf_mat.toarray()

# describimos los nombres de los rasgos dados en tfidf_trans
columnas = tfidf_trans.get_feature_names_out()

# Convertimos la matriz TF-IDF en un DataFrame de Pandas llamado tfidf, donde cada fila representa un documento y cada columna representa un término con su valor TF-IDF correspondiente.
tfidf = pd.DataFrame(data, columns = columnas)

# Mostramos el dataframe anterior en pantalla
tfidf

Unnamed: 0,and,cars,cats,dogs,like,planes,we
0,0.338381,0.0,0.572929,0.572929,0.338381,0.0,0.338381
1,0.338381,0.572929,0.0,0.0,0.338381,0.572929,0.338381


$ \ $

$(3)$ We calculate a non-normalized representation of TF-IDF by multiplying the IDF (Inverse Document Frequency) by the matrix of frequency terms tf. The TF-IDF array and the tfidf DataFrame are useful for text analysis and text mining tasks, as they provide a numerical representation that reflects both the frequency of terms and their relative importance in the corpus. This makes it easy to compare and analyze documents based on their content characteristics.

In [9]:
# El resultado se convierte en un DataFrame con los mismos términos y documentos que el DataFrame TF-IDF.
pd.DataFrame(tfidf_trans.idf_ * tf.to_numpy(), columns = tfidf_trans.get_feature_names_out())

Unnamed: 0,and,cars,cats,dogs,like,planes,we
0,1.0,0.0,1.693147,1.693147,1.0,0.0,1.0
1,1.0,1.693147,0.0,0.0,1.0,1.693147,1.0


$ \ $

$(4)$ We calculate the inner product or the square norm of the TF-IDF representation of the first document.  This is achieved by multiplying element by element the first row of the TF-IDF DataFrame with itself and then adding the resulting elements.  The result is rounded to the nearest whole number.

In [10]:
# Calculamos el producto interno o la norma cuadrada de la representación TF-IDF del primer documento.
# Esto se logra multiplicando elemento a elemento la primera fila del DataFrame TF-IDF consigo misma y luego sumando los elementos resultantes.
# El resultado se redondea al número entero más cercano.
np.dot(tfidf.iloc[0,:], tfidf.iloc[0,:])

1.0

$ \ $

----

# ***Example***

$ \ $

$(1)$ Let's try creating a tf-idf matrix ourselves! Below we have loaded a [dataset from kaggle](https://www.kaggle.com/datasets/vivmankar/physics-vs-chemistry-vs-biology?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2022-01-01) of text, made up of news documents. This is an open domain dataset that is free to use.


In [11]:
# Definimos una variable 'URL' que contiene la dirección URL del archivo CSV que contiene datos relacionados con TF-IDF.
URL = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML0187EN-SkillsNetwork/labs/module%203/data/tfidf.csv'

# Utilizamos la biblioteca pandas para leer el archivo CSV desde la URL y almacenarlo en el DataFrame 'df'.
df = pd.read_csv(URL)

# Seleccionamos solamente la segunda columna del DataFrame 'df' usando la función '.iloc[:, 1]'.
# Nota: Los índices en pandas comienzan desde 0, por lo que '.iloc[:, 1]' representa la segunda columna (índice 1).
df = df.iloc[:, 1]

$ \ $

$(2)$ Let's look at some samples rows from the dataset we loaded.

In [12]:
df.head()

0    Personally I have no idea what my IQ is. I’ve ...
1    I'm skeptical. A heavier lid would be needed t...
2    I think I have 100 cm of books on the subject....
3    Is chemistry hard in uni. Ive read somewhere t...
4    In addition to the other comment, you can crit...
Name: Comment, dtype: object

$ \ $

$(3)$ We convert this matrix of documents into a term frequency matrix. Note that this dataset has numbers, and we want to remove them for simplicity sake. For this objetive, we will use the following function and plug it into `CountVectorizer(preprocessor=preprocess_text)` as an argument.

In [13]:
# Esta función realiza un preprocesamiento básico en un texto dado
def preprocess_text(text):

    # Con esta línea de código, convertimos todo el texto a minúsculas para que todas las letras sean uniformes.
    text = text.lower()

    # La siguiente línea utiliza una expresión regular (regex) para eliminar todos los dígitos numéricos del texto.
    # La función 're.sub' reemplaza todas las ocurrencias de dígitos en el texto con una cadena vacía '',
    # lo que efectivamente elimina los dígitos numéricos del texto.
    text = re.sub(r'\d+', '', text)

    # Finalmente, devolvemos el texto preprocesado después de aplicar ambas transformaciones.
    return text

$ \ $

$(4)$ We also want to limit the Countvectorizer to just the top $500$ words using the `max_features` argument. We Apply the `CountVectorizer` to the `df` Series and name the columns to the features from the `cv.get_feature_names_out()` function.

In [14]:
# Creamos una instancia del objeto CountVectorizer llamado cv con dos parámetros:
#  max_features = 500: Limitamos el número máximo de características (términos) que se utilizarán en la matriz de términos de frecuencia (TF).
#  preprocessor = preprocess_text: Especificamos la función 'preprocess_text' definida anteriormente para preprocesar el texto antes de construir la matriz de TF.
cv = CountVectorizer(max_features = 500, preprocessor = preprocess_text)

# Aplicamos el CountVectorizer al DataFrame 'df', lo que crea una matriz de términos de frecuencia (TF) de los documentos en 'df'.
# La función fit_transform realiza dos pasos:
# 1. Aprende el vocabulario del texto y construye la matriz de términos de frecuencia.
# 2. Transforma los documentos en 'df' en una representación de matriz de términos de frecuencia (TF).
tf = cv.fit_transform(df)

# Finalmente, convertimos la matriz de términos de frecuencia (TF) en un DataFrame utilizando pd.DataFrame.
# La función 'toarray()' convierte la representación dispersa (sparse) de la matriz de TF en una matriz densa,
# y la función 'columns' obtiene los nombres de las características (términos) del CountVectorizer para usarlos como columnas en el DataFrame.
pd.DataFrame(tf.toarray(), columns = cv.get_feature_names_out())

Unnamed: 0,able,about,above,acid,acids,actually,add,after,again,ago,...,wouldn,wrong,www,yeah,year,years,yes,you,your,yourself
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1581,0,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,10,6,0
1582,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1583,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,3,0,0
1584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


$ \ $

$(5)$ Now that we have a term frequency matrix, we can apply the ***tf-idf*** function to it in order to obtain a matrix where the values represent how important a certain word is to their documents. For this,  we apply the TfidfTransformer to the `tf` matrix and name the columns to the features from `CountVectorizer.get_feature_names_out()`.

In [15]:
# Creamos una instancia del objeto TfidfTransformer llamado tfidf_trans.
# No se pasan parámetros adicionales, por lo que se utilizarán los valores predeterminados del TfidfTransformer.
tfidf_trans = TfidfTransformer()

# Aplicamos el TfidfTransformer a la matriz de términos de frecuencia (TF) previamente obtenida 'tf'.
# La función fit_transform realiza dos pasos:
# 1. Aprende los pesos tf-idf de la matriz de términos de frecuencia (TF) y construye la matriz tf-idf normalizada.
# 2. Transforma la matriz de términos de frecuencia (TF) en una matriz tf-idf normalizada.
tfidf_mat = tfidf_trans.fit_transform(tf.toarray())

# Convertimos la matriz tf-idf normalizada en un DataFrame utilizando pd.DataFrame.
# La función 'toarray()' convierte la representación dispersa (sparse) de la matriz tf-idf en una matriz densa,
# y la función 'columns' obtiene los nombres de las características (términos) del CountVectorizer para usarlos como columnas en el DataFrame.
tfidf = pd.DataFrame(tfidf_mat.toarray(), columns = cv.get_feature_names_out())

# Mostramos nuestra data en pantalla
tfidf

Unnamed: 0,able,about,above,acid,acids,actually,add,after,again,ago,...,wouldn,wrong,www,yeah,year,years,yes,you,your,yourself
0,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
1,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
2,0.11354,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044232,0.000000,0.0
3,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.188718,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.079460,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1581,0.00000,0.214699,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.331533,0.308927,0.0
1582,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096378,0.149678,0.0
1583,0.00000,0.121809,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.225714,0.000000,0.0
1584,0.00000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0


As we can see above, both the term frequency and tf-idf matrices contain a lot of 0's. When dealing with very large corpus of text, or a corpus with a large amount of unique words/features, we will often store the information in a dense format. This saves us space in RAM, as well as reduces the sparsity of the original matrix.


In [16]:
# Utilizamos el método 'stack()' en el DataFrame 'tfidf' para convertirlo en una serie apilada.
# La serie apilada contiene todos los elementos de la matriz tf-idf normalizada y tiene una estructura de índice jerárquico,
# donde el nivel superior del índice es el número de fila y el nivel inferior es el término correspondiente.
dense_tfidf = tfidf.stack()

# A continuación, aplicamos el filtro 'dense_tfidf != 0' en la serie apilada para obtener solo los elementos que no son iguales a cero.
# Esto nos da una serie que contiene solo los valores no nulos (distintos de cero) de la matriz tf-idf normalizada.
# Los valores no nulos representan la importancia de cada término en cada documento del corpus.
dense_tfidf_non_zero = dense_tfidf[dense_tfidf != 0]

# Mostramos en pantalla los resultados
dense_tfidf_non_zero

0     an       0.137536
      and      0.154283
      be       0.109092
      been     0.398127
      by       0.153027
                 ...   
1585  to       0.208749
      up       0.163900
      video    0.559576
      want     0.189222
      you      0.181626
Length: 51416, dtype: float64

In [18]:
import numpy as np
from scipy.sparse import csr_matrix

# Dada la lista con las coordenadas y valores no nulos de la matriz
lista = [(1, 1, 2), (1, 2, 3), (3, 4, 1), (2, 4, 4), (4, 3, 1)]

# Obtener las dimensiones de la matriz
num_rows = max(item[0] for item in lista)
num_cols = max(item[1] for item in lista)

# Inicializar una matriz vacía de tamaño num_rows x num_cols, llena de ceros
A = np.zeros((num_rows, num_cols))

# Llenar la matriz A con los valores de la lista en las posiciones correspondientes
for row, col, value in lista:
    A[row-1, col-1] = value  # Restamos 1 para convertir a índices basados en 0

# Convertir la matriz numpy A en una matriz dispersa en formato CSR
sparse_A = csr_matrix(A)

print(sparse_A)


  (0, 0)	2.0
  (0, 1)	3.0
  (1, 3)	4.0
  (2, 3)	1.0
  (3, 2)	1.0
