# Text Data Representation (Text Preprocessing)

Text data representation in NLP (Natural Language Processing) refers to the process of converting raw text into a structured and numerical format that can be understood and processed by machine learning algorithms or other NLP models. Since machines primarily work with numerical data, text data representation is essential for extracting meaningful information and patterns from textual content.

There are different approaches for representing text data in NLP, including:

1. **Bag-of-Words (BoW)**: The BoW representation treats each document as a collection of words and creates a vector representation based on the frequency or presence of words in the document. It disregards the order or sequence of words and represents text as a set of word occurrences or frequencies.

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**: TF-IDF is a weighted representation that assigns a score to each word based on its frequency in a document and its rarity across the entire corpus. It emphasizes words that are frequent in a document but rare in the overall corpus, capturing their relative importance.

3. **Word Embeddings**: Word embeddings are dense, low-dimensional vector representations that capture semantic and contextual relationships between words. Models like Word2Vec, GloVe, and FastText learn to represent words in a continuous vector space, capturing their meaning and similarity.

4. **Neural Network-based representations**: Deep learning models, such as recurrent neural networks (RNNs) or transformers, can learn contextual representations of words or documents. They leverage the sequential or hierarchical structure of the text to capture dependencies and generate dense vector representations.

5. **Character-level representations**: Instead of focusing on words, character-level representations operate at the individual character level. This approach can capture morphological or spelling variations and is useful for tasks like text generation or sentiment analysis.

**Note:** The choice of representation depends on the specific task, dataset size, language complexity, and available resources. Effective text data representation enables NLP models to extract meaningful information, perform tasks such as text classification, sentiment analysis, machine translation, information extraction, and facilitate the understanding and analysis of textual content.

In [1]:
# Import packages
import numpy as np
import pandas as pd 

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import warnings
warnings.filterwarnings("ignore")
# seeding
np.random.seed(123)

In [3]:
# load data

data = pd.read_csv('../Data/swahili_news_titles_clean_data.csv')

In [4]:
# show sample of the data

data.head() 

Unnamed: 0,headlines,word_count
0,kiongozi wawili kikaangoni,3
1,ruangwa yafanya kweli miradi maji,5
2,marekani yaishuku urusi kukiuka mkataba nyuklia,6
3,watumishi umma kuula,3
4,ccm yatoa siri ushindi uchaguzi,5


##  1. Bag of word (BoW)

The Bag-of-Words (BoW) is a fundamental concept in Natural Language Processing (NLP) used to represent text documents as numerical feature vectors. It involves creating a vocabulary of unique words in the corpus and then quantifying the presence or frequency of these words in each document. The order or grammar of the words is disregarded, and only the occurrence information is considered. By representing documents as a collection of word counts or frequencies.

BoW enables machines to process and analyze text data using numerical methods and algorithms. It serves as a basis for various NLP tasks such as text classification, sentiment analysis, topic modeling, and information retrieval.

## How to implement in Python?


To implement the Bag-of-Words (BoW) representation in Python, you can utilize the CountVectorizer class from the scikit-learn library. 

The Count Vectorizer transforms a string into a Frequency representation.It tokenizes the text and applies basic and elementary processing.

The objective is to make a vector with as many dimensions as there are distinct words. Each unique word has its own dimension, which will be represented by 1 in that dimension and 0 in all others.


In [5]:
# Quick example
text = [
    'This is the first document', 'This document is the second document',
    'And this is the third one', 'is this the first document',
]

coun_vect = CountVectorizer()
count_matrix = coun_vect.fit_transform(text)
count_array = count_matrix.toarray()
df = pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names_out())
df 

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


### Implement CountVectorizer method in the Swahili News headlines dataset

In [9]:
# how to implement using countvectorizer

count_vectorizer = CountVectorizer()

# fit and transform the data with count vectorizer
count_data = count_vectorizer.fit_transform(data['headlines'].values.astype('U'))

In [10]:
# show the transformed data

count_data[:10].toarray() 

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [11]:
# get features names
cv_feature_names = count_vectorizer.get_feature_names_out()

# create dataframe for the transformed data
count_data_df = pd.DataFrame(count_data.toarray(),
                             columns=list(cv_feature_names))

# view top 5 rows
count_data_df.head()

Unnamed: 0,aa,aacha,aachana,aachane,aache,aachia,aachie,aachiliwa,aachiwa,aachiwe,...,zuku,zulu,zuma,zumbukuku,zungu,zuri,zutah,zutta,zuttah,zverev
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# check the shape 
count_data_df.shape 

(31024, 22015)

## 2. Term Frequency-Inverse Document Frequency (TF-IDF):

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical representation technique used in Natural Language Processing (NLP) to measure the importance of words in a document. It takes into account two factors:
    
- Term frequency(TF)
- Inverse document frequency. 


**(a) Term frequency**: It represents how often a word appears in a document. 

**(b) Inverse document frequency**: It measures how rare a word is across all documents in a collection.
    
TF-IDF assigns higher weights to words that occur frequently within a specific document but are less common in the overall document collection. This allows TF-IDF to highlight words that are both important within a document and distinct across the entire corpus. By calculating TF-IDF values for each word, NLP models can capture the relative significance of words and use them for tasks like information retrieval, text classification, and document ranking.

### Mathematical Formula 
TF-IDF has two-part:

1. TF = Number of repetition of words in a sentence / Number of words in a sentence
2. IDF = log( Number of sentences / Number of sentences containing the word)

TF-IDF Terminology:
- T = Term (word)
- F = Frequency
- D = Document (set of word)

TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

![](https://media.licdn.com/dms/image/C5612AQFcV0nH5A23ow/article-inline_image-shrink_1000_1488/0/1626417114643?e=1690416000&v=beta&t=RuKyVENGafUleFdv1S8EWUeqyooxWF2ZS3ql5FoIjcY)

### How to implement in Python?

To implement Term Frequency-Inverse Document Frequency (TF-IDF) in Python, you can use the TfidfVectorizer class from the scikit-learn.

The TfidfVectorizer class to convert the text documents into a matrix of TF-IDF values. The fit_transform method fits the TfidfVectorizer on the corpus and transforms the documents into their TF-IDF representations.

Finally, you can access the feature names and the TF-IDF representation of each document using get_feature_names() and toarray() methods, respectively.

In [25]:
# Quick example
text = [
    'This is the first document', 'This document is the second document',
    'And this is the third one', 'is this the first document',
]

tfidf_vect = TfidfVectorizer(ngram_range=(1,1))
tfidf_matrix = tfidf_vect.fit_transform(text)
tfidf_array = tfidf_matrix.toarray()
df = pd.DataFrame(data=tfidf_array, columns=tfidf_vect.get_feature_names_out())
df 

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


### Implement TfidfVectorizer method in the Swahili News headlines dataset

In [17]:
# how to implement using TFIDF Vectorizer

tfidf_vectorizer = TfidfVectorizer()

# fit and transform the data with tfidf vectorizer
tfidf_data = tfidf_vectorizer.fit_transform(data['headlines'].values.astype('U'))

In [18]:
# show the transformed data

tfidf_data[:10].toarray() 

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [19]:
# get features names
cv_feature_names = count_vectorizer.get_feature_names_out()

# create dataframe for the transformed data
tfidf_data_df = pd.DataFrame(tfidf_data.toarray(),
                             columns=list(cv_feature_names))

# view top 5 rows
tfidf_data_df.head()

Unnamed: 0,aa,aacha,aachana,aachane,aache,aachia,aachie,aachiliwa,aachiwa,aachiwe,...,zuku,zulu,zuma,zumbukuku,zungu,zuri,zutah,zutta,zuttah,zverev
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
# check the shape 
tfidf_data_df.shape 

(31024, 22015)