**Objective**:
**To explore and implement various text representation techniques, including TF-IDF and Transformer-based models, for building and evaluating a machine learning pipeline. This involves converting text data into numerical formats, training classifiers (such as SVM) for tasks like text classification, and understanding the advantages and limitations of methods like TF-IDF in natural language processing (NLP) workflows. Additionally, the project will cover concepts of autoencoding models like BERT, autoregressive models like GPT, and their applications in both natural language understanding (NLU) and generation (NLG).**

In [None]:
#BertViz, exBERT and TensorBoard


Points from the paper

- dominated
- transformer based NLP - Python Transformers Library
- tokenizer

- BERT model - autoencoding models
- Autoregressive Models - GPT
- Train, fine -tune the models
- NLU, NLG problems
-Text classification, token classification, text representation
- tensorflow, pytorch, conda, transformers and sentenceTransformers
- BOW -

In [None]:
sklearn
nltk ==3.5.0
gensim ==3.8.3
fasttest
keras>=2.3.0
Transformers >=4.00

**What Does This Code Do?

    It converts the toy text data into a numerical representation using TF-IDF.
    It prints the size of the vocabulary (unique words) and the shape of the document-term matrix.
    It creates a DataFrame to easily view the TF-IDF values for each word in each document.**

TF-IDF Vectorization
What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a way to represent text data in a numerical format, which can be used for machine learning. It gives higher weight to words that are unique to a document and lower weight to common words across all documents.

**
1.Import Libraries**

In [None]:
#sklearn: A popular machine learning library in Python.
#numpy: A library for numerical operations.
#pandas: A library for data manipulation and analysis.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
import numpy as np

In [None]:
import pandas as pd


2.**Define the Toy Corpus**


In [None]:
#toy_corpus is a list of simple sentences that we will use as our text data.

In [None]:
toy_corpus=['the fat cat sat on the mat', 'the big cat slept','the dog chased a cat']

In [None]:
toy_corpus

['the fat cat sat on the mat', 'the big cat slept', 'the dog chased a cat']

3.**Initialize and Fit the Vectorizer**


In [None]:
#TfidfVectorizer(): This creates an object that can convert text data into a TF-IDF matrix.
#fit_transform(): This method learns the vocabulary and computes the TF-IDF values for the toy_corpus.

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
vectorizer

In [None]:
corpus_tfidf=vectorizer.fit_transform(toy_corpus)

In [None]:
corpus_tfidf

<3x10 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

4.**Print Vocabulary Size and Matrix Shape**


In [None]:
#vectorizer.vocabulary_: A dictionary of words and their indices in the vocabulary.
#corpus_tfidf.shape: The dimensions of the TF-IDF matrix (number of documents x number of unique words).

In [None]:
print(f'The vocabulary size is {len(vectorizer.vocabulary_.keys())}')

The vocabulary size is 10


In [None]:
print(f'The document-term matrix shape is {corpus_tfidf.shape}')

The document-term matrix shape is (3, 10)


5.**Create a DataFrame for Better Visualization**

In [None]:
#corpus_tfidf.toarray(): Converts the sparse matrix to a dense array.
#pd.DataFrame(): Creates a DataFrame from the dense array.
#get_feature_names_out(): Gets the words corresponding to the columns of the DataFrame.

In [None]:
df=pd.DataFrame(np.round(corpus_tfidf.toarray(),2))

In [None]:
df.columns=vectorizer.get_feature_names_out()

In [None]:
df

Unnamed: 0,big,cat,chased,dog,fat,mat,on,sat,slept,the
0,0.0,0.25,0.0,0.0,0.42,0.42,0.42,0.42,0.0,0.49
1,0.61,0.36,0.0,0.0,0.0,0.0,0.0,0.0,0.61,0.36
2,0.0,0.36,0.61,0.61,0.0,0.0,0.0,0.0,0.0,0.36


In [None]:
toy_corpus

['the fat cat sat on the mat', 'the big cat slept', 'the dog chased a cat']

advantages and disadvantages  [TF-IDF]

- easy to implement
- human friendly
- domain adaptation

Dis
- dimensionality - curse

- No solution for unseen words

- Semantic relations  is a

- word order is ignored
- vocabularies - slow and not good

LSA- Latent Semantic Analysis - capture pairs correlation

In [None]:
# NLU - pipeline

#tokenization, stemming, noun phrase detection, chucking, stop word elimination
# ML pipeline
#

**Machine Learning Pipeline with SVM
What is a Machine Learning Pipeline?

A pipeline is a sequence of steps that are executed in order, such as data preprocessing and model training. It helps in organizing and streamlining the machine learning workflow.**

**What Does This Code Do?

    It defines a simple classifier and the labels for our toy data.
    It trains the classifier on the TF-IDF matrix.
    It makes predictions on the same data and prints the predicted labels.**

**1.Import Libraries**

In [None]:
#make_pipeline: A function to create a pipeline.
#SVC: Support Vector Classifier, a type of machine learning model used for classification tasks.

In [None]:
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.svm import SVC

**2.Define Labels and Initialize the Classifier**

In [None]:
#labels: The target labels for our toy data (e.g., categories 0 and 1).
#SVC(): Initializes a Support Vector Classifier.

In [None]:
labels=[0,1,0]

In [None]:
clf=SVC()

**3.Train the Classifier**

In [None]:
#fit(): Trains the classifier using the TF-IDF matrix (df.to_numpy()) and the labels.

In [None]:
clf.fit(df.to_numpy(), labels)

**4.Make Predictions**

In [None]:
#predict(): Uses the trained classifier to predict the labels for the input data.

In [None]:
clf.predict(df.to_numpy())

array([0, 1, 0])

In [None]:
# language Modeling and generation ***