# 0201-02 - NLP Embedding - Exercice Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

Context : 

Let's get the party started ! 

Data  : 

**You can find the dataset [here](https://gist.githubusercontent.com/AlexandreGazagnes/cabe445634a092d308d17a883a305a75/raw/9f785f0f02739ac6352e1d583323771d55270221/nlp.csv).**

## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [None]:
# pwd

In [None]:
# cd ..

In [None]:
# ls

In [None]:
# cd ..

In [None]:
# ls

Install various Librairies : 

In [None]:
# !pip install -r requirements.txt >> pip.log
# !pip freeze >> pip.freeze

### Import 

In [None]:
import os, sys, warnings
import pickle
from IPython.display import display

In [None]:
import pandas as pd
import numpy as np

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
from sklearn.base import *
from sklearn.preprocessing import *
from sklearn.impute import *
from sklearn.model_selection import *
from sklearn.decomposition import *
from sklearn.ensemble import *
from sklearn.model_selection import *
from sklearn.pipeline import *
from sklearn.feature_extraction import *
from sklearn.dummy import *
from sklearn.feature_extraction.text import *

# from lightgbm import *
# from xgboost import *

from sklearn.linear_model import *
from sklearn.ensemble import *
from sklearn.neighbors import *

In [None]:
import nltk
import wordcloud

from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.tokenize import wordpunct_tokenize

import string

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
import gensim

from gensim.models import KeyedVectors
from gensim.downloader import load

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.parsing.preprocessing import preprocess_string

In [160]:
# import transformers

In [None]:
# from openai import OpenAI
import requests

### Graphs and Settings

In [None]:
sns.set()

In [None]:
# warnings.filterwarnings('ignore')
warnings.filterwarnings(action="once")

In [None]:
DISPLAY = True

### Thrid Parties Tools

We need some Third parties : 

In [None]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("words")

Some string assets : 

In [None]:
stop_words = set(stopwords.words("english"))
punctuation = set(string.punctuation)
word_dict = words.words()

We need to download spacy : 

In [None]:
# !python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
# !python -m spacy download en_core_web_lg

Word2vect : 

And to load spacy model :

### Data

url of the dataset :

In [None]:
url = "https://gist.githubusercontent.com/AlexandreGazagnes/cabe445634a092d308d17a883a305a75/raw/d2014e8a34bba3c1be3ec8936bb338fb42888f24/nlp.csv"

Download the dataset : 

In [None]:
df = pd.read_csv(url)
df.head(5)

Keep a copy of the df : 

In [None]:
DF = df.copy()

## King - Men + Woman

### With Spacy

Tokenize 'King' : 

Extract the vector : 

Length ?

Same for Man : 

Fancy calculation ! 

Length ?

Reshape new vector : 

Compute Similarity : 

v1 is : 

vect is  : 

In [None]:
vect = nlp.vocab[v1]
vect

text is : 

In [None]:
vect.text

Whoooo .... not so good ! 

Lets do the same with a "huge" model : 

In [None]:

!python -m spacy download en_core_web_lg
# !python -m spacy download en_core_web_trf

In [None]:
nlp = spacy.load("en_core_web_lg")

Good ? ...

Just re-run previous cells with this code.

What are your conclusions ?

Let's try another last trick : 

In [None]:
doc = "He is one of the most famous kings:  Richard III was the last king of England to die in battle"

In [None]:
doc = "Fifteen months after the death of King George VI, his daughter Elizabeth is crowned Queen of England"

In [None]:
doc = "a female, it's a woman, or a lady, a human of female sex."

In [None]:
doc = "a boy, a guy, or a man, it's a human being of male sex."

### With Doc2Vect

Let's do the same with Pretrained Doct2Vect : 

## Using Gensim

### Prepare Data

Create y vector : 

Create X : 

Cross validation : 

In [None]:
def cv():
    return StratifiedShuffleSplit(n_splits=5, test_size=0.25)


cv()

### By Hand

Our documents : 

Init spacy : 

Preprocess (clean) the corpus : 

In [None]:
tokenized_docs = [
    [
        token.lemma_
        for token in nlp(doc.lower())
        if not token.is_stop and not token.is_punct
    ]
    for doc in documents
]
tokenized_docs[:10]

Key concept here is a tagged document => Token + id

Train the Doc2Vec model

sm : 

md : 

lg : 

Get the vectors : 

Data Type : 

Length ? : 

Rebuild a 'special' X : 

Shape : 

Grid : 

Resultize : 

In [None]:
def resultize(grid):

    res = pd.DataFrame(grid.cv_results_)
    cols = [i for i in res.columns if "split" not in i]
    res = res.loc[:, cols]
    res = res.round(2).sort_values("mean_test_score", ascending=False).head(10)

    return res


resultize(grid)

### Using a pipeline

What is "passthrough" : 

Param grid : 

In [None]:
param_grid = {
    "scaler": [
        "passthrough",
        StandardScaler(),
        QuantileTransformer(n_quantiles=100),
        # MinMaxScaler(),
        Normalizer(),
    ],
    "reductor": [PCA()],
    "reductor__n_components": [0.7, 0.85, 0.9, 0.95, 0.99],
    "estimator": [RandomForestClassifier(), LogisticRegression()],
}
param_grid

New grid : 

Results 

### Using a custom transformer 

Our Transformer : 

In [None]:
class Doc2VecTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, vector_size=500, window=5, min_count=5, epochs=100):

        self.vector_size = vector_size
        self.window = window
        self.min_count = min_count
        self.epochs = epochs
        self.model = None

    def fit(self, X, y=None):

        if not isinstance(X, list):
            _X = X.values.tolist()
        else:
            _X = X

        tagged_docs = [
            TaggedDocument(words=preprocess_string(doc), tags=[i])
            for i, doc in enumerate(_X)
        ]
        model = Doc2Vec(
            vector_size=self.vector_size, min_count=self.min_count, epochs=self.epochs
        )
        model.build_vocab(tagged_docs)
        model.train(tagged_docs, total_examples=model.corpus_count, epochs=model.epochs)
        self.model = model

        return self

    def transform(self, X, y=None):

        if not isinstance(X, list):
            _X = X.values.tolist()
        else:
            _X = X

        vectors = [self.model.infer_vector(preprocess_string(i)) for i in X]
        return vectors

Original df : 

In [None]:
df

Init d2f : 

Fit : 

Transform : 

New param grid : 

New Grid : 

Fit : 

Results : 

Testing various transformers params : 

Grid : 

Fit : 

Results : 

### Using OpenAI GPT Emedding

Init your client : 

Doc : 

Just a try : 

What is response : 

The vector : 

With a custom transformer : 

In [None]:
class OpenAIVecTransformer(BaseEstimator, TransformerMixin):

    def __init__(self, model="text-embedding-3-small"):

        self.model = model
        self.client = OpenAI()

    def fit(self, X, y=None):

        return self

    def transform(self, X, y=None):

        if not isinstance(X, list):
            _X = X.values.tolist()
        else:
            _X = X

        get_vect = (
            lambda i: self.client.embeddings.create(input=i, model=self.model)
            .data[0]
            .embedding
        )
        X_ = [get_vect(i) for i in X]

        return X_