<a href="https://colab.research.google.com/github/Jorgecardetegit/NLP/blob/main/Elon_Musk_Bot_with_BlenderBot_using_HuggingFace_%F0%9F%A4%97.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Elon Musk Bot with Blender Bot** 🚀

##  **1. Open-domain chatbots**
Open-domain chatbots are designed to engage in conversations on a wide range of topics without being restricted to a specific domain or set of functions. Unlike closed-domain chatbots, which are developed for specific tasks like customer support or booking reservations, open-domain chatbots aim to understand and generate human-like responses in any conversation.

**Key Characteristics**:

- **Versatility:** They can handle a vast array of topics, from casual chit-chat to more complex discussions.
- **Generative Responses:** Instead of relying on predefined responses, they generate replies based on the context of the conversation.
- **Learning Capability:** Many open-domain chatbots use machine learning models that have been trained on large datasets to improve their conversational abilities.

##  **2. Blender Bot**

BlenderBot is a chatbot developed by Facebook AI Research (FAIR). It's one of the largest open-domain chatbots and was trained on a diverse range of internet text.

BlenderBot and OpenAI's GPT models (like GPT-3 or GPT-4) are both state-of-the-art in the realm of open-domain chatbots. While they share similarities in terms of their generative nature and training on vast amounts of data, their underlying architectures, training methodologies, and specific design goals may differ.

**Key Characteristics**:

- **Large Scale:** BlenderBot was trained using a technique called "Blended Skill Talk," which combines several conversational skills — including the ability to show empathy, provide knowledge, and even joke — into a single system.

- **Generative Design:** Similar to other advanced chatbots like GPT-3 or GPT-4, BlenderBot generates responses rather than relying on predefined answers. This allows it to craft more nuanced and contextually relevant replies.

- **Training Data:** FAIR used conversations from the ParlAI platform, which contains a wide range of dialogue datasets. This includes data from both simulated and real user interactions.

### **2.1 blenderbot-400M-distill**

In this proyect we will use the blenderbot-400M-distill version, check the following HuggingFace link for a more in depth explanation (you can use the available API posted in the web).

https://huggingface.co/facebook/blenderbot-400M-distill?text=Hey+until+when+aer+you+updated

#### **2.1.1 Model destillation**
 This model is the distilled version of Facebook's BlenderBot with approximately 400 million parameters. This model has undergone knowledge distillation, a technique where a smaller model (student) is trained to replicate the behavior of a larger model (teacher).


## **3. Dataset (Joe Rogan Experience 1169 - Elon Musk)**
The data used for this proyect will be extracted from the kaggle dataset: **Joe Rogan Experience 1169 - Elon Musk**. This dataset consist on the transcript of the interview which Elon Musk had on Joe Rogan´s podcast in September 2018.

The dataset contains 3 columns:

- **Timestamp:** When the phrase was said.
- **Speaker:** Name of the person who speaks.
- **Text:** The actual phrase.

Check the link to know more about the dataset:

https://www.kaggle.com/datasets/christianlillelund/joe-rogan-experience-1169-elon-musk

### **3.1 Allternative dataset (Elon Musk interviews)**

Much larger dataset containing video transcriptions from different interviews of Elon Musk.

Check the link to see the dataset:

https://www.kaggle.com/datasets/folefac/elon-musk-interviews


                                                      


# 1. Import libraries and install dependencies

In [None]:
import importlib.util

# Function to check if a library is installed
def is_library_installed(name):
    spec = importlib.util.find_spec(name)
    return spec is not None

# Check if both 'transformers' and 'datasets' are installed
if not is_library_installed('transformers') or not is_library_installed('datasets'):
    !pip install transformers datasets

In [5]:
import tensorflow as tf
import numpy as np
import io
import os
import pandas as pd
import re
import string
import time
from numpy import random
import tensorflow_datasets as tfds
import tensorflow_probability as tfp
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import Dense,Flatten,InputLayer,BatchNormalization,Dropout,Input,LayerNormalization
from tensorflow.keras.losses import BinaryCrossentropy,CategoricalCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.metrics import Accuracy,TopKCategoricalAccuracy, CategoricalAccuracy, SparseCategoricalAccuracy
from tensorflow.keras.optimizers import Adam
from google.colab import drive
from google.colab import files
from datasets import load_dataset
from transformers import create_optimizer,DataCollatorForSeq2Seq,DataCollatorForLanguageModeling,BlenderbotTokenizerFast

In [None]:
MAX_LENGTH=256

# 2. Import dataset

In [None]:
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d christianlillelund/joe-rogan-experience-1169-elon-musk
!unzip "/content/joe-rogan-experience-1169-elon-musk.zip" -d "/content/dataset/"

In [4]:
filepath="/content/dataset/joe-rogan-experience-1169-elon-musk.csv"
dataset = load_dataset('csv', data_files=filepath)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

# 3. Basic EDA

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Timestamp', 'Speaker', 'Text'],
        num_rows: 1831
    })
})

In [6]:
dataset["train"][0]

{'Timestamp': '[00:00:00]',
 'Speaker': 'Joe Rogan',
 'Text': 'Ah, ha, ha, ha. Four, three, two, one, boom. Thank you. Thanks for doing this, man. Really appreciate it.'}

# 4. Preprocessing

In [None]:
model_id="facebook/blenderbot-400M-distill"
tokenizer = BlenderbotTokenizerFast.from_pretrained(model_id,truncation_side="left")