## 1. Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'pandas'

In [None]:
import pandas as pd

# Function to load data
def load_data(url):
    try:
        df = pd.read_csv(url)
        print("Data loaded successfully!")
        return df
    except Exception as e:
        print(f"Failed to load data: {e}")

# URLs of the datasets
urls = {
    "sample_submission": "https://raw.githubusercontent.com/JohannG3/DS_ML/main/sample_submission.csv",
    "training_data": "https://raw.githubusercontent.com/JohannG3/DS_ML/main/training_data.csv",
    "unlabelled_test_data": "https://raw.githubusercontent.com/JohannG3/DS_ML/main/unlabelled_test_data.csv"
}

# Load datasets
sample_submission = load_data(urls['sample_submission'])
training_data = load_data(urls['training_data'])
unlabelled_test_data = load_data(urls['unlabelled_test_data'])

# Display the first few rows of the training data to confirm it's loaded correctly
training_data.head()


take a part to test, a part for hyperparameters (k-fold)

In [None]:
training_data.head(5)

Unnamed: 0,id,sentence,difficulty
0,0,Les coûts kilométriques réels peuvent diverger...,C1
1,1,"Le bleu, c'est ma couleur préférée mais je n'a...",A1
2,2,Le test de niveau en français est sur le site ...,A1
3,3,Est-ce que ton mari est aussi de Boston?,A1
4,4,"Dans les écoles de commerce, dans les couloirs...",B1


In [None]:
print(training_data.describe())

                id
count  4800.000000
mean   2399.500000
std    1385.784976
min       0.000000
25%    1199.750000
50%    2399.500000
75%    3599.250000
max    4799.000000


In [None]:
# Check for missing values
print(training_data.isnull().sum())
# Optionally, drop rows with missing values
training_data.dropna(inplace=True)

id            0
sentence      0
difficulty    0
dtype: int64


In [None]:
sample_submission.head(5)

Unnamed: 0,id,difficulty
0,0,A1
1,1,A1
2,2,A1
3,3,A1
4,4,A1


In [None]:
unlabelled_test_data.head(5)

Unnamed: 0,id,sentence
0,0,Nous dûmes nous excuser des propos que nous eû...
1,1,Vous ne pouvez pas savoir le plaisir que j'ai ...
2,2,"Et, paradoxalement, boire froid n'est pas la b..."
3,3,"Ce n'est pas étonnant, car c'est une saison my..."
4,4,"Le corps de Golo lui-même, d'une essence aussi..."


## 2.1 Baseline TF-IDF vectorization

To set up a baseline model for classifying the difficulty of French sentences, we can use a simple machine learning model with TF-IDF vectorization. This approach will help us quickly assess the nature of the problem and the effectiveness of basic techniques before diving into more complex models like neural networks.

Here’s how we’ll proceed:

TF-IDF Vectorization: Convert the text data from the sentence column into a format that a machine learning algorithm can process. TF-IDF stands for Term Frequency-Inverse Document Frequency, a numerical statistic that reflects the importance of a word to a document in a corpus.
Model Selection: Use a simple yet robust classifier like Logistic Regression, which is often effective for baseline models in text classification tasks.
Training the Model: Train the logistic regression model on the training data.
Prediction: Predict the difficulty level on a subset of the training data to see how the model performs.
Let's start by setting up the TF-IDF vectorization and training the Logistic Regression model. We'll split the training data into training and validation sets to evaluate the model's performance.

In [None]:
# Prepare the data
X = training_data['sentence']
y = training_data['difficulty']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))

# Fit and transform the training data and transform the validation data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_val_tfidf = tfidf_vectorizer.transform(X_val)

# Initialize and train a Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_tfidf, y_train)

# Predict the difficulty level on the validation set
y_pred = model.predict(X_val_tfidf)

# Evaluate the model
report = classification_report(y_val, y_pred)

report

'              precision    recall  f1-score   support\n\n          A1       0.48      0.60      0.53       166\n          A2       0.31      0.28      0.29       158\n          B1       0.36      0.27      0.31       166\n          B2       0.35      0.29      0.32       153\n          C1       0.35      0.38      0.36       152\n          C2       0.42      0.51      0.46       165\n\n    accuracy                           0.39       960\n   macro avg       0.38      0.39      0.38       960\nweighted avg       0.38      0.39      0.38       960\n'

The baseline Logistic Regression model using TF-IDF vectorization has been trained and evaluated on the validation set. Here are the classification results:

Model Performance
Accuracy: 39%
Precision, Recall, F1-Score: Varied across different difficulty levels, which shows how the model performs for each class specifically.
Detailed Performance by Class
A1: Precision = 48%, Recall = 60%, F1-score = 53%
A2: Precision = 31%, Recall = 28%, F1-score = 29%
B1: Precision = 36%, Recall = 27%, F1-score = 31%
B2: Precision = 35%, Recall = 29%, F1-score = 32%
C1: Precision = 35%, Recall = 38%, F1-score = 36%
C2: Precision = 42%, Recall = 51%, F1-score = 46%
Interpretation
The model's overall accuracy at 39% indicates there is significant room for improvement. However, for a baseline, this gives us a starting point to understand how more complex models might improve upon these metrics.
The model performs best on classes A1 and C2, which suggests that some difficulty levels may have more distinctive linguistic features than others.
Next Steps
Given the baseline performance, you might consider:

Enhancing Preprocessing: Exploring better text preprocessing and feature engineering strategies.
Model Experimentation: Testing more complex models such as Support Vector Machines, Random Forests, or advanced neural networks like LSTM or BERT for NLP tasks.
Hyperparameter Tuning: Adjusting model parameters and vectorization settings to improve performance.

In [None]:
# Transform the unlabelled test data using the fitted TF-IDF vectorizer
X_test_tfidf = tfidf_vectorizer.transform(unlabelled_test_data['sentence'])

# Use the trained model to predict the difficulty level of the unlabelled test data
unlabelled_test_predictions = model.predict(X_test_tfidf)

# Create a DataFrame for the predicted difficulties
predicted_difficulties_1 = pd.DataFrame({
    'id': unlabelled_test_data['id'],
    'difficulty': unlabelled_test_predictions
})

predicted_difficulties_1.head()

Unnamed: 0,id,difficulty
0,0,C2
1,1,A2
2,2,A1
3,3,A2
4,4,C2


In [None]:
# Save the DataFrame to a CSV file
predicted_difficulties_1.to_csv('predicted_difficulties_1.csv', index=False)

## 2.2  TF-IDF vectorization & Support Vector Machine (SVM) model

Let's proceed by refining the TF-IDF vectorization and then training a Support Vector Machine (SVM) model on the data. We'll also adjust the n-gram range and experiment with other TF-IDF parameters to see if they help improve the model's performance.

I'll start by updating the TF-IDF vectorization and then fitting the SVM model.

In [None]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# Update TF-IDF vectorizer with potentially better settings
tfidf_vectorizer_enhanced = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))

# Create a pipeline with the updated TF-IDF and SVM
svm_model = make_pipeline(tfidf_vectorizer_enhanced, SVC(kernel='linear', C=1))

# Fit the pipeline on the full training data
svm_model.fit(X_train, y_train)

# Evaluate the model using cross-validation on the training set
cv_scores = cross_val_score(svm_model, X_train, y_train, cv=5, scoring='accuracy')

# Calculate mean cross-validation accuracy
mean_cv_accuracy = np.mean(cv_scores)

mean_cv_accuracy

0.4354166666666666

The updated model, which uses an enhanced TF-IDF vectorization coupled with a Support Vector Machine (SVM) classifier, achieved a mean cross-validation accuracy of approximately 43.54% on the training set. This shows an improvement over the baseline Logistic Regression model.

Next Steps:
To further enhance the model's performance, we could consider the following strategies:

Hyperparameter Tuning: Experiment with different values for SVM's C parameter or try different kernels (e.g., RBF, polynomial).
Feature Engineering: Integrate word embeddings like Word2Vec or GloVe to capture deeper semantic meanings.
Advanced Models: Test deep learning approaches, such as LSTM or Transformer-based models like BERT, for potentially better handling of contextual information in text.
Data Augmentation: If applicable, generate synthetic data or use techniques like back-translation to increase the training dataset size and variability.

In [None]:
# Transform and predict the unlabelled test data using the trained SVM pipeline
unlabelled_test_predictions_svm = svm_model.predict(unlabelled_test_data['sentence'])

# Create a DataFrame for the predicted difficulties with SVM
predicted_difficulties_svm = pd.DataFrame({
    'id': unlabelled_test_data['id'],
    'difficulty': unlabelled_test_predictions_svm
})

predicted_difficulties_svm.head()

Unnamed: 0,id,difficulty
0,0,C2
1,1,B1
2,2,A1
3,3,A1
4,4,C2


In [None]:
# Save the DataFrame to a CSV file
predicted_difficulties_svm.to_csv('predicted_difficulties_svm.csv', index=False)

Great! With a precision of 45%, there's still room for improvement.

## 2.3 TF-IDF, SVM, Hyperparameters

Let’s explore some advanced strategies to further enhance the model's performance:

1. Hyperparameter Tuning
We'll start with tuning the hyperparameters of the SVM model, specifically the C parameter and trying different kernels. This might help in improving the model's ability to generalize better to unseen data.

2. Feature Engineering
We could integrate word embeddings, which provide a richer representation of the sentence semantics. Using pre-trained embeddings like Word2Vec or GloVe can capture deeper contextual meanings that might be missed by traditional TF-IDF.

3. Advanced Models
If the SVM with tuned hyperparameters and enhanced features still doesn’t achieve the desired performance, we might consider using deep learning models. LSTM (Long Short-Term Memory) networks or Transformer-based models like BERT are particularly effective for text-based tasks and handle context better in sentences.

Step 1: Hyperparameter Tuning for SVM
Let's begin by tuning the SVM's hyperparameters. We’ll use a grid search approach over a range of C values and try different kernels like RBF and polynomial. This step will help us determine the best set of parameters for our SVM model.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define a range of C values to test
param_grid = {
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf']
}

# Create a new pipeline for the SVM with the enhanced TF-IDF
svm_pipeline = make_pipeline(TfidfVectorizer(max_features=5000, ngram_range=(1, 3)),
                             SVC(random_state=42))

# Set up the grid search with cross-validation
grid_search = GridSearchCV(svm_pipeline, param_grid, cv=3, scoring='accuracy', verbose=1)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

best_params, best_score

Fitting 3 folds for each of 6 candidates, totalling 18 fits


({'svc__C': 10, 'svc__kernel': 'rbf'}, 0.425)

In [None]:
# Update the SVM pipeline with the best parameters found
svm_best = make_pipeline(
    TfidfVectorizer(max_features=5000, ngram_range=(1, 3)),
    SVC(kernel='rbf', C=10, random_state=42)
)

# Fit the updated pipeline on the entire training data
svm_best.fit(X_train, y_train)

# Evaluate the model using cross-validation on the training set
cv_scores_best = cross_val_score(svm_best, X_train, y_train, cv=3, scoring='accuracy')

# Calculate mean cross-validation accuracy for the best model
mean_cv_accuracy_best = np.mean(cv_scores_best)
mean_cv_accuracy_best

0.425

In [None]:
# Predict the difficulty level of the unlabelled test data using the optimized SVM model
unlabelled_test_predictions_optimized_svm = svm_best.predict(unlabelled_test_data['sentence'])

# Create a DataFrame for the predicted difficulties with the optimized SVM model
predicted_difficulties_optimized_svm = pd.DataFrame({
    'id': unlabelled_test_data['id'],
    'difficulty': unlabelled_test_predictions_optimized_svm
})

predicted_difficulties_optimized_svm.head()

Unnamed: 0,id,difficulty
0,0,C2
1,1,B1
2,2,A2
3,3,A1
4,4,C2


In [None]:
# Save the DataFrame to a CSV file
predicted_difficulties_optimized_svm.to_csv('predicted_difficulties_optimized_svm.csv', index=False)

## 2.4 BERT

2.4.1

Great choice! Implementing a deep learning model, particularly using a Transformer-based approach like BERT (Bidirectional Encoder Representations from Transformers), can be highly effective for NLP tasks such as text classification due to its ability to understand the context of words in a sentence.

Plan for Implementing BERT:
Model Selection: We'll use a pre-trained BERT model suited for classification. The transformers library by Hugging Face provides several variations of BERT that are pre-trained and can be fine-tuned for specific tasks.
Data Preparation: The input data needs to be tokenized specifically for BERT using its tokenizer, which converts text into tokens that BERT has been trained on.
Model Training: Fine-tune the model on our specific dataset, adjusting the top layers to predict the difficulty levels of sentences.
Evaluation: Assess the model's performance on the validation set and optimize as necessary.
Setting Up the Environment:
To proceed, we'll need to install the transformers and torch libraries, which provide the pre-trained models and the necessary functionality to train them with PyTorch.

Let's start by setting up the tokenizer and preparing the data for BERT. I'll load a basic BERT model suited for text classification and prepare the training data accordingly.

In [None]:
# First, let's install the transformers library if it's not already installed
"""
!pip install transformers

# Import necessary modules from transformers
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the training data
train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)

# Prepare labels (convert difficulty levels to integers for training)
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(y_train)

# Import torch dataset
import torch

class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Convert the encoded training data into a Dataset object
train_dataset = Dataset(train_encodings, train_labels)

# Check if the dataset is set up correctly
train_dataset[0]
"""



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'input_ids': tensor([  101,  2006,  1050,  1005,  4372,  4066, 14674,  1010,  3802, 25175,
          1010,  1046,  1005,  4372,  9932,  9388,  2890,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

2.4.2

In [None]:
#pip install datasets

Collecting datasets
  Downloading datasets-2.19.0-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.21.2 (from datasets)
  Downloading huggingface_hub-0.22.2-py3-none-any

In [None]:
#pip install accelerate



In [None]:
#!pip install transformers[torch] -U

Collecting transformers[torch]
  Downloading transformers-4.40.1-py3-none-any.whl (9.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m31.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch-

In [None]:
# Ensure all necessary libraries are installed
#!pip install transformers datasets accelerate torch
"""
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
from datasets import Dataset, load_metric

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=6)

# Prepare label mapping
label_to_id = {label: i for i, label in enumerate(training_data['difficulty'].unique())}

def encode_sentences(sentences, labels):
    # Convert labels to IDs
    label_ids = [label_to_id[label] for label in labels]
    # Tokenize and encode sentences with padding and truncation
    inputs = tokenizer(sentences, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    return {'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask'], 'labels': torch.tensor(label_ids)}

# Prepare dataset
train_sentences = training_data['sentence'].tolist()
train_labels = training_data['difficulty'].tolist()

# Encode all data
train_encodings = [encode_sentences([sentence], [label]) for sentence, label in zip(train_sentences, train_labels)]

# Creating a combined dataset from individual encodings
input_ids = torch.cat([enc['input_ids'] for enc in train_encodings], dim=0)
attention_mask = torch.cat([enc['attention_mask'] for enc in train_encodings], dim=0)
labels = torch.cat([enc['labels'] for enc in train_encodings], dim=0)

# Create Hugging Face dataset
train_dataset = Dataset.from_dict({
    'input_ids': input_ids,
    'attention_mask': attention_mask,
    'labels': labels
})

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch"
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

# Start training
trainer.train()
"""

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

## 2.5 Bert easy

In [None]:
!pip install transformers[torch] -U



In [None]:
pip install datasets



In [None]:
pip install accelerate



In [None]:
import accelerate
print(accelerate.__version__)

0.29.3


In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch
from datasets import load_dataset, Dataset

# Load tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=6)  # Adjust num_labels based on your task

# Prepare label mapping
label_to_id = {label: i for i, label in enumerate(sorted(training_data['difficulty'].unique()))}

def encode_sentences(sentences, labels):
    inputs = tokenizer(sentences, truncation=True, padding="max_length", max_length=512, return_tensors="pt")
    labels = [label_to_id[label] for label in labels]
    return {'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask'], 'labels': torch.tensor(labels)}

# Prepare dataset
train_sentences = training_data['sentence'].tolist()
train_labels = training_data['difficulty'].tolist()

# Encoding the data
encoded_data = [encode_sentences([sentence], [label]) for sentence, label in zip(train_sentences, train_labels)]
input_ids = torch.cat([item['input_ids'] for item in encoded_data], dim=0)
attention_mask = torch.cat([item['attention_mask'] for item in encoded_data], dim=0)
labels = torch.cat([item['labels'] for item in encoded_data], dim=0)

# Create dataset object
train_dataset = Dataset.from_dict({'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels})

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch"
)

# Initialize and run the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)
trainer.train()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

## 2.6 Word Embedding

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras.initializers import Constant

# Example data
texts = ['I love machine learning', 'Deep learning is amazing', 'NLP is interesting']
labels = [0, 1, 1]  # Example binary labels

# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Prepare embedding matrix
word_index = tokenizer.word_index
num_tokens = len(word_index) + 1
embedding_dim = 100  # Depends on the GloVe embeddings you use

# Load GloVe word embeddings
embedding_index = {}
with open('glove.6B.100d.txt', 'r', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Model architecture
model = Sequential()
model.add(Embedding(num_tokens,
                    embedding_dim,
                    embeddings_initializer=Constant(embedding_matrix),
                    input_length=max([len(seq) for seq in sequences]),
                    trainable=False))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Pad sequences and convert labels to array
data = pad_sequences(sequences, maxlen=max([len(seq) for seq in sequences]))
labels = np.asarray(labels)

# Train model
model.fit(data, labels, epochs=10, verbose=1)


FileNotFoundError: [Errno 2] No such file or directory: 'glove.6B.100d.txt'