In [None]:
!pip install datasets
!pip install accelerate -U
!pip install sentencepiece sacremoses
!python -m spacy download fr_core_news_sm
!pip install evaluate

# Exploring Camembert for French Text Difficulty Classification

## Introduction

In this notebook, we will explore the use of Camembert, a popular pre-trained language model for the French language, to tackle the task of classifying French text difficulty. Text difficulty classification is a fundamental natural language processing (NLP) task that has various applications, such as educational content assessment, language learning, and content recommendation.

Camembert is a transformer-based language model, and it has demonstrated impressive performance on various NLP tasks in the French language. Leveraging its pre-trained representations, we will train and evaluate a model to predict the difficulty level of French texts.



## Approach

Our approach will involve the following steps:

1. **Data Preparation**: We will load and preprocess the dataset, including text tokenization, data splitting, and any necessary data augmentation.

2. **Model Selection**: We will fine-tune a Camembert model for our text difficulty classification task. Fine-tuning involves adapting the pre-trained model to our specific task by training it on our labeled dataset.

3. **Model Training**: We will train the Camembert-based model on the training data and monitor its performance using appropriate evaluation metrics.

4. **Evaluation**: After training, we will evaluate the model's performance on a separate validation or test dataset. We will assess its accuracy, precision, recall, F1-score, and any other relevant metrics.

5. **Inference**: We will demonstrate how to use the trained model to predict the difficulty level of new, unseen French texts.

6. **Analysis and Interpretation**: We will analyze the model's predictions, inspect its behavior, and identify potential areas for improvement.

## Dependencies

Before we begin, ensure that you have the following dependencies installed:

- Python 3.x
- PyTorch
- Transformers library (Hugging Face Transformers)
- Scikit-learn
- Matplotlib (for data visualization)

You can install these libraries using `pip` or `conda` as needed.

Let's get started with data loading and preparation!

In [None]:
import nltk
nltk.download('stopwords')
french_stopwords = set(stopwords.words('french'))
import spacy
nlp = spacy.load('fr_core_news_sm')  # Load the French model
from google.colab import drive
drive.mount('/content/drive')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Text Classification using Camembert and TensorFlow

This notebook demonstrates text classification using the Camembert model from the Hugging Face Transformers library and TensorFlow. The code follows these steps:

1. **Data Loading and Preprocessing:**
   - Load a CSV dataset with columns 'sentence' for text and 'difficulty' for labels.
   - Rename columns to 'text' and 'labels' and drop the 'id' column if present.
   - Apply text preprocessing by converting text to lowercase and removing non-alphabet characters.

In [None]:
import pandas as pd
from transformers import CamembertTokenizer
import tensorflow as tf
import re
import spacy
import pandas as pd
from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


# Load the dataset
file_path = '/content/training_data.csv'  # Replace with your file path
data = pd.read_csv(file_path)


data = data.rename(columns={'sentence': 'text', 'difficulty': 'labels'}).drop(['id'],axis=1)
def initial_clean(text):
    text = text.lower()
    text = re.sub(r'[^a-zàâçéèêëîïôûùüÿñæœ]', ' ', text)
    return text

data['text'] = data['text'].apply(initial_clean)

# Initialize tokenizer and model
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

#for dimostration pourposes I inly use 2k data
LE = LabelEncoder()
data['labels'] = LE.fit_transform(data['labels'])
data.head()


Unnamed: 0,text,labels
0,les coûts kilométriques réels peuvent diverger...,4
1,le bleu c est ma couleur préférée mais je n a...,0
2,le test de niveau en français est sur le site ...,0
3,est ce que ton mari est aussi de boston,0
4,dans les écoles de commerce dans les couloirs...,2


# Text Classification with Camembert and Transformers

This code performs text classification using the Camembert model with Hugging Face Transformers and TensorFlow. It consists of the following steps:

1. **Dataset Splitting:**
   - We split the `data` into training and validation datasets using `train_test_split` with an 80-20 ratio.

2. **Dataset Preparation:**
   - We create `train_dataset` and `val_dataset` from Pandas DataFrames using the `Dataset` class from the `datasets` library.

3. **Tokenizer Initialization:**
   - We initialize a tokenizer for Camembert using `AutoTokenizer` from Transformers.

4. **Model Initialization:**
   - We initialize a Camembert-based model for sequence classification using `AutoModelForSequenceClassification` from Transformers, specifying the number of labels as `num_labels=6`.

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split
train_data , val_data= train_test_split(data,test_size=0.2)
train_dataset = Dataset.from_pandas(train_data)
val_dataset = Dataset.from_pandas(val_data)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("camembert-base")
from transformers import AutoModelForSequenceClassification
bmodel = AutoModelForSequenceClassification.from_pretrained("/content/drive/MyDrive/cambert_french_finetuned", num_labels=6)

# Model Training Configuration and Evaluation Metrics

This code defines the training configuration for a model using the Hugging Face Transformers library and sets up evaluation metrics. It includes the following steps:

1. **Training Configuration:**
   - We create `training_args` using `TrainingArguments` from Transformers, specifying the `output_dir` where model checkpoints and results will be saved.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")
import numpy as np
import evaluate
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Model Training with Hugging Face Transformers

This code snippet demonstrates the training of a model using the Hugging Face Transformers library. It includes the following steps:

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer",num_train_epochs=19, evaluation_strategy="epoch")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=traintokenized_datasets,
    eval_dataset=valtokenized_datasets,
    compute_metrics=compute_metrics,

)

### Code Explanation

This code snippet encompasses several steps in processing and saving a natural language model:

1. **Saving the Model:**
   - `trainer.save_model("/content/drive/roberta_french_finetuned/")`: This command saves a trained model, presumably a fine-tuned version of a RoBERTa model for French language tasks, to the specified directory in Google Drive, which is linked with a Google Colab environment.

2. **Extracting French Text Data:**
   - `frenc_texts = data['text'].values`: Here, we're extracting the 'text' column from a DataFrame `data`. This column contains French texts, and these texts are stored in the `frenc_texts` variable for further processing.

3. **Handling Part-of-Speech Tags:**
   - `all_pos_tags = {tag for tag in nlp.get_pipe("morphologizer").labels}`: This line creates a set of all part-of-speech (POS) tags available in the `morphologizer` component of a spaCy NLP model (`nlp`). The morphologizer is responsible for determining the morphological features (like tense, gender, number) and POS tags of words in a sentence.
   - `tag_dict = {tag: i for i, tag in enumerate(all_pos_tags)}`: This line creates a dictionary mapping each POS tag to a unique integer. This is useful for tasks that require a numerical representation of POS tags, such as feature encoding in machine learning models.


In [None]:
trainer.save_model("/content/drive/roberta_french_finetuned/")
frenc_texts=data['text'].values
all_pos_tags = {tag for tag in nlp.get_pipe("morphologizer").labels}
tag_dict = {tag: i for i, tag in enumerate(all_pos_tags)}

### Code Explanation

This code snippet involves defining a function for extracting part-of-speech (POS) tags from text and processing these tags for analysis:

1. **Function to Extract POS Tags:**
   - `def get_pos_tags(text): ...`: This function, `get_pos_tags`, takes a string `text` as input and returns a list of POS tags corresponding to each token in the text.
   - `doc = nlp(text)`: The text is processed by a spaCy NLP model (`nlp`), which tokenizes the text and performs various NLP tasks including POS tagging.
   - `return [token.pos_ for token in doc]`: The function returns a list of POS tags, where each tag corresponds to the part of speech of a token in the input text.

2. **Applying the Function to a DataFrame:**
   - `data['pos_tags'] = data['text'].apply(get_pos_tags)`: This line applies the `get_pos_tags` function to each row in the 'text' column of a DataFrame `data`. It creates a new column 'pos_tags' in the DataFrame, where each entry is a list of POS tags for the corresponding text.

3. **Getting Unique POS Tags:**
   - `all_pos_tags = set(tag for tags in data['pos_tags'] for tag in tags)`: This line extracts all unique POS tags present in the entire DataFrame. It iterates over each list of tags in the 'pos_tags' column and adds each tag to a set, ensuring that each POS tag is represented only once.

4. **Creating One-Hot Encoded Vectors for POS Tags:**
   - This block of code creates a one-hot encoded matrix for POS tags. One-hot encoding is a process of converting categorical data, in this case, POS tags, into a binary vector.
   - `one_hot_vectors = []`: Initializes an empty list to store the one-hot encoded vectors.
   - The `for` loop iterates over each list of tags in the 'pos_tags' column. For each list of tags:
       - `vector = [1 if pos_tag in tags else 0 for pos_tag in all_pos_tags]`: A one-hot encoded vector is created for the current list of tags. Each element in the vector corresponds to a tag in `all_pos_tags`. If the tag is present in the current list of tags, the element is set to 1; otherwise, it is set to 0.
   - `one_hot_vectors.append(vector)`: Each one-hot encoded vector is appended to the `one_hot_vectors` list. This results in a list of binary vectors representing the presence or absence of each POS tag in each text.


In [None]:
def get_pos_tags(text):
    doc = nlp(text)
    return [token.pos_ for token in doc]

# Apply the function to create a new column with POS tags
data['pos_tags'] = data['text'].apply(get_pos_tags)

# Get unique POS tags in the entire DataFrame
all_pos_tags = set(tag for tags in data['pos_tags'] for tag in tags)

# Create a one-hot encoded matrix for POS tags
one_hot_vectors = []
for tags in data['pos_tags']:
    vector = [1 if pos_tag in tags else 0 for pos_tag in all_pos_tags]
    one_hot_vectors.append(vector)


In [None]:

# Convert the list of one-hot vectors into a DataFrame
pos_tags_df = pd.DataFrame(one_hot_vectors, columns=list(all_pos_tags))

# Concatenate the POS tags DataFrame with your original DataFrame
df = pd.concat([data, pos_tags_df], axis=1)


In [None]:
bmodel.cuda()

CamembertForSequenceClassification(
  (roberta): CamembertModel(
    (embeddings): CamembertEmbeddings(
      (word_embeddings): Embedding(32005, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): CamembertEncoder(
      (layer): ModuleList(
        (0-11): 12 x CamembertLayer(
          (attention): CamembertAttention(
            (self): CamembertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): CamembertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=Tru

### Code Explanation

This code snippet involves a function that uses a BERT model to generate embeddings from text:

1. **Importing PyTorch:**
   - `import torch`: This line imports the PyTorch library, which is a popular framework for deep learning and is commonly used for tasks involving neural networks.

2. **Function to Get BERT Embeddings:**
   - `def get_bert_embeddings(text): ...`: This function, `get_bert_embeddings`, takes a string `text` as input and returns its embedding generated by a BERT model.
   
3. **Tokenizing and Encoding Text:**
   - `encoded_input = tokenizer(text, return_tensors='pt', padding="max_length", max_length=512, truncation=True)`: 
     - The text is tokenized and encoded using a tokenizer compatible with a BERT model. 
     - `return_tensors='pt'` indicates that the output will be PyTorch tensors.
     - `padding="max_length"` and `max_length=512` ensure that all encoded texts are padded or truncated to the same length for batch processing.
     - `truncation=True` allows the tokenizer to truncate texts longer than the maximum length.

4. **Moving Encoded Input to CUDA:**
   - `encoded_input = {key: value.to('cuda') for key, value in encoded_input.items()}`: 
     - This line moves the encoded input to a CUDA-enabled device (typically a GPU) for faster processing. 
     - This is necessary for leveraging GPU acceleration during model inference.

5. **Model Inference Without Gradient Calculation:**
   - `with torch.no_grad(): ...`: 
     - This context manager tells PyTorch not to calculate gradients during the following operations, which is standard practice during inference to reduce memory usage and computation.
   - `output = bmodel(**encoded_input, output_hidden_states=True)`: 
     - The BERT model (`bmodel`) generates output for the encoded input. 
     - `output_hidden_states=True` indicates that the model should return all hidden states.

6. **Extracting and Pooling Embeddings:**
   - `return output.hidden_states[-1].mean(dim=1).squeeze().cpu().numpy()`: 
     - The function returns the mean-pooled vector of the last layer's hidden states.
     - `output.hidden_states[-1]` accesses the last layer's hidden states.
     - `.mean(dim=1)` computes the mean across the sequence dimension, effectively pooling the embeddings.
     - `.squeeze()` removes any singleton dimensions.
     - `.cpu().numpy()` moves the tensor back to the CPU and converts it to a NumPy array, making it compatible with libraries like scikit-learn.

This function is typically used to convert text into a dense numerical representation, capturing linguistic features as understood by the BERT model, and can be used for various downstream machine learning or NLP tasks.


In [None]:
import torch
def get_bert_embeddings(text):
    # Tokenize and encode the text
    encoded_input =tokenizer(text, return_tensors='pt',padding="max_length", max_length=512,truncation=True)

    # Move encoded input to the device
    encoded_input = {key: value.to('cuda') for key, value in encoded_input.items()}

    # Get model output and extract the last hidden states
    with torch.no_grad():
        output = bmodel(**encoded_input,output_hidden_states=True)
        # print(output.keys())
    # Mean pooling
    return output.hidden_states[-1].mean(dim=1).squeeze().cpu().numpy()  # Move to CPU for compatibility with scikit-learn


### Code Explanation

This code snippet processes text data in batches to generate embeddings using a BERT model and then integrates these embeddings into a DataFrame:

1. **Setting the Batch Size:**
   - `batch_size=8`: Defines the size of each batch for processing the text data. Here, the batch size is set to 8, meaning that 8 texts will be processed together in each batch.

2. **Creating Text Batches:**
   - `text_batches = [df['text'][i:i + batch_size] for i in range(0, len(df), batch_size)]`: 
     - This line creates batches of texts from the DataFrame `df`. 
     - It slices the 'text' column of `df` into smaller chunks, each containing `batch_size` number of texts. 
     - These batches are stored in the list `text_batches`.

3. **Processing Batches to Generate Embeddings:**
   - `embeddings = []`: Initializes an empty list to store embeddings.
   - The `for` loop iterates over each batch in `text_batches`:
     - `embeddings.append(get_bert_embeddings(list(batch)))`: 
       - For each batch, the `get_bert_embeddings` function is called to generate embeddings for the texts in the batch. 
       - The embeddings for each batch are then appended to the `embeddings` list.

4. **Concatenating the Embeddings:**
   - `embeddings = np.concatenate(embeddings)`: 
     - After processing all batches, the embeddings from each batch are concatenated into a single NumPy array. 
     - This results in a collective array of embeddings for all texts in the DataFrame.

5. **Reshaping and Assigning Embeddings to DataFrame:**
   - `flattened_embeddings = embeddings.reshape((len(df), -1))`: 
     - The embeddings array is reshaped to have a shape of `(len(df), -1)`. 
     - This ensures that there is one embedding vector per row in the DataFrame, matching the number of texts.
   - The `for` loop iterates over each column in the `flattened_embeddings`:
     - `df[f'embed_{i}'] = flattened_embeddings[:, i]`: 
       - Each column of the `flattened_embeddings` is assigned to a new column in the DataFrame `df`. 
       - The columns are named `embed_0`, `embed_1`, and so on, each containing a different dimension of the embedding vectors.

This approach is useful for processing large text datasets with BERT embeddings, as it batches the texts to manage memory usage efficiently and then integrates the resulting embeddings into the original DataFrame for further analysis or machine learning tasks.


In [None]:
batch_size=8
text_batches = [df['text'][i:i + batch_size] for i in range(0, len(df), batch_size)]

# Process batches and concatenate the results
embeddings = []
for batch in text_batches:

    embeddings.append(get_bert_embeddings(list(batch)))

# Concatenate the embeddings
embeddings = np.concatenate(embeddings)

# Assign the embeddings to the DataFrame
# df['embed'] = embeddings

flattened_embeddings = embeddings.reshape((len(df), -1))

# Assign the embeddings to the DataFrame
for i in range(flattened_embeddings.shape[1]):
    df[f'embed_{i}'] = flattened_embeddings[:, i]

  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embedding

In [None]:
df.to_csv('french_text_with_embeddings.csv')

### Code Explanation

This code snippet calculates sentence complexity features and updates a DataFrame with these features:

1. **Defining Sentence Complexity Function:**
   - `def sentence_features(text): ...`: A function to calculate the number of words and average word length in a given text.
   - `words = text.split()`: Splits the text into words.
   - `return len(words), ...`: Returns the number of words.
   - `... sum(len(word) for word in words) / len(words) if words else 0`: Calculates the average word length; if there are no words, it returns 0.

2. **Applying the Function to DataFrame:**
   - `df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))`: 
     - Applies `sentence_features` to the 'text' column in DataFrame `df`.
     - Stores the number of words and average word length in two new columns, 'num_words' and 'avg_word_length'.

3. **Converting Numerical Features to Strings:**
   - `df['num_words_str'] = df['num_words'].astype(str)`: Converts the 'num_words' column to string format and stores it in a new column 'num_words_str'.
   - `df['avg_word_length_str'] = df['avg_word_length'].astype(str)`: Converts the 'avg_word_length' column to string format and stores it in a new column 'avg_word_length_str'.

This approach allows for easy analysis of text complexity within the DataFrame by adding relevant features in both numerical and string formats.


In [None]:
# Sentence complexity features
def sentence_features(text):
    words = text.split()
    return len(words), sum(len(word) for word in words) / len(words) if words else 0

df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))
# Convert numerical features to string
df['num_words_str'] = df['num_words'].astype(str)
df['avg_word_length_str'] = df['avg_word_length'].astype(str)


  df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))
  df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))
  df['num_words_str'] = df['num_words'].astype(str)
  df['avg_word_length_str'] = df['avg_word_length'].astype(str)


In [None]:
X=df.drop(['text','pos_tags','labels','num_words_str','avg_word_length_str'],axis=1)
y=df['labels']

In [None]:
X.head()

Unnamed: 0,PRON,NUM,ADV,NOUN,ADP,PUNCT,CCONJ,ADJ,PROPN,SCONJ,...,embed_760,embed_761,embed_762,embed_763,embed_764,embed_765,embed_766,embed_767,num_words,avg_word_length
0,0,0,1,1,1,1,1,1,0,0,...,-0.045885,-0.047388,-0.173737,-0.153386,-0.046336,0.100206,-0.222565,0.11575,38,5.736842
1,1,0,1,1,0,1,1,0,1,0,...,-0.215935,0.330025,0.270201,-0.007197,0.046525,-0.050677,-0.007611,0.120119,12,4.25
2,0,0,0,1,1,1,0,0,0,0,...,-0.24028,0.313979,0.246006,-0.045984,0.03535,-0.048163,-0.028361,0.09254,13,4.153846
3,1,0,1,1,1,1,0,0,1,1,...,-0.210785,0.302628,0.280659,-0.038862,0.044026,-0.051162,-0.008161,0.122888,8,4.125
4,1,1,1,1,1,1,1,1,0,0,...,0.169169,0.033796,0.200843,0.081495,0.103755,-0.025925,0.183891,-0.027395,34,5.176471


In [None]:
# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### General Overview of the Code

This code snippet covers two main aspects of machine learning: data preprocessing and model training with hyperparameter tuning:

1. **Data Preprocessing:**
   - The snippet begins with scaling the training and test datasets (`X_train` and `X_test`) using `MinMaxScaler` from `sklearn.preprocessing`. This is a standard practice in machine learning to normalize data within a specific range, typically 0 to 1, which can enhance the performance of many algorithms.

2. **Machine Learning Model Training:**
   - Essential machine learning libraries are imported, indicating a complex workflow involving data handling and modeling.
   - The focus then shifts to training a Support Vector Machine (SVM) model. `GridSearchCV` is used for hyperparameter tuning, exploring various combinations of parameters like 'C', 'gamma', and 'kernel'. This method aims to find the optimal settings for the SVM model by testing different combinations and evaluating their performance.
   - After determining the best model configuration, the training and test data are reshaped, presumably for compatibility with the model or further processing steps.

In summary, the code illustrates a typical machine learning process, starting with data standardization and followed by an advanced technique of hyperparameter optimization to enhance model performance.


In [None]:
# prompt: min max scaling of X_train

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

import pandas as pd
import numpy as np
import re
import torch
import spacy
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.svm import SVC

# Model training with hyperparameter tuning
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5, verbose=2, n_jobs=-1)
grid_search.fit(X_train, y_train)

best_svc = grid_search.best_estimator_
X_train_r = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test_r = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

Fitting 5 folds for each of 32 candidates, totalling 160 fits


### Overview of Deep Learning Model Construction Using Keras

This code snippet is focused on building and configuring a deep learning model using Keras, a high-level neural networks API:

1. **Import Statements:**
   - The snippet starts by importing necessary components from Keras such as `Sequential`, `Embedding`, `Dense`, `Conv1D`, `MaxPooling1D`, `GlobalMaxPooling1D`, and various others. These are key building blocks for constructing neural network layers.

2. **Setting Up Callbacks:**
   - `callback_list` is defined with three types of callbacks: `EarlyStopping` (to stop training when the accuracy metric stops improving), `ModelCheckpoint` (to save the model after every epoch where the validation loss improves), and `ReduceLROnPlateau` (to reduce the learning rate when a metric has stopped improving). These callbacks help in optimizing the training process and preventing overfitting.

3. **Building the Model:**
   - The model is built using a functional API approach. 
   - It starts with defining an input layer `text_input_layer`.
   - Several convolutional layers (`Conv1D`) followed by max-pooling layers (`MaxPooling1D`) are added. This structure is common in processing sequential data (like text) where local patterns are important.
   - After multiple convolution and pooling layers, a `GlobalMaxPooling1D` layer is used to downsample the entire feature map to a single vector per map.
   - This is followed by a dense layer with ReLU activation and an output layer with softmax activation (for multi-class classification, as indicated by 6 units in the output layer).
   - The model is then compiled with the RMSprop optimizer, categorical crossentropy as the loss function, and accuracy as the metric.

4. **Model Summary and Compilation:**
   - `model.summary()` provides a summary of the model's architecture.
   - The model is compiled with specific settings for the optimizer (learning rate), loss function, and metrics.

This code is typical for building a convolutional neural network for tasks like text or sequence classification, showcasing the flexibility and ease of using Keras for deep learning tasks.


In [None]:
import keras
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Conv1D, MaxPooling1D, GlobalMaxPooling1D
from keras import Model, layers
from keras import Input

from keras.optimizers import RMSprop

callback_list = [
    keras.callbacks.EarlyStopping(
        patience=20,
        monitor='acc',
    ),

    keras.callbacks.ModelCheckpoint(
        monitor='val_loss',
        save_best_only=True,
        filepath='model/movie_sentiment_m1.h5',
    ),

    keras.callbacks.ReduceLROnPlateau(
        patience=1,
        factor=0.1,
    )
]

# layer developing
text_input_layer = Input(shape=(X_train_r.shape[1],X_train_r.shape[2],))
# embedding_layer = Embedding(X_train.shape[1], )(text_input_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_input_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = Conv1D(256, 3, activation='relu')(text_layer)
text_layer = MaxPooling1D(3)(text_layer)
text_layer = GlobalMaxPooling1D()(text_layer)
text_layer = Dense(256, activation='relu')(text_layer)
output_layer = Dense(6, activation='softmax')(text_layer)
model = Model(text_input_layer, output_layer)
model.summary()
model.compile(optimizer=RMSprop(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['acc'])

# multi-input test

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 787, 1)]          0         
                                                                 
 conv1d (Conv1D)             (None, 785, 256)          1024      
                                                                 
 max_pooling1d (MaxPooling1  (None, 261, 256)          0         
 D)                                                              
                                                                 
 conv1d_1 (Conv1D)           (None, 259, 256)          196864    
                                                                 
 max_pooling1d_1 (MaxPoolin  (None, 86, 256)           0         
 g1D)                                                            
                                                                 
 conv1d_2 (Conv1D)           (None, 84, 256)           196864

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Assuming pos_tags is a list of POS tags for multiple sentences
encoder = OneHotEncoder(sparse=False)
y_train=encoder.fit_transform(y_train.values.reshape(-1,1))

y_test=encoder.transform(y_test.values.reshape(-1,1))



In [None]:

history = model.fit(X_train_r, y_train, epochs=50, batch_size=128, callbacks=callback_list,
                    validation_data=(X_test_r, y_test))


Epoch 1/50
Epoch 2/50
 3/30 [==>...........................] - ETA: 0s - loss: 1.7594 - acc: 0.2630

  saving_api.save_model(


Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50


In [None]:
# Model evaluation
y_pred = model.predict(X_test_r)
print(classification_report(encoder.inverse_transform(y_test),encoder.inverse_transform( y_pred)))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96       166
           1       0.95      0.89      0.92       158
           2       0.88      0.92      0.90       166
           3       0.89      0.93      0.91       153
           4       0.87      0.89      0.88       152
           5       0.95      0.88      0.91       165

    accuracy                           0.91       960
   macro avg       0.91      0.91      0.91       960
weighted avg       0.91      0.91      0.91       960



In [None]:
df=pd.read_csv('https://raw.githubusercontent.com/DalipiDenis/assign/main/unlabelled_test_data.csv')

In [None]:
df = df.rename(columns={'sentence': 'text', 'difficulty': 'labels'})
df['text'] = df['text'].apply(initial_clean)


### Overview of POS Tagging and One-Hot Encoding in a DataFrame

This code snippet is designed for natural language processing (NLP), specifically for extracting part-of-speech (POS) tags from text data and converting these tags into a one-hot encoded format:

1. **POS Tagging Function:**
   - `def get_pos_tags(text): ...`: A function is defined to get the POS tags for a given text.
   - `doc = nlp(text)`: Processes the text using an NLP model (likely from a library like spaCy) to tokenize the text and assign POS tags.
   - `return [token.pos_ for token in doc]`: Returns a list of POS tags for each token in the text.

2. **Applying POS Tagging to DataFrame:**
   - `df['pos_tags'] = df['text'].apply(get_pos_tags)`: Applies the `get_pos_tags` function to each row in the 'text' column of a DataFrame `df`, creating a new column 'pos_tags' that contains the list of POS tags for each text.

3. **Extracting Unique POS Tags:**
   - `all_pos_tags = set(tag for tags in data['pos_tags'] for tag in tags)`: Extracts all unique POS tags present in the DataFrame. This set represents all different POS tags encountered across the entire dataset.

4. **Creating One-Hot Encoded Vectors:**
   - The code initializes an empty list `one_hot_vectors` to store one-hot encoded vectors.
   - For each list of tags in `df['pos_tags']`, it creates a one-hot encoded vector where each element corresponds to a tag in `all_pos_tags`. If the tag is present in the list, the element is 1; otherwise, it is 0.
   - These vectors represent the presence or absence of each POS tag in each text and are appended to `one_hot_vectors`.

The process of extracting POS tags and converting them into a one-hot encoded format is common in NLP, as it transforms textual data into a numerical form that can be easily used in various machine learning models.


In [None]:
def get_pos_tags(text):
    doc = nlp(text)
    return [token.pos_ for token in doc]

# Apply the function to create a new column with POS tags
df['pos_tags'] = df['text'].apply(get_pos_tags)

# Get unique POS tags in the entire DataFrame
all_pos_tags = set(tag for tags in data['pos_tags'] for tag in tags)

# Create a one-hot encoded matrix for POS tags
one_hot_vectors = []
for tags in df['pos_tags']:
    vector = [1 if pos_tag in tags else 0 for pos_tag in all_pos_tags]
    one_hot_vectors.append(vector)


### Overview of Integrating NLP Features into DataFrame and Batch Processing for Embeddings

This code snippet demonstrates the integration of NLP features into a DataFrame and processes text data in batches to generate embeddings:

1. **Creating DataFrame from One-Hot Vectors:**
   - `pos_tags_df = pd.DataFrame(one_hot_vectors, columns=list(all_pos_tags))`: Converts the list of one-hot encoded vectors (created earlier for POS tags) into a Pandas DataFrame. Each column represents a unique POS tag.

2. **Concatenating DataFrames:**
   - `df = pd.concat([df, pos_tags_df], axis=1)`: Concatenates the new DataFrame containing POS tag features with the original DataFrame `df`, enhancing it with additional linguistic features for each text entry.

3. **Batch Processing for Embedding Generation:**
   - `batch_size=8`: Sets the size of each batch for processing.
   - `text_batches = [df['text'][i:i + batch_size]...`: Splits the 'text' column of `df` into smaller batches. Each batch contains `batch_size` texts.
   - The following loop processes each batch to generate embeddings (likely using a BERT model as indicated by `get_bert_embeddings` function):
     - `embeddings.append(get_bert_embeddings(list(batch)))`: Appends the generated embeddings of each batch to the `embeddings` list.

4. **Concatenating and Reshaping Embeddings:**
   - `embeddings = np.concatenate(embeddings)`: Concatenates all embeddings into a single NumPy array.
   - `flattened_embeddings = embeddings.reshape((len(df), -1))`: Reshapes the embeddings so that there's one embedding vector per text entry in the DataFrame.

5. **Assigning Embeddings to DataFrame:**
   - The final loop iterates over each dimension of the `flattened_embeddings` and assigns each to a new column in the DataFrame `df`. The columns are named `embed_0`, `embed_1`, etc., storing different dimensions of the embedding vectors for each text.

This code is a practical example of how to enrich a DataFrame with advanced NLP features, like POS tags and text embeddings, which can be crucial for downstream machine learning or text analysis tasks.


In [None]:

# Convert the list of one-hot vectors into a DataFrame
pos_tags_df = pd.DataFrame(one_hot_vectors, columns=list(all_pos_tags))

# Concatenate the POS tags DataFrame with your original DataFrame
df = pd.concat([df, pos_tags_df], axis=1)


batch_size=8

text_batches = [df['text'][i:i + batch_size] for i in range(0, len(df), batch_size)]

# Process batches and concatenate the results
embeddings = []
for batch in text_batches:

    embeddings.append(get_bert_embeddings(list(batch)))

# Concatenate the embeddings
embeddings = np.concatenate(embeddings)

# Assign the embeddings to the DataFrame
# df['embed'] = embeddings

flattened_embeddings = embeddings.reshape((len(df), -1))

# Assign the embeddings to the DataFrame
for i in range(flattened_embeddings.shape[1]):
    df[f'embed_{i}'] = flattened_embeddings[:, i]

  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embeddings[:, i]
  df[f'embed_{i}'] = flattened_embedding

In [None]:
# Sentence complexity features
def sentence_features(text):
    words = text.split()
    return len(words), sum(len(word) for word in words) / len(words) if words else 0

df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))
# Convert numerical features to string
df['num_words_str'] = df['num_words'].astype(str)
df['avg_word_length_str'] = df['avg_word_length'].astype(str)


  df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))
  df['num_words'], df['avg_word_length'] = zip(*df['text'].apply(sentence_features))
  df['num_words_str'] = df['num_words'].astype(str)
  df['avg_word_length_str'] = df['avg_word_length'].astype(str)


In [None]:
X=df.drop(['text','pos_tags','id','num_words_str','avg_word_length_str'],axis=1)


In [None]:
X_val = scaler.transform(X)
X_val_r = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))

In [None]:
y_val = model.predict(X_val_r)



In [None]:
submit_df=pd.read_csv('https://raw.githubusercontent.com/DalipiDenis/assign/main/unlabelled_test_data.csv')
submit_df['difficulty']=LE.inverse_transform(encoder.inverse_transform(y_val))
submit_df.drop(['sentence'],axis=1,inplace=True)
submit_df.to_csv('submission_2.csv',index=False)

  y = column_or_1d(y, warn=True)
