<a href="https://colab.research.google.com/github/MarianoChic09/MSc-ORT-Deep-Learning/blob/main/Obligatorio/obligatorio_deep_learning_2023_c_WANDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![AIRBNB](https://www.stevenridercpa.au/wp-content/uploads/2022/09/airbnb-tax.jpeg)

# Obligatorio de Deep Learning
## Semestre 2 - 2023
-------

## Problema

Se presenta un dataset que contiene información de alojamientos publicados en AirBnB con sus respectivos precios. El tamaño del dataset de train es de 1.5 Gb aproximadamente, y 0.5 Gb el de test. Este cuenta con 84 variables predictoras que se podrán utilizar como consideren adecuado.

El objetivo es asignar el precio correcto a los alojamientos listados.

Además del dataset se les provee esta notebook conteniendo el script de carga de datos y un modelo baseline que corresponde a una arquitectura feed forward.

------

## Consigna

### A) <u>Participación en Competencia Kaggle</u>:
El objetivo de este punto es participar en la competencia de Kaggle y obtener como mínimo un Mean Absolute Error inferior a 70 puntos. [->Link a la competencia<-](https://www.kaggle.com/t/69c648e3aa214d1f812bf2314c8d4ffa).

### B) <u>Utilización de Grid Search (o equivalente)</u>:
Para cumplir con la busqueda de modelos óptimos se debe realizar un grid search lo más abarcativo y metódico posible. Recomendamos enfáticamente [Weights and Biases](https://wandb.ai/site)

### C) <u>Se debe a su vez investigar e implementar las siguientes técnicas</u>:
#### 1. [Batch Normalization](https://machinelearningmastery.com/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization/)
#### 2. [Gradient Normalization y/o Gradient Clipping](https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/)


Además como en todas las tareas se evaluará la prolijidad de la entrega, el preprocesamiento de datos, visualizaciones y exploración de técnicas alternativas.

-------

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1. Setup
### 1.1 Imports

In [2]:
%cd /content/drive/MyDrive/Colab Notebooks/Datasets

/content/drive/MyDrive/Colab Notebooks/Datasets


In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

### 1.2 Seteo de seeds

In [4]:
np.random.seed(117)
tf.random.set_seed(117)

## 2. Carga de datos

In [5]:
file_path = './obligatorio_DL/public_train_data.csv'
df = pd.read_csv(file_path)

##  3. Análisis exploratorio de datos
### 3.1 Dimensiones

In [6]:
df.shape

(326287, 85)

### 3.2 Obtener información sobre las columnas y tipos de datos

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326287 entries, 0 to 326286
Data columns (total 85 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              326287 non-null  int64  
 1   Last Scraped                    326286 non-null  object 
 2   Name                            326018 non-null  object 
 3   Summary                         315651 non-null  object 
 4   Space                           228792 non-null  object 
 5   Description                     326188 non-null  object 
 6   Experiences Offered             326287 non-null  object 
 7   Neighborhood Overview           192513 non-null  object 
 8   Notes                           130729 non-null  object 
 9   Transit                         200649 non-null  object 
 10  Access                          177108 non-null  object 
 11  Interaction                     169193 non-null  object 
 12  House Rules     

### 3.3 Visualizar las primeras filas del dataset

In [8]:
df.head(3)

Unnamed: 0,id,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,Notes,Transit,...,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features,Price
0,0,2017-05-12,Grand Loft in the heart of historic Antwerp,Best location for visiting Antwerp!! Beautiful...,Welcome in Antwerp!! The loft is situated on t...,Best location for visiting Antwerp!! Beautiful...,none,,,,...,10.0,9.0,,,strict,2.0,2.6,"51.21938762207894, 4.4034442505151885","Host Has Profile Pic,Instant Bookable",159.0
1,1,2017-05-03,"CHARMING, CLEAN & COZY BUNGALOW!",Very centrally located and less than 15 min fr...,"Well lit, private entrance with small patio.",Very centrally located and less than 15 min fr...,none,"Quiet. Pretty tree lined streets, safe area.",Has dining table and high back desk chair.,"Uber, bus line and metro link is less than 5 m...",...,,,,"City of Los Angeles, CA",flexible,1.0,,"34.1892692286356, -118.41993491931177","Host Has Profile Pic,Is Location Exact",49.0
2,2,2017-05-09,la casa di maurizio,"nice apartment with view to via veneto , very ...",,"nice apartment with view to via veneto , very ...",none,,,,...,,,,,flexible_new,1.0,,"41.90859623057272, 12.493518028459327","Host Has Profile Pic,Is Location Exact",75.0


### 3.4 Estadísticas descriptivas

In [9]:
df.describe()

Unnamed: 0,id,Host ID,Host Response Rate,Host Listings Count,Host Total Listings Count,Latitude,Longitude,Accommodates,Bathrooms,Bedrooms,...,Review Scores Rating,Review Scores Accuracy,Review Scores Cleanliness,Review Scores Checkin,Review Scores Communication,Review Scores Location,Review Scores Value,Calculated host listings count,Reviews per Month,Price
count,326287.0,326287.0,250845.0,325971.0,325970.0,326287.0,326287.0,326244.0,325300.0,325873.0,...,243160.0,242584.0,242732.0,242378.0,242710.0,242423.0,242347.0,325689.0,246983.0,326287.0
mean,163143.0,32367570.0,93.408264,9.586,9.586026,38.042816,-15.323924,3.270764,1.239482,1.358072,...,92.880063,9.524713,9.326067,9.691416,9.708253,9.468215,9.321031,6.881531,1.486211,138.229041
std,94191.087979,31745720.0,17.536835,57.399711,57.399797,22.910029,70.101677,2.037446,0.574784,0.921763,...,8.569521,0.855361,1.038858,0.731702,0.723143,0.805116,0.906478,42.025986,1.752082,149.790527
min,0.0,19.0,0.0,0.0,0.0,-38.224427,-123.218712,1.0,0.0,0.0,...,20.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,0.01,0.0
25%,81571.5,6869780.0,98.0,1.0,1.0,38.923154,-73.968081,2.0,1.0,1.0,...,90.0,9.0,9.0,10.0,10.0,9.0,9.0,1.0,0.32,55.0
50%,163143.0,21867370.0,100.0,1.0,1.0,42.304549,0.090277,2.0,1.0,1.0,...,95.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,0.89,90.0
75%,244714.5,47991660.0,100.0,3.0,3.0,50.863658,12.342749,4.0,1.0,2.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,2.0,2.04,150.0
max,326286.0,135088500.0,100.0,1114.0,1114.0,55.994889,153.637837,18.0,8.0,96.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,752.0,223.0,999.0


In [10]:
df.columns

Index(['id', 'Last Scraped', 'Name', 'Summary', 'Space', 'Description',
       'Experiences Offered', 'Neighborhood Overview', 'Notes', 'Transit',
       'Access', 'Interaction', 'House Rules', 'Thumbnail Url', 'Medium Url',
       'Picture Url', 'XL Picture Url', 'Host ID', 'Host URL', 'Host Name',
       'Host Since', 'Host Location', 'Host About', 'Host Response Time',
       'Host Response Rate', 'Host Acceptance Rate', 'Host Thumbnail Url',
       'Host Picture Url', 'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Security Deposit',
       'Cleaning Fee', 'Guests Included', 'Extra Peop

# Bert Fin

In [11]:
df.shape

(326287, 85)

In [12]:
columnas_con_NaNs_mayores_a_50_porciento = df.columns[df.isnull().sum() > 0.5*df.shape[0]]
columnas_con_NaNs_mayores_a_50_porciento

Index(['Notes', 'Host Acceptance Rate', 'Neighbourhood Group Cleansed',
       'Square Feet', 'Security Deposit', 'Has Availability', 'License',
       'Jurisdiction Names'],
      dtype='object')

## 4. Modelo Baseline

### 4.1 Seleccionar características relevantes

In [13]:
drop_columns = list(columnas_con_NaNs_mayores_a_50_porciento)
drop_columns.extend(['Host ID','Host URL','Price']) # Dropeo algunas columnas que no tienen sentido como el ID y el URL del host
drop_columns

['Notes',
 'Host Acceptance Rate',
 'Neighbourhood Group Cleansed',
 'Square Feet',
 'Security Deposit',
 'Has Availability',
 'License',
 'Jurisdiction Names',
 'Host ID',
 'Host URL',
 'Price']

In [14]:
# features = ['Bathrooms', 'Bedrooms']  # Reemplaza con las características relevantes
features = df.columns.drop(drop_columns)
target = 'Price'
df = df[[*features, target]]
# df.dropna(inplace=True)

In [15]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

df_train = df.drop('Price',axis=1)
numerical_cols = df_train.select_dtypes(include=['float64', 'int64']).columns.tolist()

unique_counts = df_train.drop(numerical_cols,axis=1).nunique()

# Calculo las filas que tienen una
mean_string_length = df.apply(lambda col: col.dropna().astype(str).apply(len).mean())

# Define los límites
unique_value_limit = 20  # Por ejemplo, considera una columna categórica si tiene menos de 10 valores únicos
string_length_limit = 20  # Por ejemplo, considera una columna de texto si la longitud media de la cadena es mayor que 20

# Identifica las columnas categóricas y de texto
categorical_cols = unique_counts[(unique_counts < unique_value_limit) & (mean_string_length < string_length_limit)].index.tolist()
text_cols = unique_counts[(unique_counts >= unique_value_limit) | (mean_string_length >= string_length_limit)].index.tolist()

# df['combined_text'] = df[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)
# text_data = df['combined_text']

X_train, X_test, y_train, y_test = train_test_split(df.drop('Price', axis=1), df['Price'], test_size=0.2, random_state=0)

# Now split the text data
X_text_train = X_train.loc[:,text_cols ]
X_text_test = X_test.loc[:,text_cols ]

In [16]:
!pip install wandb



In [17]:
# Import the W&B Python Library and log into W&B
import wandb

wandb.login()

#Creamos un proyecto en WandB a través de su interfaz
project = "obligatorio_dl"
entity = "marian-ai"

[34m[1mwandb[0m: Currently logged in as: [33mmariano-chicatun[0m ([33mmarian-ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [18]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [19]:
# import dask.dataframe as dd
# from dask.multiprocessing import get

# dask_df = dd.from_pandas(X_text_train, npartitions=20)  # Partition dataframe
# dask_result = dask_df.map_partitions(lambda df: df.applymap(preprocess_text)).compute(scheduler='multiprocessing')


In [20]:
# !pip install fasttext

In [21]:
import re
from bs4 import BeautifulSoup

def basic_cleaning(text):
    # Remove HTML tags using BeautifulSoup
    if not isinstance(text, str):
         return ''

    text = BeautifulSoup(text, "html.parser").get_text()

    # Correct encoding issues
    text = text.replace("&amp;", "&").replace("&lt;", "<").replace("&gt;", ">")

    # Remove special characters or punctuation (customize regex as needed)
    text = re.sub(r'[^a-zA-Z0-9.,!?/:;\"\'\s]', '', text)

    # Replace multiple spaces with a single space
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply to a text column in DataFrame
X_text_train = X_text_train.applymap(basic_cleaning)
X_text_test = X_text_test.applymap(basic_cleaning)


X_text_train['combined_text'] = X_text_train.apply(lambda x: ' '.join(x), axis=1)
X_text_test['combined_text'] = X_text_test.apply(lambda x: ' '.join(x), axis=1)

  text = BeautifulSoup(text, "html.parser").get_text()
  text = BeautifulSoup(text, "html.parser").get_text()


In [22]:
!pip install transformers



In [23]:
# Load BERT tokenizer and model
import tensorflow as tf
from transformers import TFBertModel, BertTokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-multilingual-uncased')
seed = 42

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [24]:
from transformers import BertTokenizer
import tensorflow as tf

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
max_length = 128  # Reduce the sequence length if needed

def batch_encode(texts, batch_size=256):
    input_ids = []
    attention_masks = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        encoded = tokenizer.batch_encode_plus(
            batch,
            add_special_tokens=True,
            padding='max_length',
            truncation=True,
            max_length=max_length,
            return_attention_mask=True,
            return_tensors='tf'
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])

    input_ids = tf.concat(input_ids, axis=0)
    attention_masks = tf.concat(attention_masks, axis=0)
    return input_ids, attention_masks

In [25]:
# Example usage
train_input_ids, train_attention_masks = batch_encode(X_text_train.combined_text)
test_input_ids, test_attention_masks = batch_encode(X_text_test.combined_text)

In [31]:
def create_dataset(input_ids, attention_masks, labels, batch_size=32):
    dataset = tf.data.Dataset.from_tensor_slices((
        {
            'input_ids': input_ids,
            'attention_mask': attention_masks
        },
        labels
    ))
    dataset = dataset.shuffle(len(labels)).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return dataset

# Scaling the labels
price_scaler = StandardScaler()
price_scaler.fit(y_train.to_numpy().reshape(-1, 1))
train_labels = price_scaler.transform(y_train.to_numpy().reshape(-1, 1))#.reshape(-1, 1))
test_labels = price_scaler.transform(y_test.to_numpy().reshape(-1, 1))#.reshape(-1, 1))

# labels = [...]  # Your label data
train_dataset = create_dataset(train_input_ids, train_attention_masks, train_labels)
test_dataset = create_dataset(test_input_ids, test_attention_masks, test_labels)


In [34]:
# Define the model
def build_model():
    input_ids = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=(max_length,), dtype=tf.int32, name='attention_mask')

    bert_output = bert_model([input_ids, attention_mask])
    cls_token_output = bert_output.last_hidden_state[:, 0, :]
    output = tf.keras.layers.Dense(1)(cls_token_output)

    model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-8),
                  loss=tf.keras.losses.MeanSquaredError(),
                  metrics=[tf.keras.metrics.MeanAbsoluteError()])
    return model

model = build_model()

In [35]:
# Assuming you have a compiled model 'model'
history = model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=1
)






In [None]:
file_path2 = './obligatorio_DL/private_data_to_predict.csv'
data_for_kaggle = pd.read_csv(file_path2)

kaggle_test = data_for_kaggle.applymap(basic_cleaning)

kaggle_test['combined_text'] = kaggle_test.apply(lambda x: ' '.join(x), axis=1)

kaggle_input_ids, kaggle_attention_masks = batch_encode(kaggle_test.combined_text)

# Evaluate and predict
test_loss, test_mae = model.evaluate(x=[kaggle_input_ids, kaggle_attention_masks], y=test_labels)
predictions = model.predict(x=[kaggle_input_ids, kaggle_attention_masks])

# Rescale predictions
y_pred = price_scaler.inverse_transform(predictions)

test_ids = data_for_kaggle['id']
test_ids = np.array(test_ids).reshape(-1,1)
output = np.stack((test_ids, y_pred), axis=-1)
output = output.reshape([-1, 2])
df = pd.DataFrame(output)
df.columns = ['id','expected']
df['expected'] = df['expected'].fillna(0)
df.to_csv("output_to_submit.csv", index = False, index_label = False)

  text = BeautifulSoup(text, "html.parser").get_text()
  text = BeautifulSoup(text, "html.parser").get_text()


# Datos numericos y categoricos

In [None]:
# Esto esta por probarse con lo de arriba:
# Assuming your numeric and categorical preprocessing is already defined
X_num_cat_train = preprocessor.fit_transform(X_train)
X_num_cat_test = preprocessor.transform(X_test)

# Assuming you have already defined your tokenizer and bert_model
encoded_corpus_train = tokenizer(text=X_text_train.combined_text.tolist(), ...)
encoded_corpus_test = tokenizer(text=X_text_test.combined_text.tolist(), ...)

train_inputs = encoded_corpus_train['input_ids']
train_masks = encoded_corpus_train['attention_mask']
test_inputs = encoded_corpus_test['input_ids']
test_masks = encoded_corpus_test['attention_mask']

def build_composite_model(bert_model, num_cat_shape):
    # Text data input
    input_ids = Input(shape=(300,), dtype=tf.int32, name='input_ids')
    attention_mask = Input(shape=(300,), dtype=tf.int32, name='attention_mask')

    # BERT model for text data
    bert_output = bert_model([input_ids, attention_mask])
    cls_token_output = bert_output.last_hidden_state[:, 0, :]

    # Numeric and Categorical data input
    num_cat_input = Input(shape=(num_cat_shape,), name='num_cat_input')

    # Combine BERT output with numeric and categorical data
    combined = concatenate([cls_token_output, num_cat_input])

    # Add dense layers for combined data
    hidden_layer = Dense(64, activation='relu')(combined)
    output = Dense(1)(hidden_layer)

    # Create the model
    model = Model(inputs=[input_ids, attention_mask, num_cat_input], outputs=output)
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

    return model

composite_model = build_composite_model(bert_model, X_num_cat_train.shape[1])

history = composite_model.fit(
    x=[train_inputs, train_masks, X_num_cat_train],
    y=train_labels,
    validation_data=([test_inputs, test_masks, X_num_cat_test], test_labels),
    batch_size=16,
    epochs=5
)

test_loss, test_mae = composite_model.evaluate([test_inputs, test_masks, X_num_cat_test], test_labels)
predictions = composite_model.predict([test_inputs, test_masks, X_num_cat_test])



In [None]:
# X_num_cat_kaggle = preprocessor.transform(data_for_kaggle)
data_for_kaggle['all_text'] = data_for_kaggle[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)

X_text_kaggle_sequences = tokenizer.texts_to_sequences(data_for_kaggle['all_text'])
X_text_kaggle_padded = pad_sequences(X_text_kaggle_sequences, maxlen=max_length)

kaggle_results = model.predict([X_text_kaggle_padded, X_num_cat_kaggle])


KeyboardInterrupt: ignored

In [None]:
test_ids = data_for_kaggle['id']
test_ids = np.array(test_ids).reshape(-1,1)
output = np.stack((test_ids, kaggle_results), axis=-1)
output = output.reshape([-1, 2])
df = pd.DataFrame(output)
df.columns = ['id','expected']
df['expected'] = df['expected'].fillna(0)
df.to_csv("output_to_submit.csv", index = False, index_label = False)

In [None]:
embedding_dim = word2vec_model.vector_size
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, embedding_dim))

for word, i in tokenizer.word_index.items():
    if word in word2vec_model:
        embedding_matrix[i] = word2vec_model[word]


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input

# Assuming `embedding_matrix` is already defined as per your previous code
embedding_layer = Embedding(
    input_dim=embedding_matrix.shape[0],
    output_dim=embedding_matrix.shape[1],
    weights=[embedding_matrix],
    trainable=False  # Set to True if you want to fine-tune the embeddings
)

input_text = Input(shape=(None,), dtype='int32')
embedded_text = embedding_layer(input_text)
lstm_output, _, _ = LSTM(32, return_sequences=True, return_state=True)(embedded_text)
output = Dense(2, activation='softmax')(lstm_output[:, -1, :])

model = Model(inputs=input_text, outputs=output)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

X_num_cat_train = preprocessor.fit_transform(X_train)
X_num_cat_test = preprocessor.transform(X_test)


In [None]:
def preprocess_data(df,imputer_strategy_numeric,imputer_strategy_categorical,text_tokenizer_num_words,text_max_length):
    df_train = df.drop('Price', axis=1)
    numerical_cols = df_train.select_dtypes(include=['float64', 'int64']).columns.tolist()

    unique_counts = df_train.drop(numerical_cols, axis=1).nunique()
    mean_string_length = df.apply(lambda col: col.dropna().astype(str).apply(len).mean())

    unique_value_limit = 20
    string_length_limit = 20

    categorical_cols = unique_counts[(unique_counts < unique_value_limit) & (mean_string_length < string_length_limit)].index.tolist()
    text_cols = unique_counts[(unique_counts >= unique_value_limit) | (mean_string_length >= string_length_limit)].index.tolist()

    df['combined_text'] = df[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)
    text_data = df['combined_text']

    X_train, X_test, y_train, y_test = train_test_split(df.drop('Price', axis=1), df['Price'], test_size=0.2, random_state=0)

    X_text_train = X_train['combined_text']
    X_text_test = X_test['combined_text']

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy=imputer_strategy_numeric)),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy=imputer_strategy_categorical, fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)])

    X_num_cat_train = preprocessor.fit_transform(X_train)
    X_num_cat_test = preprocessor.transform(X_test)

    tokenizer = Tokenizer(num_words=text_tokenizer_num_words)
    tokenizer.fit_on_texts(X_text_train)

    max_length = text_max_length

    X_text_train_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_train), maxlen=max_length)
    X_text_test_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_test), maxlen=max_length)

    return X_num_cat_train, X_text_train_padded, y_train, X_num_cat_test, X_text_test_padded, y_test


import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, Flatten, concatenate, Dropout

def build_model(neurons, dropout, optimizer, text_tokenizer_num_words, embedding_output_dim, embedding_matrix):
    # Text data input
    text_input = Input(shape=(100,), name='text_input')
    text_embedding = Embedding(
        input_dim=text_tokenizer_num_words,
        output_dim=embedding_output_dim,
        weights=[embedding_matrix],
        trainable=True
    )(text_input)
    lstm_output, _, _ = LSTM(neurons[0], return_sequences=True, return_state=True)(text_embedding)

    num_cat_input = Input(shape=(neurons[1],), name='num_cat_input')

    combined_input = concatenate([lstm_output[:, -1, :], num_cat_input])

    hidden_layer = Dense(neurons[2], activation='relu')(combined_input)
    hidden_dropout = Dropout(dropout)(hidden_layer)

    for n in neurons[3:]:
        hidden_layer = Dense(n, activation='relu')(hidden_dropout)
        hidden_dropout = Dropout(dropout)(hidden_layer)

    # Output layer
    output_layer = Dense(1)(hidden_dropout)

    # Create the model
    model = Model(inputs=[text_input, num_cat_input], outputs=output_layer)

    # Compile the model
    model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mae'])

    return model

# def build_model(neurons, dropout, optimizer,text_tokenizer_num_words,embedding_output_dim):
#     text_input = Input(shape=(100,), name='text_input')
#     num_cat_input = Input(shape=(neurons[0],), name='num_cat_input')  # assuming the first layer neuron count for input shape

#     text_embedding = Embedding(input_dim=text_tokenizer_num_words, output_dim=embedding_output_dim)(text_input)
#     text_flatten = Flatten()(text_embedding)

#     combined_input = concatenate([text_flatten, num_cat_input])

#     hidden_layer = Dense(neurons[0], activation='relu')(combined_input)
#     hidden_dropout = Dropout(dropout)(hidden_layer)

#     for n in neurons[1:]:
#         hidden_layer = Dense(n, activation='relu')(hidden_dropout)
#         hidden_dropout = Dropout(dropout)(hidden_layer)

#     output_layer = Dense(1)(hidden_dropout)  # Assuming a regression task

#     model = Model(inputs=[text_input, num_cat_input], outputs=output_layer)
#     model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mae'])

#     return model


In [None]:
import pprint

sweep_config = {
    'name': 'sweep_example',
    'method': 'grid',
    'metric': {
        'name': 'val_loss',
        'goal': 'minimize'
    },
    'parameters': {
        'dropout': {
            'value': 0.1
        },
        'neurons': {
            'values': [[32, 2], [64, 32, 2]]
        },
        'optimizer': {
            'values': ['adam', 'sgd']
        },
        'imputer_strategy_numeric': {
            'values': ['mean', 'median', 'most_frequent', 'constant']
        },
        'imputer_strategy_categorical': {
            'values': ['most_frequent', 'constant']
        },
        'text_tokenizer_num_words': {
            'values': [5000, 10000, 15000]
        },
        'text_max_length': {
            'values': [50, 100, 150]
        },
        'embedding_output_dim': {
            'values': [8, 16, 32]
        }
    }
}

pprint.pprint(sweep_config)

{'method': 'grid',
 'metric': {'goal': 'minimize', 'name': 'val_loss'},
 'name': 'sweep_example',
 'parameters': {'dropout': {'value': 0.1},
                'embedding_output_dim': {'values': [8, 16, 32]},
                'imputer_strategy_categorical': {'values': ['most_frequent',
                                                            'constant']},
                'imputer_strategy_numeric': {'values': ['mean',
                                                        'median',
                                                        'most_frequent',
                                                        'constant']},
                'neurons': {'values': [[32, 2], [64, 32, 2]]},
                'optimizer': {'values': ['adam', 'sgd']},
                'text_max_length': {'values': [50, 100, 150]},
                'text_tokenizer_num_words': {'values': [5000, 10000, 15000]}}}


In [None]:
from tensorflow.keras.layers import Dropout, Input, Embedding, Flatten, Dense, concatenate
from tensorflow.keras.models import Model

def train():
    with wandb.init() as run:
        config = run.config
        X_num_cat_train, X_text_train_padded, y_train, X_num_cat_test, X_text_test_padded, y_test = preprocess_data(df,config.imputer_strategy_numeric,config.imputer_strategy_categorical,config.text_tokenizer_num_words,config.text_max_length)
        model = build_model(config.neurons, config.dropout, config.optimizer,config.embedding_output_dim)
        wandb_callback = wandb.keras.WandbCallback()
        model.fit([X_text_train_padded, X_num_cat_train], y_train, validation_data=([X_text_test_padded, X_num_cat_test], y_test), epochs=10, batch_size=32, callbacks=[wandb_callback])





In [None]:
sweep_id = wandb.sweep(sweep_config, project=project, entity=entity)
wandb.agent(sweep_id, function=train, count=10, project=project, entity=entity)

Create sweep with ID: dnybgb90
Sweep URL: https://wandb.ai/marian-ai/obligatorio_dl/sweeps/dnybgb90


[34m[1mwandb[0m: Agent Starting Run: eqlo005i with config:
[34m[1mwandb[0m: 	dropout: 0.1
[34m[1mwandb[0m: 	embedding_output_dim: 8
[34m[1mwandb[0m: 	imputer_strategy_categorical: most_frequent
[34m[1mwandb[0m: 	imputer_strategy_numeric: mean
[34m[1mwandb[0m: 	neurons: [32, 2]
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	text_max_length: 50
[34m[1mwandb[0m: 	text_tokenizer_num_words: 5000
[34m[1mwandb[0m: Currently logged in as: [33mmariano-chicatun[0m ([33mmarian-ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Ctrl + C detected. Stopping sweep.


# Etapa previa

In [None]:


# Defino pipeline de preprocesamiento numerico
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Defino pipeline de preprocesamiento categorico
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Create a pipeline
df_preprocessed = preprocessor.fit_transform(df)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_text_train)
max_length = 100
X_text_train_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_train), maxlen=max_length)
X_text_test_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_test), maxlen=max_length)

X_num_cat_train = preprocessor.fit_transform(X_train)
X_num_cat_test = preprocessor.transform(X_test)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-22-d8fce40df4a2>", line 21, in <cell line: 21>
    tokenizer.fit_on_texts(X_text_train)
  File "/usr/local/lib/python3.10/dist-packages/keras/src/preprocessing/text.py", line 293, in fit_on_texts
    seq = text_to_word_sequence(
  File "/usr/local/lib/python3.10/dist-packages/keras/src/preprocessing/text.py", line 80, in text_to_word_sequence
    seq = input_text.split(split)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2099, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another ex

In [None]:
df[text_cols].dtypes

### Preproceso la data de texto

### Uno todos los datos

In [None]:
print(len(X_text_train_padded), len(X_num_cat_train), len(y_train))
print(np.any(np.isnan(X_text_train_padded)), np.any(np.isnan(X_num_cat_train)), np.any(np.isnan(y_train)))


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, concatenate
from tensorflow.keras.models import Model

# Assume X_num_cat_train and X_num_cat_test are your numerical and categorical data split into training and test sets
# X_num_cat_train, X_num_cat_test, y_train, y_test = train_test_split(df_preprocessed, df['Price'], test_size=0.2, random_state=0)

# Separate input layers
text_input = Input(shape=(100,), name='text_input')
num_cat_input = Input(shape=(X_num_cat_train.shape[1],), name='num_cat_input')

# Text data path
text_embedding = Embedding(input_dim=5000, output_dim=16)(text_input)
text_flatten = Flatten()(text_embedding)

# Combine the processing paths
combined_input = concatenate([text_flatten, num_cat_input])

# Continue with your model
hidden_layer = Dense(128, activation='relu')(combined_input)
output_layer = Dense(1)(hidden_layer)  # Assuming a regression task

model = Model(inputs=[text_input, num_cat_input], outputs=output_layer)
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])



### 4.2 Dividir los datos en conjuntos de entrenamiento y prueba

### 4.3 Definir el modelo

### 4.4 Entrenar

In [None]:
history = model.fit([X_text_train_padded, X_num_cat_train], y_train, epochs=10, batch_size=32, validation_split=0.2)


### 4.5 Evaluar en Test

In [None]:
loss, mae = model.evaluate([X_text_test_padded, X_num_cat_test], y_test, verbose=0)
print(f'Test Loss: {loss}')
print(f'Test MAE: {mae}')

## 5 Generación de salida para competencia en Kaggle

In [None]:
file_path2 = './obligatorio_DL/private_data_to_predict.csv'
data_for_kaggle = pd.read_csv(file_path2)

In [None]:
# text_cols.remove('combined_text')

In [None]:
print(text_cols)

In [None]:
X_num_cat_kaggle = preprocessor.transform(data_for_kaggle)
data_for_kaggle['all_text'] = data_for_kaggle[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)

X_text_kaggle_sequences = tokenizer.texts_to_sequences(data_for_kaggle['all_text'])
X_text_kaggle_padded = pad_sequences(X_text_kaggle_sequences, maxlen=max_length)

kaggle_results = model.predict([X_text_kaggle_padded, X_num_cat_kaggle])


In [None]:
test_ids = data_for_kaggle['id']
test_ids = np.array(test_ids).reshape(-1,1)
output = np.stack((test_ids, kaggle_results), axis=-1)
output = output.reshape([-1, 2])
df = pd.DataFrame(output)
df.columns = ['id','expected']
df['expected'] = df['expected'].fillna(0)
df.to_csv("output_to_submit.csv", index = False, index_label = False)

# Con WandB

# New Section

In [None]:
import pprint

# sweep_config = {
# 'name': 'sweep_example',
# 'method': 'grid',
# 'metric': {
#     'name': 'val_loss',
#     'goal': 'minimize'
# },
# 'parameters': {
#     'dropout':{'value': 0.1},
#     'neurons':{
#         'values': [[32,2],[64,32,2]]
#         },
#     'optimizer': {
#         'values': ['adam', 'sgd']
#         }
# }
# }

sweep_config = {
    'name': 'sweep_example',
    'method': 'grid',
    'metric': {
        'name': 'val_loss',
        'goal': 'minimize'
    },
    'parameters': {
        'dropout': {'value': 0.1},
        'neurons': {'values': [[32, 2], [64, 32, 2]]},
        'optimizer': {'values': ['adam', 'sgd']},
        'imputer_strategy': {'values': ['mean', 'median', 'most_frequent']},  # Imputer strategies
        'scaler': {'values': ['standard', 'minmax']},  # Scaler options: standard scaler or minmax scaler
        'text_embedding_dim': {'values': [16, 32]},  # Embedding dimensions for text data
    }
}

pprint.pprint(sweep_config)

In [None]:
import sys
import traceback
# def run_train():
#     try:
#         with wandb.init(config=None, project=project, entity=entity):
#             # initialize model
#             config = wandb.config
#             print(config)
#             model= get_model(config.neurons, config.optimizer, config.dropout)
#             tf.keras.backend.clear_session()
#             wandb_callback = wandb.keras.WandbCallback()
#             model.fit([X_text_train_padded, X_num_cat_train], y_train,
#                       epochs=5, batch_size=128, validation_split=0.2,
#                       callbacks=[wandb_callback], max_queue_size=3, workers=2)
#     except Exception as e:
#         # exit gracefully, so wandb logs the problem
#         print(traceback.print_exc(), file=sys.stderr)
#         exit(1)

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def run_train():
    try:
        with wandb.init(config=None, project=project, entity=entity):
            # initialize model
            config = wandb.config
            print(config)

            # Configure preprocessing based on sweep config
            numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy=config.imputer_strategy)),
                ('scaler', StandardScaler() if config.scaler == 'standard' else MinMaxScaler())])

            preprocessor = ColumnTransformer(
                transformers=[('num', numeric_transformer, numerical_cols)])

            X_num_cat_train = preprocessor.fit_transform(X_train)
            X_num_cat_test = preprocessor.transform(X_test)

            model = get_model(config.neurons, config.optimizer, config.dropout, config.text_embedding_dim)
            tf.keras.backend.clear_session()
            wandb_callback = wandb.keras.WandbCallback()
            model.fit([X_text_train_padded, X_num_cat_train], y_train,
                      epochs=5, batch_size=128, validation_split=0.2,
                      callbacks=[wandb_callback], max_queue_size=3, workers=2)

    except Exception as e:
        # exit gracefully, so wandb logs the problem
        print(traceback.print_exc(), file=sys.stderr)
        exit(1)

In [None]:
sweep_id = wandb.sweep(sweep_config, project=project, entity=entity)
wandb.agent(sweep_id, function=run_train, count=10, project=project, entity=entity)


# Bert Fine tuning

Cambiar a Keras

In [None]:
import re
def treat_euro(text):
    text = re.sub(r'(euro[^s])|(euros)|(€)', ' euros', text)
    return text
def treat_m2(text):
    text = re.sub(r'(m2)|(m²)', ' m²', text)
    return text

def filter_ibans(text):
    pattern = r'fr\d{2}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{2}|fr\d{20}|fr[ ]\d{2}[ ]\d{3}[ ]\d{3}[ ]\d{3}[ ]\d{5}'
    text = re.sub(pattern, '', text)
    return text
def remove_space_between_numbers(text):
    text = re.sub(r'(\d)\s+(\d)', r'\1\2', text)
    return text
def filter_emails(text):
    pattern = r'(?:(?!.*?[.]{2})[a-zA-Z0-9](?:[a-zA-Z0-9.+!%-]{1,64}|)|\"[a-zA-Z0-9.+!% -]{1,64}\")@[a-zA-Z0-9][a-zA-Z0-9.-]+(.[a-z]{2,}|.[0-9]{1,})'
    text = re.sub(pattern, '', text)
    return text
def filter_ref(text):
    pattern = r'(\(*)(ref|réf)(\.|[ ])\d+(\)*)'
    text = re.sub(pattern, '', text)
    return text
def filter_websites(text):
    pattern = r'(http\:\/\/|https\:\/\/)?([a-z0-9][a-z0-9\-]*\.)+[a-z][a-z\-]*'
    text = re.sub(pattern, '', text)
    return text
def filter_phone_numbers(text):
    pattern = r'(?:(?:\+|00)33[\s.-]{0,3}(?:\(0\)[\s.-]{0,3})?|0)[1-9](?:(?:[\s.-]?\d{2}){4}|\d{2}(?:[\s.-]?\d{3}){2})|(\d{2}[ ]\d{2}[ ]\d{3}[ ]\d{3})'
    text = re.sub(pattern, '', text)
    return text

def clean_text(text):
    text = text.lower()
    text = text.replace(u'\xa0', u' ')
    text = treat_m2(text)
    text = treat_euro(text)
    text = filter_phone_numbers(text)
    text = filter_emails(text)
    text = filter_ibans(text)
    text = filter_ref(text)
    text = filter_websites(text)
    text = remove_space_between_numbers(text)
    return text
df['cleaned_description'] = df.description.apply(clean_text)

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
encoded_corpus = tokenizer(text=df.cleaned_description.tolist(),
                            add_special_tokens=True,
                            padding='max_length',
                            truncation='longest_first',
                            max_length=300,
                            return_attention_mask=True)
input_ids = encoded_corpus['input_ids']
attention_mask = encoded_corpus['attention_mask']

import numpy as np
def filter_long_descriptions(tokenizer, descriptions, max_len):
    indices = []
    lengths = tokenizer(descriptions, padding=False,
                     truncation=False, return_length=True)['length']
    for i in range(len(descriptions)):
        if lengths[i] <= max_len-2:
            indices.append(i)
    return indices
short_descriptions = filter_long_descriptions(tokenizer,
                               df.cleaned_description.tolist(), 300)
input_ids = np.array(input_ids)[short_descriptions]
attention_mask = np.array(attention_mask)[short_descriptions]
labels = df.prix.to_numpy()[short_descriptions]
from sklearn.model_selection import train_test_split
test_size = 0.1
seed = 42
train_inputs, test_inputs, train_labels, test_labels = \
            train_test_split(input_ids, labels, test_size=test_size,
                             random_state=seed)
train_masks, test_masks, _, _ = train_test_split(attention_mask,
                                        labels, test_size=test_size,
                                        random_state=seed)

from sklearn.preprocessing import StandardScaler
price_scaler = StandardScaler()
price_scaler.fit(train_labels.reshape(-1, 1))
train_labels = price_scaler.transform(train_labels.reshape(-1, 1))
test_labels = price_scaler.transform(test_labels.reshape(-1, 1))

import torch
from torch.utils.data import TensorDataset, DataLoader
batch_size = 32
def create_dataloaders(inputs, masks, labels, batch_size):
    input_tensor = torch.tensor(inputs)
    mask_tensor = torch.tensor(masks)
    labels_tensor = torch.tensor(labels)
    dataset = TensorDataset(input_tensor, mask_tensor,
                            labels_tensor)
    dataloader = DataLoader(dataset, batch_size=batch_size,
                            shuffle=True)
    return dataloader
train_dataloader = create_dataloaders(train_inputs, train_masks,
                                      train_labels, batch_size)
test_dataloader = create_dataloaders(test_inputs, test_masks,
                                     test_labels, batch_size)
import torch.nn as nn
class BertRegressor(nn.Module):

    def __init__(self, drop_rate=0.2, freeze_bert=False):

        super(BertRegressor, self).__init__()
        D_in, D_out = 768, 1

        self.bert = \
                   CamembertModel.from_pretrained('bert-base-multilingual-uncased')
        self.regressor = nn.Sequential(
            nn.Dropout(drop_rate),
            nn.Linear(D_in, D_out))
    def forward(self, input_ids, attention_masks):

        outputs = self.bert(input_ids, attention_masks)
        class_label_output = outputs[1]
        outputs = self.regressor(class_label_output)
        return outputs
model = BertRegressor(drop_rate=0.2)

import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU.")
else:
    print("No GPU available, using the CPU instead.")
    device = torch.device("cpu")
model.to(device)
from transformers import AdamW
optimizer = AdamW(model.parameters(),
                  lr=5e-5,
                  eps=1e-8)
from transformers import get_linear_schedule_with_warmup
epochs = 5
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
                 num_warmup_steps=0, num_training_steps=total_steps)

loss_function = nn.MSELoss()
from torch.nn.utils.clip_grad import clip_grad_norm
def train(model, optimizer, scheduler, loss_function, epochs,
          train_dataloader, device, clip_value=2):
    for epoch in range(epochs):
        print(epoch)
        print("-----")
        best_loss = 1e10
        model.train()
        for step, batch in enumerate(train_dataloader):
            print(step)
            batch_inputs, batch_masks, batch_labels = \
                               tuple(b.to(device) for b in batch)
            model.zero_grad()
            outputs = model(batch_inputs, batch_masks)
            loss = loss_function(outputs.squeeze(),
                             batch_labels.squeeze())
            loss.backward()
            clip_grad_norm(model.parameters(), clip_value)
            optimizer.step()
            scheduler.step()

    return model
model = train(model, optimizer, scheduler, loss_function, epochs,
              train_dataloader, device, clip_value=2)

def evaluate(model, loss_function, test_dataloader, device):
    model.eval()
    test_loss, test_r2 = [], []
    for batch in test_dataloader:
        batch_inputs, batch_masks, batch_labels = \
                                 tuple(b.to(device) for b in batch)
        with torch.no_grad():
            outputs = model(batch_inputs, batch_masks)
        loss = loss_function(outputs, batch_labels)
        test_loss.append(loss.item())
        r2 = r2_score(outputs, batch_labels)
        test_r2.append(r2.item())
    return test_loss, test_r2
def r2_score(outputs, labels):
    labels_mean = torch.mean(labels)
    ss_tot = torch.sum((labels - labels_mean) ** 2)
    ss_res = torch.sum((labels - outputs) ** 2)
    r2 = 1 - ss_res / ss_tot
    return r2

def predict(model, dataloader, device):
    model.eval()
    output = []
    for batch in dataloader:
        batch_inputs, batch_masks, _ = \
                                  tuple(b.to(device) for b in batch)
        with torch.no_grad():
            output += model(batch_inputs,
                            batch_masks).view(1,-1).tolist()[0]
    return output

val_set = val_data[['id_annonce', 'description', 'prix']]
val_set['cleaned_description'] = \
                val_set.description.apply(clean_text)
encoded_val_corpus = \
                tokenizer(text=val_set.cleaned_description.tolist(),
                          add_special_tokens=True,
                          padding='max_length',
                          truncation='longest_first',
                          max_length=300,
                          return_attention_mask=True)
val_input_ids = np.array(encoded_val_corpus['input_ids'])
val_attention_mask = np.array(encoded_val_corpus['attention_mask'])
val_labels = val_set.prix.to_numpy()
val_labels = price_scaler.transform(val_labels.reshape(-1, 1))
val_dataloader = create_dataloaders(val_input_ids,
                         val_attention_mask, val_labels, batch_size)
y_pred_scaled = predict(model, val_dataloader, device)

y_test = val_set.prix.to_numpy()
y_pred = price_scaler.inverse_transform(y_pred_scaled)
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import r2_score
mae = mean_absolute_error(y_test, y_pred)
mdae = median_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
mdape = ((pd.Series(y_test) - pd.Series(y_pred))\
         / pd.Series(y_test)).abs().median()
r_squared = r2_score(y_test, y_pred)


In [None]:
import tensorflow as tf
from transformers import TFBertModel, BertTokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Your preprocessing functions and data loading code here...
# ...

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-multilingual-uncased')

# Assume df is your DataFrame and 'cleaned_description' is the text field
encoded_corpus = tokenizer(
    text=df.cleaned_description.tolist(),
    add_special_tokens=True,
    padding='max_length',
    truncation='longest_first',
    max_length=300,
    return_attention_mask=True,
    return_tensors='tf'
)

input_ids = encoded_corpus['input_ids']
attention_mask = encoded_corpus['attention_mask']
labels = df.prix.to_numpy()

# Splitting the data
test_size = 0.1
seed = 42
train_inputs, test_inputs, train_labels, test_labels = \
    train_test_split(input_ids, labels, test_size=test_size, random_state=seed)
train_masks, test_masks, _, _ = train_test_split(attention_mask, labels, test_size=test_size, random_state=seed)

# Scaling the labels
price_scaler = StandardScaler()
price_scaler.fit(train_labels.reshape(-1, 1))
train_labels = price_scaler.transform(train_labels.reshape(-1, 1))
test_labels = price_scaler.transform(test_labels.reshape(-1, 1))

# Define the model
def build_model():
    input_ids = tf.keras.layers.Input(shape=(300,), dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=(300,), dtype=tf.int32, name='attention_mask')

    bert_output = bert_model([input_ids, attention_mask])
    cls_token_output = bert_output.last_hidden_state[:, 0, :]
    output = tf.keras.layers.Dense(1)(cls_token_output)

    model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5, epsilon=1e-8),
                  loss=tf.keras.losses.MeanSquaredError(),
                  metrics=[tf.keras.metrics.MeanAbsoluteError()])
    return model

model = build_model()

# Train the model
history = model.fit(
    x=[train_inputs, train_masks],
    y=train_labels,
    validation_data=([test_inputs, test_masks], test_labels),
    batch_size=32,
    epochs=5
)

# Evaluate and predict
test_loss, test_mae = model.evaluate(x=[test_inputs, test_masks], y=test_labels)
predictions = model.predict(x=[test_inputs, test_masks])

# Rescale predictions
y_pred = price_scaler.inverse_transform(predictions)
# ...
# Your evaluation code here
# ...
