![AIRBNB](https://www.stevenridercpa.au/wp-content/uploads/2022/09/airbnb-tax.jpeg)

# Obligatorio de Deep Learning
## Semestre 2 - 2023
-------

## Problema

Se presenta un dataset que contiene información de alojamientos publicados en AirBnB con sus respectivos precios. El tamaño del dataset de train es de 1.5 Gb aproximadamente, y 0.5 Gb el de test. Este cuenta con 84 variables predictoras que se podrán utilizar como consideren adecuado.

El objetivo es asignar el precio correcto a los alojamientos listados.

Además del dataset se les provee esta notebook conteniendo el script de carga de datos y un modelo baseline que corresponde a una arquitectura feed forward.

------

## Consigna

### A) <u>Participación en Competencia Kaggle</u>:
El objetivo de este punto es participar en la competencia de Kaggle y obtener como mínimo un Mean Absolute Error inferior a 70 puntos. [->Link a la competencia<-](https://www.kaggle.com/t/69c648e3aa214d1f812bf2314c8d4ffa).

### B) <u>Utilización de Grid Search (o equivalente)</u>:
Para cumplir con la busqueda de modelos óptimos se debe realizar un grid search lo más abarcativo y metódico posible. Recomendamos enfáticamente [Weights and Biases](https://wandb.ai/site)

### C) <u>Se debe a su vez investigar e implementar las siguientes técnicas</u>:
#### 1. [Batch Normalization](https://machinelearningmastery.com/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization/)
#### 2. [Gradient Normalization y/o Gradient Clipping](https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/)


Además como en todas las tareas se evaluará la prolijidad de la entrega, el preprocesamiento de datos, visualizaciones y exploración de técnicas alternativas.

-------

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 1. Setup
### 1.1 Imports

In [2]:
%cd /content/drive/MyDrive/Colab Notebooks/Datasets

/content/drive/MyDrive/Colab Notebooks/Datasets


In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

### 1.2 Seteo de seeds

In [4]:
np.random.seed(117)
tf.random.set_seed(117)

## 2. Carga de datos

In [5]:
file_path = './obligatorio_DL/public_train_data.csv'
df = pd.read_csv(file_path)

##  3. Análisis exploratorio de datos
### 3.1 Dimensiones

In [6]:
df.shape

(326287, 85)

### 3.2 Obtener información sobre las columnas y tipos de datos

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326287 entries, 0 to 326286
Data columns (total 85 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              326287 non-null  int64  
 1   Last Scraped                    326286 non-null  object 
 2   Name                            326018 non-null  object 
 3   Summary                         315651 non-null  object 
 4   Space                           228792 non-null  object 
 5   Description                     326188 non-null  object 
 6   Experiences Offered             326287 non-null  object 
 7   Neighborhood Overview           192513 non-null  object 
 8   Notes                           130729 non-null  object 
 9   Transit                         200649 non-null  object 
 10  Access                          177108 non-null  object 
 11  Interaction                     169193 non-null  object 
 12  House Rules     

### 3.3 Visualizar las primeras filas del dataset

In [8]:
df.head(3)

Unnamed: 0,id,Last Scraped,Name,Summary,Space,Description,Experiences Offered,Neighborhood Overview,Notes,Transit,...,Review Scores Location,Review Scores Value,License,Jurisdiction Names,Cancellation Policy,Calculated host listings count,Reviews per Month,Geolocation,Features,Price
0,0,2017-05-12,Grand Loft in the heart of historic Antwerp,Best location for visiting Antwerp!! Beautiful...,Welcome in Antwerp!! The loft is situated on t...,Best location for visiting Antwerp!! Beautiful...,none,,,,...,10.0,9.0,,,strict,2.0,2.6,"51.21938762207894, 4.4034442505151885","Host Has Profile Pic,Instant Bookable",159.0
1,1,2017-05-03,"CHARMING, CLEAN & COZY BUNGALOW!",Very centrally located and less than 15 min fr...,"Well lit, private entrance with small patio.",Very centrally located and less than 15 min fr...,none,"Quiet. Pretty tree lined streets, safe area.",Has dining table and high back desk chair.,"Uber, bus line and metro link is less than 5 m...",...,,,,"City of Los Angeles, CA",flexible,1.0,,"34.1892692286356, -118.41993491931177","Host Has Profile Pic,Is Location Exact",49.0
2,2,2017-05-09,la casa di maurizio,"nice apartment with view to via veneto , very ...",,"nice apartment with view to via veneto , very ...",none,,,,...,,,,,flexible_new,1.0,,"41.90859623057272, 12.493518028459327","Host Has Profile Pic,Is Location Exact",75.0


### 3.4 Estadísticas descriptivas

In [9]:
df.describe()

Unnamed: 0,id,Host ID,Host Response Rate,Host Listings Count,Host Total Listings Count,Latitude,Longitude,Accommodates,Bathrooms,Bedrooms,...,Review Scores Rating,Review Scores Accuracy,Review Scores Cleanliness,Review Scores Checkin,Review Scores Communication,Review Scores Location,Review Scores Value,Calculated host listings count,Reviews per Month,Price
count,326287.0,326287.0,250845.0,325971.0,325970.0,326287.0,326287.0,326244.0,325300.0,325873.0,...,243160.0,242584.0,242732.0,242378.0,242710.0,242423.0,242347.0,325689.0,246983.0,326287.0
mean,163143.0,32367570.0,93.408264,9.586,9.586026,38.042816,-15.323924,3.270764,1.239482,1.358072,...,92.880063,9.524713,9.326067,9.691416,9.708253,9.468215,9.321031,6.881531,1.486211,138.229041
std,94191.087979,31745720.0,17.536835,57.399711,57.399797,22.910029,70.101677,2.037446,0.574784,0.921763,...,8.569521,0.855361,1.038858,0.731702,0.723143,0.805116,0.906478,42.025986,1.752082,149.790527
min,0.0,19.0,0.0,0.0,0.0,-38.224427,-123.218712,1.0,0.0,0.0,...,20.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,0.01,0.0
25%,81571.5,6869780.0,98.0,1.0,1.0,38.923154,-73.968081,2.0,1.0,1.0,...,90.0,9.0,9.0,10.0,10.0,9.0,9.0,1.0,0.32,55.0
50%,163143.0,21867370.0,100.0,1.0,1.0,42.304549,0.090277,2.0,1.0,1.0,...,95.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0,0.89,90.0
75%,244714.5,47991660.0,100.0,3.0,3.0,50.863658,12.342749,4.0,1.0,2.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,2.0,2.04,150.0
max,326286.0,135088500.0,100.0,1114.0,1114.0,55.994889,153.637837,18.0,8.0,96.0,...,100.0,10.0,10.0,10.0,10.0,10.0,10.0,752.0,223.0,999.0


In [10]:
df.columns

Index(['id', 'Last Scraped', 'Name', 'Summary', 'Space', 'Description',
       'Experiences Offered', 'Neighborhood Overview', 'Notes', 'Transit',
       'Access', 'Interaction', 'House Rules', 'Thumbnail Url', 'Medium Url',
       'Picture Url', 'XL Picture Url', 'Host ID', 'Host URL', 'Host Name',
       'Host Since', 'Host Location', 'Host About', 'Host Response Time',
       'Host Response Rate', 'Host Acceptance Rate', 'Host Thumbnail Url',
       'Host Picture Url', 'Host Neighbourhood', 'Host Listings Count',
       'Host Total Listings Count', 'Host Verifications', 'Street',
       'Neighbourhood', 'Neighbourhood Cleansed',
       'Neighbourhood Group Cleansed', 'City', 'State', 'Zipcode', 'Market',
       'Smart Location', 'Country Code', 'Country', 'Latitude', 'Longitude',
       'Property Type', 'Room Type', 'Accommodates', 'Bathrooms', 'Bedrooms',
       'Beds', 'Bed Type', 'Amenities', 'Square Feet', 'Security Deposit',
       'Cleaning Fee', 'Guests Included', 'Extra Peop

In [11]:
df.shape

(326287, 85)

In [12]:
columnas_con_NaNs_mayores_a_50_porciento = df.columns[df.isnull().sum() > 0.5*df.shape[0]]
columnas_con_NaNs_mayores_a_50_porciento

Index(['Notes', 'Host Acceptance Rate', 'Neighbourhood Group Cleansed',
       'Square Feet', 'Security Deposit', 'Has Availability', 'License',
       'Jurisdiction Names'],
      dtype='object')

In [16]:
print(f"Tamaño antes de eliminar duplicados: {df.shape}")
df.drop_duplicates(subset = "id", keep=False, inplace=True)
df.drop_duplicates(subset = "Description", keep=False, inplace=True)
df.drop_duplicates(subset = "Summary", keep=False, inplace=True)
print(f"Tamaño luego de eliminar duplicados: {df.shape}")

Tamaño antes de eliminar duplicados: (317501, 75)
Tamaño luego de eliminar duplicados: (299223, 75)


In [17]:
df.isna().sum()/len(df)

id                                0.000000
Last Scraped                      0.000003
Name                              0.000725
Summary                           0.000000
Space                             0.310200
                                    ...   
Calculated host listings count    0.001761
Reviews per Month                 0.245482
Geolocation                       0.000000
Features                          0.000595
Price                             0.000000
Length: 75, dtype: float64

In [19]:
numericals = df.select_dtypes(['float','int']).columns
print(numericals)

Index(['id', 'Host Response Rate', 'Host Listings Count',
       'Host Total Listings Count', 'Latitude', 'Longitude', 'Accommodates',
       'Bathrooms', 'Bedrooms', 'Beds', 'Cleaning Fee', 'Guests Included',
       'Extra People', 'Minimum Nights', 'Maximum Nights', 'Availability 30',
       'Availability 60', 'Availability 90', 'Availability 365',
       'Number of Reviews', 'Review Scores Rating', 'Review Scores Accuracy',
       'Review Scores Cleanliness', 'Review Scores Checkin',
       'Review Scores Communication', 'Review Scores Location',
       'Review Scores Value', 'Calculated host listings count',
       'Reviews per Month', 'Price'],
      dtype='object')


## 4. Modelo Baseline

### 4.1 Seleccionar características relevantes

In [20]:
drop_columns = list(columnas_con_NaNs_mayores_a_50_porciento)
drop_columns.extend(['Host ID','Host URL','Price']) # Dropeo algunas columnas que no tienen sentido como el ID y el URL del host
drop_columns

['Notes',
 'Host Acceptance Rate',
 'Neighbourhood Group Cleansed',
 'Square Feet',
 'Security Deposit',
 'Has Availability',
 'License',
 'Jurisdiction Names',
 'Host ID',
 'Host URL',
 'Price']

In [21]:
# features = ['Bathrooms', 'Bedrooms']  # Reemplaza con las características relevantes
features = df.columns.drop(drop_columns)
target = 'Price'
df = df[[*features, target]]
# df.dropna(inplace=True)

KeyError: ignored

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

df_train = df.drop('Price',axis=1)
numerical_cols = df_train.select_dtypes(include=['float64', 'int64']).columns.tolist()

unique_counts = df_train.drop(numerical_cols,axis=1).nunique()

# Calculo las filas que tienen una
mean_string_length = df.apply(lambda col: col.dropna().astype(str).apply(len).mean())

# Define los límites
unique_value_limit = 20  # Por ejemplo, considera una columna categórica si tiene menos de 10 valores únicos
string_length_limit = 20  # Por ejemplo, considera una columna de texto si la longitud media de la cadena es mayor que 20

# Identifica las columnas categóricas y de texto
categorical_cols = unique_counts[(unique_counts < unique_value_limit) & (mean_string_length < string_length_limit)].index.tolist()
text_cols = unique_counts[(unique_counts >= unique_value_limit) | (mean_string_length >= string_length_limit)].index.tolist()

# df['combined_text'] = df[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)
# text_data = df['combined_text']

X_train, X_test, y_train, y_test = train_test_split(df.drop('Price', axis=1), df['Price'], test_size=0.2, random_state=0)

# Now split the text data
X_text_train = X_train.loc[:,text_cols ]
X_text_test = X_test.loc[:,text_cols ]

Probar esto:

- https://medium.com/ilb-labs-publications/list-price-prediction-with-scraped-real-estate-data-df5e49f14547
- https://medium.com/ilb-labs-publications/fine-tuning-bert-for-a-regression-task-is-a-description-enough-to-predict-a-propertys-list-price-cf97cd7cb98a
- https://huggingface.co/bert-base-multilingual-cased

In [None]:
!pip install transformers

In [24]:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = TFBertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [25]:
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(1, 13, 768), dtype=float32, numpy=
array([[[ 0.252442  , -0.53208315,  0.4496354 , ...,  1.1428416 ,
         -0.62380266, -0.07453373],
        [ 0.53735656,  0.07007895,  0.48796287, ...,  1.0800393 ,
         -0.5559526 , -0.55644315],
        [ 0.46855706, -0.32863307,  0.4781716 , ...,  1.1656268 ,
         -0.738888  , -0.39201605],
        ...,
        [ 0.48219427, -1.0821985 ,  0.90263855, ...,  1.8029281 ,
         -1.1342983 , -0.10343595],
        [ 0.12827995, -0.65504634,  0.37227005, ...,  0.781748  ,
         -0.91728044, -0.04007214],
        [ 0.16448566, -0.5238817 ,  0.66378266, ...,  0.79642314,
         -0.73260677, -0.2641508 ]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[ 0.3479018 , -0.04663803,  0.45640555, -0.24642844, -0.05631402,
         0.58653045,  0.48069143,  0.24746308, -0.56629646,  0.42599723,
         0.01355295, -0.335

In [None]:
!pip install wandb

In [None]:
# Import the W&B Python Library and log into W&B
import wandb

wandb.login()

#Creamos un proyecto en WandB a través de su interfaz
project = "obligatorio_dl"
entity = "marian-ai"

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
import dask.dataframe as dd
from dask.multiprocessing import get

dask_df = dd.from_pandas(X_text_train, npartitions=20)  # Partition dataframe
dask_result = dask_df.map_partitions(lambda df: df.applymap(preprocess_text)).compute(scheduler='multiprocessing')


In [None]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.11.1-py3-none-any.whl (227 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4199774 sha256=af43f44d6c7a50e05ca1055f3009f6a1b1b7bd14b185a536b49cde686d24ee25
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.11.1


In [None]:
import fasttext
import fasttext.util

# Load the multilingual model (This will only load word vectors, not a full model)
ft = fasttext.load_model('wiki.multi.en.vec')

def get_embedding(text):
    words = text.split()
    word_vectors = [ft.get_word_vector(word) for word in words]
    sentence_vector = np.mean(word_vectors, axis=0)
    return sentence_vector

text = "Bonjour, comment ça va ?"
embedding = get_embedding(text)

text_es = "Buenos dias como estas?"
embedding_es = get_embedding(text_es)

print(text)
print(f"El embedding es {embedding}")

print(text_es)
print(f"El embedding es {embedding_es}")



ValueError: ignored

In [None]:
from joblib import Parallel, delayed

def preprocess_text(text):
    # If text is not a string, convert it to an empty string
    if not isinstance(text, str):
        text = ''

    def strip_html(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    def remove_between_square_brackets(text):
        return re.sub('\[[^]]*\]', '', text)

    def remove_special_characters(text):
        pattern = r'[^a-zA-z\s]'
        text = re.sub(pattern, '', text)
        return text

    def remove_stop_words(text):
        stop_words = set(stopwords.words("english"))
        tokens = text.split()
        tokens = [tok for tok in tokens if tok not in stop_words]
        return " ".join(tokens)

    def lemmatize(text):
        wnl = WordNetLemmatizer()
        lemmas = [wnl.lemmatize(word) for word in text.split()]
        return " ".join(lemmas)

    text = strip_html(text)
    text = remove_between_square_brackets(text)
    text = remove_special_characters(text)
    text = text.lower()
    text = remove_stop_words(text)
    text = lemmatize(text)
    return text

def preprocess_column(col):
    return col.apply(preprocess_text)

X_text_train[text_cols] = Parallel(n_jobs=-1)(delayed(preprocess_column)(X_text_train[col]) for col in text_cols)
X_text_test[text_cols] = Parallel(n_jobs=-1)(delayed(preprocess_column)(X_text_test[col]) for col in text_cols)
# for col in text_cols:
    # X_text_train[col] = X_text_train[col].apply(preprocess_text)
    # X_text_test[col] = X_text_test[col].apply(preprocess_text)

X_text_train['combined_text'] = X_text_train.apply(lambda x: ' '.join(x), axis=1)
X_text_test['combined_text'] = X_text_test.apply(lambda x: ' '.join(x), axis=1)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  soup = BeautifulSoup(text, "html.parser")
  soup = BeautifulSoup(text, "html.parser")


KeyboardInterrupt: ignored

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

texts = X_text_train['combined_text'].values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

X_train_sequences = tokenizer.texts_to_sequences(texts)


In [None]:
from gensim.models import KeyedVectors

word2vec_model = KeyedVectors.load_word2vec_format('word2vec.bin', binary=True)


In [None]:
embedding_dim = word2vec_model.vector_size
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, embedding_dim))

for word, i in tokenizer.word_index.items():
    if word in word2vec_model:
        embedding_matrix[i] = word2vec_model[word]


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input

# Assuming `embedding_matrix` is already defined as per your previous code
embedding_layer = Embedding(
    input_dim=embedding_matrix.shape[0],
    output_dim=embedding_matrix.shape[1],
    weights=[embedding_matrix],
    trainable=False  # Set to True if you want to fine-tune the embeddings
)

input_text = Input(shape=(None,), dtype='int32')
embedded_text = embedding_layer(input_text)
lstm_output, _, _ = LSTM(32, return_sequences=True, return_state=True)(embedded_text)
output = Dense(2, activation='softmax')(lstm_output[:, -1, :])

model = Model(inputs=input_text, outputs=output)

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

In [None]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

X_num_cat_train = preprocessor.fit_transform(X_train)
X_num_cat_test = preprocessor.transform(X_test)


In [None]:
def preprocess_data(df,imputer_strategy_numeric,imputer_strategy_categorical,text_tokenizer_num_words,text_max_length):
    df_train = df.drop('Price', axis=1)
    numerical_cols = df_train.select_dtypes(include=['float64', 'int64']).columns.tolist()

    unique_counts = df_train.drop(numerical_cols, axis=1).nunique()
    mean_string_length = df.apply(lambda col: col.dropna().astype(str).apply(len).mean())

    unique_value_limit = 20
    string_length_limit = 20

    categorical_cols = unique_counts[(unique_counts < unique_value_limit) & (mean_string_length < string_length_limit)].index.tolist()
    text_cols = unique_counts[(unique_counts >= unique_value_limit) | (mean_string_length >= string_length_limit)].index.tolist()

    df['combined_text'] = df[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)
    text_data = df['combined_text']

    X_train, X_test, y_train, y_test = train_test_split(df.drop('Price', axis=1), df['Price'], test_size=0.2, random_state=0)

    X_text_train = X_train['combined_text']
    X_text_test = X_test['combined_text']

    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy=imputer_strategy_numeric)),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy=imputer_strategy_categorical, fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)])

    X_num_cat_train = preprocessor.fit_transform(X_train)
    X_num_cat_test = preprocessor.transform(X_test)

    tokenizer = Tokenizer(num_words=text_tokenizer_num_words)
    tokenizer.fit_on_texts(X_text_train)

    max_length = text_max_length

    X_text_train_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_train), maxlen=max_length)
    X_text_test_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_test), maxlen=max_length)

    return X_num_cat_train, X_text_train_padded, y_train, X_num_cat_test, X_text_test_padded, y_test


import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, Flatten, concatenate, Dropout

def build_model(neurons, dropout, optimizer, text_tokenizer_num_words, embedding_output_dim, embedding_matrix):
    # Text data input
    text_input = Input(shape=(100,), name='text_input')
    text_embedding = Embedding(
        input_dim=text_tokenizer_num_words,
        output_dim=embedding_output_dim,
        weights=[embedding_matrix],
        trainable=True
    )(text_input)
    lstm_output, _, _ = LSTM(neurons[0], return_sequences=True, return_state=True)(text_embedding)

    num_cat_input = Input(shape=(neurons[1],), name='num_cat_input')

    combined_input = concatenate([lstm_output[:, -1, :], num_cat_input])

    hidden_layer = Dense(neurons[2], activation='relu')(combined_input)
    hidden_dropout = Dropout(dropout)(hidden_layer)

    for n in neurons[3:]:
        hidden_layer = Dense(n, activation='relu')(hidden_dropout)
        hidden_dropout = Dropout(dropout)(hidden_layer)

    # Output layer
    output_layer = Dense(1)(hidden_dropout)

    # Create the model
    model = Model(inputs=[text_input, num_cat_input], outputs=output_layer)

    # Compile the model
    model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mae'])

    return model

# def build_model(neurons, dropout, optimizer,text_tokenizer_num_words,embedding_output_dim):
#     text_input = Input(shape=(100,), name='text_input')
#     num_cat_input = Input(shape=(neurons[0],), name='num_cat_input')  # assuming the first layer neuron count for input shape

#     text_embedding = Embedding(input_dim=text_tokenizer_num_words, output_dim=embedding_output_dim)(text_input)
#     text_flatten = Flatten()(text_embedding)

#     combined_input = concatenate([text_flatten, num_cat_input])

#     hidden_layer = Dense(neurons[0], activation='relu')(combined_input)
#     hidden_dropout = Dropout(dropout)(hidden_layer)

#     for n in neurons[1:]:
#         hidden_layer = Dense(n, activation='relu')(hidden_dropout)
#         hidden_dropout = Dropout(dropout)(hidden_layer)

#     output_layer = Dense(1)(hidden_dropout)  # Assuming a regression task

#     model = Model(inputs=[text_input, num_cat_input], outputs=output_layer)
#     model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mae'])

#     return model


In [None]:
import pprint

sweep_config = {
    'name': 'sweep_example',
    'method': 'grid',
    'metric': {
        'name': 'val_loss',
        'goal': 'minimize'
    },
    'parameters': {
        'dropout': {
            'value': 0.1
        },
        'neurons': {
            'values': [[32, 2], [64, 32, 2]]
        },
        'optimizer': {
            'values': ['adam', 'sgd']
        },
        'imputer_strategy_numeric': {
            'values': ['mean', 'median', 'most_frequent', 'constant']
        },
        'imputer_strategy_categorical': {
            'values': ['most_frequent', 'constant']
        },
        'text_tokenizer_num_words': {
            'values': [5000, 10000, 15000]
        },
        'text_max_length': {
            'values': [50, 100, 150]
        },
        'embedding_output_dim': {
            'values': [8, 16, 32]
        }
    }
}

pprint.pprint(sweep_config)

{'method': 'grid',
 'metric': {'goal': 'minimize', 'name': 'val_loss'},
 'name': 'sweep_example',
 'parameters': {'dropout': {'value': 0.1},
                'embedding_output_dim': {'values': [8, 16, 32]},
                'imputer_strategy_categorical': {'values': ['most_frequent',
                                                            'constant']},
                'imputer_strategy_numeric': {'values': ['mean',
                                                        'median',
                                                        'most_frequent',
                                                        'constant']},
                'neurons': {'values': [[32, 2], [64, 32, 2]]},
                'optimizer': {'values': ['adam', 'sgd']},
                'text_max_length': {'values': [50, 100, 150]},
                'text_tokenizer_num_words': {'values': [5000, 10000, 15000]}}}


In [None]:
from tensorflow.keras.layers import Dropout, Input, Embedding, Flatten, Dense, concatenate
from tensorflow.keras.models import Model

def train():
    with wandb.init() as run:
        config = run.config
        X_num_cat_train, X_text_train_padded, y_train, X_num_cat_test, X_text_test_padded, y_test = preprocess_data(df,config.imputer_strategy_numeric,config.imputer_strategy_categorical,config.text_tokenizer_num_words,config.text_max_length)
        model = build_model(config.neurons, config.dropout, config.optimizer,config.embedding_output_dim)
        wandb_callback = wandb.keras.WandbCallback()
        model.fit([X_text_train_padded, X_num_cat_train], y_train, validation_data=([X_text_test_padded, X_num_cat_test], y_test), epochs=10, batch_size=32, callbacks=[wandb_callback])





In [None]:
sweep_id = wandb.sweep(sweep_config, project=project, entity=entity)
wandb.agent(sweep_id, function=train, count=10, project=project, entity=entity)

Create sweep with ID: dnybgb90
Sweep URL: https://wandb.ai/marian-ai/obligatorio_dl/sweeps/dnybgb90


[34m[1mwandb[0m: Agent Starting Run: eqlo005i with config:
[34m[1mwandb[0m: 	dropout: 0.1
[34m[1mwandb[0m: 	embedding_output_dim: 8
[34m[1mwandb[0m: 	imputer_strategy_categorical: most_frequent
[34m[1mwandb[0m: 	imputer_strategy_numeric: mean
[34m[1mwandb[0m: 	neurons: [32, 2]
[34m[1mwandb[0m: 	optimizer: adam
[34m[1mwandb[0m: 	text_max_length: 50
[34m[1mwandb[0m: 	text_tokenizer_num_words: 5000
[34m[1mwandb[0m: Currently logged in as: [33mmariano-chicatun[0m ([33mmarian-ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Ctrl + C detected. Stopping sweep.


# Etapa previa

In [None]:


# Defino pipeline de preprocesamiento numerico
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Defino pipeline de preprocesamiento categorico
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)])

# Create a pipeline
df_preprocessed = preprocessor.fit_transform(df)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_text_train)
max_length = 100
X_text_train_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_train), maxlen=max_length)
X_text_test_padded = pad_sequences(tokenizer.texts_to_sequences(X_text_test), maxlen=max_length)

X_num_cat_train = preprocessor.fit_transform(X_train)
X_num_cat_test = preprocessor.transform(X_test)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-22-d8fce40df4a2>", line 21, in <cell line: 21>
    tokenizer.fit_on_texts(X_text_train)
  File "/usr/local/lib/python3.10/dist-packages/keras/src/preprocessing/text.py", line 293, in fit_on_texts
    seq = text_to_word_sequence(
  File "/usr/local/lib/python3.10/dist-packages/keras/src/preprocessing/text.py", line 80, in text_to_word_sequence
    seq = input_text.split(split)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 2099, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another ex

In [None]:
df[text_cols].dtypes

### Preproceso la data de texto

### Uno todos los datos

In [None]:
print(len(X_text_train_padded), len(X_num_cat_train), len(y_train))
print(np.any(np.isnan(X_text_train_padded)), np.any(np.isnan(X_num_cat_train)), np.any(np.isnan(y_train)))


In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, concatenate
from tensorflow.keras.models import Model

# Assume X_num_cat_train and X_num_cat_test are your numerical and categorical data split into training and test sets
# X_num_cat_train, X_num_cat_test, y_train, y_test = train_test_split(df_preprocessed, df['Price'], test_size=0.2, random_state=0)

# Separate input layers
text_input = Input(shape=(100,), name='text_input')
num_cat_input = Input(shape=(X_num_cat_train.shape[1],), name='num_cat_input')

# Text data path
text_embedding = Embedding(input_dim=5000, output_dim=16)(text_input)
text_flatten = Flatten()(text_embedding)

# Combine the processing paths
combined_input = concatenate([text_flatten, num_cat_input])

# Continue with your model
hidden_layer = Dense(128, activation='relu')(combined_input)
output_layer = Dense(1)(hidden_layer)  # Assuming a regression task

model = Model(inputs=[text_input, num_cat_input], outputs=output_layer)
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])



### 4.2 Dividir los datos en conjuntos de entrenamiento y prueba

### 4.3 Definir el modelo

### 4.4 Entrenar

In [None]:
history = model.fit([X_text_train_padded, X_num_cat_train], y_train, epochs=10, batch_size=32, validation_split=0.2)


### 4.5 Evaluar en Test

In [None]:
loss, mae = model.evaluate([X_text_test_padded, X_num_cat_test], y_test, verbose=0)
print(f'Test Loss: {loss}')
print(f'Test MAE: {mae}')

## 5 Generación de salida para competencia en Kaggle

In [None]:
file_path2 = './obligatorio_DL/private_data_to_predict.csv'
data_for_kaggle = pd.read_csv(file_path2)

In [None]:
# text_cols.remove('combined_text')

In [None]:
print(text_cols)

In [None]:
X_num_cat_kaggle = preprocessor.transform(data_for_kaggle)
data_for_kaggle['all_text'] = data_for_kaggle[text_cols].apply(lambda x: ' '.join(str(x)), axis=1)

X_text_kaggle_sequences = tokenizer.texts_to_sequences(data_for_kaggle['all_text'])
X_text_kaggle_padded = pad_sequences(X_text_kaggle_sequences, maxlen=max_length)

kaggle_results = model.predict([X_text_kaggle_padded, X_num_cat_kaggle])


In [None]:
test_ids = data_for_kaggle['id']
test_ids = np.array(test_ids).reshape(-1,1)
output = np.stack((test_ids, kaggle_results), axis=-1)
output = output.reshape([-1, 2])
df = pd.DataFrame(output)
df.columns = ['id','expected']
df['expected'] = df['expected'].fillna(0)
df.to_csv("output_to_submit.csv", index = False, index_label = False)

# Con WandB

In [None]:
import pprint

# sweep_config = {
# 'name': 'sweep_example',
# 'method': 'grid',
# 'metric': {
#     'name': 'val_loss',
#     'goal': 'minimize'
# },
# 'parameters': {
#     'dropout':{'value': 0.1},
#     'neurons':{
#         'values': [[32,2],[64,32,2]]
#         },
#     'optimizer': {
#         'values': ['adam', 'sgd']
#         }
# }
# }

sweep_config = {
    'name': 'sweep_example',
    'method': 'grid',
    'metric': {
        'name': 'val_loss',
        'goal': 'minimize'
    },
    'parameters': {
        'dropout': {'value': 0.1},
        'neurons': {'values': [[32, 2], [64, 32, 2]]},
        'optimizer': {'values': ['adam', 'sgd']},
        'imputer_strategy': {'values': ['mean', 'median', 'most_frequent']},  # Imputer strategies
        'scaler': {'values': ['standard', 'minmax']},  # Scaler options: standard scaler or minmax scaler
        'text_embedding_dim': {'values': [16, 32]},  # Embedding dimensions for text data
    }
}

pprint.pprint(sweep_config)

In [None]:
import sys
import traceback
# def run_train():
#     try:
#         with wandb.init(config=None, project=project, entity=entity):
#             # initialize model
#             config = wandb.config
#             print(config)
#             model= get_model(config.neurons, config.optimizer, config.dropout)
#             tf.keras.backend.clear_session()
#             wandb_callback = wandb.keras.WandbCallback()
#             model.fit([X_text_train_padded, X_num_cat_train], y_train,
#                       epochs=5, batch_size=128, validation_split=0.2,
#                       callbacks=[wandb_callback], max_queue_size=3, workers=2)
#     except Exception as e:
#         # exit gracefully, so wandb logs the problem
#         print(traceback.print_exc(), file=sys.stderr)
#         exit(1)

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def run_train():
    try:
        with wandb.init(config=None, project=project, entity=entity):
            # initialize model
            config = wandb.config
            print(config)

            # Configure preprocessing based on sweep config
            numeric_transformer = Pipeline(steps=[
                ('imputer', SimpleImputer(strategy=config.imputer_strategy)),
                ('scaler', StandardScaler() if config.scaler == 'standard' else MinMaxScaler())])

            preprocessor = ColumnTransformer(
                transformers=[('num', numeric_transformer, numerical_cols)])

            X_num_cat_train = preprocessor.fit_transform(X_train)
            X_num_cat_test = preprocessor.transform(X_test)

            model = get_model(config.neurons, config.optimizer, config.dropout, config.text_embedding_dim)
            tf.keras.backend.clear_session()
            wandb_callback = wandb.keras.WandbCallback()
            model.fit([X_text_train_padded, X_num_cat_train], y_train,
                      epochs=5, batch_size=128, validation_split=0.2,
                      callbacks=[wandb_callback], max_queue_size=3, workers=2)

    except Exception as e:
        # exit gracefully, so wandb logs the problem
        print(traceback.print_exc(), file=sys.stderr)
        exit(1)

In [None]:
sweep_id = wandb.sweep(sweep_config, project=project, entity=entity)
wandb.agent(sweep_id, function=run_train, count=10, project=project, entity=entity)
