<a href="https://colab.research.google.com/github/SethuGopalan/Listing/blob/master/ListingPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from scipy.sparse import csr_matrix


In [None]:
# Load the dataset
data = pd.read_csv('/content/sample_data/listings.csv')

# Fill missing values
data['reviews_per_month'].fillna(0, inplace=True)
data['last_review'].fillna('2020-01-01', inplace=True)

# Select features and target
features = ['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type',
            'minimum_nights', 'number_of_reviews', 'reviews_per_month',
            'calculated_host_listings_count', 'availability_365']
target = 'price'

X = data[features]
y = data[target]

# Preprocess features
numerical_features = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
                      'reviews_per_month', 'calculated_host_listings_count', 'availability_365']
categorical_features = ['neighbourhood_group', 'neighbourhood', 'room_type']

# Standardize Numerical Features:
# Standardization involves scaling the numerical features so they have a mean of 0 and a
#  standard deviation of 1. This helps improve the performance and stability of machine learning models.
#  The StandardScaler from sklearn.preprocessing is used for this purpose.

# Standardize numerical features
numerical_transformer = StandardScaler()

# One-Hot Encode Categorical Features:
# One-hot encoding converts categorical values into a format that can be provided to machine learning algorithms to do a better job in prediction.
# It creates binary columns for each category, with a value of 1 indicating the presence of that category and 0 indicating its absence. The OneHotEncoder
# from sklearn.preprocessing is used to perform this transformation. The parameter handle_unknown='ignore'
# ensures that if there are any categories in the test data that were not seen during training, they are ignored rather than causing an error.

# One-hot encode categorical features
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Create a Preprocessor Using ColumnTransformer:
# The ColumnTransformer allows you to apply different preprocessing steps to different columns in your dataset.
# It combines the numerical and categorical transformations into a single transformation pipeline.

# Numerical Transformer: Applied to the numerical features using StandardScaler.
# Categorical Transformer: Applied to the categorical features using OneHotEncoder.

# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])
# Apply the Preprocessor to the Data:
# The preprocessor.fit_transform(X) line applies the defined transformations to the dataset X. This transforms the numerical
#  features using standardization and the categorical features using one-hot encoding.


# Preprocess the data
X_processed = preprocessor.fit_transform(X)

# # X_train: The feature matrix for the training set.
# X_test: The feature matrix for the testing set.
# y_train: The target vector for the training set.
# y_test: The target vector for the testing set.
# X: The feature matrix (inputs) of your dataset. This includes all the columns you want to use to make predictions.
# y: The target vector (output) of your dataset. This includes the column you want to predict.
# test_size=0.2: This specifies the proportion of the dataset to include in the test split. Here, 0.2 means 20% of
# the data will be used for testing, and the remaining 80% will be used for training.
# random_state=42: This is a seed for the random number generator. Setting a random state ensures reproducibility,
# meaning that every time you run the code with the same random state, you get the same split.

# Example Scenario:
# Suppose you have a dataset with 100 rows of data. The train_test_split function will randomly split this data into two sets:

# Training Set: 80 rows (80% of the data) – X_train and y_train
# Testing Set: 20 rows (20% of the data) – X_test and y_test


In [None]:
if isinstance(X_processed, csr_matrix):
    X_processed = X_processed.toarray()

In [None]:
y = y.to_numpy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)


When you preprocess data using tools like StandardScaler and OneHotEncoder, the resulting data can sometimes be in a sparse matrix format. A sparse matrix is a matrix in which most of the elements are zero. Sparse matrices are efficient for storing and processing large datasets with many zeros because they save memory and computational resources.
Check if Data is Sparse:
if isinstance(X_processed, csr_matrix):
This line checks if X_processed is an instance of the csr_matrix class. csr_matrix stands for Compressed Sparse Row matrix, which is a specific type of sparse matrix provided by the SciPy library.
Convert Sparse Matrix to Dense Array:
X_processed = X_processed.toarray()
If X_processed is a csr_matrix, this line converts it to a dense array (also known as a dense matrix). A dense array is a standard NumPy array where all values, including zeros, are explicitly stored.

In [None]:
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(32)


In [None]:
# Build a simple neural network model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)  # Output layer for regression
])

# Compile the model
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Train the model
model.fit(train_dataset, epochs=50)

# Evaluate the model
loss, mae = model.evaluate(test_dataset)
print("Test Loss:", loss)
print("Test MAE:", mae)

# Make predictions
predictions = model.predict(test_dataset)
print("Predictions:", predictions[:5])  # Print first 5 predictions


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Test Loss: 71138.3671875
Test MAE: 76.11277770996094
Predictions: [[116.79138]
 [173.43144]
 [153.9025 ]
 [276.8338 ]
 [211.07025]]
