#  Model Training using Maximum Likelihood Estimation (MLE)

Now that we have completed **data preprocessing**, we’ll train our first regression model using the **Maximum Likelihood Estimation (MLE)** principle.

---

###  What is MLE in Linear Regression?

For a dataset with Gaussian noise, the **Ordinary Least Squares (OLS)** solution in Linear Regression is equivalent to the **Maximum Likelihood Estimate (MLE)**.

Mathematically, MLE assumes:

\[
y_i = \beta_0 + \beta_1x_1 + ... + \beta_nx_n + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)
\]

The model parameters \(\beta\) are chosen to **maximize the likelihood** of observing the given data,  
which simplifies to minimizing the **sum of squared errors (SSE)** — exactly what Linear Regression does.

---

We’ll now:
1. Load the preprocessed dataset  
2. Split it into training and testing sets  
3. Train a Linear Regression model (MLE)  
4. Evaluate its performance  
5. Prepare a Kaggle submission file


In [None]:
# Step 1: Upload preprocessed CSV file (Option 1: manual upload)

from google.colab import files
import pandas as pd

# Upload the preprocessed file from your local system
uploaded = files.upload()  # Choose "hotel_data_preprocessed.csv" when prompted


Saving hotel_data_preprocessed.csv to hotel_data_preprocessed (1).csv


###  Step 2: Load and Inspect the Preprocessed Data
We’ll now read the uploaded file into a DataFrame and inspect its shape and first few rows.


In [None]:
# Step 2: Load the uploaded preprocessed dataset

df = pd.read_csv("hotel_data_preprocessed.csv")

print(" File successfully loaded!")
print("Shape of preprocessed data:", df.shape)
df.head(3)


 File successfully loaded!
Shape of preprocessed data: (1104, 75)


Unnamed: 0,Id,PropertyClass,ZoningCategory,RoadAccessLength,LandArea,RoadType,PlotShape,LandElevation,UtilityAccess,PlotConfiguration,...,EnclosedVerandaArea,SeasonalPorchArea,ScreenPorchArea,SwimmingPoolArea,ExtraFacilityValue,MonthSold,YearSold,DealType,DealCondition,HotelValue
0,775,20,RL,110.0,14226,Pave,Reg,Lvl,AllPub,Corner,...,0,0,0,0,0,7,2007,New,Partial,348515.0
1,673,20,RL,80.0,11250,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,6,2006,WD,Normal,165000.0
2,234,20,RL,75.0,10650,Pave,Reg,Lvl,AllPub,Corner,...,0,0,0,0,0,2,2010,WD,Normal,128200.0


###  Step 3: Split Features and Target Variable
We'll separate the feature matrix (**X**) from the target variable (**y**).  
If `Log_HotelValue` exists (from preprocessing), we’ll use that as it reduces skewness.


In [None]:
# Step 3: Split data into features and target

target_col = 'Log_HotelValue' if 'Log_HotelValue' in df.columns else 'HotelValue'

X = df.drop(columns=[target_col])
y = df[target_col]

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)


Feature matrix shape: (1104, 74)
Target vector shape: (1104,)


###  Step 4: Train-Test Split
We'll use an 80/20 split to train and test our model.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)


Training data shape: (883, 74)
Testing data shape: (221, 74)


### ⚙️ Step 5: Train Linear Regression (MLE)
Linear Regression from `scikit-learn` fits parameters using **Ordinary Least Squares (OLS)**,  
which is mathematically the same as **Maximum Likelihood Estimation (MLE)** under Gaussian noise.


###  Fix: Encode Any Remaining Categorical Columns

The error `ValueError: could not convert string to float` means that some columns are still in text format (like zoning type, road access, etc.).

We’ll identify all **object (string)** columns and apply **one-hot encoding** (`pd.get_dummies`) so that the model can handle them numerically.


In [None]:
# Identify categorical columns
cat_cols = X.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns to encode:", cat_cols)

# Apply one-hot encoding
X_encoded = pd.get_dummies(X, columns=cat_cols, drop_first=True)

print(" Encoding complete!")
print("New shape after encoding:", X_encoded.shape)


Categorical columns to encode: ['ZoningCategory', 'RoadType', 'PlotShape', 'LandElevation', 'UtilityAccess', 'PlotConfiguration', 'LandSlope', 'District', 'NearbyTransport1', 'NearbyTransport2', 'PropertyType', 'HotelStyle', 'RoofDesign', 'RoofMaterial', 'ExteriorPrimary', 'ExteriorSecondary', 'ExteriorQuality', 'ExteriorCondition', 'FoundationType', 'BasementHeight', 'BasementCondition', 'BasementExposure', 'BasementFacilityType1', 'BasementFacilityType2', 'HeatingType', 'HeatingQuality', 'CentralAC', 'ElectricalSystem', 'KitchenQuality', 'PropertyFunctionality', 'ParkingType', 'ParkingFinish', 'ParkingQuality', 'ParkingCondition', 'DrivewayType', 'DealType', 'DealCondition']
 Encoding complete!
New shape after encoding: (1104, 222)


###  Re-run Train-Test Split and Model Training
Now that all features are numeric, we can safely re-train the MLE (Linear Regression) model.


In [None]:
# Recreate train-test split using encoded data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

# Train the MLE model (Linear Regression)
mle_model = LinearRegression()
mle_model.fit(X_train, y_train)

# Predict and evaluate
y_pred = mle_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(" MLE Model Performance (After Encoding):")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.3f}")


 MLE Model Performance (After Encoding):
Mean Squared Error: 1318606035.41
R² Score: 0.704


In [None]:
import pandas as pd

# Load the uploaded test data
test_df = pd.read_csv("test.csv")

# Check first few rows
test_df.head()


Unnamed: 0,Id,PropertyClass,ZoningCategory,RoadAccessLength,LandArea,RoadType,ServiceLaneType,PlotShape,LandElevation,UtilityAccess,...,ScreenPorchArea,SwimmingPoolArea,PoolQuality,BoundaryFence,ExtraFacility,ExtraFacilityValue,MonthSold,YearSold,DealType,DealCondition
0,893,20,RL,70.0,8414,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,,0,2,2006,WD,Normal
1,1106,60,RL,98.0,12256,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,4,2010,WD,Normal
2,414,30,RM,56.0,8960,Pave,Grvl,Reg,Lvl,AllPub,...,0,0,,,,0,3,2010,WD,Normal
3,523,50,RM,50.0,5000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,10,2006,WD,Normal
4,1037,20,RL,89.0,12898,Pave,,IR1,HLS,AllPub,...,0,0,,,,0,9,2009,WD,Normal


In [None]:
# Load the preprocessed training data (the one you uploaded earlier)
train_df = pd.read_csv("hotel_data_preprocessed.csv")

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)


Train shape: (1104, 75)
Test shape: (260, 80)


In [None]:
# Separate target variable and features
y_train = train_df['HotelValue']  # use 'Log_HotelValue' if you had applied log earlier
X_train = train_df.drop(columns=['Id', 'HotelValue', 'Log_HotelValue'], errors='ignore')

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)


X_train shape: (1104, 73)
y_train shape: (1104,)


In [None]:
# One-hot encode both train and test sets
train_encoded = pd.get_dummies(X_train, drop_first=True)
test_encoded = pd.get_dummies(test_df.drop(columns=['Id'], errors='ignore'), drop_first=True)

# Align both encoded dataframes to have the same columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

print("Train encoded shape:", train_encoded.shape)
print("Test encoded shape:", test_encoded.shape)


Train encoded shape: (1104, 221)
Test encoded shape: (260, 221)


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the MLE model (Linear Regression)
mle_model = LinearRegression()

# Fit the model on training data
mle_model.fit(train_encoded, y_train)

# Predict on training data (for evaluation)
y_pred_train = mle_model.predict(train_encoded)

# Evaluate model performance
mse = mean_squared_error(y_train, y_pred_train)
r2 = r2_score(y_train, y_pred_train)

print("📈 MLE Model Performance on Training Data:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.4f}")


 MLE Model Performance on Training Data:
Mean Squared Error (MSE): 275846869.42
R² Score: 0.9371


In [None]:
# Check for missing values in test data
print("Missing values in test_encoded:", test_encoded.isnull().sum().sum())

# If any NaNs exist, fill them (recommended: use 0 or mean of columns)
test_encoded = test_encoded.fillna(0)
# Alternatively:
# test_encoded = test_encoded.fillna(test_encoded.mean())

# Now safe to predict
y_pred_test = mle_model.predict(test_encoded)

# Prepare submission DataFrame
submission = pd.DataFrame({
    "Id": test_df["Id"],
    "HotelValue": y_pred_test
})

submission.to_csv("submission.csv", index=False)
print("Submission file created successfully: submission.csv")
