# House Price Prediction - ML Workflow

This notebook demonstrates a complete machine learning workflow:
1. **Data Loading & Exploration** - Understanding the dataset
2. **Data Preparation** - Creating features (X) and target (y)
3. **Train-Test Split** - Dividing data for training and evaluation
4. **Model Training** - Training a regression model
5. **Prediction & Evaluation** - Making predictions and assessing performance

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv("D:/1_1_DS_AI_Internship/data/house_price.csv")

# Display basic information about the dataset
print("=" * 50)
print("DATASET OVERVIEW")
print("=" * 50)
print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:\n{df.head()}")
print(f"\nData types and missing values:\n{df.info()}")
print(f"\nBasic statistics:\n{df.describe()}")

DATASET OVERVIEW
Dataset shape: (4600, 18)

First few rows:
                  date      price  bedrooms  bathrooms  sqft_living  sqft_lot  \
0  2014-05-02 00:00:00   313000.0       3.0       1.50         1340      7912   
1  2014-05-02 00:00:00  2384000.0       5.0       2.50         3650      9050   
2  2014-05-02 00:00:00   342000.0       3.0       2.00         1930     11947   
3  2014-05-02 00:00:00   420000.0       3.0       2.25         2000      8030   
4  2014-05-02 00:00:00   550000.0       4.0       2.50         1940     10500   

   floors  waterfront  view  condition  sqft_above  sqft_basement  yr_built  \
0     1.5           0     0          3        1340              0      1955   
1     2.0           0     4          5        3370            280      1921   
2     1.0           0     0          4        1930              0      1966   
3     1.0           0     0          4        1000           1000      1963   
4     1.0           0     0          4        1140        

## 1. Data Preparation

In this step, we separate our data into:
- **Features (X)**: All columns except 'price' (the target variable)
- **Target (y)**: Only the 'price' column

This separation is crucial for supervised learning as the model learns to predict `y` (price) based on `X` (features).

In [2]:
# Step 1: Select numeric columns only (to avoid errors with text data)
numeric_df = df.select_dtypes(include=[np.number])

# Step 2: Remove rows with missing values to ensure clean data
numeric_df = numeric_df.dropna()

print("=" * 50)
print("DATA PREPARATION")
print("=" * 50)

# Step 3: Create X (features) by dropping the target column 'price'
X = numeric_df.drop(columns=['price'])
print(f"\nFeatures (X) shape: {X.shape}")
print(f"Features (X) columns: {list(X.columns)}")

# Step 4: Create y (target) by isolating only the 'price' column
y = numeric_df['price']
print(f"\nTarget (y) shape: {y.shape}")
print(f"Target variable: 'price'")
print(f"\nFirst 5 feature values:\n{X.head()}")
print(f"\nFirst 5 target values:\n{y.head().values}")

DATA PREPARATION

Features (X) shape: (4600, 12)
Features (X) columns: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']

Target (y) shape: (4600,)
Target variable: 'price'

First 5 feature values:
   bedrooms  bathrooms  sqft_living  sqft_lot  floors  waterfront  view  \
0       3.0       1.50         1340      7912     1.5           0     0   
1       5.0       2.50         3650      9050     2.0           0     4   
2       3.0       2.00         1930     11947     1.0           0     0   
3       3.0       2.25         2000      8030     1.0           0     0   
4       4.0       2.50         1940     10500     1.0           0     0   

   condition  sqft_above  sqft_basement  yr_built  yr_renovated  
0          3        1340              0      1955          2005  
1          5        3370            280      1921             0  
2          4        1930              0      1

## 2. Train-Test Split

We split the data into:
- **Training Set (80%)**: Used to train the model
- **Test Set (20%)**: Used to evaluate the model's performance on unseen data

This prevents overfitting and gives us a realistic measure of model performance.

In [3]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("=" * 50)
print("TRAIN-TEST SPLIT")
print("=" * 50)
print(f"\nTraining set size: {X_train.shape[0]} samples ({(X_train.shape[0]/len(X))*100:.1f}%)")
print(f"Test set size: {X_test.shape[0]} samples ({(X_test.shape[0]/len(X))*100:.1f}%)")
print(f"\nTraining features shape: {X_train.shape}")
print(f"Test features shape: {X_test.shape}")

TRAIN-TEST SPLIT

Training set size: 3680 samples (80.0%)
Test set size: 920 samples (20.0%)

Training features shape: (3680, 12)
Test features shape: (920, 12)


## 3. Feature Scaling

Before training the model, we normalize the features using StandardScaler. This ensures all features are on the same scale, which improves model training efficiency and performance.

In [4]:
# Make predictions on training and test sets
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)

# Calculate performance metrics
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("=" * 50)
print("PREDICTION & MODEL EVALUATION")
print("=" * 50)

print("\nTRAINING SET PERFORMANCE:")
print(f"  Mean Squared Error (MSE): ${train_mse:,.2f}")
print(f"  Root Mean Squared Error (RMSE): ${train_rmse:,.2f}")
print(f"  R² Score: {train_r2:.4f}")

print("\nTEST SET PERFORMANCE:")
print(f"  Mean Squared Error (MSE): ${test_mse:,.2f}")
print(f"  Root Mean Squared Error (RMSE): ${test_rmse:,.2f}")
print(f"  R² Score: {test_r2:.4f}")

print("\n" + "=" * 50)
print("SAMPLE PREDICTIONS")
print("=" * 50)
comparison = pd.DataFrame({
    'Actual Price': y_test.values[:10],
    'Predicted Price': y_test_pred[:10],
    'Difference': y_test.values[:10] - y_test_pred[:10],
    'Error %': (abs(y_test.values[:10] - y_test_pred[:10]) / y_test.values[:10] * 100)
})
print("\n", comparison.to_string())
print(f"\nAverage prediction error on test set: ${abs(y_test.values - y_test_pred).mean():,.2f}")

NameError: name 'model' is not defined

## 5. Prediction & Model Evaluation

We make predictions on both the training and test sets, then evaluate performance using:
- **Mean Squared Error (MSE)**: Average squared difference between predicted and actual values
- **Root Mean Squared Error (RMSE)**: Square root of MSE, in the same units as the target variable
- **R² Score**: Proportion of variance explained by the model (0-1, higher is better)

In [None]:
# Train the Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

print("=" * 50)
print("MODEL TRAINING")
print("=" * 50)
print(f"\nModel trained successfully!")
print(f"Number of features: {len(model.coef_)}")
print(f"\nTop 5 feature coefficients:")
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_
}).sort_values('Coefficient', ascending=False)
print(feature_importance.head())
print(f"\nModel intercept: ${model.intercept_:,.2f}")

## 4. Model Training

We train a Linear Regression model on the scaled training data. The model learns the relationship between the features and the target variable (house prices) by finding optimal coefficients.

In [None]:
# Initialize and fit the StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("=" * 50)
print("FEATURE SCALING")
print("=" * 50)
print(f"\nOriginal training features (first 3 rows):\n{X_train.head(3)}")
print(f"\nScaled training features (first 3 rows):\n{X_train_scaled[:3]}")
print(f"\nFeature means after scaling: {X_train_scaled.mean(axis=0)[:5]}")
print(f"Feature std devs after scaling: {X_train_scaled.std(axis=0)[:5]}")