<a href="https://colab.research.google.com/github/MohameddAkmall/Codveda-Internship/blob/main/Codveda_House_Predection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load and Inspect Data

In [None]:
import pandas as pd

# Column names from Boston Housing dataset
columns = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS",
    "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"
]

# Load dataset (assuming CSV format)
df = pd.read_csv("/content/4) house Prediction Data Set.csv", delim_whitespace=True, header=None, names=columns)

# Inspect
print(df.head())
print(df.info())


Handle Missing Data

In [None]:
# Check missing values
print(df.isnull().sum())

# Option 1: Fill with mean
df.fillna(df.mean(), inplace=True)

# Option 2: Drop missing rows
# df.dropna(inplace=True)


Encode Categorical Variables

CHAS is categorical (0 = not near river, 1 = near river).
Since it’s binary, no need for one-hot encoding

In [None]:
df = pd.get_dummies(df, drop_first=True)


Normalize or Standardize Numerical Features

Standardization: mean = 0, std = 1

Normalization: scale between 0–1

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X = df.drop("MEDV", axis=1)
y = df["MEDV"]

X_scaled = scaler.fit_transform(X)


Split Dataset into Training and Testing Sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)


linear regression model

Train Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


Interpret Model Coefficients

Each coefficient shows the effect of that feature on house price.

Example: if RM coefficient = 3, then each additional room increases price by ~3 units (keeping other features constant).

In [None]:
coefficients = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
})
print(coefficients)


Evaluate Model Performance

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Predictions
y_pred = model.predict(X_test)

# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


Visualize Predictions vs Actual Values

In [None]:
import matplotlib.pyplot as plt

# Scatter plot of actual vs predicted
plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred, color="blue", alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
         color="red", linewidth=2)  # perfect prediction line
plt.xlabel("Actual MEDV")
plt.ylabel("Predicted MEDV")
plt.title("Actual vs Predicted House Prices")
plt.show()


Decision Tree Classifier (categorize prices)

the house price (MEDV) — is continuous (e.g., 24.0, 21.6, 33.4...).
But DecisionTreeClassifier is for classification, meaning it expects discrete labels (like 0, 1, 2 for classes).

Thats why we will use Decision Tree Regressor.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Train a Decision Tree Regressor (unpruned)
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(X_train, y_train)

# Predict
y_pred = tree_reg.predict(X_test)

# Evaluate
print("R² Score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


prune the tree

In [None]:
# Pruned tree using manual constraints
pruned_tree = DecisionTreeRegressor(
    random_state=42,
    max_depth=3,             # Limit tree depth
    min_samples_split=50,    # Minimum samples to split a node
    min_samples_leaf=20      # Minimum samples in a leaf
)
pruned_tree.fit(X_train, y_train)

# Predictions
y_pred_pruned = pruned_tree.predict(X_test)

# Evaluation
print("Pruned Tree R² Score:", r2_score(y_test, y_pred_pruned))
print("Pruned Tree Mean Squared Error:", mean_squared_error(y_test, y_pred_pruned))


Visualize the unpruned Tree

In [None]:
from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
tree.plot_tree(reg_tree, feature_names=X.columns, filled=True, max_depth=3)
plt.title("Unpruned Decision Tree (Top 3 Levels)")
plt.show()


visualize pruned tree

In [None]:
plt.figure(figsize=(20,10))
tree.plot_tree(pruned_tree, feature_names=X.columns, filled=True, max_depth=3)
plt.title("Pruned Decision Tree (Top 3 Levels)")
plt.show()
print("Unpruned R²:", r2_score(y_test, y_pred))
print("Pruned R²:", r2_score(y_test, y_pred_pruned))
print("Unpruned MSE:", mean_squared_error(y_test, y_pred))
print("Pruned MSE:", mean_squared_error(y_test, y_pred_pruned))



| Metric                                | Unpruned Tree | Pruned Tree | Interpretation                                                                                                                                             |
| :------------------------------------ | :------------ | :---------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **R² (coefficient of determination)** | 0.858         | 0.684       | The unpruned tree fits the training data much better — but may overfit. The pruned tree is simpler and sacrifices some accuracy for better generalization. |
| **MSE (Mean Squared Error)**          | 10.42         | 23.16       | The pruned tree has higher error — expected, since it’s less complex and less tuned to the training set.                                                   |


k-mean clustering
We’ll cluster house prices into 3 groups (Low, Medium, High) using the Housing dataset.

Elbow Method (choose optimal k)

In [None]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Elbow method to determine optimal number of clusters
inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)   # we already scaled X earlier
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(6,4))
plt.plot(K, inertia, 'bo-')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.title("Elbow Method for Optimal k")
plt.show()


Apply KMeans with k=3

In [None]:
# Apply KMeans with k=3 (Low, Medium, High groups)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Add cluster labels to dataframe
df["Cluster"] = clusters


Visualize Clusters (2D Scatter)

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(df["RM"], df["MEDV"], c=df["Cluster"], cmap="viridis")
plt.xlabel("Average Rooms per Dwelling (RM)")
plt.ylabel("Median Value of Homes (MEDV)")
plt.title("KMeans Clustering of House Prices")

plt.show()




What the graph represents

X-axis: Average number of rooms per dwelling (RM)

Y-axis: Median value of homes (MEDV)

Each dot represents one house (or neighborhood), and the color shows which cluster the K-Means algorithm assigned it to.

🟢 Cluster 1 – Low-priced Homes

Houses in this group have fewer rooms (typically 4–6 per dwelling).

They correspond to lower median home values (around $10,000–$20,000 in the dataset’s scaled values).

These are likely older, smaller properties or those located in less desirable neighborhoods.

🟡 Cluster 2 – Medium-priced Homes

Houses have a moderate number of rooms (around 6–7).

The median value of homes in this cluster is mid-range (roughly $20,000–$30,000).

These homes represent the average housing market segment, possibly suburban or mid-income neighborhoods.

🟣 Cluster 3 – High-priced Homes

Houses in this cluster have more rooms (around 7–8+).

They correspond to high median home values (around $35,000–$50,000).

These are likely large, newer homes in desirable areas with better amenities.

The clustering is based on the similarity of data points — K-Means grouped together houses that share similar room counts and median values.

------------------------------
Random forest
--------------------------

In [None]:
# Task 1: Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Random Forest model
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=None,
    random_state=42
)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on test data
y_pred = rf_model.predict(X_test)

# Evaluate performance
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"Random Forest R²: {r2:.4f}")
print(f"Random Forest MSE: {mse:.4f}")

# Cross-validation (5-fold)
cv_scores = cross_val_score(rf_model, X, y, cv=5, scoring='r2')
print(f"Cross-Validation R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# Feature Importance
importances = rf_model.feature_importances_
feature_names = X.columns

# Visualize feature importance
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 5))
sns.barplot(x=importances[indices], y=feature_names[indices], palette="viridis")
plt.title("Feature Importance (Random Forest Regressor)")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.show()


Explanation:

✅ RandomForestRegressor — fits continuous target values like prices.

✅ R² (coefficient of determination) — measures how well the model explains variance (closer to 1 = better).

✅ MSE (Mean Squared Error) — lower = better fit.

✅ Feature importance — shows which input variables most affect predicted prices.

Neural Networks with
TensorFlow/Keras
-------------------------

In [None]:
# Task 3: Neural Networks with TensorFlow/Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

# 1️⃣ Preprocessing
# Scale the features for better NN convergence
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [None]:

# 2️⃣ Build the Neural Network
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),  # input layer + 1st hidden layer
    Dropout(0.2),  # prevents overfitting
    Dense(32, activation='relu'),  # 2nd hidden layer
    Dense(1)  # output layer for regression
])

# 3️⃣ Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mae'])

# 4️⃣ Train the model
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=16,
    verbose=1
)


In [None]:

# 5️⃣ Evaluate performance
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"\nNeural Network R²: {r2:.4f}")
print(f"Neural Network MSE: {mse:.4f}")

# 6️⃣ Visualize training vs validation loss
plt.figure(figsize=(8,5))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss over Epochs')
plt.xlabel('Epochs')
plt.ylabel('MSE Loss')
plt.legend()
plt.show()
