<a href="https://colab.research.google.com/github/DAVINNCI5/pro.py/blob/main/PROJECT_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**GENERAL OBJECTIVES**

i.	The main objective of this project is create a machine learning model that accurately predicts house prices.
ii.	Compare the performance of various machine learning algorithms to determine the most effective method for predicting house prices.


This line loads the dataset containing apartment listings into a DataFrame called df. The file apartments.csv includes information about the available rooms or apartments—such as price, location, number of bedrooms, availability, and other relevant details. By loading this data, we can begin cleaning, analyzing, and eventually using it to help tenants find suitable rental options.

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/apartments.csv')

# Display the first few rows
print(df.head())

# Check for any missing values
print(df.isnull().sum())


                                              title  \
0   3 Bedroom Apartment / Flat to rent in Riverside   
1  3 Bedroom Apartment / Flat to rent in Kileleshwa   
2       3 Bedroom Apartment / Flat to rent in Nyali   
3   3 Bedroom Apartment / Flat to rent in Lavington   
4  1 Bedroom Apartment / Flat to rent in Kileleshwa   

                                   location  bedrooms  bathrooms     price  \
0  Riverside Dr Nairobi, Riverside, Nairobi         3          3   200 000   
1                       Kileleshwa, Nairobi         3          4    70 000   
2          Links Rd Mombasa, Nyali, Mombasa         3          2    38 000   
3    Near Valley Arcade, Lavington, Nairobi         3          3    80 000   
4                       Kileleshwa, Nairobi         1          1   110 000   

        rate  
0  Per Month  
1  Per Month  
2  Per Month  
3  Per Month  
4  Per Month  
title          0
location     376
bedrooms       0
bathrooms      0
price          0
rate           0
dtype: i

This line removes the title column from the dataset as it contains unstructured text that's not useful for analysis. It helps keep only relevant features for filtering or modeling rental listings.

In [3]:
# Drop 'title' column
df = df.drop(columns=['title'])


This line converts the location column into multiple binary (0/1) columns using one-hot encoding. It allows the model to understand location as separate features without treating them as numeric values.

In [4]:
# One-hot encode 'location'
df = pd.get_dummies(df, columns=['location'], drop_first=True)


This block cleans and standardizes numeric features like bedrooms, bathrooms, price, and rate to prepare the data for modeling. It ensures that values are in a uniform scale, which improves the performance of many machine learning algorithms.

In [5]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Assuming 'df' is your DataFrame

# Replace spaces and commas in numeric columns and convert to numeric type
for col in ['bedrooms', 'bathrooms', 'price', 'rate']:
    # Convert the column to string type first to handle potential mixed data types
    df[col] = df[col].astype(str).str.replace(' ', '').str.replace(',', '')
    # Now, convert to numeric, handling errors by coercing them to NaN
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Initialize the scaler
scaler = StandardScaler()

# Scale numerical columns
df[['bedrooms', 'bathrooms', 'price', 'rate']] = scaler.fit_transform(df[['bedrooms', 'bathrooms', 'price', 'rate']])
# Initialize the scaler (This line seems redundant and can be removed)
# scaler = StandardScaler()

# Scale numerical columns (This line seems redundant and can be removed)
# df[['bedrooms', 'bathrooms', 'price', 'rate']] = scaler.fit_transform(df[['bedrooms', 'bathrooms', 'price', 'rate']])

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count


ChatGPT said:
This line creates a demo column is_available to simulate availability, labeling listings with a rate below the median as available. It's a synthetic feature just for testing or modeling purposes.

In [6]:
# Create a synthetic 'is_available' column based on rate (just for demo purposes)
df['is_available'] = (df['rate'] < df['rate'].median()).astype(int)



This code splits the data into training and testing sets to prepare for model building. X contains the apartment features, while y holds the target (is_available), with 80% used for training and 20% for testing.

In [7]:
from sklearn.model_selection import train_test_split

# Features (X) and target (y)
X = df.drop(columns=['is_available'])
y = df['is_available']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [8]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer  # Import SimpleImputer

# Assuming 'df' is your DataFrame

# ... (Your previous code for data loading, one-hot encoding, etc.)

# Replace spaces and commas in numeric columns and convert to numeric type
# Step 1: Identify numeric columns
numeric_cols = ['bedrooms', 'bathrooms', 'price', 'rate']

# Step 2: Check which columns are fully null
print(df[numeric_cols].isnull().sum())

# Step 3: Drop columns with ALL missing values (like 'rate', if applicable)
numeric_cols = [col for col in numeric_cols if df[col].notnull().sum() > 0]

# Step 4: Impute only the valid numeric columns
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
imputed_data = imputer.fit_transform(df[numeric_cols])

# Step 5: Assign imputed data back
df[numeric_cols] = imputed_data

# Initialize the scaler
scaler = StandardScaler()

# Scale numerical columns
# Use numeric_cols here to ensure all four columns are scaled
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# ... (Rest of your code for creating 'is_available', train-test split, model training, and evaluation)

bedrooms        0
bathrooms       0
price           0
rate         2520
dtype: int64


ows with missing values are removed from both training and testing sets to ensure clean input for the model. The corresponding labels (y_train, y_test) are also aligned with the cleaned feature sets.

In [9]:
# Drop all rows with any NaN values in X_train and X_test
X_train = X_train.dropna()
y_train = y_train.loc[X_train.index]

X_test = X_test.dropna()
y_test = y_test.loc[X_test.index]

# Confirm it's clean
print("X_train NaNs:\n", X_train.isnull().sum())


X_train NaNs:
 bedrooms                                                0
bathrooms                                               0
price                                                   0
rate                                                    0
location_Jabavu court, Kilimani, Nairobi                0
location_Kikuyu Town Bus park Kikuyu, Kikuyu, Kikuyu    0
location_Kileleshwa Nairobi, Kileleshwa, Nairobi        0
location_Kileleshwa, Nairobi                            0
location_Kilimani, Nairobi                              0
location_Links Rd Mombasa, Nyali, Mombasa               0
location_Muthaiga, Nairobi                              0
location_Near Valley Arcade, Lavington, Nairobi         0
location_Nyali, Mombasa                                 0
location_Off Othaya road, Lavington, Nairobi            0
location_Riverside Dr Nairobi, Riverside, Nairobi       0
location_Shanzu, Mombasa                                0
location_Thika Rd Nairobi, Kahawa Wendani, Nairobi      0

ChatGPT said:
These assertions ensure that all missing values have been successfully removed from X_train and X_test. If any NaNs remain, an error is raised to catch the issue early before modeling.

In [10]:
assert not X_train.isnull().any().any(), "Still some NaNs in X_train!"
assert not X_test.isnull().any().any(), "Still some NaNs in X_test!"


These lines print the dimensions of the training feature set (X_train) and the target labels (y_train). It's a quick sanity check to confirm that the data is aligned and ready for model training.

In [11]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)


X_train shape: (0, 20)
y_train shape: (0,)


In [12]:
print(y_train.value_counts())


Series([], Name: count, dtype: int64)


This re-splits the preprocessed data into training and testing sets, ensuring the distribution of the is_available target remains balanced across both sets using stratify=y.

In [13]:
from sklearn.model_selection import train_test_split

# Assuming df['is_available'] is still intact and clean
X = df.drop(columns=['is_available'])
y = df['is_available']

# Re-split after all preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [14]:
print("New X_train shape:", X_train.shape)


New X_train shape: (2016, 20)


Columns with only missing values are dropped from X_train, and X_test is reindexed to match, ensuring compatibility. Then, missing values are filled using median imputation to prepare clean and consistent data for modeling.

In [15]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Drop columns with all NaN values from X_train (since median imputation can't handle them)
X_train_clean = X_train.dropna(axis=1, how='all')

# Ensure X_test has the same columns in the same order
X_test_clean = X_test.reindex(columns=X_train_clean.columns, fill_value=0)

# Create the imputer and fit on the cleaned X_train
imputer = SimpleImputer(strategy='median')

# Fit and transform X_train
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train_clean),
                               columns=X_train_clean.columns,
                               index=X_train_clean.index)

# Transform X_test using the same imputer
X_test_imputed = pd.DataFrame(imputer.transform(X_test_clean),
                              columns=X_test_clean.columns,
                              index=X_test_clean.index)


In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 🧪 OPTIONAL: Create a sample dataset (skip this if you have X and y)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
y = pd.Series(y, name="target")

# 🧼 Inject some missing values randomly (for demonstration)
rng = np.random.default_rng(42)
missing_mask = rng.choice([True, False], size=X.shape, p=[0.1, 0.9])
X = X.mask(missing_mask)

# 1️⃣ Stratified Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 🔍 Confirm class distribution
print("Class distribution in y_train:\n", y_train.value_counts())

# 2️⃣ Drop columns with all NaNs in training set
X_train_clean = X_train.dropna(axis=1, how='all')

# Ensure test set has same columns
X_test_clean = X_test.reindex(columns=X_train_clean.columns, fill_value=0)

# 3️⃣ Impute missing values (median)
imputer = SimpleImputer(strategy='median')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train_clean),
                               columns=X_train_clean.columns,
                               index=X_train_clean.index)

X_test_imputed = pd.DataFrame(imputer.transform(X_test_clean),
                              columns=X_test_clean.columns,
                              index=X_test_clean.index)

# 🛡 Safety check: make sure training labels contain at least 2 classes
if len(y_train.unique()) < 2:
    raise ValueError("y_train contains only one class. Check your dataset or splitting method.")

# 4️⃣ Train Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_imputed, y_train)






Class distribution in y_train:
 target
1    402
0    398
Name: count, dtype: int64


This code builds a full ML pipeline: it handles missing values, splits the data, trains a logistic regression model, and evaluates its performance. It simulates real-world data issues and prepares a clean dataset for accurate prediction.

In [17]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train
rf_model.fit(X_train_imputed, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test_imputed)

# Evaluate
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

print("✅ Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\n📊 Classification Report:\n", classification_report(y_test, y_pred_rf))
print("\n🧮 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))


✅ Accuracy: 0.875

📊 Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.86      0.87        99
           1       0.87      0.89      0.88       101

    accuracy                           0.88       200
   macro avg       0.88      0.87      0.87       200
weighted avg       0.88      0.88      0.87       200


🧮 Confusion Matrix:
 [[85 14]
 [11 90]]


In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split  # Import train_test_split

# ... (Your previous code for data loading, preprocessing, etc.)

# Ensure X and y have consistent indices before splitting
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

# 1️⃣ Stratified Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ... (Rest of your code for data cleaning, imputation, etc.)

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train
rf_model.fit(X_train_imputed, y_train.loc[X_train_imputed.index])  # Align y_train with X_train_imputed

# ... (Rest of your code for prediction and evaluation)
# Evaluate
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

print("✅ Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\n📊 Classification Report:\n", classification_report(y_test, y_pred_rf))
print("\n🧮 Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))


✅ Accuracy: 0.875

📊 Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.86      0.87        99
           1       0.87      0.89      0.88       101

    accuracy                           0.88       200
   macro avg       0.88      0.87      0.87       200
weighted avg       0.88      0.88      0.87       200


🧮 Confusion Matrix:
 [[85 14]
 [11 90]]


**SAVE** **MODEL**

In [19]:
import joblib
import os

# Create a 'models' directory if it doesn't exist
os.makedirs('models', exist_ok=True)

# Save the trained model
joblib.dump(rf_model, 'models/availability_model.pkl')

# Save the imputer used for handling missing values
joblib.dump(imputer, 'models/imputer.pkl')


['models/imputer.pkl']

deploy

In [40]:
# Save app.py
%%writefile app.py
import streamlit as st
import joblib
import pandas as pd

model = joblib.load('availability_model.pkl')
imputer = joblib.load('imputer.pkl')

st.title("🏡 Room Availability Checker")
st.markdown("""
Welcome to the Room Availability Checker!
Please input the details of a room, and click **"Check Availability"** to see if it's currently available.
""")

bedrooms = st.number_input("Bedrooms", 0, 10, 1)
bathrooms = st.number_input("Bathrooms", 0, 10, 1)
price = st.number_input("Price (KES)", 0.0, 100000.0, 1000.0)
rate = st.number_input("Rate (optional)", 0.0, 5.0, 0.0)

if st.button("Check Availability"):
    input_df = pd.DataFrame([{
        'bedrooms': bedrooms,
        'bathrooms': bathrooms,
        'price': price,
        'rate': rate
    }])
    input_df = imputer.transform(input_df)
    prediction = model.predict(input_df)[0]
    result = "✅ Available!" if prediction == 1 else "❌ Not Available"
    st.success(result) if prediction == 1 else st.error(result)


Overwriting app.py


In [41]:
# Save requirements.txt
%%writefile requirements.txt
streamlit
joblib
pandas


Writing requirements.txt


In [42]:
from google.colab import files
uploaded = files.upload()


Saving imputer (1).pkl to imputer (1) (1).pkl
Saving availability_model (2).pkl to availability_model (2) (1).pkl


In [56]:
!git config --global user.email "davidversion3560@gmail.com"
!git config --global user.name "DAVINNCI5"


In [63]:
!echo "# pro.py" >> README.md
!git init
!git commit -m "first commit"
!git branch -M main
!echo "# pro.py" >> README.md
!git init
!git add .
!git commit -m "first commit"
!git branch -M main
!git remote add origin https://github.com/DAVINNCI5/pro.py.git
!git push -u origin main

Reinitialized existing Git repository in /content/.git/
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   README.md[m

no changes added to commit (use "git add" and/or "git commit -a")
Reinitialized existing Git repository in /content/.git/
[main 6f6b6c7] first commit
 1 file changed, 2 insertions(+)
error: remote origin already exists.
fatal: could not read Username for 'https://github.com': No such device or address
