# House Price Prediction using Machine Learning

This notebook demonstrates how to predict house prices using machine learning techniques. We'll go through the following steps:

1. Data Loading and Exploration
   - Load training and test datasets
   - Examine data structure and features
   - Check for missing values

2. Data Preprocessing
   - Handle missing values
   - Convert categorical variables to numerical
   - Select important features

3. Model Building
   - Split data into training and validation sets
   - Train a Random Forest model
   - Evaluate model performance

4. Make Predictions
   - Predict house prices on test data
   - Create submission file

This is a great beginner project to learn:
- Data preprocessing and cleaning
- Feature engineering
- Machine learning model training
- Model evaluation and prediction


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Display the first few rows of the training data
train_data.head()

# Data Preprocessing
# Check for missing values
missing_values = train_data.isnull().sum()
missing_values[missing_values > 0]


LotFrontage      259
Alley           1369
MasVnrType       872
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Load the data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Display the first few rows of the training data
train_data.head()

# Data Preprocessing
# Check for missing values
missing_values = train_data.isnull().sum()
missing_values[missing_values > 0]

# Handle missing values (example strategies)
train_data.fillna({
    'LotFrontage': train_data['LotFrontage'].mean(),
    'Alley': 'No Alley',
    'BsmtQual': 'No Basement',
    'GarageType': 'No Garage',
    'PoolQC': 'No Pool',
    'FirePlaceQu': 'No Fireplace',
    'Fence': 'No Fence',
}, inplace=True)

# Convert categorical variables to numerical (one-hot encoding)
train_data = pd.get_dummies(train_data, drop_first=True)



# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
val_predictions = model.predict(X_val)

# Model evaluation
mse = mean_squared_error(y_val, val_predictions)
print(f'Mean Squared Error: {mse}')

# Visualize feature importance
importances = model.feature_importances_
sns.barplot(x=importances, y=features)
plt.title('Feature Importance')
plt.show()

# Feature importance calculation using all features
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})  # Use all features
top_features = feature_importance_df.sort_values(by='Importance', ascending=False).head(20)['Feature']

# Correlation heatmap for top 20 features
plt.figure(figsize=(12, 10))
sns.heatmap(train_data[top_features].corr(), annot=True, fmt=".2f", cmap='viridis')  # Using 'viridis' for clarity
plt.title('Correlation Heatmap of Top 20 Important Features')
plt.show()

# Distribution of SalePrice
plt.figure(figsize=(10, 6))
sns.histplot(train_data['SalePrice'], bins=30, kde=True)
plt.title('SalePrice Distribution')
plt.xlabel('SalePrice')
plt.ylabel('Frequency')
plt.show()

NameError: name 'X' is not defined