# House Price Analysis
This notebook analyzes the house price dataset to understand the factors affecting house prices and to build a predictive model.

## Dataset Description
The dataset contains information about various houses, including the number of bedrooms, number of bathrooms, living area, lot area, number of floors, presence of waterfront, number of views, condition, grade, area excluding basement, basement area, year built, year renovated, postal code, latitude, longitude, living area renovation, lot area renovation, number of schools nearby, distance from the airport, and the price of the house.

### Columns
- `id`: Unique ID for each house
- `Date`: Date of the sale
- `number of bedrooms`: Number of bedrooms
- `number of bathrooms`: Number of bathrooms
- `living area`: Living area in square feet
- `lot area`: Lot area in square feet
- `number of floors`: Number of floors
- `waterfront present`: Presence of a waterfront
- `number of views`: Number of views
- `condition of the house`: Condition of the house
- `grade of the house`: Grade of the house
- `Area of the house(excluding basement)`: Area of the house excluding basement in square feet
- `Area of the basement`: Area of the basement in square feet
- `Built Year`: Year the house was built
- `Renovation Year`: Year the house was renovated
- `Postal Code`: Postal code of the house
- `Lattitude`: Latitude of the house location
- `Longitude`: Longitude of the house location
- `living_area_renov`: Renovated living area in square feet
- `lot_area_renov`: Renovated lot area in square feet
- `Number of schools nearby`: Number of schools nearby
- `Distance from the airport`: Distance from the airport
- `Price`: Price of the house

## Problem Statement
The main objective of this analysis is to understand the factors that influence house prices and to build a predictive model that can estimate the price of a house based on its features.

## Feature Importance
Identifying which features are most important in predicting house prices will help in building a more accurate model. We will use feature selection techniques and correlation analysis to determine the most significant features.

## Import Libraries and Load Dataset
In this step, we import the necessary libraries and load the dataset.

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('/path/to/your/dataset.csv')

# Display the first few rows of the dataset
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '/path/to/your/dataset.csv'

## Dataset Overview
Here, we display the basic information about the dataset to understand its structure and contents.

In [None]:
df.info()

### Summary of Dataset
The dataset consists of 23 columns and 14619 rows. It includes various features such as the number of bedrooms, bathrooms, living area, lot area, number of floors, etc. This summary gives us an idea of the data types and the presence of any missing values.

## Data Cleaning and Preprocessing
In this step, we handle missing values, convert data types if necessary, and perform any other preprocessing tasks needed to prepare the data for analysis.

In [None]:
# Checking for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Dropping rows with missing values (if any)
df.dropna(inplace=True)

# Convert data types if necessary (example)
df['Date'] = pd.to_datetime(df['Date'], unit='D', origin='julian')

### Data Cleaning Summary
After handling missing values and converting data types, the dataset is now clean and ready for further analysis.

## Exploratory Data Analysis (EDA)
We perform exploratory data analysis to uncover patterns, trends, and relationships in the data.

In [None]:
df.describe().T

In [None]:
df=df.drop(columns=['id','Date'])

In [None]:
df=df.drop(columns=['Longitude','Lattitude','Postal Code'] )                       

In [None]:
df.head()

#### Distribution of House Prices
In this step, we plot the distribution of house prices to understand their spread and central tendency. A histogram with a kernel density estimate (KDE) overlay provides a clear visualization of the distribution.

In [None]:
# plt.figure(figsize=(10,5))
# sns.countplot(x="Price",data =df,palette="inferno")
# plt.ylabel("Frequency")
# plt.xlabel('Price')
# plt.title("Price Distribution",size=15)
# plt.show()



# Plotting the distribution of house prices
plt.figure(figsize=(10, 6))
sns.histplot(df['Price'], kde=True, bins=30)
plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()


#### Central Tendency:
The majority of house prices are concentrated around a certain range, indicating the most common price range in the dataset.
#### Spread: 
The plot shows the spread of house prices, indicating the range of prices from the lowest to the highest.
#### Skewness: 
If the distribution is not symmetric, it suggests skewness. For example, if there are more houses with lower prices and fewer with extremely high prices, the distribution would be right-skewed.
#### Outliers:
The presence of any significant outliers can be observed. Outliers are house prices that are much higher or lower than the rest of the data points.

In [None]:
# Plotting the correlation heatmap
plt.figure(figsize=(14, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()


#### Strong Positive Correlations: 
Some features show strong positive correlations with each other and with the target variable 'Price'. For example, 'living area' might have a high positive correlation with 'Price', indicating that larger living areas are associated with higher prices.

#### Strong Negative Correlations:
There may be features with strong negative correlations with 'Price' or other features. For instance, 'Distance from the airport' might have a negative correlation with 'Price', suggesting that houses farther from the airport tend to have lower prices.

### Weak or No Correlations:
Some features may show very weak or no correlation with 'Price', indicating they might not be significant predictors for the target variable.

#### Feature Interrelationships:
The heatmap helps in understanding the interrelationships between features. For example, 'number of bedrooms' and 'living area' might show a strong correlation, indicating that houses with more bedrooms tend to have larger living areas.

In [None]:
df.head()

In [None]:
# Plotting Price vs. Number of Bedrooms
plt.figure(figsize=(12, 6))
sns.boxplot(x='number of bedrooms', y='Price', data=df)
plt.title('Price vs. Number of Bedrooms')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Price')
plt.show()

# Plotting Price vs. Number of Bathrooms
plt.figure(figsize=(12, 6))
sns.boxplot(x='number of bathrooms', y='Price', data=df)
plt.title('Price vs. Number of Bathrooms')
plt.xlabel('Number of Bathrooms')
plt.ylabel('Price')
plt.show()


### Price vs. Number of Bedrooms
#### Observations:

 Median Price: The median price of houses increases with the number of bedrooms, indicating that houses with more bedrooms generally tend to be more expensive.
Price Range: The range of house prices (from lower to upper quartile) for each number of bedrooms shows variability. Larger ranges indicate more variability in prices for houses with that number of bedrooms.

Outliers: Outliers are visible as individual points outside the whiskers of the boxplot, indicating houses that are priced significantly higher or lower than the majority of houses with the same number of bedrooms.

Summary: The boxplot shows a positive relationship between the number of bedrooms and house prices. As the number of bedrooms increases, the median house price tends to rise. However, there is also significant variability and outliers in each category.

### Price vs. Number of Bathrooms
#### Observations:

Median Price: The median price of houses increases with the number of bathrooms, indicating that houses with more bathrooms generally tend to be more expensive.
Price Range: The range of house prices for each number of bathrooms shows variability. Larger ranges indicate more variability in prices for houses with that number of bathrooms.

Outliers: Outliers are visible as individual points outside the whiskers of the boxplot, indicating houses that are priced significantly higher or lower than the majority of houses with the same number of bathrooms.

Summary: The boxplot shows a positive relationship between the number of bathrooms and house prices. As the number of bathrooms increases, the median house price tends to rise. However, there is also significant variability and outliers in each category.

In [None]:

# Plotting Year Built vs. Price
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Built Year', y='Price', data=df)
plt.title('Year Built vs. Price')
plt.xlabel('Year Built')
plt.ylabel('Price')
plt.show()


### Year Built vs. Price
#### Observations:

Price Trend Over Time: The scatterplot shows the relationship between the year a house was built and its price. Generally, houses built in more recent years tend to have higher prices, but there is considerable variability.

Variability: There is significant variability in prices for houses built in different years. Some older houses still command high prices, likely due to renovations, historical value, or desirable locations.

Outliers: There are outliers present, with some houses built in certain years being significantly more expensive or cheaper than the general trend.