This project demonstrates a basic linear regression model to predict housing prices based on various factors such as location, population, and median income. The dataset used is Housing.csv, and the code is implemented in Python.
- main.py: Python script containing the data processing, visualization, and model training.
- Housing.csv: Dataset used for model training and testing.
- README.md: Documentation of the project.
The Housing.csv dataset includes features like:
longitudeandlatitude: Geographic coordinates.housing_median_age: Age of the housing in the area.total_roomsandtotal_bedrooms: Counts of rooms and bedrooms.populationandhouseholds: Population and household counts.median_income: Median income in the area.median_house_value: Target variable for prediction.ocean_proximity: Proximity to the ocean (categorical feature).
-
Data Preparation:
- Stratified sampling is used to ensure consistent income distribution in training and test sets.
- One-hot encoding converts the categorical
ocean_proximityfeature to numeric. - Missing values in
total_bedroomsare filled with the median for accurate model training.
-
Data Visualization:
- A scatter plot visualizes housing data distribution by population and geographic location.
-
Model Training:
- We use linear regression to model the relationship between the features and the target (
median_house_value). - Predictions are generated on a sample to compare with actual values and evaluate model performance.
- We use linear regression to model the relationship between the features and the target (
Sample predictions vs. actual values:
- Predictions: [88983.15, 305351.35, 153334.71, 184302.55, 246840.19]
- Actual values: [72100., 279600., 82700., 112500., 238300.]
- Clone this repository.
- Install required packages:
pip install -r requirements.txt
- Run the script:
python main.py