#House Price Prediction Project: Environment Setup

This notebook documents the setup process and planning for a O.A machine learning project to predict house prices. 
Project Goal: Build a regression model to predict house prices based on various features using the Kaggle Advanced Housing Price dataset.

Date: [February 2025]

Project Overview

Problem Statement
House prices are influenced by numerous factors including location, size, amenities, and market conditions. Manual estimation is complex and often inaccurate. Machine learning offers a solution by analyzing patterns in historical data to make accurate predictions.

Approach
This project will use supervised learning regression techniques to predict house prices. Based on the Week 1 lecture by Ainish, this falls under the supervised learning category since we have labeled data (known house prices) to train our model.

Dataset
The project will use Kaggle's Advanced Housing Price Dataset, which includes:
  Numeric features: square footage, number of rooms, year built, etc.
  Categorical features: neighborhood, house style, condition, etc.
  Target variable: Sale Price of houses

Evaluation Metrics
As discussed in the lecture, we'll evaluate our regression model using:
- RMSE (Root Mean Squared Error)
- MAE (Mean Absolute Error)
- R² (R-squared)

Based on Week 1 lecture materials, this project applies several key concepts:

 Artificial Intelligence & Machine Learning
This project uses machine learning, a subset of AI, to create a system that learns from housing data to make price predictions without being explicitly programmed with real estate valuation rules.

 Types of Machine Learning Used
   Supervised Learning: We're using labeled data (houses with known prices) to train a model that can predict prices for new houses.
   Regression: Since we're predicting a continuous value (house price) rather than categories, this is specifically a regression problem.

 Statistical Concepts
The project will apply statistical concepts such as:
- Correlation analysis between features and house prices
- Distribution analysis of numeric features
- Hypothesis testing to identify significant predictors

 Data Science Workflow
We'll follow the data science workflow outlined in the lecture:
1. Data collection and cleaning
2. Exploratory data analysis
3. Feature engineering
4. Model selection and training
5. Evaluation and optimization

In [None]:
# Check Python version and installed packages
import sys
print(f"Python version: {sys.version}")

# List installed packages
import pkg_resources
installed_packages = pkg_resources.working_set
installed_packages_list = sorted([f"{i.key}=={i.version}" for i in installed_packages])
print("\nInstalled packages:")
for package in installed_packages_list[:10]:  # Show first 10 packages
    print(f"  {package}")
print("  ...")

 Environment Setup

 Python and Virtual Environment
For this project, I'm using:
- Python 3.13.2 as the programming language
- Pipenv for virtual environment and package management

As discussed in the lecture, virtual environments allow:
- Isolation of project dependencies
- Consistent environments across different machines
- Prevention of package conflicts between projects
- Easy sharing of project requirements

 Setup Process
1. Installed Pipenv: `pip install pipenv`
2. Created a new environment: `pipenv install`
3. Installed data science packages: `pipenv install numpy pandas matplotlib seaborn scikit-learn jupyter`
4. Activated the environment: `pipenv shell`

 Version Control with Git
Based on the lecture, version control is essential for tracking changes and collaborating on projects.
 Git Setup
1. Initialized Git repository: `git init`
2. Created .gitignore file to exclude:
   - Virtual environment files (`/.venv/`)
   - Large datasets (`/data/*.csv`)
   - Jupyter checkpoints (`/.ipynb_checkpoints/`)
   - Cached Python files (`__pycache__/`, `*.pyc`)

 GitHub Integration (Future)
In the next session, I'll connect this local repository to GitHub for:
- Remote backup
- Version history tracking
- Potential collaboration
- Project portfolio showcasing

 AWS Cloud Computing 

The lecture highlighted the importance of cloud computing for scalable data science projects.
 Planned AWS Services
1. S3 (Simple Storage Service)
   - Store large datasets
   - Share results and visualizations

2. EC2 (Elastic Compute Cloud)
   - Run computationally intensive models
   - Scale resources as needed

3. SageMaker (Potential)
   - Build, train, and deploy ML models
   - Access pre-built algorithms

 Implementation Timeline
AWS integration will be implemented in later stages of the project when we need additional computational resources or deployment options.

 Project Structure

I've organized the project with the following structure:
HousePricePrediction/
├── data/ # Dataset files
├── notebooks/ # Jupyter notebooks (including this file)
├── docs/ # Documentation
├── Pipfile # Pipenv package requirements
├── Pipfile.lock # Locked dependencies
└── README.md # Project overview

 Reflections and Learnings

 Key Takeaways from Week 1
- I've learned that AI is the broader field of creating intelligent machines, ML is a subset focused on learning from data, statistics provides the mathematical foundation, and data science combines all these with domain expertise to extract insights.
- I now understand the distinctions between supervised learning (using labeled data to predict outcomes), unsupervised learning (finding patterns in unlabeled data), reinforcement learning (learning through trial and error), and semi-supervised learning (combining labeled and unlabeled data).
- I've gained experience setting up Python, pipenv for virtual environment management, and Jupyter Notebook as my IDE for this project.
- I've learned the importance of proper project organization, documentation, and planning before diving into implementation.

 Challenges Encountered
- I faced difficulties with the Python command not being recognized in the command prompt despite installation. This was due to PATH environment variable issues in Windows.
- Initially, I found it challenging to grasp how pipenv manages dependencies and creates isolated environments.
- I encountered some confusion about how to start, stop, and create new notebooks in the Jupyter environment.

 How I Overcame These Challenges
- For Python PATH issues, I reinstalled Python with the "Add to PATH" option checked and verified the installation using alternative commands like `py --version`.
- I researched virtual environments through documentation and tutorials to better understand how pipenv isolates project dependencies.
- I learned Jupyter Notebook commands through trial and error, discovering how to use keyboard shortcuts to efficiently manage the notebook environment.

 Questions for Further Exploration
- How do different regression algorithms compare in performance for house price prediction tasks?
- What techniques are most effective for handling categorical variables in housing datasets?

In [1]:
import pandas as pd
import os

# List files in data directory
print("Files in the data directory:")
data_files = os.listdir('../data')
print(data_files)

# Load the training dataset
train_data_path = '../data/train.csv'
if os.path.exists(train_data_path):
    train_df = pd.read_csv(train_data_path)
    print(f"\nTraining dataset loaded successfully with {train_df.shape[0]} rows and {train_df.shape[1]} columns")

    # Display the first few rows
    print("\nFirst 5 rows of the training dataset:")
    print(train_df.head())
else:
    print(f"\nError: Could not find {train_data_path}")

Files in the data directory:
['data_description.txt', 'sample_submission.csv', 'test.csv', 'train.csv']

Training dataset loaded successfully with 1460 rows and 81 columns

First 5 rows of the training dataset:
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    All