# Report: Predicting House Prices in California Housing Dataset

## Introduction:

The objective of this report is to document the process of implementing and comparing the performance of two models, Linear Regression and an Artificial Neural Network (ANN), for predicting house prices using the California Housing dataset.

### Dataset Overview:

- The California Housing dataset consists of various features related to housing in California.
- Features include median housing price, location, population, median income, etc.
- The target variable is the median house value for California districts.

## Methodology:

### 1. Data Preprocessing:

#### Data Exploration:

- Explored and understood the features of the dataset using basic statistics and information.
- Identified potential outliers and missing values.

#### Outlier Handling:

- Used the z-score method from the `scipy.stats` module to handle outliers.
- Removed data points with z-scores greater than 3.

#### Data Splitting:

- Split the dataset into training and testing sets using a 80-20 split.

### 2. Linear Regression:

#### Model Implementation:

- Implemented a Linear Regression model using scikit-learn.
- Trained the model on the training set and made predictions on the testing set.

#### Performance Evaluation:

- Evaluated the model's performance using metrics:
  - Mean Squared Error (MSE)
  - R2 Score

### 3. Artificial Neural Network (ANN):

#### Model Implementation:

- Implemented a simple ANN for regression using TensorFlow/Keras.
- Designed the architecture with an input layer (64 neurons, ReLU) and an output layer (1 neuron, linear activation).
- Trained the ANN on the training set.

#### Performance Evaluation:

- Evaluated the ANN's performance using the same metrics as Linear Regression.

### 4. Comparison and Analysis:

- Compared the performance metrics of Linear Regression and ANN.
- Discussed the strengths and weaknesses of each model.
- Analyzed whether the complexity of the ANN provides better predictive performance compared to Linear Regression.

### 5. Visualization:

- Created scatter plots to compare predicted values of Linear Regression and ANN with actual values.

## Results:

### Model Performance:

#### Linear Regression Metrics:
   - Mean Squared Error (Linear Regression): 0.4234
   - R2 Score (Linear Regression): 0.6511

#### ANN Metrics:
   - Mean Squared Error (ANN): 2.2342
   - R2 Score (ANN): -0.8123

### Insights:

- The simpler Linear Regression model outperformed the more complex ANN in this scenario.
- The negative R2 Score for the ANN suggests poor fit to the data.

### Challenges Encountered:

1. **Model Tuning:**
   - Hyperparameters of the ANN might need further tuning.
   - Current architecture may not be optimal for this dataset.

2. **Overfitting:**
   - The ANN may suffer from overfitting, requiring regularization techniques or adjustments to the model architecture.

3. **Data Characteristics:**
   - Understanding the nature of the data is crucial; the dataset may not contain complex patterns suited for the complexity of the ANN.

## Conclusion:

- While the ANN has potential for capturing intricate patterns, the provided metrics indicate that, in this case, the simplicity of Linear Regression resulted in better predictive performance.
- Further refinement of the ANN's architecture, regularization, and a deeper understanding of the data may lead to improved results.
- The choice between models depends on considerations such as interpretability, computational efficiency, and the nature of the data.
