# California Housing Dataset (CSV Version) - Data Analysis Exercise

Dataset file: `california_housing_train.csv`

### Instructions:
- Load dataset using pandas
- Use matplotlib for visualization
- Provide interpretation for each plot


### 1. Load the dataset from the CSV file and display the first 5 rows.

### 2. Display dataset shape, column names, and data types.

### 3. Check for missing values in each column.

### 4. Generate descriptive statistics for all numerical columns.

### 5. Which feature has the highest mean value? Which has the lowest?

### 6. Compute the correlation matrix of the dataset.

### 7. Identify the top 3 features most correlated with median_house_value.

### 8. Plot histogram of median_house_value and describe its distribution.

### 9. Plot histograms for all numerical features.

### 10. Create a scatter plot of median_income vs median_house_value. Interpret the relationship.

### 11. Create a scatter plot of housing_median_age vs median_house_value.

### 12. Create a boxplot of median_house_value and discuss possible outliers.

### 13. Create a geographic scatter plot using longitude (x-axis) and latitude (y-axis).

### 14. Enhance the geographic scatter plot by coloring points based on median_house_value (use colormap).

### 15. Create a geographic scatter plot where:
   - x-axis = longitude
   - y-axis = latitude
   - color = median_house_value
   - size = median_income
   Interpret spatial price patterns.

### 16. Create another geographic scatter plot where color represents housing_median_age.

### 17. Group data into 4 income quartiles using median_income and compute average median_house_value per group.

### 18. Create a bar chart of average house value per income quartile.

### 19. Compare average median_house_value for houses above and below 25 years old.

### 20. Write a short analytical conclusion summarizing:
   - Income effect
   - Geographic patterns
   - Age effect
   - Any detected outliers

## Preprocessing & Feature Engineering Section

### 21. Check for Duplicate Records
- Determine whether the dataset contains duplicate rows.
- If duplicates exist, remove them and report how many were removed.

### 22. Feature Engineering – Rooms Per Household
- Create a new feature:
  RoomsPerHousehold = total_rooms / households
- Display the first 5 rows including the new feature.
- Explain why this feature might be useful.

### 23. Feature Engineering – Bedrooms Ratio
- Create a new feature:
  BedroomsRatio = total_bedrooms / total_rooms
- What information does this ratio capture?

### 24. Feature Engineering – Population Per Household
- Create a new feature:
  PopulationPerHousehold = population / households
- Compare its distribution with AveOccup (if available).

### 25. Feature-Target Separation
- Separate features (X) and target variable (y).

### 26. Train-Test Split
- Split the dataset into training (80%) and testing (20%) sets.
- Use a fixed random_state for reproducibility.
- Display the shapes of X_train, X_test, y_train, y_test.

### 27. Standard Scaling
- Apply StandardScaler to numerical features.
- Fit only on training data and transform both train and test sets.
- Verify the mean and standard deviation of scaled training features.

## 28. Deep Learning Regression Model

Using the preprocessed dataset:

### Task:
Create and train a Deep Learning regression model to predict `median_house_value`.

### Instructions:

1. Use the training and testing sets created earlier.
2. Build a neural network using  PyTorch.
3. The model should:
   - Have at least 3 hidden layers
   - Use ReLU activation for hidden layers
4. Compile the model using:
   - Loss function: Mean Squared Error (MSE)
   - Optimizer: Adam or SGD
5. Train the model for at least 50 epochs.
6. Plot training loss curves.
