# Notebook 2: Data Analysis

## Objectives
- Load the data collected in the previous notebook.
- Perform exploratory data analysis (EDA) to find features correlated with `SalePrice`.
- Visualize the relationships between key features and `SalePrice`.

## Inputs
- The Ames Housing dataset from `01_data_collection.ipynb` (re-loaded for consistency).

## Outputs
- Correlation analysis results.
- Visualizations of key features.
- Insights for feature selection in the next stage.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

### Load the dataset

In [None]:
url = 'https://raw.githubusercontent.com/INRIA/scikit-learn-mooc/main/datasets/ames_housing_no_missing.csv'
df = pd.read_csv(url)

### Data Cleaning

As noted in the previous notebook, this version of the dataset has no missing values, so no imputation is needed. We can proceed directly to analysis.

### Correlation Analysis

Let's find the features that are most correlated with `SalePrice`.

In [None]:
correlations = df.corr(numeric_only=True)['SalePrice'].sort_values(ascending=False)
print("Top 10 features most correlated with SalePrice:")
print(correlations.head(10))

Based on the correlation results, `OverallQual`, `GrLivArea`, and `GarageCars` are strongly positively correlated with `SalePrice`. Let's visualize these relationships.

### Visualization 1: `SalePrice` vs. `OverallQual`

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='OverallQual', y='SalePrice', data=df)
plt.title('SalePrice vs. Overall Quality')
plt.xlabel('Overall Quality')
plt.ylabel('Sale Price')
plt.show()

As expected, there is a clear positive relationship: as the overall quality of the house increases, so does the sale price.

### Visualization 2: `SalePrice` vs. `GrLivArea`

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='GrLivArea', y='SalePrice', data=df, alpha=0.6)
plt.title('SalePrice vs. Above Grade Living Area (GrLivArea)')
plt.xlabel('Above Grade Living Area (sq. ft.)')
plt.ylabel('Sale Price')
plt.show()

This scatter plot shows a strong linear relationship. Larger living areas correspond to higher sale prices.

### Visualization 3: Distribution of `SalePrice`

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['SalePrice'], kde=True)
plt.title('Distribution of SalePrice')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()

The `SalePrice` is right-skewed, which is common for price data. For more advanced models, applying a log transformation could be beneficial, but for our current scope, this is a good starting point.

## Conclusion

Our analysis confirms strong relationships between `SalePrice` and several key features, most notably `OverallQual` and `GrLivArea`. These features will be excellent candidates for our predictive model.

We have fulfilled the objectives of this notebook and are ready to proceed to model training.