# **Data Collection Notebook **

## Objectives

* Explore and analyze the Ames, Iowa housing dataset to identify how house attributes correlate with sale price.
* Create visualizations to highlight the strongest correlations.
* Prepare the data (cleaning, encoding, feature selection) for modeling.

## Inputs

* Raw data file: `train.csv` from the Ames Housing dataset (publicly available via Kaggle).
* Python libraries: pandas, numpy, matplotlib, seaborn.

## Outputs

* Correlation plots between numerical/categorical features and `SalePrice`.
* Cleaned and preprocessed dataset saved as `cleaned_data.csv`.
* Summary of top predictive features for regression modeling.

## Additional Comments

* This notebook supports the business requirement of identifying key variables that influence house sale price.
* Results will be used in the model training phase and integrated into the dashboard later in the project.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

In [None]:
%pip install kaggle==1.5.12

---

# Section 2

In [None]:
import os

if os.path.exists("kaggle (1).json"):
    os.rename("kaggle (1).json", "kaggle.json")
!chmod 600 kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()


Section 2 content

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

import os
os.makedirs(DestinationFolder, exist_ok=True)

!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder} --unzip


---

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

In [None]:
df.info()

In [None]:
df[df.duplicated(subset=['SalePrice'])]


In [None]:
df['SalePrice'] = pd.to_numeric(df['SalePrice'], errors='coerce')


In [None]:
df['SalePrice'].dtype


In [None]:
df['BsmtExposure'].unique()
df['BsmtExposure'] = df['BsmtExposure'].replace({"No": 0, "Av": 1, "Gd": 2, "Mn": 3})


In [None]:
df['BsmtExposure'].dtype


In [None]:
df['LotFrontage'] = df['LotFrontage'].fillna(df['LotFrontage'].mean())


In [None]:
df = df.dropna(subset=['BsmtExposure', 'GarageFinish'])


* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os

try:
    os.makedirs(name='outputs/datasets/collection', exist_ok=True)
except Exception as e:
    print(e)

df.to_csv("outputs/datasets/collection/HousePrices.csv", index=False)
