# **Data Collection Notebook**

## Objectives

- **Fetch data from Kaggle**: Fetch the dataset from Kaggle and save the date as raw data.
- **Inspect Data**: Load and inspect the data.
- **Save Data**:  Save the data under outputs/datasets/raw.
- **Push files to Repo**: Push the files to the GitHub repository.

## Inputs

- **Kaggle JSON File**: A JSON authentication token necessary for accessing datasets on Kaggle.
- **Kaggle House Prices Dataset**: Access the dataset provided by the Code Institute, available on [Kaggle](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data).

## Outputs

- **Generated Dataset**: The generated dataset will be saved in outputs/datasets/collection

## Additional Comments

- In this project, the dataset for housing prices is sourced from **Kaggle** and is publicly available. In a real-world scenario, however, data related to housing prices would typically be considered sensitive and may require enhanced privacy measures.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Python Packages

- The following command installs the necessary python packages in the notebook

In [None]:
! pip3 install -r requirements.txt

---

# Fetch data from Kaggle

- The following command installs the Kaggle library used to fetch the data.

In [None]:
! pip3 install kaggle==1.5.12

Next, we need to obtain a JSON file from Kaggle, which serves as an authentication token required to securely access and download the necessary datasets for this project.

**Steps to Obtain and Use the Kaggle JSON File:**

1. Create a Kaggle Account: <br>
    - If you do not already have an account, you can sign up at [Kaggle](https://www.kaggle.com/).

2. Generate API Token: <br>
    - Navigate to your account settings by clicking on your profile picture at the top right, and select "Account" from the menu.
    - Scroll down to the "API" section and click on "Create New API Token".
    - This will trigger the download of a kaggle.json file containing your API credentials.

3. Place the JSON File in the Root Directory: <br>
    - Locate the kaggle.json file in your default downloads folder or wherever your browser typically saves downloaded files.
    - Move this file to the root directory of your project or the designated directory required by your environment setup. <br>

    ![JSON](../media/kagglejson.png)

Next, we will execute the following code to ensure that the token is recognized during this session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

As mentioned in the "Inputs" section of this notebook, the project utilizes the following [Housing Prices Dataset](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data) available on Kaggle.

The code below accomplishes the following tasks:
- **Define the Kaggle Path**: This is the path that appears after www.kaggle.com/datasets/ in the URL, identifying the specific dataset.
- **Set the Destination Folder**: Specifies where the dataset will be downloaded on your local machine/cloud environment.
- **Download the Data**: Executes the download process to retrieve the dataset from Kaggle.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

After downloading the files, execute the following code cell to manage the files efficiently:
- **Unzip the Downloaded File**: This code will automatically extract the contents of the downloaded ZIP file to your specified directory.
- **Delete the ZIP File**: The code will remove the ZIP file from your directory to clear up space, as it is no longer needed after extraction.
- **Delete the kaggle.json File**: For security purposes, the code will also delete the kaggle.json file from your working directory to prevent unauthorized access to your Kaggle account.

By running this code cell, you ensure that only necessary files are retained and sensitive information is securely managed.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Data Inspection

After downloading and preparing the datasets, we have two primary datasets available:

- **House Prices Records**: This dataset includes general market data for houses in Ames, Iowa.
- **Inherited Houses Records**: This dataset specifically details the inherited properties.

### Load and Inspect the House Prices Records

In [None]:
import pandas as pd

# Loads the house prices dataset
house_prices_df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")

# Displays the first 5 rows of the dataset
house_prices_df.head()

### Load and Inspect the Inherited Houses Records

In [None]:
# Loads the inherited houses dataset
inherited_df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")

# Displays the first 5 rows of the dataset
inherited_df.head()

### Abbreviations:
The displayed dataframes/tables above include abbreviations that summarize key property features. Below, a brief explanation of these abbreviations is provided in order of appearance for quick reference. For a more detailed description, please refer to the README file.

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)

BsmtExposure: Refers to walkout or garden level walls
- Gd: Good Exposure
- Av: Average Exposure
- Mn: Mimimum Exposure
- No: No Exposure
- None: No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtFinType1: Rating of basement finished area
- GLQ: Good Living Quarters
- ALQ: Average Living Quarters
- BLQ: Below Average Living Quarters
- Rec: Average Rec Room
- LwQ: Low Quality
- Unf: Unfinshed
- None: No Basement


BsmtUnfSF: Unfinished square feet of basement area

EnclosedPorch: Enclosed porch area in square feet

GarageArea: Size of garage in square feet

GarageFinish: Interior finish of the garage
- Fin: Finished
- RFn: Rough Finished
- Unf: Unfinished
- None: No Garage

GarageYrBlt: Year garage was built

GrLivArea: Above grade (ground) living area square feet

TotalBsmtSF: Total square feet of basement area

KitchenQual: Kitchen quality
- Ex: Excellent
- Gd: Good
- TA: Typical/Average
- Fa: Fair
- Po: Poor

LotArea: Lot size in square feet

LotFrontage: Linear feet of street connected to property

MasVnrArea: Masonry veneer area in square feet

OpenPorchSF: Open porch area in square feet

OverallCond: Rates the overall condition of the house
- 10: Very Excellent
- 9: Excellent
- 8: Very Good
- 7: Good
- 6: Above Average
- 5: Average
- 4: Below Average
- 3: Fair
- 2: Poor
- 1: Very Poor

OverallQual: Rates the overall material and finish of the house
- 10: Very Excellent
- 9: Excellent
- 8: Very Good
- 7: Good
- 6: Above Average
- 5: Average
- 4: Below Average
- 3: Fair
- 2: Poor
- 1: Very Poor

TotalBsmtSF: Total square feet of basement area

WoodDeckSF: Wood deck area in square feet

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodelling or additions)

SalePrice: Sale Price

### Dataframe Summary

In [None]:
# Display the information about the DataFrame "house_prices_df", which includes details about each column
house_prices_df.info()

In [None]:
# Display the information about the DataFrame "inherited_df", which includes details about each column
inherited_df.info()

### Check for missing values.

In [None]:
# Calculate and display the number of missing values in each column of the DataFrame "house_prices_df"
house_prices_df.isnull().sum()

In [None]:
# Calculate and display the number of missing values in each column of the DataFrame "inherited_df"
inherited_df.isnull().sum()

### Check for duplicated entries.

In [None]:
# Identify and display any duplicate rows in "house_prices_df"
house_prices_df[house_prices_df.duplicated(subset=None)]

In [None]:
# Identify and display any duplicate rows in "inherited_df"
inherited_df[inherited_df.duplicated(subset=None)]

There are no duplicate entries in both datasets.

---

# Push files to Repo

The following code will push the loaded data to the repository.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # creates outputs/datasets/collection folder
except Exception as e:
  print(e)

# Save "house_prices_df" to a CSV file without including row indices
house_prices_df.to_csv(f"outputs/datasets/collection/house_prices_records.csv",index=False)

# Save "inherited_df" to a CSV file without including row indices
inherited_df.to_csv(f"outputs/datasets/collection/inherited_houses.csv",index=False)

---

# Conclusions and Next Steps

## Conclusions

Significant columns from the house prices dataset have numerous missing values wich are:

**2ndFlrSF**, **BedroomAbvGr**, **BsmtFinType1**, **EnclosedPorch**, **GarageFinish**, **GarageYrBlt**, **LotFrontage**, **MasVnrArea**, and **WoodDeckSF** .

The presence of a large number of zeros in **EnclosedPorch** and **WoodDeckSF** suggests that these features might be absent in many properties.

## Next Steps

1. **Sale Price Study**:
    - Conduct a comprehensive analysis of the **SalePrice** attribute to understand its distribution. This will help establish anunderstanding of the factors driving property values in the dataset.

2. **Correlation Analysis**:
    - Perform detailed correlation analysis to identify which attributes most strongly influence the sale price.
    - Utilize both Pearson correlation for linear relationships and Spearman correlation for non-linear trends.

3. **Visualization of Key Relationships**:
    - Develop visualizations such as scatter plots, heat maps, and pair plots to visually represent the relationships between house attributes and the sale price.
    - Focus on visualizing the most impactful variables as identified from the correlation analysis.