# **01 - Data Collection**

## Objectives

* Install the Kaggle API to enable dataset retrieval.
* Authenticate using the Kaggle API token stored in the `kaggle.json` file.
* Download the raw dataset from Kaggle.
* Inspect the dataset to ensure it is loaded correctly.
* Save the dataset in the `inputs/datasets/raw/` folder for further processing.

## Inputs

* Kaggle dataset: [Housing Prices Data](
)
* Kaggle API token stored in the `kaggle.json` file.

## Outputs

* Raw dataset saved as `inputs/datasets/raw/housing_prices.csv`.

## Additional Comments

* Ensure that the Kaggle API token (`kaggle.json`) is correctly configured and stored in the project root directory.
* The working directory will be adjusted to the project root to ensure consistent file paths.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [19]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/milestone-project-heritage-housing-issues'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [20]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [21]:
current_dir = os.getcwd()
current_dir

'/workspaces'

# Section 1: Data Retrieval

To retrieve the raw project data, we first need to install the Kaggle API, which allows us to download datasets directly from Kaggle. This requires the installation of the Kaggle Python package. 

In [22]:
%pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


To configure the Kaggle API for data set retrieval, we first need to set the enviroment variable `KAGGLE_CONFIG_DIR` to the current directory where the `kaggle.json` file is stored. This file contains the necessary credentials for authentification. 
We use the `chmod 600` to restrict the access to the `kaggle.json` file, ensuring that it is only readable and writeble by the owner.

In [23]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


Next, we download the zip file containing the dataset.

In [24]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/home/cistudent/.local/bin/kaggle", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/api/kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cistudent/.local/lib/python3.12/site-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'


Unzip dataset, delete zip-file and remove kaggle.json file.

In [25]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


Moving the CSV files to `/inputs/datasets/raw/` and removing unnecessary folders simplifies the project structure, makes the code cleaner, and follows best practices for organizing raw data.

In [26]:
!mv inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/*.csv inputs/datasets/raw/ \
  && rmdir inputs/datasets/raw/house-price-20211124T154130Z-001/house-price \
  && rmdir inputs/datasets/raw/house-price-20211124T154130Z-001

mv: cannot stat 'inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/*.csv': No such file or directory


---

# Section 2: Load and Inspect Kaggle Data

* Import the Pandas library to work with tabular data.
* Load the datasets (`house_prices_records.csv`and `inherited_houses.csv`) into two separate Pandas DataFrames.
* Display the first few rows of `df_prices`.

In [27]:
import pandas as pd

df_prices = pd.read_csv("inputs/datasets/raw/house_prices_records.csv")
df_inherited = pd.read_csv("inputs/datasets/raw/inherited_houses.csv")

df_prices.head()

FileNotFoundError: [Errno 2] No such file or directory: 'inputs/datasets/raw/house_prices_records.csv'

Inspecting DataFrame Structure.
We use `.info()` to inspect the structure of the DataFrame, including column names, data types, and the number of non-null values. This helps us identify missing values and understand the overall structure of the dataset.

In [None]:
print("Dataset info:")
df_prices.info()

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  

Identify the number of missing values in each column of `df_prices`. To count the number of missing values in each column and plan how we will handle them.

In [None]:
df_prices.isnull().sum()

1stFlrSF            0
2ndFlrSF           86
BedroomAbvGr       99
BsmtExposure       38
BsmtFinSF1          0
BsmtFinType1      145
BsmtUnfSF           0
EnclosedPorch    1324
GarageArea          0
GarageFinish      235
GarageYrBlt        81
GrLivArea           0
KitchenQual         0
LotArea             0
LotFrontage       259
MasVnrArea          8
OpenPorchSF         0
OverallCond         0
OverallQual         0
TotalBsmtSF         0
WoodDeckSF       1305
YearBuilt           0
YearRemodAdd        0
SalePrice           0
dtype: int64

Statistical Summary of Numeric Columns.
We use `.describe()` to generate a statistical summary of the numeric columns in the dataset. This helps us understand the distribution of data, identify outliers, and plan for data cleaning.

In [None]:
print("Statistical Summary of df_prices:")
print(df_prices.describe())

Statistical Summary of df_prices:
          1stFlrSF     2ndFlrSF  BedroomAbvGr   BsmtFinSF1    BsmtUnfSF  \
count  1460.000000  1374.000000   1361.000000  1460.000000  1460.000000   
mean   1162.626712   348.524017      2.869214   443.639726   567.240411   
std     386.587738   438.865586      0.820115   456.098091   441.866955   
min     334.000000     0.000000      0.000000     0.000000     0.000000   
25%     882.000000     0.000000      2.000000     0.000000   223.000000   
50%    1087.000000     0.000000      3.000000   383.500000   477.500000   
75%    1391.250000   728.000000      3.000000   712.250000   808.000000   
max    4692.000000  2065.000000      8.000000  5644.000000  2336.000000   

       EnclosedPorch   GarageArea  GarageYrBlt    GrLivArea        LotArea  \
count     136.000000  1460.000000  1379.000000  1460.000000    1460.000000   
mean       25.330882   472.980137  1978.506164  1515.463699   10516.828082   
std        66.684115   213.804841    24.689725   525.480

---

## Conclusions and next step
### Concolusions:
The dataset consists of 1460 rows and 24 columns, with 9 variables containing missing values. Key columns with missing data include `LotFrontage` (17.7%), `GarageFinish` (16.1%), and others like `EnclosedPorch` and `WoodDeckSF`, which have a high percentage of missing values. Categorical columns need to be converted from `object` to `category`.

The data also reveals that `LotFrontage` shows a wide range of values, indicating variability that may need further exploration. Additionally, `GarageYrBlt` contains missing values and represents years, which may require special handling to ensure consistency.

### Next Steps: Data Cleaning

