# **Data collection**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/raw

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset:
  * outputs/datasets/collection/house_prices_records.csv
  * outputs/datasets/collection/inherited_houses.csv

## Additional Comments

* For this project, we are fetching data from Kaggle. The first dataset contains the data used to build the machine learning model. The second dataset contains information about the houses inherited by our client, for which we need to predict the prices.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\-MY STUDY-\\Coding\\projects\\project-5\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\-MY STUDY-\\Coding\\projects\\project-5'

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting certifi (from kaggle==1.5.12)
  Using cached certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)
Collecting requests (from kaggle==1.5.12)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from kaggle==1.5.12)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting urllib3 (from kaggle==1.5.12)
  Using cached urllib3-2.3.0-py3-none-any.whl.metadata (6.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Collecting charset-normalizer<4,>=2 (from requests->kaggle==1.5.12)
  Using cached charset_normalizer-3.4.1-cp312-cp312-win_amd64.whl.metadata (36 kB)
Collecting idna<4,>=2.5 (from requests->kaggle==1.5.12)
  Using cached idna-3.10-py3-none-a

After dragging kaggle.json file run the cell below, so the token is recognized in the session

In [5]:
import os
import stat

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Set file permissions on Windows
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
os.chmod(kaggle_json_path, stat.S_IREAD | stat.S_IWRITE)

We are using the following [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

In [6]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading housing-prices-data.zip to inputs/datasets/raw




  0%|          | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 465kB/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 461kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
import os
import zipfile

DestinationFolder = 'inputs/datasets/raw'  # Destination folder for the downloaded dataset

# Unzip all .zip files in the destination folder
for item in os.listdir(DestinationFolder):
    if item.endswith('.zip'):
        file_path = os.path.join(DestinationFolder, item)
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            zip_ref.extractall(DestinationFolder)
        os.remove(file_path)  # Remove the zip file after extraction

# Remove kaggle.json
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
if os.path.exists(kaggle_json_path):
    os.remove(kaggle_json_path)

---

# Load and Inspect Kaggle data

List files in directory

In [15]:
import os

for root, dirs, files in os.walk("inputs/datasets/raw"):
    for file in files:
        print(os.path.join(root, file))

inputs/datasets/raw\house-metadata.txt
inputs/datasets/raw\house-price-20211124T154130Z-001\house-price\house_prices_records.csv
inputs/datasets/raw\house-price-20211124T154130Z-001\house-price\inherited_houses.csv


In [23]:
import pandas as pd

csv_file_path_1 = "inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv"
csv_file_path_2 = "inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv"

# Load the CSV files
df1 = pd.read_csv(csv_file_path_1)
df2 = pd.read_csv(csv_file_path_2)

# Inspect the first few rows of each DataFrame
print("First CSV file:")
print(df1.head(3))

print("\nSecond CSV file:")
print(df2.head(3))

First CSV file:
   1stFlrSF  2ndFlrSF  BedroomAbvGr BsmtExposure  BsmtFinSF1 BsmtFinType1  \
0       856     854.0           3.0           No         706          GLQ   
1      1262       0.0           3.0           Gd         978          ALQ   
2       920     866.0           3.0           Mn         486          GLQ   

   BsmtUnfSF  EnclosedPorch  GarageArea GarageFinish  ...  LotFrontage  \
0        150            0.0         548          RFn  ...         65.0   
1        284            NaN         460          RFn  ...         80.0   
2        434            0.0         608          RFn  ...         68.0   

   MasVnrArea OpenPorchSF  OverallCond  OverallQual  TotalBsmtSF  WoodDeckSF  \
0       196.0          61            5            7          856         0.0   
1         0.0           0            8            6         1262         NaN   
2       162.0          42            5            7          920         NaN   

   YearBuilt  YearRemodAdd  SalePrice  
0       2003     

DataFrame Summary

In [22]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       1460 non-null   int64  
 1   2ndFlrSF       1374 non-null   float64
 2   BedroomAbvGr   1361 non-null   float64
 3   BsmtExposure   1422 non-null   object 
 4   BsmtFinSF1     1460 non-null   int64  
 5   BsmtFinType1   1315 non-null   object 
 6   BsmtUnfSF      1460 non-null   int64  
 7   EnclosedPorch  136 non-null    float64
 8   GarageArea     1460 non-null   int64  
 9   GarageFinish   1225 non-null   object 
 10  GarageYrBlt    1379 non-null   float64
 11  GrLivArea      1460 non-null   int64  
 12  KitchenQual    1460 non-null   object 
 13  LotArea        1460 non-null   int64  
 14  LotFrontage    1201 non-null   float64
 15  MasVnrArea     1452 non-null   float64
 16  OpenPorchSF    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  OverallQ

In [21]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       4 non-null      int64  
 1   2ndFlrSF       4 non-null      int64  
 2   BedroomAbvGr   4 non-null      int64  
 3   BsmtExposure   4 non-null      object 
 4   BsmtFinSF1     4 non-null      float64
 5   BsmtFinType1   4 non-null      object 
 6   BsmtUnfSF      4 non-null      float64
 7   EnclosedPorch  4 non-null      int64  
 8   GarageArea     4 non-null      float64
 9   GarageFinish   4 non-null      object 
 10  GarageYrBlt    4 non-null      float64
 11  GrLivArea      4 non-null      int64  
 12  KitchenQual    4 non-null      object 
 13  LotArea        4 non-null      int64  
 14  LotFrontage    4 non-null      float64
 15  MasVnrArea     4 non-null      float64
 16  OpenPorchSF    4 non-null      int64  
 17  OverallCond    4 non-null      int64  
 18  OverallQual   

---

SUMMARY

* The first dataset contains information about 1460 houses from Ames, Iowa. It includes 24 columns with various attributes related to the houses. The columns have different data types: 13 columns are of type int64, 7 columns are of type float64, and 4 columns are of type object.

Some columns have missing values, as indicated by the non-null counts.

* The second dataset contains information about the four houses inherited by our client. It includes 23 columns with various attributes related to these houses. The columns have different data types: 12 columns are of type `int64`, 7 columns are of type `float64`, and 4 columns are of type `object`.

All columns have non-null values, indicating that there are no missing values in this dataset. The attributes included in this dataset are similar to those in the first dataset.

---

# Push files to Repo


In this notebook, we have collected data from Kaggle and inspected the columns in the dataset.

From the initial inspection, it is evident that the data needs to be cleaned before any analysis can be performed. We will now push the datasets to the repository.

In [24]:
import os

# Define the directory path
directory_path = 'outputs/datasets/collection'

try:
    # Create the directory if it doesn't exist
    os.makedirs(name=directory_path, exist_ok=True)
except Exception as e:
    print(e)

# Save df1 and df2 as CSV files in the created directory
df1.to_csv(os.path.join(directory_path, 'house_prices_records.csv'), index=False)
df2.to_csv(os.path.join(directory_path, 'inherited_houses.csv'), index=False)

# Add the datasets to the repository
os.system('git add outputs/datasets/collection/house_prices_records.csv')
os.system('git add outputs/datasets/collection/inherited_houses.csv')

# Commit the changes with a message
os.system('git commit -m "Add Kaggle datasets for house prices and inherited houses"')

# Push the changes to the remote repository
os.system('git push origin main')


0