# **DATA COLLECTION NOTEBOOK**

## Objectives

* Fetch data from Kaggle and save it as raw data in inputs/datasets/raw
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/house_prices_after_inspection.csv

## Additional Comments

* Data derives from Kaggle but has been provided by CI 


---

# Change working directory to the parent folder

Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch Data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

Recognize Kaggle token

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

### Read CSV files

In [None]:
import pandas as pd
df_house_prices = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv") 
df_house_prices.head()
# print(df.shape)


In [None]:
df_inherited_houses = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv") 
df_inherited_houses.head()
# print(df.shape)

### Read TXT files

In [None]:
df_house_metadata = pd.read_csv(f"inputs/datasets/raw/house-metadata.txt", header=None) 
df_house_metadata.head()
# print(df_house_metadata.shape)

### DataFrames Summary

In [None]:
df_house_prices.info()

In [None]:
df_inherited_houses.info()

### Visually confirm whether the sum of numerical columns are integers or a floats
* Use the Sum to as an indicators to confirm if a column contains float values

In [None]:
categ_variables = df_house_prices.select_dtypes(include = "object").columns.to_list()
categ_variables

for col in df_house_prices.columns:
    if col in categ_variables:
        pass
    else:
        print(f'{col}: {df_house_prices[col].sum()}')

### Check the unique values of the categorical variables

In [None]:
for col in categ_variables:
    print(f"{col} : {df_house_prices[col].unique()}")


### Confirm Target data type
* The target is already a numeric variable.

In [None]:
df_house_prices['SalePrice'].dtype

### Conclusions and Next actions
* The variables GarageYrBlt, YearBuilt and YearRemodAdd are numeric. 
* While they could be converted to datetime data type, their current numerical format facilitates their use in Pearson and Spearman correlation analyses and as direct inputs for the regression model.
* The categorical variables BsmtFinType1 and KitchenQual show an inconsistent casing. Thus, for consistency apply in future notebooks a case transformation to all categorical variables.

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

### Create outputs directory

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)


### Save the data under as csv

In [None]:
df_house_prices.to_csv(f"outputs/datasets/collection/house_prices_after_inspection.csv", index=False)