# **Data Collection Notebook**

## Overview
---

### Objectives

Fetch data from Kaggle and save it as raw data.
Inspect the data and save it under outputs/datasets/collection.

### Inputs

Kaggle JSON file - the authentication token.

### Outputs

Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv

### Additional Comments
Standard practice would not be to push the collected data to a public repository, but as this is fictious data and there are no privacy concerns, the data will be hosted publicly.

## Change Working Directory
---

We need to ensure that the terminal commands ran from inside the notebook are executed from inside the root directory of the project. Please don't run these commands multiple times, as you will progressively step up through the directories your project is housed in.

In [None]:
import os
current_dir = os.getcwd()
current_dir

We set the parent directory, of the current directory as the new working directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the directory

In [None]:
current_dir = os.getcwd()
print(current_dir)

## Fetch House Price data from Kaggle
---

### Getting set up
We have already installed kaggle in our requirements in order to fetch the data.

After this, you will need to insert your kaggle.json API key from your kaggle account. 

If you are unfamiliar with this, please follow the links below to create your kaggle account and API key.

* [Create your Kaggle account](https://www.kaggle.com/getting-started/45113)
* [Create your API key](https://www.kaggle.com/docs/api)

### Using your key
Once you have downloaded your API key from the kaggle website, please place the file in the root directory for your project.

A simple drag and drop will work.

Afterwards, please run the cell below to ensure that the correct permissions are assigned for handling the file.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

### Get the dataset path from the Kaggle URL

We are using the following Kaggle dataset: [House Prices](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

When you are looking at the dataset on kaggle, copy what comes after "https://www.kaggle.com/"

We then define the dataset and it's destination folder after download.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Then we unzip the downloaded file, delete the zip file and delete the kaggle.json file


In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

## Analyse Data
---

There are two files returned from the online dataset:
* house_prices_records.csv
* inherited_houses.csv

### house_prices_records dataset

We can read the dataset into a pandas dataframe to analyse it.

In [None]:
import pandas as pd
df_house_prices = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_house_prices.head()

#### DataFrame Summary

You can read the dataframe summary by calling the .info() method on the dataframe object, but the snippet below reads this output into it's own dataframe for readability purposes.

In [None]:
import io
buf = io.StringIO()
df_house_prices.info(buf=buf)
s = buf.getvalue()
lines = [line.split() for line in s.splitlines()[3:-2]]
df_house_prices_info = pd.DataFrame(lines).drop([0], axis=1)
df_house_prices_info.columns = df_house_prices_info.iloc[0]
df_house_prices_info.drop([0,1], inplace=True)
df_house_prices_info.reset_index(drop=True, inplace=True)
df_house_prices_info

We can learn more about the dataset by checking the metadata that is supplied. It gives a brief description of each feature and what it represents.

From this we can see that there are a number of object maps using strings to describe levels of finish or ratings of different aspects of the property. We need to change them to numbers to better work with the algorithyms.

#### Apply Label Maps

We can check to see which columns have cateogircal or object datatypes.

In [None]:
df_house_prices.dtypes

The variables with type object are the ones we need to map.

In [None]:
df_house_prices.select_dtypes(include=['object']).head()

We can create the list of variables to map using the column names from this modified dataframe.

In [None]:
vars_to_map = df_house_prices.select_dtypes(include=['object']).columns.tolist()
vars_to_map

We create an object for mapping the categorical labels for each variable to numerical values.

In [None]:
label_map = {
    'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0},
    'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0},
    'KitchenQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1},
    'BsmtExposure': {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'None': 0},
}

We then apply this to the dataframe.

In [None]:
df_house_prices[vars_to_map] = df_house_prices[vars_to_map].replace(label_map)
df_house_prices[vars_to_map].head()

If we check the dtype of the dataframe columns now, we can see they are numerical.

In [None]:
df_house_prices.dtypes

As we are dealing with only house features, there is nothing specifc to an individual house like an address, so we don't need to check for duplicate values.

### inherited_houses

We can read the dataset into a pandas dataframe to analyse it.

In [None]:
import pandas as pd
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

This datast will be what we apply the model to after we have finished training. As such, we should apply the same modification that we applied to the house_prices dataset. They are both in the same format so we can use the label map that we created earlier.

In [None]:
df_inherited[vars_to_map] = df_inherited[vars_to_map].replace(label_map)
df_inherited.dtypes

## Push Files to Repo
---

In [None]:

import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_house_prices.to_csv(f"outputs/datasets/collection/house_prices.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/collection/inherited_houses.csv",index=False)

We can clear the cell outputs now and push the files to the repo.