# **Data Collection Notebook**

## Overview
---

### Objectives

Fetch data from Kaggle and save it as raw data.
Inspect the data and save it under outputs/datasets/collection.

### Inputs

Kaggle JSON file - the authentication token.

### Outputs

Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv

### Additional Comments
Standard practice would not be to push the collected data to a public repository, but as this is fictious data and there are no privacy concerns, the data will be hosted publicly.

## Change Working Directory
---

We need to ensure that the terminal commands ran from inside the notebook are executed from inside the root directory of the project. Please don't run these commands multiple times, as you will progressively step up through the directories your project is housed in.

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/matthewmurnaghan/Data-Science-Projects/AI-Projects/House-Price-Predictor/jupyter-notbooks'

We set the parent directory, of the current directory as the new working directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the directory

In [3]:
current_dir = os.getcwd()
print(current_dir)

/Users/matthewmurnaghan/Data-Science-Projects/AI-Projects/House-Price-Predictor


## Fetch House Price data from Kaggle
---

### Getting set up
We have already installed kaggle in our requirements in order to fetch the data.

After this, you will need to insert your kaggle.json API key from your kaggle account. 

If you are unfamiliar with this, please follow the links below to create your kaggle account and API key.

* [Create your Kaggle account](https://www.kaggle.com/getting-started/45113)
* [Create your API key](https://www.kaggle.com/docs/api)

### Using your key
Once you have downloaded your API key from the kaggle website, please place the file in the root directory for your project.

A simple drag and drop will work.

Afterwards, please run the cell below to ensure that the correct permissions are assigned for handling the file.

In [4]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: kaggle.json: No such file or directory


### Get the dataset path from the Kaggle URL

We are using the following Kaggle dataset: [House Prices](https://www.kaggle.com/datasets/codeinstitute/housing-prices-data)

When you are looking at the dataset on kaggle, copy what comes after "https://www.kaggle.com/"

We then define the dataset and it's destination folder after download.

In [5]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/Users/matthewmurnaghan/Data-Science-Projects/AI-Projects/House-Price-Predictor/.venv/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/Users/matthewmurnaghan/Data-Science-Projects/AI-Projects/House-Price-Predictor/.venv/lib/python3.10/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/Users/matthewmurnaghan/Data-Science-Projects/AI-Projects/House-Price-Predictor/.venv/lib/python3.10/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /Users/matthewmurnaghan/Data-Science-Projects/AI-Projects/House-Price-Predictor. Or use the environment method.


Then we unzip the downloaded file, delete the zip file and delete the kaggle.json file


In [6]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

zsh:1: no matches found: inputs/datasets/raw/*.zip


## Analyse Data
---

There are two files returned from the online dataset:
* house_prices_records.csv
* inherited_houses.csv

### house_prices_records dataset

We can read the dataset into a pandas dataframe to analyse it.

In [7]:
import pandas as pd
df_house_prices = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df_house_prices.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,5,7,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,8,6,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,5,7,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,5,7,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,5,8,1145,,2000,2000,250000


#### DataFrame Summary

You can read the dataframe summary by calling the .info() method on the dataframe object, but the snippet below reads this output into it's own dataframe for readability purposes.

In [8]:
import io
buf = io.StringIO()
df_house_prices.info(buf=buf)
s = buf.getvalue()
lines = [line.split() for line in s.splitlines()[3:-2]]
df_house_prices_info = pd.DataFrame(lines).drop([0], axis=1)
df_house_prices_info.columns = df_house_prices_info.iloc[0]
df_house_prices_info.drop([0,1], inplace=True)
df_house_prices_info.reset_index(drop=True, inplace=True)
df_house_prices_info

Unnamed: 0,Column,Non-Null,Count,Dtype
0,1stFlrSF,1460,non-null,int64
1,2ndFlrSF,1374,non-null,float64
2,BedroomAbvGr,1361,non-null,float64
3,BsmtExposure,1460,non-null,object
4,BsmtFinSF1,1460,non-null,int64
5,BsmtFinType1,1346,non-null,object
6,BsmtUnfSF,1460,non-null,int64
7,EnclosedPorch,136,non-null,float64
8,GarageArea,1460,non-null,int64
9,GarageFinish,1298,non-null,object


We can learn more about the dataset by checking the metadata that is supplied. It gives a brief description of each feature and what it represents.

From this we can see that there are a number of object maps using strings to describe levels of finish or ratings of different aspects of the property. We need to change them to numbers to better work with the algorithyms.

#### Apply Label Maps

We can check to see which columns have cateogircal or object datatypes.

In [9]:
df_house_prices.dtypes

1stFlrSF           int64
2ndFlrSF         float64
BedroomAbvGr     float64
BsmtExposure      object
BsmtFinSF1         int64
BsmtFinType1      object
BsmtUnfSF          int64
EnclosedPorch    float64
GarageArea         int64
GarageFinish      object
GarageYrBlt      float64
GrLivArea          int64
KitchenQual       object
LotArea            int64
LotFrontage      float64
MasVnrArea       float64
OpenPorchSF        int64
OverallCond        int64
OverallQual        int64
TotalBsmtSF        int64
WoodDeckSF       float64
YearBuilt          int64
YearRemodAdd       int64
SalePrice          int64
dtype: object

The variables with type object are the ones we need to change to categorical.

In [10]:
obj_to_categorical = df_house_prices.select_dtypes(include=['object']).columns
df_house_prices[obj_to_categorical] = df_house_prices[obj_to_categorical].astype('category')

We can create the list of variables to map using the column names from this modified dataframe.

In [11]:
# vars_to_map = df_house_prices.select_dtypes(include=['object']).columns.tolist()
vars_to_map = ['OverallCond', 'OverallQual',]
vars_to_map

['OverallCond', 'OverallQual']

We create an object for mapping the categorical labels for each variable to numerical values.

In [12]:
# label_map = {
#     'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0},
#     'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0},
#     'KitchenQual': {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1},
#     'BsmtExposure': {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'None': 0},
# }
label_map = {
    1:0,
    2:1,
    3:2,
    4:3,
    5:4,
    6:5,
    7:6,
    8:7,
    9:8,
    10:9,
}

We then apply this to the dataframe.

In [13]:
df_house_prices[vars_to_map] = df_house_prices[vars_to_map].replace(label_map)
df_house_prices[vars_to_map].head()

Unnamed: 0,OverallCond,OverallQual
0,4,6
1,7,5
2,4,6
3,4,6
4,4,7


If we check the dtype of the dataframe columns now, we can see they are numerical.

In [14]:
df_house_prices.dtypes

1stFlrSF            int64
2ndFlrSF          float64
BedroomAbvGr      float64
BsmtExposure     category
BsmtFinSF1          int64
BsmtFinType1     category
BsmtUnfSF           int64
EnclosedPorch     float64
GarageArea          int64
GarageFinish     category
GarageYrBlt       float64
GrLivArea           int64
KitchenQual      category
LotArea             int64
LotFrontage       float64
MasVnrArea        float64
OpenPorchSF         int64
OverallCond         int64
OverallQual         int64
TotalBsmtSF         int64
WoodDeckSF        float64
YearBuilt           int64
YearRemodAdd        int64
SalePrice           int64
dtype: object

We can then change the object datatypes to categories.

In [15]:
df_house_prices

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856,854.0,3.0,No,706,GLQ,150,0.0,548,RFn,...,65.0,196.0,61,4,6,856,0.0,2003,2003,208500
1,1262,0.0,3.0,Gd,978,ALQ,284,,460,RFn,...,80.0,0.0,0,7,5,1262,,1976,1976,181500
2,920,866.0,3.0,Mn,486,GLQ,434,0.0,608,RFn,...,68.0,162.0,42,4,6,920,,2001,2002,223500
3,961,,,No,216,ALQ,540,,642,Unf,...,60.0,0.0,35,4,6,756,,1915,1970,140000
4,1145,,4.0,Av,655,GLQ,490,0.0,836,RFn,...,84.0,350.0,84,4,7,1145,,2000,2000,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,953,694.0,3.0,No,0,Unf,953,,460,RFn,...,62.0,0.0,40,4,5,953,0.0,1999,2000,175000
1456,2073,0.0,,No,790,ALQ,589,,500,Unf,...,85.0,119.0,0,5,5,1542,,1978,1988,210000
1457,1188,1152.0,4.0,No,275,GLQ,877,,252,RFn,...,66.0,0.0,60,8,6,1152,,1941,2006,266500
1458,1078,0.0,2.0,Mn,49,,0,112.0,240,Unf,...,68.0,0.0,0,5,4,1078,,1950,1996,142125


As we are dealing with only house features, there is nothing specifc to an individual house like an address, so we don't need to check for duplicate values.

### inherited_houses

We can read the dataset into a pandas dataframe to analyse it.

In [16]:
import pandas as pd
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
df_inherited.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd
0,896,0,2,No,468.0,Rec,270.0,0,730.0,Unf,...,11622,80.0,0.0,0,6,5,882.0,140,1961,1961
1,1329,0,3,No,923.0,ALQ,406.0,0,312.0,Unf,...,14267,81.0,108.0,36,6,6,1329.0,393,1958,1958
2,928,701,3,No,791.0,GLQ,137.0,0,482.0,Fin,...,13830,74.0,0.0,34,5,5,928.0,212,1997,1998
3,926,678,3,No,602.0,GLQ,324.0,0,470.0,Fin,...,9978,78.0,20.0,36,6,6,926.0,360,1998,1998


This datast will be what we apply the model to after we have finished training. As such, we should apply the same modification that we applied to the house_prices dataset. They are both in the same format so we can use the label map that we created earlier.

In [17]:
df_inherited[vars_to_map] = df_inherited[vars_to_map].replace(label_map)
df_inherited.dtypes

1stFlrSF           int64
2ndFlrSF           int64
BedroomAbvGr       int64
BsmtExposure      object
BsmtFinSF1       float64
BsmtFinType1      object
BsmtUnfSF        float64
EnclosedPorch      int64
GarageArea       float64
GarageFinish      object
GarageYrBlt      float64
GrLivArea          int64
KitchenQual       object
LotArea            int64
LotFrontage      float64
MasVnrArea       float64
OpenPorchSF        int64
OverallCond        int64
OverallQual        int64
TotalBsmtSF      float64
WoodDeckSF         int64
YearBuilt          int64
YearRemodAdd       int64
dtype: object

## Push Files to Repo
---

In [18]:

import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_house_prices.to_csv(f"outputs/datasets/collection/house_prices.csv",index=False)
df_inherited.to_csv(f"outputs/datasets/collection/inherited_houses.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


We can clear the cell outputs now and push the files to the repo.