# Data Collection

### Notebook objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection
* Push the files to Github

### Inputs

* Kaggle JSON file - the authentication token.
* Download the appropriate data set from Kaggle, specified in the Code Institute Lessons 

### Additional Comments

* This Notebook follows the structure set out during the two Code Instutute Walkthrough Projects that relate to Predictive Analytics and is based on the Template provided in the Assessment Handbook.

___

## Change working Directory

* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir


We want to make the parent of the current directory the new current directory

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Collect The Data from Kaggle

1. first we must install the Kaggle package in order to fetch the data.

In [None]:
! pip3 install kaggle==1.5.12

2. With the kaggle package installed we must then import our specific Kaggle authentification token, which is stored in a Kaggle.json file.
    * In order to access one of these tokens you must create an acount with Kaggle, from there you may download a personal token that will give you access to Kaggles Datasets.
    * Once you have downloaded this token drag and drop the kaggle.json file into the root directory.
    * Now we can set the Kaggle enviroment variable so the token can be recognised by running the cell below.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

3. After setting our environment we can finally access our required dataset from kaggle and place it into a folder within our directory, in order to do so we run the below code.

In [None]:
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

4. The dataset is downloaded as a .zip, therefore we must unzip the file in order to view and use our data, do so by running the code below.

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

* Our required dataset is now accessable and ready to be inspected.
    * As a note is important to see we have downloaded 2 datasets in 2 .csv files. The first is our house_price_records dataset, this will be used to train and test the model. The other is the inherited_houses dataset, the data here is the subject of our business case. As stated in the Readme our aim is to accuratly predict the sales price of the houses in this dataset.

___

## Load and Inspect our newly downloaded datasets

1. The first thing we will do is install Pandas in order to place our first dataset (house_price_records.csv) into a dataframe, we do this so we can easily view and inspect our data.
    * we can then visualise the top 5 rows of our dataset using .head() on our dataframe

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df.head()

2. To then see a summary of our data, we can use .info() on our dataframe.

In [None]:
print(df.shape)
df.info()


* We can see here that our data contains 24 columns ranging from 1stFlrSF to SalePrice, we have 7 coloumns that are float64, 13 columns that are int64, and 4 columns that are objects.
* We can also see in this summary the non-null count for each column. 
* In the DataAnalysis Notebook we will dive deeper into our data. 

3. We will now repeat the above steps for our inherited_houses data as well.

In [None]:
df_inherited = pd.read_csv(f"inputs/datasets/raw/house-price-20211124T154130Z-001/house-price/inherited_houses.csv")
print(df_inherited.shape)
df_inherited.head()

In [None]:
df_inherited.info()

* Looking quickly over this data we can see we have 4 rows, showing our 4 inherited houses and 23 columns. The obvious difference here to the previous dataset is that SalesPrice is not present. This is because the SalesPrice is unknown and shall be the target of the model. 

___

## Conclusion

* With our data loaded and inspected there a couple of aspects to note.
    1. Salesprice is not present in the inherited_houses dataset. This however is expected as the prediction of the SalesPrice of these houses is the problem defined in our business case.
    2. There are diffrences in the data types in matching columns between datasets. For example 2ndFlrSF in house_price_records is type float64, where as it is type int64 in the inherited_houses dataset. This will be resolved in the DataCleaning Notebook.
* From this point we have our data loaded and can move on to Data Analysis.

___