# **Data Collection**

## Objectives

* Upload dataset file into explorer panel.
* Inspect the data and save it under ../house-price-20211124T154130Z-001

## Inputs

* Extract file through terminal, using command 'unzip file_name.zip' 

## Outputs

* Archive:  archive.zip
  inflating: house-metadata.txt      
  inflating: house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: house-price-20211124T154130Z-001/house-price/inherited_houses.csv  

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Import Packages

* Import packages using the 'import' statement followed by the name of the package. For example, 'import pandas' which is commonly used for data manipulation and analysis. This is  followed by and alias of your choice, preferably as pd although it is arbitrary.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Load dataset using a pandas method '.read_csv', sampled to retrieve only 20% of the dataframe, random_state is used for reproducibily.
shape() method summoned to analyise rows and columns in dataset.

In [2]:
# Load dataset
df = pd.read_csv("/workspace/housing-market-analysis/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df = df.sample(frac=0.2, random_state=101)
print(df.shape)
df.head(5)

(292, 24)


Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
1054,1091,898.0,3.0,Mn,932,GLQ,133,,586,Fin,...,90.0,210.0,60,5,8,1065,,2002,2002,255000
361,988,517.0,3.0,No,399,Rec,484,,240,,...,,0.0,0,5,5,883,,1940,1982,145000
1282,1040,0.0,3.0,Mn,532,LwQ,364,,484,Unf,...,61.0,0.0,0,7,5,1040,,1977,2008,150500
161,1572,1096.0,3.0,Av,1016,GLQ,556,,726,Fin,...,110.0,664.0,0,5,9,1572,,2003,2004,412500
515,2020,0.0,3.0,No,1436,GLQ,570,,900,Fin,...,94.0,305.0,54,5,10,2006,,2009,2009,402861


* DataFrame information

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 292 entries, 1054 to 1018
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       292 non-null    int64  
 1   2ndFlrSF       276 non-null    float64
 2   BedroomAbvGr   274 non-null    float64
 3   BsmtExposure   282 non-null    object 
 4   BsmtFinSF1     292 non-null    int64  
 5   BsmtFinType1   259 non-null    object 
 6   BsmtUnfSF      292 non-null    int64  
 7   EnclosedPorch  22 non-null     float64
 8   GarageArea     292 non-null    int64  
 9   GarageFinish   246 non-null    object 
 10  GarageYrBlt    276 non-null    float64
 11  GrLivArea      292 non-null    int64  
 12  KitchenQual    292 non-null    object 
 13  LotArea        292 non-null    int64  
 14  LotFrontage    247 non-null    float64
 15  MasVnrArea     292 non-null    float64
 16  OpenPorchSF    292 non-null    int64  
 17  OverallCond    292 non-null    int64  
 18  OverallQual

# Section 1

* Using feature_engine to replace all numeical missing data

In [4]:
from feature_engine.imputation import MeanMedianImputer
imputer = MeanMedianImputer(imputation_method='median')

We fit & transform the data to get 

In [6]:
imputer.fit_transform(df)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
1054,1091,898.0,3.0,Mn,932,GLQ,133,0.0,586,Fin,...,90.0,210.0,60,5,8,1065,0.0,2002,2002,255000
361,988,517.0,3.0,No,399,Rec,484,0.0,240,,...,66.0,0.0,0,5,5,883,0.0,1940,1982,145000
1282,1040,0.0,3.0,Mn,532,LwQ,364,0.0,484,Unf,...,61.0,0.0,0,7,5,1040,0.0,1977,2008,150500
161,1572,1096.0,3.0,Av,1016,GLQ,556,0.0,726,Fin,...,110.0,664.0,0,5,9,1572,0.0,2003,2004,412500
515,2020,0.0,3.0,No,1436,GLQ,570,0.0,900,Fin,...,94.0,305.0,54,5,10,2006,0.0,2009,2009,402861
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23,1060,0.0,3.0,No,840,GLQ,200,0.0,572,,...,44.0,0.0,110,7,5,1040,100.0,1976,1976,129900
1190,1622,0.0,3.0,Av,1159,BLQ,90,0.0,1356,Fin,...,66.0,149.0,0,4,4,1249,0.0,1961,1975,168000
683,1668,0.0,3.0,Av,1059,GLQ,567,0.0,702,,...,90.0,215.0,45,5,9,1626,0.0,2002,2002,285000
189,1593,0.0,0.0,Av,1153,GLQ,440,0.0,682,Fin,...,41.0,0.0,120,5,8,1593,0.0,2001,2002,286000


---

In [39]:
imputer.imputer_dict_

{'1stFlrSF': 1063.0,
 '2ndFlrSF': 0.0,
 'BedroomAbvGr': 3.0,
 'BsmtFinSF1': 385.5,
 'BsmtUnfSF': 490.0,
 'EnclosedPorch': 0.0,
 'GarageArea': 481.0,
 'GarageYrBlt': 1982.5,
 'GrLivArea': 1481.5,
 'LotArea': 9110.0,
 'LotFrontage': 66.0,
 'MasVnrArea': 0.0,
 'OpenPorchSF': 32.0,
 'OverallCond': 5.0,
 'OverallQual': 6.0,
 'TotalBsmtSF': 972.0,
 'WoodDeckSF': 0.0,
 'YearBuilt': 1972.5,
 'YearRemodAdd': 1994.0,
 'SalePrice': 160000.0}

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
