# **Data Collection**

## Objectives

* Upload dataset file into explorer panel.
* Inspect the data and save it under ../house-price-20211124T154130Z-001

## Inputs

* Extract file through terminal, using command 'unzip file_name.zip' 

## Outputs

* Archive:  archive.zip
  inflating: house-metadata.txt      
  inflating: house-price-20211124T154130Z-001/house-price/house_prices_records.csv  
  inflating: house-price-20211124T154130Z-001/house-price/inherited_houses.csv  

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


# Change working directory
We need to change the working directory from its current folder to its parent folder

We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

---

# Import Packages

* Import packages using the 'import' statement followed by the name of the package. For example, 'import pandas' which is commonly used for data manipulation and analysis. This is  followed by and alias of your choice, preferably as pd although it is arbitrary.

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Load the dataset using the pandas method '.read_csv', and sample it to retrieve only 20% of the dataframe, ensuring reproducibility by specifying a random_state. Then, use the 'shape()' method to analyze the number of rows and columns in the dataset.

In [None]:
# Load dataset
df = pd.read_csv("/workspace/housing-market-analysis/house-price-20211124T154130Z-001/house-price/house_prices_records.csv")
df = df.sample(frac=0.2, random_state=101)
print(df.shape)
df.head(5)

* DataFrame information

In [None]:
df.dtypes['WoodDeckSF']
df.dtypes['BsmtExposure']

In [None]:
df['WoodDeckSF'].fillna(0.0)
df.isnull().sum()

# Section 1

* Using feature_engine to replace all numeical & categorical missing data
* Using sklearn.preprocessing to scale all the numerical variables

In [None]:
from sklearn.impute import SimpleImputer
from feature_engine.imputation import MeanMedianImputer
from sklearn.preprocessing import StandardScaler

# Identify missing values
missing_values = df.isnull().sum()

# Separate columns by data type
numeric_cols = df.select_dtypes(include=['int', 'float']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# Impute missing values for numeric columns using median
numeric_imputer = MeanMedianImputer(imputation_method='median')
df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])

# Impute missing values for categorical columns using most frequent
categorical_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])




# Heatmap
I used a heatmap to visualize all the numerical variables in the dataset and gain insight into which features correlate the most and which correlate the least.

In [None]:
plt.figure(figsize=(15, 8))
corr_matrix = df[numeric_cols].corr()
sns.heatmap(corr_matrix, cmap='YlGnBu', annot=True, fmt=".2f", annot_kws={"size": 8.5})
plt.show()


---

* All missing columns have been filled in 

In [None]:
df.head(20)

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/house-price-2021.csv",index=False)
