# Feature Engineering

#### Import the required libraries

In [3]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd
pd.set_option('display.max_columns', 100)

# Matplotlib for visualization
from matplotlib import pyplot as plt

# display plots in the notebook
%matplotlib inline 

# Seaborn for visualization
import seaborn as sns

#### Import the cleaned dataset

In [4]:
# Load cleaned dataset from the previous lecture
df = pd.read_csv('cleaned_df.csv')
df.head(2)

Unnamed: 0,price,year_sold,property_tax,insurance,beds,baths,sqft,year_built,lot_size,basement,property_type
0,295850,2013,234,81,1,1,584,2013,0,0.0,Condo
1,216500,2006,169,51,1,1,612,1965,0,1.0,Condo


## I. Domain Knowledge

#### A. Popular Properties

2 bedroom and 2 bathroom properties are especially popular for investors. Let's create an indicator variable just for properties with 2 beds and 2 baths.

In [9]:
# Build your code step by step
# ((df.beds == 2) & (df.baths == 2))
# ((df.beds == 2) & (df.baths == 2)).astype(int)

has_2_beds = df.beds == 2
has_2_baths = df.baths == 2

beds
False    1617
True      265
Name: count, dtype: int64

In [10]:
# Create indicator variable for properties with 2 beds and 2 baths
df['popular'] = (has_2_beds & has_2_baths).astype(int)

In [11]:
# Check how many propoerties have 2 baths and 2 beds 
df.popular.value_counts()

popular
0    1704
1     178
Name: count, dtype: int64

#### B. Housing Market Recession

We are modeling housing prices in the United States, it's important to consider the housing market recession around 2008. According to data from Zillow, the lowest housing prices were from 2010 to end of 2013.

<br>
Create an indicator feature **recession**

Here's how:
* Your first condition `year_sold >= 2010`
* Your second condition `year_sold <= 2013`
* Combine the two conditions with an `&` operator
* Convert the resulting data to `int` type.

In [12]:
# Create a new variable recession
df['recession']=((df.year_sold >= 2010) & (df.year_sold <= 2013)).astype(int)

In [13]:
# Check how many propoerties were sold during recession period 
df.recession.value_counts()

recession
0    1386
1     496
Name: count, dtype: int64

In [14]:
fd = df[['year_sold','recession']]
fd.head(8)

Unnamed: 0,year_sold,recession
0,2013,1
1,2006,0
2,2012,1
3,2005,0
4,2002,0
5,2004,0
6,2011,1
7,2005,0


## II. Interaction Features

In the first step, you engineered features from domain knowledge. interaction features can be products, sums, or differences between two features.

#### A. Property Age

We have the features `year_sold` and the `year_built`. let's create a new feature `property_age`

In [21]:
df['property_age'] = df['year_sold'] - df['year_built']

In [22]:
type(df.property_age)

pandas.core.series.Series

In [None]:
# Create a 'property_age' feature
#{insert_missing_code} = {insert_missing_code} - {insert_missing_code}

Do a quick sanity check on that feature. Run `df.describe()` and check the stats for the feature `property_age`

In [23]:
# Do you see any error?
df[['property_age']].describe()

Unnamed: 0,property_age
count,1882.0
mean,24.126461
std,21.153271
min,-8.0
25%,6.0
50%,20.0
75%,38.0
max,114.0


In [24]:
# Check number of observations with 'property_age' < 0
(df.property_age < 0).sum()

19

On second thought, this could be an error or that some homeowners buy houses before the construction company builts them. But for the purpose of this project we will remove these observations.

We'll do a quick ad-hoc data cleaning and remove these observations from our dataset.

#### Remove observations where `property_age` is less than 0.
* Keep only observations where `property_age` is 0 and above.

In [25]:
# Print df shape before
print(df.shape)

# Remove rows where property_age is less than 0
df = df[df.property_age >= 0]

# Print number of rows in remaining dataframe
print(df.shape)

(1882, 14)
(1863, 14)


### III. Drop Redundant Features

Beacuse we created a new feature `property_age` using features `year_built` and `year_sold`. We can drop those two features.

**Remove features 'year_built' and 'year_sold'** 
* Use Pandas's `.drop()` function.
* Remember to set `axis=1` because you are dropping columns.
* Remember to do 'inplace=True'.


In [29]:
# Drop 'year_built' and 'year_sold' from the dataset
df.drop(columns=['year_built', 'year_sold'], inplace=True)
df.head()

TypeError: unhashable type: 'list'

### Save the final dataset

We will save this dataset and train our model on it.

In [None]:
# Save the data as 'final.csv'
df.to_csv('final.csv', index=None)