#### Feature Engineering

The following are data engineering processes to improve the performance of the machine learning model

- Features willl be contrived based off features or modified

### Feature Engineering

This Jupyter Notepad is a selection of data engineering processes you can apply to your data before model training to maximise the performance of your machine learning model. For this demonstration we will engineer new or improved features for the diabetes data you previously wrangled.

#### Feature Engineering Process
- Deriving new variables from existing ones
    - Encoding categorical features
    - Calculating new features from existing features
- Combining features/feature interactions
- Identifying the most relevant features for the model
- Transforming Features
  - [Dividing Data into categories](https://web.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html)
  - Mathematical transformations (for example logarithmic transformations). Logarithmic transformations are a powerful tool in the world of statistical analysis. They are often used to transform data that exhibit skewness or other irregularities, making it easier to analyze, visualize, and interpret the results.
- Creating Domain-Specific Features that incorporating knowledge from the specific domain to create features that capture important characteristics of the data.

#### Load the required dependencies

In [9]:
# Import frameworks
import pandas as pd

#### Storing data as a local variable

In [10]:
data_frame = pd.read_csv("wrangled_data_1.csv")

####  Deriving new variables from existing ones

##### Encoding categorical variables

Data Encoding converts textual data into numerical format, so that it can be used as input for algorithms to process. The reason for encoding is that most machine learning algorithms work with numbers and not with text or categorical variables.

I will be encoding the type of property column, assigning the value -1 to an apartment and 1 to house. -1 is a better for choice over 0, as computers will default to 0 more allowing for bias

In [11]:
data_frame['type'] = data_frame['type'].apply(lambda type: -1 if type == 'Apartment' else 1 if type == 'House' else None)
print(data_frame['type'].tail())

10013    1
10014    1
10015   -1
10016    1
10017    1
Name: type, dtype: int64


#### Converting date sold into datetime

- Ultimately a large factor in determining the sale price of the house is when it was sold. The dates given are strings and must be converted to datetime to increase the performance of the machine learning model


In [12]:

# Convert date_sold to datetime
data_frame['date_sold'] = pd.to_datetime(data_frame['date_sold'], format='%d/%m/%y')

#Converting datetime to a float
data_frame['ds_float'] = data_frame['date_sold'].astype(int) / 10**9

# Print the result
print(data_frame[['date_sold', 'ds_float']])

       date_sold      ds_float
0     2016-01-13  1.452643e+09
1     2016-01-13  1.452643e+09
2     2016-01-13  1.452643e+09
3     2016-01-15  1.452816e+09
4     2016-01-15  1.452816e+09
...          ...           ...
10013 2021-12-31  1.640909e+09
10014 2021-12-31  1.640909e+09
10015 2021-12-31  1.640909e+09
10016 2022-01-01  1.640995e+09
10017 2022-01-01  1.640995e+09

[10018 rows x 2 columns]


#### Combining features/feature interactions

While individual features can be powerful predictors, their interactions often carry even more information. Feature interaction engineering is the process of creating new features that represent the interaction between two or more features.

In this case, domain knowledge is required, so feature interactions could include 

- Total Numbers of Rooms
- Property size x distance from CBD

In [13]:
# Calculate the total number of rooms
data_frame['tot_rooms'] = (data_frame['num_bath'] + data_frame['num_bed']).astype(int)

# Calculating property size x distance from CBD
data_frame['value_score'] = data_frame['property_size'] * (data_frame['km_from_cbd'])


# Print the result
print(data_frame[['tot_rooms', 'value_score']])

       tot_rooms  value_score
0              6     0.431332
1              6     0.269558
2              4     0.120461
3              4     0.272616
4              4     0.223928
...          ...          ...
10013          7     0.148172
10014         10     0.305451
10015          4     0.000009
10016          6     0.109405
10017          5     0.384657

[10018 rows x 2 columns]


#### Transoforming features

Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. ]

From my domain knowledge, it is important to delineate that pricing for houses and apartment are very different and cannot be encompassed by a single model. Thus,

- Dataset A will only include data rows that are houses
- Dataset B will only include data rows that are apartments

This is representative of the two seperate models I will have, where one will predict apartment prices and the other hosue prices

In [14]:
# Filter the data to -1 (Apartment)
data_frame = data_frame[data_frame['type'] == -1]

# Print the result
print(data_frame[['date_sold', 'type', 'Target']])

       date_sold  type   Target
23    2016-01-22    -1   272500
71    2016-02-22    -1   595000
104   2016-02-29    -1   755000
148   2016-03-24    -1   730000
158   2016-03-30    -1   900000
...          ...   ...      ...
9951  2021-12-22    -1  1075000
9959  2021-12-22    -1   670000
9986  2021-12-24    -1  3975000
10006 2021-12-28    -1  1510000
10015 2021-12-31    -1  1025000

[682 rows x 3 columns]


#### Creating Domain-Specific Features

Domain knowledge is about understanding the domain or subject area of the data. In this case the domain is 'real estate'

- Possible features could include 

1. Parking Availability (Binary yes or no)

2. Socioeconomic features such as median income of Suburb (Implemented)

3. Flood risk (particularly pertinent as a lot of houses bult on flood plains)

4. Using an API to gather nearby median house price of suburb lat and long

5. Amount of public infrastructure surrounding (can increase property value)

6. The cash/interest rate at the time, can increase property value

However, there is a sufficient amount of features currently, so the implementation of such features at the moment would be futile, though will be reconisdered after further testing

#### Save the wrangled and engineered data to CSV

In [15]:
data_frame.to_csv('../B_Model_Training/model_ready_data/B_model_ready_data_new_1.csv', index=False)