# Feature Engineering: data generation from existing datasets
We are going to generate KPIs to get more meaningful and actionable insights from the existing dataset. These Key Performance Indicators (KPIs) will help us better understand the relationships between different features in the dataset and will provide us with new, valuable information for training the machine learning model. By creating derived features such as price per square meter, price per room, and the relationship between bathrooms and bedrooms, we can enhance the model's ability to predict the target variable. Additionally, we will focus on generating new features that account for location-related factors, property characteristics, and the overall price adjustment based on area, helping us improve the model's accuracy and interpretability.


## Step 1: import dataset and libraries

In [3]:
import pandas as pd
import numpy as np

# Leer el archivo generado con los datos aumentados
file_path = '/home/mike/Escritorio/codes/projects/PropNet/PropNet-project/2_Data_Processing/correlation_analysis/generated_data.csv'
df = pd.read_csv(file_path)

# Ver las primeras filas para asegurarnos de que los datos se cargaron correctamente
df.head()

Unnamed: 0,parking,area,prefarea,unfurnished,airconditioning,stories,mainroad,price,basement,semi-furnished,furnished,bathrooms,bedrooms,guestroom
0,2,7420,1,0,1,3,1,13300000,0,0,1,2,4,0
1,3,8960,0,0,1,4,1,12250000,0,0,1,4,4,0
2,2,9960,1,0,0,2,1,12250000,1,1,0,2,3,0
3,3,7500,1,0,1,2,1,12215000,1,0,1,2,4,0
4,2,7420,0,0,1,2,1,11410000,1,0,1,1,4,1


### Step 2: Ideate KPI's from existing data

1. `Area per Bedroom`
   This KPI helps to evaluate the space available per bedroom in a property. It is calculated by dividing the total area of the property by the number of bedrooms. A higher ratio suggests larger bedrooms or more efficient use of space.

2. `Bathrooms per Bedroom`
   This KPI measures the number of bathrooms available for each bedroom. It is calculated by dividing the number of bathrooms by the number of bedrooms. A higher ratio may suggest more comfort or luxury in the property.

3. `Area per Story`
   This KPI evaluates how much area is available on each story or floor of the property. It is calculated by dividing the total area by the number of stories. A higher value suggests that the property might be more spacious per floor.

4. `Air Conditioning Presence`
   This KPI indicates whether the property has air conditioning. It is represented as a binary value (1 for presence, 0 for absence). Having air conditioning can significantly increase the comfort of the property.

5. `Bathrooms per Area`
   This KPI evaluates how many bathrooms are available per unit area. It is calculated by dividing the number of bathrooms by the area. A higher value could indicate a property with a higher number of bathrooms relative to its size.

6. `Area per Furnishing Type`
   This KPI compares the total area of the property relative to whether it is furnished or unfurnished. It is calculated by dividing the area by the sum of the furnished and unfurnished indicators.

7. `Parking Availability in Prefarea`
   This KPI measures how common it is for properties in a preferred area to have parking. It is calculated by multiplying the `prefarea` indicator by the `parking` indicator. This will give you the number of properties with parking in preferred areas.

8. `Area per Parking`
   This KPI measures how much space is available per parking spot in a property. It is calculated by dividing the total area by the number of parking spaces (adding 1 to avoid division by zero). A higher value may indicate properties with larger areas dedicated to parking.

9. `Prefarea Presence`
   This KPI indicates whether the property is located in a preferred area (as defined by the dataset). It is represented as a binary value (1 for presence, 0 for absence). Properties in preferred areas often have higher demand.

10. `Parking in Prefarea`
    This KPI captures how many properties in preferred areas also have parking. It is calculated by multiplying the `prefarea` indicator by the `parking` indicator. This can help determine how desirable it is to have parking in a preferred area.


### Step 3: Create the KPI's

In [4]:
import pandas as pd

# Suponiendo que tu dataset está en un DataFrame llamado df
# df['KPI_pricePerArea'] = df['price'] / df['area']

# 1. Area per Bedroom
df['KPI_Area_per_Bedroom'] = df['area'] / df['bedrooms']

# 2. Bathrooms per Bedroom
df['KPI_Bathrooms_per_Bedroom'] = df['bathrooms'] / df['bedrooms']

# 3. Area per Story
df['KPI_Area_per_Story'] = df['area'] / df['stories']

# 4. Air Conditioning Presence
df['KPI_AirConditioning_Presence'] = df['airconditioning']

# 5. Bathrooms per Area
df['KPI_Bathrooms_per_Area'] = df['bathrooms'] / df['area']

# 6. Area per Furnishing Type
df['KPI_Area_per_Furnishing'] = df['area'] / (df['furnished'] + df['unfurnished'])

# 7. Parking Availability in Prefarea
df['KPI_Parking_in_Prefarea'] = df['prefarea'] * df['parking']

# 8. Area per Parking
df['KPI_Area_per_Parking'] = df['area'] / (df['parking'] + 1)  # +1 para evitar división por cero

# 9. Prefarea Presence
df['KPI_Prefarea_Presence'] = df['prefarea']

# 10. Parking in Prefarea
df['KPI_Parking_in_Prefarea'] = df['prefarea'] * df['parking']

df['KPI_Area_per_Bathroom'] = df['area'] / df['bathrooms']
df['KPI_Bedrooms_per_Area'] = df['bedrooms'] / df['area']
df['KPI_Stories_per_Area'] = df['stories'] / df['area']
df['KPI_AirConditioning_Area_Ratio'] = df['airconditioning'] * df['area']
df['KPI_Area_per_Bedroom_Bathroom'] = df['area'] / (df['bedrooms'] + df['bathrooms'])
df['KPI_Bedrooms_per_Bathroom'] = df['bedrooms'] / df['bathrooms']
df['KPI_Furnished_Area_Ratio'] = df['furnished'] * df['area']
df['KPI_SemiFurnished_Area_Ratio'] = df['semi-furnished'] * df['area']
df['KPI_Stories_per_Bathroom'] = df['stories'] / df['bathrooms']
df['KPI_Mainroad_Area_Ratio'] = df['mainroad'] * df['area']

df['KPI_Area_per_Story'] = df['area'] / df['stories']
df['KPI_Stories_per_Land_Area'] = df['stories'] / df['area']
df['KPI_Bedrooms_AirConditioning_Ratio'] = df['airconditioning'] / df['bedrooms'] 
df['KPI_Area_per_SemiFurnished'] = df['area'] / df['semi-furnished']
df['KPI_Bathrooms_Area_Ratio'] = df['bathrooms'] / df['area']
df['KPI_Bathrooms_per_Furnishing'] = df['bathrooms'] / (df['furnished'] + df['semi-furnished'] + df['unfurnished'])
df['KPI_Area_per_Guestroom'] = df['area'] / df['guestroom']
df['KPI_Bedrooms_per_Guestroom'] = df['bedrooms'] / df['guestroom']
df['KPI_Parking_Area_Ratio'] = df['parking'] / df['area']
df['KPI_Basement_Area_Ratio'] = df['basement'] / df['area']


## Export data to CSV

In [5]:
df.info()

df.to_csv('processed_data.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3045 entries, 0 to 3044
Data columns (total 42 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   parking                             3045 non-null   int64  
 1   area                                3045 non-null   int64  
 2   prefarea                            3045 non-null   int64  
 3   unfurnished                         3045 non-null   int64  
 4   airconditioning                     3045 non-null   int64  
 5   stories                             3045 non-null   int64  
 6   mainroad                            3045 non-null   int64  
 7   price                               3045 non-null   int64  
 8   basement                            3045 non-null   int64  
 9   semi-furnished                      3045 non-null   int64  
 10  furnished                           3045 non-null   int64  
 11  bathrooms                           3045 no