# Data Transformation and Cleaning Summary

## Summary of Data Transformations


1. **Column Renaming**:
   - The columns were renamed for better readability and consistency:
     - `MLS` was renamed to `mls`.
     - `sold_price` was renamed to `price`.
     - `HOA` was renamed to `hoa`.

2. **Handling Missing Values**:
   - **Visualization**: A heatmap was generated to visualize missing values in the dataset.
   - **Check for Column Removal**: Columns with more than 90% missing values should be removed from the dataset. In our case, all the columns satisfied this threshold, so no column was removed.
   - **Imputation**:
     - Numeric columns were filled with the median value
     - Categorical columns were filled with the mode (most frequent value).

3. **Feature Engineering**:
   - A new column `num_kit_features` was created, representing the number of kitchen features by splitting the `kitchen_features` column and counting the items.

4. **Data Type Conversions**:
   - Converted `fireplaces` and `hoa` columns from `object` to `int`, with missing or invalid values coerced and filled with zeros.
   - Converted `bathrooms` and `garage` columns from `float` to `int` to ensure consistency.

5. **Data Visualization**:
   - Histograms were plotted for numeric columns to visualize their distribution after data cleaning.
        

## Comparison of Dataset Before and After Changes


| Transformation | Before (Original Dataset) | After (Transformed Dataset) |
|----------------|---------------------------|-----------------------------|
| **Column Names** | `MLS`, `sold_price`, `HOA` | `mls`, `price`, `hoa` |
| **Missing Values Handling** | Several columns with missing values | all column are kept as no column with more than 90% missing values |
| **New Feature** | No `num_kit_features` column | `num_kit_features` column added |
| **Data Types** | `fireplaces`, `hoa`, `bathrooms`, `garage` as mixed types | All above columns converted to `int` |   
| **Number of features** | 16 | 17 |        
| **Number of observations(raws)** | 5000 | 4605 |        

## Detailed Explanation


1. **Column Renaming**:
   - Renaming columns is a best practice to improve code readability and maintainability. By renaming `MLS` to `mls`, `sold_price` to `price`, and `HOA` to `hoa`, you made it easier to understand and reference these columns in further analysis.

2. **Handling Missing Values**:
   - **Threshold-Based Removal**: No column with more than 90% missing values -> check OK, no column was removed.
   - **Imputation**: Filling missing values with the median for numeric columns ensures that outliers do not overly influence the data, while using the mode for categorical columns helps preserve the most common categories:
   - 10 missing values in `lot_acres` --> corrected with median
   - 6 missing values in `bathroom` --> corrected with median
   - 56 missing values in `sqrt_ft` --> corrected with median
   - 7 missing values in `garage` --> corrected with median
   - 33 missing values in `kitchen_features` --> corrected with mode
   - 1 missing values in `floor_covering` --> corrected with mode
   - 562 missing values in `hoa` --> corrected with median

3. **Feature Engineering**:
   - Creating the `num_kit_features` column from the `kitchen_features`. This new feature can be valuable in predicting the price of a house.

4. **Data Type Conversions**:
   - Ensuring that data types are consistent (e.g., converting `fireplaces`, `hoa`, `bathrooms`, and `garage` to integers)

5. **Initial Data Visualization**:
   - Visualizing data distributions after cleanup to ensure that all data issues are solved and check if potential issues such as outliers.

6. **Handling outliers**:
   - Since we have a right-skewed distribution: The distribution is not symmetric and is skewed to the right, it indicates that the data is not normally distributed. So we will use IQR method to handle the outliers instead of the zscore method
   - Final dataset shape: `4606 * 17`
7. **Final Data Visualization**:
   - Final check to ensure everything is OK and dataset is well cleaned before submitting the dataset to the Modeling team.

**Younes ABAROUDI**, 31/08/2024