<a href="https://colab.research.google.com/github/Rdcornejov/Data-mining-Tools/blob/main/Semana5_3_Ejercicio_teoria_CursoIngenieriaCaracteristicas_Qu%C3%A9_es_la_ingenier%C3%ADa_de_caracter%C3%ADsticas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'fe-course-data:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F933090%2F1828856%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240916%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240916T005825Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D36363de68ffc16ccd087f307db26523172d51cb6d857979a80eacfbf7aac8b1b72a721976b903b0caa76920042618dbcddb16f790c8d504bb74afb52189170e026e254d81ef888d643689a9a72e119ee3efefbcfed9b673f1b03f20ef585fbe7c05f69427a0dcd15fc44e2931d94185cf042dc00fc6d1709b9c8779093ef6534f9744d761faa7801e7e73cf45f003254248412e3a428e3b83282d280327b9b1beff0f90b6b146c79221315d9ef13041d6814d136254b7a992bedd7701a5a29a46cb0b29a628d54484043382d951dcbd870250892178c5474a24e518326e622451a47655bb5f3cc78723ac69492546b2f4de530fba66e5837dcfd161462859fef'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


Downloading fe-course-data, 27023163 bytes compressed
Downloaded and uncompressed: fe-course-data
Data source import complete.


**This notebook is an exercise in the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course.  You can reference the tutorial at [this link](https://www.kaggle.com/ryanholbrook/creating-features).**

---


# Introduction #

In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

Run this cell to set everything up!

In [7]:


import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")

-------------------------------------------------------------------------------

Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.

# 1) Create Mathematical Transforms

Create the following features:

- `LivLotRatio`: the ratio of `GrLivArea` to `LotArea`
- `Spaciousness`: the sum of `FirstFlrSF` and `SecondFlrSF` divided by `TotRmsAbvGrd`
- `TotalOutsideSF`: the sum of `WoodDeckSF`, `OpenPorchSF`, `EnclosedPorch`, `Threeseasonporch`, and `ScreenPorch`

In [11]:
# YOUR CODE HERE
X_1 = pd.DataFrame()  # dataframe to hold new features

X_1["LivLotRatio"] = X["GrLivArea"] / X["LotArea"]
X_1["Spaciousness"] = (X["FirstFlrSF"] + X["SecondFlrSF"]) / X["TotRmsAbvGrd"]
X_1["TotalOutsideSF"] =  X['WoodDeckSF'] +X['OpenPorchSF'] + X['EnclosedPorch'] + X['Threeseasonporch']+ X['ScreenPorch']


# Check your answer

In [None]:
# Lines below will give you a hint or solution code
#q_1.hint()
#q_1.solution()

-------------------------------------------------------------------------------

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

```
# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)
```

# 2) Interaction with a Categorical

We discovered an interaction between `BldgType` and `GrLivArea` in Exercise 2. Now create their interaction features.

In [16]:
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(df["BldgType"], prefix="Bldg")
# Multiply
X_2 = X_2.mul(df["GrLivArea"], axis=0)

X_2

Unnamed: 0,Bldg_Duplex,Bldg_OneFam,Bldg_Twnhs,Bldg_TwnhsE,Bldg_TwoFmCon
0,0.0,1656.0,0.0,0.0,0.0
1,0.0,896.0,0.0,0.0,0.0
2,0.0,1329.0,0.0,0.0,0.0
3,0.0,2110.0,0.0,0.0,0.0
4,0.0,1629.0,0.0,0.0,0.0
...,...,...,...,...,...
2925,0.0,1003.0,0.0,0.0,0.0
2926,0.0,902.0,0.0,0.0,0.0
2927,0.0,970.0,0.0,0.0,0.0
2928,0.0,1389.0,0.0,0.0,0.0


In [None]:
# Lines below will give you a hint or solution code
#q_2.hint()
#q_2.solution()

# 3) Count Feature

Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature `PorchTypes` that counts how many of the following are greater than 0.0:

```
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
```

In [19]:
df[["WoodDeckSF","OpenPorchSF","EnclosedPorch","Threeseasonporch", "ScreenPorch"]].head(5)

Unnamed: 0,WoodDeckSF,OpenPorchSF,EnclosedPorch,Threeseasonporch,ScreenPorch
0,210.0,62.0,0.0,0.0,0.0
1,140.0,0.0,0.0,0.0,120.0
2,393.0,36.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,212.0,34.0,0.0,0.0,0.0


In [20]:
X_3 = pd.DataFrame()

componentes = ["WoodDeckSF","OpenPorchSF","EnclosedPorch","Threeseasonporch", "ScreenPorch"]

# YOUR CODE HERE
X_3["PorchTypes"] = df[componentes].gt(0).sum(axis=1)

X_3.head()

Unnamed: 0,PorchTypes
0,2
1,2
2,2
3,0
4,2


In [None]:
# Lines below will give you a hint or solution code
#q_3.hint()
#q_3.solution()

# 4) Break Down a Categorical Feature

`MSSubClass` describes the type of a dwelling:

In [22]:
df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting `MSSubClass` at the first underscore `_`. (Hint: In the `split` method use an argument `n=1`.)

In [37]:
X_4 = pd.DataFrame()

# YOUR CODE HERE

X_4["MSClass"] = df.MSSubClass.str.split("_",expand=True, n=1)[0]


In [34]:
df.MSSubClass.str.split("_",expand=True, n=1)[0]

Unnamed: 0,0
0,One
1,One
2,One
3,One
4,Two
...,...
2925,Split
2926,One
2927,Split
2928,One


In [None]:
# Lines below will give you a hint or solution code
#q_4.hint()
#q_4.solution()

# 5) Use a Grouped Transform

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature `MedNhbdArea` that describes the *median* of `GrLivArea` grouped on `Neighborhood`.

In [40]:
X_5 = pd.DataFrame()

# YOUR CODE HERE

X_5["MedNhbdArea"] = (
    df.groupby("Neighborhood")  # for each state
    ["GrLivArea"]                 # select the income
    .transform("median")         # and compute its mean
)


In [41]:
X_5.head(5)

Unnamed: 0,MedNhbdArea
0,1200.0
1,1200.0
2,1200.0
3,1200.0
4,1560.0


In [None]:
# Lines below will give you a hint or solution code
#q_5.hint()
#q_5.solution()

Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:

In [42]:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)

0.13954039591355258

# Keep Going #

[**Untangle spatial relationships**](https://www.kaggle.com/ryanholbrook/clustering-with-k-means) by adding cluster labels to your dataset.

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/feature-engineering/discussion) to chat with other learners.*