# Introduction #

Run this cell to set everything up.

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex5 import *

The *Spotify* dataset contains features describing almost 33,000 popular songs. In this exercise, you'll predict a song's popularity from some features describing its acoustic qualities and genre.

In [None]:
import pandas as pd

df = pd.read_csv("../input/fe-course-data/spotify.csv")

FEATURES = [
    "playlist_genre",
    "playlist_subgenre",
    "danceability",
    "energy",
    "loudness",
    "speechiness",
    "acousticness",
    "instrumentalness",
    "liveness",
    "valence",
    "tempo",
    "duration_ms",
]

GENRES = [['pop', 'rap', 'rock', 'latin', 'r&b', 'edm']] # use predefined genres

X = df.dropna()
y = X.pop("track_popularity").to_frame()
X = X[FEATURES]
X.head()

# Step 1 - Split Data #

Since target encoding is a supervised transform, it needs to be fit on data that is separate from the data the model will be trained on.

Split `X` and `y` to use 25% of the data to fit the target encoding.

In [None]:
from sklearn.model_selection import train_test_split


# YOUR CODE HERE: Create a split for encoding
X_encode, X_train, y_encode, y_train = train_test_split(X, y, train_size=0.25)


# Check your answer
q_1.check()

# Step 2 - Define Encoder #

Define two transforms for this dataset:
1. a `OneHotEncoder` for the `playlist_genre` feature with argument `categories=GENRES`
2. a `TargetEncoder` for the `playlist_subgenre` feature

Use `make_column_transformer` to apply each transform to the appropriate feature.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.compose import make_column_transformer


# YOUR CODE HERE
one_hot_encoder = OneHotEncoder(categories=GENRES)
target_encoder = TargetEncoder()
encoder = make_column_transformer(
    (one_hot_encoder, ["playlist_genre"]),
    (target_encoder, ["playlist_subgenre"]),
    remainder="passthrough",
)


# Check your answer
q_2.check()

# Step 3 - Fit Encoder #

Now fit the column transformer you just created on the encoding data. Then, use the transformer to encode the held-out `X_train` data.

In [None]:
# YOUR CODE HERE: Fit the transformer to the encoding data
encoder.fit(X_encode, y_encode)

# YOUR CODE HERE: Encode the training data
X_train_enc = encoder.transform(X_train)


# Check your answer
q_3.check()

If you like, run the next cell to see the result of the encoding.

In [None]:
pd.DataFrame(
    X_train_enc,
    columns=encoder.get_feature_names()
)

# Step 4 - Define Model #

Create an instance of XGBoost to use for prediction.

In [None]:
from xgboost import XGBRegressor


# YOUR CODE HERE: create the XGBoost instance
xgboost = XGBRegressor()


# Check your answer
q_4.check()

# Step 5 - Validate #

Finally, validate the model with 5-fold cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score


# YOUR CODE HERE
score = cross_val_score(
    xgboost, X_train_enc, y_train, cv=5, scoring='neg_mean_absolute_error'
)
score = -1 * score.mean()
print("Score: {:.4f}".format(score))


# Check your answer
q_5.check()

```
# Hint: What should you use for the training data?
```

# Keep Going #