# Preprocessing data

**Encoding dummy variables**

In [None]:
"""

call pd-dot-get_dummies, passing the categorical column. As we only need to keep nine out of our ten binary features, we can set the drop_first argument to True.
To bring these binary features back into our original DataFrame we can use pd-dot-concat, passing a list containing the music DataFrame and our dummies DataFrame,
and setting axis equal to one. Lastly, we can remove the original genre column using df-dot-drop, passing the column, and setting axis equal to one



import pandas as pd
music_df = pd.read_csv('music.csv')
music_dummies = pd.get_dummies(music_df["genre"], drop_first=True)
print(music_dummies.head())


music_dummies = pd.concat([music_df, music_dummies], axis=1)
music_dummies = music_dummies.drop("genre", axis=1)

"""






"""

If the DataFrame only has one categorical feature, we can pass the entire DataFrame, thus skipping the step of combining variables.
If we don't specify a column, the new DataFrame's binary columns will have the original feature name prefixed,
so they will start with genre-underscore. Notice the original genre column is automatically dropped. Once we have dummy variables, we can fit models as before.


music_dummies = pd.get_dummies(music_df, drop_first=True)
print(music_dummies.columns)

"""

In [None]:
"""

Use a relevant function, passing the entire music_df DataFrame, to create music_dummies, dropping the first binary column.
Print the shape of music_dummies.

"""


# Create music_dummies
music_dummies = pd.get_dummies(music_df, drop_first = True)

# Print the new DataFrame's shape
print("Shape of music_dummies: {}".format(music_dummies.shape))

In [None]:
"""

Create X, containing all features in music_dummies, and y, consisting of the "popularity" column, respectively.
Instantiate a ridge regression model, setting alpha equal to 0.2.
Perform cross-validation on X and y using the ridge model, setting cv equal to kf, and using negative mean squared error as the scoring metric.
Print the RMSE values by converting negative scores to positive and taking the square root.

"""


# Create X and y
X = music_dummies.drop("popularity", axis=1).values
y = music_dummies["popularity"].values

# Instantiate a ridge model
ridge = Ridge(alpha=0.2)

# Perform cross-validation
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate RMSE
rmse = np.sqrt(-scores)
print("Average RMSE: {}".format(np.mean(rmse)))
print("Standard Deviation of the target array: {}".format(np.std(y)))

# Handling missing Data

**Dropping missing data**

In [None]:
"""

A common approach is to remove missing observations accounting for less than 5% of all data.
To do this, we use pandas' dot-dropna method, passing a list of columns with less than 5% missing values to the subset argument.
If there are missing values in our subset column, the entire row is removed. Rechecking the DataFrame, we see fewer missing values


print(music_df.isna().sum().sort_values())


music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])
print(music_df.isna().sum().sort_values())

"""

**Imputing Values**

In [None]:
"""

Another option is to impute missing data. This means making an educated guess as to what the missing values could be.
We can impute the mean of all non-missing entries for a given feature. We can also use other values like the median.
For categorical values we commonly impute the most frequent value.


Note: we must split our data before imputing to avoid leaking test set information to our model, a concept known as data leakage

"""

In [None]:
### Dropping missing data

"""

Print the number of missing values for each column in the music_df dataset, sorted in ascending order.
Remove values for all columns with 50 or fewer missing values.
Convert music_df["genre"] to values of 1 if the row contains "Rock", otherwise change the value to 0.

"""



# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre" , "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))

In [None]:
### Pipeline for song genre prediction: I

"""

Import SimpleImputer and Pipeline.
Instantiate an imputer.
Instantiate a KNN classifier with three neighbors.
Create steps, a list of tuples containing the imputer variable you created, called "imputer", followed by the knn model you created, called "knn"

"""


# Import modules
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(n_neighbors = 3)

# Build steps for the pipeline
steps = [("imputer", imputer),
         ("knn", knn)]

In [None]:
### Pipeline for song genre prediction: II

"""

Create a pipeline using the steps you previously defined.
Fit the pipeline to the training data.
Make predictions on the test set.
Calculate and print the confusion matrix.

"""


steps = [("imputer", imp_mean),
        ("knn", knn)]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training data
pipeline.fit(X_train , y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Centering and Scaling

**Why scale our data?**

In [None]:
"""

Many machine learning models use some form of distance to inform them, so if we have features on far larger scales, they can disproportionately influence our model.

For example, KNN uses distance explicitly when making predictions. For this reason, we actually want features to be on a similar scale.
To achieve this, we can normalize or standardize our data, often referred to as scaling and centering.

"""

**How to scale our data**

In [None]:
"""

There are several ways to scale our data: given any column, we can subtract the mean and divide by the variance so that all features are centered around zero and have a variance of one.
This is called standardization.

We can also subtract the minimum and divide by the range of the data so the normalized dataset has minimum zero and maximum one.

Or, we can center our data so that it ranges from -1 to 1 instead.

"""

**CV and scaling in a pipeline**

In [None]:
"""

We first build our pipeline. We then specify our hyperparameter space by creating a dictionary: the keys are the pipeline step name followed by a double underscore, followed by the hyperparameter name.
The corresponding value is a list or an array of the values to try for that particular hyperparameter.

In this case, we are tuning n_neighbors in the KNN model. Next we split our data into training and test sets.
We then perform a grid search over our parameters by instantiating the GridSearchCV object, passing our pipeline and setting the param_grid argument equal to parameters.
We then fit it to our training data. Lastly, we make predictions using our test set.



from sklearn.model_selection import GridSearchCV

steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier())]

pipeline = Pipeline(steps)
parameters = {"knn__n_neighbors": np.arange(1, 50)}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=21)


cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

"""

In [None]:
### Centering and scaling for regression

"""

Now you have seen the benefits of scaling your data, you will use a pipeline to preprocess the music_df features and build a lasso regression model to predict a song's loudness.

X_train, X_test, y_train, and y_test have been created from the music_df dataset, where the target is "loudness" and the features are all other columns in the dataset.
Lasso and Pipeline have also been imported for you.

Note that "genre" has been converted to a binary feature where 1 indicates a rock song, and 0 represents other genres

"""



"""

Import StandardScaler.
Create the steps for the pipeline object, a StandardScaler object called "scaler", and a lasso model called "lasso" with alpha set to 0.5.
Instantiate a pipeline with steps to scale and build a lasso regression model.
Calculate the R-squared value on the test data.

"""

# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create pipeline steps
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

# Instantiate the pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)

# Calculate and print R-squared
print(pipeline.score(X_test, y_test))

In [None]:
"""

Build the steps for the pipeline: a StandardScaler() object named "scaler", and a logistic regression model named "logreg".
Create the parameters, searching 20 equally spaced float values ranging from 0.001 to 1.0 for the logistic regression model's C hyperparameter within the pipeline.
Instantiate the grid search object.
Fit the grid search object to the training data.

"""


# Build the steps
steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"logreg__C": np.linspace(0.001, 1, 20)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=21)

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)

## Evaluating multiple Models

In [None]:
### Visualizing regression model performance

"""

Write a for loop using model as the iterator, and model.values() as the iterable.
Perform cross-validation on the training features and the training target array using the model, setting cv equal to the KFold object.
Append the model's cross-validation scores to the results list.
Create a box plot displaying the results, with the x-axis labels as the names of the models.

"""

models = {"Linear Regression": LinearRegression(), "Ridge": Ridge(alpha=0.1), "Lasso": Lasso(alpha=0.1)}
results = []

# Loop through the models' values
for model in models.values():
  kf = KFold(n_splits=6, random_state=42, shuffle=True)

  # Perform cross-validation
  cv_scores = cross_val_score(model, X_train, y_train, cv=kf)

  # Append the results
  results.append(cv_scores)

# Create a box plot of the results
plt.boxplot(results, labels=models.keys())
plt.show()

In [None]:
"""

Import mean_squared_error.
Fit the model to the scaled training features and the training labels.
Make predictions using the scaled test features.
Calculate RMSE by passing the test set labels and the predicted labels.

"""


# Import mean_squared_error
from sklearn.metrics import mean_squared_error

for name, model in models.items():

  # Fit the model to the training data
  model.fit(X_train_scaled, y_train)

  # Make predictions on the test set
  y_pred = model.predict(X_test_scaled)

  # Calculate the test_rmse
  test_rmse = mean_squared_error(y_test, y_pred, squared=False)
  print("{} Test Set RMSE: {}".format(name, test_rmse))

In [None]:
"""

Create a dictionary of "Logistic Regression", "KNN", and "Decision Tree Classifier", setting the dictionary's values to a call of each model.
Loop through the values in models.
Instantiate a KFold object to perform 6 splits, setting shuffle to True and random_state to 12.
Perform cross-validation using the model, the scaled training features, the target training set, and setting cv equal to kf

"""


# Create models dictionary
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), "Decision Tree Classifier":  DecisionTreeClassifier()}
results = []

# Loop through the models' values
for model in models.values():

  # Instantiate a KFold object
  kf = KFold(n_splits=6, random_state=12, shuffle=True)

  # Perform cross-validation
  cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
  results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()

In [None]:
### Pipeline for predicting song popularity

"""

For the final exercise, you will build a pipeline to impute missing values, scale features, and perform hyperparameter tuning of a logistic regression model.
The aim is to find the best parameters and accuracy when predicting song genre!

All the models and objects required to build the pipeline have been preloaded for you

"""


"""

Create the steps for the pipeline by calling a simple imputer, a standard scaler, and a logistic regression model.
Create a pipeline object, and pass the steps variable.
Instantiate a grid search object to perform cross-validation using the pipeline and the parameters.
Print the best parameters and compute and print the test set accuracy score for the grid search object.

"""


# Create steps
steps = [("imp_mean", SimpleImputer()),
         ("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]

# Set up pipeline
pipeline = Pipeline(steps)
params = {"logreg__solver": ["newton-cg", "saga", "lbfgs"],
         "logreg__C": np.linspace(0.001, 1.0, 10)}

# Create the GridSearchCV object
tuning = GridSearchCV(pipeline, param_grid=params)
tuning.fit(X_train, y_train)
y_pred = tuning.predict(X_test)

# Compute and print performance
print("Tuned Logistic Regression Parameters: {}, Accuracy: {}".format(tuning.best_params_, tuning.score(X_test, y_test)))