# Module 02 - Regression
## Feature Selection
Feature selection is a critical step in machine learning that involves identifying the most relevant features for a given model. By carefully selecting features, we can improve the model's performance, reduce overfitting, and enhance interpretability. Including irrelevant or redundant features can lead to increased computational complexity and poorer generalization to new data. Moreover, a well-curated feature set can provide valuable insights into the underlying data patterns and relationships, enabling more informed decisions.

There are several techniques for feature selection, each with its strengths and applications. Methods like correlation analysis and Variance Inflation Factor (VIF) can help identify and remove highly correlated features. Tree-based models such as Random Forests and Gradient Boosted Trees can provide feature importance scores, highlighting the most impactful features. Additionally, techniques like Recursive Feature Elimination (RFE) systematically eliminate less important features, while dimensionality reduction methods such as Principal Component Analysis (PCA) transform features into uncorrelated components. Combining these techniques can create a robust feature selection strategy, ultimately leading to better-performing and more efficient machine learning models.

## Hyperparameter Tuning
Hyperparameter tuning is an essential process in machine learning that involves optimizing the hyperparameters of a model to achieve the best possible performance. Unlike model parameters, which are learned from the data during training, hyperparameters are set before the training process and control the behavior of the model. Properly tuned hyperparameters can significantly improve the model's accuracy, prevent overfitting, and ensure that the model generalizes well to new, unseen data. The process of hyperparameter tuning involves selecting the best combination of hyperparameters that maximize the model's performance based on a specific evaluation metric.

There are several techniques for hyperparameter tuning, each with its own advantages. Grid Search is a popular method that involves an exhaustive search over a specified parameter grid, evaluating each combination to find the best one. Random Search, on the other hand, randomly samples from the hyperparameter space and can be more efficient when dealing with a large number of hyperparameters. Advanced techniques such as Bayesian Optimization and Hyperband offer more sophisticated approaches by modeling the performance of hyperparameters and dynamically allocating resources to promising configurations. Cross-validation is often used in conjunction with these techniques to ensure that the selected hyperparameters lead to robust and reliable models. By employing these strategies, machine learning practitioners can fine-tune their models to achieve optimal performance and make data-driven decisions with confidence.

## Regression
Regression analysis is a fundamental statistical technique in machine learning that seeks to understand and predict the relationship between a dependent variable (the outcome being predicted) and one or more independent variables (the predictors). At its core, regression aims to model how changes in input features correspond to changes in the target variable, allowing practitioners to make quantitative predictions and understand the underlying patterns in data. Unlike classification, which predicts discrete categories, regression focuses on predicting continuous numerical values, making it crucial for tasks that involve forecasting, estimation, and understanding complex relationships across various fields like economics, finance, scientific research, and machine learning.

<p style="text-align: center"><img src="https://thislondonhouse.com/Jupyter/Images/regression-linefit.png"></p>

The fundamental goal of regression is to find the most appropriate mathematical function that best describes the relationship between variables, minimizing the difference between predicted and actual values. This process involves selecting an appropriate regression model based on the data's characteristics, such as linearity, complexity, and potential non-linear relationships. Regression techniques range from simple methods like linear regression, which assumes a straight-line relationship, to more complex approaches like neural network and gradient boosting regression that can capture intricate, non-linear patterns. By carefully choosing and tuning regression models, data scientists can develop predictive tools that not only forecast outcomes but also provide insights into the underlying mechanisms driving those predictions.

### Regression Algorithms
#### Linear Regression
Linear regression serves as the fundamental building block of regression analysis, establishing a direct linear relationship between independent variables (features) and a dependent variable (target). At its core, it aims to find a linear function that best describes this relationship by fitting a line (in 2D) or hyperplane (in higher dimensions) to the data points. The model determines the optimal coefficients (weights) for each feature by minimizing the sum of squared differences between predicted and actual values, known as the least squares method. While simple and interpretable, linear regression makes several key assumptions: linearity in parameters, independence of observations, homoscedasticity (constant variance in errors), and normally distributed errors. Despite these limitations, it remains a powerful tool for understanding relationships between variables and making predictions when these assumptions are reasonably met. The main advantage of linear regression is its simplicity and interpretability, making it excellent for baseline models and understanding feature importance, but its strict linearity assumption means it cannot capture complex non-linear relationships in data.
#### Ridge and Lasso Regression
Ridge regression, also known as L2 regularization or Tikhonov regularization, addresses some of the key limitations of standard linear regression, particularly when dealing with multicollinearity or overfitting. It modifies the linear regression cost function by adding a penalty term proportional to the sum of squared coefficient values, multiplied by a regularization parameter λ (lambda). This penalty term effectively shrinks the coefficients toward zero but never exactly to zero, creating a more stable model by reducing the impact of correlated features. Lasso regression (Least Absolute Shrinkage and Selection Operator) introduces a different approach to regularization by adding an L1 penalty term to the linear regression cost function. Unlike ridge regression, lasso uses the absolute values of coefficients in its penalty term, which can force some coefficients exactly to zero, effectively performing automatic feature selection. Ridge and Lasso regressions are particularly valuable when dealing with many features or when features are highly correlated, as it helps prevent the extreme coefficient values that often occur in these situations. 
#### Support Vector Regression
Support Vector Regression (SVR) adapts the principles of Support Vector Machines to regression problems, offering a powerful approach for handling both linear and non-linear relationships. Instead of fitting a line to minimize squared errors, SVR attempts to find a function that deviates from the observed values by no more than a specified margin ε (epsilon) while remaining as flat as possible. It employs a flexible "tube" of width 2ε around the function, only penalizing predictions that fall outside this tube. One of SVR's key strengths is its ability to handle non-linear relationships through the kernel trick, which implicitly maps the input features to a higher-dimensional space where a linear relationship might better represent the data. Common kernel choices include polynomial, radial basis function (RBF), and sigmoid functions, making SVR highly adaptable to different types of relationships in the data. SVR's major advantage is its robustness to outliers and ability to handle non-linear relationships effectively, but its main disadvantage is the computational complexity that makes it less suitable for very large datasets.
#### Gradient Boosting Regression
Gradient Boosting Regression represents an advanced ensemble learning technique that sequentially builds regression models to create a powerful predictive framework. Unlike traditional regression methods, gradient boosting constructs a series of weak learners (typically decision trees) where each subsequent model focuses on correcting the errors of the previous models, effectively "boosting" the overall predictive performance. The method works by iteratively adding new models that predict the residual errors of the existing ensemble, with each new model trained to minimize the remaining prediction error through gradient descent optimization. This approach allows gradient boosting to capture complex, non-linear relationships in the data while maintaining relatively good interpretability compared to neural networks. The core strength of gradient boosting lies in its ability to handle diverse data types, reduce overfitting through techniques like regularization and tree pruning, and often achieve state-of-the-art predictive performance across various domains. Despite the advantages, it can be computationally intensive and requires careful hyperparameter tuning to prevent overfitting, especially on smaller datasets.
#### Neural Network Regression
Neural network regression uses artificial neural networks to model complex, non-linear relationships between inputs and outputs. Unlike traditional regression methods, neural networks can automatically learn hierarchical representations of features through multiple layers of interconnected neurons, each applying non-linear activation functions to weighted combinations of inputs. This architecture allows neural networks to capture intricate patterns and interactions that simpler models might miss. The training process uses backpropagation to adjust the network's weights and biases, minimizing a loss function that measures the difference between predicted and actual values. Neural network regression can be customized through various architectural choices, including the number of layers, neurons per layer, activation functions, and regularization techniques like dropout. While potentially more powerful than simpler regression methods, neural networks typically require larger datasets for training, more computational resources, and careful tuning of hyperparameters to achieve optimal performance. The primary advantage of neural networks is their unparalleled ability to capture complex patterns in data, but this comes at the cost of reduced interpretability and the need for substantial computational resources and data.


In [None]:
import string
import pybaseball as pb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.svm import SVR
from xgboost import XGBRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from scikeras.wrappers import KerasRegressor

## Regression Exercise 1
The following functions are helper functions for this exercise. The first simply removes punctuation from a string. We will use it to format player's names so that there are no special characters that may make name-matching more difficult. The second function builds our dataset using the [pybaseball](https://github.com/jldbc/pybaseball) library. This function downloads batter data from 2018 to the present and pairs that data with player profile data, which includes salary. The third function provides a formatted summary of regression results. We will use that to compare performance across algorithms.

In [None]:
# functions
def remove_punctuation(name):
    return name.translate(str.maketrans('', '', ' ' + string.punctuation)).lower()

def create_batter_data():
    batter_data = pb.batting_stats(2018, 2024, qual=50)
    player_data = pb.bwar_bat()

    batter_data['key'] = batter_data['Name'].apply(remove_punctuation) + batter_data['Season'].astype(str)
    player_data['key'] = player_data['name_common'].apply(remove_punctuation) + player_data['year_ID'].astype(str)

    player_data = player_data.sort_values(by=['mlb_ID', 'year_ID'])

    # Create a new column 'Previous_Year_Salary' with the previous year's salary for each employee
    player_data['prev_salary'] = player_data.groupby('mlb_ID')['salary'].shift(1)
    player_data = player_data.dropna()

    batter_data = batter_data.merge(player_data[['key', 'lg_ID', 'prev_salary', 'salary']], on='key', how="inner")

    for column in batter_data.columns.tolist():
        if batter_data[column].isnull().sum() > 0:
            batter_data = batter_data.drop(column, axis=1)

    batter_data.to_csv("data/batter_data.csv", index=False)
    return batter_data

def regression_preformance(y_true, y_pred, importance_df=None):
    mse = metrics.mean_squared_error(y_true, y_pred)
    mae = metrics.mean_absolute_error(y_true, y_pred)
    r2 = metrics.r2_score(y_true, y_pred)

    print(f'''
Mean Squared Error (MSE): {mse:.4f}
Mean Absolute Error (MAE): {mae:.4f}
R-squared (R2): {r2:.4f}
''')

    # Plotting true vs predicted values
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

    ax1.scatter(y_pred, y_true)
    ax1.plot(y_true, y_true, color='red')
    # ax1.ticklabel_format(style='plain', axis='y')
    # ax1.ticklabel_format(style='plain', axis='x')
    ax1.set_ylabel('True Values')
    ax1.set_xlabel('Predicted Values')
    ax1.set_title('True vs Predicted Values')

    if importance_df is not None:
        sns.barplot(x='Importance', y='Feature', data=importance_df.sort_values(by='Importance', ascending=False), ax=ax2)
        ax2.set_title('Feature Importance')
        ax2.set_xlabel('Importance')
        ax2.set_ylabel('Feature')
    plt.show()

### Business Problem
Baseball is a highly competetive sport among players and team owners. Owners compete by spending money to acquire the best players. Ownership rules in baseball are structured in such a way as to allow almost limitless access to the best players. Whereas other professional sports leagues may have hard salary caps to increase parity and competition, baseball has a luxury tax which penalizes owners who spend 'too much' on their team. Part of this tax is distributed to others members of major league baseball. So, there is a great need to understand the relationship between salary and on-field performance.

### Data Collection
This code tries to load the batter_data.csv file from the data folder. If the attempt fails (because the file is not found), it builds the file for us.


In [None]:
try:
    batter_df = pd.read_csv('data/batter_data.csv')
except FileNotFoundError:
    batter_df = create_batter_data()

Data are orgnized in tabular format with each record representing an individual player in a given year. The target variable is 'salary' which represents the player's salary for a given year. The data contain over 200 variables. 

The following line will load the data as a pandas dataframe.

In [None]:
salary_df = pd.read_csv("data/batter_data.csv")

### Data Profiling
Once the data are loaded, we need to profile the data and prepare it for analysis. This typically involves several steps that may include handling missing data, exploring data, feature selection, among others. The steps will vary depending on the dataset and the business problem, but profiling always precedes model building.  

The following lines provide important insight into the nature of the data. As you can see there are dozens of variables that may be important when predicting salary.

In [None]:
print(salary_df.info(verbose=True, show_counts=True))

In [None]:
print(salary_df.head())

In [None]:
print(salary_df.describe())

In [None]:
salary_df.hist(figsize=(12, 10), bins=30, edgecolor="black")
plt.show()

It is almost impossible to comprehend this much data. So, it is important that we start with a subset of the data and then iteratively build a model that helps us understand the relationship between pay and performance. Thus, the following lines are important as they provide an intial theory for how players are evaluated: **On-field individual performance determines pay**.

In [None]:
target_cols = ['salary']
categorical_cols = ['lg_ID']
numeric_cols = ['AVG','prev_salary']
count_cols = ['Age', 'PA', '1B', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SF', 'SH', 'SB', 'CS', 'Pitches', 'Balls', 'Strikes']
input_cols = [x for x in categorical_cols + numeric_cols + count_cols if x not in target_cols]
data_cols = input_cols + target_cols

In [None]:
df = salary_df[data_cols]

This is a more manageable subset of the data. We can see that we have 17 player performance metrics, a variable representing the league (National or American) and their previous salary (this allows us to 'control' for previous salary). 

In [None]:
print(df.info(verbose=True, show_counts=True))

In [None]:
print(df.describe())

The following illustrates that most of our data are count data and will need to be transformed.

In [None]:
df[input_cols].hist(figsize=(12, 10), bins=30, edgecolor="black")
plt.show()

The following chart illustrates the correlation among variables. Hot (i.e., white) values indicate high levels of correlation. Too much coorelation among variables may cause the models to fail or to have poor prediction performance. 

In [None]:
# # plot the heatmap
sns.heatmap(df.select_dtypes('number').corr())
plt.show()

Here, you can see that salaries range from 100k to 40 million.

In [None]:
print(df[target_cols].describe())

Next, we will drop rows with missing data.

In [None]:
df = df.dropna()

Finally, we will create our testing and training samples. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[input_cols], df[target_cols], test_size=0.25, random_state=16)

### Model Specification
#### Preprocessing
As with the previous exercises, we will create some standard transformers to handle different kinds of data. Categorical variables must be transformed into number, count data must be log-transformed and scaled, and numeric data must be scaled. These pipelines ensure consistency. We may not always need all pipelines, but we will always want to handle these datatypes in a consistent way. 

In [None]:
# Define the pipeline
count_transformer = Pipeline(steps=[
    ('log', FunctionTransformer(np.log1p)),
    ('scaler', StandardScaler())
])

In [None]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

In [None]:
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

#### Model Selection
We will use the same preprocessor for all of the following models. This may not always be the case as we will see in subsequent models. 

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('count', numeric_transformer, count_cols),
        ('cont', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
base_model = LinearRegression() # Linear Regression

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("Linear Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({'Feature': preprocessor.get_feature_names_out(), 'Importance': base_model.coef_})) # Assess model performance

In [None]:
base_model = DummyRegressor(strategy="mean") # Dummy Regressor

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("Dummy Regression Performance Metrics:")
regression_preformance(y_test, y_predicted) # Assess model performance

In [None]:
base_model = Ridge(random_state=42) # Ridge Regression

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("Ridge Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({'Feature': preprocessor.get_feature_names_out(), 'Importance': base_model.coef_})) # Assess model performance

In [None]:
base_model = SVR(kernel='linear') # Support Vector Machines Regression

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("SVM Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({'Feature': preprocessor.get_feature_names_out(), 'Importance': base_model.coef_[0]})) # Assess model performance

In [None]:
base_model = GradientBoostingRegressor(random_state=42) # Gradient Boosting Regressor

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("Gradient Boosting Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({'Feature': preprocessor.get_feature_names_out(), 'Importance': base_model.feature_importances_})) # Assess model performance

In [None]:
base_model = XGBRegressor(enable_categorical=True) # XBBoost Regressor

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("XGBoost Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({'Feature': preprocessor.get_feature_names_out(), 'Importance': base_model.feature_importances_})) # Assess model performance

In [None]:
def create_model(dims): # Define the structure of the model
    model = Sequential()
    model.add(Input(shape=(dims,)))
    model.add(Dense(12, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='linear'))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

base_model = KerasRegressor(create_model(preprocessor.fit_transform(X_train).shape[1]), epochs=100, batch_size=10, verbose=1) # Neural Network Regressor

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("Neural Network Regression Performance Metrics:")
regression_preformance(y_test, y_predicted) # Assess model performance

### Model Evaluation
Model evaluation involves more than just questions of whether the model performed well. It also includes questions of whether this is the correct model or did we use the correct parameters when building a model. As you can see, the SVM regression performed poorly. It may be because our data are bad, or it may be because the algorithm was not tuned correctly. 
#### Hyperparameter Tuning: Grid Search Cross-Validation
To check the latter, we can use a techique known as hyperparameter tuning. This method uses a comprehensive grid search of potential hyperparameter values, testing the model with each permutation of parameter values and cross-validating each test 5 times. This will help us identify the hyperparameters that best fit our data and the task.

In [None]:
base_model = SVR()
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', base_model)
])

param_grid = {
    'model__C': [0.1, 1, 10, 100, 1000, 10000, 100000, 1000000],
    'model__epsilon': [0.01, 0.1, 1, 10],
    'model__kernel': ['linear', 'rbf'],
    'model__gamma': ['scale', 'auto']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error', verbose=5)
grid_search.fit(X_train, np.ravel(y_train))

# Print the best parameters and best score
print(f'Best parameters: {grid_search.best_params_}')

y_predicted = grid_search.best_estimator_.predict(X_test)

print("Grid Search SVM Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({
    'Feature': preprocessor.get_feature_names_out(),
    'Importance': grid_search.best_estimator_['model'].coef_[0]
}))


As you can see, with the correct hyperparameters, the SVM regression performs in-line or better than other regression algorithms. In this case, A C value of 10000, an epison of 10, a gamma of scale, and a kernel of linear woud recreate the best result.

#### Feature Selection: Recursive Feature Elimination
In addition to hyperparameter tuning, you may also face issues selecting the appropritate features. We started with a theory about how baseball players are evaluated--**individual performance determines pay**--but this may not be a correct theory of the relationship between pay and performance. While we could offer a competing theory, we could also mine the data for a theory. Data mining seeks to find relationships among variables, independent of the meaning of those variables.

Recursive feature elimination is a technique for mining data that iteratively removes features until an optimal number of features is reached. This technique is agnostic about the meaning of a variable and is only concerned with the question: **does this feature contribute to our ability to predit the target**?

We start by reselecting our features, but this time we will select all numeric features and iteratively remove non-performing features.

In [None]:
target_cols = ['salary']
categorical_cols = ['lg_ID']
numeric_cols = [x for x in salary_df.select_dtypes(['int', 'float']).columns.tolist() if x not in target_cols]
count_cols = []
input_cols = [x for x in categorical_cols + numeric_cols + count_cols if x not in target_cols]
data_cols = input_cols + target_cols

Once the features are selected, we will subset our data, drop missing values and create our testing/training samples.

In [None]:
df = salary_df[data_cols]
df = df.dropna()
X_train, X_test, y_train, y_test = train_test_split(df[input_cols], df[target_cols], test_size=0.25, random_state=16)

Next, we will define our preprocessor. In this case, we will not need a count transformer because we have defined all features as numeric values.

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cont', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Our best performing model from above is the Gradient Boosting Regressor so we will use that in our feature selection. We could have used any model, but it is generally a good idea to start with a model that you believe will perform well. 

Note that we are adding a Recursive Feature Elimination step to our pipeline. So, in the following example, we will transform our data, select the best features (top 40), and then feed those features into our model.

In [None]:
base_model = LinearRegression() # Linear Regression
rfe = RFE(estimator=base_model, n_features_to_select=40, verbose=10) # Recursive Feature Elimination

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('feature_selection', rfe),
    ('model', base_model)
]) # Define the pipeline

pipeline.fit(X_train, np.ravel(y_train)) # Fit the pipeline

y_predicted = pipeline.predict(X_test) # Use model to predict test data

print("Linear Regression Performance Metrics:")
regression_preformance(y_test, y_predicted, pd.DataFrame({'Feature': preprocessor.get_feature_names_out()[rfe.support_], 'Importance': base_model.coef_})) # Assess model performance

### Conclusion
Overall, we were able to develop a reasonable model for predicting salaries of MLB batters. Whether building a model from theory or mining for the most important factors, we were able to explain approximately 78% of the variance in hitter salaries. Such a model could be useful in identifying undervalued players, predicting negotiations among rising stars or building an economical line-up.

One challenge highlighted in model specification is that the players previous salary is highly predictive of current salary. Though this is clearly true in practice, it may have a biasing effect on the salaries of rising stars who typically have lower salaries. In such cases, our model would likely predict that their future salary would be lower (perhaps much lower) than the market would be willing to pay. Missing on the low side may lead to a lower-than-is-appropriate first offer in future salary negotiations, potentially insulting the player or leading the player to explore more options. The RFE model was able to achieve similar levels of prediction without relying heavily on previous salary information, and therefore may be a preferable model when valuing high-performing but low-paid players.

## Regression Exercise 2
### Business Problem

Explain the business problem

### Data Collection

Load the data as a pandas dataframe.

### Data Profiling
Profile the data and prepare it for analysis. 

Explore the data

Select the features

Subset features

Explore focal features

Drop rows with missing data.

Create our testing and training samples.

### Model Specification

Define preprocessing

Select model

Build pipeline

Fit model

Predict testing data

Report performance

### Model Evaluation

Consider feature importance/selection

Consider hyperparameter tuning

### Conclusion

Assess the model.