# Introduction to Regression

In [None]:
"""

Create X, an array of the values from the sales_df DataFrame's "radio" column.
Create y, an array of the values from the sales_df DataFrame's "sales" column.
Reshape X into a two-dimensional NumPy array.
Print the shape of X and y

"""

import numpy as np

# Create X from the radio column's values
X = sales_df["radio"].values

# Create y from the sales column's values
y = sales_df["sales"].values

# Reshape X
X = X.reshape(-1 , 1)

# Check the shape of the features and targets
print(X.shape)
print(y.shape)

In [None]:
"""

Import LinearRegression.
Instantiate a linear regression model.
Predict sales values using X, storing as predictions.

"""

# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X, y)

# Make predictions
predictions = reg.predict(X)

print(predictions[:5])

In [None]:
"""

Import matplotlib.pyplot as plt.
Create a scatter plot visualizing y against X, with observations in blue.
Draw a red line plot displaying the predictions against X.
Display the plot.

"""


# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create scatter plot
plt.scatter(X, y, color="blue")

# Create line plot
plt.plot(X, predictions, color="red")
plt.xlabel("Radio Expenditure ($)")
plt.ylabel("Sales ($)")

# Display the plot
plt.show()

# The basics of linear regression

In [None]:
"""

Create X, an array containing values of all features in sales_df, and y, containing all values from the "sales" column.
Instantiate a linear regression model.
Fit the model to the training data.
Create y_pred, making predictions for sales using the test features

"""


# Create X and y arrays
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X_train , y_train)

# Make predictions
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))

In [None]:
"""

Import mean_squared_error.
Calculate the model's R-squared score by passing the test feature values and the test target values to an appropriate method.
Calculate the model's root mean squared error using y_test and y_pred.
Print r_squared and rmse.

"""


# Import mean_squared_error
from sklearn.metrics import mean_squared_error

# Compute R-squared
r_squared = reg.score(X_test, y_test)

# Compute RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the metrics
print("R^2: {}".format(r_squared))
print("RMSE: {}".format(rmse))

# Cross-validation

**Cross-validation motivation**

In [None]:
"""

What is R-squared?

--->  R-squared tells us how well our model's predictions match the actual values. A higher R-squared means our model is better at explaining the variation in the data.


Problem with Single Train-Test Split
--------------------------------------------
When we split data into training and test sets, the test set is randomly chosen.
Some test sets might have unusual patterns (e.g., outliers or rare cases).

This can make the R-squared score unreliable, meaning it might not reflect how well the model works on new, unseen data.



Imagine we're predicting house prices based on size.

We randomly split the data:
Train Set: Normal houses (apartments, single-family homes).
Test Set: Mostly luxury villas (much more expensive).
Our model, trained on normal houses, fails on the luxury houses, leading to a misleadingly low R-squared score.


"""

In [None]:
"""

Import KFold and cross_val_score.
Create kf by calling KFold(), setting the number of splits to six, shuffle to True, and setting a seed of 5.
Perform cross-validation using reg on X and y, passing kf to cv.
Print the cv_scores

"""

# Import the necessary modules
from sklearn.model_selection import cross_val_score, KFold

# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)

reg = LinearRegression()

# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg , X, y, cv=kf)

# Print scores
print(cv_scores)

In [None]:
"""

Calculate and print the mean of the results.
Calculate and print the standard deviation of cv_results.
Display the 95% confidence interval for your results using np.quantile()

"""

# Print the mean
print(np.mean(cv_results))

# Print the standard deviation
print(np.std(cv_results))

# Print the 95% confidence interval
print(np.quantile(cv_results, [0.025, 0.975]))

# Regularized Regression

**Ridge Regression**

In [None]:
"""

With ridge, we use the Ordinary Least Squares loss function plus the squared value of each coefficient, multiplied by a constant, alpha.

So, when minimizing the loss function, models are penalized for coefficients with large positive or negative values.
When using ridge, we need to choose the alpha value in order to fit and predict. Essentially, we can select the alpha for which our model performs best.
Picking alpha for ridge is similar to picking k in KNN. Alpha in ridge is known as a hyperparameter, which is a variable used for selecting a model's parameters.

lpha controls model complexity. When alpha equals zero, we are performing OLS, where large coefficients are not penalized and overfitting may occur.
A high alpha means that large coefficients are significantly penalized, which can lead to underfitting.

"""

**Lasso Regression**

In [None]:
"""

There is another type of regularized regression called lasso, where our loss function is the OLS loss function plus the absolute value of each coefficient multiplied by some constant, alpha.

"""

**Lasso regression for feature selection**

In [None]:
"""

Lasso regression can actually be used to assess feature importance. This is because it tends to shrink the coefficients of less important features to zero.
The features whose coefficients are not shrunk to zero are selected by the lasso algorithm.

"""

In [None]:
### Regularized regression: Ridge

"""

Import Ridge.
Instantiate Ridge, setting alpha equal to alpha.
Fit the model to the training data.
Calculate the R^2 score for each iteration of ridge.

"""


# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:

  # Create a Ridge regression model
  ridge = Ridge(alpha = alpha)

  # Fit the data
  ridge.fit(X_train, y_train)

  # Obtain R-squared
  score = ridge.score(X_test, y_test)
  ridge_scores.append(score)
print(ridge_scores)

In [None]:
### Lasso regression for feature importance

"""

Import Lasso from sklearn.linear_model.
Instantiate a Lasso regressor with an alpha of 0.3.
Fit the model to the data.
Compute the model's coefficients, storing as lasso_coef

"""

# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regression model
lasso = Lasso(alpha = 0.3)

# Fit the model to the data
lasso.fit(X , y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)
plt.bar(sales_columns, lasso_coef)
plt.xticks(rotation=45)
plt.show()


###  [ 3.56256962 -0.00397035  0.00496385]