### Task — Predict next month’s sales using LASSO

You work as an analyst at a company that sells medical equipment. Your goal is to build a predictive model that forecasts next month’s sales using data from the previous months. You have already explored the data and selected a subset of candidate predictor variables — chosen either because they showed a (rough) linear relationship with the target or because they are meaningful from a domain/expert perspective.

You will apply LASSO (L1-regularized) linear regression to produce a simple, interpretable model with improved explanatory power and reduced overfitting. Review the provided code, correct any issues, and complete the remaining tasks described below.

### Import of necessary Python libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

### Below is dataset that you work with
Variables:
- sales → number of sales of your company in a given month $t$, the variable that you want to predict
- offers_our → number of special offers of our company launched in the previous month $t-1$
- offers_comp → number of special offers of our competitors launched in the previous month $t-1$
- complaints → number of complains about our customer service from the previous month $t-1$
- cows → number of cows in Poland in the previous month $t-1$

In [None]:
# ----- Your dataset -----
np.random.seed(42)
n = 40
var = np.round(np.random.normal(1000, 100, n))
sales = np.round(10 * var + np.random.normal(0, 300, n))
offers_our = np.round(0.7 * (sales - np.mean(sales)) / np.std(sales) * 50 + np.mean(sales) * 0.1 + np.random.normal(0, 20, n))
offers_comp = np.round(0.6 * (-sales + np.mean(sales)) / np.std(sales) * 50 + np.mean(sales) * 0.05 + np.random.normal(0, 25, n))
complaints = 10*np.round(0.4 * (-sales + np.mean(sales)) / np.std(sales) * 20 + np.mean(sales) * 0.02 + np.random.normal(0, 15, n))
cows = var * 1000 + 1500000

### Creation of DataFrame to store all variables

In [None]:
df = pd.DataFrame({
    'sales': sales,
    'complaints': complaints,
    'offers_comp': offers_comp,
    'offers_our': offers_our,
    'cows': cows
})

### Quick summary of descriptive statistics of our dataset

In [None]:
df.describe()

### Additional task (for interested)
Try to prepare some plots that will help visualise the data and relationships beetween them. 

Hint: consider rescaling variables before plotting, so you can directly observe how they change from month to month. 

Hint: you can prepare additional variables i.e. relative month to month changes

### Checking of the linear correaltions beetween varaible sales and additional variables

In [None]:
# Print correlations to verify
print("Data correlations with sales:")
print(df.corr()['sales'])

### Task - which variables seem to be the most promising for prediction of our sales? 
Be prepared for the question from your manager which variables are in the model and why.

### Data preparation
You have already checked the data quality and data availability. You have to be sure that additional variables from previous month are available at the moment when you need to provide your boss with prediction. Now lets prepare the data for model training and selection of the optimal regularization parameter λ of our LASSO model.

In [None]:
### Are there any variables that we want to get rid of? 
### We can do it below in preparation of X dataset df.drop(['sales','varaible2'],axis=1)
X = df.drop('sales', axis=1)
y = df['sales']

### Divide data into $train$ and $test$ datasets to select optimal λ 
Hint: you can try to divide dataset in different proportion by choosing different values of test_size beetween 0.1 and 0.3 (10%-30%). Does it affect the best performing hyperparameter? 

In [None]:
# FIRST split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Fit only on train, transform both train & test
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Perform selection of the optimal λ

In [None]:
# ----- Lasso Regularization -----
lambdas = np.arange(0, 100, 0.5)
train_rmse = []
test_rmse = []

for a in lambdas:
    model = Lasso(alpha=a, max_iter=10000)
    model.fit(X_train, y_train)
    train_rmse.append(mean_squared_error(y_train, model.predict(X_train))**0.5)
    test_rmse.append(mean_squared_error(y_test, model.predict(X_test))**0.5)

# Find best alpha
best_lambda = lambdas[np.argmin(test_rmse)]
best_test_rmse = min(test_rmse)

print(f"Best λ: {best_lambda:.1f}")

### You can visualize the process of selection of hyperparameter with the below plot 

In [None]:
# ----- Plot -----
plt.figure(figsize=(10, 6))
plt.plot(lambdas, train_rmse, label='Train RMSE')
plt.plot(lambdas, test_rmse, label='Test RMSE')
plt.axvline(best_lambda, linestyle='--', label=f'Best λ = {best_lambda:.4f}', linewidth=2)
plt.xlabel("Regularization Strength (λ)", fontsize=14)
plt.ylabel("Root Mean Squared Error", fontsize=14)
plt.title("Lasso: Bias–Variance Trade-off", fontsize=16)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.4)
plt.tight_layout()
plt.show()


In [None]:
print(f"Our test RMSE is ~{best_test_rmse:.0f} which means that we expect that our model will predict the number of sales with an error around that number. Our mean number of sales is 9773, so the error seem no to be that large.")

### We have selected the best λ based on split of our data on train and test datasets. 
Now we want to deploy model on production environment and provide our boss with the first predictions. To do that we want to train the final model on the whole available dataset.

In [None]:
# we prepare the scaler based on the whole dataset
scaler_fin = StandardScaler()
X_fin = scaler_fin.fit_transform(X)

model_final = Lasso(alpha=best_lambda, max_iter=10000)
model_final.fit(X, y)

# We have data for the following 5 months of the final model working on production. Lets see what ex-post error we have!

In [None]:
### New data - simulation of model performance on production 

np.random.seed(40)
n = 5
var1 = np.round(np.random.normal(1000, 100, n))
var2 = np.round(np.random.normal(1000, 100, n))
sales_new = np.round(10 * var1 + np.random.normal(0, 300, n))
offers_our_new = np.round(0.7 * (sales_new - np.mean(sales_new)) / np.std(sales_new) * 50 + np.mean(sales_new) * 0.1 + np.random.normal(0, 20, n))
offers_comp_new = np.round(0.6 * (-sales_new + np.mean(sales_new)) / np.std(sales_new) * 50 + np.mean(sales_new) * 0.05 + np.random.normal(0, 25, n))
complaints_new = 10*np.round(0.4 * (-sales_new + np.mean(sales_new)) / np.std(sales_new) * 20 + np.mean(sales_new) * 0.02 + np.random.normal(0, 15, n))
cows_new = var2 * 1000 + 1500000

In [None]:
### Data frame for the new data
df_new = pd.DataFrame({
    'sales': sales_new,
    'complaints': complaints_new,
    'offers_comp': offers_comp_new,
    'offers_our': offers_our_new,
    'cows': cows_new
})

### Data preparation - prepare the new data the same way as the data for model on production were prepared

In [None]:
X_new = df_new.drop('sales', axis=1)
y_new = df_new['sales']

In [None]:
# Remember about scaling of the new data with scaler that was used for the model deployment!
X_new_scaled = scaler_fin.transform(X_new)

### Verify whats the RMSE on the new data

In [None]:
rmse_new_data=mean_squared_error(y_new, model_final.predict(X_new))**0.5
rmse_new_data

### Check your score with the message below

In [None]:
print(f"your RMSE for new dataset is {rmse_new_data:.0f}, while you expected it to be around {best_test_rmse:.0f}")

In [None]:
ratio = rmse_new_data/best_lambda  

if ratio < 1:
    print("Good job, your model works great!")
elif 1 <= ratio < 2:
    print("Your model is working ok.")
elif 2 <= ratio < 4:
    print("Something requires adjustment.")
else:  # ratio >= 4
    print("Your boss is furious, you may need to rethink your approach!")


### Hints if something is not working as intended

Hint: maybe rethink the usage of some of the variables. See the correlations for the new data below. Maybe you can draw some conclusions from them. 

Hint: maybe you can use Ridge regression istead of LASSO? try to modify line $model = Lasso(alpha=a, max_iter=10000)$ - remember to adjust also $model final$ in the following lines of code

Hint: mybe try to extend the list of considered λ in line: $lambdas = np.arange(0, 100, 0.5)$

In [None]:
# Correlations for the new data
print("Data correlations with sales:")
print(df_new.corr()['sales'].sort_values(ascending=False))