<img src="https://www.bestdesigns.co/uploads/inspiration_images/4350/990__1511457498_404_walmart.png" alt="WALMART LOGO" />

# Walmart : predict weekly sales

## Company's Description 📇

Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores from the United States, headquartered in Bentonville, Arkansas. The company was founded by Sam Walton in 1962.

## Project 🚧

Walmart's marketing service has asked you to build a machine learning model able to estimate the weekly sales in their stores, with the best precision possible on the predictions made. Such a model would help them understand better how the sales are influenced by economic indicators, and might be used to plan future marketing campaigns.

## Goals 🎯

The project can be divided into three steps:

- Part 1 : make an EDA and all the necessary preprocessings to prepare data for machine learning
- Part 2 : train a **linear regression model** (baseline)
- Part 3 : avoid overfitting by training a **regularized regression model**

## Scope of this project 🖼️

For this project, you'll work with a dataset that contains information about weekly sales achieved by different Walmart stores, and other variables such as the unemployment rate or the fuel price, that might be useful for predicting the amount of sales. The dataset has been taken from a Kaggle competition, but we made some changes compared to the original data. Please make sure that you're using **our** custom dataset (available on JULIE). 🤓

## Deliverable 📬

To complete this project, your team should: 

- Create some visualizations
- Train at least one **linear regression model** on the dataset, that predicts the amount of weekly sales as a function of the other variables
- Assess the performances of the model by using a metric that is relevant for regression problems
- Interpret the coefficients of the model to identify what features are important for the prediction
- Train at least one model with **regularization (Lasso or Ridge)** to reduce overfitting


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Part 1 : EDA and data preprocessing

Start your project by exploring your dataset : create figures, compute some statistics etc...

Then, you'll have to make some preprocessing on the dataset. You can follow the guidelines from the *preprocessing template*. There will also be some specific transformations to be planned on this dataset, for example on the *Date* column that can't be included as it is in the model. Below are some hints that might help you 🤓

 ## IMPORT MODULES

In [1]:
print('Loading...')
#Manipulate
import pandas as pd # To load and manipulate data
import numpy as np # to calculate statistics (mean, standart desviation, etc..)

#Visualize
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

# Preprocessing 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

#Model
from sklearn.linear_model import LinearRegression

#Evaluate
from sklearn.metrics import r2_score, mean_squared_error

#Regularization
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV

#Ignoring Deprecation Warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

print('...Done')

Loading...
...Done


 ## MODEL IMPORTING (DATASET)

In [2]:
# Import dataset
df = pd.read_csv("Walmart_Store_sales.csv")
df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.470
2,17.0,27-07-2012,,0.0,,,130.719581,5.936
3,11.0,,1244390.03,0.0,84.57,,214.556497,7.346
4,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092
...,...,...,...,...,...,...,...,...
145,14.0,18-06-2010,2248645.59,0.0,72.62,2.780,182.442420,8.899
146,7.0,,716388.81,,20.74,2.778,,
147,17.0,11-06-2010,845252.21,0.0,57.14,2.841,126.111903,
148,8.0,12-08-2011,856796.10,0.0,86.05,3.638,219.007525,


## EXPLORING DATASET

In [3]:
display(f"Number of rows: {df.shape[0]}  Number of columns: {df.shape[1]}")
# List Comprehension
display([col for col in df.columns])
summary_table = pd.DataFrame({
    # 'is_null_value': df.isnull().any(),
    'sum_null_value': round(df.isnull().sum()),
    'proporcion_null_value': (df.isnull().sum()/df.shape[0])*100,
    'count_values': df.count(),
    # 'unique': {print(f'{len(df.groupby(col).Name.nunique())}') for col in df},
    'type': df.dtypes,
    'mean': df.mean(),
    'max': df.max(),
    'min': df.min(),
})
summary_table

'Number of rows: 150  Number of columns: 8'

['Store',
 'Date',
 'Weekly_Sales',
 'Holiday_Flag',
 'Temperature',
 'Fuel_Price',
 'CPI',
 'Unemployment']

  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  del sys.path[0]


Unnamed: 0,sum_null_value,proporcion_null_value,count_values,type,mean,max,min
CPI,12,8.0,138,float64,179.8985,226.9688,126.111903
Date,18,12.0,132,object,,,
Fuel_Price,14,9.333333,136,float64,3.320853,4.193,2.514
Holiday_Flag,12,8.0,138,float64,0.07971014,1.0,0.0
Store,0,0.0,150,float64,9.866667,20.0,1.0
Temperature,18,12.0,132,float64,61.39811,91.65,18.79
Unemployment,15,10.0,135,float64,7.59843,14.313,5.143
Weekly_Sales,14,9.333333,136,float64,1249536.0,2771397.0,268929.03


In [4]:
unique = {print(f'{col} : {len(df.groupby(col).nunique())} : {df[col].dtypes}') for col in df}

Store : 20 : float64
Date : 85 : object
Weekly_Sales : 136 : float64
Holiday_Flag : 2 : float64
Temperature : 130 : float64
Fuel_Price : 120 : float64
CPI : 135 : float64
Unemployment : 104 : float64


 ## PREPROCESSING - PANDAS 🐼🐼



### DROP LINES

 **Drop lines where target values are missing :**
 - Here, the target variable (Y) corresponds to the column *Weekly_Sales*. One can see above that there are some missing values in this column.
 - We never use imputation techniques on the target : it might create some bias in the predictions !

 👉 Then, we will just drop the lines in the dataset for which the value in *Weekly_Sales* is missing.
 

In [5]:
df.shape

(150, 8)

In [6]:
df_clean = df.dropna(axis=0, how='any', subset=['Weekly_Sales', 'Date'])
# df_clean = df[df['Weekly_Sales'].notna()]
df_clean = df_clean.reset_index(drop=True)
df_clean


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,6.0,18-02-2011,1572117.54,,59.61,3.045,214.777523,6.858
1,13.0,25-03-2011,1807545.43,0.0,42.38,3.435,128.616064,7.470
2,6.0,28-05-2010,1644470.66,0.0,78.89,2.759,212.412888,7.092
3,4.0,28-05-2010,1857533.70,0.0,,2.756,126.160226,7.896
4,15.0,03-06-2011,695396.19,0.0,69.80,4.069,134.855161,7.658
...,...,...,...,...,...,...,...,...
113,3.0,19-10-2012,424513.08,0.0,73.44,3.594,226.968844,6.034
114,14.0,18-06-2010,2248645.59,0.0,72.62,2.780,182.442420,8.899
115,17.0,11-06-2010,845252.21,0.0,57.14,2.841,126.111903,
116,8.0,12-08-2011,856796.10,0.0,86.05,3.638,219.007525,


### CHANGE DATE

**Create usable features from the *Date* column :**
The *Date* column cannot be included as it is in the model. Either you can drop this column, or you will create new columns that contain the following numeric features : 
- *year*
- *month*
- *day*
- *day of week*

In [7]:
df_clean = df_clean.copy() #Deep copy

In [8]:
df_clean['Date'] = pd.to_datetime(df_clean['Date'], format='%d-%m-%Y')
df_clean['Date']

0     2011-02-18
1     2011-03-25
2     2010-05-28
3     2010-05-28
4     2011-06-03
         ...    
113   2012-10-19
114   2010-06-18
115   2010-06-11
116   2011-08-12
117   2012-04-20
Name: Date, Length: 118, dtype: datetime64[ns]

In [9]:
type(df_clean['Date'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [10]:
display(df_clean['Date'][0].year)
display(df_clean['Date'][0].month)
display(df_clean['Date'][0].day)
display(df_clean['Date'][2].weekday()) #Monday is 0 and Sunday is 6, so 4 is Friday.

2011

2

18

4

In [11]:
list_year = []
list_month = []
list_day = []
# list_weekday = []
#Extract the Year
for i in range(df_clean.shape[0]):
    list_year.append(df_clean['Date'][i].year)
display(list_year[0])
df_clean['Year'] = list_year
#Extract the Month
for i in range(df_clean.shape[0]):
    list_month.append(df_clean['Date'][i].month)
display(list_month[0])
df_clean['Month'] = list_month
#Extract the day
for i in range(df_clean.shape[0]):
    list_day.append(df_clean['Date'][i].day)
display(list_day[0])
df_clean['Day'] = list_day
#Extract the day of the week
# for i in range(df_clean.shape[0]):
#     list_weekday.append(df_clean['Date'][i].weekday())
# display(list_weekday) # all of them are Friday; so nopthing interesting... we will not have into account this data.

2011

2

18

In [12]:
df_clean.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,2011-02-18,1572117.54,,59.61,3.045,214.777523,6.858,2011,2,18
1,13.0,2011-03-25,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011,3,25
2,6.0,2010-05-28,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010,5,28
3,4.0,2010-05-28,1857533.7,0.0,,2.756,126.160226,7.896,2010,5,28
4,15.0,2011-06-03,695396.19,0.0,69.8,4.069,134.855161,7.658,2011,6,3


In [13]:
px.scatter(data_frame=df_clean, x='Temperature', y='Weekly_Sales')

In [14]:
px.scatter(data_frame=df_clean, x='Fuel_Price', y='Weekly_Sales')

In [15]:
px.scatter(data_frame=df_clean, x='CPI', y='Weekly_Sales')

In [16]:
px.scatter(data_frame=df_clean, x='Unemployment', y='Weekly_Sales')

In [17]:
px.scatter(data_frame=df_clean, x='Holiday_Flag', y='Weekly_Sales')

**Drop lines containing invalid values or outliers :**
In this project, will be considered as outliers all the numeric features that don't fall within the range : $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$. This concerns the columns : *Temperature*, *Fuel_price*, *CPI* and *Unemployment*

## Measuring the center of a distribution

There are three ways to measure the centre of a distribution in statistics: mean, median and mode.

### Mean (or Average)

The mean is the sum of all the measurements in your sample divided by the total number of individuals. The formula is as follows:

$$
\bar{X} = \frac{\sum_{i=0}^{n}X_{i}}{n}
$$

In [18]:
cols = ['Temperature',
       'Fuel_Price', 'CPI', 'Unemployment']
for i in df[cols]:
    print(df[i].mean())

61.398106060606054
3.320852941176469
179.89850871739125
7.59842962962963


## Measure of Variation


### Standard Deviation

A value that is used much more is the standard deviation. It allows us to know how much the values in our sample deviate from the mean. Here is the formula:

$$
 \sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}}
$$

Although standard deviations are sensitive to statistical outliers(i.e. exceptionally large or small numbers), they are not as affected as intervals. The larger your database, the less your results will be affected by these extreme values. It is still recommended to look at all points that are abnormally far from "normal" values and remove them from your sample because you know that they only bias your calculations.

Standard deviations can be calculated on any distribution. Through statistical experience, mathematicians have seen that, empirically, 68% of the values in a sample are often located one standard deviation from the mean. 95% of the values are located two standard deviations from the mean and 99.7% of the values in your sample are located three standard deviations from the mean. Nothing has been proven but this is what we observe in statistical experiments.

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M03-Python_programming_and_statistics/D01-Introduction_to_python_and_statistics/three_sigma_rule.png)

Todas las variables que esten fuera del rango $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$ seran eliminadas.

* 1 sigma = 68%
* 2 sigma = 95%
* 3 sigma = 99.7%

In [19]:
cols = ['Temperature',
       'Fuel_Price', 'CPI', 'Unemployment']
for i in df[cols]:
    print('Mean:')
    display(df[i].mean())
    print('Standart Desviation:')
    display(df[i].std())

Mean:


61.398106060606054

Standart Desviation:


18.37890061969609

Mean:


3.320852941176469

Standart Desviation:


0.47814903192626695

Mean:


179.89850871739125

Standart Desviation:


40.27495629088349

Mean:


7.59842962962963

Standart Desviation:


1.5771725009088247

Confidence Interval for the temperature --> $[\bar{X} - 3\sigma, \bar{X} + 3\sigma]$

In [20]:
cols = ['Temperature',
       'Fuel_Price', 'CPI', 'Unemployment']

In [21]:
CI_Temp =  61.39 - 3*18.37, 61.39 + 3*18.37
CI_Fuel =  3.32 - 3*0.47, 3.32 + 3*0.47
CI_CPI =  179.89 - 3*40.27, 179.89 + 3*40.27
CI_Unem =  7.59 - 3*1.57, 7.59 + 3*1.57
display(CI_Temp)
display(CI_Fuel)
display(CI_CPI)
display(CI_Unem)

(6.280000000000001, 116.5)

(1.91, 4.7299999999999995)

(59.079999999999984, 300.7)

(2.88, 12.3)

In [22]:
df_clean.shape

(118, 11)

In [23]:
df_mask_temp = df_clean[(df_clean['Temperature'] >= 6.28) & (df_clean['Temperature'] < 116.5)]
df_mask_Fuel = df_mask_temp[(df_mask_temp['Fuel_Price'] >= 1.91) & (df_mask_temp['Fuel_Price'] < 4.72)]
df_mask_CPI = df_mask_Fuel[(df_mask_Fuel['CPI'] >= 59.07) & (df_mask_Fuel['CPI'] < 300.7)]
df_mask_Unem = df_mask_CPI[(df_mask_CPI['Unemployment'] >= 2.88) & (df_mask_CPI['Unemployment'] < 12.3)]
display(df_mask_temp.shape)
display(df_mask_Fuel.shape)
display(df_mask_CPI.shape)
display(df_mask_Unem.shape)

(107, 11)

(96, 11)

(91, 11)

(80, 11)

In [24]:
df_drop_date = df_mask_Unem.drop('Date', axis=1)

In [25]:
df_clean_final = df_drop_date

 ### DROP COLUMNS

🔴 We will keep all the columns.


### FEATURES AND TARGET

In [26]:
type(df_clean_final)

pandas.core.frame.DataFrame

In [27]:
df_clean_final.columns

Index(['Store', 'Weekly_Sales', 'Holiday_Flag', 'Temperature', 'Fuel_Price',
       'CPI', 'Unemployment', 'Year', 'Month', 'Day'],
      dtype='object')

In [28]:
df_clean_final.head()

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,6.0,1572117.54,,59.61,3.045,214.777523,6.858,2011,2,18
1,13.0,1807545.43,0.0,42.38,3.435,128.616064,7.47,2011,3,25
2,6.0,1644470.66,0.0,78.89,2.759,212.412888,7.092,2010,5,28
4,15.0,695396.19,0.0,69.8,4.069,134.855161,7.658,2011,6,3
5,20.0,2203523.2,0.0,39.93,3.617,213.023622,6.961,2012,2,3


In [29]:
X = df_clean_final.loc[:,('Store', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Year', 'Month', 'Day')]
y = df_clean_final.loc[:, 'Weekly_Sales']
print(X.shape)
print(y.shape)

(80, 9)
(80,)


 ## PREPROCESSING - SCIKIT-LEARN 🔬🔬


We need to identify which columns contain categorical variables and which columns contain numerical variables, as they will be treated differently.

🖐 FEATURES (X)--> We have both, categorical variables and Numerical ones.

**👉 Categorical variables : 'Stored', 'Holiday_Flag'.**

**👉 Numerical variables : 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment', 'Year', 'Month', 'Day', 'DayOfWeek'**

In this dataset, we have both types of variables. It will thus be necessary to create a numeric_transformer that will wrap together all preprocessing steps for numerical variables (it will call the StandardScaler class and replace missing values using the SimpleImputer class) and a categorical_transformer to wrap together all the preprocessing steps for categorical variables (it will call the OneHotEncoder class and replace missing values using the SimpleImputer class).

🖐 TARGET (Y)--> It´s a numerical variable.

**👉 Numerical variables : 'Weekly_Sales'**



 ### TRAIN TEST SPLIT 🔬🔬

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
display(X_train.head(), y_train.head())
display(X_test.head(), y_test.head())

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
48,8.0,0.0,74.92,2.619,214.936279,6.315,2010,8,27
11,17.0,0.0,60.07,2.853,126.2346,6.885,2010,10,1
37,2.0,0.0,54.63,3.555,220.275944,7.057,2012,2,24
95,8.0,0.0,75.32,2.582,214.878556,6.315,2010,9,17
69,3.0,0.0,75.19,3.688,225.23515,6.664,2012,5,11


48     888816.78
11     829207.27
37    1861802.70
95     836707.85
69     431985.36
Name: Weekly_Sales, dtype: float64

Unnamed: 0,Store,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
67,9.0,0.0,49.96,2.771,215.437285,6.56,2010,11,19
105,4.0,0.0,82.84,3.627,129.150774,5.644,2011,7,22
57,3.0,0.0,45.71,2.572,214.424881,7.368,2010,2,5
78,3.0,0.0,83.52,2.637,214.785826,7.343,2010,6,18
41,16.0,0.0,48.29,3.75,197.413326,6.162,2012,3,30


67      519823.30
105    2036231.39
57      461622.22
78      364076.85
41      485095.41
Name: Weekly_Sales, dtype: float64

### PIPELINE NUMERIC FEATURES

In [31]:
X_train.shape

(64, 9)

In [32]:
# Create pipeline for numeric features

# numeric_features = [2,3,4,5,6,7,8] # Positions of numeric columns in X_train/X_test
numeric_features = [2,3,4,5] # Positions of numeric columns in X_train/X_test
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # missing values will be replaced by columns' median
    ('scaler', StandardScaler())
])

### PIPELINE CATEGORICAL FEATURES

In [33]:
# Create pipeline for categorical features
categorical_features = [0,1] # Positions of categorical columns in X_train/X_test
categorical_transformer = Pipeline(
    steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # missing values will be replaced by most frequent value
    ('encoder', OneHotEncoder(drop='first')) # first column will be dropped to avoid creating correlations between features
    ])

### PIPELINE GLOBAL

In [34]:
#Reminder: you need to call .fit_transform() on X_train and only .transform() on X_test, to ensure that the latter gets the same transformations as X_train.
# Use ColumnTranformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Preprocessings on train set
print("Performing preprocessings on train set...")
print(X_train)
X_train = preprocessor.fit_transform(X_train)
print('...Done.')
print(X_train)
print()

# Preprocessings on test set
print("Performing preprocessings on test set...")
print(X_test)
X_test = preprocessor.transform(X_test) # Don't fit again !!
print('...Done.')
print(X_test)
print()

Performing preprocessings on train set...
    Store  Holiday_Flag  Temperature  Fuel_Price         CPI  Unemployment  \
48    8.0           0.0        74.92       2.619  214.936279         6.315   
11   17.0           0.0        60.07       2.853  126.234600         6.885   
37    2.0           0.0        54.63       3.555  220.275944         7.057   
95    8.0           0.0        75.32       2.582  214.878556         6.315   
69    3.0           0.0        75.19       3.688  225.235150         6.664   
..    ...           ...          ...         ...         ...           ...   
92    7.0           0.0        39.30       3.936  197.722738         8.090   
84    8.0           1.0        33.34       2.548  214.621419         6.299   
31    4.0           0.0        81.85       3.570  129.066300         5.946   
96    5.0           0.0        89.42       3.682  216.046436         6.529   
70    9.0           NaN        78.51       2.642  214.656430         6.442   

    Year  Month  Day 

### ENCODE TARGET VARIABLE Y

## MODEL TRAINING 🏃

In [35]:
# Train model - Machine Learning
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

## PREDICTIONS 🔮

In [36]:
# Predictions on training set
print("Predictions on training set...")
y_train_pred = model.predict(X_train)
print("...Done.")
print(y_train_pred)
print()

Predictions on training set...
...Done.
[ 902811.84042866  864232.60588619 1904450.30805908  908127.72907499
  411898.10402366  884853.0041572  1940396.66973096 1757242.51003853
  446601.77139093 1607604.11554827 1305458.64042695 1569885.58185975
  657663.56492476  341168.74918562  321087.93608411 1071389.92642183
  295220.299167   2471090.72318482 1049799.2478141   630667.67952008
  556724.33007289 1520691.58030216 1800708.7849217  1430384.34249415
  430522.81615788 2324368.31690616 1515050.44409096 1604183.97004812
 1805999.79031797 1319074.15163586 1560658.80920596 2029957.69446471
 1357030.17990592 1989003.06127839  608343.85521317 1914898.84341392
 1831507.13669206  365887.94459378 1576778.38906279  554246.54817519
 1922820.95380494  539483.31782671  502980.30070076 1430126.96619187
 1953544.7601659  1921191.75993564 1032278.08136128  393779.49292174
  501209.68712638 1926774.17795646  571150.11079636 1976133.44135238
 2059866.74839395 1077533.92525276  467286.95834988  478500.408

In [37]:
# Predictions on test set
print("Predictions on test set...")
y_test_pred = model.predict(X_test)
print("...Done.")
print(y_test_pred)
print()

Predictions on test set...
...Done.
[ 490224.62383112 2299147.28424685  414188.74392308  378266.0227095
  573204.86294493 1429646.23246147 1996019.07158625 1763944.19618088
 1983259.7704913  2072675.59330677 2325068.88243948  787053.32630871
 2054351.0208494  1423586.3625096   492352.68655217  889075.24470223]



### PERFORMANCE 💯

In [38]:
# Print R^2 scores
print("R2 score on training set : ", r2_score(y_train, y_train_pred))
print("R2 score on test set : ", r2_score(y_test, y_test_pred))

R2 score on training set :  0.9823127395524044
R2 score on test set :  0.936999912387258


In [39]:
model.coef_

array([  -14006.58664641,   -87078.32270008,   603468.7369591 ,
         -67738.87441177,   280899.53960601, -1276913.95807916,
        2034561.65627388, -1351364.81916576,   -11009.83711787,
        -613349.71769242,  -839952.30645776, -1236656.82330935,
        1638208.76579716,   157008.01157018,  1729697.820537  ,
        1034392.2904485 ,   468393.7719494 ,  -728674.67989459,
         579212.60236754,   866006.20856432,  1158910.52663011,
         541355.52273578,   -37292.68668042])

In [40]:
coefs = pd.DataFrame(data = model.coef_.transpose(), columns=["coefficients"])
coefs

Unnamed: 0,coefficients
0,-14006.59
1,-87078.32
2,603468.7
3,-67738.87
4,280899.5
5,-1276914.0
6,2034562.0
7,-1351365.0
8,-11009.84
9,-613349.7


In [41]:
coefs.abs().sort_values(by="coefficients", ascending=False)

Unnamed: 0,coefficients
6,2034562.0
14,1729698.0
12,1638209.0
7,1351365.0
5,1276914.0
11,1236657.0
20,1158911.0
15,1034392.0
19,866006.2
10,839952.3


### FIGHT OVERFITTING

### RIDGE

### LASSO

**Bonus question**

In regularized regression models, there's a hyperparameter called *the regularization strength* that can be fine-tuned to get the best generalized predictions on a given dataset. This fine-tuning can be done thanks to scikit-learn's GridSearchCV class : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Also, you'll find here some examples of how to use GridSearchCV together with Ridge or Lasso models : https://alfurka.github.io/2018-11-18-grid-search/