# **Codes to lecture 1.**

This notebook includes codes on such topics as:
- importing basic packages in Python,
- basics of Numpy package,
- importing datasets,
- using different regression models (linear, Ridge, LASSO and ElasticNet) in scikit-learn


Table of content



>[Codes to lecture 1.](#scrollTo=cBjf1ollxtsh)

>[Import packages](#scrollTo=QvbVS6nwyBzh)

>[Numpy basics](#scrollTo=0qPD1BZV-Mhi)

>>>[2.1 Arrays](#scrollTo=miP1UR8__H9q)

>>>[2.1 Methods np.arange() and np.reshape()](#scrollTo=JjFjWtJJC2rY)

>>>[2.2 Creating simple matricies.](#scrollTo=IaTzwnoxDutW)

>>>[2.3 Stacking columns.](#scrollTo=Wa66izMUEYoP)

>>>[Matrix transpose and inverse.](#scrollTo=bFRQXTQCEbf-)

>>>[2.4 Matrix multiplication](#scrollTo=EyhM8m4AKKgQ)

>[3. Import datasets](#scrollTo=33A5F5OlyCPP)

>>>[3.1 Import dataset from seaborn](#scrollTo=N7QjVGoZyCnf)

>>>[3.2 Import dataset from scikit-learn](#scrollTo=F46pjIVQyC7g)

>>>[3.3 Import datasets from R](#scrollTo=tRKZyHnC1fyo)

>[Linear Regression models and their regularization (Ridge, LASSO, ElasticNet).](#scrollTo=VTn1vhtFh-Sp)

>>>[4.1 Simple artificial example](#scrollTo=Fxldou6Aib8k)

>>>[4.2 More complicated artificial example](#scrollTo=McDMMdhsjiF5)

>>>[4.3 Real dataset example](#scrollTo=M8LJDa47ieu6)



# 1. Import packages

Usually we use shorthands to packages e.g. common shorthand for package *numpy* is `np`.

In [38]:
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

# 2. Numpy basics

This section provides some basics information on numpy arrays. \\
In particular, it contains commands that will be usefull in preparing solution to Assignment 1.

You can get the list of available methods in numpy using `dir(np)`

In [39]:
dir(np)

['ALLOW_THREADS',
 'AxisError',
 'BUFSIZE',
 'CLIP',
 'DataSource',
 'ERR_CALL',
 'ERR_DEFAULT',
 'ERR_IGNORE',
 'ERR_LOG',
 'ERR_PRINT',
 'ERR_RAISE',
 'ERR_WARN',
 'FLOATING_POINT_SUPPORT',
 'FPE_DIVIDEBYZERO',
 'FPE_INVALID',
 'FPE_OVERFLOW',
 'FPE_UNDERFLOW',
 'False_',
 'Inf',
 'Infinity',
 'MAXDIMS',
 'MAY_SHARE_BOUNDS',
 'MAY_SHARE_EXACT',
 'NAN',
 'NINF',
 'NZERO',
 'NaN',
 'PINF',
 'PZERO',
 'RAISE',
 'SHIFT_DIVIDEBYZERO',
 'SHIFT_INVALID',
 'SHIFT_OVERFLOW',
 'SHIFT_UNDERFLOW',
 'ScalarType',
 'Tester',
 'TooHardError',
 'True_',
 'UFUNC_BUFSIZE_DEFAULT',
 'UFUNC_PYVALS_NAME',
 'WRAP',
 '_CopyMode',
 '_NoValue',
 '_UFUNC_API',
 '__NUMPY_SETUP__',
 '__all__',
 '__builtins__',
 '__cached__',
 '__config__',
 '__deprecated_attrs__',
 '__dir__',
 '__doc__',
 '__expired_functions__',
 '__file__',
 '__former_attrs__',
 '__future_scalars__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_add_newdoc_ufunc',


### 2.1 Arrays

We start with creating some simple arrays:

In [40]:
a = np.array([0, 1, 2, 3])
a

array([0, 1, 2, 3])

In [41]:
b = np.array([[1,2],
             [3,4]]
             )
print(b)

[[1 2]
 [3 4]]


In [42]:
c = np.array([[[1.0,3,5],[7,9,11]],
              [[13,15,17],[19,21,23]],
              [[25,27,29],[31,33,35]]]
             )
print(c)

[[[ 1.  3.  5.]
  [ 7.  9. 11.]]

 [[13. 15. 17.]
  [19. 21. 23.]]

 [[25. 27. 29.]
  [31. 33. 35.]]]


with the following shapes

In [43]:
print("Shape of a is", a.shape)
print("Shape of b is", b.shape)
print("Shape of c is", c.shape)

Shape of a is (4,)
Shape of b is (2, 2)
Shape of c is (3, 2, 3)


You can also check type of the object e.g. a:

In [44]:
type(a)

numpy.ndarray

In [45]:
a.dtype

dtype('int64')

In [46]:
c.dtype

dtype('float64')

and convert it to e.g. float:

In [47]:
a = a.astype(float)
a

array([0., 1., 2., 3.])

In [48]:
a.dtype

dtype('float64')

See also https://numpy.org/doc/stable/user/basics.types.html for more information about data types.

### 2.1 Methods np.*arange*() and np.reshape()

See also \\
https://numpy.org/doc/stable/reference/generated/numpy.arange.html
https://numpy.org/doc/stable/reference/generated/numpy.reshape.html


In [49]:
a1 = np.arange(4)
a1

array([0, 1, 2, 3])

In [50]:
b1 = np.arange(1,5,1).reshape((2,2))
b1

array([[1, 2],
       [3, 4]])

In [51]:
c1 = np.arange(1,48,2).reshape((4,3,2))
c1

array([[[ 1,  3],
        [ 5,  7],
        [ 9, 11]],

       [[13, 15],
        [17, 19],
        [21, 23]],

       [[25, 27],
        [29, 31],
        [33, 35]],

       [[37, 39],
        [41, 43],
        [45, 47]]])

In [52]:
c1 = np.arange(1,48,2).reshape((4,3,2))
c1

array([[[ 1,  3],
        [ 5,  7],
        [ 9, 11]],

       [[13, 15],
        [17, 19],
        [21, 23]],

       [[25, 27],
        [29, 31],
        [33, 35]],

       [[37, 39],
        [41, 43],
        [45, 47]]])

### 2.2 Creating simple matricies.

You can easily obtain typical examples of matrices using built-in methods

*   np.ones
*   np.zeros
*   np.eye

in numpy.

See also

https://numpy.org/doc/stable/reference/generated/numpy.ones.html \\
https://numpy.org/doc/stable/reference/generated/numpy.zeros.html \\
https://numpy.org/doc/stable/reference/generated/numpy.eye.html

and references therein.


In [53]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [54]:
np.ones(6).reshape((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [55]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [56]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [57]:
np.zeros(9).reshape((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [58]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

### 2.3 Stacking columns.

See also https://numpy.org/doc/stable/reference/generated/numpy.hstack.html
and references therein.

In [59]:
A = np.ones(5)
A

array([1., 1., 1., 1., 1.])

In [60]:
B = np.arange(15).reshape((5,3))
B

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

We can add array A as the first column of array B as follows:

In [61]:
np.c_[A,B]

array([[ 1.,  0.,  1.,  2.],
       [ 1.,  3.,  4.,  5.],
       [ 1.,  6.,  7.,  8.],
       [ 1.,  9., 10., 11.],
       [ 1., 12., 13., 14.]])

In [62]:
np.column_stack((A,B))

array([[ 1.,  0.,  1.,  2.],
       [ 1.,  3.,  4.,  5.],
       [ 1.,  6.,  7.,  8.],
       [ 1.,  9., 10., 11.],
       [ 1., 12., 13., 14.]])

See also methods `np.vstack()` and `np.hstack()`: 

https://numpy.org/doc/stable/reference/generated/numpy.vstack.html 

https://numpy.org/doc/stable/reference/generated/numpy.hstack.html

### Matrix transpose and inverse.

You can use built-in methods

*   np.transpose
*   np.linalg.inv

to transpose and inverse matrix, respectivetely.

In [63]:
A = np.arange(1,5,1).reshape((2,2))
A

array([[1, 2],
       [3, 4]])

In [64]:
np.transpose(A)

array([[1, 3],
       [2, 4]])

In [65]:
A.T

array([[1, 3],
       [2, 4]])

In [66]:
np.linalg.inv(A)

array([[-2. ,  1. ],
       [ 1.5, -0.5]])

In [67]:
np.linalg.det(A)

-2.0000000000000004

### 2.4 Matrix multiplication

To multiply matricies you can use one of the following built-in methods:

*   np.matmul (or @ as shorthand),
*   np.dot.

See also: 

https://numpy.org/doc/stable/reference/generated/numpy.matmul.html 

https://numpy.org/doc/stable/reference/generated/numpy.dot.html



In [68]:
A = np.arange(1,7,1).reshape((2,3))
A

array([[1, 2, 3],
       [4, 5, 6]])

In [69]:
B = np.arange(12).reshape((3,4))
B

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [70]:
A @ B

array([[ 32,  38,  44,  50],
       [ 68,  83,  98, 113]])

In [71]:
np.matmul(A,B)

array([[ 32,  38,  44,  50],
       [ 68,  83,  98, 113]])

In [72]:
np.dot(A,B)

array([[ 32,  38,  44,  50],
       [ 68,  83,  98, 113]])

# **3. Import datasets**

This part ot the notebook includes information how to get datasets from:

*   seaborn,
*   scikit-learn,
*   R.

### 3.1 Import dataset from seaborn

You can check the list of available datasets in seaborn as follows

In [73]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

The, we can load the chosen dataset as follows

In [74]:
titanic = sns.load_dataset('titanic')

and take a look at the data:

In [75]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Basic informations about types of variables (columns) can be obtained using .info():


In [76]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


Basic statistics about (here: numerical) variable can obtained using `.describe` method:

In [77]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


To get infos about another (non numerical) variables you can proceed as follows:

In [78]:
titanic.describe(include=['category'])

Unnamed: 0,class,deck
count,891,203
unique,3,7
top,Third,C
freq,491,59


In [79]:
titanic.describe(include=['object'])

Unnamed: 0,sex,embarked,who,embark_town,alive
count,891,889,891,889,891
unique,2,3,3,3,2
top,male,S,man,Southampton,no
freq,577,644,537,644,549


In [80]:
titanic.describe(include=['bool'])

Unnamed: 0,adult_male,alone
count,891,891
unique,2,2
top,True,True
freq,537,537


### 3.2 Import dataset from scikit-learn

For more info please see https://scikit-learn.org/stable/datasets.html.

Notice that scikit-learn returns sample data as numpy arrays rather than a pandas data frame.

Let us start with the list of available datasets:

In [81]:
import sklearn.datasets

dir(sklearn.datasets)

['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__getattr__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_arff_parser',
 '_base',
 '_california_housing',
 '_covtype',
 '_kddcup99',
 '_lfw',
 '_olivetti_faces',
 '_openml',
 '_rcv1',
 '_samples_generator',
 '_species_distributions',
 '_svmlight_format_fast',
 '_svmlight_format_io',
 '_twenty_newsgroups',
 'clear_data_home',
 'dump_svmlight_file',
 'fetch_20newsgroups',
 'fetch_20newsgroups_vectorized',
 'fetch_california_housing',
 'fetch_covtype',
 'fetch_file',
 'fetch_kddcup99',
 'fetch_lfw_pairs',
 'fetch_lfw_people',
 'fetch_olivetti_faces',
 'fetch_openml',
 'fetch_rcv1',
 'fetch_species_distributions',
 'get_data_home',
 'load_breast_cancer',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_linnerud',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine',
 'make_biclusters',
 'make_blobs',
 'make_checkerboard

and choose e.g. Iris dataset



In [82]:
from sklearn.datasets import load_iris

iris = load_iris()

Now we can take a look at the Iris data:

In [83]:
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

As you can see, we obtain the dictionary. 

We can get the list of keys from this dictionary as follows:

In [84]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

We can read short description (DESCR) of this dataset:

In [85]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

We can check names of variables (feature_names) associated with appropriate columns:

In [86]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

as well as names (target_names) of labels:

In [87]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

We can also check their (data and target) shapes:

In [88]:
iris.target.shape

(150,)

In [89]:
iris.data.shape

(150, 4)

You can also create DataFrame from this data as follows

In [90]:
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


If you prefer to use the full name for the target variable, you can use the following code

In [91]:
#df_iris_target =
iris_df['target'] = iris_df['target'].map({0:"setosa", 1:"versicolor", 2:"virginica"})
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### 3.3 Import datasets from R

You can access all R's sample data sets by copying the URLs from this R data set repository:

https://vincentarelbundock.github.io/Rdatasets/datasets.html


For instance, to get Iris dataset we use

In [92]:
iris_r = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv")
iris_r

Unnamed: 0,rownames,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,1,5.1,3.5,1.4,0.2,setosa
1,2,4.9,3.0,1.4,0.2,setosa
2,3,4.7,3.2,1.3,0.2,setosa
3,4,4.6,3.1,1.5,0.2,setosa
4,5,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,virginica
146,147,6.3,2.5,5.0,1.9,virginica
147,148,6.5,3.0,5.2,2.0,virginica
148,149,6.2,3.4,5.4,2.3,virginica


Now take a look at basic infomrations about features:

In [93]:
iris_r.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   rownames      150 non-null    int64  
 1   Sepal.Length  150 non-null    float64
 2   Sepal.Width   150 non-null    float64
 3   Petal.Length  150 non-null    float64
 4   Petal.Width   150 non-null    float64
 5   Species       150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


Moreover, we can calculate basic statistic for this dataset:

In [94]:
iris_r.describe()

Unnamed: 0,rownames,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.057333,3.758,1.199333
std,43.445368,0.828066,0.435866,1.765298,0.762238
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [95]:
iris_r.describe(include=['object'])

Unnamed: 0,Species
count,150
unique,3
top,setosa
freq,50


In [96]:
iris_r.Species.value_counts()

setosa        50
versicolor    50
virginica     50
Name: Species, dtype: int64

After removing the column rownames we get (almost) the same dataframe as `iris_df`

In [97]:
iris_r.drop('rownames', axis=1, inplace=True)
iris_r

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


# 4. Linear Regression models and their regularization (Ridge, LASSO, ElasticNet).

We show how to train and use to predictions some Regression models: classical one, Ridge regression, LASSO regression and ElasticNet model.

To do it we will use 3 datasets.

### 4.1 Simple artificial example

We start with creating some artificial data:


In [98]:
X = np.array([[1,1],
              [1,2],
              [2,2],
              [2,3],
              [3,3]])

# y = 3 * x_0 + 4 * x_1 + 5
y = np.dot(X, np.array([3, 4])) + 5

Now, we have to import LinearRegression from sklearn as follows:

In [99]:
from sklearn.linear_model import LinearRegression

Then, we train (fit) the LinearRegression model (called lr) to the data X,y:

In [100]:
lr = LinearRegression().fit(X,y)

and check coefficients of the model

In [101]:
lr.coef_

array([3., 4.])

In [102]:
lr.intercept_

5.0

To predict some new value of y (e.g. for x=[5, 7]) we can proceed as follows:


In [103]:
lr.predict(np.array([[5, 7]]))

array([48.])

### 4.2 More complicated artificial example

In this example we generate random data using built-in generators in numpy.

In [104]:
rng = np.random.default_rng(seed=123)

In [None]:
X = rng.normal(loc=100, scale=10, size=22)
X = X.reshape(-1,1)
# -1 jest po to aby reshape dopasowaÅ‚ siÄ™ automatycznie
X

array([[ 90.1087865 ],
       [ 96.32213349],
       [112.87925261],
       [101.93974419],
       [109.202309  ],
       [105.77103791],
       [ 93.63536354],
       [105.4195222 ],
       [ 96.83404549],
       [ 96.77610884],
       [100.97167319],
       [ 84.74069593],
       [111.92166104],
       [ 93.28910325],
       [110.0026942 ],
       [101.36321124],
       [115.3203308 ],
       [ 93.40030586],
       [ 96.88205144],
       [103.37769127],
       [ 77.92528902],
       [108.27921442]])

In [106]:
y = rng.normal(loc=0, scale=1, size=22)
y = y.reshape(-1,1)
y

array([[ 1.54163039],
       [ 1.12680679],
       [ 0.75476964],
       [-0.14597789],
       [ 1.28190223],
       [ 1.07403062],
       [ 0.39262084],
       [ 0.00511431],
       [-0.36176687],
       [-1.2302322 ],
       [ 1.22622929],
       [-2.17204389],
       [-0.37014735],
       [ 0.16438007],
       [ 0.85988118],
       [ 1.76166124],
       [ 0.99332378],
       [-0.29152143],
       [ 0.72812756],
       [-1.26160032],
       [ 1.42993853],
       [-0.15647532]])

In [107]:
y += 3 * X #y = y + 3 * X
y

array([[271.86798988],
       [290.09320725],
       [339.39252748],
       [305.67325468],
       [328.88882922],
       [318.38714436],
       [281.29871145],
       [316.26368093],
       [290.14036959],
       [289.09809432],
       [304.14124885],
       [252.05004392],
       [335.39483578],
       [280.03168981],
       [330.86796377],
       [305.85129495],
       [346.95431616],
       [279.90939616],
       [291.37428186],
       [308.87147348],
       [235.20580558],
       [324.68116792]])

Now we create and train basic Linear Regression model.

In [108]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()
lr_model.fit(X,y)

print(lr_model.coef_)
print(lr_model.intercept_)

[[3.01400433]]
[-1.07036225]


### 4.3 Real dataset example

In this example we will use dataset containing sales prices of houses in the City of Windsor.

In [109]:
housing = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Housing.csv")
housing

Unnamed: 0,rownames,price,lotsize,bedrooms,bathrms,stories,driveway,recroom,fullbase,gashw,airco,garagepl,prefarea
0,1,42000,5850,3,1,2,yes,no,yes,no,no,1,no
1,2,38500,4000,2,1,1,yes,no,no,no,no,0,no
2,3,49500,3060,3,1,1,yes,no,no,no,no,0,no
3,4,60500,6650,3,1,2,yes,yes,no,no,no,0,no
4,5,61000,6360,2,1,1,yes,no,no,no,no,0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...
541,542,91500,4800,3,2,4,yes,yes,no,no,yes,0,no
542,543,94000,6000,3,2,4,yes,no,no,no,yes,0,no
543,544,103000,6000,3,2,4,yes,yes,no,no,yes,1,no
544,545,105000,6000,3,2,2,yes,yes,no,no,yes,1,no


We will use only few first columns (with numerical variables) of this dataset.

In [110]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   rownames  546 non-null    int64 
 1   price     546 non-null    int64 
 2   lotsize   546 non-null    int64 
 3   bedrooms  546 non-null    int64 
 4   bathrms   546 non-null    int64 
 5   stories   546 non-null    int64 
 6   driveway  546 non-null    object
 7   recroom   546 non-null    object
 8   fullbase  546 non-null    object
 9   gashw     546 non-null    object
 10  airco     546 non-null    object
 11  garagepl  546 non-null    int64 
 12  prefarea  546 non-null    object
dtypes: int64(7), object(6)
memory usage: 55.6+ KB


In [111]:
housing = housing.iloc[:,1:6]
housing.head()

Unnamed: 0,price,lotsize,bedrooms,bathrms,stories
0,42000,5850,3,1,2
1,38500,4000,2,1,1
2,49500,3060,3,1,1
3,60500,6650,3,1,2
4,61000,6360,2,1,1


In [112]:
housing.describe()

Unnamed: 0,price,lotsize,bedrooms,bathrms,stories
count,546.0,546.0,546.0,546.0,546.0
mean,68121.59707,5150.265568,2.965201,1.285714,1.807692
std,26702.670926,2168.158725,0.737388,0.502158,0.868203
min,25000.0,1650.0,1.0,1.0,1.0
25%,49125.0,3600.0,2.0,1.0,1.0
50%,62000.0,4600.0,3.0,1.0,2.0
75%,82000.0,6360.0,3.0,2.0,2.0
max,190000.0,16200.0,6.0,4.0,4.0


The first column (price) is our target variable y, while the other columns are our set X.

In [113]:
X = housing.iloc[:,1:]
X = X.values
X.shape

(546, 4)

In [114]:
y = housing.iloc[:,0]
y = y.values
y.shape

(546,)

We split the data to the training (learning) set and the test set in proportion 80:20.

In [115]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

Now it is time to create (and train on training set) the appropriate Regression models.

In [116]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet

lr_classical = LinearRegression()
lr_ridge = Ridge()
lr_lasso = Lasso()
lr_elasticnet = ElasticNet()

lr_classical.fit(X_train, y_train)
lr_ridge.fit(X_train, y_train)
lr_lasso.fit(X_train, y_train)
lr_elasticnet.fit(X_train, y_train)

0,1,2
,alpha,1.0
,l1_ratio,0.5
,fit_intercept,True
,precompute,False
,max_iter,1000
,copy_X,True
,tol,0.0001
,warm_start,False
,positive,False
,random_state,


We predict values of y using the new set for models i.e test set.

In [117]:
y_pred_classical = lr_classical.predict(X_test)
y_pred_ridge = lr_ridge.predict(X_test)
y_pred_lasso = lr_lasso.predict(X_test)
y_pred_elasticnet = lr_elasticnet.predict(X_test)

We compare obtained predictions with the real values of y using RMSE (Root Mean Squared Error) metric given by

$$ RMSE = \sqrt{\frac1n\sum_{i=1}^n (y_i-\hat{y}_i)^2},$$

where
$$ y_{true} = [y_1, y_2, \ldots, y_n]$$
and
$$ y_{pred} = [\hat{y}_1,\hat{y}_2,\ldots,\hat{y}_n].$$

In [118]:
from sklearn.metrics import mean_squared_error

rmse_classical = mean_squared_error(y_test, y_pred_classical, squared=False)
rmse_ridge = mean_squared_error(y_test, y_pred_ridge, squared=False)
rmse_lasso = mean_squared_error(y_test, y_pred_lasso, squared=False)
rmse_elasticnet = mean_squared_error(y_test, y_pred_elasticnet, squared=False)

print(rmse_classical)
print(rmse_ridge)
print(rmse_lasso)
print(rmse_elasticnet)

TypeError: got an unexpected keyword argument 'squared'