If possible, update your sklearn version to 1.3.2 to reduce variance in the versions.

In [None]:
!pip3 install scikit-learn==1.3.2

In [77]:
import numpy as np
from scipy.linalg import solve
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.3.2.


## Regression - Polynomial features

In [34]:
from sklearn.datasets import fetch_california_housing
import os, ssl

if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context
    
california = fetch_california_housing()
print(california.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Creating the data matrix

In [35]:
D = california.data
y = california.target
n,d = D.shape
print(n,d)

20640 8


Creating a design matrix with polynomial features

In [78]:
# First scale the data
scaler = StandardScaler()
D_scaled = scaler.fit_transform(D)

# Compute design matrix
aff = PolynomialFeatures(2, include_bias=True)
X = aff.fit_transform(D_scaled)
aff.get_feature_names_out(california.feature_names)

array(['1', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
       'AveOccup', 'Latitude', 'Longitude', 'MedInc^2', 'MedInc HouseAge',
       'MedInc AveRooms', 'MedInc AveBedrms', 'MedInc Population',
       'MedInc AveOccup', 'MedInc Latitude', 'MedInc Longitude',
       'HouseAge^2', 'HouseAge AveRooms', 'HouseAge AveBedrms',
       'HouseAge Population', 'HouseAge AveOccup', 'HouseAge Latitude',
       'HouseAge Longitude', 'AveRooms^2', 'AveRooms AveBedrms',
       'AveRooms Population', 'AveRooms AveOccup', 'AveRooms Latitude',
       'AveRooms Longitude', 'AveBedrms^2', 'AveBedrms Population',
       'AveBedrms AveOccup', 'AveBedrms Latitude', 'AveBedrms Longitude',
       'Population^2', 'Population AveOccup', 'Population Latitude',
       'Population Longitude', 'AveOccup^2', 'AveOccup Latitude',
       'AveOccup Longitude', 'Latitude^2', 'Latitude Longitude',
       'Longitude^2'], dtype=object)

In [73]:
from sklearn.linear_model import LinearRegression

# Linear regression
model = LinearRegression()
model.fit(X, y)

# Get regression parameters
coefficients = model.coef_

β_MedInc = coefficients[1]
β_MedIncAveBedrms = coefficients[12]
β_HouseAgeAveBedrms = coefficients[19]

β_MedInc, β_MedIncAveBedrms, β_HouseAgeAveBedrms

(0.922436888432675, -0.16758435804320024, 0.06328854538476116)

In [94]:
# Solve the linear regression with matrices and compare result from previous
β = solve(X.T@X, X.T@y)
β[1], β[12], β[19] 

(0.9224368884313412, -0.16758435803836222, 0.06328854538394722)

## Exercise b
Derive the solution to the Ridge regression with the following objective:
$$min_{\beta}\frac{1}{n}||y-X\beta||^2 + \lambda ||\beta||^2$$

1. Calculate partial derivatives
$$\triangledown(\frac{1}{n}||y-X\beta||^2)=-\frac{2}{n}X^T(y-X\beta)\\
\triangledown(\lambda ||\beta||^2)=2\lambda\beta$$


2. Set the gradient to 0 and solve for $\beta$
$$
-\frac{2}{n}X^T(y-X\beta) + 2\lambda\beta=0 \\
X^T(y-X\beta)-n\lambda\beta=0 \\
X^T(y-X\beta)=n\lambda\beta \\
X^Ty-X^TX\beta=n\lambda\beta \\
n\lambda\beta + X^TX\beta = X^Ty\\
(X^TX+n\lambda I)\beta=X^Ty \\
\beta=(X^TX+n\lambda I)^{-1}X^Ty\\
$$

Denote the obtained ridge regression parameters when using $\lambda=0.1$ for the following features.

In [100]:
λ = 0.1

# Calculate matrices
XTX = X.T@X
n_lambda_I = XTX.shape[0] * λ * np.eye(XTX.shape[0])

# Solve for beta
β_ridge = solve(XTX + n_lambda_I, X.T@y)
β_ridge[1], β_ridge[12], β_ridge[19] 

(0.9236035762672617, -0.16440080116264447, 0.06550918679687188)

## Naive Bayes
From the 20Newsgroups dataset we fetch the documents belonging to three categories, which we use as classes.

In [None]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'talk.politics.guns',
              'sci.space']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

For example, the first document in the training data is the following one:

In [None]:
print(train.data[0])

The classes are indicated categorically with indices from zero to two by the target vector. The target names tell us which index belongs to which class.

In [None]:
y_train = train.target
y_train

In [None]:
train.target_names

We represent the documents in a bag of word format. That is, we create a data matrix ``D`` such that ``D[j,i]=1`` if the j-th document contains the i-th feature (word), and ``D[j,i]=0`` otherwise. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english", min_df=5,token_pattern="[^\W\d_]+", binary=True)
D = vectorizer.fit_transform(train.data)
D_test = vectorizer.transform(test.data)

We get the allocation of feature indices to words by the following array, containing the vocabulary.

In [None]:
vectorizer.get_feature_names_out()

For example, the word `naive` has the index 4044.

In [None]:
np.where(vectorizer.get_feature_names_out() == 'naive')[0]

## Decision Tree

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
D, y = iris.data, iris.target

In [None]:
print(iris.DESCR)

## SVM

In [None]:
# Standard scientific Python imports
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split

In [None]:
digits = datasets.load_digits()

_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title("Label %i" % label)

In [None]:
# flatten the images
n = len(digits.images)
D = digits.images.reshape((n, -1))
y = digits.target

# Split data into 70% train and 30% test subsets
D_train, D_test, y_train, y_test = train_test_split(
    D, y, test_size=0.3, shuffle=False
)