# Session 2

Look at the following API references, as well as the code above, to solve the problems below:

- https://pandas.pydata.org/pandas-docs/stable/reference/index.html
- https://scikit-learn.org/stable/modules/classes.html

---------------------------------

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import sklearn
%matplotlib inline

# Problems - California Housing dataset

In [2]:
from sklearn.datasets import fetch_california_housing

# Load the data object
data_object = fetch_california_housing()

# Print the data description
print(data_object.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [3]:
# Create a DataFrame with features
df = pd.DataFrame(data_object.data, columns=data_object.feature_names)

# Target values as a vector y
y = data_object.target

# We use a subset of the data - the first 100 rows
#df = df.head(100)
#y = y[:100]

df.head() # Print the first 5 rows

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [4]:
from sklearn.model_selection import train_test_split

# Split the data into test and train
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.5, random_state=42)

## Problem 1: Replace the linear model with Lasso and Ridge

- change the regularization parameter `alpha` and see how coefficients become smaller using Ridge
- change the regularization parameter `alpha` and see how sparsity changes in the coefficients using Lasso

In [5]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge

# Replace LinearRegression below with Lasso and Ridge in turn
# Then change the "alpha" hyperparameter
predictor = LinearRegression()
predictor.fit(X_train, y_train)

# Coefficients of the fitted model
pd.Series(predictor.coef_, index=df.columns, name="Coefficients").to_frame()

Unnamed: 0,Coefficients
MedInc,0.4420325
HouseAge,0.009647486
AveRooms,-0.1194035
AveBedrms,0.7720128
Population,-3.566076e-07
AveOccup,-0.003023989
Latitude,-0.4233425
Longitude,-0.4374872


In [6]:
# Try to recreate this plot, using our data:
# https://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_path.html
# PS: Just copy the code over and change it.
# You will need to change the limits of "alpha"

## Problem 2: Investigate ways to deal with missing data. 

Below I remove rows with missing data. Instead try to use the mean or median, or use an Imputer:

Imputers: https://scikit-learn.org/stable/modules/impute.html#imputation-of-missing-values

## Test performance when training data has no missing values

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Model performance with no missing data
predictor = LinearRegression()
predictor.fit(X_train, y_train)

# Predict and check the Root Mean Squared Error (RMSE) on the test data
y_pred = predictor.predict(X_test)
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 0.7286276383085546


## Randomly set training data as missing

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Seed random number generation for reproducbile results
np.random.seed(12) 

# Dirty the data: set 10 % of the entries as missing
missing_mask = np.random.uniform(size=X_train.shape) > 0.5
X_train.values[missing_mask] = np.nan

# As seen below, some values are now missing (NaN)
X_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
5967,3.8929,,4.974293,,1126.0,,,-117.81
17744,,15.0,6.932722,,,,37.29,
952,,15.0,,1.058763,,2.560825,37.71,-121.94
9361,8.3935,,7.004792,,1549.0,,,-122.53
11024,4.8542,,,,,,33.79,-117.83


## RMSE when rows with missing data are removed from training data

In [9]:
# Model performance when missing data is removed
predictor = LinearRegression()

has_null_mask = X_train.isnull().any(axis=1).values
predictor.fit(X_train[~has_null_mask], y_train[~has_null_mask])

# Predict and check the Root Mean Squared Error (RMSE) on the test data
y_pred = predictor.predict(X_test)
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 1.3096244391963165


## Test performance when using imputation

In [10]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

# =======================================================
# CHANGE THE IMPUTER BELOW AND TRY DIFFERENT STRATEGIES
# HOW LOW CAN YOU GET THE RMSE?
# =======================================================
imputer = SimpleImputer(strategy='most_frequent')


# Fit the imputer on the training data
X_new = imputer.fit_transform(X_train)

# Train model on data
predictor = LinearRegression()
predictor.fit(X_new, y_train)

# Transform the test data with the imputer trained on the training data
X_test_new = imputer.transform(X_test)
y_pred = predictor.predict(X_test_new)

# This RMSE is higher due to the missing data and imputation
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

RMSE: 0.8212618989998314


## Problem 3: Use logistic regression to predict breast cancer (classification task)

Here you are mostly on your own.
Try to:
- Investigate the data.
- Create some plots.
- Split the data into test and train. Maybe split the training data into a validation set too.
- Try using logistic regression and feature engineering.

In [11]:
from sklearn.datasets import load_breast_cancer

# Load the data object
data_object = load_breast_cancer()

# Print the data description
print(data_object.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [12]:
# Create a data frame with features, and a target y`
df = pd.DataFrame(data_object.data, columns=data_object.feature_names)
y = data_object.target

In [13]:
from sklearn.model_selection import train_test_split

X = df.to_numpy()
X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [14]:
# Train a logistic regression model below
from sklearn.linear_model import LogisticRegression

In [15]:
# Plot the ROC curve:
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve
# Feel free to compare the training set ROC curve vs. the test set ROC curve

from sklearn.metrics import roc_curve

In [16]:
# Compute the ROC AUC
from sklearn.metrics import roc_auc_score

## Problem 4: Use a tree model to predict breast cancer

- Try different models, e.g. `sklearn.tree.DecisionTreeClassifier`, `sklearn.ensemble.GradientBoostingClassifier`

In [17]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load the data object
data_object = load_breast_cancer()

# Create a data frame with features, and a target y`
df = pd.DataFrame(data_object.data, columns=data_object.feature_names)
y = data_object.target

X = df.to_numpy()
X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=42)

## Problem 5: Anything goes - breast cancer

Use whatever models, feature engineering, etc to predict breast cancer.



In [18]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load the data object
data_object = load_breast_cancer()

# Create a data frame with features, and a target y`
df = pd.DataFrame(data_object.data, columns=data_object.feature_names)
y = data_object.target

X = df.to_numpy()
X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=42)

## Problem 6: Anything goes - your own data

Get some data from any source. Train a model.

Some data sources:
- Wikipedia tables
- `sklearn.datasets`
- UCI Machine Learning Repository: Data Sets : https://archive.ics.uci.edu/ml/datasets.php