# Welcome to Week 5 - Data splitting & cross-validation

This week we will talk about how to split datasets into training and test data for model building and testing. We will also implement some cross-validation techniques.

Before we get started, a little background on random number generators.

### Section 1: Random number generation

Random number(s) can be generated in many ways: 

1. "random()" for a single float number in [0,1),

2. "random.randint(lowerBound, upperBound)" for an integer within an interval, 

3. "random.randrange(lowerBound, upperBound, step)" for further constraining on the range. (e.g. randrange(0,11,2) randomly generates even number within [0,11) )

4. "random.uniform(a,b)" for a random float number within interval [a,b). Here a doesn't have to be smaller than b., while method 1-3 will cause TypeError or ValueError if a<b

**DEMO 1.1**

Here we introduce the random generator from NumPy as follows, which generates an array of length 3 with numbers between 0 and 5 (not including 5):

In [102]:
import numpy as np

# Generates a random sample from a given 1-D array.
np.random.choice(5,3)
# #This is equivalent to np.random.randint(0,5,3)

array([4, 1, 4])

This generation is random. Another run will give you a different array.

In [103]:
np.random.choice(5,3)

array([4, 0, 0])

But what if we're testing a specific method and we want to make sure that we "randomly" generate the **same** numbers again?

We can achieve that by first setting a "seed". 

In [104]:
np.random.seed(3)

# We choose seed "3" here but you can choose any number.
# You can think of it as a passcode - the same seed will produce the same results when you rerun this code later.

np.random.choice(5, 3)

array([2, 0, 1])

In [105]:
np.random.seed(4)

np.random.choice(5,3)

array([2, 1, 0])

In [106]:
np.random.seed(3)
np.random.choice(5, 3)

array([2, 0, 1])

Notice how these two arrays are now the same - they are associated with the seed "3". Even if you close the notebook and re-run this code later, they will stay the same. But the arrays in the beginning without the seed will change every time. This is because you use a certain stream of previously generated random numbers that are kept in an array, saved via this "passcode" seed. In other words, the random samples generated by seed is traceable.

Considering we still consider these as random numbers, this is quite counterintuitive. We want to be **random**, but we want **predictable random**. The latter is especially useful if we want to test a piece of code. If we want to find our mistakes or just test the rationale, we don't want to keep running it in order to see whether that one particular instance that was causing an error is reoccurring.

In general, to make our sampling appear to be random, as we theorise it should be, we use a computer to generate a random number, or an array of random numbers, depending on the application. Computers do this by making use of **pseudo-random number generators (PRNGs)**. These generators start from a particular state, the (random) seed state, and start to apply different functions/algorithms over that seed to obtain the next results in the sequence. Many such functions exist. The one used in numpy for example is based on the very popular Mersenne Twister PRNG (it is also used in Excel, R, MATLAB, and so on).

Now you see why, **when we set the seed, we obtain the same results. We are applying the same function to the same starting point so all the subsequent results can be derived by applying the same function. If we don't set the seed in Python, the initial seed value gets reseeded after every generation, i.e., it is replaced by another value from a pseudo-random sequence every time we run the code that is asked to make a selection.**


### Section 2: Training and test data / Data splitting

Supervised learning relies on the training of a model on a set of labelled training data, and then testing the performance of the model on unlabelled test data. In order to do that, we generally split our data randomly into two parts: training and test.

Remember our linear regression from week 3? Let's try to recreate that model but this time, we will train the model on a part of the data and then test it on the other part.

In [107]:
import sklearn.datasets as ds

df = ds.fetch_california_housing(as_frame=True)

print(df.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [108]:
# You first have to split your data into dependent and independent part

X = df['data']
y = df['target']

We will now split the data using the train_test_split function from sklearn. Check the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and try implementing it below. 

Note the random_state parameter - this refers to the random seed which you can set so that your results become reproducible, just like we did in section 1.

The other important parameter is test_size or training size. With either of those you can decide the size of the split.

Common split options are 50/50, or 70 training/30 test.

In [109]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

Now that we've split our data, we will train a linear regression on the TRAIN part of it.

Implement this below using the statsmodels OLS function, just like we did in week 3. Make sure to train on JUST the training part of the data this time. You can print a model summary if you like, using the summary() function which statsmodels allows us to do.

In [110]:
from statsmodels.regression.linear_model import OLS

LR = OLS(y_train, X_train)

results = LR.fit()
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:            MedHouseVal   R-squared (uncentered):                   0.892
Model:                            OLS   Adj. R-squared (uncentered):              0.892
Method:                 Least Squares   F-statistic:                          1.066e+04
Date:                Wed, 18 Oct 2023   Prob (F-statistic):                        0.00
Time:                        13:00:33   Log-Likelihood:                         -12048.
No. Observations:               10320   AIC:                                  2.411e+04
Df Residuals:                   10312   BIC:                                  2.417e+04
Df Model:                           8                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Now we want to do out of sample predictions, that is, we want to predict y_test using X_test given the fitted model above.

You can do that by running the predict() function for your fitted model, giving it X_test as the parameter.

In [111]:
from sklearn.linear_model import LinearRegression

LR = LinearRegression()
LR.fit(X_train, y_train)
predictions = LR.predict(X_test)

Time to check our model performance!

When evaluating a model, we can look at two types of performance. In the previous lectures, we looked at the in-sample performance (on the training data). That means we measured how well the model was able to predict y given X, and how far away the predictions were. 

Now, we can also measure the out-of-sample performance (on the test data). This measures how well the fitted model handles unseen data.

A good model will do both, but especially the out of sample performance is very important as it tells us how generalisable the model is to new data. A poor out-of-sample performance is indicative of an overfit model.

Sklearn has some very useful functions for measuring the performance of your model under sklearn.metrics. You can find a list of them [here](https://scikit-learn.org/stable/modules/model_evaluation.html). 

The ones for regression include some that you will remember from the lecture:

- RMSE / MSE: mean_squared_error (a parameter let's you turn on/off the rooting)
- MAE: mean_absolute_error

Calculate them below for your model. You will want to look at your predicted values of y compared to your y_test, as that's our out-of-sample performance.

In [112]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

print(mean_squared_error(y_test, predictions))
print(mean_absolute_error(y_test, predictions))

0.5308982353071016
0.531997848211222


Curious about your in-sample performance? Then you have to rerun the predict() function from earlier on X_train, to give us predictions for those values of X. If you have some extra time in the session today, why now make a comparison of in-sample and out-of-sample performance of your model.

In [113]:
predictions = LR.predict(X_train)

print(mean_squared_error(y_train, predictions))
print(mean_absolute_error(y_train, predictions))

0.5204367375968284
0.5281781945051116


By default, the sklearn training/test splitting function tries to **keep the same proportion of each class** that is presented in the original dataset. This will become important in the next computer lab when we talk about over/under sampling.

### Section 3: Cross-validation

We will now have another look at the training/test splitting function, but this time we will look at a classification case. And in this context we will also talk about k-fold cross validation.

Let's generate some random data this time to show you some more neat functions of sklearn.

Besides offering us with some real datasets which we've already used, such as the housing data above, sklearn also gives us the option to generate **random aritifical data** specific for model building and testing. 

You can read more about this functionality [here](https://scikit-learn.org/stable/datasets/sample_generators.html).

This time we will use the make_classification function to generate some random data for a logistic regression problem. Check the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification) to read more about the parameters with which you can modify the dataset to suit your test case.

In [114]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=1,
    weights=(0.7, 0.3),
    class_sep=0.99,
    random_state=14,
)

print(X.shape, y.shape)
print(X[0], y[0])

(1000, 10) (1000,)
[-0.98235375  0.98107319  1.52302769 -0.79409039 -1.77362528 -2.42985663
  0.85952872  0.46672469 -3.27137384 -2.18460667] 0


Let's now also create a simple logistic regression model for this data.

In [115]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')

We will now make use of the sklearn cross-validation function. There's some great documentation about how sklearn does its cross-validation [here](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) which I recommend you have a look at.

The bottom line is this:
- the function holds back a part of the data for final testing
- the remaining training data is then split into k parts
- in each round, the model is trained on k-1 parts of the data, with 1 being used for interim evaluating of the model
- the process is repeated until each part of the training data has been used for this evaluation
- the fitted model is used for a final run on the held-back test data
- the mean of the interim evaluations is reported

**TASK**

Implement below the cross_val_score function from sklearn.model_selection on the classifier model defined above. You can choose how many iterations (k) you want to run, but higher numbers will be computationally more expensive so I suggest a value under 20.

In [116]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(classifier, X, y, cv=5)
print(scores)

[0.985 0.995 0.985 0.995 0.99 ]


If everything goes well, the model should report **as many accuracy scores as you have folds**, i.e., k scores. The default is 'accuracy' but different metrics can be specified.

If **multiple metrics** are required, apply the function cross_validate(). It basically does the same as the cross_val_score() function, but gives back a whole dict of values.

You can read more about it in the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate).

**TASK**

Implement the function below for the same model above. Note that the parameter return_train_score is set to False as default. If you want to report in-sample (training) evaluation parameters you can set it to true. Generally, we are not very interested in those but there might be situations in which you want to check how they compare to the test performance.

In [117]:
from sklearn.model_selection import cross_validate
from pprint import pprint

scores = cross_validate(classifier, X, y, cv=5, return_train_score=True)
pprint(scores)

{'fit_time': array([0.00099874, 0.00099921, 0.00159264, 0.00099993, 0.00100017]),
 'score_time': array([0.0010016, 0.       , 0.       , 0.       , 0.       ]),
 'test_score': array([0.985, 0.995, 0.985, 0.995, 0.99 ]),
 'train_score': array([0.99375, 0.99   , 0.99125, 0.98875, 0.9925 ])}


### Additional reading / DEMO: Pipelines

It is generally recommended that any preprocessing and data transformations are first used on the training data, which is then used to build the model.

Afterwards, the same pre-processing steps should be applied to the test data separately. This ensures that there is no spillover from any of the transformations between the different data parts, which would mean that there might be information from the test data spilling into the model training.

An easy way to implement this is through pipelines.

We apply function "make_pipeline" from sklearn, to set pipeline of transforms with a final estimator. See documentation [here](https://scikit-learn.org/stable/modules/compose.html#combining-estimators).

This function sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be 'transforms', that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to **assemble several steps** that can be cross-validated together **while** setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__'.

In [118]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

metrics = ["accuracy"]

# Construct a Pipeline from the given estimators.
# add another step of standardizing.
pipeline = make_pipeline(StandardScaler(), classifier)

outcomes = cross_validate(
    pipeline, X, y, scoring=metrics, cv=10, return_train_score=True
)
for metric in outcomes.keys():
    print(metric + " value: " + str(outcomes[metric]))

fit_time value: [0.00200033 0.0025568  0.00199938 0.00200057 0.00099993 0.00200081
 0.00200057 0.00200248 0.00199914 0.00199866]
score_time value: [0.         0.         0.         0.00100017 0.0010004  0.
 0.00100064 0.00100088 0.         0.00099874]
test_accuracy value: [1.   0.96 1.   0.99 0.98 0.99 1.   1.   0.98 1.  ]
train_accuracy value: [0.99222222 0.99555556 0.99111111 0.99222222 0.99222222 0.99222222
 0.99111111 0.99222222 0.99333333 0.99111111]
