ML regression is a technique used to predict continuous numerical values based on input data. 

It is widely used in various fields such as, finance, healthcare, and weather forecasting. 

In this tutorial, we will cover the basics of regression, including the different types of regression algorithms and the steps involved in building a regression model. 

Types of regression algorithms 

Popular ones are...

1.Linear regression
Linear Regression models the relationship between the input variables (X) and the output variable (Y) by fitting a linear equation to the data. It will assume a linear relationship between the variables and tries to minimize the errors between the predicted and actual values.
y=ax+b

x is independent variable
a is slope of line of best fit
b is intercept
a and b are known as regression coefficients

2.Decision Tree regression
Decision Tree regression builds a model that resembles a tree like structure with leaf nodes in them and takes the probability of each output. It splits the data based on different conditions and predicts the average value of the target values in each leaf node. 

3.Random Forest regression
Random Forest regression is an ensembling learning method that combines multiple decision tree regressors. It creates a forest of decision trees and averages the predictions of each tree to obtain the final result. 

Steps to build a regression model
1. Data preparation
Gather and prepare data for regression. Ensure that you have a target variable Y and independent or input variables X. If necessary perform data cleaning, handling missing values, and encoding categorical variables.

2. Splitting the data
Next, split the dataset into two parts: a training set and a test set. A common split is 70-80% for training set and 20-30% for test set

3. Feature scaling (optional)
Depending on the regression algorithm, you choose, you may need to perform feature scaling. Feature scaling ensures that all input variables are on a similar scale, preventing some variables from dominating the others. Common scaling techniques include standardization (subtracting the mean and dividing by the standard deviation) and normalization (scaling values to a range between 0 and 1).

4. Model training
Train the regression model using the training data. Linear regression model estimates the coefficients that minimize the difference between the predicted and actual values. In decision regression model recursively splits the data based on conditions. 

5. Model evaluation
Once the model is trained, it is important to evaluate its performance using the test data. Common evaluation metrics for regression include mean squared error (MSE),root mean squared error (RMSE), mean absolute error (MAE) and R-square (coeff. of determination)

6. Model Fine-tuning (optional)
If the model's performance is not satisfactory, you can fine-tune it by adjusting hyperparameters. Hyperparameters are settings that control the learning process, such as the maximum depth of a decision tree or the number of trees in a random forest. Use the techniques like cross-validation or grid search to find the optimal hyperparameter values. 

7. Model deployment 
Once you are satisfied with model performance, deploy it and make predictions on new, unseen data. 

In [1]:
import pandas as pd

df = pd.read_csv('cars.csv')
df

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Amount (Million Naira),Type,Distance
0,VHL12546,Abuja,Honda,Accord Coupe EX V-6,2011,Silver,2.20,Nigerian Used,
1,VHL18827,Ibadan,Hyundai,Sonata,2012,Silver,3.50,Nigerian Used,125000
2,VHL19499,Lagos,Lexus,RX 350,2010,Red,9.20,Foreign Used,110852
3,VHL17991,Abuja,Mercedes-Benz,GLE-Class,2017,Blue,22.80,Foreign Used,30000
4,VHL12170,Ibadan,Toyota,Highlander,2002,Red,2.60,Nigerian Used,125206
...,...,...,...,...,...,...,...,...,...
7200,VHL14329,Abuja,Honda,Civic,2018,Gray,5.70,Foreign Used,65000
7201,VHL10637,Abuja,BMW,X3,2007,White,4.00,Nigerian Used,200000
7202,VHL19734,Abuja,Toyota,RAV4 2.5 Limited 4x4,2010,Black,2.85,Nigerian Used,
7203,VHL15569,Lagos,Mercedes-Benz,GLK-Class 350,2012,Black,8.65,Foreign Used,85750


In [2]:
#cleaning the data
df['Year']=df.Year.str.replace(',','')
df['Distance']=df.Distance.str.replace(',','')

In [3]:
df.head()

Unnamed: 0,VehicleID,Location,Maker,Model,Year,Colour,Amount (Million Naira),Type,Distance
0,VHL12546,Abuja,Honda,Accord Coupe EX V-6,2011,Silver,2.2,Nigerian Used,
1,VHL18827,Ibadan,Hyundai,Sonata,2012,Silver,3.5,Nigerian Used,125000.0
2,VHL19499,Lagos,Lexus,RX 350,2010,Red,9.2,Foreign Used,110852.0
3,VHL17991,Abuja,Mercedes-Benz,GLE-Class,2017,Blue,22.8,Foreign Used,30000.0
4,VHL12170,Ibadan,Toyota,Highlander,2002,Red,2.6,Nigerian Used,125206.0


In [4]:
df.describe()

Unnamed: 0,Amount (Million Naira)
count,7188.0
mean,11.847999
std,25.318922
min,0.45
25%,3.5
50%,5.65
75%,11.6625
max,456.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   object 
 2   Maker                   7205 non-null   object 
 3   Model                   7205 non-null   object 
 4   Year                    7184 non-null   object 
 5   Colour                  7205 non-null   object 
 6   Amount (Million Naira)  7188 non-null   float64
 7   Type                    7008 non-null   object 
 8   Distance                4845 non-null   object 
dtypes: float64(1), object(8)
memory usage: 506.7+ KB


In [6]:
df.isna().sum()

VehicleID                    0
Location                     0
Maker                        0
Model                        0
Year                        21
Colour                       0
Amount (Million Naira)      17
Type                       197
Distance                  2360
dtype: int64

In [7]:
df['Year'].fillna(df['Year'].median(),inplace=True)
df['Amount (Million Naira)'].fillna((df['Amount (Million Naira)'].median()),inplace=True)

In [8]:
df['Type'].fillna(method='ffill',inplace=True)

In [9]:
df['Distance'].fillna(df['Distance'].median(),inplace=True)

In [10]:
df.isna().sum()

VehicleID                 0
Location                  0
Maker                     0
Model                     0
Year                      0
Colour                    0
Amount (Million Naira)    0
Type                      0
Distance                  0
dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   object 
 2   Maker                   7205 non-null   object 
 3   Model                   7205 non-null   object 
 4   Year                    7205 non-null   object 
 5   Colour                  7205 non-null   object 
 6   Amount (Million Naira)  7205 non-null   float64
 7   Type                    7205 non-null   object 
 8   Distance                7205 non-null   object 
dtypes: float64(1), object(8)
memory usage: 506.7+ KB


In [12]:
df.describe()

Unnamed: 0,Amount (Million Naira)
count,7205.0
mean,11.833375
std,25.290819
min,0.45
25%,3.5
50%,5.65
75%,11.5
max,456.0


In [13]:
df['Distance']=df.Distance.astype(float)
df['Year'] = df.Year.astype(int)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   object 
 2   Maker                   7205 non-null   object 
 3   Model                   7205 non-null   object 
 4   Year                    7205 non-null   int64  
 5   Colour                  7205 non-null   object 
 6   Amount (Million Naira)  7205 non-null   float64
 7   Type                    7205 non-null   object 
 8   Distance                7205 non-null   float64
dtypes: float64(2), int64(1), object(6)
memory usage: 506.7+ KB


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7205 entries, 0 to 7204
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   VehicleID               7205 non-null   object 
 1   Location                7205 non-null   object 
 2   Maker                   7205 non-null   object 
 3   Model                   7205 non-null   object 
 4   Year                    7205 non-null   int64  
 5   Colour                  7205 non-null   object 
 6   Amount (Million Naira)  7205 non-null   float64
 7   Type                    7205 non-null   object 
 8   Distance                7205 non-null   float64
dtypes: float64(2), int64(1), object(6)
memory usage: 506.7+ KB


In [16]:
df.drop(['VehicleID'],axis=1,inplace=True)

In [17]:
df.head()

Unnamed: 0,Location,Maker,Model,Year,Colour,Amount (Million Naira),Type,Distance
0,Abuja,Honda,Accord Coupe EX V-6,2011,Silver,2.2,Nigerian Used,80830.0
1,Ibadan,Hyundai,Sonata,2012,Silver,3.5,Nigerian Used,125000.0
2,Lagos,Lexus,RX 350,2010,Red,9.2,Foreign Used,110852.0
3,Abuja,Mercedes-Benz,GLE-Class,2017,Blue,22.8,Foreign Used,30000.0
4,Ibadan,Toyota,Highlander,2002,Red,2.6,Nigerian Used,125206.0


In [18]:
categ_col = df.select_dtypes(include=['object','category']).columns

In [19]:
categ_col

Index(['Location', 'Maker', 'Model', 'Colour', 'Type'], dtype='object')

In [20]:
df.Location.unique()

array(['Abuja', 'Ibadan', 'Lagos'], dtype=object)

In [21]:
df.Maker.unique()

array(['Honda', 'Hyundai', 'Lexus', 'Mercedes-Benz', 'Toyota', 'Acura',
       'Dodge', 'Nissan', 'Kia', 'BMW', 'Volvo', 'Ford', 'Land Rover',
       'Lincoln', 'Peugeot', 'Chevrolet', 'Audi', 'Jaguar', 'Infiniti',
       'Porsche', 'Fiat', 'Maserati', 'Volkswagen', 'Suzuki', 'Bentley',
       'GAC', 'Mazda', 'Scion', 'Renault', 'Mitsubishi', 'Mini',
       'Pontiac', 'Cadillac', 'Ferrari', 'Jeep', 'Buick', 'Rolls-Royce',
       'GMC', 'Chrysler', 'Lamborghini', 'Citroen', 'King', 'BAW',
       'Saturn', 'Tata', 'Opel', 'JAC', 'MG', 'Hummer', 'Subaru', 'Rover',
       'Saab', 'Skoda', 'IVM', 'Brabus'], dtype=object)

In [22]:
df.Model.unique()

array(['Accord Coupe EX V-6', 'Sonata', 'RX 350', ..., 'Almera 1.6 Lux',
       'X5 3.0i Sports Activity', '320i SV Premium'], dtype=object)

In [23]:
df.Colour.unique()

array(['Silver', 'Red', 'Blue', 'Black', 'Gold', 'White', 'Gray',
       'Burgandy', 'Green', 'Violet', 'Brown', 'Yellow', 'Orange', 'Pink',
       'Beige', 'Purple', 'Ivory', 'G', 'Teal', 'Mica', 'Pearl'],
      dtype=object)

In [24]:
df.Type.unique()

array(['Nigerian Used', 'Foreign Used', 'Brand New'], dtype=object)

In [25]:
Encoding
Label Encoding and One Hot Encoding are both popular methods used to convert categorical variables into numerical.

1. Label Encoding
It is a technique where each unique category in a categorical variable is assigned a numerical label. 
These labels are usually assigned in ascending order starting from 0 or 1. It is suitable when the cateogrical variable has an inherent order or ranking.
For example, low, medium, and high can be encoded as 0,1,2 repsectively. However, if there is no meaningful order in the categories using Label encoding can lead to incorrect assumptions by the model, as it may interpret the numerical labels as having a specific order.
No increase in dimensionality. Easily interpretable because it corresponds to a specific category. Choose this one when there is an inherent order or ranking among the categories 

2. One Hot Encoding
It is a technique where each unique category in a categorical variable is converted into a new binary column or dummy variable.
Each binary column represents a single category and has a value of 1 if the category is present and 0 otherwise.
It is suitable when there is no inherent order or ranking among the categories, and each category is treated independently. It prevents the model from assuming any ordinal relationship between the categories.
One hot encoding increases the dimensionality of the data, as it introduces new binary columns for each category. 
THIS CAN BE BENEFICIAL FOR SOME ML ALGORITHMS THAT REQUIRE NUMERICAL INPUTS. This one is often prefered for algorithms that expect numerical inputs and when there is no inherent order among the categories. 
One hot encoding provides a more explicit representation of each category but can result in more complex and redundant representations especially when dealing with many categories. 

SyntaxError: invalid syntax (<ipython-input-25-3bd95a2867a3>, line 2)

In [26]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

#one hot encoding
new_df = pd.get_dummies(df)
new_df

Unnamed: 0,Year,Amount (Million Naira),Distance,Location_Abuja,Location_Ibadan,Location_Lagos,Maker_Acura,Maker_Audi,Maker_BAW,Maker_BMW,...,Colour_Purple,Colour_Red,Colour_Silver,Colour_Teal,Colour_Violet,Colour_White,Colour_Yellow,Type_Brand New,Type_Foreign Used,Type_Nigerian Used
0,2011,2.20,80830.0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1,2012,3.50,125000.0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
2,2010,9.20,110852.0,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,2017,22.80,30000.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,2002,2.60,125206.0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7200,2018,5.70,65000.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
7201,2007,4.00,200000.0,1,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,1
7202,2010,2.85,80830.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7203,2012,8.65,85750.0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [27]:
col = list(new_df.columns)
col

['Year',
 'Amount (Million Naira)',
 'Distance',
 'Location_Abuja',
 'Location_Ibadan',
 'Location_Lagos',
 'Maker_Acura',
 'Maker_Audi',
 'Maker_BAW',
 'Maker_BMW',
 'Maker_Bentley',
 'Maker_Brabus',
 'Maker_Buick',
 'Maker_Cadillac',
 'Maker_Chevrolet',
 'Maker_Chrysler',
 'Maker_Citroen',
 'Maker_Dodge',
 'Maker_Ferrari',
 'Maker_Fiat',
 'Maker_Ford',
 'Maker_GAC',
 'Maker_GMC',
 'Maker_Honda',
 'Maker_Hummer',
 'Maker_Hyundai',
 'Maker_IVM',
 'Maker_Infiniti',
 'Maker_JAC',
 'Maker_Jaguar',
 'Maker_Jeep',
 'Maker_Kia',
 'Maker_King',
 'Maker_Lamborghini',
 'Maker_Land Rover',
 'Maker_Lexus',
 'Maker_Lincoln',
 'Maker_MG',
 'Maker_Maserati',
 'Maker_Mazda',
 'Maker_Mercedes-Benz',
 'Maker_Mini',
 'Maker_Mitsubishi',
 'Maker_Nissan',
 'Maker_Opel',
 'Maker_Peugeot',
 'Maker_Pontiac',
 'Maker_Porsche',
 'Maker_Renault',
 'Maker_Rolls-Royce',
 'Maker_Rover',
 'Maker_Saab',
 'Maker_Saturn',
 'Maker_Scion',
 'Maker_Skoda',
 'Maker_Subaru',
 'Maker_Suzuki',
 'Maker_Tata',
 'Maker_Toyota',

In [28]:
#split the data

x = df.drop(['Amount (Million Naira)','Location','Model','Maker','Type','Colour'],axis=1)
y = df['Amount (Million Naira)']

In [29]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [30]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((5764, 2), (1441, 2), (5764,), (1441,))

In [31]:
#Model training

In [32]:
from sklearn.ensemble import RandomForestRegressor

from sklearn.tree import DecisionTreeRegressor

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Lasso

from sklearn.linear_model import Ridge

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

In [33]:
#put models in a dictionary

models = {"DecisionTree Regressor":DecisionTreeRegressor(),"Random Forest":RandomForestRegressor(),"Ridge":Ridge(alpha=0.5),"Lasso":Lasso(alpha=0.5)}


In [34]:
#set up a function to fit and score model
def fit_and_score(models,x_train, x_test, y_train, y_test):
    np.random.seed(42)#setting up a random seed
    model_scores = {}
    for name,model in models.items():
        model.fit(x_train,y_train) #fitting the model to the data
        model_scores[name] = model.score(x_test,y_test) #evaluate the mode and append its score to model_scores
    return model_scores

In [35]:
import numpy as np

In [36]:
scores = fit_and_score(models,x_train, x_test, y_train, y_test)
scores

{'DecisionTree Regressor': 0.18237786103904485,
 'Random Forest': 0.29340746143136354,
 'Ridge': 0.18419356754759242,
 'Lasso': 0.18425530262972578}

In [37]:
model_compare = pd.DataFrame(scores,index['R2_Score'])
model_compare.T.plot.bar()

NameError: name 'index' is not defined

In [None]:
#feature scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
#Min Max scaling scales the features to a specific range, usually between 0 and 1. 
#It is particularly useful when there are outliers in the data and we want to preserve the relative ordernig of the values 

In [None]:
x_train_scaled = scaler.fit_transform(x_train)
x_test_scale = scaler.fit_transform(x_test)

In [None]:
model2 = DecisionTreeRegressor()
model2.fit(x_train_scaled,y_train)
model2.score(x_test_scaled,y_test)
y_preds = model2.predict(x_test_scaled)
y_preds

In [None]:
Model Evaluation

Regression model evaluation metrics

R^2 or coefficient of determination - compares your models predictions to the mean of the targets.
Values can range from negative infinity (a very poor model) to 1. for example if all your model does is predict the mean of the targets. It's R^2 values would be 0. 
And if your model perfectly predicts a range of numbers its R^2 value would be 1.

Mean absolute error (MAE) - The avg of absolute difference between predictions and actual values. It gives you an idea of how wrong your predictions were. 

Mean square error (MSE) - The avg squared difference between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors)

In [None]:
#model evaluation

mse = mean_squared_error(y_test,y_preds)
mse

In [None]:
np.sqrt(mse)

In [None]:
mae = mean_absolute_error(y_test,y_preds)
mae

In [None]:
df2 = pd.DataFrame(data={"actual values":y_test,"predictions":y_preds})
df2

In [None]:
the prediction values can be improved using hyper parameters on the model training

In [None]:
model2.feature_importances_

In [None]:
#creating a function for visualising the feature importances
def plot_importances(columns,importances, n=10):
    df3 = (pd.DataFrame({'features':columns,'feature_importances':importances}.sort_values('feature_importances',ascending=False).reset_index(drop=False))
    fig, ax = plt.subplots(figsize=(10,5)) #plot the df
    ax = sns.barplot(x='feature_importances',y='features',data=df3[:n],orient='h')
    plt.ylabel('features')
    plt.xlabel('feature_importances')
           

In [None]:
plot_importances(x_train.columns,model2.feature_importances_)

Penalization Methods

Regularization is a method used to make complex models simpler by penalising coefficients to reduce their magnitude, variance in the training set and in turn reduce overfitting in the model. 

Regularization occurs by shrinking the coefficients in the model towards zero such that the complexity term added to the model will result in a bigger loss for models with a higher complexity. 

There are 2 types of regression techniques such as Ridge Regression and Lasso Regression

Ridge Regression
Also known as L2 regularisation, this is a technique that uses a penality term to shrink the magnitude of coefficients towards zero without eliminating them. 
The shrinkage prevents overfitting caused by the complexity of the model or collinearity.
It includes the square magnitude of the coefficients to the loss function as the penalty term. 
If the error is defined as the square of residual, when L2 regularization term is added, the result is the equation below. 
As lambda increases the penalty increases causing more coefficients to shrink in the same vein, if lambda is zero, it results in the loss function.

In [None]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.5)
ridge_reg.fit(x_train,y_train)

Feature selection and LASSO regression
Also known as L1 regularisation
Some datasets can be high dimensional with a very high number of features and some of them not contributing towards predicting the response variable. 
As a result, it becomes more computationally expensive to train a model and can also introduce noise causing the model to perform poorly. The process of selecting significant features that contribute the most in obtaining high performing models is known as feature selection.

LASSO - LEAST ABSOLUTE SHRINKAGE AND SELECTION OPERATOR

Lasso regression reduces overfitting of the dataset by penalising the coefficients such that some coefficients are shrunk to zero and indirectly performs feature selection by selecting only a subset of features leaving only relevant variables that minimize prediction errors. 

By using L1 regularisation, it includes the absolute value of the magnitude to the loss function. 

The application of L1 regularisation results in simpler and sparse models that allow for better interpretation.

Although, Lasso regression helps prevent overfitting, one major limitation is that it does not consider other factors when eliminating predictors. For example, it arbitrarily eliminates a variable from a correlated pair that might not be a good rational from a human perspective. 



In [None]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.001)
lasso_reg.fit(x_train,y_train)

#comparing the effects of regularisation
def get_weights_df(model,feat,col_name):
    #this function returns the weight of every feature
    weights = pd.Series(model.coef_,feat.columns).sort_values()
    weights_df = pd.DataFrame(weights).reset_index()
    weights_df.columns = ['Features','col_name']
    weights_df[col_name].round(3)
    return weights_df

linear_model_weights = get_weights_df(linear_model, x_train,'Linear_Model_Weight')
ridge_weights_df = get_weights_df(ridge_reg,x_train,'Ridge_Weight')
lasso_weights_df = get_weights_df(lasso_reg,x_train,'Lasso_Weight')
final_weights = pd.merge(linear_model_weights,ridge_weights_df,on='Features')
final_weights = pd.merge(final_weights,lasso_weights_df,on='Features')


Elastic Net Regression

This is a combination of L1 and L2 penalities from ridge and lasso regression. This method arose from the need to overcome the limitations of lasso regression. It regularizes and performs feature selection simultaneously by initially finding the optimal values of the coefficients as in ridge then performs a shrinkage. 

Non-Linear Regression Methods and other Recommendations

Model tuning and choosing parameters

ML models are parameterized such that there has to be a search for the combination of parameters that will result in the optimal performance of the model. The parameters that define the model architecture are referred to as hyperparameters while the process of exploring a range of values is called hyperparameter tuning. 

It is important to note the distinction between model parameters and hyperparameters.
Unlike hyperparameters, model parameters are learnt during the training phase while setting hyperparameters is exclusive of the training process. 

Ideally, when hyperparameter tuning is completed, the result is the best parameters for the model. Grid search and random search are two common strategies for tuning hyperparameters. 

Grid Search
Grid search explores the combination of a grid of parameters such that for every combination of parameters, a model is built and evaluated then the model with the best result selected and its corresponding parameters. While it is computationally expensive, setting up a grid search is quite easy.

Random Search

As opposed to grid search, random search randomly combines parameter values in the grid to build and evaluate models. It does not sequentially combine all parameters as in grid search instead, it allows for a quick exploration of the entire action space tor reach optimal values. 

Data splitting
This involves setting aside a portion of the dataset for testing that is out of sample of hold out and evaluating the performance of the model to provide unbiased results while the rest is used in fitting the model. 

The proportion of division is solely based on choice and sometimes, the size of the dataset. 

However, a common practice is to split the dataset into training, validation or dev and testing sets where the validation set is used to tune the hyperparameters to select the best values for the model. 

Resampling involves repeatedly selecting samples from the orginal dataset and using these samples to obtain more information about the model. This can create different samples of the training set and another for evaluation. Cross validation is a method used to generalise and prevent overfitting in ML. 