Module 2: Linear Regression
Welcome to module 2 of the introductory course to data for good where we will be exploring linear regression - the first machine learning algorithm of this course!
By end of this module one should feel comfortable with the fundamentals of linear regression. Specific topics included are:
- How to split the data between training and test data
- Using training data to train a linear regression model
- Analyzing the results of the model
- Checking the assumptions of linear regression
- Building a multivariate regressor
Linear Regression is a parametric model which predicts a continuous outcome feature (Y) from one or more explanatory features (X).
Y = beta_0 + beta_1 * X
beta_0 is called the intercept term, and represents the expected mean value of Y when all explanatory features equal 0.
beta_1 is called a beta coefficient, and represents the expected change in the value of Y that results from a one unit change in X.
This is module fits a straight line to your data, where the value of the outcome feature can be calculated as a linear combination of the explanatory features. Sounds relatively simple? Afraid not, there are many nuances and conditions that need to be understood before using linear regression! We are going to delve into these assumptions and conditions and then demonstrate how to use this algorithm on the kiva dataset.
Linear regression is one member of a family of linear parametric models. Some additional advanced topics we recommend looking up are...
Logistic regression is very similar to linear regression but has a categorical outcome instead. So rather than modeling a continuous dependent variable, it models a binary classification - yes or no, true or false, 1 or 0. This is still a linear model as it assumes a linear relationship between the independent variables and the link function.
To learn more about Logistic Regression, try to following resources:
- Beginners guide to Logistic Regression: A good overview of the theory and mathematics behind the algorithm
- Logistic Regression in Python: A thorough tutorial on a publicly available dataset in Python
Ridge and Lasso regression
Both linear and logistic regression have a tendancy to overfit when there are a large number of features. Therefore it is important that we choose the features which have the most predictive power but how do we choose these features? We can use our EDA to a certain extent but that only goes so far.
This is where ridge and lasso regularization techniques come into play! Both of these techniques can be used to identify which features explain the most variance and should therefore be kept in the model.
To learn more about ridge and lasso regression and general regulaization techniques, we recommend the following resources:
- Complete tutorial on ridge and lasso regression in python: A broad tutorial explaining why we use regularization techniques, touching on the mathematics behind the algorithms and giving a few examples in python.
- An Introduction to Statistical Learning, Chapter 6.2: A comprehensive explanation of both Lasso and Ridge and their application in the context of statistical learning.