Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Module 2: Linear Regression


Welcome to module 2 of the introductory course to data for good where we will be exploring linear regression - the first machine learning algorithm of this course!


By end of this module one should feel comfortable with the fundamentals of linear regression. Specific topics included are:

  1. How to split the data between training and test data
  2. Using training data to train a linear regression model
  3. Analyzing the results of the model
  4. Checking the assumptions of linear regression
  5. Building a multivariate regressor

Topic overview

Linear Regression is a parametric model which predicts a continuous outcome feature (Y) from one or more explanatory features (X).

Y = beta_0 + beta_1 * X

beta_0 is called the intercept term, and represents the expected mean value of Y when all explanatory features equal 0.
beta_1 is called a beta coefficient, and represents the expected change in the value of Y that results from a one unit change in X.

This is module fits a straight line to your data, where the value of the outcome feature can be calculated as a linear combination of the explanatory features. Sounds relatively simple? Afraid not, there are many nuances and conditions that need to be understood before using linear regression! We are going to delve into these assumptions and conditions and then demonstrate how to use this algorithm on the kiva dataset.



Advanced topics

Linear regression is one member of a family of linear parametric models. Some additional advanced topics we recommend looking up are...

Logistic regression

Logistic regression is very similar to linear regression but has a categorical outcome instead. So rather than modeling a continuous dependent variable, it models a binary classification - yes or no, true or false, 1 or 0. This is still a linear model as it assumes a linear relationship between the independent variables and the link function.

To learn more about Logistic Regression, try to following resources:

Ridge and Lasso regression

Both linear and logistic regression have a tendancy to overfit when there are a large number of features. Therefore it is important that we choose the features which have the most predictive power but how do we choose these features? We can use our EDA to a certain extent but that only goes so far.

This is where ridge and lasso regularization techniques come into play! Both of these techniques can be used to identify which features explain the most variance and should therefore be kept in the model.

To learn more about ridge and lasso regression and general regulaization techniques, we recommend the following resources: