Linear regression is often referred to as the building block of statistical learning methods. In a nutshell, it is an attempt to model a relationship between two or more variables by fitting a linear equation to the data at hand. For example, suppose you plot people's salaries on the y-axis and their years of education on the x-axis in a simple scatter plot. In this sense, you are trying to estimate a dependent variable (salary) by using a predictor (years of education) by drawing a line of best fit through the data. A line of best fit is a line that minimizes the sum of the distances from itself to each point in the data. The resulting line's slope, is the coefficient of the predictor (i.e., what kind of effect one unit change of years of education has on the predicted salary). For example, let's say that the line of best fit follows the equation below:

y = 30000 + 5000x1, where y = salary, x1 = years of education, and 30000 is a constant y intercept

This means that for one more year of education, a person's salary is estimated to increase by 5000, all else held constant. Thus, for someone with 12 years of education, their predicted salary is:

y = 30000 + 5000*(12)
y = 90000

This is a rather simplistic example. In reality, we know there are many factors that influence a person's salary. This is where multivariate linear regression comes into play. For example, suppose you now have information not only on the person's salary and years of education, but their parents' last combined income, the years of work experience, etc. You can estimate a model where each factor's effect is being considered, though visualizing the line of best fit will get more difficult as you keep adding dimensions! Not to worry, the math still works! Here's an example of a multivariate linear regression:

salary = 20000 + 4000*years_of_education + 1.1*last_combined_parents_income + 1000*years_of_experience

Thus, a person with 12 years of education, 100000 as their parents' last combined income and 5 years of experience is estimated to earn:

y = 20000 + 4000*12 + 1.1*100000 + 1000*5
y = 183000

Linear regression contains a lot of aspects to it that need to be considered. Topics such as:

How to estimate the coefficients
The tradeoff between bias and variance
Measuring the quality of fit and model accuracy
Omitted variable bias
Non-linear transformations of the predictors
Interaction and dummy/binary variables

are only just a handful of topics that need to be considered.

For a much better and a lot more detailed explanation of linear regression (and statistical methods, in general) there are plenty of resources on the internet. For example: https://www.statlearning.com/.

Electric Car Battery Example

Data Source

https://www.kaggle.com/gktuzgl/id-3-pro-max-ev-consumption-data

Explanation

Short explanation about the main use case and goals

Using the data provided by Göktuğ Özgül on Kaggle.com, we will build a simple linear regression model that predicts battery drainage: how much will your electric car's battery drain if you drive it in certain ways. For example, how much should you expect your battery to be drained if you drive 50 km at 50 km per hour, using heated seats?

We will cover:

How to read in the data and deal with special characters
How to explore the data, both for numerical variables and categorical variables
How to process the data and very light feature engineering (i.e., creating new variables from existing ones)
One of the many possible ways to perform feature selection
How to build high quality data ot be used as input for your model
How to follow Dimensional design process to do the above.
How to build a simple linear regression model
How to extract the results and predictions from the model
How to build a simple Streamlit dashboard showcasing your model's predictive ability

Sources Referred To

Versatile Data Kit (VDK)

The Versatile Data Kit framework allows you to implement automated pull ingestion and batch data processing.

Short explanation about Versatile Data Kit

Create the Data Job Files

Data Job directory can contain any files, however there are some files that are treated in a specific way:

SQL files (.sql) - called SQL steps - are directly executed as queries against your configured database;
Python files (.py) - called Python steps - are Python scripts that define run function that takes as argument the job_input object;
config.ini is needed in order to configure the Job. This is the only file required to deploy a Data Job;
requirements.txt is an optional file needed when your Python steps use external python libraries.

Delete all files you do not need and replace them with your own.

Data Job Code

VDK supports having many Python and/or SQL steps in a single Data Job. Steps are executed in ascending alphabetical order based on file names. Prefixing file names with numbers makes it easy to have meaningful file names while maintaining the steps' execution order.

Run the Data Job from a Terminal:

Make sure you have vdk installed. See Platform documentation on how to install it.

vdk run <path to Data Job directory>

Deploy Data Job

When a Job is ready to be deployed in a Versatile Data Kit runtime (cloud): Run the command below and follow its instructions (you can see its options with vdk --help)

vdk deploy

Exercises

Please open up MyBinder to get started on the exercises!

If you have any issue with above link try

For more information on MyBinder, please visit:

https://mybinder.readthedocs.io

Lessons Learned

Through this scenario, you created a data job, which:

Read in a local CSV file and stripped it off its special characters
Process the data using dimensional modelling
Built and tested a linear regression model
Built an interactive Streamlit dashboard, which showcased your model's predictive ability
Create, develop and deploy jobs using Versatile Data Kit (VDK)

Congrats!

> Go back to main page of the Workshop.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VW ID. 3 Pro Max EV Consumption.csv		VW ID. 3 Pro Max EV Consumption.csv
exercise.ipynb		exercise.ipynb
postBuild		postBuild
requirements.txt		requirements.txt
runtime.txt		runtime.txt
start		start

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samples

samples

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

VW ID. 3 Pro Max EV Consumption.csv

VW ID. 3 Pro Max EV Consumption.csv

exercise.ipynb

exercise.ipynb

postBuild

postBuild

requirements.txt

requirements.txt

runtime.txt

runtime.txt

start

start

Repository files navigation

Table of Contents

Purpose

Background

Linear Regression

Electric Car Battery Example

Data Source

Explanation

Versatile Data Kit (VDK)

Exercises

Lessons Learned

About

Releases

Packages

Contributors 2

Languages

License

versatile-data-kit-demo/dsc

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Purpose

Background

Linear Regression

Electric Car Battery Example

Data Source

Explanation

Versatile Data Kit (VDK)

Exercises

Lessons Learned

About

Resources

License

Stars

Watchers

Forks

Languages