Skip to content
Pranoop Mutha edited this page Feb 3, 2018 · 18 revisions

Sub-Team Members Class ID: 5-2 15 Naga Venkata Satya Pranoop Mutha 5-2 23 Geovanni West


This ICP is related to get familiar with concepts of Linear Regression, Supervised Learning and Unsupervised Learning, and Clustering of Data. In this ICP, we take 3D Road Network data and apply Linear Regression fit to it. Then we observe the Training Mean Square Error and Test Mean Square Error.

Linear Regression:

Regression is statistical process for estimating the relationships among variables. It predicts the value of dependent variable for a given value of the explanatory variable.

The parameters in the linear regression model are very easy to interpret.

Yi=b0+b1X1+b2X2++bpXp+e

, where b0 is the intercept and bp is slope for the pth variable xp. Y is a quantitative variable.

In Linear Regression, first we read data, generate training and test data in a particular ratio (We have taken 70-30 ratio in this ICP). Then we build model, evaluate training and test error, then save and load model.

Input Data:

Then we load and parse the data

Then we build the model

Using small batches of random data (stochastic training). In this case, stochastic gradient descent. Ideally we should use all data but expensive. Instead we use a different subset everytime. Its also cheap and benificial.

Now, we evaluate training mean square error and test mean square error

Next we save and load the result into file

Results

Training Mean Squared Error

= 1.95062302120918E16

Test Mean Squared Error

= 1.9596847563029088E16

K - Means Clustering

K - Means Clustering is a unsupervised learning, which does a task of grouping a set of objects in such a way that objects in the same group(called cluster) are more similar to each other than to those in other clusters.

Clustering algorithms mainly rely upon distance metric between data points

It also minimizes inter-class similarity and maximizes intra-class similarity

In K-Means clustering, 'K' stands for number of clusters, it is typically a user input to the algorithm. It is iterative in nature.

The basic idea in this method is partitioning and similarity measure using Eucledian distance and assigning data points to the nearest cluster.

The main goal is to minimize the squared error

So in this example, for the 3D Road Network Data, we apply both K = 3 and K = 4.

Case 1: K = 3

Source Code:

Outlier Point:

  • Within Set Sum of Squared Errors = 8.58246791862488E14

Case 2 : K = 4

Source Code:

Outlier Point 1:

Outlier Point 2:

  • Within Set Sum of Squared Errors = 2.031178299721398E14