Machine Learning (R)

Spring 2018 - Data Science Institute, Columbia University

Prof. John Paisely

Overview

Course description
Tech/framework
Assignment 1
Assignment 2
Assignment 3
Assignment 4
Assignment 5

Course Description

ELEN 4903 was a 1 semester class on machine learning theory. The course was almost fully theoretical, focusing on probability, optimization, and analysis. We covered most of the algorithms in use for the span of machine learning models, in a method grounded in rigorous mathematics. The goal of the class was to understand what is happening under the hood when using these models. The assignments however did involve some questions related to coding. All of the coding in this repo is done in R, the models are coded from scratch and all pre-made ML packages (caret, etc) are avoided. The topics covered in the course inclde:

Regression
Maximum Liklihood and Maximum a posteriori
Regularization and the bias-variance tradeoff
Classficiation
- Naive bayes
- K-nn
- Perceptron and Logistic
- Laplace approxmiation and Bayesian logistic
Feature expansions and kernals
- Kernalized perceptron
- Kernalized knn
- Gaussian processes
SVMs
Trees, bagging, and random forests
Boosting
The Expectation Maximization (EM) algorithm
Clustering and Gaussian mixture models (GMMs)
Matrix factorization and recommender systems
Topic modelling and the LDA algorithm
Nonnegative matrix factorization
Principal components analysis (PCA)
Markov chains
Hidden Markov models (HMMs) and the Kalman filter

Tech/Framework

All of the coding is done in R. Some packages are used for data wrangling and visualization but the actual algorithms are written from scratch. A function is included in the headers for all the code that install and/or loads the required packages.

Assignment 1

The first part of assignment 1 involves a theoretical answer which is not provided in this repo. The seocnd part involves the coding of ridge regression (OLS with L2 penalty) and comparing various hyperparameter values. The dataset used for training and testing contains features about various cars used to predict miles per gallon. The graph shows the variation of the weights with different values of the df( $\lambda$ ). Higher values of lambda (and thus lower df( $\lambda$ ) pulls the weights (coefficients) to 0 by increasing the penalty. One important thing to notice is that for some values the not only the magnitudes, but the signs on the weights change as well.

Next we turn to both the validation of the model using RMSE on the testing set along with trying out a polynomial model with L2 regularization.

The results above are fairly intutitive. The OLS model seems to perform best without any regularization, however once we add more parameters to the model through polynomials, some regularization helps in the prediction on the testing set by reducing overfitting. The best model on the RMSE metric seems to be a second order polynomial with a value for the hyperparameter $\lambda$ being around 15.

Assignment 2

The second assignment shifts focus from regression to classification. The first problem involves a theoretical ML derivation of the naive Bayes classifier, and this part of the solution is not included in this repo. The second part of the assignment involves coding various classifiers on a dataset consisting spam and nonspam emails with words as features.

The first classifier is the naive bayes we derived in the first part of the assignment. This chart shows the class conditional weights for the different features. A value of 1 indicates the email is labelled spam and 0 nonspam. The different weights represent the different conditional liklihood of the different classes for each feature. For example, the features 16 and 52 represent the words "free" and "!" respectively. It makes sense that given a high presence of these words, the email is more likely to b spam (y = 1) than not.

Next I implement a non-probabalistic classfier, he K nearest neighbors (k-nn) classifier. The performance of the model on the testing set with varying values of k are shown below.

The last classifer I implement, is one of the most commonly used binary classifiers, the logistic regression. I implement the regression using two methods, the normal stochastic gradient decsent (sdg) method and Newton's method. The log liklihood of each as a function of iterations is shown below.

Assignment 3

Assignment 3 involves two coding problems, the first part is a gaussian process, and the second is the implmentation of a boosting algorithm on a simple OLS model. In the first part I use the same dataset as that in the first assignment, using various car characteristics as features to predict miles per gallon. The second questions uses an room occupancy dataset found here

The gaussian process algorithm involves creating kernal matrices using the radial basis function and then using that to predict values based on the hyperparameters b and $\sigma^2$ . Below I show a scatter plot with the predicted mean based on just using one of the features (car weight).

The second part of the assignment involved using the Adaboost algorithm on a simple OLS model. I used 1500 iterations for the boosting. The distribution of the sampled observations can be seen on the right. It is clear that the sampling was highly uneven with some of the observations having a alrge influence on weight optimization. Below I show $\alpha_t$ and $\epsilon_t$ as a function of t.

Assignment 4

In assignment 4 we switch gears from supervised learning to unsupervised learning. In the first part of the assignment I first generate three Gaussians on R² with different mean and mixing weights but the same covariance matrices. I then implement the k-means clustering algorithms for various k clusters which allows us to test how well the unsupervised algorithm works whilst knowing the true data generating process.

Above, I show both the k-means objective function as a function of iterations (left) and the final cluster assignment after 20 iterations of the data sample for two different k (right). From the left we can see that increasing the number of clusters will montonically decrease the loss function, however this should not be used as the only metric, as we know the true k is 3 not 5, which can be seen from the image on the right as well as our prior information on the data generating process.

In the second part of the assignment I implement a matrix factorization algorithm on a user-movie dataset. The log liklihood as a function of iterations for different intitalizations can be seen below.

All the iterations are pretty similar, however I use the highest one (trial 7) and then examine the movies "Star Wars", "My Fair Lady", and "Goodfellas". The tables with the closest movies (recommendations) to each can be seen below.

Star Wars	My Fair Lady	GoodFellas
Empire Strikes Back, The (1980)	Return of Martin	Casino (1995)
Raiders of the Lost Ark (1981)	Guerre, The (Retour de Martin Guerre, Le) (1982)	Full Metal Jacket (1987)
Return of the Jedi (1983)	Carrington (1995)	Bonnie and Clyde (1967)
Usual Suspects, The (1995)	Snow White and the Seven Dwarfs (1937)	Good, The Bad and The Ugly, The (1966)
Indiana Jones and the Last Crusade (1989)	Victor/Victoria (1982)	Apocalypse Now (1979)
Day the Earth Stood Still, The (1951)	Fantasia (1940)	Carlito's Way (1993)
My Man Godfrey (1936)	Charade (1963)	Swingers (1996)
Prefontaine (1997)	Only You (1994)	Godfather, The (1972)
Blues Brothers, The (1980)	Mary Poppins (1964)	Godfather: Part II, The (1974)
Back to the Future (1985)	Singin' in the Rain (1952)	People vs. Larry Flynt, The (1996)

Based on a cursory familiarity with some of the movies above it seems the algorithm was implemented with a fair degree of success using only collaborative filtering and no movie of user characteristics.

Assignment 5

In the final assignment I implment a markov chain model to rank college football teams based on their performances vis-a-vis each other. The second part of the assignment involves using nonnegative matrix factorization to do topic modelling on 8447 articles from the new york times.

In the first part of the assignment I implement markov chain model looking the performance of 763 college football teams from the 2017 season and attempt to rank them. The chart on the right shows the difference between the weight at iteration t and the steady state weight as a function of t. The table below shows the rankings by the markov chain algorithm and the rankings from AP poll. The outcomes are quite similar.

Algo Rank	Team	Voting Rank
1	Alabama	1
2	Georgia	2
3	OhioState	5
4	Clemson	4
5	Oklahoma	3
6	Wisconsin	7
7	CentralFlorida	6
8	Auburn	10
9	PennState	8
10	NotreDame	11

In the second part of the assignment, I use non-negative martix factorization (NMF) to do topic modelling on NYT articles. Below I show the objective function value as a function of iterations.

Next I show the top 10 words from 5 topics. As can be seen there is some coherence to the topics. The first one seems to be about logistics, the second is media, the third is business, the fourth is social/community, and the fifth is policy.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
agreement	television	company	father	money
plan	medium	business	mrs	state
meeting	advertising	industry	son	pay
agree	magazine	executive	graduate	budget
issue	network	product	daughter	program
negotiation	news	customer	mother	tax
decision	broadcast	service	marry	cost
proposal	video	president	receive	government
meet	newspaper	base	retire	bill
official	commercial	sell	degree	law

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Assignment1		Assignment1
Assignment2		Assignment2
Assignment3		Assignment3
Assignment4		Assignment4
Assignment5		Assignment5
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment1

Assignment1

Assignment2

Assignment2

Assignment3

Assignment3

Assignment4

Assignment4

Assignment5

Assignment5

README.md

README.md

Repository files navigation

Machine Learning (R)

Spring 2018 - Data Science Institute, Columbia University

Prof. John Paisely

Overview

Course Description

Tech/Framework

Assignment 1

Assignment 2

Assignment 3

Assignment 4

Assignment 5

About

Releases

Packages

Languages

Alokik-Mishra/MachineLearning-R

Folders and files

Latest commit

History

Repository files navigation

Machine Learning (R)

Spring 2018 - Data Science Institute, Columbia University

Prof. John Paisely

Overview

Course Description

Tech/Framework

Assignment 1

Assignment 2

Assignment 3

Assignment 4

Assignment 5

About

Resources

Stars

Watchers

Forks

Languages