# Samuel Gartenstein QMSS GR5073 Final Exam

## Question 1

From the perspective of a social scientist, which models did we learn this semester that are useful for ruling out alternative explanations through control variables AND that allow us to observe substantively meaningful information from model coefficients?



### Answer

The models we learned this semester that are useful for ruling out alternative explanations through control variables and allow us to observe substantively meaningful information from our coefficients were linear, logistic, lasso, and ridge regression.

From a traditional social science approach, adding controls to our model serves the purpose of improving our ability to explain our target feature. This may reduce the explanatory power of our primary predictor. This reduction would happen if our controls were correlated with our primary predictor or better explains our target feature. While social scientists and statisticians rely on p-values (in a frequentist framework), machine leaners normally analyze the magnitude of their coefficients. For example, if there is multicollinearity or omitted variable bias, the coefficient of the model may be misrepresented. As a result, adding controls may allow machine learners to reduce the magnitude the coefficient(s) that do not significantly contribute to the model. This in turn allows them to rule out alternative explanations as well as analyze the predictors that have an impact on our target feature.

It is worth noting that adding control variables may not always be effective. In fact, they can diminish our ability generalize to new data. A model that has too many controls may overfit to training data. In this context, machine learning practitioners utilize lasso or ridge regression. Lasso and ridge regression are regularization techniques that shrink coefficients of predictors to near or zero (zero in the case of lasso) if they have multicollinearity or minimal explanatory power. This allows machine learning practitioners to minimize the risk of overfitting, thereby increasing their ability to parse out their model(s) most important features.



## Question 2

Describe the main differences between supervised and unsupervised learning.

### Answer

In supervised machine learning, machine learning practitioners have pre-defined input and output variables. Although the choices of these features depend on the context, there is always a distinction between independent variable(s) and a dependent variable. Supervised machine learning models include, but not limited to linear and logistic regression, SVM, and tree models.

On the other hand, in unsupervised machine learning, there is not a pre-defined set of independent or dependent variables. Rather, there are only features that are normally unlabeled. As a result, machine learning practitioners use algorithms such as PCA (Principal Component Analysis) and clustering to derive meaning from the features.




## Question 3

Is supervised or unsupervised learning the primary approach that is used by machine learning practitioners?  For whatever approach you think is secondary, why would you use this approach (what's a good reason to use these kinds of models?)



### Answer

I believe that machine learning practitioners primarily use unsupervised machine learning since data tends to be unlabeled and unstructured. During a lecture, Nina Lerner shared several real-world problems she solves on her job, one of which is extracting information from consumer receipts. In such a case, there are no pre-defined input or outputs; rather data scientists must utilize unsupervised machine learning to derive meaningful information from their consumers.

I also believe that another primary goal of unsupervised machine learning is to make data suitable for supervised machine learning. I would not call supervised machine learning "secondary." In fact, I believe that machine learning practitioners would rather have an ordered dataset where they can apply regression, SVM, or ensemble make predictions about a target feature. However, machine learning practitioners are bound by the structure of their data. Unsupervised machine learning, such as PCA, serve to reduce the dimensionality and extract important features for subsequent supervised machine learning models.




## Question 4

Which unsupervised learning modeling approaches did we cover this semester?  What are the major differences between these techniques?



### Answer

The unsupervised learning models we covered this semester were PCA and Clustering (Kmeans and Hierarchal).

In PCA, the goal is to use reduce the dimensions of our data. This is done by projecting our data onto axes, principal components, based off variance. As a result, we can identify components based off variance, determining influential features in our data.

In clustering, our goal is to group similar observations of data. For Kmeans clustering achieves this by pre-defining a number of clusters and then minimize the variation within those clusters. Hierarchal clustering utilizes a dendrogram, a tree-like structure, to partition the data based off similarity, and then choose the optimal number of clusters. Both aim to cluster data by grouping those with similar patterns.

Overall, the primary difference is that PCA seeks to reduce the dimensionality of the dataset and clustering seeks to group similar sets of observations. However, both serve to derive meaning from unlabeled and/or unstructured data.


## Question 5

 What are the main benefits of using Principal Components Analysis?

### Answer

The primary benefit of using Principal Components Analysis that it reduces high dimensional data (data with a high number of features). This reduction is achieved by projecting our data onto axes, principal components, based off variance. After, we can retain features that have a high variance. The technique allows for dimensionality reduction while keeping features that relatively high degree of explanatory power.

Several advantages follow from reducing the data while keeping the most important features. First, since PCA eliminates features that provide little information, it decreases the risk of overfitting, thereby increasing predictive performance for subsequent ML models. Second, by reducing features with high multicollinearity, PCA helps increase the predictive performance of subsequent ML models. Third, since the unnecessary and/or highly correlated variables are eliminated, utilizing PCA will speed up computation for subsequent ML models.


## Question 6

Thinking about neural networks, what are three major differences between a deep multilayer perceptron network and a convolutional neural network model?  Be sure to define any key terms in your explanation.



### Answer

The first difference between the deep multilayer perceptron network (MPN) and convolutional neural network (CNN) model is the number of connected layers. The MPN has an input layer, one or more hidden layers, and output layer. CNN have more layers, consisting of the convolution layer where the features are filtered, a pooling layer for shrinking data, and fully connected layer.

The second difference is that in CNN can filter the data. That is, we can create a new matrix with certain weights, which is called the convolution kernel. We partition our primary data matrix into overlapping subsets and multiply them by our filter. We take the sum of the element-wise multiplication and put it into a new matrix. This allows us to extract patterns and information about our features. Deep MPN does not have this feature, meaning that we cannot filter the data.  

The third difference is that there CNN allows dimension reduction. This is done in the pooling layer, which occurs after we filter our data in the convolutional layer. In pooling, the filtered matrix is divided into non-overlapping subsets. After, we reduce our dimensions based off values in the submatrices. This can be done by taking the maximum element in each submatrix, or the average of all the elements, and storing the output into a new smaller matrix. Like filtering, deep MPN does not have this functionality, meaning we cannot reduce the dimensionality of our data.


## Question 7

Write the tf.keras code for a multilayer perceptron neural network with the following structure: Three hidden layers.  50 hidden units in the first hidden layer, 100 in the second, and 150 in the third.  Activate all hidden layers with relu.  The output layer should be built to classify to five categories.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model and your approach to compile the model.  You will not run it on real data.)




In [None]:
'''
For Questions 7 and 8, set input shape as (784,), meaning there are 784 features
For Question 7 through 10, I set the learning rate to 0.0001, the loss to categorical crossentropy,
and used accuracy as the scoring metric.
'''

model = Sequential([
    #First hidden layer with 50 neurons
    Dense(50, input_shape=input_shape=(784,)),
    Activation('relu'),

    #Second hidden layer with 100 neurons
    Dense(100),
    Activation('relu'),

    #Third hidden layer with 150 neurons
    Dense(150),
    Activation('relu'),

    #Output layer set to classify 5 categorical variables
    Dense(5),
    Activation('softmax'),
])

#Learning rate for optimization, using stochastic gradient descent
sgd = SGD(lr=0.0001)

#Compiling model with
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])


## Question 8
Write the tf.keras code for a multilayer perceptron neural network with the following structure: Two hidden layers.  75 hidden units in the first hidden layer and 150 in the second.  Activate all hidden layers with relu.  The output layer should be built to classify a binary dependent variable.  Further, your optimization technique should be stochastic gradient descent. (This code should simply build the architecture of the model and your approach to compile the model.  You will not run it on real data.)



In [None]:
model = Sequential([
    #First hidden layer with 75 neurons
    Dense(75, input_shape=('''Number of Features''',)),
    Activation('relu'),

    #Second hidden layer with 150 neurons
    Dense(150),
    Activation('relu'),

    #Output layer set classify binary dependent variable
    Dense(1),
    Activation('sigmoid'), #Activated with sigmoid function
])

#Learning rate for optimization, using stochastic gradient descent
sgd = SGD(lr=0.0001)

#Compiling model
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])


 ## Question 9

 Write the tf.keras code for a convolutional neural network with the following structure: Two convolutional layers.  16 filters in the first layer and 28 in the second.  Activate all convolutional layers with relu.  Use max pooling after each convolutional layer with a 2 by 2 filter.  The output layer should be built to classify to ten categories.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model and your approach to compile the model.  You will not run it on real data.)



In [None]:
'''
For Questions 9 and 10, I made the kernal/filter size 3 by 3, and the input shape 28 by 28 by 1 for all convolutional layers
'''

#First  convolutional layer with 16 filters
model.add(Conv2D(16, (3, 3), padding='valid', activation='relu', input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2, 2))) #max pooling with a 2 by 2 filter

#Second  convolutional layer with 28 filters
model.add(Conv2D(28, (3, 3), padding='valid', activation='relu', input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2, 2))) #max pooling with a 2 by 2 filter

model.add(Flatten())

#Output layer
model.add(Dense(10, activation='softmax'))

#Learning rate for optimization, using stochastic gradient descent
sgd = SGD(lr=0.0001)

#Compiling model
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])


## Question 10

Write the keras code for a convolutional neural network with the following structure: Two convolutional layers.  32 filters in the first layer and 32 in the second.  Activate all convolutional layers with relu.  Use max pooling after each convolutional layer with a 2 by 2 filter.  Add two fully connected layers with 128 hidden units in each layer and relu activations.  The output layer should be built to classify to six categories.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model and your approach to compile the model.  You will not run it on real data.)

In [None]:
#First  convolutional layer with 32 filters
model.add(Conv2D(32, (3, 3), padding='valid', activation='relu', input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2, 2))) #max pooling with a 2 by 2 filter

#Second  convolutional layer with 32 filters
model.add(Conv2D(32, (3, 3), padding='valid', activation='relu', input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2, 2))) #max pooling with a 2 by 2 filter

model.add(Flatten())

model.add(Dense(128, activation='relu')) # First fully-connected layer of 128 neurons.
model.add(Dense(128, activation='relu')) # Second fully-connected layer of 128 neurons.

#Output layer
model.add(Dense(6, activation='softmax'))

#Learning rate for optimization, using stochastic gradient descent
sgd = SGD(lr=0.0001)

#Compiling model
model.compile(loss='categorical_crossentropy',
              optimizer=sgd,
              metrics=['accuracy'])
