<a href="https://colab.research.google.com/github/SalNel97/qmss_python_final/blob/main/sne2114_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**QMSS S5073**
# ***Final***
**Salah El-Sadek (sne2114)**

---

#**Question 1**

*From the perspective of a social scientist, which models did we learn this semester that are useful for ruling out alternative explanations through control variables AND that allow us to observe substantively meaningful information from model coefficients?*

**In general, regression models in linear regression (for continuous outcome variables) and logistic regression (for binary outcome variables) are the ones most useful for social scientists in terms of being able to interpret coefficients and, by extension, be able to have realistic interpretations of the relationships between variables in our model. Unpenalized linear or logistic regression models can be run, or an L1 (Lasso) or L2 (Ridge) penalty can be applied to minimize RSS (and variance) for better prediction or for the ability to eliminate irrelevant variables from the model (only in the case of Lasso penalty are you able to completely eliminate predictors from final model). It is the ability to be able to interpret coefficients that allow for any meaningful statisitical hypothesis testing to be performed, which is generally the main focus of a social scientist.**

**One could also explore the data using decision tree type models (Random Forest, Boosted, Bagged) to identify any interaction associations between variables and which variables are of importance in predicting certain aspects of the data set.**

#**Question 2**

*Describe the main differences between supervised and unsupervised learning.*

**Supervised learning involves models trained and tested to be good at predicting an outcome Y from a different set of X predictors. While unsupervised learning involves only dealing with our X predictors without concern for prediction ability of certain outcome variable Y.**

**Due to this difference, unsupervised learning is much more open-ended without a set goal/prediction score in mind. Just with the aim of discovering interesting charactersitics defining our set of X predictor variables using only the X variables in said process. Supervised learning, on the other hand, is much more directed as it is goverened by the goal that the model is able to predict our Y outcome well and which X predictors are involved in such predictions.**

#**Question 3**

*Is supervised or unsupervised learning the primary approach that is used by machine learning practitioners?  For whatever approach you think is secondary, why would you use this approach (what's a good reason to use these kinds of models?)*

**Supervised learning is the primary approach used by machine learning practitioners as we are concerned with the predictive power of a certain response outcome variable.**

**Secondary to that approach are unsupervised learning methods which only use the X predictor variables to discover interesting relationships among said X variables. This could involve analysing the presence of any meaningful clusters or homogenous groupings for our X predictors, or exploring the most important dimensions/features to focus on that explain non-random variation in the data.**

#**Question 4**

*Which unsupervised learning modeling approaches did we cover this semester?  What are the major differences between these techniques?*

**Some of these approaches include Principle Component Analysis (PCA), Manifold Learning (MDS, LLE, or IsoMap), and clustering techniques such as K-Means Clustering and Hierarchical/Agglomerative Clustering.**

**Clustering methods involve finding homogenous subgroups within our data by either partitioning our data into a predefined number of 'K' clusters, or by performing step-wise leaf-to-branch analysis and gradually grouping data points in groups and subgroups until all points are in a group/cluster. The later method is more flexible in that it does not require a predefined number of 'K' cluster (a parameter which would have to be tuned). In the end, these kinds of clustering techniques are concerned with finding relationships between X predictors in terms of their homogeneity when partitioned/labeled into certain subgroups, and the similarities/differences between said subgroups.**

**PCA is concerned with reducing the number of dimensions of our data to the minimal required to explain the majority of the variance in our data points. Given its concern with dimension reductionality only, in comparision to clustering techniques like K-means, PCA does not particularly require apriori knowledge/background to determine parameters unlike choosing the number of 'K' cluster to use.**

**Manifold learning is a useful technique besides PCA in that it performs well in detecting nonlinear relationships in the data. But some disadvantages that leave PCA as being more viable include Manifold learning techniques' inability to handle missing data, the need for a pre-set number of neighbors for distant matrix calculations, and even the inability to filter out noise automatically (like PCA).** 

#**Question 5**

*What are the main benefits of using Principal Components Analysis?*

**Given PCA fundemental ability to reduce data to a fewer number dimensions while still explaining a good portion of the variability in your data all at once there are multiple uses for it:**
* **Focusing on most important feature variables in data set for future supervised learning/hypothesis testing.**

* **Filtering noise out of your data as a preprocessing method by flipping the PCA method and looking at the largest subset of components (signal) to help us isolate the smaller component subset that is the noise.**

* **The ability to visualize data with a very high number of dimensions such as image data, and the ability to effectively filter noise out of these higher dimensional image data.**

**In general, PCA is very effective at focusing on the most important features/dimensions (for various reasons), but especially when we talk about high-dimensional data.**

#**Question 6**

*Thinking about neural networks, what are three major differences between a deep multilayer perceptron network and a convolutional neural network model?  Be sure to define any key terms in your explanation.*

* **One of the major differences between DMPN and CNN is that DMPN is a feedfoward neural network only while CNN is not. This means that the nodes connecting our input layers to our hidden layers to our output layer do not form a loop. The information simply goes in one direction from input layers through hidden layers and onto prediction with the final output layer. Looping between layers is possible with the convolutional layers in CNN.**

* **Another difference is the fact that DMPN uses 'dense' fully-connected layers in which each node is connected to every other node in a nearby layer. While in CNN we have sparsely connected layers, where every node is not necessarily connected to every other node. This has the benefits of greatly reducing the total number of parameters in our network and reducing the redunduncies inherent in high-dimensional data. This makes CNN more efficient and able to predict output layers (images, probabilty distributions, etc.) while requiring a smaller number of input variables compared to DMPN.**

* **CNN has the ability of reading inputs as matrices as well as vectors while DMPN can only take vector-type inputs. This has the consequence of DMPN not having specific spatial-dependence when identifying patterns (if talking about pixels in an image, for example) since relationships are determined by the pixel of an image (given the dense fully-connected nature of DMPN layers). But with the ability to read in matrix inputs, certain image patterns and feature detectors are 'shared' in other spatial positions of an image. CNN is a "smarter" image prediction method as it does not try to predict output nodes by meticulously utilizing all input information available, but instead realizes that certain shape/edge prediction patterns are common in different areas of the image and utilizes that fact in its image prediction. CNN is more efficient and useful for very high-dimension complex image or audio data, for instance.**

In [None]:
%matplotlib inline

import warnings
warnings.simplefilter("ignore", UserWarning)

# Importing relevant libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import make_pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

from sklearn.pipeline import Pipeline


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, BatchNormalization, Flatten
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.python.keras.layers.convolutional import Conv2D, MaxPooling2D
import tensorflow as tf

#**Question 7**

*Write the keras code for a multilayer perceptron neural network with the following structure: Three hidden layers.  50 hidden units in the first hidden layer, 100 in the second, and 150 in the third.  Activate all hidden layers with relu.  The output layer should be built to classify to five categories.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model.  You will not run it on real data.)*

In [None]:
model1 = Sequential() 
model1.add(Dense(units = 50, activation = 'relu', input_dim = 7))
model1.add(Dense(units = 100, activation = 'relu'))
model1.add(Dense(units = 150, activation = 'relu'))
model1.add(Dense(units = 5, activation = 'softmax'))

# Compile
model1.compile(loss='binary_crossentropy', optimizer = 'sgd', metrics=['accuracy']) #Could use AUC instead of accuracy
  
model1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 50)                400       
_________________________________________________________________
dense_1 (Dense)              (None, 100)               5100      
_________________________________________________________________
dense_2 (Dense)              (None, 150)               15150     
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 755       
Total params: 21,405
Trainable params: 21,405
Non-trainable params: 0
_________________________________________________________________


#**Question 8**

*Write the keras code for a multilayer perceptron neural network with the following structure: Two hidden layers.  75 hidden units in the first hidden layer and 150 in the second.  Activate all hidden layers with relu.  The output layer should be built to classify a binary dependent variable.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model.  You will not run it on real data.)*

In [None]:
model2 = Sequential() 
model2.add(Dense(units = 75, activation = 'relu', input_dim = 7))
model2.add(Dense(units = 150, activation = 'relu'))
model2.add(Dense(units = 1, activation = 'sigmoid'))

# Compile
model2.compile(loss='binary_crossentropy', optimizer = 'sgd', metrics=['accuracy']) #Could use AUC instead of accuracy
  
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_4 (Dense)              (None, 75)                600       
_________________________________________________________________
dense_5 (Dense)              (None, 150)               11400     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 151       
Total params: 12,151
Trainable params: 12,151
Non-trainable params: 0
_________________________________________________________________


#**Question 9**

*Write the keras code for a convolutional neural network with the following structure: Two convolutional layers.  16 filters in the first layer and 28 in the second.  Activate all convolutional layers with relu.  Use max pooling after each convolutional layer with a 2 by 2 filter.  The output layer should be built to classify to ten categories.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model.  You will not run it on real data.)*

In [None]:
model3 = Sequential()
model3.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', input_shape=[140, 140, 3]))
model3.add(MaxPooling2D(pool_size = (2, 2)))

model3.add(Conv2D(filters=28, kernel_size=2, padding='same', activation='relu'))
model3.add(MaxPooling2D(pool_size = (2, 2)))

model3.add(Flatten())
model3.add(Dense(10, activation='softmax'))

# Compile 
model3.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

model3.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 140, 140, 16)      208       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 70, 70, 16)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 70, 70, 28)        1820      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 35, 35, 28)        0         
_________________________________________________________________
flatten (Flatten)            (None, 34300)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                343010    
Total params: 345,038
Trainable params: 345,038
Non-trainable params: 0
________________________________________________

#**Question 10**

*Write the keras code for a convolutional neural network with the following structure: Two convolutional layers.  32 filters in the first layer and 32 in the second.  Activate all convolutional layers with relu.  Use max pooling after each convolutional layer with a 2 by 2 filter.  Add two fully connected layers with 128 hidden units in each layer and relu activations.  The output layer should be built to classify to six categories.  Further, your optimization technique should be stochastic gradient descent.  (This code should simply build the architecture of the model.  You will not run it on real data.)*

In [None]:
model4 = Sequential()
model4.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu', input_shape=[140, 140, 3]))
model4.add(MaxPooling2D(pool_size = (2, 2)))

model4.add(Conv2D(filters=32, kernel_size=2, padding='same', activation='relu'))
model4.add(MaxPooling2D(pool_size = (2, 2)))

model4.add(Flatten())
model4.add(Dense(128, activation='relu'))
model4.add(Dense(128, activation='relu')) 

model4.add(Dense(6, activation='softmax')) 

# Compile 
model4.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

model4.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_2 (Conv2D)            (None, 140, 140, 32)      416       
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 70, 70, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 70, 70, 32)        4128      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 35, 35, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 39200)             0         
_________________________________________________________________
dense_8 (Dense)              (None, 128)               5017728   
_________________________________________________________________
dense_9 (Dense)              (None, 128)              