# Assignment Overview

Links to the notes discussed in the video
* [Model Selection Overview](./ModelSelect.pdf)
* [Model Types](./ModelType.pdf)
* [Model Decision Factors](./ModelDecisionFactors.pdf)
* [Generalization Techiques](./Generalization.pdf)

The assignment consists of two parts requiring you to select appropriate models with associated code/text.

1. Determine challenge and relevant model for two distinct situations (fill out this notebook). 
1. Address the data code needed and the model for [car factors](./CarFactors/carfactors.ipynb) contained in the subdirectory, CarFactors.

* ***Check the rubric in Canvas*** to make sure you understand the requirements and the assocated weights grading

# Part 1: Speed Dating Model Selection

You are to explore the data set on speed dating and construct two models that provide some insight such as grouping or predictions.  The models must come from different model areas such as listed as categories in the [ModelTypes](./ModelTypes.pdf) document.  You must justify your answer considering the data and the prediction value.

The data is contained in [SpeedDatingData.csv](SpeedDatingData.csv).  The values are detailed in [SpeedDatingKey.md](./SpeedDatingKey.md).  The directory also contains the original key document - SpeedDatingDataKey.docx but jupyter lab is unable to render it.  You are free to render it outside of jupyter lab if something didn't translater clearly.  The open source tool [pandoc](https://pandoc.org/installing.html) was used to perform the translation.  It is useful for almost any translation and works in all major operating systems

# Model 1

## Outline the challenge 

The challenge illustrated here is to take data based on speed dating statisitics, and use that data in order to predict compatibility between two partners. There are many ways to approach this problem, but for this case, it seems that a decision process based on given factors would be a good investigation to see correlations between decisions. So the problem being addressed in this first model would be based on decisions on personality types and career fields. Along the lines of thinking such as "I would only date someone if they worked in the service industry" that leads to thinking more categorically on if a specific partner would be receptive to them, and vise versa. Meaning this approach to the challenge would follow a decision tree path to see the likelihood of a successful date. 

Success in this model would be a categorical yes or no match that leads to a successful speed dating experience. That would be measured against a random speed dating test. In other words, the successful match ratio here would be an improvement over randomly assigning others, using specific categories. 

### Select the features and their justification 

Here are some usesful features for this challenge:
Match -- in this case, match would be the label attached to the model to train it, as a 1 would be a successful match, and 0 would not be. Will be used for training purposes and predictions
Gender -- on average, there is some sort of bias between which gender a person would chose to say yes to.
Age -- people typically have an age range of preference when it comes to dating.
Field of study and field of work-- personality types typically follow the same type of field either of study or work. This could be useful in determining compatibility in a match. These features serve similar purposes which is why they are clumped together, but also helps determine if 2 do not align may also be helpful.
imprelig -- many religions find it important to find those who share similar values, and could be deal breakers so good correlation
race and imprace -- These two go along to help a decision on importance in the decision
Interested activities -- these are important as compatibility can determine a lot of interest in a possible match with ideas to share. 

Unfortunately, the attribute section, while could be useful, is broken up in a series of events that left a lot of nulls in the data, and while that could be fixed, there is such a gap that it cannot be used effectively.These are some of the features I predict would be a good use case for the model



### Note necessary feature processing such as getting rid of empty cells etc.

For this particular model, there is less that needs to be done. Not nothing though.

Check the features that have been selected for any lines that are null, and either average out data and put a place holder into that data, or remove that line entirely as not to skew data.

Reduce the features to those that have been selected. 

Make sure there are no lines with a lot of empty space to make sure that does not corrupt the data. 

Investigate for outliers, as in someone who put in a number way outside of the range of questions, or made up a value in some cases that would not be considered realisitic and overcomplicate the model. 

### Model Selection

Outline the rationale for selecting the model noting how its capabilities address your challenge

The model chosen for the first model would be random forests. This model is exceptionally powerful as it is an ensemble model of the decision tree that incorporates a lot of psuedo randomness to make sure that overfitting does not happen as best it can. It requires little tuning of both the model and the features due to that, especially with the reduced features selected.

Another benefit, as dating typically can work fairly categoricaly, such as only dating people in a certain career field or a certain outlook on life, or other characteristics. That leads right into a decision tree which has a percentage to be chosen one direction or another depending on the data. Plus, the random forest model is very easy to set up.

In [1]:
# Enter python code of constructing your selected model  - CODE REQUIRED! (only the model creation)
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=200, max_depth=5, random_state=42)

# Model 2

## Outline the challenge

The challenge of selecting a model for speed dating is that there is typically a lot of subjective data, differing weights of attributes, priorities, goals, etc that lead to a hard decision of making a match. However, there does seem to be patterns and clusters of people that have similar subjective views to others that can lead to correlation. Even with people that have seemingly significant different views, there may be some sort of correlation in that, that leads to a successful match. 

A successful model here would be able to give a percentage chance of if there would be a successful match or not, which would be tested against a set of randomly selected speed dating group. If this model cannot out perform random, then there would not be a reason to justify implementing it. 

### Select the features and their justification

Match -- in this case, match would be the label attached to the model to train it, as a 1 would be a successful match, and 0 would not be. Will be used for training purposes and predictions
Gender -- on average, there is some sort of bias between which gender a person would chose to say yes to.
Age -- people typically have an age range of preference when it comes to dating.
Field of study and field of work-- personality types typically follow the same type of field either of study or work. This could be useful in determining compatibility in a match. These features serve similar purposes which is why they are clumped together, but also helps determine if 2 do not align may also be helpful.
imprelig -- many religions find it important to find those who share similar values, and could be deal breakers so good correlation
race and imprace -- These two go along to help a decision on importance in the decision
Interested activities -- these are important as compatibility can determine a lot of interest in a possible match with ideas to share. 
attribute section 5 -- this is the section of how a person thinks others view them, which would be good in noticing patterns of a match. 
go out -- this can be useful in recognizing personality traits that determine their interest in what a person wants which would be helpful in determining extraverted/intraverted tendencies that have a lot to do with compatibility. 
int_corr -- correlating interests are often a sign of compatibility and potentially a higher chance of a successful match. 

Unfortunately there is other good data that could be used but it is either rated differently, such as some of the attribute sections that are different between waves that would overcomplicate and overfit a model based on a specific group of people that were in types of waves. As well as some data was post speed date, and opinions and decisions are typically already made by then

### Note necessary feature processing such as getting rid of empty cells etc.

This model requires a few significant changes. One, things would need to be scaled appropriately so some data does not have a higher then normal impact on the prediction.

Categories will need to be turned into numbers and also scaled

In order to keep up with the data, blanks will either have to be filled or deleted. 

Many features will have to be removed or the chance of overfitting or an exception compute cost will occur. 

Categorical data will need to be onehot encoded

Scaling and normalization needs to occur. 

### Model Selection

Outline the rationale for selecting the model noting how its capabilities address your challenge

The model going to be used here is a deep neural network. The reason being is that these are exceptionally good at clustering and noticing similarities that the human brain cannot recongize when looking at data. So when it is trained and learns the data after the features have been seleted and used, it can cluster successful matches, and negative matches, together, and give a good predicition (hopefully) that will outmatch just randomly matching people. The network will be a deep network as there are lots of seemingly subjective and random decision points people make in dating that will be hard to recognize a cluster in a small or normal clustering method. 

In [2]:
# Enter python code of constructing your selected model  - CODE REQUIRED! (only the model creation)
import torch

class PyTorchDeepMLP(torch.nn.Module): # Deep MLP
    def __init__(self, n_hidden=10, n_layers=2, epochs=100, eta=0.001, minibatch_size=50, seed=0):
        super(PyTorchDeepMLP, self).__init__()
        self.random = np.random.RandomState(seed) # shuffle mini batches
        self.n_hidden = n_hidden # size of the hidden layer
        self.n_layers = n_layers # number of hidden layers
        self.epochs = epochs # number of iterations
        self.eta = eta # learning rate
        self.minibatch_size = minibatch_size # size of training batch - 1 would not work
        self.optimizer = None
        self.loss_func = torch.nn.CrossEntropyLoss()
        self.model = None

    def init_layers(self, _M:int, _K:int) -> None:
        # data structure
        layers = [torch.nn.Linear(_M, self.n_hidden), torch.nn.Sigmoid()]
        for _ in range(self.n_layers - 1):
            layers.extend([torch.nn.Linear(self.n_hidden, self.n_hidden), torch.nn.Sigmoid()])
        layers.append(torch.nn.Linear(self.n_hidden, _K))
        self.model = torch.nn.Sequential(*layers)
    
    def predict(self, _X):
        _X = torch.FloatTensor(_X)
        assert self.model is not None
        self.model.eval()
        with torch.no_grad():
            y_pred = np.argmax(self.model(_X), axis=1)
        self.model.train()
        return y_pred.numpy()

    def fit(self, _X_train, _y_train, info=False):
        import sys
        _X_train, _y_train = torch.FloatTensor(_X_train), torch.LongTensor(_y_train)
        n_features= _X_train.shape[1]
        n_output= np.unique(_y_train).shape[0] # number of class labels
        
        self.init_layers(n_features, n_output)
        self.optimizer = torch.optim.Rprop(self.model.parameters(), lr=self.eta) # connect model to optimizer

        for i in range(self.epochs):
            indices = np.arange(_X_train.shape[0])
            self.random.shuffle(indices) # shuffle the data each epoch

            for start_idx in range(0, indices.shape[0] - self.minibatch_size + 1, self.minibatch_size):
                batch_idx = indices[start_idx:start_idx + self.minibatch_size]
                self.optimizer.zero_grad()
                
                net_out = self.model(_X_train[batch_idx])
                
                loss = self.loss_func(net_out, _y_train[batch_idx])
                loss.backward()
                self.optimizer.step()
                
                if info:
                    sys.stderr.write(f"\r{i+1:03d} Loss: {loss.item():6.5f}")
                    sys.stderr.flush()
        return self

Dropout could be added to reduce overfitting should it happen

In [3]:
class dropoutMLP(PyTorchDeepMLP): # Deep MLP
    def __init__(self, n_hidden=10, n_layers=3, epochs=100, eta=0.001, minibatch_size=50, seed=0, dropout_rate=0.5):
        super(dropoutMLP, self).__init__()
        self.random = np.random.RandomState(seed) # shuffle mini batches
        self.n_hidden = n_hidden # size of the hidden layer
        self.n_layers = n_layers # number of hidden layers
        self.epochs = epochs # number of iterations
        self.eta = eta # learning rate
        self.minibatch_size = minibatch_size # size of training batch - 1 would not work
        self.optimizer = None
        self.loss_func = torch.nn.CrossEntropyLoss()
        self.model = None
        self.dropout_rate = dropout_rate

    def init_layers(self, _M:int, _K:int) -> None:
        # data structure
        layers = [torch.nn.Linear(_M, self.n_hidden), torch.nn.Sigmoid(), torch.nn.Dropout(self.dropout_rate)]
        for _ in range(self.n_layers - 1):
            layers.extend([torch.nn.Linear(self.n_hidden, self.n_hidden), torch.nn.Sigmoid(), torch.nn.Dropout(self.dropout_rate)])
        layers.append(torch.nn.Linear(self.n_hidden, _K))
        
        self.model = torch.nn.Sequential(*layers)