<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<center>
    <h1><font color="red">Supervised Machine Learning & Parallelization</font></h1>
</center>

## <font color="blue"> Machnine Learning Algorithms</font>

Algorithms can generally be placed into one of three categories:

##### Supervised learning 
    - f(X) = Y
    - i.e. regression, classification
    - Decision trees, Support Vector Machines, Neural Networks, ...
    
    
##### Unsupervised learning 
    - f(X) = X'
    - i.e. clustering, dimensionality reduction
    - K-Means, Expectation Maximization, Principal component analysis, ...
    
##### Reinforcement learning: 
    - f(S, A, T, R) = π
    - i.e. optimal policy for markov decision processes 
    - Q-Learning, SARSA, Temporal Differencing, ...

Also some which might fit in multiple e.g. Semi-supervised learning algorithms

![fig_ML](https://www.pngitem.com/pimgs/m/346-3460573_types-of-machine-learning-machine-learning-types-of.png)
Image Source: www.pngitem.com

Here we will use **supervised learning**

In [None]:
from IPython.display import display
import os
import pandas as pd 
import numpy as np
pd.set_option('display.max_rows', 7)

## <font color="blue">Description of the Dataset</font>

*Open University Learning Analytics Data (OULAD):* https://archive.ics.uci.edu/ml/datasets/Open+University+Learning+Analytics+dataset

- Information about students and their interactions with a virtual learning environment.

In [None]:
assess_fname = "https://raw.githubusercontent.com/aishwaryar7/Open-University-Learning-Analysis/master/anonymisedData/assessments.csv"
courses_fname = "https://raw.githubusercontent.com/aishwaryar7/Open-University-Learning-Analysis/master/anonymisedData/courses.csv"
stdassess_fname = "https://raw.githubusercontent.com/aishwaryar7/Open-University-Learning-Analysis/master/anonymisedData/studentAssessment.csv"
stdinfo_fname = "https://raw.githubusercontent.com/aishwaryar7/Open-University-Learning-Analysis/master/anonymisedData/studentInfo.csv"
stdregi_fname = "https://raw.githubusercontent.com/aishwaryar7/Open-University-Learning-Analysis/master/anonymisedData/studentRegistration.csv"
vle_fname = "https://raw.githubusercontent.com/aishwaryar7/Open-University-Learning-Analysis/master/anonymisedData/vle.csv"
stdvle_fname = "datasets/student_vle.csv"
list_files = [assess_fname, courses_fname, stdassess_fname, 
              stdinfo_fname, stdregi_fname, stdvle_fname, vle_fname]

In [None]:
for filename in list_files:
    reader = pd.read_csv(filename)  
    print('%s: %i samples, %i features' % (os.path.basename(filename), reader.shape[0], reader.shape[1]))
    print('\t','\n\t'.join(reader.columns),'\n')

There are:
- 32,953 student records (potentially with duplicate students)
- 22 courses
- 6364 VLEs ("virtual learning environment" - web page, essentially)
- 10,655,280 VLE student interactions 
- 206 assessments 
- 173,912 assessment results from students

**Objective:** predict a student's final result (distinction, fail, pass, withdraw)

This means there are 32,953 samples we can learn from - one per student. 
So we have 11 (no id_student) features from studentInfo, and 1 (date_registration) from studentRegistration...what about others? Unclear exactly how we might represent the remaining data to an algorithm.

There are also some fields that would be "cheating" to know a priori:
- date_unregistration: only filled for students who have withdrawn
- assessment scores: having all of these with the weights gives the final result itself
- clicks on a VLE? Depends on when in the process you want to create a classifier for
    
In practice, the fields you can use are determined by what the algorithm has available at test time. If you want to predict student grades based on their information at registration, only the 12 previously mentioned fields can be used.

Intuitively, it should be very difficult to predict a student's final result based solely on that information...but let's give it a shot.

In [None]:
info = pd.read_csv(stdinfo_fname)
display(info)

In [None]:
regr = pd.read_csv(stdregi_fname)
display(regr)

- According to the dataset's information, **date_registration** is the number of days relative to the start of the actual course - meaning it's already preprocessed for us!
- We now need to combine it into one set. 
- Note that the **id_student** fields are in the same order. Otherwise, we would need to go row by row to combine them.

Verify that the student ids match in both datasets.

In [None]:
assert(info['id_student'].equals(regr['id_student']))

Add the date of registration to the student information dataframe to create a new one.

In [None]:
data = info.join(regr['date_registration'])
display(data)

- The data above uses text data for the majority of features. 
- For any algorithm to work, we need to convert these into numerical representations. 

So, best way to do that? 

- Simplest way is to just assign a number to each unique value in a feature. 
- However, that implies an ordering on the feature. 
- For some, e.g. **age_band**, that may make sense; but what would it mean to say **Scotland > North Western Region**? **Yorkshire Region** comes before the **South Region**? Nominal vs. Ordinal features.

For these types of features - those without a natural ordering - we use a one-hot encoding vector. For example, a feature set:

```python
[Red, Green, Red, Blue]
```
might be changed to
```python
[[1 0 0]
 [0 1 0]
 [1 0 0]
 [0 0 1]]
```
Each column therefore represents the presence or lack of a certain attribute.

Let us try to convert a categorical variable into dummy/indicator variables using the `get_dummies` method:

In [None]:
code_module = pd.get_dummies(data['code_module'])
display(code_module)

For each column in the dataframe, print the number of unique values.

In [None]:
for col in data:
    print("{}: {}".format(col, data[col].nunique()))

In [None]:
print('\n'+'\n'.join(['%s: %i' % (col, data[col].nunique()) for col in data]))

+ While potentially subjective, we want to label **code_module**, **code_presentation**, **gender**, **region**, and **disability** as nominal values. 
+ This gives us (7 + 4 + 2 + 13 + 1 + 1 + 1 + 1 + 1 + 2 + 1 + 1) = 35 features. 
+ We then also need to relabel the ordinal values with a logical ordering.

In [None]:
ordinal = ['highest_education', 'imd_band', 'age_band', 'final_result']

for col in ordinal:
    print("{}: {}".format(col, data[col].unique()))      

- `?` doesn't have an ordering against the others. 
- One way to take care of it (other than classifying as nominal) is to drop those 1111 samples. 
- Another option is to label as NaNif the chosen ML algorithm can handle these values

In [None]:
for col in data.columns:
    if '?' in set(data[col]):
        print('Column %s has %s unknown values' % (col, data[col].value_counts()['?']))        

**Create a new dataframe where new columns for nominal and ordinal values are added.**

In [None]:
def clean_data(data):
    numeric_keys = ['num_of_prev_attempts', 'studied_credits', 'date_registration']
    nominal_keys = ['code_module', 'code_presentation', 'region', 'disability', 'gender']
    ordinal_keys = {'highest_education': ['No Formal quals', 'Lower Than A Level', 'A Level or Equivalent', 
                                          'HE Qualification', 'Post Graduate Qualification'],
                   'imd_band': ['0-10%', '10-20', '20-30%', '30-40%', '40-50%', '50-60%',
                                '60-70%', '70-80%', '80-90%', '90-100%'],
                   'age_band': ['0-35', '35-55', '55<='],
    #                 'gender':['M', 'F'],
                   'final_result': ['Fail', 'Withdrawn', 'Pass', 'Distinction']}
    
    # Get rid of unknown values
    data = data.drop('id_student', axis=1).replace({'?':np.nan})

    all_keys = numeric_keys + nominal_keys + list(ordinal_keys.keys()) 
    remaining_keys = set(data.columns) - set(all_keys)
    numeric_keys += list(remaining_keys)
    
    def to_ord(x, order):
        """
           Create a categorical data type. 
           https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
        """
        y = pd.Categorical(x, categories=order)
        return y.codes

    # Numeric columns
    numeric = data[numeric_keys]

    # One-hot encoded nominal columns
    nominal = pd.get_dummies(data[nominal_keys])

    # Set the ordinal columns as a 'category' type and retrieve the resulting conversion
    # Can only do this with series, so have to perform column-by-column and then reform 
    # into a dataframe
    ordinal = np.array([to_ord(data[o], ordinal_keys[o]) for o in sorted(ordinal_keys)])
    ordinal = pd.DataFrame(ordinal.T, columns=sorted(ordinal_keys))

    # Join the dataframes together, drop nan values, convert to integers, and reset the index back to default
    final_data = numeric.join(nominal).join(ordinal).dropna().astype(np.int32).reset_index(drop=True)
    return final_data

In [None]:
final_data = clean_data(data)
display(final_data)

## <font color="blue">Choosing an Algorithm</font>

*Algorithm choice without background knowledge is arbitrary.*

"No Free Lunch" theorem:

> "We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems." - Wolpert & Macready (1997)
    
Luckily, there is always background knowledge. We can *assume* a great deal even without knowing this dataset:
- Samples that have similar features are usually going to be similarly classified
- There will likely be outliers in the data
- There is going to be little or no noise in our current feature set
- There is not any timeseries component or ordering in time to these features
- Many other assumptions we make by the sheer fact of having this dataset. 

These remove choices like reinforcement learning algorithms, NLP algorithms, forecasting algorithms, etc. 

There is also the background knowledge of what you would like to receive from the algorithm: 
- interpretable model?
- feature importances?
- exact boundary conditions?
- exact function approximation?
- speed?

All of these contribute to the choice of algorithm. Here, we'll use **decision trees** - they essentially provide four of the five above conditions, and are fairly powerful for such a simple model.

### <font color="red">(binary) Decision Trees</font>

Quick summary:
- Tree is built by selecting a feature to 'split' on at each node; all samples with the feature value go to one branch, and all other samples go to the other
- Few different splitting metrics, but Gini and Information Gain are the most common. General idea is to determine which feature choice gives the greatest separation between target classes
- Adaptable to continuous data as well as regression settings

Simple decision tree for passenger survival on the Titanic:

![fig_tree](https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png)
Image Source: wikipedia

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

def split_data(X, Y, pct=0.2):
    ''' 
       Divide data into training / validation set 
    '''
    X = X.as_matrix()
    samples = X.shape[0]
    n_train = int(samples * pct) 
    indices = np.arange(samples)
    np.random.shuffle(indices)

    # Training set
    Xt = X[indices[:n_train]]
    Yt = Y[indices[:n_train]]

    # Validation set
    Xv = X[indices[n_train:]]
    Yv = Y[indices[n_train:]]
    return (Xt, Yt), (Xv, Yv)


def fit_data(X, Y, kwargs={}, Model=DecisionTreeClassifier):
    '''
    Fit model and test performance on a validation set
    
    kwargs:
     - max_depth
     - max_leaf_nodes
     - min_impurity_split
     - min_samples_split
     - min_samples_leaf
     - criterion
     - class_weight
    '''
    training, (Xv, Yv) = split_data(X, Y)

    model = Model(**kwargs).fit(*training)
    Y_hat = model.predict(Xv)

    print('Accuracy:', accuracy_score(Yv, Y_hat))
    print('F1:', f1_score(Yv, Y_hat, average='weighted'))
    print('Confusion Matrix:\n', confusion_matrix(Yv, Y_hat))
    return model

In [None]:
Y = final_data['final_result']
X = final_data.drop('final_result', axis=1)

fit_data(X, Y)

Not so great. So, what do we do now?

A few options:
 - Change the objective
 - Improve model
 - Improve data


In [None]:
Y = final_data['final_result'].replace({1:0, 2:1, 3:1}) # Fail, Withdraw, Pass, Distinction -> Incomplete, Complete
X = final_data.drop('final_result', axis=1)

fit_data(X, Y)

Improvement, but still not much better than a coin flip. How about improving the model?

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import make_scorer
from time import time

# Create a timing decorator 
def timer(f):
    def wrapper(*args, **kwargs):
        start  = time()
        result = f(*args, **kwargs)
        print('Execution time: %.2f seconds' % (time() - start))
        return result
    return wrapper


# Split data using sklearn
Xt, Xv, Yt, Yv = train_test_split(X, Y, test_size=0.2)

@timer
def run_search_dt(n_jobs=1):
    search = GridSearchCV(estimator = DecisionTreeClassifier(),
                          param_grid= {'max_depth'      : np.arange(2, 11),
                                       'max_leaf_nodes' : np.linspace(8, 100, 10, dtype=np.int32),
                                       'criterion'      : ['gini', 'entropy'],
                                       'class_weight'   : ['balanced', None]},
                          scoring = make_scorer(f1_score),
                          n_jobs  = n_jobs,
                          cv      = 3) # 3 fold cross validation
    search.fit(Xt, Yt)
    display(pd.DataFrame(search.cv_results_))
    print('Best score: %.3f' % search.best_score_)
    return search.best_params_
    
    
params = run_search_dt()
print(params)

In [None]:
params = run_search_dt(4)
fit_data(X, Y, params)

## <font color="blue"> Ensembles </font>

Ensemble learning is the idea that we can potentially combine multiple models into one bigger model, which on average outputs better estimates than any of its constituents. 

Number of different approaches:
 - Bayesian methods
 - Stacking
 - Bagging
 - Boosting
 - ...
 
Longer training times, but typically ensembles using even extremely simple models (e.g. single split decision tree "stumps") can outperform more specialized models. 


### Boosting

Given a weak learner i.e. performing consistently better than random, are we able to make it a strong learner?

Yes: 

#### AdaBoost

Assume you've trained the weak learner and have its outputs on the training set. These outputs match well with some examples, and poorly with others. The poor output samples can be reweighted to have a higher importance in the training set, and then another training round for a weak learner is performed. 

Each weak learner has an associated weight based on its overall sample-weighted performance. These are used after all boosting rounds are completed to combined the weak learners into a single ensemble model. 


#### Gradient Boosting

Continuing with that idea: rather than reweight the samples and refit them at each iteration, we can instead fit the residual at each step - the negative gradient of the squared error:

$F_{t+1}(x) = \dfrac{\delta L(y, F_t(x))}{\delta F_t(x)} = y - F_t(x)$

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from xgboost.sklearn  import XGBClassifier

timed_test = timer(fit_data)

for M in [AdaBoostClassifier, XGBClassifier]:
    print(M.__name__)
    timed_test(X, Y, Model=M)
    print()

### <font color="red"> Feature Importance</font>

- One benefit to using tree-based methods is that they are white-boxes: their models are easily explainable. 
- This leads to easy calculation of things like feature importance and decision paths.  

In [None]:
model = fit_data(X, Y, Model=XGBClassifier)
importance = sorted(list(zip(X.columns, model.feature_importances_)), key=lambda k:k[1], reverse=True)
print('Top three features:')
display(importance[:3])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

imp = np.array(importance)
imp = pd.DataFrame(imp[:,1], index=imp[:,0]).T

plt.figure(figsize=(10,10));
sns.barplot(data=imp, orient='h', palette='RdBu').set_xlabel('Relative Importance %');

In [None]:
from xgboost.plotting import plot_tree
plt.figure(figsize=(14,13));
plot_tree(model, ax=plt.gca());

## <font color="blue"> Learning Curves </font>

Another thing we can look at is the learning curves, to judge a few different things. 

1. Determine if more data would improve the results
2. Check if our model is overfitting
3. Bias / variance examination


#### Bias-variance tradeoff 

Generally speaking:
 - Bias: error on the data; high bias = underfitting
 - Variance: sensitivity to small differences between samples; high variance = overfitting
 
Supervised learning models always have an inherent bias-variance tradeoff in their construction. Usually this is tweakable via the hyperparameters of the model, such as with the maximum depth of a decision tree.  

In [None]:
from sklearn.model_selection import learning_curve, ShuffleSplit

def plot_learning_curve(X, Y, Model=XGBClassifier):
    cv = ShuffleSplit(n_splits=30, test_size=0.2)
    train_sizes = np.linspace(0.1,1,8)
    train_sizes, train_scores, test_scores = learning_curve(Model(), 
                                                            X, Y, cv=cv, 
                                                            train_sizes=train_sizes, n_jobs=4)
    train_mean = np.mean(train_scores, axis=1)
    train_std  = np.std(train_scores, axis=1)
    test_mean  = np.mean(test_scores, axis=1)
    test_std   = np.std(test_scores, axis=1)

    c1 = plt.plot(train_sizes, train_mean, 'o-', label="Training score")[0].get_color()
    c2 = plt.plot(train_sizes, test_mean, 'o-', label="Cross-validation score")[0].get_color()

    plt.fill_between(train_sizes, train_mean-train_std, train_mean+train_std, alpha=0.1, color=c1)
    plt.fill_between(train_sizes, test_mean-test_std, test_mean+test_std, alpha=0.1, color=c2)

    plt.legend(loc="best")
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
plot_learning_curve(X, Y)


Other than that, feature engineering would be the next step:
 - group of students in the same code module?
 - student in multiple code modules?
 - student performance in prior modules?
 - etc. 
 
Let's take a look at the data again however - maybe changing our objective to a point-in-time estimate will get better results.

In [None]:
vle = pd.read_csv(stdvle_fname)
display(vle)

In [None]:
vle['date'].hist();

In [None]:
s = vle.groupby(['code_module', 'code_presentation', 'id_student', 'date'])['sum_click'].sum().reset_index()
display(s)

In [None]:
s.groupby(['date'])['sum_click'].sum().reset_index().plot(x='date', y='sum_click');

In [None]:
def add_clicks(data, day):
    # Get all data points on this day or prior, and sum them over the three columns we care about
    clicks = s[s['date'] <= day].groupby(['code_module', 'code_presentation', 'id_student'])['sum_click'].sum().reset_index()
    
    # Merge this data with the appropriate rows in the full data set
    full   = pd.merge(data, clicks, on=['code_module', 'code_presentation', 'id_student'], how='outer')
    
    # Replace any missing values with 0
    full['sum_click'].fillna(0, inplace=True)
    return full

**Using Multiprocessing**

In [None]:
from multiprocessing import Pool

assert(0), 'Kernel crash on windows'

# Multiprocessing can significantly speed up execution, and isn't too difficult to implement
# Define a function you want mapped to each sample in the data, and then apply it
def test_day(day):
    wclicks = add_clicks(data)
    cleaned = clean_data(wclicks)
    
    Y = cleaned['final_result'].replace({1:0, 2:1, 3:1}) # Fail, Withdraw, Pass, Distinction -> Incomplete, Complete
    X = cleaned.drop('final_result', axis=1)
    Xt, Xv, Yt, Yv = train_test_split(X, Y, test_size=0.2)

    model = XGBClassifier().fit(Xt, Yt)
    Y_hat = model.predict(Xv)
    return day, f1_score(Yv, Y_hat)

pool = Pool(4)
xy_performance = pool.map(test_day, sorted(s['date'].unique()))

**If Multiprocessing is not Available**

In [None]:
from tqdm import tqdm 

# If multiprocessing isn't available, can still do the same work normally
xy_performance = []
for day in tqdm(sorted(s['date'].unique())):
    wclicks = add_clicks(data, day)
    cleaned = clean_data(wclicks)
    
    Y = cleaned['final_result'].replace({1:0, 2:1, 3:1}) # Fail, Withdraw, Pass, Distinction -> Incomplete, Complete
    X = cleaned.drop('final_result', axis=1)
    Xt, Xv, Yt, Yv = train_test_split(X, Y, test_size=0.2)

    model = XGBClassifier().fit(Xt, Yt)
    Y_hat = model.predict(Xv)
    xy_performance.append([day, f1_score(Yv, Y_hat)])

In [None]:
xy_performance = np.array(sorted(xy_performance, key=lambda k:k[0]))
plt.plot(*xy_performance.T);
plt.xlabel('Days Relative to Start of Course');
plt.ylabel('F1 Score');

So with just adding information about the number of clicks a student has made, we can improve the score significantly. Since we're looking at data going forward anyway, we can also add test scores.

In [None]:
student_scores = pd.read_csv(stdassess_fname)
display(student_scores)

In [None]:
test_info = pd.read_csv(assess_fname)
display(test_info)

In [None]:
combined_test = pd.merge(student_scores, test_info, on=['id_assessment'], how='outer')
combined_test = combined_test.replace({'?':np.nan}).dropna()
display(combined_test)

In [None]:
c = combined_test.drop(['assessment_type', 'is_banked', 'id_assessment'], axis=1)

# Create a few new features, combining other columns together
c['relative_turnin'] = c['date'].astype(float) - c['date_submitted'].astype(float)
c['weighted_score']  = c['score'].astype(float) * c['weight'] / 100.
c['raw_score'] = c['score'].astype(float) # Modify type
display(c)

In [None]:
def add_scores(data, day, key='date_submitted'):    
    # Same as before - get the relevant data and group by necessary columns
    assess = c[c['date_submitted'] <= day].groupby(['code_module', 'code_presentation', 'id_student'])

    # Sum over the relevant features
    total  = assess['raw_score', 'relative_turnin', 'weighted_score'].sum().reset_index()

    # Merge with the rest of the data
    full   = pd.merge(data, total, on=['code_module', 'code_presentation', 'id_student'], how='outer')

    # Replace missing values
    full['relative_turnin'].fillna(0, inplace=True)
    full['weighted_score'].fillna(0, inplace=True)
    full['raw_score'].fillna(0, inplace=True)
    return full

In [None]:
xy_performance = []
for day in tqdm(sorted(s['date'].unique())):
    wclicks = add_clicks(data, day)
    wscores = add_scores(wclicks, day)
    cleaned = clean_data(wscores)
    
    Y = cleaned['final_result'].replace({1:0, 2:1, 3:1}) # Fail, Withdraw, Pass, Distinction -> Incomplete, Complete
    X = cleaned.drop('final_result', axis=1)
    Xt, Xv, Yt, Yv = train_test_split(X, Y, test_size=0.2)

    model = XGBClassifier().fit(Xt, Yt)
    Y_hat = model.predict(Xv)
    xy_performance.append([day, f1_score(Yv, Y_hat)])

In [None]:
xy_performance = np.array(sorted(xy_performance, key=lambda k:k[0]))
plt.plot(*xy_performance.T);
plt.xlabel('Days Relative to Start of Course');
plt.ylabel('F1 Score');

So, not bad: we can predict with ~80% accuracy whether a student will complete the course after only about a month and a half - 20% of the way through. <br><br>Even at the first day of the course, we're at nearly 70%. Let's take a look at how each of the original objectives fare over time...

In [None]:
from tqdm import tqdm 
from xgboost.sklearn import XGBClassifier
xy_performance = []
for day in tqdm(sorted(s['date'].unique())):
    wclicks = add_clicks(data, day)
    wscores = add_scores(wclicks, day)
    cleaned = clean_data(wscores)
    
    Y = cleaned['final_result']
    X = cleaned.drop('final_result', axis=1)
    Xt, Xv, Yt, Yv = train_test_split(X, Y, test_size=0.2)

    model = XGBClassifier().fit(Xt, Yt)
    Y_hat = model.predict(Xv)
    xy_performance.append([day,] + list(f1_score(Yv, Y_hat, average=None)))

In [None]:
XY = np.array(sorted(xy_performance, key=lambda k:k[0]))
XY = pd.DataFrame(XY, columns=['Days', 'Fail', 'Withdraw', 'Pass', 'Distinction'])
display(XY)

import seaborn as sns
%matplotlib inline
XY.plot(x='Days').set_ylabel('F1 Score');

Machine learning is an iterative process over two steps:
 - Improve the data
 - Improve the model
 
When possible, improving the objective can also help.

## <font color="blue"> Advanced Topics</font>

- **Translation**. Not necessarily just language to language, but in general data to data where the two sets have a (nearly) one-to-one mapping. Autoencoders are popular for this right now, e.g. Seq2Seq
<br><br>
- **Deep NLP**. Any time you want to use text as the feature in a neural network, word vector embeddings are the go to representation e.g. GloVe. Long short-term memory units are the building blocks for most models, and recurrent networks in general tend to perform fairly well.
<br><br>
- **Temporal data**. Recurrent networks are the standard model when trying to extract temporal relationships within a data set. Recurrent networks can also be used for a number of other cool applications: https://distill.pub/2016/augmented-rnns/ . Aside from those, reservoir computing networks (Echo State Networks & Liquid State Machines) are some lesser known but powerful models (and also my favorite topics).
<br><br>
- **Small data set**. The general notion is that you need a _lot_ of data to perform deep learning. While it helps, it's not necessarily a requirement - Generative Adversarial Networks can help to create a simulated data set which is 'good enough' for getting a training process started. Utilizing these correctly can take some work however. 
<br><br>
- **Image to image**. Also GAN domain, though depending on the specific application convolutional networks are useful as well (and are actually the building blocks for GANs. 
<br><br>
- **Markov Processes**. Anything that can be imagined as taking actions in a state to reach a goal (or optimize the state) are typically the domain of reinforcement learning. The standard research environment (or most visible at least) is in game playing: there's an obvious state transition function, action mapping, and reward sequence to train a model. Plus, there's no shortage of environments.
<br><br>
- **Different data domains**. If you have say, a _large_ set of simulated data or data which is particular to a specific domain, but you want to create a model which can learn a _related_ domain. There is a lot of research on transfer learning happening right now, where models trained on one domain can transfer knowledge of what they learned to be used in the separate (but somehow related) domain. 
<br><br>
- Tons of other models for nearly any problem you can think of. Any which I haven't mentioned?