**Project Scoping**: Tuning hyperparamters of USPORF and benchmarking performance of tuned parameters across different datasets.

![Screen%20Shot%202019-09-25%20at%205.38.45%20AM.png](attachment:Screen%20Shot%202019-09-25%20at%205.38.45%20AM.png)

  **Definitions below are embedded within urerf.py within SPORF repository. I placed arrows next to the hyperparameters which I will be looking into tuning for this project**
  
Parameters
    ----------
    projection_matrix : str, optional (default: "RerF")
        The random combination of features to use: either "RerF", "Base".
        "RerF" randomly combines features for each `mtry`. Base is our
        implementation of Random Forest. "S-RerF" is structured RerF,
        combining multiple features together in random patches.
        See Tomita et al. (2016) [#Tomita]_ for further details.  
        
    --> n_estimators : int, optional (default: 100)
        Number of trees in forest.  
        
    --> max_depth : int or None, optional (default=None)
        The maximum depth of the tree. If None, then nodes are expanded
        until all leaves are pure or until all leaves contain less than
        min_samples_split samples.  
        
    --> min_samples_split : int, optional (default: "auto")
        The minimum splittable node size.  A node size < ``min_samples_split``
        will be a leaf node.  Note: other implementations called `min.parent`
        or `minParent`
        - If "auto", then ``min_samples_split=sqrt(num_obs)``
        - If int, then consider ``min_samples_split`` at each split.  
        
    --> max_features : int, float, string, or None, optional (default="auto")
        The number of features or feature combinations to consider when
        looking for the best split.  Note: also called `mtry` or `d`.
        - If int, then consider ``max_features`` features or feature combinations at each split.
        - If float, then `max_features` is a fraction and ``int(max_features * n_features)`` features are considered at each split.
        - If "auto", then ``max_features=sqrt(n_features)``.
        - If "sqrt", then ``max_features=sqrt(n_features)`` (same as "auto").
        - If "log2", then ``max_features=log2(n_features)``.
        - If None, then ``max_features=n_features``.  
        
    feature_combinations : float, optional (default: "auto")
        Average number of features combined to form a new feature when
        using "RerF." Otherwise, ignored.
        - If int or float, then ``feature_combinations`` is average number of features to combine for each ``max_features`` to try.  
        - If "auto", then ``feature_combinations=n_features``.
        - If "sqrt", then ``feature_combinations=sqrt(n_features)`` (same as "auto").
        - If "log2", then ``feature_combinations=log2(n_features)``.
        - If None, then ``feature_combinations=n_features``.  
        
    n_jobs : int or None, optional (default=None)
        The number of jobs to run in parallel for both `fit` and `predict`.
        ``None`` means 1. ``-1`` means use all processors.  
        
    random_state : int or None, optional (default=None)
        Random seed to use. If None, set seed to ``np.random.randint(1, 1000000)``.


**Two popular methods available on sklearn for hyperparameter tuning:**
1) Grid Search
2) Random Search

and then applying cross-validation to see which method and parameters work better.

Sources: https://blog.usejournal.com/a-comparison-of-grid-search-and-randomized-search-using-scikit-learn-29823179bc85, https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models

The second link has python code that uses scikit-learn to build a simple logistic regression classifier and tunes the hyperparamters with grid search and random search. 

**Deliverable:** recreate above code in Jupyter notebook and write similar code for USPORF hyperparameter tuning on same dataset as used in tutorial.