## Reduce Complexity of Model
As we've seen in creating Supervised and Unsupervised Machine Learning models, the number of predictors plays a large role in each model. The more predictors we have, the more complex the model and the higher risk of overfitting our training data in a way that is not predictive. So the major question is: how do we reduce the complexity of our model without indescriminantly throwing out data? 

## Feature Selection or Feature Projection?

There are two general methods to reducing the number for features in your model: selection and projection. 

**Feature Selection** involves systematically choosing a smaller set of features to focus on and removing the rest. For example, consider our 13-feature wine dataset. One could systematically create Decision Tree models for every possible 4-feature combination *(715 combinations in total)* and compare the resulting five-fold cross-validation accuracy averages to determine the 4 features to keep. 

**Feature Projection** involves mathematically combining multiple features into new, singular "Frankenstein" features. Principal Component Analysis (PCA) is such a method, where the principal axes are linear combinations of multiple features. This still results in loss of data, but in a "mathematically minimal" way. Explaining the mathematics behind these methods is beyond the scope of this survey course and is best revisited after taking a Linear Algebra course.

## Feature Selection Example - Combinatorial Optimization
Let's write out the code to reduce our 13-feature wine dataset down to 2 features. 

In [21]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier

from itertools import combinations
from sklearn.model_selection import cross_val_score

In [37]:
def create_all_feature_dicts(df, subset_n):
    features = df.keys()
    all_subsets = list(combinations(features, subset_n))

    list_of_feature_dicts = []
    name_index = 0
    for subset in all_subsets:
        temp_dict = {
            'name': f'model_{name_index}',
            'features': list(subset), 
            'reduced_df': df[list(subset)], 
            'target': wine.target,
            'model': DecisionTreeClassifier()
        }
        list_of_feature_dicts.append(temp_dict)
        name_index += 1

    return list_of_feature_dicts

def run_cross_val(dict, cv_n):
    model = dict['model']
    X = dict['reduced_df']
    y = dict['target']
    all_scores = cross_val_score(model, X, y, cv=cv_n)

    dict['cross_val_scores'] = all_scores
    dict['cross_val_avg'] = np.mean(all_scores)
    return dict

def return_selected_dict(list_of_dicts, name):
    selected_dict = list(filter(lambda model: model['name'] == name, list_of_dicts))[0]
    return selected_dict

In [51]:
wine = load_wine()
list_of_feature_dicts = create_all_feature_dicts(pd.DataFrame(wine.data), 2)

list_of_cross_val_avg = []
list_of_model_names = []

for dict in list_of_feature_dicts:
    dict = run_cross_val(dict, 5)
    list_of_cross_val_avg.append(dict['cross_val_avg'])
    list_of_model_names.append(dict['name'])

sorted_model_names = [name for _, name in sorted(zip(list_of_cross_val_avg, list_of_model_names), reverse=True)]

for index in range(0, 10):
    model_name = sorted_model_names[index]
    model_dict = return_selected_dict(list_of_feature_dicts, model_name)
    print(f'Rank: {index+1}')
    print(f'Features: {model_dict['features']}')
    print(f'Cross-Validation Average: {int(model_dict['cross_val_avg']*100)}%')
    print()

Rank: 1
Features: [6, 9]
Cross-Validation Average: 91%

Rank: 2
Features: [0, 6]
Cross-Validation Average: 89%

Rank: 3
Features: [9, 12]
Cross-Validation Average: 88%

Rank: 4
Features: [10, 12]
Cross-Validation Average: 87%

Rank: 5
Features: [9, 11]
Cross-Validation Average: 87%

Rank: 6
Features: [11, 12]
Cross-Validation Average: 85%

Rank: 7
Features: [5, 9]
Cross-Validation Average: 85%

Rank: 8
Features: [0, 11]
Cross-Validation Average: 84%

Rank: 9
Features: [4, 6]
Cross-Validation Average: 83%

Rank: 10
Features: [6, 12]
Cross-Validation Average: 82%



## Feature Projection Example - Linear Discriminant Analysis

IBM hosts an *excellent* step-by-step description of Linear Discriminant Analysis (LDA): [https://developer.ibm.com/tutorials/awb-implementing-linear-discriminant-analysis-python/](https://developer.ibm.com/tutorials/awb-implementing-linear-discriminant-analysis-python/). Advanced users looking to see how to determine how many features to project onto should read through this resource.

In [53]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split the data set into training and testing sets
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.2)

# Apply Linear Discriminant Analysis
# n_components cannot be more than min(n_features, n_classes - 1)
# Since n_classes = 3, our n_components 
lda = LinearDiscriminantAnalysis(n_components=2) 
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

#Display the accuracy
print(f'Accuracy: {int(accuracy*100)}%')

### STILL NEEDED - CREATE CROSS-VALIDATION

Accuracy: 97%
