# Data Collection

A few questions we need to ask. 

Which input features should I include?
How do I obtain known values of my target(label) variable?
How much training data do I need?
How do I know if my training data is good enough?

## Which features should be included

Most problems will have a ton of potential features to choose from. It may be difficult and this is why it's important to have domain knowledge. However without domain knowledge using intuition may be the next best thing. Regardless of choices a few things must be established. 

1. The value of the feature must be known at the time predictions are needed (for example, at the beginning of the month for a saas churn example)
2. The feature must be numerical or categorical in nature or you'll be required to do additional feature engineering

Include features only as they are believed to be valueable and absolutely needed. The more features the bigger the challenge at times. This is because it creates more noise (perturbations) as opposed to establishing the proper signal (true data relationship). So for example a unique id may not be valuable to predict SaaS churn but monthly logins may.

The fewer the features the better. 

The more uninformative features are present, the lower the signal-to-noise ratio and thus the less accurate on average the ML model will be. 

Excluding can reduce accuracy as well because the model doesn't know about the neglected features, which may be predictive of the target. 

### Trade off practical approaches

1. include all features that you suspect to be predictive of the target variable. Fit a ML model. If the accuracy of the model is sufficient, stop.
2. Otherwise, expand the feature set by including other features that are less obviously related to the target. Fit another model and assess the accuracy. If performance is sufficient, stop.
3. Otherwise, starting from the expanded feature set, run a ML feature selection algorithm to choose the best, most predictive subset of your expanded feature set. 

### Achieving Ground Truth

Is a painful process especially in 'big data'. Newer projects requires waiting until there is enough data to begin training while others such as tweet sentiment means using groups of individuals or manually interpretting sentiment. 

Most efforts to obtain high veracity are labor intensive and while the payoff may be well worth it careful consideration of budget and time should be assessed before jumping headlong into a ML project.

### How much training data is needed?

A complicated question that with time you'll no doubt start getting the hang of how much you should need. Some factors that can determine training data needs include:

1. complexity of the problem. Is there a simple pattern or is it nonlinear and complex
2. Accuracy requirements. a 60% success rate required less data than a 95%.
3. Dimensionality of the feature space. Less data is required for 2 features as opposed to 2,000

As training sets grows the models on average become more accurate (the unreasonable effectiveness of data)

In the churn example. Assuming 3333 instances with 19 features and an outcome of subscribed or unsubscribed the following is a way to potentially see if there is a need for more data. 

1. Using the current training set, choose a grid of subsample sizes to try. example sets of 500, 1000, 1500, 2500 or 3k
2. For eac sample size, randomly draw that many instances (without replacement) from the training set.
3. With each subsample of training data, build an ML model and access the accuracy of that model (evaluation metrics)
4. assess hwo the accuracy changes as a function of sample size. If it seems to level off at the higher sample sizes, the existing training set is probably sufficient. But if the accuracy continues to rise for the larger samples, the inclusion of more training instances would likely boost accuracy

### Is the training data representative enough?

In other words how similiar are the instances in the training set to the instances that will be collected in the future? In a supervicsed machine learning model the goal is to generate accurate predictiosn on new data. If not then we will fall subject to 'sample-selection bias' or the 'covariate shift'.

##### Nonrepresentative reasons

1. It was possible to obtain ground truth only for a certain subsample. Credit Card fraud over a certain amount of money. Thus if the threshold of fraud was 1k. Anything below that would be difficult to detect in a supervised learning situation.
2. The properties of the instances have changed over time. For example, if your training example consists of historical data on medical insurance fraud, but new laws have substatially changed the ways in which medical insurers must conduct their business, then your predictions on the new data may not be appropriate
3. The input feature set has changed over time. Say the set of location attributes that you collect on each customer has changed; you used to collect ZIP code and state but now you collect IP address. This change may require you to modify the feature set used for the model and potentially discard old data from the training set. 





In [18]:
import numpy as np
#convert categorical features to numerical binary features example

def cat_to_num(data):
    data = np.array(data)
    categories = np.unique(data)
    features = []
    for cat in categories:
        binary = (data == cat)
        features.append(binary.astype('int'))
    return features

In [19]:
gender = ['male', 'female', 'male', 'female','male']
gender_cat = cat_to_num(gender)
print(gender_cat)

[0 1 0 1 0]
[1 0 1 0 1]
[array([0, 1, 0, 1, 0]), array([1, 0, 1, 0, 1])]
