In [None]:
import numpy as np 
import matplotlib.pyplot as plt

# Common Problems when Training ML/AI Models (And how to solve them without losing your mind) 

A practical guide for practical python users

## Poor Data Quality 

A ml/ai algorithm is only as good at the data you give it. 
If you give data that contains no meaningful information, you shouldn't be surprized when the algorithm can't get anything out of it. 
However, we've got a whole lot of tips to get around the problems. 
You'll also build up an intutition for how to get around these with practice. 

### Missing Data

Not every piece of data is always recorded. 
If a question is optional on a survey, not everyone is going to anwser. 
Unfortunately, models tend to get a little angry with a NaN or inf in a dataset. 
There are two main approaches to dealing with it. 

#### Replace

Sometimes you are working with a smaller dataset and every single row is important. 
In this case, just skipping a row that is incomplete can cause downstream problems. 
Then, you'll want to replace any NaN values. 
(This includes inf and -inf!)

This takes a little bit of work to understand what your data is doing and what the best approach is, to avoid introducing accidental bias or accidentally teaching your model to be a NaN detector. 
The most common things to replace missing values with are either 0, or the mean of the column. 
Using the mean generally means it will not change the distribution of the data dramatically (especially if the number of missing values is small). 
When replacing a missing number with 0 though, it's important to understand why it was missing. 
It's easy to skew the dataset if you don't think about it, so make sure to look at the distributions of your data before and after you make changes. 

It is also very important not to make guesses on your label/y field. 
If you are missing this field for a piece of data, if you replace it, you're already makinga  decision on what you want the model to train for; so it's best to just discard these points. 
It's like seeing that you don't know if the car should have turned right or left on your trip, and just deciding that "Eh, it's probably to the right" and possibly ending up in Springfield, Manitoba instead of Springfield, Illinois. 

#### Remove

This is the more common solution.
If your dataset is large enough, you can just remove any rows (or columns) that are missing important data. 
It's important to cut down your dataset to just the features you are going to use in training before you do this step though, you could end up losing data that would have been fine. 

Remove a single row when it's missing a field that is required for your training, or remove a whole column when it's either unimportant (Run some correlation tests to see if this is the case!) or mostly empty. 
It is important to make sure you're not accidentally covering up a systematic bias when you do this though. 
Is there a reason this field is empty for a lot entries? 
Always make sure you understand what your data is actually saying before making modifications. 
Remember, modeling is a science not just a series of checkboxes. 

In [None]:
# Missing data
size = 200
data = np.random.default_rng().standard_normal(0, 1, size=size)
remove_indices = np.random.default_rng().integers(0, size-1, size=size*.05) # remove 5% of the data

data[remove_indices] = np.NaN

In [None]:
## Replace
# This is a normal distirbution (from the definition), but it's good to look at the distribution without NaNs before you pick a replacement value
# Could be better to use a mode or median! 

data_replaced = data.fillna(data.mean())

plt.histogram(data_replaced)
plt.title("Data, replacing the missing values with the mean")
plt.show()

# Let's see how we can bias the data here 
data_replaced_zeros = data.fillna(0)

plt.histogram(data_replaced_zeros)
plt.title("Data, replacing the missing values with 0")

plt.show()

In [None]:
## Remove
# This method is a little easier, and probably better with such a well behaved problem
data_dropped = data.dropna()

plt.histogram(data_dropped)
plt.title("Data, NaN's removed")

plt.show()

### Noisy Data 

"Noise" is an over-general term that refers to the problem from having information that does not contribute to the signal you want to study. 
It can come from any number of sources; like instrumentation noise when the data was collected, incorrect data recording, the like. 
A little noise isn't a bad thing, it can prevent models from overfitting, but when noise is so overwhelming it dominates the data, that's where a problem sits. 


Solutions: 
* Identify and remove overly-noisy data. 


### Data with outliers

Similarly to noisy data, if your data collection included possible errorsn in collection, points far outside the expected range, this can skew your results. 
Most likely, you don't want to account for these sort of outliers, or they're incredibly rare and satistically unimportant, you can throw them out. 

It is important to consider these outliers before you make a blanket statement that they're not useful, they could hold a clue to a problem upstream in the data collection procress, but this is not often the case. 
Sometimes there are just outliers. 

In this case, statitics can take care of us. 
You can use the quartiles to indentify the points widely outside the distirbution, and follow the same logic as if you were removing an NaN value. 
We'll use `np.quartile`, but pandas as identical functionality. 

In [None]:
size = 200
data = np.random.default_rng().standard_normal(0, 1, size=size)
remove_indices = np.random.default_rng().integers(0, size-1, size=size*.05) # remove 5% of the data

data[remove_indices] = np.random.default_rng().standard_normal(1, 5, size=size*.05) # Replace it with something.... different 

plt.hist(data, bins=20)
plt.show() #Wow 

# Remove those outliers. 

outliers = np.quantile(data, 0.99) # Take everything in the 99% quartile. A little outside what we expect

# Now let's use some indexing to remove this from the dataset 

non_outliers = data[data!=outliers]
plt.hist(non_outliers, bins=20)
plt.show()


### Inbalanced Data 

Because not everything is equally likely, often you get datasets that are not perfectly even. 
This can cause a problem in training ML algorithms, because the model with learn to cheat and figure out that it can always select class A if class A is 95% of the dataset, and still get 95% accuracy. 
This sort of problem is obvious in problems like outlier detection (ex, fraud prevention, rare event tagging), but present to a degree in most (if not all) classification tasks. 

There's two main ways to solve this: 

#### Selective Sampling 
Also called stratified sampling, this means sampling a larger class down to the size of the smaller class. 
This is an approach best done when the data is biased, but not enough that one of the classes is TINYYYY compared to the other. 
Remember, this does throw out data from the larger class, so make sure your sub-samble is still representive of the class as a whole. 
Generally a uniform sampling scheme will work well enough, but running some comparisons between your sub-sample and the whole class is good practice. 

#### Label Weighing
This method is better for highly imbalanced problems.
When you weight labels, it means that you apply more importance to the smaller class in the loss function. 
So, if you have a 100/10 split in classes, but weight them 1:10, the loss function will treat the loss from any of the smaller class with 10x the gravity as anything from the dominate class. 
Basically, you just make your loss REALLY care about the smaller class, comparitively. 

In [None]:
def train_model(x_train, y_train, class_weights): 
    pass 

slightly_imbalanced_data = ""

very_unbalanced_data = ""

### Large Variance in Scale

When your variables correlate to real world parameters, they are often in different units and different scales. 
Imagine you are trying to train a model to predict the difference between a cookie recipe and a cake recipe. 
So you would have the amount of milk in a unit like liters, and flour in grams. 
This ends up asking for something like .15 liters of milk and 150 grams of flour. 
The model is going to understand these numbers very differently, so it may put a large amount of importance on a variable you don't really need it to. 
(To a model, the difference between .2 L and 1 L is still smaller than 150 g to 145g!
It only sees the numbers and how they relate to the labels, not the real meaning of the number, no matter how much LLMs want you to think otherwise.) 

To fix this, we can scale the units (such converting liters in milliliters,) or normalizing each row to take values between 0 and 1. 
Because not every value in the world has units, in AI we generally take the later approach. 
We can also normalize (where we assume the data fits a normal distribution and scale using the mean and standard deviation of the dataset.)

### Out-Of-Domain Data 

Non-Representive Data (or out-of-domain data) is the problem where the features you have between different inferences of your model are not the same. 
This is common when doing things like moving from simulation to non-simulation data, or when you gather new data to expand your training set. 
Unfortunately, this has no Easy fix.
But, it's a whole field of study. 
Check out the wikipedia page for [Domain Adaptation](https://en.wikipedia.org/wiki/Domain_adaptation). 

You can see a near identical from incorrectly scaling your input data. 

If your original work was done on a PCA-transformed dataset, you'll need to use the same fit transform to fit your validation data or any other data you preform inference on.
Same goes for scaling, normalization, you name it. 
If it's a transform applied to your training data, it has to be applied to further input data. 

In [None]:
# Troubled Training 
train_x = ""
train_y = ""

val_x = ""
val_y = ""

train_x = "" # Do some transform to x

model = ""
model_preformance = ""

model_validation_preformance = ""

print(model_preformance)
print(model_validation_preformance)

In [None]:
# Correction 
train_x = ""
train_y = ""

val_x = ""
val_y = ""

# The transform is a single instance
fit_transform = ""

train_x = fit_transform(train_x)
val_x = fit_transform(val_x)

model = ""
model_preformance = ""

model_validation_preformance = ""

print(model_preformance)
print(model_validation_preformance)

## Training Problems

### Improper Loss Function 

### Too high/too low learning rate

### Underfit models

### Overfit models

## Coding problems

### Non-descriptive variables

### Overwriting variables

### Versioning

### Path Problems

### Sharing and Reproduciblity