# FAQ

Answers to frequently asked questions about the [cleanlab](https://github.com/cleanlab/cleanlab) open source package

### What data can cleanlab detect issues in?

Any classification data!

### Why isn’t CleanLearning working for my data?

At this time, cleanlab only works for numpy matrices or pd.DataFrames.

### How do I format labels for Cleanlab?

Cleanlab only works with integer-encoded labels in the range `{0,1, ... K-1}` where `K = number_of_classes`. The `labels` array should only contain integer values in the range  `{0, K-1}` and be of shape `(N,)` where `N = total_number_of_data_points`.

**Text or string labels** should to be mapped to integers for eaach possible value. For example if your original data labels look like this: `["dog", "dog", "cat", "mouse", "cat"]`, you should feed them to Cleanlab like this: `labels = [1,1,0,2,0]` and keep track of which integer uniquely represents which class (classes were ordered alphabetically in this example). 

**One-hot encoded labels** should be integer-encoded by finding the argmax along the one-hot encoded axis. An example of what this might look like is shown below.

In [1]:
import numpy as np 

# This example arr has 4 labels (one per data point) where 
# each label can be one of 3 possible classes

arr  = np.array([[0,1,0],[1,0,0],[0,0,1],[1,0,0]])
labels = np.argmax(arr, axis=1) # How labels should be represented when passed into the model

### How can I use different models for cleaning and final training in CleanLearning?

Here's how to use one type of model for finding label issues and another type of model for the final training on the clean subset of data with label issues removed

In [4]:
from cleanlab.classification import CleanLearning
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# Make data
data = np.vstack([np.random.random((100, 2)), np.random.random((100, 2)) + 10])
y = np.array([0] * 100 + [1] * 100)
# Introduce label errors
true_errors = [97, 98, 100, 101, 102, 104]
for idx in true_errors:
	y[idx] = 1 - y[idx]  # Flip label

# Demonstrate CleanLearning with 2 different classifiers:

model_to_find_errors = LogisticRegression()  # this model will be trained many times  via cross-validation
model_to_return = GradientBoostingClassifier()  # this model will be trained once on clean subset of data
cl0 = CleanLearning(model_to_find_errors)
issues = cl0.find_label_issues(data, y)
print(cl0.clf)  # will be LogisticRegression()

cl = CleanLearning(model_to_return).fit(data, y, label_issues=issues)
pred_probs = cl.predict_proba(data)  # predictions from GradientBoostingClassifier
print(cl.clf)  # will be GradientBoostingClassifier()

LogisticRegression()
yp
Cannot utilize sample weights for final training. To utilize must either specify noise_matrix or have previously called self.find_label_issues() instead of filter.find_label_issues()
GradientBoostingClassifier()


### How do I hyperparameter tune only the final model (and not the one used by CleanLearning)?

In [None]:
import numpy as np
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

# Make data
data = np.vstack([np.random.random((100, 2)), np.random.random((100, 2)) + 10])
y = np.array([0] * 100 + [1] * 100)

# Introduce label errors
true_errors = [97, 98, 100, 101, 102, 104]
for idx in true_errors:
    y[idx] = 1 - y[idx]  # Flip label

# Demonstrate CleanLearning with no hyperparameter-tuning to find label issues
# but hyperparameter-tuning for the final training of model on clean subset of the data:
model_to_find_errors = GradientBoostingClassifier()  # this model will be trained many times  via cross-validation
model_to_return = RandomizedSearchCV(GradientBoostingClassifier(),
                    param_distributions = {
                        "learning_rate": [0.001, 0.05, 0.1, 0.2, 0.5],
                        "max_depth": [3, 5, 10],
                    }
                )   # this model will be trained once on clean subset of data
cl0 = CleanLearning(model_to_find_errors)
issues = cl0.find_label_issues(data, y)

cl = CleanLearning(model_to_return).fit(data, y, label_issues=issues)
pred_probs = cl.predict_proba(data)  # predictions from hyperparameter-tuned GBDT

### Can't find an answer to your question?

If your question is not addressed in these tutorials, please refer to the: [Cleanlab Github issues](https://github.com/cleanlab/cleanlab/issues?q=is%3Aissue), [Cleanlab Code Examples](https://github.com/cleanlab/examples) or our [Slack Community](https://join.slack.com/t/cleanlab-community/shared_invite/zt-17lszn4hv-gg2FhZPXYfljq_l01uo92g).

If your question is not addressed anywhere, please open a [new Github issue](https://github.com/cleanlab/cleanlab/issues/new/choose). Our developers can also provide personalized assistance in [Slack](https://join.slack.com/t/cleanlab-community/shared_invite/zt-17lszn4hv-gg2FhZPXYfljq_l01uo92g).