Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random Forest predict() does not produce reproducible results. random_state=42 #28920

Open
aedavids opened this issue Apr 30, 2024 · 3 comments
Labels
Needs Investigation Issue requires investigation Needs Reproducible Code Issue requires reproducible code

Comments

@aedavids
Copy link

Describe the bug

If I load my pre trained model and set of samples and call predict() multiple times I get different predicted classes. Here are some sample results. I am using a juypter notebook. I have tried restarting the kernal multiple times and also just re-running the cell multiple times

auc: {0: 0.476, 1: 0.524} pred: [0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1]
auc: {0: 0.613, 1: 0.387} pred: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1]
auc: {0: 0.762, 1: 0.238} pred: [1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
auc: {0: 0.589, 1: 0.411} pred: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

I have a random forest I trained with the following parameters

RandomForestClassifier(max_depth=7, max_features=1, max_samples=0.9,
                       n_estimators=50, random_state=42)

The model was save using joblib. I load the model as follows

model = joblib.load(modelPath)

I make predictions as follow

predictions  = model.predict(XNP)

yProbability = model.predict_proba(XNP)

yNP:
[0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 1 1]

XNP = np.array([[ 16,   9,   0,   0,   5,   0, 104,   1,   1,   1],
           [ 19,   4,   0,   0,   4,   0,  96,   0,   2,   0],
           [ 14,   7,   0,   0,   5,   0,  72,   0,   2,   0],
           [ 29,   5,   0,   0,  11,   0, 108,   0,   1,   0],
           [ 16,   9,   0,   0,   6,   0,  80,   0,   1,   1],
           [ 49,  13,   0,   0,  20,   0, 198,   0,   5,   2],
           [ 45,   7,   0,   0,   7,   0, 163,   0,   1,   1],
           [ 47,  13,   0,   1,  10,   0, 229,   0,   4,   1],
           [ 17,  21,   0,   0,   2,   0,  61,   0,   5,   0],
           [ 56,  15,   0,   0,  12,   0, 362,   0,   4,   1],
           [ 14,   7,   0,   0,   8,   0, 113,   0,   1,   0],
           [  5,   3,   0,   0,   1,   0,  49,   0,   0,   0],
           [ 23,   7,   0,   0,   8,   0,  92,   0,   2,   0],
           [ 15,  12,   0,   0,   3,   0, 119,   0,   0,   1],
           [ 18,   4,   0,   0,   1,   0, 133,   0,   0,   0],
           [ 13,   3,   0,   0,   4,   0, 126,   0,   0,   0],
           [ 20,   3,   0,   0,   5,   0, 161,   0,   0,   0],
           [ 15,   6,   0,   0,   4,   0, 163,   0,   0,   0],
           [ 23,   4,   0,   0,   8,   0, 127,   0,   0,   2]])

I have tried setting calling random.seed()

Any suggestions would be greatly apreciated.

p.s.
When I trained I save the label encoder and load as follows. (This was to insure the class number match the class names)

def encoder2Dict(encoder : LabelEncoder) -> dict  :
    '''
    key is class
    value is int
    '''
    values = encoder.transform(encoder.classes_)
    retDict = dict(zip(encoder.classes_, values))
    return retDict

def loadEncoder(path: str) -> LabelEncoder:
    '''
    arguments:
        path: file containing labelEncoder values saved as a dictionary
    '''
    encoder = LabelEncoder()
    encoderDict = loadDictionary(path)

    # Manually assign the sorted list of class labels to the classes_ attribute
    # The keys of the dictionary are sorted according to their corresponding values
    # dictionary.get(key) returns the value value
    encoder.classes_ = np.array(sorted(encoderDict, key=encoderDict.get))

    return encoder

I can make my trained model avaliable

Steps/Code to Reproduce

predictions = model.predict(XNP)

yProbability = model.predict_proba(XNP)

Expected Results

predict(X) == predict(X)

Actual Results

auc: {0: 0.476, 1: 0.524} pred: [0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1]
auc: {0: 0.613, 1: 0.387} pred: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1]
auc: {0: 0.762, 1: 0.238} pred: [1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0]
auc: {0: 0.589, 1: 0.411} pred: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Versions

System:
    python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0]
executable: /private/home/aedavids/miniconda3/envs/POC/bin/python
   machine: Linux-5.15.0-89-generic-x86_64-with-glibc2.35

Python dependencies:
      sklearn: 1.4.0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.3
        scipy: 1.11.4
       Cython: None
       pandas: 2.2.0
   matplotlib: 3.7.1
       joblib: 1.4.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell
    num_threads: 128

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/libgomp.so.1.0.0
        version: None
    num_threads: 160
$ conda list scikit-learn
# packages in environment at /private/home/aedavids/miniconda3/envs/extraCellularRNA:
#
# Name                    Version                   Build  Channel
scikit-learn              1.4.0           py311hc009520_0    conda-forge


$ python --version
Python 3.11.4
@aedavids aedavids added Bug Needs Triage Issue requires triage labels Apr 30, 2024
@glemaitre
Copy link
Member

We will need the data to understand what is the reason but I suspect that the issue is linked to random tie breaking.

@glemaitre glemaitre added Needs Reproducible Code Issue requires reproducible code Needs Investigation Issue requires investigation and removed Bug Needs Triage Issue requires triage labels May 1, 2024
@betatim
Copy link
Member

betatim commented May 2, 2024

Please also provide a short code snippet that we can copy&paste to reproduce the problem. From reading your original comment it sounds like you are using more than just a RandomForestClassifier. Having a full snippet from start to finish makes sure we are all debugging the same thing.

@aedavids
Copy link
Author

aedavids commented May 6, 2024

Hi All

I am in the process of creating test code I can post. I have narrowed it down a bit. The problem happens in my jupyter notebook. If I run the predict cell multiple times I get the same results. If I restart the notebook I will get different results from the first run

I wrote a small py script. I can not reproduce the error when I run from the terminal.

I going to try and and figure out how I can isolate the problem in my Notebook. I will post the test notebook

Hopefully I can upload a zip file with the test code and my trained model

Kind regards

Andy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Investigation Issue requires investigation Needs Reproducible Code Issue requires reproducible code
Projects
None yet
Development

No branches or pull requests

3 participants