Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: M1 Mac: Performance degrages severely after 1 generation #5488

Open
1 task done
twopiearr opened this issue Dec 6, 2022 · 37 comments
Open
1 task done

[Bug]: M1 Mac: Performance degrages severely after 1 generation #5488

twopiearr opened this issue Dec 6, 2022 · 37 comments
Labels
bug-report Report of a bug, yet to be confirmed platform:mac Issues that apply to Apple OS X, M1, M2, etc

Comments

@twopiearr
Copy link

twopiearr commented Dec 6, 2022

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

Did a fresh install this morning to make sure I was running the latest and greatest. Generating a 768x768 image at 20 steps with Euler_a using the 2.0 model. Launched with the --medvram argument but got similar results without it. Batch count and batch size both at 1. The first batch generated in 1:56; the second batch, with identical settings, took 17:04. Cancelling and relaunching the Terminal command does seem to get it back to the initial performance, but only for one batch, thus necessitating quitting and relaunching every time.

Steps to reproduce the problem

  1. Launch web-ui on an M1 Mac.
  2. Load the 2.0 model and set X/Y to 768, sampler to euler_a, cfg to 7, steps to 20.
  3. With batch count and batch size both at 1, generate an image. Should take a reasonable amount of time.
  4. Generate a second image. Should take an order of magnitude more time or more.

What should have happened?

The generation time for the second batch should be more or less the same as the time for the first batch.

Commit where the problem happens

44c46f0

What platforms do you use to access UI ?

MacOS

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--medvram

Additional information, context and logs

For the purposes of this bug report I launched with --medvram, mostly to see if it would improve performance as suggested in the Apple Installation article on the wiki. With this argument,the first generation happened in approx half the time, but similar results overall were observed in the second generation. (I don't actually know how long the second generation will take without the --medvram argument; the UI was estimating 30 mins, and I cancelled it after 20.)

@twopiearr twopiearr added the bug-report Report of a bug, yet to be confirmed label Dec 6, 2022
@twopiearr
Copy link
Author

twopiearr commented Dec 6, 2022

Attempted a third generation. It is currently reporting step 5/20 with 3:46 elapsed and 10:50 estimated, so I expect similar results. EDIT: took 13:54.

@twopiearr
Copy link
Author

twopiearr commented Dec 6, 2022

Additional information: This doesn't appear to be limited to the 2.0 model. Experimentally I tried cancelling the Terminal process, adding the 1.4 model via a symlink, and relaunching. With identical settings aside from selecting the 1.4 model, I'm getting identical performance: first batch took 2:34, second batch is currently at 5/20 after 3:45 with an ETA of 17:41.

@cooperdk
Copy link

cooperdk commented Dec 8, 2022

Using this software on an ARM-based mac is going to give you trouble, guaranteed. Not only is the newer Macs ARM-based (RISC which is most often used in network attached storage devices, cars and coffe makers) but also these CPUs use Apple proprietary GPUs which isn't really supported by this software (basically you need CUDA to make it work without issues).

So get yourself an older Intel mac, put in a Geforce card, or wait until someone magically makes this work on a computer that looks best on a fancy desk. (Sorry, biased, but Apple decided to screw their fans over once more).

@twopiearr
Copy link
Author

Using this software on an ARM-based mac is going to give you trouble, guaranteed. Not only is the newer Macs ARM-based (RISC which is most often used in network attached storage devices, cars and coffe makers) but also these CPUs use Apple proprietary GPUs which isn't really supported by this software (basically you need CUDA to make it work without issues).

And yet...I didn't have this problem even with this software as recently as 11 days ago. Nor do I have this problem with literally any other implementation I've tried on this machine. So shit on Apple all you want, but I don't think this is an Apple problem, per se.

@cooperdk
Copy link

cooperdk commented Dec 8, 2022 via email

@twopiearr
Copy link
Author

then why didn't I have this issue with this software before it integrated support for SD 2.0?

why don't I have this problem with any other implementation (now up to 4 and counting) that I've tried on this machine?

Your firey rhetoric neither fits the facts nor gets any closer to a solution to a problem that is unique to this implementation, so kindly shut up if you can't contribute something useful.

@holynuts
Copy link

holynuts commented Dec 9, 2022

I am the same problem, 2nd generation is 4 times slower, same settings different batch.
first generation: 20/20 [01:44<00:00, 5.21s/it], after 17/20 [05:46<00:58, 19.65s/it]

@cooperdk
Copy link

cooperdk commented Dec 9, 2022

then why didn't I have this issue with this software before it integrated support for SD 2.0?

why don't I have this problem with any other implementation (now up to 4 and counting) that I've tried on this machine?

Your firey rhetoric neither fits the facts nor gets any closer to a solution to a problem that is unique to this implementation, so kindly shut up if you can't contribute something useful.

Working on Mac requires a lot of memory, it seems. Perhaps more on the later release.

But torch is only built for cuda and (it seems) AMD GPUs do I guess you need the memory to run both Python, the ui and it's modules AND then you need memory for generation.

When I run the UI on Windows with 32 GB ram, I have maybe 35% left when it has started. Loading a model, hypernet, dream etc will take more. It transfers to the GPU as needed, freeing to the ram on the PC.

Since torch is not really built for the Mac gpu, perhaps that's the issue. Slowing down due to high memory usage.
I agree that supporting SD2 might have been a mistake at this point since I doubt a lot of people will use it, due to their censoring.

Did you try to check ram and cpu usage while doing first and second generation?

@rworne
Copy link

rworne commented Dec 16, 2022

Open activity viewer and look at your memory pressure. I have a 32GB RAM Studio here and it's pretty peppy (for a Mac running SD) but once the memory usage gets beyond the physical RAM, it will start swapping to the relatively fast onboard SSD, but you get a serious performance hit in the meantime. I can get up to 40-some odd GB in use before that happens.

Close all other applications and let it run. You can free up GPU RAM by turning off GPU acceleration in your browser.

The unified memory is great as in it will let you continue to run SD even when running out of memory, but you will get dinged for it performance wise.

@cooperdk
Copy link

I have read that's how it is on Mac.
You need a lot of ram. Probably twice is good. It loads everything in ram and swaps nothing to the gpu.
On Windows it uses all available memory but when loading new stuff, it uses ram and 32 mb system ram is basically a minimum in my experience if you have 12 GB of gpu ram, or you get out of memory errors. It might be due to it using everything you have and everything on Mac might include virtual memory which it won't use on Windows.

@twopiearr
Copy link
Author

Open activity viewer and look at your memory pressure. I have a 32GB RAM Studio here and it's pretty peppy (for a Mac running SD) but once the memory usage gets beyond the physical RAM, it will start swapping to the relatively fast onboard SSD, but you get a serious performance hit in the meantime. I can get up to 40-some odd GB in use before that happens.

Close all other applications and let it run. You can free up GPU RAM by turning off GPU acceleration in your browser.

I have done all this. I've also done all this while using other implementations. Consistently Auto1111 is the only implementation with this problem. I'm not disputing that the unified memory on the Mac causes issues, I'm stating that from all available information, something Auto1111 specifically is doing is causing this lack of optimization. It doesn't happen in InvokeAI. It doesn't happen in DiffusionBee. It doesn't even happen in Draw Things, which is an iPad app sort of retrofit to run on the computer. It's a problem that is unique to Auto1111.

@cooperdk
Copy link

That may be true, but the way Mac uses resources is the same for all.
I read somewhere that you need to use cpu memory for everything with torch and then 32G is not too much considering you likely only have 24 or so available when you run the app.
You could install a Linux in parallel, that could give you better memory usage but still, the GPU isn't really supported.

@rworne
Copy link

rworne commented Dec 17, 2022

I don't see this issue running any checkpoint with the default 512x512 size. I've had this run for 12+ hours generating images with no performance hits since a bit over two weeks ago, when the big M1 update in Automatic came out. Prior to that I had issues with it crapping out with a semaphore issue or some sort of fault.

Here is the log from what I have had it do this evening:

Weights loaded.
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:37<00:00,  1.06it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.03it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:37<00:00,  1.06it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:37<00:00,  1.06it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.05it/s]
100%|███████████████████████████████████████████| 40/40 [00:38<00:00,  1.04it/s]
Total progress: 100%|███████████████████████████| 40/40 [00:38<00:00,  1.04it/s]

Here it is with 768x1152, hiresfix:

100%|███████████████████████████████████████████| 20/20 [00:24<00:00,  1.22s/it]
100%|███████████████████████████████████████████| 20/20 [04:32<00:00, 13.62s/it]
Total progress: 100%|███████████████████████████| 40/40 [05:16<00:00,  7.91s/it]
100%|███████████████████████████████████████████| 20/20 [00:24<00:00,  1.24s/it]
100%|███████████████████████████████████████████| 20/20 [04:11<00:00, 12.57s/it]
Total progress: 100%|███████████████████████████| 40/40 [04:52<00:00,  7.31s/it]
100%|███████████████████████████████████████████| 20/20 [00:23<00:00,  1.20s/it]
100%|███████████████████████████████████████████| 20/20 [04:47<00:00, 14.36s/it]
Total progress: 100%|███████████████████████████| 40/40 [05:24<00:00,  8.11s/it]
100%|███████████████████████████████████████████| 20/20 [00:24<00:00,  1.21s/it]
100%|███████████████████████████████████████████| 20/20 [04:45<00:00, 14.29s/it]
Total progress: 100%|███████████████████████████| 40/40 [05:23<00:00,  8.08s/it]

This is just big enough to get it to swap slightly, but no slowdown (last two are slightly longer due to my opening and using Safari). If I have this run all night, it may eventually give me an issue. I'll let it got for a few hours and see what it does. I'll update if I see anything.

EDIT:
Came back after a while, average time is ~9.5 sec/iteration.

@barryanders
Copy link

barryanders commented Dec 17, 2022

I'm on M1 Pro, 16 gb ram. Here's my results running images today. Euler A, CFG 7, highres fix, 768x1024, 20 steps. When the "fast" ones in the results below finish, there's no image. The image only shows up from the super slow ones. Now I realize it takes longer to make bigger images, but this seems ridiculous. Please correct me if I'm wrong. Granted, being able to produce amazing works of art that fast is wonderful, but it does make it take a long time to cherry pick from hundreds of results.

100%|████████████████████████████████████████████████████████████████| 20/20 [01:04<00:00,  3.23s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [43:02<00:00, 129.12s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:03<00:00,  3.17s/it]
100%|█████████████████████████████████████████████████████████████| 20/20 [1:27:08<00:00, 261.45s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:01<00:00,  3.09s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:58<00:00, 119.93s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.04s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:30<00:00, 121.51s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.01s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:09<00:00, 120.49s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.05s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:35<00:00, 121.76s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:01<00:00,  3.06s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:06<00:00, 120.32s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:00<00:00,  3.04s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:29<00:00, 118.47s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:02<00:00,  3.11s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:59<00:00, 119.99s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:01<00:00,  3.05s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:19<00:00, 120.97s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [01:02<00:00,  3.11s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:04<00:00, 120.22s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [00:59<00:00,  3.00s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [40:03<00:00, 120.16s/it]
100%|████████████████████████████████████████████████████████████████| 20/20 [00:59<00:00,  2.98s/it]
100%|███████████████████████████████████████████████████████████████| 20/20 [39:32<00:00, 118.60s/it]

@cooperdk
Copy link

Those speeds (the slow ones) correspond with the time it takes to generate on cpu only on Windows - with a CPU that is about four generations old.

Knowing it's unlikely that the Apple GPU supports torch, I guess this means that Apple users should try it on Linux, or get capable hardware.

@rworne
Copy link

rworne commented Dec 18, 2022

Here's my results for the same. I have 32GB of RAM:

To create a public link, set `share=True` in `launch()`.
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:20<00:00,  1.03s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [03:57<00:00, 11.86s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [04:27<00:00,  6.70s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00,  1.13s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [04:26<00:00, 13.33s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [04:59<00:00,  7.49s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00,  1.15s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [04:54<00:00, 14.74s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [05:26<00:00,  8.16s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:22<00:00,  1.10s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [04:45<00:00, 14.30s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [05:23<00:00,  8.08s/it]

What I noticed on my machine the python process is using slightly more than 23GB of RAM with those settings.
I'd think yours with 16GB is probably swapping, which explains the crazy long 2nd stage image processing. The first pass goes quickly because if you have the firstpass sizes set to 0,0 it renders a 512x512 image then IIRC, scales it up to the desired resolution on the 2nd pass.

EDIT:
I did some more experimentation. Since my machine has double the RAM of yours, I ran the same thing again with double the image sizes. Here's the results:

100%|█████████████████████████████████████████████████████████████████████████| 20/20 [00:38<00:00,  1.93s/it]
100%|█████████████████████████████████████████████████████████████████████████| 20/20 [26:31<00:00, 79.56s/it]
Total progress: 100%|█████████████████████████████████████████████████████████| 40/40 [28:15<00:00, 42.38s/it]

Screenshot 2022-12-17 at 8 03 38 PM

As you can see with these settings, the performance went into the toilet because it's swapping like mad. My swapfile fluctuated between 12 and 20GB with 100% RAM utilization while rendering this image. In fact, the 768x1024 size you mentioned earlier is very near the largest size I can do on this 32GB machine without it swapping.

cooperdk:
While CUDA is not supported, the MPS support in SD is definitely using the GPU, I have steady 80-90+% utilization there while the CPU is bouncing around from 33% to 90%.

@barryanders
Copy link

barryanders commented Dec 18, 2022

Thanks for running tests and sharing for comparison.

Edit: After running another batch at 640x960, instead of averaging in the 120s, I was somewhere more around the 20s. Slightly smaller, but significantly faster.

@rworne
Copy link

rworne commented Dec 18, 2022

I bought my Studio a month before SD came out. If I knew then what I knew now, I'd have gotten a 64GB unit. I'm not majorly upset about it, as I usually make images as experimentation so 512x512 and 512x768 (and thereabouts) is perfectly fine for me. There's an upper limit bug in the M1s too, as I have tried and let it make larger images, but once you get above 1024x1024, you hit another bug that crashes the program - see #5278 for that one.

SD on M1 came a long way since it was first released. I'm really hoping there's some good optimizations that can be wrung out of it over the next year.

@holynuts
Copy link

I have the earliest model M1(16g), I tried the v2.1 512 model and remove the argument --medvram, it seems working 'normally' again (like before using 1.5 models).

@rworne
Copy link

rworne commented Dec 18, 2022

I have the earliest model M1(16g), I tried the v2.1 512 model and remove the argument --medvram, it seems working 'normally' again (like before using 1.5 models).

My command line settings are:
--listen --no-half --use-cpu interrogate --skip-torch-cuda-test

It ignores the skip CUDA test, but does the rest of them.

@rworne
Copy link

rworne commented Dec 18, 2022

I have the earliest model M1(16g), I tried the v2.1 512 model and remove the argument --medvram, it seems working 'normally' again (like before using 1.5 models).

You mentioned earlier you don't see the interim image. If you go to settings and set the value for "Show image creation progress every N sampling steps" to 1 (most frequent) or some other number, it will show the image after each N iteration steps. There is a performance penalty for this, but I find it minimal (maybe a few sec per image). I keep mine at 1 so when I generate that lovely supermodel and it turns out to be a horrid lobstrosity, I can interrupt the image generation and fix the prompt without waiting for it to finish first. This can be a huge time saver.

@twopiearr
Copy link
Author

twopiearr commented Dec 18, 2022

Apple users should try it on Linux, or get capable hardware.

cooperdk, do you seriously have nothing better to do with your life than dunk on hardware in the specific thread created to troubleshoot for it? I'm truly sorry your soul is that empty, but please shut up if you have nothing useful to contribute.

@ptppan
Copy link

ptppan commented Feb 23, 2023

Same issue here, several months after this thread was created. I didn't have any issues for weeks, 14 or so days up until 2-3 years ago, when A1111 suddenly began crawling. My machine got hot, for the first time, so I figured perhaps my machine was throttling, so I stopped generating for a while. This was 3 days ago, but ever since, I can no longer do more than 2-3 generations before everything grinds to a halt.

I've tried different browsers, no difference, same issue, fast at first and then grinds to a halt. I've tried restarting (of course), I've tried reinstalling, no change.

I don't know what it is but it is making SD unusable for me now, which is sad, because I really want to use it like I have for the past 14 days.

Here is an example of the times I get: https://i.imgur.com/34IZISm.png. The message close to the start is me activating ControlNet openpose, but I've tried without activating it, it doesn't matter. After a few generations my instance grinds to a halt and I can no longer generate properly.

I've also tried different instances and this only seems to happen on A1111. @twopiearr did you guys find a solution to this? It's been a few months now after all.

Thank you!

@twopiearr
Copy link
Author

@ptppan see an incredibly helpful thread here: #5461 (reply in thread)

and no anti-apple bigots, either!

@ptppan
Copy link

ptppan commented Feb 24, 2023

@ptppan see an incredibly helpful thread here: #5461 (reply in thread)

and no anti-apple bigots, either!

Haha awesome, thank you! Did you find anything in particular that helped you with the issue you described in this thread? Because I am having a similar issue. Looking through that thread, I am not sure what could actually solve this issue, and I'm not seeing any comments by you there? Did you do something recommended in that thread in particular that helped you with your issue?

I am actually unsure if I actually have an issue now or if it's just the usage of LoRAs that slow things down (naturally), or if it perhaps is ControlNet that runs slow on Mac, or whatever it could be. Really confused about this issue, as it seems so vague. But love to know if you followed a particular advice in that thread? Thank you again for providing it!

Edit: Nvm, stupid me, just saw your posts in the thread. Thank you so much, I'll proceed from there.

@alsomail
Copy link

alsomail commented Mar 7, 2023

Same issue here, Since last Sunday morning, it was able to generate a picture in 7 minutes, but it suddenly slowed down by four times. However, there is one detail I noticed. A few days ago, when everything was normal, I exited the webui with control + z , closing the terminal is directly closed, but after this happens, and then close the terminal, it will prompt me that python is still running, are you sure to close it

@alsomail
Copy link

alsomail commented Mar 8, 2023

iShot_2023-03-08_09 20 13

The same parameter generates ten batches, the speed is 15s/it for the first generation, 67s/it for the second generation, and 217s/it for the fourth generation

@alsomail
Copy link

alsomail commented Mar 8, 2023

after contrl + z, the Python still occupies 30g of memory and has not released it
iShot_2023-03-08_09 29 45

@alsomail
Copy link

alsomail commented Mar 8, 2023

the Python version is Python 3.10.10 (main, Feb 8 2023, 05:34:50) [Clang 14.0.0 (clang-1400.0.29.202)]

@cooperdk
Copy link

cooperdk commented Mar 8, 2023

Python does not shut down if you close the terminal window. You have to break it from running.
I don't know if you do that with Ctrl+C on Mac.
You should be able to shut it down in your process list.

But the web ui was not made to run on Mac. You really should get proper gear to use it.

@twopiearr
Copy link
Author

Seriously, @cooperdk - get a hobby or something.

@barryanders
Copy link

Python does not shut down if you close the terminal window. You have to break it from running. I don't know if you do that with Ctrl+C on Mac. You should be able to shut it down in your process list.

But the web ui was not made to run on Mac. You really should get proper gear to use it.

Aye, on that note, Draw Things runs nicely on a Mac.

@bigbozo
Copy link

bigbozo commented Mar 14, 2023

after contrl + z, the Python still occupies 30g of memory and has not released it

You didn't stop it (says so in the screen 'suspended'), it's still running in background. to stop press CTRL-C

@akx akx added the platform:mac Issues that apply to Apple OS X, M1, M2, etc label Jun 13, 2023
@s4t0shi-n4k4m0t0
Copy link

Experiencing the same on my M1 air 8GB.
I don't understand from what point this started to happen.
I'm sure it was not like this in the first few weeks after installing A1111

@cooperdk
Copy link

cooperdk commented Jul 8, 2023

This is really not written to execute on a Mac, but for Nvidia hardware. You're bound to get issues.

@egdx
Copy link

egdx commented Jul 8, 2023

Experiencing the same on my M1 air 8GB. I don't understand from what point this started to happen. I'm sure it was not like this in the first few weeks after installing A1111

Try this. Works great on Mac mini M2 8GB. Had similar problems with the default args.
Update webui-user.sh with these args which I think overrides the default. (Note: default args are not in webui-user.sh)

export COMMANDLINE_ARGS="--skip-torch-cuda-test --upcast-sampling --opt-sub-quad-attention --use-cpu interrogate --disable-nan-check"

image
4 images generated are width 512 x height 768, averages about 50 seconds.

image

It's been working great for several days, many sessions, hundreds of images generated averaging about 50 seconds per image.

Let us know how it goes. Thanks.

@rworne
Copy link

rworne commented Jul 8, 2023

Try this. Works great on Mac mini M2 8GB. Had similar problems with the default args.

I gave it a shot on my M1 Mac Studio. The "opt-sub-quad-attention" argument causes "modules.devices.NansException" which with your other flag, generates blank (black) images.
This doesn't happen with "opt-split-attention"

No idea why this is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-report Report of a bug, yet to be confirmed platform:mac Issues that apply to Apple OS X, M1, M2, etc
Projects
None yet
Development

No branches or pull requests