Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange readings #11

Closed
Mr-HaleYa opened this issue May 26, 2020 · 113 comments
Closed

Strange readings #11

Mr-HaleYa opened this issue May 26, 2020 · 113 comments
Assignees

Comments

@Mr-HaleYa
Copy link
Sponsor Contributor

I am using
esp32 ttgo sim 800l (https://rb.gy/ogywnq)
Dfrobot dht22 (https://www.dfrobot.com/product-1102.html)

my code

DHTNEW DHT(18);
DHT.setWaitForReading(true);

float DHTtempBuffer[7];
float DHThumidityBuffer[7];
int DHTReadings = 7;             // number of DHT readings before choosing a median

void getDHT(){
  for (int i = 0 ; i < DHTReadings; i++) {
    DHT.read();                                     // Gets a reading from the DHT sensor
    DHThumidityBuffer[i] = DHT.getHumidity();
    DHTtempBuffer[i] = DHT.getTemperature();
  }
  sortData(DHThumidityBuffer, DHTReadings);                     //sorts array low to high
  DHThumidity = DHThumidityBuffer[(DHTReadings - 1) / 2];       //gets median of the array
  sortData(DHTtempBuffer, DHTReadings);                         //sorts array low to high
  DHTtemperature = DHTtempBuffer[(DHTReadings - 1) / 2];        //gets median of the array
}

So my problem is that even tho I use a filtering algorithm to get rid of highs and lows by taking 7 readings then choosing a median I will occasionally get some massive dips in temperature (and spikes in humidity as a result). my temperature in my room stays about 24°C and the dips are -12 and 12 °C so half the temp and half * -1. I have no clue why it is doing this... It runs every ~2 mins and when the bad readings start it does it consistently for a few runs before correcting itself.

18

Nothing moves nothing is touched nothing changes... It just goes wack... Could it be a library error? I don't see how it could be but you're a smart man @RobTillaart so you may see something I'm missing. I have also thought of the potential that its a dud sensor. I hope not bc i ordered 20 more before testing the one I'm using. I had the thought that maybe adding more reading (like 15 instead of 7) so there is more to data to filter but it looks like all the readings it's taking in those fault zones are bad...

Would appreciate any suggestion - Regards Hale

@RobTillaart
Copy link
Owner

The strange numbers seem to indicate that some bits are not read correctly. 12 equls 8+4 so 2 bits.
Most often this is caused by the (missing) pull up resistor. So three questions:
How long is the wire between ESP and DHTsensor?
What is the value of the pull up resistor?
What is the voltage level of the powersupply?

@Mr-HaleYa
Copy link
Sponsor Contributor Author

Thanks for the quick reply,

like I said initially I am using a DFrobot DHT22
19

  1. How long is the wire between ESP and DHT sensor?

I am using the provided 12in cable, the wiki says the sensor is good up to 20m cable (I have my doubts on this tho)

  1. What is the value of the pull-up resistor?

I found this comment about the product

The PCB has an I2C pull-up resistor and a decoupling cap

Additionally, all diagrams I have found of it wired show it without external pullups and all the docs I have read say it works with esp32. Also a direct reply from the company said that no external pull-up was needed due to the one on the PCB.

  1. What is the voltage level of the power supply?

I have it connected to the 5v rail of the esp as the doc says it is 5v powered. I do have it being powered of the front ports of my desktop tho... could this all be attributed with it being supplied limited current?

@RobTillaart
Copy link
Owner

  1. 14 inch ~ 35 cm, normally I would add a pull up resistor 4K7
    Do you have a scope to see the squareness of the signals on the line?
    Very useful tool when doing hardware.

  2. Strange that the documentation mentions an I2C pull up as the DHT22 is not a I2C device.

The fact that all diagrams don't show a pull up can be caused by the fact that they use short wires?
Check this tutorial - https://learn.adafruit.com/dht/connecting-to-a-dhtxx-sensor
it definitely uses a pull up.
Also the equivalent datasheet - https://akizukidenshi.com/download/ds/aosong/AM2302.pdf
promotes a pull up.

  1. The 5V should work, although it might give 5V on the ESP pins and I do not know if these can stand it for longer periods. But you have too many good readings so this is not the prime suspect.

  2. The DHT.read() function returns an error code if it detects one.
    4a. You do not check on this, why not?
    4b. Can you create a log file to see the error code when the dips occur.
    4c. If errors are detected, adjust the code so it retries after e.g. 5 seconds.

  3. Can you try this project? it seems to be an ESP specific library.
    https://www.dfrobot.com/blog-910.html

@Mr-HaleYa
Copy link
Sponsor Contributor Author

  1. Do you have a scope to see the squareness of the signals on the line?

sorry.. I do not have an o-scope since the university has closed

  1. Strange that the documentation mentions an I2C pull up as the DHT22 is not an I2C device.

this was not documentation, it was a reply a user left on a purchase.

The fact that all diagrams don't show a pull up can be caused by the fact that they use short wires?

The tutorials (plural) that are on their site all show them using the same cable that is shipped with it.

Check this tutorial - https://learn.adafruit.com/dht/connecting-to-a-dhtxx-sensor
it definitely uses a pull up.
Also the equivalent datasheet - https://akizukidenshi.com/download/ds/aosong/AM2302.pdf
promotes a pull up.

That DHT 22 is not on a PCB and is instead just the raw sensor and yes the raw sensor very much does require a pull up between VCC and data, but the DFrobot version is the raw sensor mounted to a PCB and the PCB has a 10k pull-up between VCC and data just as shown in that example.

  1. The DHT.read() function returns an error code if it detects one.

Isn't it also supposes to through a crazy number at you (-999 if I remember) when it gets an error and we have none of those?

  1. Can you try this project? it seems to be an ESP specific library.

that library is just a fork of the Arduino-DHT library so I didn't think it would be any different, I also wanted to use yours lol. However, It does say

none of the DHT libraries I found were written to work without errors on the ESP32. For ESP32 (a multi core/ multi processing SOC) task switching must be disabled while reading data from the sensor.

he is specifically talking about the dht22 so this might actually be legit

@RobTillaart
Copy link
Owner

Thanks for your replies

  1. pull up
    The fixed 10 K pull up on the PCB might be too high so an addition 4K7 might improve the signal quality.

  2. DHT.read() return value
    I used a crazy number so it would not be mistaken for an "valid temperature or humidity"
    -999 is crazy enough for that purpose. The return value can see if there is e.g. an CRC error
    and that helps to solve the issue.

  3. interrupts
    The DHTNEW library has a flag to disable interrupts during the time critical data extraction
    it is pretty new and I have only tested it on an UNO. PLease give it a try.

Add this line before DHT.read()

DHT.setDisableIRQ(true);

With my next purchase-round I'm going to add a number of ESP32 so I can do more testing.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

Mr-HaleYa commented May 26, 2020

so I powered up the project again to see if it would reoccur and sure enough, it did. However, I have found something interesting. When I press the reboot button my readings stay stable for quite some time before doing it ~8 passes. If I unplug the sensor and plug it back in I get 1-3 good passes then it rapidly drops to -12 after 2-5 passes it bumps to 12 and after another 2-5 int recovers. What could this mean???

@Mr-HaleYa
Copy link
Sponsor Contributor Author

side note: I changed it to 15 reading instead of 7 to see if it would help but no it has not

red is unplugged and replugged and yellow are just power cycles with onboard button
20

so in my debug log when I get the dips it looks like this (humid, temp°C)

40.40	25.70   //it always gives 1 good reading as far I can see
100.00	12.90
100.00	-12.90
100.00	-12.80
100.00	12.80
100.00	12.90
100.00	-12.80
100.00	-12.90
100.00	12.90
100.00	-12.90
100.00	-12.90
100.00	-12.90
100.00	-12.90
100.00	-12.90
100.00	-12.90

on a good pass, the readings look more like

40.50	25.60
41.70	25.70
40.70	25.70
40.60	25.60
40.70	25.70
40.70	25.70
40.60	25.60
40.60	25.60
40.60	25.60
40.60	25.60
40.60	25.60
40.70	25.70
40.60	25.70
40.60	25.60
40.70	25.70

In the passes right before it fails it looks like

40.60	25.60
41.60	25.70
40.50	25.60
40.50	25.60
40.60	25.70
40.70	25.80
40.60	25.70
40.50	25.60
40.60	25.70
40.50	25.60
40.60	25.70
40.60	25.70
100.00	12.80
40.60	25.70
100.00	12.80

@RobTillaart
Copy link
Owner

Further analysis
The fact that there are so much good readings, means that the code flow is OK.
Hardware issues has been discussed and you need to test and verify these.

In the software to read the DHT sensors the timing is the most critical part. For the much faster ESP32 (compared to UNO) this timing should not be a problem however the "multitasking" behavior of the ESP has been mentioned as a possible cause in this and other timing issues. However differences in hardware including differences in quality of the sensors (including fake and cheap clones) have caused many problems in the past too.

In this analysis I will prove the cause is not in the software but in the hardware, and that a work around can possibly be made in software.

Humidity
Jumps to 100.0 in the code the value is constrained to the range 0..100 so the value 100 will not inform us what happend with the received bit-pattern. Come back to this later.

Temperature
I see three temperature values in your log, so lets analyse the underlying bit pattern. In this analysis the MSB is most informative.

temperature = ((_bits[2] & 0x7F) * 256 + _bits[3]) * 0.1;
...
if (_bits[2] & 0x80)  // negative temperature
  {
    temperature = -temperature;

reversing these lines of code gives us

Value Val * 10 bits[2] pattern bits[3] pattern
25.70 257 0000 0001 0000 0001
12.80 128 0000 0000 1000 0000
12.90 129 0000 0000 1000 0001
-12.90 -129 1000 0000 1000 0001

Notice : the Most Significant Bit has shifted 1 place to the left.

cause negative sign
This means that the most right bit of bits[2] is in fact the LSB of the real humidity value.
as a LSB is relative noisy this fluctuates between 0 and 1. That causes the negative sign to pop up randomly when this error occurs.

cause high humidity
As the left shift moves in a bit in bi[0] which is the MSB of the humidity, it adds 6553.6 to the humidity, causing it to be >> 100 and therefor it will be constrained to 100.

AS far as I can see all wrong readings have a humidity of 100. So we might assume that when shift happens, we always shift in a HIGH bit. Where does that come from and WHY?

Cause of the shift
In the code in _readsensors the bits are defined by the length of a LOW period measured in micros. These lines (around 170-180)

  loopCnt = DHTLIB_TIMEOUT;
    while(digitalRead(_pin) == HIGH)
    {
      if (--loopCnt == 0) return DHTLIB_ERROR_TIMEOUT;
    }

    if ((micros() - t) > 40)
    {
      _bits[idx] |= mask;        <<< sets HIGH bit
    }

As we get an extra HIGH bit we must conclude the time is larger than 40 us.

The number 40 is based upon the specs in the datasheet of the DHT sensors:

  • 26-28us ==> 0
  • 70 uS ==> 1

40 us is 42% larger than the 28us
40 us is 42% smaller than the 70us
So the margin is in both directions equal in size in relative terms (harmonic average). That is the reason I choose 40. As this core code worked for years with many clients it is a good choice.

So for some reason the data pin stays HIGH too long (sometimes), or has difficulties to become LOW fast enough. An detailed analysis of the hardware is mandatory to find the cause.

CONCLUSION
The root cause is not found in the code, it works too well too often. The above analysis shows that the data-pin does not goes LOW fast enough according to the specs / datasheet.

A possible solution to work around this incorrect behavior is to adjust the timing threshold value in the software, in concreto the value in line 176 from 40 to 50.
This gives the signal 10 us extra time to get LOW and is still far enough from the 70 to be discriminating between 0 and 1. If the problem still occurs at 50, one could even try 60, but it must stay between 28 and 70.

Please test if this proposal solves the issue

@Mr-HaleYa
Copy link
Sponsor Contributor Author

You are amazing... You literally broke it down and explained it so well. How do I buy you a coffee???

Anyways, I was on board with the idea that it was a hardware issue not explicitly software related. I modified the dhtnew.cpp file and changed it to 50 and so far not a single bad reading. either your a freaking wizard and solved it or it just hasn't reoccurred yet. I have restarted, unplugged, and power cycled like 20 times, and it's going strong.
all rows look like this

18.50	24.50
13.10	24.90
13.20	24.90
13.20	24.90
13.30	24.90
13.40	24.90
13.40	24.90
13.40	24.90
13.30	24.90
13.30	24.90
13.20	24.90
13.20	24.90
13.20	24.90
13.20	25.00
13.20	25.10

I will keep you informed on the status of the next 24 hours. Might I suggest that if it does not falter again meaning this minuscule increase in time was able to eliminate the problem, you should make this an update to prevent future people from suffering as I (briefly) did?

Thank you for all the time you have put into your responses - Hale

@RobTillaart
Copy link
Owner

Thanks for the compliments, No wizard otherwise I would focus on other things than software.
It is just (hard) analytical thinking and understand the details.

I still think it is a hardware issue somehow as the timing is out of specification.
Fortunately it is an problem that can be solved in software in a way it has very little chance to affect any other platform as 50 is well under the 70 uS.

I will prepare a PR with the adjusted threshold and wait for the results of your 24h test.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

Mr-HaleYa commented May 27, 2020

dang it...

100.00	-11.60
100.00	-11.60
100.00	-11.70
100.00	11.60
100.00	-11.60
100.00	11.70
100.00	11.70
100.00	-11.60
100.00	-11.60
100.00	-11.60
100.00	11.70
100.00	11.70
100.00	11.70
100.00	11.60
100.00	11.60

that's with 50... The lower numbers are bc its colder inside today.

100.00	11.70
100.00	11.80
100.00	11.80
100.00	11.70
100.00	11.80
100.00	11.80
100.00	11.80
100.00	-11.80
100.00	11.80
100.00	-11.80
100.00	11.80
100.00	11.80
100.00	11.80
100.00	11.80
100.00	11.80

and that's with 60....

100.00	-11.90
100.00	-421.40
100.00	-421.40
100.00	-421.40
100.00	-421.40
100.00	-421.40
100.00	421.40
100.00	421.40
100.00	421.40
100.00	-421.40
100.00	421.40
100.00	-421.40
100.00	-421.40
100.00	-421.40
100.00	-421.40

I tried 30 just for fun and got this ^^^

@Mr-HaleYa
Copy link
Sponsor Contributor Author

It's so weird tho... after reverting back to 40, It gives those dips then recovers and never does it again...
Instead all of a sudden after running for like 8 hours the sensor shut off or something because it was a solid reading of -999 on my graph after that point. It never did recover after that until I unplugged the sensor and plugged it back in.

@RobTillaart
Copy link
Owner

So 40 behaves pretty good in long run, 50 is slightly better at start. 60 same. 30 is definitely too short as it adds 1 bits (numbers are larger).

It seems that the processor in the sensor got locked somehow in the end. Resetting the sensor helps. This could be done under program control by switching the powerline by means of a mosfet,
Was the sensor hot?

@Mr-HaleYa
Copy link
Sponsor Contributor Author

no, everything was the same temperature as the box ~23°C. so all I'm needing is an internal temp and humidity sensor so we can monitor the temp to see if it gets cooked in the sun and to know if water is getting in (humidity will spike). I'm concluding these are duds so if you have any hardware change suggestions I'm up for some modding. Or do you have any suggestions on what to buy?

@RobTillaart
Copy link
Owner

I'll come back to it later.

@RobTillaart
Copy link
Owner

RobTillaart commented May 28, 2020

ALTERNATIVE SENSOR
SHT31 from Sensirion would be my choice. It is an I2C sensor so you need to connect the hardware in another way. I have written a library https://github.com/RobTillaart/SHT31 which has a similar enough interface to get it up and running quick.

Accuracy is better than DHT22 although resolution is approx. same.

It has a build in heater to evaporate any condensation (e.g. if humidity goes very high) but you need to use that with care, my library supports this heater function and has some hooks to check the timing of the heater

There are three known sensors with slightly different accuracy etc. Check the datasheet
https://www.mouser.com/datasheet/2/682/Sensirion_Humidity_Sensors_SHT3x_Datasheet_digital-971521.pdf

Adafruit has a breakout version of the SHT31 - https://www.adafruit.com/product/2857

DHT22
I have had the DHT22 (no breadboard version) running for many days (up to a week) without serious glitches with an UNO. So to understand the cause of the permanent failing after XX hours the hardware needs to be investigated.

One option is (again a workaround) is to power up and down the sensor from the ESP, this will cost you an extra pin and minimal code (pin HIGH wait 2 seconds, read sensor pin LOW). This will work quite well and in fact it is used in systems that are battery powered to safe energy.

@RobTillaart
Copy link
Owner

This question is still not answered:
What value does DHT.read() return when the temperature and humidity == -999

That gives more information where in the protocol the handshake fails.

@RobTillaart
Copy link
Owner

The long run usage of DHT22 really depends on the individual sensor.
See: https://forum.arduino.cc/index.php?topic=432544.15

If price is no issue => give the SHT31 a try (try it anyway to get some experience / reference)
Or go for the switch on / off construction as that is much cheaper and will work.

@RobTillaart
Copy link
Owner

RobTillaart commented May 28, 2020

Studied the datasheet again, and I see this phrase "Data-bus's free status is high voltage level." but it is not specified if the host or the sensor should keep the data line HIGH . A quick look in two alternative libraries shows none pulls the line up after the sensor is read.

To test if this causes the long term failure you can add 2 lines to DHT._read() at the start (around line 86)

int DHTNEW::_read()
{
  // READ VALUES
  if (_disableIRQ) noInterrupts();
  int rv = _readSensor();
  if (_disableIRQ) interrupts();
  // put databus in HIGH state between reads.   
  pinMode(_pin, OUTPUT);        <<<<<<<< add line
  digitalWrite(_pin, HIGH);     <<<<<<<< add line
   _lastRead = millis();

Can you give this a try with a long run?

@RobTillaart
Copy link
Owner

FYI
Made an issue - #13 - to investigate enable/disable of the power supply in the library. Looks not difficult, but could have side effects too.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

I think I will try the sht31, I would like to buy form DFrobot as that's where I get all my parts for this project, I just hope it actually works... I would appreciate it if you visited their site and glanced over it to see if it looks compatible and decent.

Sht31

Also, that is quite a large increase in pins going from one to three where I have almost all of my pins populated. Are all of the pins necessary? Or is there any other sensor that is a 1 pin? Of course this is just me being nitpicky

As for the checking what the error is, what code would you like me to insert together this error code? I would gladly do some error testing for you.

also with your comment saying the no breadboard version should I just desolder it from the breadboard and try it raw like you did and put my own pull up resistor and everything to see if that fixes it. Maybe it's just their combination of capacitor and resistor.

@RobTillaart
Copy link
Owner

DFRobotics makes quite good stuff normally and the price is a bit better than Adafruits.
From the 6 pins you only need to connect VCC GND SDA and SCL, so in fact you need 1 pin more than for the DHT11.

Because I2C is a bus other devices may use the same I2C pins (e.g. LCD or OLED display, FRAM EEPROM etc)

@RobTillaart
Copy link
Owner

As for the checking what the error is, what code would you like me to insert together this error code? I would gladly do some error testing for you.

void getDHT(){
  for (int i = 0 ; i < DHTReadings; i++) {
    int status = DHT.read();                                     // Gets a reading from the DHT sensor
     Serial.println(status);

@RobTillaart
Copy link
Owner

RobTillaart commented May 28, 2020

also with your comment saying the no breadboard version should I just desolder it from the breadboard and try it raw like you did and put my own pull up resistor and everything to see if that fixes it. Maybe it's just their combination of capacitor and resistor.

Than I would just order a plain sensor. Your breadboard version works well way too often I would say.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

Than I would just order a plain sensor. Your breadboard version works well way too often I would say.

I don't understand what you mean by this. I was asking if I should remove this
image

from this

image

and try this

image

to see if it works (doesn't give false readings like it currently is)

@RobTillaart
Copy link
Owner

Shure you can, the idea is good to do that test but I would buy a new sensor. And then compare the results.

@RobTillaart
Copy link
Owner

Started

17:57:59.309 -> 1. Type detection test, first run might take longer to determine type
17:57:59.309 -> CNT	I	TYPE	STAT	HUMI	TEMP
17:57:59.309 -> 1	0	0	0	55.20	25.20
17:57:59.309 -> 2	1	22	0	55.10	25.20
17:58:01.336 -> 3	2	22	0	55.10	25.10
17:58:03.338 -> 4	3	22	0	54.90	25.20
17:58:05.318 -> 5	4	22	0	54.80	25.10
17:58:07.335 -> 6	5	22	0	54.70	25.20

@Mr-HaleYa
Copy link
Sponsor Contributor Author

So this makes the ESP32 the next "suspect" of the burst of failing reads.

This is what I meant by it appears that I am right. My suspicion was that the esp32 is causing it.

@RobTillaart
Copy link
Owner

This is what I meant by it appears that I am right. My suspicion was that the esp32 is causing it.

I cannot prove it is the ESP32, as the DHT11 test you did work for more than 7 days.
I am open for suggestions how to test this hypothesis.

@RobTillaart
Copy link
Owner

@Mr-HaleYa

Disabling the interrupts did not solve the DHT22 reading failures

03:51:04.386 -> 17759	13	22	-4	-999.00	-999.00
...
03:57:42.800 -> 17958	2	22	-4	-999.00	-999.00

Burst duration: 8.38 same range
The ratio between SENSOR_NOT_READY and BIT_SHIFT is 4 to 1
This is different from other measurements about 6 to 1, improvement in theory as bit shifts can partial be recovered from but no help in solving the issue.

End	17958	03:57:42.800
Start	17759	03:51:04.386
=====================================
	199	00:08:38.414

	130 	DHTLIB_ERROR_SENSOR_NOT_READY
	34  	DHTLIB_ERROR_BIT_SHIFT
	33	DHTLIB_OK
	====
	199 

Time to bursts 17759 * 2000 = 35.518.000 millis
Same order of magnitude as the previous first bursts.

Conclusion
The problem is hard and no new insights are made so far.
Test will run to see if the next failure is around 03:55 + 12:25 = 16:20

@Mr-HaleYa
Copy link
Sponsor Contributor Author

I cannot prove it is the ESP32, as the DHT11 test you did work for more than 7 days.

I did not do a full 7 days (slight exaggeration...) I got through about 4 1/2 I think and then I rebooted it because I had to do some maintenance so if you were saying the error should have occurred on the 6th day, I never got there. I will have one up and running for 7 days though once I get all of my problems sorted out.

I am open for suggestions how to test this hypothesis.

it only seems to happen when using an esp32 with the dht22 so if we try more boards such as an 8266 an Uno a mega a nano and if none of them give problems but the 32 then there has to be a conclusion that the 32 is causing the error. right? (Even if we don't explicitly know why)

@RobTillaart
Copy link
Owner

Analyzing the bug takes already a lot of my time and I will not test other boards as these will not give me information as far as I can see.

From my head I have never seen the error in combination with the MEGA and NANO, that is no proof for this version of the lib etc.
I recall that a similar problem with the DHT22 existed with a fore runner of this lib - DHTstable - in combination with a 8266. few years ago. I cannot recall the solution but if the bug/solution was in the library it would have migrated into DHTnew. I might need to check if I can find that.

You are not right, you mix up correlation and causation.
See - https://www.tylervigen.com/spurious-correlations

If we see the problem only occurring in the combination of DHT22 and ESP32 we cannot conclude which one causes it. We only see the correlation. (that is the point we are now)
If we don't see the problem occurring on any other platform, yes that makes the ESP32 more suspect, but it is no proof for causation.
For what it is worth it still can be the library as well.

Most important step to make is a way to trigger the error explicitly at any time.

@RobTillaart
Copy link
Owner

RobTillaart commented Jun 14, 2020

Bit of searching and reading found back the 8266 discussion

from that thread

Sorry Rob, my last comment was not so clear. To try to clarify, I don't use a new version of Adafruit library or a new version of WiFi library. By code version 0.2 on Github, I was referring to the last version of my own code which uses the ESP8266 Wifi in station mode only (STA). By default, the ESP8266 runs WiFi in AP+STA mode. My previous code version was using WiFi in default mode (so AP+STA). My current code version (0.2) uses WiFi in STA mode only. I assume this is the reason I don't get no more NAN. My try with your own library was also with the version 0.2 of my own code.

ADAfruit DHT22


Checked latest ADAfruit library - https://github.com/adafruit/DHT-sensor-library/blob/master/DHT.cpp
(line numbers may vary in future)

  1. line 244: yield() call before actual read to handle WIFI interrupts
    might relate to the WIFI 10 hr failure mentioned above
    -> can be included.
  2. line 261: wake up delay is 10% longer for all sensor types.
    -> can be included.
  3. Line 275: delayMicroseconds(pullTime); waits default 55 usec - DHTnew does 40.
    -> can be included.
  4. Line 281: InterruptLock lock; is another way to prevent interrupts. Important difference is that the ADAfruit code disables interrupts always but for a shorter period.
    -> can be changed
  5. ADAfruit time out loops differ slightly, after about 1 millisecond.
    -> Mine are longer so not interesting.
  6. ADAfruit has complete different way of bit reading.
    -> Error occurs before this part of protocol so not interesting.

I will add the timing/ interrupt related points [1..4] and do a 24 hr test run when time permits...

@RobTillaart
Copy link
Owner

@Mr-HaleYa

Results of test with interrupts disabled. As predicted the failure burst came at 16:20

16:20:16.428 -> 40189	3	22	-4	-999.00	-999.00
...
16:26:46.817 -> 40384	3	22	-3	-999.00	-999.00

Burst duration: 6:30 lowest so far but same order of magnitude
The ratio between SENSOR_NOT_READY and BIT_SHIFT is 4.6 to 1
This is lower than other measurements ~6 to 1.

End	40384	16:26:46.817
Start	40189	16:20:16.428
=====================================
	195	00:06:30.389

	129 	DHTLIB_ERROR_SENSOR_NOT_READY
	28  	DHTLIB_ERROR_BIT_SHIFT
	38	DHTLIB_OK
	====
	195 

Interval since previous burst 40189 - 17958 = 22231 * 2000 = 44.462.000

Conclusions

  • burst pattern confirmed
  • ratio pattern (more or less) confirmed
  • Disable interrupt flag does NOT prevent crashes. This is a new information. However it gives me no insight yet about the cause.

Time for next test with patches mentioned above

@Mr-HaleYa
Copy link
Sponsor Contributor Author

So I'm using an esp32 with a cellular modem on it and I use this cellular modem to send my data to my web server but I do not use the Wi-Fi on it at all. I'm curious if I still need to disable Wi-Fi because as soon as the chip boots up it turns the Wi-Fi modem on, you just don't use it so it's a low-power mode. Same goes for the Bluetooth module, I wonder if that needs disabled as well.

@RobTillaart
Copy link
Owner

RobTillaart commented Jun 15, 2020

@Mr-HaleYa

Test with extended timing + extra yield() call ran for 12 hours and ZERO failures so far.
Test is now halfway its 24 hrs run and this looks promising as it skipped the 10 hr read failure.
Question is which of the 4 changes made the difference?
Or is it the combination of some/all?

So it looks like disabling wifi / bluetooth is not needed.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

That's promising! What exactly did you change? Just added 1 and 4? Can you post a snippet.

@RobTillaart
Copy link
Owner

@Mr-HaleYa
I pushed my test branch so you can see the deltas they include 1,2,3 and 4 of the list above + some changes I made for version 0..3.0 e.g. to setReadDelay().

https://github.com/RobTillaart/DHTNew/tree/test_timing

I want to understand which timing changes did the trick, or was it the call to yield()?
Two additional test runs are needed to see when it fails again.

The other changes I will revert as I want to keep implementation in line with the datasheet as much as possible.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

@RobTillaart
It is saying your test branch was updated 6 days ago... can you confirm that you pushed it.

@RobTillaart
Copy link
Owner

RobTillaart commented Jun 15, 2020

I need more coffee in the morning, sorry. Fixed. (now I only need to fix a coffee)

@RobTillaart
Copy link
Owner

RobTillaart commented Jun 15, 2020

pushed 0.3.0 alpha in test_timing,
will be used for 12 hour test run to verify it does not fail after 10 hours.

Current test runs now 22:30 so that is 10:00 + 12:25 hrs so normally second burst would be about now. Good to see zero fails so far.

@RobTillaart
Copy link
Owner

RobTillaart commented Jun 15, 2020

Zero failures in 23:30; so starting test now with 0.3.0 alpha.


update: 9300+ reads OK

@RobTillaart
Copy link
Owner

0.3.0 alpha - ran for 13:00 hrs
23366 reads OK - so no 10 hr failure (capture missed first 3 reads)

17:38:45.476 -> 4	3	22	0	48.40	25.90   
...
06:39:22.886 -> 23370	14	22	0	56.30	24.70

next steps

  • review code delta with 0.2.2
  • verify & update example sketches
  • performance per read
  • push 0.3.0

@Mr-HaleYa
Copy link
Sponsor Contributor Author

So you think you figured it out then? 😃

@RobTillaart
Copy link
Owner

Not understood in all details, but I think I'm closing in. This is my current hypothesis.

Observations

  • The DHT22 has its own processor.
  • Its clock is not super accurate (few percent).
  • Timing of the protocol is implemented quite strict
  • During the crashes the sensor did not wake up

Hypothesis
The DHT22 sensor does not recognize the wake up signal as the DHTNEW library is "on the edge" of the specification in its timing.

Solution 0.3.0

  • Extending the wake up signal by 10%
    This is a lot, optionally test lower percentage in the future.
  • Extending the wait time for the sensor to react
    A polling loop that gives more time if needed. It can also be faster in some cases.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

so basically it does what it wants when it wants lol. sometimes quick readings and replies other times its to slow and doesn't fully respond before the read req is over

@RobTillaart
Copy link
Owner

Yes it is alive :)

@RobTillaart
Copy link
Owner

Issue closed as issue is fixed.
The cause seemed to be related to a too strict timing during wake up of the sensor.
However it is still unclear where the 10:00 and 12:25 intervals came from.

Please reopen if these burst of failed reads occur again.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

@RobTillaart Wow, amazing work! I really don't know what I will do with my mornings now that this is closed 😭 I really enjoyed reading your in-depth explanations lol 😄 Well I hope the issue never arises again as that was (at the least) inconvenient, It is good that it was fixed tho. I wish you the best of luck on all your other projects and thank you again for the amazing work you have done here.
-Regards Hale

@RobTillaart
Copy link
Owner

@Mr-HaleYa
You're welcome,
I like these kind of analysis, for me it is (mostly) fun especially the moments of new insights, the moments of real learning. The only thing that took real patience were the test runs, they took serious time, but that was also time to think about what happened under the hood.

Are you going back to the DHT22 now in your project? Or do you keep the DHT11.

@Mr-HaleYa
Copy link
Sponsor Contributor Author

Well if you say it works then I see no reason not to use the DHT22 🙂 I have 5 so I will switch out some of the DHT11's in my projects and see if I get any errors 😆

@RobTillaart
Copy link
Owner

In my tests it looks stable so please give it a try,

But you know that the fact that you see no bug is no proof there is none 😊

Success!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants