Skip to content
Michael Thomson edited this page May 22, 2020 · 57 revisions

Here I'll probably put info before I format it properly into their own pages so you guys have something to work with way before "it looks nice" :) ofc feel free to add stuff here too, formatting doesn't matter as long as it's clear what you mean!


Protocol overview

(C = client, S = server/Sonoff camera)

C => broadcast to local LAN on port 32108
S => replies to the client’s originating port from the (new?) port its listening on with a cookie
C => returns the cookie to the server’s listening port directly
S => replies with a slightly modified version of the cookie indicating success
C => sends a ‘client initialising’ packet - client command index reset to 0
S => sends acknowledgement
S => sends a ‘server initialising’ packet - server command index reset to 0
C => sends acknowledgment

From this point on if there’s a gap of > 1250 msec a “Keepalive” exchange is triggered:

S => sends “Marco”
C => replies with “Polo”
C => sends “Marco”
S => replies with "Polo

This will keep the channel open until the client sends a command which the server will acknowledge as above, with the command index incrementing by one each time.

This works even if the camera is isolated from the WAN (it still makes DNS requests of the local DHCP-supplied resolver), and the RTSP streams can be opened while the camera motion is triggered as above.

I have not yet figured out if it’s possible to enable the RTSP stream / set the password without using the app however, but hopefully others might be interested and can help take this reverse engineering further?

Channel discovery:

The camera will respond to a UDP broadcast on the local LAN with a payload of 4 bytes (f1 30 00 00) sent to port 32108. The response will be sent as a unicast to the source port of the machine that sent the broadcast.

The source port in the initial response is to be used for further UDP commands. It looks like the initial response is a cookie that the client sends back, and the source port of that message is used as the reply port by the server.

E.g. If the camera IP were on 192.168.0.129 and your test machine on 192.168.0.50 then:

Source IP.      Source Port    Dest.IP.        Dest.Port      Payload          Means
192.168.0.50  : (aaaaa)     -> 192.168.0.255 : 32108 (fixed)  (f1 30 00 00)    What's your port?
192.168.0.129 : server_port -> 192.168.0.50. : (aaaaa)        (f1 41 00 ..)    Talk to me on "server_port", here's a cookie
...
192.168.0.50  : client_port -> 192.168.0.255 : server_port    (f1 41 00 ..)    Talk to me on "client_port", here's your cookie back
192.168.0.129 : server_port -> 192.168.0.50. : client_port    (f1 42 00 ..)    Channel established?

aaaaa: Random high port >1024 chosen at runtime 32108: Fixed port for broadcast to the network on which the camera is always listening server_port: Random high port >1024 chosen at runtime by the camera for ongoing communication client_port: Random high port >1024 chosen at runtime by the client, can be the same as aaaaa above.

Cookie format:

The payload of the cookie contains a 4-byte header then the encoded serial number of the camera as follows:

Byte offset Value Purpose(?)
0 - 3 F1 41 00 14 Constant - present in all cookies
4 - 7 45 57 4c 4b "EWLK" - first third of the serial number
8 - 15 xx xx xx xx xx xx xx xx Hexadecimal representation of the numeric in the serial number
16 - 20 xx xx xx xx xx ASCII representation of the alphanumeric suffix in the serial number
21 - 23 00 00 00 Constant

So my camera with serial number "EWLK-057746-HRWYT" is represented by this cookie: f1 41 00 14 45 57 4c 4b 00 00 00 00 00 00 e1 92 48 52 57 59 54 00 00 00

Command protocol:

Seems to follow this pattern, but some messages are repeated (and acknowledged as repeats in the acknowledgement).

  1. Command sent
  2. Acknowledgement of command received
  3. Response sent
  4. Acknowledgement of response received

Payload structure:

Messages are variable length, both individually and when concatenated into larger payloads.

Message header

Byte offset Value Purpose(?)
0 F1 Constant - present in all commands / acks / responses
1 D0 Message sent
2 xx Message length MSB
3 xx Message length LSB

Channel identification and message sequence number

Byte offset Value Purpose(?)
4 D1 Constant
5 00 / 02 Constant; 00 = Commands, 02 = Video frames
6 xx Message index (MSB) N.B. Server and client both maintain separate counters
7 xx Message index (LSB)

Command header

Byte offset Value Purpose(?)
8 88 Constant
9 88 Constant
10 76 Constant
11 76 Constant

Command payload

Byte offset Value Purpose(?)
12 xx 04 or 08, varies by command
13 00 Constant so far
14 00 Constant so far
15 00 Constant so far
16 xx Varies by command
17 xx Varies by command
18-31 00 Constant so far
32 xx Varies by command
33 xx Varies by command
34 00 Constant so far
35 00 Constant so far
36 xx Varies by command

Individual commands can be longer than this one. Subsequent commands can be appended in a single message, simply concatenating another command header 88 88 76 76 and then the next variable-length command payload. One, two and four commands have been seen in a single message.

Commands understood so far

(Byte offsets assume a single command per message)

Pan/Tilt Byte 12 16 17 32 33
Tilt up 08 01 10 02 08
Tilt down 08 01 10 01 08
Pan left 08 01 10 06 08
Pan right 08 01 10 03 08
Video stream Byte 12 16 32
Start SD 640x360 14 01 02
Start HD 1920x1080 14 01 01
Stop stream 14 03 00

(The video resolution is also passed back in the message from the camera to announce the video stream, after the start request is received from the client)

Sound Monitor Byte 12 16 17
On 04 04 00
Off 04 05 00
Microphone Gain Byte 12 16 17 36
Low 0E 1A 81 28
Medium 0E 1A 81 4B
High 0E 1A 81 55
Speaker Volume Byte 12 16 17 36
Low 08 1A 81 28
Medium 08 1A 81 4B
High 08 1A 81 55
Motion Detection Byte 12 16 17 32 33
Low 08 24 03 xx 19
Medium 08 24 03 xx 32
High 08 24 03 xx 4B
Alarm on 08 24 03 00 xx
Alarm off 08 24 03 01 xx
Microphone Sensitivity Byte 12 16 17 36
Indoor (50Hz) 08 60 03 01
Outdoor (60Hz) 08 60 03 00
Picture Orientation Byte 12 16 17 36
Normal 08 70 03 00
Rotated 08 70 03 03

Less well understood messages

Client setup

(Sent immediately after channel setup and before video feed begins) Bytes 07-08 are set to 00 00 - resetting command counter?

0000  F1 D0 00 64 D1 00 00 00 88 88 76 76 48 00 00 00  ...d......vvH...
0010  10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0020  31 32 33 34 35 36 37 38 00 00 00 00 00 00 00 00  12345678........
0030  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0040  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0060  00 00 00 00 00 00 00 00                          ........

Camera setup(?) sent by camera after client init above is acknowledged

0000  F1 D0 00 28 D1 00 00 00 88 88 76 76 0C 00 00 00  ...(......vv....
0010  10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0020  00 00 00 00 00 00 00 00 07 00 00 01              ............

Audio frames (Camera to client)

Get Audio Byte 12 16
Start 04 04
Stop 04 05

Format is 8kHz, mono, A-law encoded, with some mystery data on the front...

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 32 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50...
|-----|-----|-----|-----|-----------|-----------------------------------------------------------------------------------|-----------------------------------
|-Hdr-|-Len-|-Ch.-|-Num-|-AudioData-|-???-???-???-???-???-???-???-???-???-???-???-???-???-???-???-???-???-???-???-???---|----- 8kHz, mono, A-Law audio! ----
------------------------------------------------------------------------------------------------------------------------|-----------------------------------
 f1 d0 04 04 d1 02 00 ef f2 0f 86 19 00 05 00 00 1f e0 4f 00 00 00 00 00 00 00 00 00 00 00 00 00 52 00 00 00 00 00 00 00 52 5d 5c 5f 5e 59 58 5b 5a 5a 45 44
 f1 d0 04 04 d1 02 00 f2 f2 0f 86 19 00 05 00 00 bf e0 4f 00 00 00 00 00 00 00 00 00 00 00 00 00 52 00 00 00 00 00 00 00 5f 5f 59 58 5b 5a 5a 44 44 47 46 41
 f1 d0 04 04 d1 02 00 f6 f2 0f 86 19 00 05 00 00 5f e1 4f 00 00 00 00 00 00 00 00 00 00 00 00 00 52 00 00 00 00 00 00 00 5f 5e 59 58 5b 5a 45 44 47 47 46 46
 f1 d0 04 04 d1 02 00 fc f2 0f 86 19 00 05 00 00 9f e2 4f 00 00 00 00 00 00 00 00 00 00 00 00 00 52 00 00 00 00 00 00 00 dd c6 cf f5 f7 f6 f1 f1 f7 f4 f4 f4

Audio frames (Client to camera)

Send Audio Byte 12 16 36
Start 10 06 01
Stop 04 07 00

Audio data is encoded as 8kHz, mono, A-law

e.g.

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 32 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50...
|-----|-----|-----|-----|-----------|-----------------------------------------------------------------------------------------------------------------------
|-Hdr-|-Len-|-Ch.-|-Num-|-AudioData-|--???-???-???-???-???-???-???-???--|-----                         8kHz, mono, A-Law audio!                        -----
------------------------------------------------------------------------------------------------------------------------------------------------------------
 f1 d0 01 54 d1 01 00 12 f2 0f 86 19 40 01 00 00 00 00 00 00 00 00 00 00 db c5 c6 c0 c3 c3 c0 c7 c4 c4 c5 c7 c6 c1 c0 c0 c0 c2 cd c3 c0 c1 c7 c5 d8 d3 d4 56
 f1 d0 01 54 d1 01 00 13 f2 0f 86 19 40 01 00 00 00 00 00 00 00 00 00 00 d1 d0 d0 d3 d3 d3 d0 d2 d2 d3 d3 d0 d1 d6 d6 d7 d7 d4 d5 55 54 54 54 54 57 57 57 55
 f1 d0 01 54 d1 01 00 14 f2 0f 86 19 40 01 00 00 00 00 00 00 00 00 00 00 54 d5 d5 55 54 54 54 57 d5 54 d5 54 d4 54 55 d5 d4 54 55 d7 54 d4 d4 d5 d4 d5 55 d5
 f1 d0 01 54 d1 01 00 15 f2 0f 86 19 40 01 00 00 00 00 00 00 00 00 00 00 7c 17 62 60 6f c6 c4 4e f5 ef e7 e3 95 e7 e6 e1 f4 64 40 65 60 7e 7f 60 44 d8 48 da

Video stream announcement (from camera to client)

00000000:  f1 d0 00 34 d1 00 00 15  88 88 76 76 18 00 00 00  |...4......vv....|
00000010:  02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020:  03 00 00 00 80 02 00 00  68 01 00 00 52 00 00 00  |........h...R...|
00000030:  00 00 00 00 00 00 00 00                           |........|

Bytes 36 & 37 are the horizontal resolution (little endian)

  • 0x80 0x02 = (16 * 8 + 256 * 2) = 640 (SD stream)
  • 0x80 0x07 = (16 * 8 + 256 * 7) = 1920 (HD Stream)

Bytes 40 & 41 are the vertical resolution (little endian)

  • 0x68 0x01 = (16 * 6 + 8 + 256 * 1) = 360 (SD stream)
  • 0x38 0x04 = (16 * 3 + 8 + 256 * 4) = 1020 (HD stream)

Video frames

Video is raw h.264 data, encoded in variable lengths from byte offset 8 onwards in the message. Concatenating these payloads results in a file that's playable with ffmpeg, or vlc using the command line -demux h264 option. Video frames are distinguished as channel D1 02 and have a distinct message sequence counter from the other D1 00 commands sent from the camera.

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 32 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50...
|-----|-----|-----|-----|-----------|-----------------------------------------------------------------------------------------------------------------------
|-Hdr-|-Len-|-Ch.-|-Num-|-VideoData-|-----             h.264 encoded (AVC, Main@L3 format profile, 10.0 fps, YUV, 4:2:0, 8bit Progressive)             -----
|-----|-----|-----|-----|-----------|-----------------------------------------------------------------------------------------------------------------------
 f1 d0 04 04 d1 02 02 4c f1 0f 86 19 68 5a 00 00 a8 a3 0e 00 01 00 00 00 61 e3 0c 00 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 01 47 4d 40 1e 99 a0 28 0b
 f1 d0 04 04 d1 02 02 63 f1 0f 86 19 b4 09 00 00 0c a4 0e 00 02 00 00 00 62 e3 0c 00 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 01 41 e0 1d 5c 77 04 c0 85
 f1 d0 04 04 d1 02 02 6a f1 0f 86 19 2f 5c 00 00 38 a5 0e 00 01 00 00 00 65 e3 0c 00 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 01 47 4d 40 1e 99 a0 28 0b
 f1 d0 04 04 d1 02 02 82 f1 0f 86 19 f8 06 00 00 9c a5 0e 00 02 00 00 00 66 e3 0c 00 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 01 41 e0 1d 5c 77 05 44 50

Video stream stopped (from camera to client)

00000000:  f1 d0 00 20 d1 00 00 19  88 88 76 76 04 00 00 00  |... ......vv....|
00000010:  13 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020:  00 00 00 00                                       |....|

Message acknowledgments

Sent back to confirm receipt of commands, video/audio data or responses to commands. Minimum payload of 18 bytes, padded with zeroes if there are less than five acks in the message, but can be longer. Maximum payload seen for multiple video frames (84 bytes) included 38 message acknowledgements.

Byte offset Value Purpose(?)
0 F1 Constant - present in all commands / acks / responses
1 D1 Message acknowledgment?
2 xx Message Length MSB
3 xx Message Length LSB, counting from after this byte
4 D1 Constant
5 xx 00 for command acknowledgement, 02 for video frames
6 xx Number of acks in this message (MSB)
7 xx Number of acts in this message (LSB)
8 xx Ack #1 MSB
9 xx Ack #1 LSB
10 xx Ack #2 MSB (or 00)
11 xx Ack #2 LSB (or 00)
12 xx Ack #3 MSB (or 00)
13 xx Ack #3 LSB (or 00)
14 xx Ack #4 MSB (or 00)
15 xx Ack #4 LSB (or 00)
16 xx Ack #5 MSB (or 00)
17 xx Ack #5 LSB (or 00)
18 xx Ack #6 MSB (optional)
19 xx Ack #6 LSB (optional)
20 xx Ack #7 MSB (optional)
21 xx Ack #7 LSB (optional)
.. xx etc to a maximum offset of 83?

Keepalives

Sent back and forth while no other traffic is flowing e.g. video stream. Timeout seems to be around 1-2 seconds of silence before this is initiated.

Either side (client or server) starts the keepalive exchange with a message containing a 4 byte payload:F1 E0 00 00. The other side responds with F1 E1 00 00.

Teardown

If the server doesn't want to talk to the client any more, it sends a 4-byte payload: F1 F0 00 00