New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow collection of server statistical data #196

Merged
merged 12 commits into from Jul 28, 2018

Conversation

Projects
None yet
4 participants
@albertosottile
Member

albertosottile commented Jul 25, 2018

Because of the recent increase in the update rate of Syncplay, we would like to have some data on our currently user base.

This PR introduces a new server argument --stats-db-file [file] that creates a SQLite database in [file] and stores some anonymous information in it.

We only store non-sensitive and non-personal data that are already being transmitted by the clients in the current protocol. Specifically, for every user in the server at the time of stats collection, we store (EDIT):

  • server's Unix timestamp at the collection time
  • user's Syncplay client version

Data collection is scheduled to happen every hour. Nothing else is acquired nor stored in the snapshot.

Thanks to SQLite, full concurrency is supported and multiple servers can use the same DB file at the same time. For further safety, a fixed delay is introduced before the first collection, and its amount is based on the server's port number. (EDIT): twisted.enterprise.adbapi is employed to ensure asynchronous consistency.

@@ -445,7 +475,12 @@ def getName(self):
return self._name
def getVersion(self):
return self._connector.getVersion()
pattern = r'\A[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}\Z'

This comment has been minimized.

@Uriziel

Uriziel Jul 25, 2018

Contributor

You don't need to do that since it's prepared statement anyway you can save anything pretty much (there could be length check at most, but rather that should be connector's job)

This comment has been minimized.

@albertosottile

albertosottile Jul 25, 2018

Member

I agree this can be removed and also that a check, if any, should be done in _connector. I'll remove this.

if statsDbFile is not None:
self.statsDbHandle = self._connectToStatsDb(statsDbFile)
logDelay = 5*(int(self.port)%10 + 1)
reactor.callLater(logDelay, self._scheduleClientSnapshot, self.statsDbHandle, self.port)

This comment has been minimized.

@Uriziel

Uriziel Jul 25, 2018

Contributor

what's the point of scheduling this instead of running the code on user join?

This comment has been minimized.

@Et0h

Et0h Jul 25, 2018

Contributor

The idea is for it to occur at set intervals from being started to provide a snapshot, and to have a delay to avoid multiple servers all writing at exactly the same time if they are started at the same time.

def _connectToStatsDb(self, dbPath):
conn = sqlite3.connect(dbPath)
c = conn.cursor()
c.execute('create table if not exists clients_snapshots (snapshot_time integer, port integer, version string, room_index integer, play_status integer)')

This comment has been minimized.

@Uriziel

Uriziel Jul 25, 2018

Contributor

would be nice if all the DB stuff:

  • connection
  • schema creating
  • queries

were put into a separate class (service)

This comment has been minimized.

@albertosottile

albertosottile Jul 25, 2018

Member

I'll try. Should I keep this new class in server.py or move it to a new file and module? I would advise for the former since all this is server-related only.

playStatus = room.isPlaying()
for watcher in room.getWatchers():
content = (snapshotTime, int(portNumber), watcher.getVersion(), idx, playStatus, )
c.execute("INSERT INTO clients_snapshots VALUES (?, ?, ?, ?, ?)", content)

This comment has been minimized.

@Uriziel

Uriziel Jul 25, 2018

Contributor

What's the point of saving room state?
Room index is non-deterministic as well and can change within multiple executions on the same dictionary (unless python 3.7).
You'll log multiple users multiple times without any real way to distinguish between them.
Also loging the port makes very litte sense and you should have a separate file for every instance you run (https://www.sqlite.org/faq.html#q5 http://howfuckedismydatabase.com/sqlite/)

This comment has been minimized.

@Uriziel

Uriziel Jul 25, 2018

Contributor

instead of room-idx just store the room-name hash.

This comment has been minimized.

@daniel-123

daniel-123 Jul 25, 2018

Contributor

You can write to single SQLite database with multiple processes - you just need to handle SQLITE_BUSY and wait for other inserts to finish.

This comment has been minimized.

@albertosottile

albertosottile Jul 25, 2018

Member

By default, sqlite3 handles SQLITE_BUSY by waiting for 5 seconds and then retrying. That's why a port-based fixed delay was introduced. Having a single SQLite database makes handling and analyses much easier, in my opinion.

This comment has been minimized.

@Et0h

Et0h Jul 25, 2018

Contributor

Room index can be used to get a sense of the typical size of a session, which can help inform development/design decisions (e.g. default size of 'list of who is watching what' section of client)

This comment has been minimized.

@daniel-123

daniel-123 Jul 25, 2018

Contributor

If encountering SQLITE_BUSY has potential delay of 5 seconds, won't that risk stalling entire server process for 5 second during each iteration of this loop?

This comment has been minimized.

@albertosottile

albertosottile Jul 25, 2018

Member

My understanding was that only sqlite3.commit locked the database, but probably I was wrong. In any case, I'll try to replace this loop of execute with a list and a single sqlite3.executescript at the end of the loop.

This comment has been minimized.

@albertosottile

albertosottile Jul 25, 2018

Member

Upon further discussion, @Et0h and I decided to get rid of room_index and play_status for now. The only recorded stats will be the current timestamp, the client's version, and the server port, needed to understand server's load while keeping only one database file.

@@ -185,6 +194,17 @@ def setPlaylistIndex(self, watcher, index):
else:
watcher.setPlaylistIndex(room.getName(), room.getPlaylistIndex())
def _connectToStatsDb(self, dbPath):
conn = sqlite3.connect(dbPath)

This comment has been minimized.

@Uriziel

Uriziel Jul 25, 2018

Contributor

maybe you should implement non-blocking db connection:
https://howto.lintel.in/asynchronous-db-operations-twisted/
(haven't tested)

This comment has been minimized.

@albertosottile

albertosottile Jul 25, 2018

Member

We could do that but, quoting the FAQ you linked before, "SQLite allows multiple processes to have the database file open at once". So, I do not think we should rely on twisted for this.

@@ -185,6 +192,45 @@ def setPlaylistIndex(self, watcher, index):
else:
watcher.setPlaylistIndex(room.getName(), room.getPlaylistIndex())
class StatsRecorder(object):
def __init__(self):
self._dbPool = None

This comment has been minimized.

@Uriziel

Uriziel Jul 26, 2018

Contributor

Nitpicking: I'd name this _connection

if self._dbPool is not None:
self._dbPool.close()
def startRecorder(self, dbpath, roomManager, delay):

This comment has been minimized.

@Uriziel

Uriziel Jul 26, 2018

Contributor

Nitpicking I'd move dbpath, roomManager into __init__

print("--- Error in initializing the stats database. Server Stats not enabled. ---")
def _initDatabase(self):
dbpool = adbapi.ConnectionPool("sqlite3", self._dbPath, check_same_thread=False)

This comment has been minimized.

@Uriziel

Uriziel Jul 26, 2018

Contributor

My idea for separate class was to cover "Single Responsibility Principle" which moves all stuff related to sqlite to a separate class, thus making abstraction layer between business code (which is "log this and that data") and app logic (which is "use database to store data").
So something like this:

class Database(NameForChange):
    __init__(dbPath): pass
    connect(): pass
   addVersionLog(version, time): pass
    _createSchema(): pass

which later on can be extended by adding function like fetchUser(username) etc. that all relates to fetching data from data layer (i.e. sqlite)

@Uriziel

Keep up good work

@Uriziel

This comment has been minimized.

Contributor

Uriziel commented Jul 26, 2018

Reposting my last comment as it got squashed by one of the changes (my bad.)

My idea for separate class was to cover "Single Responsibility Principle" which moves all stuff related to sqlite to a separate class, thus making abstraction layer between business code (which is "log this and that data") and app logic (which is "use database to store data").
So something like this (all the names were chosen at random without a second thought given to them):

class Database(NameForChange):
    __init__(dbPath): pass
    connect(): pass
   addVersionLog(version, time): pass
    _createSchema(): pass

which later on can be extended by adding function like fetchUser(username) etc. that all relates to fetching data from data layer (i.e. sqlite)

@Et0h Et0h merged commit b3a7ed8 into Syncplay:master Jul 28, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment