Skip to content

Playing with the Z Score

Nathan Evans edited this page Mar 4, 2016 · 1 revision

When looking for a way to figure out if a developer had high closeness and low betweenness or vice-versa, at first I didn't have much of an idea of what was "high" and what was "low" for either of these values. After playing with different queries I eventually came up with the query below select * from developer_snapshots where (has_sheriff_hrs=0 and betweenness < 0.01 and closeness > .5) order by closeness desc;

Which seemed to return interesting developers, but Dr. Meneely explained that the values I was using for "high" and "low" weren't accurate. I later modified it to use more accurate numbers, but after speaking with Sam, the big problem was how arbitrary these numbers were. So, when looking for something less arbitrary, I came across the Z-Score equation from my statistics class, which is used to find outliers in a set of data(the psql query wasn't pretty). However, according to the Z-Score formula, the data does not have any strong outliers, which in theory was fine, as we are only looking for developers with "highs" and "lows", not necessarily "strangely highs" and "strangely lows". In practice, this circled back around to the issue of arbitrary values being used to determine relative "highs" and relative "lows" in the data. So for now, the Z-Score hasn't proven useful, and I am now working with Kayla and Sam on answering some new research questions.