# Data Collection for Add-ons Engagement Study

This notebook uses the logitundinal dataset to build a proof-of-concept design matrix for the addons engagement study, which seeks to answer the question: Does having addons increase user engagement? *(definition of engagement TBD)*

Specs for this study can be found [here](https://metrics.mozilla.com/protected/dzeber/docs/addons-engagement-study.html)

I've included the **Covariates** section of the spec, which includes a preliminary list of variables to include in our design matrix that is subject to change.






### Covariates

The following measures will be included as covariates in the model, in order to control for their effect when analyzing the effect on the response of installing the add-on.

The following system characteristics are intended to describe a user's computing environment/resources, as well as give an indication of their technical knowledge.

- OS
- OS version (Windows only)
- System architecture
- System memory (grouped by 2^n GB)
- Number of physical CPU cores
- (First) graphics adapter vendorID?

There are some other features we want to control for but aren't explicitly interested in.

- Average number of main crashes per session hour
- Initial Firefox version
- Profile creation date (grouped by month)
- Did they have non-system, non-acceptable add-ons?
    + eg. did they have foreign-installed, enabled add-ons?
- Did they enable telemetry by the end of the period?

#### Possible Responses / addition explanatory variables
The following statistics are aggregated over a period. Together, they characterize a user's pattern of activity and engagement.

- Total number of active days in the period
- Longest run of consecutive active days in the period
- Was the profile active on weekends?
- Average number of sessions per active day
- Average uptime per active day (in minutes)
- Average usage intensity (percentage of uptime that was "active")
- Did they configure sync and link at least one device by the end of the period?
- Was Firefox set as the default browser at the end of the period?
- Did they change their search default at some point during the period?
    + Otherwise, the actual search default at the end of the period: Yahoo/Google/Other
- Did they make any bookmarks during the period?
- Average number of history pages per uptime hour
- Did they make any SAP searches during the period?
    + Otherwise, average search rate (searches per hour)
- Did they change their theme at some point during the period?

In [None]:
# listing columns of interest here for now, 
client_id, os, normalized_channel, submission_date, geo_country, 
profile_creation_date, session_length, settings, active_addons, build,
system, system_device

In [1]:
dataset = sqlContext.sql("SELECT * FROM longitudinal")

In [2]:
dataset.printSchema()

root
 |-- client_id: string (nullable = true)
 |-- os: string (nullable = true)
 |-- normalized_channel: string (nullable = true)
 |-- submission_date: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sample_id: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- size: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- geo_country: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- geo_city: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- dnt_header: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- async_plugin_init: array (nullable = true)
 |    |-- element: boolean (containsNull = true)
 |-- flash_version: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- previous_build_id: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- previous_session_id: array (nullable = true)
 |  