###Anova analysis by country

In [1]:
import graphlab as gl
graph = gl.SFrame.read_csv('data/user_graph.csv')
eng = gl.SFrame.read_csv('data/user_engagement.csv')
login = gl.SFrame.read_csv('data/user_login_info.csv')

login_eng = login.join(eng, on='user_id', how='inner')
login_eng_graph = login_eng.join(graph, on='user_id', how='inner')
login_eng_graph, drop = login_eng_graph.dropna_split()

[INFO] This commercial license of GraphLab Create is assigned to mjdata@mindjet.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-4636 - Server binary: /Library/Python/2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1440379406.log
[INFO] GraphLab Server Version: 1.5.2


PROGRESS: Finished parsing file /Users/mhardas/Google Drive/work/project/twitter-data-challenge/data/user_graph.csv
PROGRESS: Parsing completed. Parsed 100 lines in 1.50934 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 2641306 lines. Lines per second: 1.27888e+06
PROGRESS: Finished parsing file /Users/mhardas/Google Drive/work/project/twitter-data-challenge/data/user_graph.csv
PROGRESS: Parsing completed. Parsed 4000000 lines in 2.44885 secs.
PROGRESS: Finished parsing file /Users/mhardas/Google Drive/work/project/twitter-data-challenge/data/user_engagement.csv
PROGRESS: Parsing completed. Parsed 100 lines in 1.56959 secs.
------------------------------------------------------
In

###Hypothesis: Activity of a person is a function of the country_id

Activity is a loosely defined term on purpose. Consider activity is defined as the sum of "engagement" metrics from the dataset.

In [14]:
login_eng_graph['activity_metrics'] = login_eng_graph.apply(lambda x: x['num_tweet_days_30d'] + x['time_in_app_30d'] +
                                                           x['num_timeline_views_30d'] + x['num_share_sent_days_30d']
                                                           + x['num_share_rcvd_30d'] + x['num_favour_sent_30d'] +
                                                           x['num_favour_rcvd_30d'])
print login_eng_graph

+---------+------------+----------------+-------------+----------+----------------+
| user_id | country_id | primary_device | days_active | activity | num_tweets_30d |
+---------+------------+----------------+-------------+----------+----------------+
|    1    |     h      |       B        |      26     |    t     |      186       |
|    2    |     d      |       B        |      27     |    t     |       5        |
|    3    |     h      |       A        |      21     |    t     |       0        |
|    4    |     h      |       A        |      29     |    t     |       4        |
|    5    |     h      |       E        |      30     |    f     |       1        |
|    6    |     h      |       B        |      26     |    t     |       51       |
|    7    |     g      |       B        |      30     |    t     |      4713      |
|    8    |     g      |       B        |      30     |    t     |       5        |
|    9    |     h      |       A        |      30     |    t     |      185 

Get the set of countries and sum of users from each country

In [15]:
import graphlab.aggregate as agg
countries = login_eng_graph.groupby('country_id', operations={'count': agg.COUNT()})
countries

country_id,count
a,49285
g,1134673
c,66611
h,1742210
f,104267
d,476008
b,166687
e,180196


Sample users from eight different countries. Each sample consists of n=10000 users. No real reason behind selecting
this number. It is open to interpretation.

In [31]:
def select_country_sample(country):
    sample = login_eng_graph.filter_by(country, 'country_id').topk('user_id', k=10000)['activity_metrics']
    return sample
    
import string
country_list = list(string.ascii_lowercase)[:8]

import pandas
country_samples = pandas.DataFrame(map(select_country_sample, country_list)).transpose()
print country_samples.describe()

                   0              1              2              3  \
count   10000.000000   10000.000000   10000.000000   10000.000000   
mean     8474.950300   12167.428400    7257.241100   12087.786900   
std     17719.676942   24840.490078   19339.747958   19007.077375   
min         0.000000       0.000000       0.000000       0.000000   
25%       224.500000     281.000000     188.000000     914.000000   
50%      2384.500000    2851.500000    1823.000000    5142.500000   
75%      9438.000000   12928.500000    7151.000000   15600.750000   
max    644250.000000  613425.000000  645803.000000  270856.000000   

                   4              5              6              7  
count   10000.000000   10000.000000   10000.000000   10000.000000  
mean     6209.693200    4886.981400   20522.180600   15105.269400  
std     14755.428758   15617.336119   27848.723661   24123.591131  
min         0.000000       0.000000       0.000000       0.000000  
25%        77.000000      15.000000   

In [39]:
# one-way ANOVA P value
from scipy import stats
f_val, p_val = stats.f_oneway(country_samples[0]
                             , country_samples[1]
                             , country_samples[2]
                             , country_samples[3]
                             , country_samples[4]
                             , country_samples[5]
                             , country_samples[6]
                             , country_samples[7]
                             )
print f_val, p_val

624.846429832 0.0


The p statistic value is really small (0 in fact). It is unlikely that the differences observed in the samples derived from the country are due to random sampling. This means that there is difference in population means from which the sample are drawn. This means activity differs by country.

###Hypothesis: Activity of a person is the function of the device they use

Get the types of devices and their counts

In [40]:
import graphlab.aggregate as agg
devices = login_eng_graph.groupby('primary_device', operations={'count': agg.COUNT()})
print devices

+----------------+---------+
| primary_device |  count  |
+----------------+---------+
|       E        |  614191 |
|       A        | 1023371 |
|       D        |  116343 |
|       B        | 2041788 |
|       C        |  124244 |
+----------------+---------+
[5 rows x 2 columns]



Sample 10000 users of each type of device

In [50]:
def select_device_sample(device):
    sample = login_eng_graph.filter_by(device, 'primary_device').head(n=10000)['activity_metrics']
    return sample
    
import string
device_list = list(string.ascii_uppercase)[:5]

import pandas
device_samples = pandas.DataFrame(map(select_device_sample, device_list)).transpose()
print device_samples.describe()

                  0             1              2              3              4
count   10000.00000   10000.00000   10000.000000   10000.000000   10000.000000
mean    12742.13890   18650.00330   10235.143200    7949.097300    8836.637700
std     21704.23117   25718.05257   24270.780397   18229.188918   22626.353081
min         0.00000       0.00000       0.000000       0.000000       0.000000
25%      1131.00000    2754.50000      26.000000       7.000000      31.000000
50%      5012.50000    9927.00000     302.500000     239.000000     413.000000
75%     15409.25000   25517.75000    7008.000000    5861.000000    6275.500000
max    330919.00000  624811.00000  385490.000000  245442.000000  375939.000000


In [51]:
# one-way ANOVA P value
from scipy import stats
f_val, p_val = stats.f_oneway(device_samples[0]
                             , device_samples[1]
                             , device_samples[2]
                             , device_samples[3]
                             , device_samples[4]
                             )
print f_val, p_val

359.539389481 9.13086916336e-306


The p statistic value is very small. This means that the difference in the means may not be because of random sampling. This means activity might be influenced by device type assuming activity is normally distributed and homoscedasticity.