###Anova analysis by country

In [1]:
import graphlab as gl
graph = gl.SFrame.read_csv('data/user_graph.csv')
eng = gl.SFrame.read_csv('data/user_engagement.csv')
login = gl.SFrame.read_csv('data/user_login_info.csv')

login_eng = login.join(eng, on='user_id', how='inner')
login_eng_graph = login_eng.join(graph, on='user_id', how='inner')
login_eng_graph, drop = login_eng_graph.dropna_split()

[INFO] This commercial license of GraphLab Create is assigned to mjdata@mindjet.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-4636 - Server binary: /Library/Python/2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1440379406.log
[INFO] GraphLab Server Version: 1.5.2


PROGRESS: Finished parsing file /Users/mhardas/Google Drive/work/project/twitter-data-challenge/data/user_graph.csv
PROGRESS: Parsing completed. Parsed 100 lines in 1.50934 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 2641306 lines. Lines per second: 1.27888e+06
PROGRESS: Finished parsing file /Users/mhardas/Google Drive/work/project/twitter-data-challenge/data/user_graph.csv
PROGRESS: Parsing completed. Parsed 4000000 lines in 2.44885 secs.
PROGRESS: Finished parsing file /Users/mhardas/Google Drive/work/project/twitter-data-challenge/data/user_engagement.csv
PROGRESS: Parsing completed. Parsed 100 lines in 1.56959 secs.
------------------------------------------------------
In

###Hypothesis: Activity of a person is a function of the country_id

Activity is a loosely defined term on purpose. Consider activity is defined as the sum of "engagement" metrics from the dataset.

In [56]:
login_eng_graph['activity_metrics'] = login_eng_graph.apply(lambda x: x['num_tweet_days_30d'] + x['time_in_app_30d'] + 
x['num_timeline_views_30d'] + x['num_share_sent_days_30d'] + x['num_share_rcvd_30d'] + x['num_favour_sent_30d'] + 
x['num_favour_rcvd_30d'])
print login_eng_graph

+---------+------------+----------------+-------------+----------+----------------+
| user_id | country_id | primary_device | days_active | activity | num_tweets_30d |
+---------+------------+----------------+-------------+----------+----------------+
|    1    |     h      |       B        |      26     |    t     |      186       |
|    2    |     d      |       B        |      27     |    t     |       5        |
|    3    |     h      |       A        |      21     |    t     |       0        |
|    4    |     h      |       A        |      29     |    t     |       4        |
|    5    |     h      |       E        |      30     |    f     |       1        |
|    6    |     h      |       B        |      26     |    t     |       51       |
|    7    |     g      |       B        |      30     |    t     |      4713      |
|    8    |     g      |       B        |      30     |    t     |       5        |
|    9    |     h      |       A        |      30     |    t     |      185 

Get the set of countries and sum of users from each country

In [15]:
import graphlab.aggregate as agg
countries = login_eng_graph.groupby('country_id', operations={'count': agg.COUNT()})
countries

country_id,count
a,49285
g,1134673
c,66611
h,1742210
f,104267
d,476008
b,166687
e,180196


Sample users from eight different countries. Each sample consists of n=10000 users. No real reason behind selecting
this number. It is open to interpretation.

In [31]:
def select_country_sample(country):
    sample = login_eng_graph.filter_by(country, 'country_id').topk('user_id', k=10000)['activity_metrics']
    return sample
    
import string
country_list = list(string.ascii_lowercase)[:8]

import pandas
country_samples = pandas.DataFrame(map(select_country_sample, country_list)).transpose()
print country_samples.describe()

                   0              1              2              3  \
count   10000.000000   10000.000000   10000.000000   10000.000000   
mean     8474.950300   12167.428400    7257.241100   12087.786900   
std     17719.676942   24840.490078   19339.747958   19007.077375   
min         0.000000       0.000000       0.000000       0.000000   
25%       224.500000     281.000000     188.000000     914.000000   
50%      2384.500000    2851.500000    1823.000000    5142.500000   
75%      9438.000000   12928.500000    7151.000000   15600.750000   
max    644250.000000  613425.000000  645803.000000  270856.000000   

                   4              5              6              7  
count   10000.000000   10000.000000   10000.000000   10000.000000  
mean     6209.693200    4886.981400   20522.180600   15105.269400  
std     14755.428758   15617.336119   27848.723661   24123.591131  
min         0.000000       0.000000       0.000000       0.000000  
25%        77.000000      15.000000   

In [59]:
# one-way ANOVA P value
from scipy import stats
f_val, p_val = stats.f_oneway(country_samples[0]
                             , country_samples[1]
                             , country_samples[2]
                             , country_samples[3]
                             , country_samples[4]
                             , country_samples[5]
                             , country_samples[6]
                             , country_samples[7]
                             )
print f_val, p_val

624.846429832 0.0


The p statistic value is really small (0 in fact). It is unlikely that the differences observed in the samples derived from the country are due to random sampling. This means that there is difference in population means from which the sample are drawn. This means activity does differ by country and null hypothesis is rejected.

###Hypothesis: Activity of a person is the function of the device they use

Get the types of devices and their counts

In [40]:
import graphlab.aggregate as agg
devices = login_eng_graph.groupby('primary_device', operations={'count': agg.COUNT()})
print devices

+----------------+---------+
| primary_device |  count  |
+----------------+---------+
|       E        |  614191 |
|       A        | 1023371 |
|       D        |  116343 |
|       B        | 2041788 |
|       C        |  124244 |
+----------------+---------+
[5 rows x 2 columns]



Sample 10000 users of each type of device

In [50]:
def select_device_sample(device):
    sample = login_eng_graph.filter_by(device, 'primary_device').head(n=10000)['activity_metrics']
    return sample
    
import string
device_list = list(string.ascii_uppercase)[:5]

import pandas
device_samples = pandas.DataFrame(map(select_device_sample, device_list)).transpose()
print device_samples.describe()

                  0             1              2              3              4
count   10000.00000   10000.00000   10000.000000   10000.000000   10000.000000
mean    12742.13890   18650.00330   10235.143200    7949.097300    8836.637700
std     21704.23117   25718.05257   24270.780397   18229.188918   22626.353081
min         0.00000       0.00000       0.000000       0.000000       0.000000
25%      1131.00000    2754.50000      26.000000       7.000000      31.000000
50%      5012.50000    9927.00000     302.500000     239.000000     413.000000
75%     15409.25000   25517.75000    7008.000000    5861.000000    6275.500000
max    330919.00000  624811.00000  385490.000000  245442.000000  375939.000000


In [51]:
# one-way ANOVA P value
from scipy import stats
f_val, p_val = stats.f_oneway(device_samples[0]
                             , device_samples[1]
                             , device_samples[2]
                             , device_samples[3]
                             , device_samples[4]
                             )
print f_val, p_val

359.539389481 9.13086916336e-306


The p statistic value is very small. This means that the difference in the means may not be because of random sampling. This means activity might be influenced by device type assuming activity is normally distributed and variance inside groups is very low.

###Hypothesis - Tweeting activity is a function of the number of followers a user has 

Let us try to test this hypothesis by segmenting users by the number of followers they have. Followers are not normally distributed. In fact, clearly they either follow a power law like distribution or some exponential distribution (refer other notebook). So lets try to do a one-way ANOVA and if that fails then do a non-parametric Kruskal-Wallis test. 

Segmenting users by the number of followers - The average number of followers is 424 with a standard deviation of 34250 and min of 1 follower and a max of 51580785 followers. The max following is obviously an outlier.
Almost 75% users have less than or equal to 233 followers.
To construct the data of segmented users by followers, consider all users with at most 500 followers and divide into 5 groups of 1-100, 101-200, ..., 401-500 followers. It is expected that the variance between these groups will be less and the means will be substantially different than those caused by random sampling. 

In [53]:
def select_followers_sample(r):
    sample = login_eng_graph.filter_by(r, 'followers').head(n=10000)['num_tweets_30d']
    return sample
    
ranges = [range(100), range(101,200), range(201,300), range(301,400), range(401,500)]

import pandas
followers_samples = pandas.DataFrame(map(select_followers_sample, ranges)).transpose()
print followers_samples.describe()

                  0             1             2             3             4
count  10000.000000  10000.000000  10000.000000  10000.000000  10000.000000
mean      29.763100     86.057300    115.338400    140.883800    171.212300
std      125.390369    238.736329    323.300391    324.390108    374.777325
min        0.000000      0.000000      0.000000      0.000000      0.000000
25%        0.000000      5.000000     10.000000     16.000000     20.000000
50%        2.000000     23.000000     38.000000     52.000000     63.000000
75%       18.000000     75.000000    114.000000    146.000000    173.000000
max     7206.000000   8036.000000  20642.000000  10756.000000  13001.000000


In [54]:
# one-way ANOVA P value
from scipy import stats
f_val, p_val = stats.f_oneway(followers_samples[0]
                             , followers_samples[1]
                             , followers_samples[2]
                             , followers_samples[3]
                             , followers_samples[4]
                             )
print f_val, p_val

346.730695931 5.82110496055e-295


P statistic indicates that there is a significant difference between the means not explained by random sampling.
The null hypothesis is rejected. It means that tweeting infact may be a function of number of followers.
This is interesting because then we can reason that people who are followed more have higher propensity to tweet.

A similar analysis can be performed for retweets, shares and favorites and whether they are a functions of followers, followings, or mutual followers. For example consider this hypothesis.

### Hypothesis - Number of shares received in last 30 days is a function of mutual followers added in last 30 days.

In [61]:
login_eng_graph['num_mutual_follower_added_30d'].sketch_summary()


+--------------------+---------------+----------+
|        item        |     value     | is exact |
+--------------------+---------------+----------+
|       Length       |    3919937    |   Yes    |
|        Min         |      0.0      |   Yes    |
|        Max         |    41780.0    |   Yes    |
|        Mean        | 9.34133048567 |   Yes    |
|        Sum         |   36617427.0  |   Yes    |
|      Variance      | 10651.2167364 |   Yes    |
| Standard Deviation | 103.204732141 |   Yes    |
|  # Missing Values  |       0       |   Yes    |
|  # unique values   |      2620     |    No    |
+--------------------+---------------+----------+

Most frequent items:
+-------+---------+--------+--------+--------+--------+--------+--------+-------+
| value |    0    |   1    |   2    |   3    |   4    |   5    |   6    |   7   |
+-------+---------+--------+--------+--------+--------+--------+--------+-------+
| count | 1636701 | 537557 | 318215 | 218985 | 163561 | 127219 | 104036 | 84564 |

95% of the users are covered if we consider number of mutual followers added as less than or equal to 29. 
The data set if formed by selecting 1000 users from each group (1-5), (6-10), ..., (26-30). The datasets contain the
number of shared received in the last 30 days for the users. By performing analysis of variance we are checking if 
the average number of shares received in last 30 days for the 6 groups significantly differ. 
We can them claim that the processes that create shares for users with different number of followers significantly
differ.

In [65]:
def select_mutual_sample(r):
    sample = login_eng_graph.filter_by(r, 'num_mutual_follower_added_30d').head(n=10000)['num_share_rcvd_30d']
    return sample
    
ranges = [range(5), range(6,10), range(11,15), range(16,20), range(21,25), range(26,30)]

import pandas
mutual_samples = pandas.DataFrame(map(select_mutual_sample, ranges)).transpose()
print mutual_samples.describe()

                 0             1             2             3            4  \
count  10000.00000  10000.000000  10000.000000  10000.000000  10000.00000   
mean       4.05100     22.337500     29.514300     50.730600     58.18480   
std       92.14426    250.584929    244.230817    829.782142    881.80225   
min        0.00000      0.000000      0.000000      0.000000      0.00000   
25%        0.00000      0.000000      0.000000      0.000000      0.00000   
50%        0.00000      2.000000      4.000000      5.000000      6.00000   
75%        0.00000     10.000000     18.000000     24.000000     30.00000   
max     6328.00000  13047.000000  14756.000000  63244.000000  82762.00000   

                  5  
count  10000.000000  
mean      82.733000  
std     1380.262665  
min        0.000000  
25%        0.000000  
50%        7.000000  
75%       35.000000  
max    92976.000000  


In [72]:
# one-way ANOVA P value
from scipy import stats
f_val, p_val = stats.f_oneway(
                    mutual_samples[0]
                    , mutual_samples[1]
                    , mutual_samples[2]
                    , mutual_samples[3]
                    , mutual_samples[4]
                    , mutual_samples[5]
                )
print f_val, p_val

13.6258909112 2.55480983628e-13


Again the p statistic indicates with high confidence that the means for the groups are significantly different.
Number of mutual followers in last 30 days seems to significantly affect shares received in last 30 days.