# Kaggle Project: TalkingData Mobile User Demographics

## Defination

### Project Overview

This is a project from Kaggle competition projects. The link of this project is: https://www.kaggle.com/c/talkingdata-mobile-user-demographics.

This project takes usage of the data collected by TalkingData, the Chinese largest 3rd-party mobile data platform. As the Kaggle website describes "In this competition, Kagglers are challenged to build a model predicting users’ demographic characteristics based on their app usage, geolocation, and mobile device properties. Doing so will help millions of developers and brand advertisers around the world pursue data-driven marketing efforts which are relevant to their users and catered to their preferences."



### Metric

Submissions are evaluated using the multi-class logarithmic loss. Each device has been labeled with one true class. For each device, you must submit a set of predicted probabilities (one for each class). The formula is then,

$$logloss = -\dfrac{1}{N} \sum_{i=1}^{N} \sum_{i=1}^{M} y_{ij} log(p_{ij})$$

where N is the number of devices in the test set, M is the number of class labels,  loglog is the natural logarithm, yijyij is 1 if device ii belongs to class jj and 0 otherwise, and pijpij is the predicted probability that observation ii belongs to class jj.

## Analysis

### Data Exploration

In [1]:
# import packages
import numpy as np
import pandas as pd
from ggplot import *
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import random 
# Draw inline
%matplotlib inline

In [27]:
# Read data into memory
train = pd.read_csv("./data/gender_age_train.csv")
test = pd.read_csv("./data/gender_age_test.csv")
phone = pd.read_csv("./data/phone_brand_device_model.csv")
event = pd.read_csv("./data/events.csv")
app_event = pd.read_csv("./data/app_events.csv")
app_label = pd.read_csv("./data/app_labels.csv")
label_cat = pd.read_csv("./data/label_categories.csv")

Now let's take of look of the data and do some basic statisitics. 

In [20]:
print train.head()
train.describe()

             device_id gender  age   group
0 -8076087639492063270      M   35  M32-38
1 -2897161552818060146      M   35  M32-38
2 -8260683887967679142      M   35  M32-38
3 -4938849341048082022      M   30  M29-31
4   245133531816851882      M   30  M29-31


Unnamed: 0,device_id,age
count,74645.0,74645.0
mean,-749135400000000.0,31.410342
std,5.32715e+18,9.868735
min,-9.223067e+18,1.0
25%,-4.617367e+18,25.0
50%,-1.841362e+16,29.0
75%,4.636656e+18,36.0
max,9.222849e+18,96.0


In [21]:
print test.head()
test.describe()

             device_id
0  1002079943728939269
1 -1547860181818787117
2  7374582448058474277
3 -6220210354783429585
4 -5893464122623104785


Unnamed: 0,device_id
count,112071.0
mean,-2.367461e+16
std,5.331855e+18
min,-9.223322e+18
25%,-4.661036e+18
50%,-3.107321e+16
75%,4.581985e+18
max,9.223069e+18


In [22]:
print phone.head()
phone.describe()

             device_id phone_brand   device_model
0 -8890648629457979026          小米             红米
1  1277779817574759137          小米           MI 2
2  5137427614288105724          三星      Galaxy S4
3  3669464369358936369       SUGAR           时尚手机
4 -5019277647504317457          三星  Galaxy Note 2


Unnamed: 0,device_id
count,187245.0
mean,-1.426513e+16
std,5.330527e+18
min,-9.223322e+18
25%,-4.645265e+18
50%,-2.619149e+16
75%,4.606568e+18
max,9.223069e+18


In [23]:
print event.head()
event.describe()

   event_id            device_id            timestamp  longitude  latitude
0         1    29182687948017175  2016-05-01 00:55:25     121.38     31.24
1         2 -6401643145415154744  2016-05-01 00:54:12     103.65     30.97
2         3 -4833982096941402721  2016-05-01 00:08:05     106.60     29.70
3         4 -6815121365017318426  2016-05-01 00:06:40     104.27     23.28
4         5 -5373797595892518570  2016-05-01 00:07:18     115.88     28.66


Unnamed: 0,event_id,device_id,longitude,latitude
count,3252950.0,3252950.0,3252950.0,3252950.0
mean,1626475.5,-2.68514e+16,77.961922,21.629493
std,939045.923418,5.301236e+18,54.058011,15.696974
min,1.0,-9.222957e+18,-180.0,-38.43
25%,813238.25,-4.616259e+18,0.0,0.0
50%,1626475.5,-1.729953e+16,112.95,28.02
75%,2439712.75,4.54975e+18,117.21,34.07
max,3252950.0,9.22254e+18,174.76,59.94


In [24]:
print app_event.head()
app_event.describe()

   event_id               app_id  is_installed  is_active
0         2  5927333115845830913             1          1
1         2 -5720078949152207372             1          0
2         2 -1633887856876571208             1          0
3         2  -653184325010919369             1          1
4         2  8693964245073640147             1          1


Unnamed: 0,event_id,app_id,is_installed,is_active
count,32473067.0,32473070.0,32473067,32473067.0
mean,1625563.562268,1.182779e+18,1,0.392109
std,938468.181497,5.360173e+18,0,0.488221
min,2.0,-9.221157e+18,1,0.0
25%,813472.0,-3.474568e+18,1,0.0
50%,1626907.0,1.387044e+18,1,0.0
75%,2441106.0,6.043001e+18,1,1.0
max,3252948.0,9.222488e+18,1,1.0


In [25]:
print app_label.head()
app_label.describe()

                app_id  label_id
0  7324884708820027918       251
1 -4494216993218550286       251
2  6058196446775239644       406
3  6058196446775239644       407
4  8694625920731541625       406


Unnamed: 0,app_id,label_id
count,459943.0,459943.0
mean,1.912461e+17,664.849749
std,5.269442e+18,192.797736
min,-9.223281e+18,2.0
25%,-4.305882e+18,548.0
50%,1.083204e+17,714.0
75%,4.830475e+18,795.0
max,9.223318e+18,1021.0


In [26]:
print label_cat.head()
label_cat.describe()

   label_id           category
0         1                NaN
1         2     game-game type
2         3   game-Game themes
3         4     game-Art Style
4         5  game-Leisure time


Unnamed: 0,label_id
count,930.0
mean,517.080645
std,297.85879
min,1.0
25%,251.25
50%,523.5
75%,778.75
max,1021.0
