# 数据探索

数据来源于Kaggle竞赛：Event Recommendation Engine Challenge，根据
    events they’ve responded to in the past
    user demographic information
    what events they’ve seen and clicked on in our app
用户对某个活动是否感兴趣

竞赛官网：
https://www.kaggle.com/c/event-recommendation-engine-challenge/data

In [3]:
import pandas as pd
import numpy as np

## 先看看训练数据
train.csv不大，可以一次全部读入

In [4]:
 """
train.csv 有6列：
user：用户ID
event：活动ID
invited：是否被邀请（0/1）
timestamp：ISO-8601 UTC格式时间字符串，表示用户看到该活动的时间
interested, and not_interested

Test.csv 除了没有interested, and not_interested，其余列与train相同
 """

#读取数据
train = pd.read_csv("train.csv")
train.head()

Unnamed: 0,user,event,invited,timestamp,interested,not_interested
0,3044012,1918771225,0,2012-10-02 15:53:05.754000+00:00,0,0
1,3044012,1502284248,0,2012-10-02 15:53:05.754000+00:00,0,0
2,3044012,2529072432,0,2012-10-02 15:53:05.754000+00:00,1,0
3,3044012,3072478280,0,2012-10-02 15:53:05.754000+00:00,0,0
4,3044012,1390707377,0,2012-10-02 15:53:05.754000+00:00,0,0


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15398 entries, 0 to 15397
Data columns (total 6 columns):
user              15398 non-null int64
event             15398 non-null int64
invited           15398 non-null int64
timestamp         15398 non-null object
interested        15398 non-null int64
not_interested    15398 non-null int64
dtypes: int64(5), object(1)
memory usage: 721.9+ KB


没有缺失值，1.5w条记录

## 测试数据
test.csv不大，可以一次全部读入

In [6]:
 """
test同.csv 有4列：
user：用户ID
event：事件ID
invited：是否被邀请（0/1）
timestamp：ISO-8601 UTC格式时间字符串，表示用户看到该事件的时间
interested, and not_interested

Test.csv 除了没有interested, and not_interested，其余列与train相同
 """

#读取数据
test = pd.read_csv("test.csv")
test.head()

Unnamed: 0,user,event,invited,timestamp
0,1776192,2877501688,0,2012-11-30 11:39:01.230000+00:00
1,1776192,3025444328,0,2012-11-30 11:39:01.230000+00:00
2,1776192,4078218285,0,2012-11-30 11:39:01.230000+00:00
3,1776192,1024025121,0,2012-11-30 11:39:01.230000+00:00
4,1776192,2972428928,0,2012-11-30 11:39:21.985000+00:00


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10237 entries, 0 to 10236
Data columns (total 4 columns):
user         10237 non-null int64
event        10237 non-null int64
invited      10237 non-null int64
timestamp    10237 non-null object
dtypes: int64(3), object(1)
memory usage: 320.0+ KB


共1w条记录，也没有缺失值
测试集中时间特征列出现的时间比训练集晚（好像是大多数竞赛数据的惯例）
所以在将训练数据划分为训练集和校验集时，最好也是校验集中的时间比训练集晚，以模拟更好地测试的情况

## 用户数据
users.csv不大，可以一次全部读入

In [8]:
 """
用户描述信息在users.csv文件：共7维特征
user_id
locale：地区，语言
birthyear：出身年
gender：性别
joinedAt：用户加入APP的时间，ISO-8601 UTC time
location：地点
timezone：时区
 """

#读取数据
users = pd.read_csv("users.csv")
users.head()

Unnamed: 0,user_id,locale,birthyear,gender,joinedAt,location,timezone
0,3197468391,id_ID,1993,male,2012-10-02T06:40:55.524Z,Medan Indonesia,480.0
1,3537982273,id_ID,1992,male,2012-09-29T18:03:12.111Z,Medan Indonesia,420.0
2,823183725,en_US,1975,male,2012-10-06T03:14:07.149Z,Stratford Ontario,-240.0
3,1872223848,en_US,1991,female,2012-11-04T08:59:43.783Z,Tehran Iran,210.0
4,3429017717,id_ID,1995,female,2012-09-10T16:06:53.132Z,,420.0


In [9]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38209 entries, 0 to 38208
Data columns (total 7 columns):
user_id      38209 non-null int64
locale       38209 non-null object
birthyear    38209 non-null object
gender       38100 non-null object
joinedAt     38152 non-null object
location     32745 non-null object
timezone     37773 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 2.0+ MB


共3.8w条记录

gender、joinedAt、location、timezone这几个特征有缺失值
所以需要做缺失值处理

用户数比测试集和训练集中出现的用户多
为节省空间和时间，竞赛中可以只取出训练集和测试集中有的用户
（猜测event也是一样，因为events.csv以gz压缩格式给出，记录数目应该更多）

## 活动数据
events.csv太大，一次全部读入比较慢
数据探索就一次读入了，后续进行特征工程和模型训练不可用pandas一次读入
可以pandas一次读入部分
或者直接用文件io函数读入（比pandas效率高）

In [10]:
 """
活动描述信息在events.csv文件：共110维特征
前9列：event_id, user_id, start_time, city, state, zip, country, lat, and lng.
event_id：id of the event, 
user_id：id of the user who created the event.  
city, state, zip, and country： more details about the location of the venue (if known).
lat and lng： floats（latitude and longitude coordinates of the venue）
start_time： 字符串，ISO-8601 UTC time，表示活动开始时间

后101列为词频：count_1, count_2, ..., count_100，count_other
count_N：活动描述出现第N个词的次数
count_other：除了最常用的100个词之外的其余词出现的次数
 """

#读取数据
events = pd.read_csv("events.csv")
events.head()

Unnamed: 0,event_id,user_id,start_time,city,state,zip,country,lat,lng,c_1,...,c_92,c_93,c_94,c_95,c_96,c_97,c_98,c_99,c_100,c_other
0,684921758,3647864012,2012-10-31T00:00:00.001Z,,,,,,,2,...,0,1,0,0,0,0,0,0,0,9
1,244999119,3476440521,2012-11-03T00:00:00.001Z,,,,,,,2,...,0,0,0,0,0,0,0,0,0,7
2,3928440935,517514445,2012-11-05T00:00:00.001Z,,,,,,,0,...,0,0,0,0,0,0,0,0,0,12
3,2582345152,781585781,2012-10-30T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,8
4,1051165850,1016098580,2012-09-27T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,9


In [11]:
events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3137972 entries, 0 to 3137971
Columns: 110 entries, event_id to c_other
dtypes: float64(2), int64(103), object(5)
memory usage: 2.6+ GB


文件占用空间很大（2.6G+），统计信息也不给了。。。

## 活动参加者数据
event_attendees.csv

In [14]:
 """
event_attendees.csv文件：共5维特征
event_id：活动ID
yes, maybe, invited, and no：以空格隔开的用户列表，
分别表示该活动参加的用户、可能参加的用户，被邀请的用户和不参加的用户.
 """

#读取数据
event_attendees = pd.read_csv("event_attendees.csv")
event_attendees.head()

Unnamed: 0,event,yes,maybe,invited,no
0,1159822043,1975964455 252302513 4226086795 3805886383 142...,2733420590 517546982 1350834692 532087573 5831...,1723091036 3795873583 4109144917 3560622906 31...,3575574655 1077296663
1,686467261,2394228942 2686116898 1056558062 3792942231 41...,1498184352 645689144 3770076778 331335845 4239...,1788073374 733302094 1830571649 676508092 7081...,
2,1186208412,,3320380166 3810793697,1379121209 440668682,1728988561 2950720854
3,2621578336,,,,
4,855842686,2406118796 3550897984 294255260 1125817077 109...,2671721559 1761448345 2356975806 2666669465 10...,1518670705 880919237 2326414227 2673818347 332...,3500235232


In [15]:
event_attendees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24144 entries, 0 to 24143
Data columns (total 5 columns):
event      24144 non-null int64
yes        22160 non-null object
maybe      20977 non-null object
invited    22322 non-null object
no         17485 non-null object
dtypes: int64(1), object(4)
memory usage: 943.2+ KB


缺失数据很多（缺失值表示没有用户）

## 用户好友数据
user_friends.csv

In [13]:
 """
user_friends.csv文件：共2维特征
user：用户ID
friends：以空格隔开的用户好友ID列表，
 """

#读取数据
user_friends = pd.read_csv("user_friends.csv")
user_friends.head()

Unnamed: 0,user,friends
0,3197468391,1346449342 3873244116 4226080662 1222907620 54...
1,3537982273,1491560444 395798035 2036380346 899375619 3534...
2,823183725,1484954627 1950387873 1652977611 4185960823 42...
3,1872223848,83361640 723814682 557944478 1724049724 253059...
4,3429017717,4253303705 2130310957 1838389374 3928735761 71...


In [16]:
user_friends.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38202 entries, 0 to 38201
Data columns (total 2 columns):
user       38202 non-null int64
friends    38063 non-null object
dtypes: int64(1), object(1)
memory usage: 597.0+ KB


也有缺失值（缺失值表示没有朋友？）