# 对用户进行聚类

数据来源于Kaggle竞赛：Event Recommendation Engine Challenge，根据events they’ve responded to in the past user demographic information what events they’ve seen and clicked on in our app 用户对某个事件是否感兴趣  

竞赛官网：https://www.kaggle.com/c/event-recommendation-engine-challenge/data  

由于用户众多（3w+），可以对用户进行聚类用户描述信息在users.csv文件：共7维特征  
user_id  
locale：地区，语言  
birthyear：出身年  
gender：性别  
joinedAt：用户加入APP的时间，ISO-8601 UTC time  
location：地点  
timezone：时区    
 
后101列为词频：count_1, count_2, ..., count_100，count_other    
count_N：活动描述出现第N个词的次数    
count_other：除了最常用的100个词之外的其余词出现的次数    
 
根据活动的关键词（count_1, count_2, ..., count_100，count_other属性）做聚类    

由于样本数目较多，本项目使用MiniBatchKMeans。  

# 数据准备

In [17]:
# 导入pandas
import pandas as pd

## 1 从训练集和测试集中获取目标数据 

首先将train.csv和test.csv中的event读出来，并组合在一起，再创建selected_event.csv文件保存选择好的event，用于后续查找匹配大数据集events.csv中所需的样本

In [18]:
 """
文件中的数据我们只需要event，故只读取该列数据  

train.csv 有6列：
user：用户ID
event：活动ID
invited：是否被邀请（0/1）
timestamp：ISO-8601 UTC格式时间字符串，表示用户看到该活动的时间
interested, and not_interested

Test.csv 除了没有interested, and not_interested，其余列与train相同
 """
    
# 统计训练集中有多少不同的用户和events（创建一个集合用于存储events）
uniqueUsers = set()
uniqueEvents = set()
    
for filename in ["train.csv", "test.csv"]:
    f = open(filename, 'r')
    
    #忽略第一行（列名字）
    f.readline().strip().split(",")
    
    for line in f:    #对每条记录
        cols = line.strip().split(",")
        uniqueUsers.add(cols[0])   #第一列为用户ID
        uniqueEvents.add(cols[1])   #第二列为活动ID——我们关注的数据
        
    f.close()

n_uniqueUsers = len(uniqueUsers)
n_uniqueEvents = len(uniqueEvents)

print("number of uniqueUsers :%d" % n_uniqueUsers)
print("number of uniqueEvents :%d" % n_uniqueEvents)



number of uniqueUsers :3391
number of uniqueEvents :13418


可以发现，训练集和测试集中的events总和是13418。

## 2 存储目标数据  

event数据存储在集合中，需要先转化为list，然后再传入pandas的dataframe中，通过dataframe将数据存为csv文件

In [19]:
ListEvent = list(uniqueEvents)

df = pd.DataFrame(ListEvent)
df.columns=['EventsID']
df.index+=1
df.index.name = 'EventNo'
df.to_csv('unique_event.csv', header=True)

将保存的数据读出来验证一下是否正确

In [20]:
unique_data = pd.read_csv("unique_event.csv")
unique_data.head()

Unnamed: 0,EventNo,EventsID
0,1,438462512
1,2,1645436499
2,3,2129619079
3,4,3609940860
4,5,1672084151


In [21]:
unique_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13418 entries, 0 to 13417
Data columns (total 2 columns):
EventNo     13418 non-null int64
EventsID    13418 non-null int64
dtypes: int64(2)
memory usage: 209.7 KB


从上面的数据可看出，已成功提取出训练集和测试集中的events，并写入文件selected_event.csv中，所提取的events总数是13418个

## 3 读取大数据集events.csv中的数据

In [22]:
"""
total_events = set()

num = 0

fevents = open("events.csv", 'r')
    
column = fevents.readline().strip().split(",")
    
for eline in fevents:    #对每条记录
    if num<=10000:
        num += 1
        events_cols = eline.strip().split(",")
        for select_eventID in unique_data.EventsID:
            #print(select_eventID)
            if events_cols[0] == select_eventID:
                total_events.add(events_cols)   #第一列为活动ID
                print(select_eventID)
           
        
fevents.close()
print(len(total_events))

"""

'\ntotal_events = set()\n\nnum = 0\n\nfevents = open("events.csv", \'r\')\n    \ncolumn = fevents.readline().strip().split(",")\n    \nfor eline in fevents:    #对每条记录\n    if num<=10000:\n        num += 1\n        events_cols = eline.strip().split(",")\n        for select_eventID in unique_data.EventsID:\n            #print(select_eventID)\n            if events_cols[0] == select_eventID:\n                total_events.add(events_cols)   #第一列为活动ID\n                print(select_eventID)\n           \n        \nfevents.close()\nprint(len(total_events))\n\n'

以上是用IO读取events.csv，这么大的数据集，用IO读确实所用时间比pandas快好多

In [23]:
import linecache
events_big_dataset = linecache.getlines("events.csv")

columns=events_big_dataset[0].strip().split(',')

data=[]

for n in events_big_dataset[1:200001]:
    n=n.strip().split(',')
    data.append(n)
    
events_data=pd.DataFrame(data=data,columns=columns)
events_data.head()
print(events_data.shape)

(200000, 110)


保存选取的events.csv中的部分样本到文件events_sample.csv。由于原样本数有300w+，太多了，即使现在读出来很快，下面做匹配查找的时候也会非常慢，所以就去前20w个数据

In [24]:
#events_data.index+=1
#events_data.index.name = 'dataNo'
events_data.to_csv('events_sample.csv',header=True ,index=False)

In [25]:
event_samle_data = pd.read_csv("events_sample.csv")
event_samle_data.head()

Unnamed: 0,event_id,user_id,start_time,city,state,zip,country,lat,lng,c_1,...,c_92,c_93,c_94,c_95,c_96,c_97,c_98,c_99,c_100,c_other
0,684921758,3647864012,2012-10-31T00:00:00.001Z,,,,,,,2,...,0,1,0,0,0,0,0,0,0,9
1,244999119,3476440521,2012-11-03T00:00:00.001Z,,,,,,,2,...,0,0,0,0,0,0,0,0,0,7
2,3928440935,517514445,2012-11-05T00:00:00.001Z,,,,,,,0,...,0,0,0,0,0,0,0,0,0,12
3,2582345152,781585781,2012-10-30T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,8
4,1051165850,1016098580,2012-09-27T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,9


In [26]:
event_samle_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 110 entries, event_id to c_other
dtypes: float64(2), int64(103), object(5)
memory usage: 167.8+ MB


发现取出的样本是预期的维数

# 4 在大样本里面查找匹配测试集和训练集中出现的evnets

In [27]:
sample_dataset = linecache.getlines("events_sample.csv")

sample_columns = sample_dataset[0].strip().split(',')

data1=[]

for m in sample_dataset[1:200001]:
    m = m.strip().split(',')
    for select_eventID in unique_data.EventsID:
        if int(select_eventID) == int(m[0]):
            data1.append(m)
    
target_data=pd.DataFrame(data=data1,columns=sample_columns)

print(target_data.shape)

print(type(select_eventID))
print(type(m[0]))
target_data.head()

(1532, 110)
<class 'int'>
<class 'str'>


Unnamed: 0,event_id,user_id,start_time,city,state,zip,country,lat,lng,c_1,...,c_92,c_93,c_94,c_95,c_96,c_97,c_98,c_99,c_100,c_other
0,684921758,3647864012,2012-10-31T00:00:00.001Z,,,,,,,2,...,0,1,0,0,0,0,0,0,0,9
1,244999119,3476440521,2012-11-03T00:00:00.001Z,,,,,,,2,...,0,0,0,0,0,0,0,0,0,7
2,3928440935,517514445,2012-11-05T00:00:00.001Z,,,,,,,0,...,0,0,0,0,0,0,0,0,0,12
3,2582345152,781585781,2012-10-30T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,8
4,1051165850,1016098580,2012-09-27T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,9


In [28]:
target_data.to_csv('target_event.csv', header=True,index=False)

因为这个文件只是用于后续聚类的中间文件，所以不用标样本序号，选择参数index=False

In [29]:
test_data = pd.read_csv("target_event.csv")
test_data.head()

Unnamed: 0,event_id,user_id,start_time,city,state,zip,country,lat,lng,c_1,...,c_92,c_93,c_94,c_95,c_96,c_97,c_98,c_99,c_100,c_other
0,684921758,3647864012,2012-10-31T00:00:00.001Z,,,,,,,2,...,0,1,0,0,0,0,0,0,0,9
1,244999119,3476440521,2012-11-03T00:00:00.001Z,,,,,,,2,...,0,0,0,0,0,0,0,0,0,7
2,3928440935,517514445,2012-11-05T00:00:00.001Z,,,,,,,0,...,0,0,0,0,0,0,0,0,0,12
3,2582345152,781585781,2012-10-30T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,8
4,1051165850,1016098580,2012-09-27T00:00:00.001Z,,,,,,,1,...,0,0,0,0,0,0,0,0,0,9


In [30]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1532 entries, 0 to 1531
Columns: 110 entries, event_id to c_other
dtypes: float64(2), int64(103), object(5)
memory usage: 1.3+ MB


得到所需要的文件，可以发现，在大数据集events.csv的前20w个数据中，与测试集和训练集中出现的13418个活动匹配的只有1531个，我之前尝试只读取events.csv的前1000个样本进行匹配，发现匹配到676个，说明能匹配到的数据基本集中在events.csv的前面，后面可能比较少