# Data Preprocessing
1. Topic Extraction
2. User Buckets
3. Data Split

***
## Dataset description
||Feature Name|Feature Type|Feature Description|
|:---|:---|:---|:---|
|Tweet Features|Text tokens<br>Hashtags<br>Tweet id<br>Present media<br>Present links<br>Present domains<br>Tweet type<br>Language<br>Timestamp|List[long]<br>List[string]<br>String<br>List[String]<br>List[string]<br>List[string]<br>String<br>String<br>Long|Ordered list of Bert ids corresponding to Bert tokenization of Tweet text<br>Tab separated list of hastags (identifiers) present in the tweet<br>Tweet identifier<br>Tab separated list of media types. Media type can be in (Photo, Video, Gif)<br>Tab separeted list of links (identifiers) included in the Tweet<br>Tab separated list of domains included in the Tweet (twitter.com, dogs.com)<br>Tweet type, can be either Retweet, Quote, Reply, or Toplevel<br>Identifier corresponding to the inferred language of the Tweet<br>Unix timestamp, in sec of the creation time of the Tweet|
|Engaged With User Features|User id<br>Follower count<br>Following count<br>Is verified?<br>Account creation time|String<br>Long<br>Long<br>Bool<br>Long|User identifier<br>Number of followers of the user<br>Number of accounts<br>the user is following<br>Is the account verified?<br>Unix timestamp, in seconds, of the creation time of the account
|Engaging User Features|User id<br>Follower count<br>Following count<br>Is verified?<br>Account creation time|String<br>Long<br>Long<br>Bool<br>Long|User identifier<br>Number of followers of the user<br>Number of accounts<br>the user is following<br>Is the account verified?<br>Unix timestamp, in seconds, of the creation time of the account
|Engagement Features|Engagee follows engager?<br>Reply engagement timestamp<br>Retweet engagement timestamp<br>Retweet<br>with comment engagement timestamp<br>Like engagement timestamp|Bool<br>Long<br>Long<br>Long<br>Long|Does the account of the engaged tweet author follow the account that has made the engagement?<br>If there is at least one, unix timestamp, in s, of one of the replies<br>If there is one, unix timestamp, in s, of the retweet of the tweet by the engaging user<br>If there is at least one, unix timestamp, in s, of one of the retweet with comment of the tweet by the engaging user<br>If there is one, Unix timestamp, in s, of the like

참고) https://recsys-twitter.com/


***
## 데이터 로드

In [314]:
import pandas as pd
COLS = ["text_tokens", "hashtags", "tweet_id", "present_media", 
                      "present_links", "present_domains", "tweet_type", "language", "tweet_timestamp", "engaged_with_user_id",
                     "engaged_with_user_follower_count", "engaged_with_user_following_count", 
                     "engaged_with_user_is_verified", "engaged_with_user_account_creation", "engaging_user_id",
                     "engaging_user_follower_count", "engaging_user_following_count", "engaging_user_is_verified",
                     "engaging_user_account_creation", "engagee_follows_engager", "reply_timestamp",
                     "retweet_timestamp", "retweet_with_comment_timestamp", "like_timestamp"]
df = pd.read_csv('../dataset/twitter/train100K.csv', names=COLS, skipinitialspace=True, skiprows=1)

In [315]:
df.head()

Unnamed: 0,text_tokens,hashtags,tweet_id,present_media,present_links,present_domains,tweet_type,language,tweet_timestamp,engaged_with_user_id,...,engaging_user_id,engaging_user_follower_count,engaging_user_following_count,engaging_user_is_verified,engaging_user_account_creation,engagee_follows_engager,reply_timestamp,retweet_timestamp,retweet_with_comment_timestamp,like_timestamp
0,101\t10117\t140\t119\t142\t119\t152\t119\t1010...,,373C0F43762B7CEC1D75728BE8A33891,,A2CE3A1941BA410A1C31496C355EFCD7,E14AF8A8D257BB47587843FE7D08382B,TopLevel,D3164C7FBCF2565DDF915B1B3AEFB1DC,1582126349,2A8B6AD2B9D55F535C2441AB673133D2,...,00000865A1538142CDA5936B07FE4311,65,166,False,1452599043,False,,,,
1,101\t10105\t10817\t10124\t59232\t18121\t15629\...,,773A92D9E4824D06105C02BD044BB20A,,,,Quote,D3164C7FBCF2565DDF915B1B3AEFB1DC,1581971193,950A95B81407F33C412E520BE55A1450,...,000009A057792FF118B9E3F2578B8407,1814,1314,False,1322868747,True,1581979000.0,,,
2,101\t48561\t10116\t67737\t18554\t36371\t10989\...,,218A6C27871801759F7380D7C41694A6,,5C683B5A29B308CADD0D7EFA7C9C32D3,6717B03E03DEE1D7ACAE37649ACA7BD6,TopLevel,9BF3403E0EB7EA8A256DA9019C0B0716,1582047119,ABB2F7F22C34057BC7B30D627B0C137A,...,00000DEF82BE9EB5CFD07FB7DB94317B,4,73,False,1573996260,False,,,,
3,101\t100055\t69940\t10414\t159\t11305\t11166\t...,,AB817EBA68064A0C8CBF4A6C059D92DC,Photo,E925556EE312213AD98C4D9F131D7A8D,D722330FEBEAAE68B4F4339CE8BD7C70,TopLevel,691890251F2B9FF922BE6D3699ABEFD2,1581554925,03F96C3B7CE2179B6347AA395880C963,...,0000109A57AFA64758EE4AAE2A01BFC7,15,124,False,1385502405,True,,,,
4,101\t62154\t32221\t71843\t10143\t10237\t15507\...,,349120C1E2801857530393F16D4653A5,,,,TopLevel,9BF3403E0EB7EA8A256DA9019C0B0716,1581568955,E035DCB47CB3DF98C5CD7CFEEC3BC704,...,000012366528B5FEE179A9606DBC9826,1226,655,False,1268639592,True,1581570000.0,,,


***

## 1. 유저 버켓 만들기
- 모델이 데이터셋에 몇번밖에 안나타나는 유저에대한 임베딩을 만들수가 없기 때문에, 모든 유저들에 대한 임베딩을 만들 수 없음
- 따라서, 유저 버켓을 만들어서 관련있는 유저들끼리 학습시킴
- 만약 유저가 threashold 이상 데이터셋에서 나타난다면, 한 유저는 한 버켓에 할당됨 (즉, 유저가 곧 버켓)
- threashold 보다 적게 나타난 유저들은 비슷한 유저들끼리 클러스터링함

### 많이 나타나는 유저들에 대한 버켓 생성
- threashold = 15
- threashold 이상 나타면 유저당 버켓 1개 생성

In [548]:
engaging_user_id_id = pd.DataFrame()
engaging_user_id_id['user_id'] = (df['engaging_user_id'].tolist())

engaged_with_user_id_id = pd.DataFrame()
engaged_with_user_id_id['user_id'] = (df['engaged_with_user_id'].tolist())

total_users = engaging_user_id_id.merge(engaged_with_user_id_id, how='outer', on='user_id')

data = pd.DataFrame()
data['count'] = total_users.groupby(['user_id'])['user_id'].apply(lambda x: len(list(x)))
# data = data.reset_index()

bucketizer_users = data[data['count'] >= 15]
bucketizer_users = bucketizer_users.sort_values(by='count', ascending=False)
bucketizer_users = bucketizer_users.reset_index()
bucketizer_users['bucket'] = bucketizer_users.index
bucketizer_users = bucketizer_users.drop(['count'], axis=1)

In [549]:
bucketizer_users

Unnamed: 0,user_id,count,bucket
0,C6758D692A850E4C67B2763B66D1CFA8,325,0
1,5FF622786FB4924A067BD44D4B717570,155,1
2,E5D1B83B0E02FAFF871EEEF276D18132,120,2
3,D6B8AAD592D596C1A2DBA7C036A7F404,100,3
4,7C03844E8B2E0C7B4346D41028AB14E2,94,4
...,...,...,...
271,23AD1EFF98AF6A3BE04A3D3EC6CBBD64,15,271
272,AFF39496CD7B7AEF4415FC3EDCC1C6AA,15,272
273,A7A0D9EE9222FD46BD568CA6B005FD80,15,273
274,D0A4D4F6F7741E1F2B4351B62CC404D2,15,274




### Kmeans_process
- threshold 보다 적은 유저들에 대해 60개의 클러스터 생성
- 60개의 클러스터에 유저들을 할당한다.
- 한 클러스터에 450개의 유저보다 적으면 하나의 클러스터로 합친다.
- 마지막으로 클러스터를 400개의 버켓에 할당한다.

In [760]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=60, algorithm='auto')

engaged_with_df = pd.DataFrame()
engaged_with_df['user_id'] = df['engaging_user_id'].tolist()
engaged_with_df['follower_count'] = df['engaged_with_user_follower_count'].tolist()
engaged_with_df['following_count'] = df['engaged_with_user_following_count'].tolist()
engaged_with_df['is_verified'] = df['engaged_with_user_is_verified'].tolist()
engaged_with_df['account_creation'] = df['engaged_with_user_account_creation'].tolist()
engaging_df['count'] = df['engaging_user_is_verified'].tolist()

engaging_df = pd.DataFrame()
engaging_df['user_id'] = df['engaging_user_id'].tolist()
engaging_df['follower_count'] = df['engaging_user_follower_count'].tolist()
engaging_df['following_count'] = df['engaging_user_following_count'].tolist()
engaging_df['is_verified'] = df['engaging_user_is_verified'].tolist()
engaging_df['account_creation'] = df['engaging_user_is_verified'].tolist()
engaging_df['count'] = df['engaging_user_is_verified'].tolist()

user_df = engaged_with_df.merge(engaging_df, how='outer')



In [791]:
follower_count = user_df.groupby(['user_id'])['follower_count'].apply(lambda x: max(list(x))).reset_index()
following_count = user_df.groupby(['user_id'])['following_count'].apply(lambda x: max(list(x))).reset_index()
is_verified = user_df.groupby(['user_id'])['is_verified'].apply(lambda x: max(list(x))).reset_index()
account_creation = user_df.groupby(['user_id'])['account_creation'].apply(lambda x: max(list(x))).reset_index()
count = user_df.groupby(['user_id'])['count'].apply(lambda x: len(list(x))).reset_index()


In [907]:
sample_data = user_df['user_id']
sample_data = sample_data.reset_index()
sample_data = sample_data.drop('index',axis=1)
sample_data = sample_data.drop_duplicates()

user_kmeans = pd.merge(sample_data,follower_count, how= 'left')
user_kmeans = pd.merge(user_kmeans,following_count, how= 'left')
user_kmeans = pd.merge(user_kmeans,is_verified, how= 'left')
user_kmeans = pd.merge(user_kmeans,account_creation, how= 'left')
user_kmeans = pd.merge(user_kmeans,count, how= 'left')

user_kmeans = user_kmeans[user_kmeans['count'] < 15] # threshold 보다 작은거만
user_kmeans = user_kmeans.drop(['count'], axis=1) # count column 지우기

kmeans = KMeans(n_clusters=60).fit(user_kmeans[['follower_count', 'following_count']].values) # k means clustering
user_kmeans['cluster'] = kmeans.labels_ # 클러스터 할당

In [902]:
user_kmeans

Unnamed: 0,user_id,follower_count,following_count,is_verified,account_creation,cluster
0,00000865A1538142CDA5936B07FE4311,8791989,367,True,1.208634e+09,22
1,000009A057792FF118B9E3F2578B8407,1814,1314,False,1.465095e+09,0
2,00000DEF82BE9EB5CFD07FB7DB94317B,15648678,73,True,1.249873e+09,44
3,0000109A57AFA64758EE4AAE2A01BFC7,490,586,False,1.496413e+09,0
4,000012366528B5FEE179A9606DBC9826,1226,2511,False,1.419231e+09,0
...,...,...,...,...,...,...
71861,0549F61591B5023624554E23C7EFAFB1,269137,5557,True,1.469767e+09,56
71862,0549F68148DBEB08F8C8E20C89E87CD8,3583,1901,False,1.549580e+09,0
71863,054A02B6549B1452A167A0A2995C2F22,15898,700,False,1.486411e+09,0
71864,054A071AEF89961579F44313CC6E7E5A,2005855,8821,True,1.216046e+09,48


### 클러스터를 유저 버켓에 매핑
- 한 클러스터에 450개의 유저보다 적으면 하나의 클러스터로 합친다.
- 마지막으로 클러스터를 400개의 버켓에 할당한다.

In [908]:
cluster = user_kmeans.groupby(['cluster'])['user_id'].apply(lambda x: len(list(x)))

In [909]:
# 450보다 작은 클러스터 합치기

cluster_transform = cluster[cluster<450]

key = 0
values = 0
for i, x in cluster_transform.items():
    key = i
    if values < 450:
        values += x
        cluster_transform[i] = 0
    else:
        cluster_transform[i] = values
        values = 0

cluster_transform[i] = values

cluster_sorted_list = cluster_transform.items()
cluster_sorted_list = sorted(cluster_sorted_list, reverse=True)
idx = cluster_sorted_list[0][0]
for i, x in cluster_sorted_list:
    if x > 0:
        idx = i
    cluster_transform[i] = idx
    
print(cluster_transform)

cluster
1      7
2      7
4      7
5      7
6      7
7      7
8     11
9     11
10    11
11    11
12    17
14    17
16    17
17    17
18    22
19    22
20    22
21    22
22    22
23    28
26    28
27    28
28    28
29    33
30    33
31    33
32    33
33    33
34    44
35    44
36    44
40    44
41    44
44    44
45    54
46    54
47    54
48    54
49    54
51    54
54    54
55    58
56    58
57    58
58    58
Name: user_id, dtype: int64


In [916]:
# 450 보다 큰 클러스터는 그대로 하기
cluster_origin = cluster[cluster>=450]
for i, x in cluster_origin.items():
    cluster_origin[i] = i


In [924]:
mapped_cluster = pd.concat([cluster_origin, cluster_transform])

In [927]:
user_kmeans["cluster"] = user_kmeans["cluster"].map(mapped_cluster) # 클러스터 매핑


In [928]:
user_kmeans

Unnamed: 0,user_id,follower_count,following_count,is_verified,account_creation,cluster
0,00000865A1538142CDA5936B07FE4311,8791989,367,True,1.208634e+09,33
1,000009A057792FF118B9E3F2578B8407,1814,1314,False,1.465095e+09,39
2,00000DEF82BE9EB5CFD07FB7DB94317B,15648678,73,True,1.249873e+09,54
3,0000109A57AFA64758EE4AAE2A01BFC7,490,586,False,1.496413e+09,39
4,000012366528B5FEE179A9606DBC9826,1226,2511,False,1.419231e+09,39
...,...,...,...,...,...,...
71861,0549F61591B5023624554E23C7EFAFB1,269137,5557,True,1.469767e+09,43
71862,0549F68148DBEB08F8C8E20C89E87CD8,3583,1901,False,1.549580e+09,39
71863,054A02B6549B1452A167A0A2995C2F22,15898,700,False,1.486411e+09,39
71864,054A071AEF89961579F44313CC6E7E5A,2005855,8821,True,1.216046e+09,37


In [940]:
max(bucketizer_users.bucket)

275

In [949]:
# 버켓 할당
bucketizer_cluster = pd.DataFrame()
bucketizer_cluster['user_id'] = user_kmeans['user_id']
bucketizer_cluster['bucket'] = user_kmeans['cluster'] + max(bucketizer_users.bucket)

buckets = pd.concat([bucketizer_users, bucketizer_cluster])

In [950]:
buckets

Unnamed: 0,user_id,bucket
0,C6758D692A850E4C67B2763B66D1CFA8,0
1,5FF622786FB4924A067BD44D4B717570,1
2,E5D1B83B0E02FAFF871EEEF276D18132,2
3,D6B8AAD592D596C1A2DBA7C036A7F404,3
4,7C03844E8B2E0C7B4346D41028AB14E2,4
...,...,...
71861,0549F61591B5023624554E23C7EFAFB1,318
71862,0549F68148DBEB08F8C8E20C89E87CD8,314
71863,054A02B6549B1452A167A0A2995C2F22,314
71864,054A071AEF89961579F44313CC6E7E5A,312


In [951]:
buckets.to_csv("buckets.csv")