Skip to content

CQU-CSE/DatasetCollection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 

Repository files navigation

DatasetCollection

Common datasets used in our research

Recommender systems

Social Recommendation

   
Data Set Basic Meta User Context
Users ItemsRatings (Scale) Density Users Links (Type)
Ciao [1] 7,375 105,114 284,086 [1, 5] 0.0365% 7,375 111,781 Trust
Epinions [2] 40,163 139,738 664,824 [1, 5] 0.0118% 49,289 487,183 Trust
Douban [3] 2,848 39,586 894,887 [1, 5] 0.794% 2,848 35,770 Trust
LastFM [8] 1,892 17,632 92,834 implicit 0.27% 1,892 25,434 Trust

Music Recommendation

   
Data Set Basic Meta Context
Users Tracks ArtistsAlbums Record Tag User Profile Artist Profile
NowPlaying [9] 1,744 16,864 2,108 N/A 1,117,335 N/A N/A N/A
Xiami [10] 4,271 290,312 33,316 95,003 1,301,486 Yes N/A N/A
Yahoo Music [source] 1,800,000 136,000 many many 717,000,000 Yes N/A N/A
30 Music [source][11] 45167 5023108 595049 217337 many Yes Yes N/A

Paper Recommendation

 
Data Set Basic Meta Context
Users Papers FeedBackTag Content
CiteULike [12] 7,947 25,975 134,860 52,946 full abstract

Location Recommendation

 
Data Set Basic Meta Context
Users Locations FeedBackrelation Time
Gowalla 18,737 32,510 1,278,274 Yes Yes

Product Recommendation

 
Data Set Basic Meta Context
Users Items CategoryBehavior Type Time
Taobao(Extraction code: xv8o)[24, 25] 987,994 4,162,024 9,439 5 Yes

Spammer detection

Social Network

Data Set Non-spammer Spammer Introduction
Twitter [4] 1,295 355 The first column is the user class (i.e., 1 for non-spammers and 2 for spammers) and the subsequent columns numbered from 1 to 62 represent the user characteristics.
YouTube [5] 641 31 (promoter) 157(spammer) The first column is the user class (i.e., 1 for promoters, 2 for spammers, and 3 for legitimates) and the subsequent columns numbered from 1 to 60 represent the user characteristics.

Shilling Detection

       
Data Set Non-spammer Spammer Introduction
Amazon [6] 3,118 1,937 Colunms in profiles.txt follow this order: userid itemid rating.
    In labels.txt: 1: spammer 0: non-spammer
Yelp [7] 52,815 80,466 Colunms in yelp.txt follow this order: user_id prod_id rating label date.
    labels -1: spammer 1: non-spammer
I recommend you to filter users who have less than 5 ratings. *More information can be found in Google Drive

Cyberbullying Detection

Data Set Year Annotated method # Data # Cyberbullying Cyberbullying Ratio
Formspring [13] 2010 Crowdsourcing 3,915 369 9.43%
MySpace [14] 2011 Expert Labeling 2,088 434 20.79%
Ask.fm [15] 2014
Instagram [16] 2014 Crowdsourcing 1,954 567 29%
Vine [17] 2015 Crowdsourcing 971 304 31.34%
BullyingV3.0 [18] 2015 Label Algorithm 7,321 2,102 28.71%
WOW [19] 2016 Expert Labeling 16,975 137 0.81%
LOL [19] 2016 Expert Labeling 17,354 207 1.19%
Twitter [20] 2017 Crowdsourcing 1,303 58 4.45%
Wikipedia [21] 2017 Crowdsourcing 37,611 338 0.9%
Harassment-Corpus [22] 2018 Expert Labeling 24,189 3,119 12.89%
Hate and Abusive Speech [23] 2018 Crowdsourcing 99,799 46,009 46.1%

Reference

[1]. Tang, J., Gao, H., Liu, H.: mtrust:discerning multi-faceted trust in a connected world. In: International Conference on Web Search and Web Data Mining, WSDM 2012, Seattle, Wa, Usa, February. pp. 93–102 (2012)

[2]. Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the 2007 ACM conference on Recommender systems. pp. 17–24. ACM (2007)

[3]. G. Zhao, X. Qian, and X. Xie, “User-service rating prediction by exploring social users’ rating behaviors,” IEEE Transactions on Multimedia, vol. 18, no. 3, pp. 496–506, 2016.

[4]. Benevenuto, F., Magno, G., Rodrigues, T., & Almeida, V.: Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS). Vol. 6, No. 2010, p. 12. 2010.

[5]. Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J., & Gonçalves, M.: Detecting spammers and content promoters in online video social networks. In: Proceedings of the 32nd ACM SIGIR conference on Research and development in information retrieval. pp. 620-627. ACM (2009)

[6]. Xu, Chang, et al. "Uncovering collusive spammers in Chinese review websites." ACM International Conference on Conference on Information & Knowledge Management ACM, 2013:979-988.

[7]. Rayana, Shebuti, and L. Akoglu. "Collective Opinion Spam Detection: Bridging Review Networks and Metadata." ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM, 2015:985-994.

[8]. Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2011. 2nd Workshop on Information Heterogeneity and Fusion in Recom- mender Systems (HetRec 2011). In Proceedings of the 5th ACM conference on Recommender systems (RecSys 2011). ACM, New York, NY, USA

[9]. Eva Zangerle, Martin Pichl, Wolfgang Gassler, and Günther Specht. 2014. #nowplaying Music Dataset: Extracting Listening Behavior from Twitter. In Proceedings of the First International Workshop on Internet-Scale Multimedia Management (WISMM '14). ACM, New York, NY, USA, 21-26

[10]. Wang, Dongjing, et al. "Learning music embedding with metadata for context aware recommendation." Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016.

[11]. Turrin R, Quadrana M, Condorelli A, et al. 30Music Listening and Playlists Dataset[C]//RecSys Posters. 2015.

[12]. Hao Wang*, Wu-Jun Li, Relational collaborative topic regression for recommender systems. IEEE Transactions on Knowledge and Data Engineering (TKDE), 27(5): 1343-1355, 2015.

[13]. Reynolds K, Kontostathis A, Edwards L. Using machine learning to detect cyberbullying. Machine learning and applications and workshops (ICMLA), 2011 10th International Conference on. IEEE, 2011, 2: 241-244.

[14]. Bayzick J, Kontostathis A, Edwards L. Detecting the presence of cyberbullying using computer software. In 3rd Annual ACM Web Science Conference (WebSci ‘11). 2011: 1-2.

[15]. Hosseinmardi H, Ghasemianlangroodi A, Han R, et al. Towards understanding cyberbullying behavior in a semi-anonymous social network. Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on. IEEE, 2014: 244-252.

[16]. Hosseinmardi H, Mattson S A, Rafiq R I, et al. Analyzing labeled cyberbullying incidents on the Instagram social network. International Conference on Social Informatics. Springer, Cham, 2015: 49-66.

[17]. Rafiq R I, Hosseinmardi H, Han R, et al. Careful what you share in six seconds: Detecting cyberbullying instances in Vine. Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ACM, 2015: 617-622.

[18]. Sui J. Understanding and fighting bullying with machine learning[D]. The University of Wisconsin-Madison, 2015.

[19]. Bretschneider U, Peters R. Detecting Cyberbullying in Online Communities. ECIS. 2016: ResearchPaper61.

[20]. Chatzakou D, Kourtellis N, Blackburn J, et al. Mean birds: Detecting aggression and bullying on twitter. Proceedings of the 2017 ACM on web science conference. ACM, 2017: 13-22.

[21]. Wulczyn E, Thain N, Dixon L. Ex machina: Personal attacks seen at scale. Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017: 1391-1399.

[22]. Rezvan M, Shekarpour S, Balasuriya L, et al. A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research. Proceedings of the 10th ACM Conference on Web Science. ACM, 2018: 33-36.

[23]. Founta A-M, Djouvas C, Chatzakou D, et al. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. Proceedings of the 11th International Conference on Web and Social Media, ICWSM, 2018.

[24]. Han Z, Xiang L, Pengye Z, et al. Learning Tree-based Deep Model for Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[25]. Han Z, Daqing C, Ziru X, et al. Joint Optimization of Tree-based Index and Deep Model for Recommender Systems. arXiv:1902.07565.

[26]. Han Z, Daqing C, Ziru X, et al. Joint Optimization of Tree-based Index and Deep Model for Recommender Systems. arXiv:1902.07565.

About

collection for the common dataset in my research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published