Repo for a data mining project carried out at ITU. Contains pipeline of scripts for scraping and cleaning website meta data, and fetch Alexa website statistics. The cleaned data can readily be used with Rapidminer
A small system that collects data from 2425 websites and extracts a total of 44 attributes from each site. We show how a substantial amount of general statistics can be derived, and that we are able to find meaningful clusters in the data. Furthermore, we provide prediction results, which show that it seems unlikely that we can predict the PageRank of a website, from the its intrinsic data alone. Being able to find patterns and statistics in the diverse landscape of the web, is of interest to on-line businesses, and for web statistics in general. With future work, the implications of performing the presented data mining process, could be the display of unique web statistics or the creation of new tools for Business Intelligence.
Related paper (pdf)