Data mining project

Repo for a data mining project carried out at ITU. Contains pipeline of scripts for scraping and cleaning website meta data, and fetch Alexa website statistics. The cleaned data can readily be used with Rapidminer

Abstract

A small system that collects data from 2425 websites and extracts a total of 44 attributes from each site. We show how a substantial amount of general statistics can be derived, and that we are able to find meaningful clusters in the data. Furthermore, we provide prediction results, which show that it seems unlikely that we can predict the PageRank of a website, from the its intrinsic data alone. Being able to find patterns and statistics in the diverse landscape of the web, is of interest to on-line businesses, and for web statistics in general. With future work, the implications of performing the presented data mining process, could be the display of unique web statistics or the creation of new tools for Business Intelligence.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
alexa		alexa
page_rank		page_rank
project		project
rapidminer		rapidminer
report		report
top_sites		top_sites
various		various
weka		weka
.gitignore		.gitignore
README.md		README.md
findpy.py		findpy.py
pythonfiles.txt		pythonfiles.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data mining project

Abstract

Links

About

Releases

Packages

Languages

AndersHqst/datamining

Folders and files

Latest commit

History

Repository files navigation

Data mining project

Abstract

Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages