Kaggle Users Tutorial

This repository represents scraping user data found on page.

Case

Get list of username(s), create link(s), f.e. https://www.kaggle.com/lsind18
Check presence of user page
Parse user data, including username, userid, bio, last access date and all statistics about user datasets, notebooks, competitions, discussions
Transform data in relational format
Insert tranformed data into database
If data about this user was already parsed within last 12 hours, log it and parse next user

Independent webscraping

Prerequisites

>=Python3.10
install requirements (requests, bs4, sqlalchemy)

pip install -r requirements.txt

Project Structure

 ..independent_parsing  
    ├── project_setup  
    |   ├── logs/                   # Logs will appear here
    |   ├── context_setup.py        # DB connection, SQL tables, mapping to Model  
    │   └── logger_setup.py         # Logger setup  
    |   
    ├── process_data                      
    │   ├── Model.py                # Classes Users, UserStats, Stats  
    │   ├── Repository.py           # Insert Model objects into database   
    │   └── Service.py              # Tranform parsed data (dict), create Model objects  
    |
    ├── main.py                     # Get usernames, read data from pages --> dict  
    ├── requirements.txt  
    ├── kaggle_users.db             # Database for local testing 
    └── usernames.txt               # File with list of usernames

Database Structure

database structure with ORM SqlAlchemy (context_setup.py): create db or use existing
classes Users, UserStats, Stats (Model.py)

Sample data

Data which was already parsed, tranformed and inserted into dataabse.

Table Users

id	userId	userName	userJoinDate
1	2913628	lsind18	2019-03-08 18:01:26.933000
..	.......	........	........

Table UserStats

statsid	userId	displayName	country	city	region	bio	userLastActive	performanceTier	followers	following	parsedate
1	1	Daria Chemkaeva	Russia	Saint Petersburg		Чемкаева Дарья...	2023-05-31 17:25:00.747000	EXPERT	53	19	2023-06-01 12:16:08.339119
..	.....		........	........

Table Stats

id	userstatsId	statsType	tier	totalResults	rankPercentage	rankOutOf	rankCurrent	rankHighest	totalSilverMedals	totalBronzeMedals
1	1	competitionsSummary	CONTRIBUTOR	2	0.9209785	3742
2	1	scriptsSummary	EXPERT	9	0.014632434	279106	4084	773	1	6
3	1	datasetsSummary	EXPERT	16	0.009669507	91318	883	95	1	10
4	1	discussionsSummary	EXPERT	83	0.008640102	345945	2989	2125		50
5	2	..	..	..

Project Logic

Use

put usernames you are interested in into file usernames.txt
start main.py (entry point)
user statistics data will only be inserted into database if it have passed more than 12 hours since last parse
all data goes to sqlite database. You can change connection or use another RDBMS in project_setup/context_setup.py. Don't forget to install database provider
datetime columns are in UTC
log files created in project_setup/logs folder
give these scripts to any scheduler (cron, Task Scheduler, etc)

Use apache-airflow

PROJECT ON UPDATE

Prerequisites

>=Python3.10
virtual environment

python3 -m venv virt_env
source virt_env/bin/activate

install requirements

pip install -r requirements.txt

set up airflow

export AIRFLOW_HOME="path to current directory"
airflow db init
airflow users create
    --username
    --firstname
    --lastname
    --role Admin
    --email

if airflow database was not created properly, recreate it

airflow db reset

check dag

airflow dags list

run webserver, run scheduler

airflow scheduler
airflow webserver

log into webserver local url with created user

Project Structure

 ..airflow_parsing  
    ├── dags  
    |   ├── 
    │   └── 
    |   
    ├── process_data                      
    │   ├── 
    │   ├── 
    │   └──   
    |
    ├── logs/                       # Logs will appear here  
    |
    ├── airflow.cfg                 # Config airflow, dags folder, database, etc  
    ├── webserver_config.cfg        # Webserver config if necessary 
    ├── airflow.db                  # Airflow Db created  
    ├── requirements.txt  
    ├──                             # Database for local testing 
    └── usernames.txt               # File with list of usernames

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
independent_parsing		independent_parsing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Users Tutorial

Case

Independent webscraping

Prerequisites

Project Structure

Database Structure

Sample data

Project Logic

Use

Use apache-airflow

Prerequisites

Project Structure

Dag Structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kaggle Users Tutorial

Case

Independent webscraping

Prerequisites

Project Structure

Database Structure

Sample data

Project Logic

Use

Use apache-airflow

Prerequisites

Project Structure

Dag Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages