Skip to content

LSIND/Kaggle-Users-Get-Info-Tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle Users Tutorial

This repository represents scraping user data found on page.

Case

  • Get list of username(s), create link(s), f.e. https://www.kaggle.com/lsind18
  • Check presence of user page
  • Parse user data, including username, userid, bio, last access date and all statistics about user datasets, notebooks, competitions, discussions
  • Transform data in relational format
  • Insert tranformed data into database
  • If data about this user was already parsed within last 12 hours, log it and parse next user

Independent webscraping

Prerequisites

  • >=Python3.10
  • install requirements (requests, bs4, sqlalchemy)
pip install -r requirements.txt

Project Structure

 ..independent_parsing  
    ├── project_setup  
    |   ├── logs/                   # Logs will appear here
    |   ├── context_setup.py        # DB connection, SQL tables, mapping to Model  
    │   └── logger_setup.py         # Logger setup  
    |   
    ├── process_data                      
    │   ├── Model.py                # Classes Users, UserStats, Stats  
    │   ├── Repository.py           # Insert Model objects into database   
    │   └── Service.py              # Tranform parsed data (dict), create Model objects  
    |
    ├── main.py                     # Get usernames, read data from pages --> dict  
    ├── requirements.txt  
    ├── kaggle_users.db             # Database for local testing 
    └── usernames.txt               # File with list of usernames

Database Structure

  • database structure with ORM SqlAlchemy (context_setup.py): create db or use existing
  • classes Users, UserStats, Stats (Model.py)

ER diagram

Sample data

Data which was already parsed, tranformed and inserted into dataabse.

  • Table Users
id userId userName userJoinDate
1 2913628 lsind18 2019-03-08 18:01:26.933000
.. ....... ........ ........
  • Table UserStats
statsid userId displayName country city region bio userLastActive performanceTier followers following parsedate
1 1 Daria Chemkaeva Russia Saint Petersburg Чемкаева Дарья... 2023-05-31 17:25:00.747000 EXPERT 53 19 2023-06-01 12:16:08.339119
.. ..... ........ ........
  • Table Stats
id userstatsId statsType tier totalResults rankPercentage rankOutOf rankCurrent rankHighest totalGoldMedals totalSilverMedals totalBronzeMedals
1 1 competitionsSummary CONTRIBUTOR 2 0.9209785 3742
2 1 scriptsSummary EXPERT 9 0.014632434 279106 4084 773 1 6
3 1 datasetsSummary EXPERT 16 0.009669507 91318 883 95 1 10
4 1 discussionsSummary EXPERT 83 0.008640102 345945 2989 2125 50
5 2 .. .. ..

Project Logic

Project Graph

Use

  • put usernames you are interested in into file usernames.txt
  • start main.py (entry point)
  • user statistics data will only be inserted into database if it have passed more than 12 hours since last parse
  • all data goes to sqlite database. You can change connection or use another RDBMS in project_setup/context_setup.py. Don't forget to install database provider
  • datetime columns are in UTC
  • log files created in project_setup/logs folder
  • give these scripts to any scheduler (cron, Task Scheduler, etc)

Use apache-airflow

PROJECT ON UPDATE

Prerequisites

  • >=Python3.10
  • virtual environment
python3 -m venv virt_env
source virt_env/bin/activate
  • install requirements
pip install -r requirements.txt
  • set up airflow
export AIRFLOW_HOME="path to current directory"
airflow db init
airflow users create
    --username
    --firstname
    --lastname
    --role Admin
    --email
  • if airflow database was not created properly, recreate it
airflow db reset
  • check dag
airflow dags list
  • run webserver, run scheduler
airflow scheduler
airflow webserver
  • log into webserver local url with created user

Project Structure

 ..airflow_parsing  
    ├── dags  
    |   ├── 
    │   └── 
    |   
    ├── process_data                      
    │   ├── 
    │   ├── 
    │   └──   
    |
    ├── logs/                       # Logs will appear here  
    |
    ├── airflow.cfg                 # Config airflow, dags folder, database, etc  
    ├── webserver_config.cfg        # Webserver config if necessary 
    ├── airflow.db                  # Airflow Db created  
    ├── requirements.txt  
    ├──                             # Database for local testing 
    └── usernames.txt               # File with list of usernames

Dag Structure

About

Scraping Kaggle Users pages, transform data and put into database

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages