# Kaggle Dataset Scraper

The purpose of this project is to scrape the current hottest datasets on kaggle. Additional pre-processing is left upto user discretion.

Note: The code will only download small datasets. You can modify the conditions to allow for the downloading of sets greater than 100 MB.

## URL to web scrape

In [1]:
url = "https://www.kaggle.com/datasets?topic=trendingDataset&sort=hottest"

## Importing necessary modules

In [2]:
import requests
from bs4 import BeautifulSoup
# other essential modules are mentioned wherever requried.

### Installing Kaggle API

You can install the kaggle API by running the command "pip install kaggle" in your terminal.

### Preparing Kaggle API

To get a Kaggle API token, go to the Accounts section of your profile and click on the "create new token" button. This will generate a json file and download it, containing your unique Kaggle token.

In [3]:
import json
import os

# retrieve your token details from kaggle.json file
with open('kaggle.json', 'r') as file:
    kaggle_credentials = json.load(file)

# preparing environment variables for API's usage
os.environ['KAGGLE_USERNAME'] = kaggle_credentials["username"]
os.environ['KAGGLE_KEY'] = kaggle_credentials["key"]

In [4]:
from kaggle.api.kaggle_api_extended import KaggleApi

# authenticating api
api = KaggleApi()
api.authenticate()

## Using Selenium for Browser Simulation

The actual page on kaggle for datasets is rendered elsewhere with Javascript and then displayed at the link users commonly use. Thus, we will be using Selenium to run browser.

### Listing what datasets we need

In [5]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initialize Chrome WebDriver
driver = webdriver.Chrome()

# Load the Kaggle page
driver.get(url)

soup = BeautifulSoup(driver.page_source)


# Close the browser window
driver.quit()

## Collecting the datasets

In [6]:
dataset_cards = soup.find_all('li', class_ = 'MuiListItem-root MuiListItem-gutters MuiListItem-divider sc-drMgrp dllDGS css-iicyhe')
top_list = list()
top_list_link = list()
print("Most Trending Kaggle Datasets:\n")
for i in dataset_cards:
    x = ' '.join(i['aria-label'].split(' ')[0:-2])
    y = i.a['href'][10:]
    z = i.find_all('span', class_ = 'sc-fLseNd sc-crHHJw gArHzz bUdQDa')[-1].text
    print(x)
    print()
    print(y)
    print()
    print(z)
    print()
    print('------------------------------------------------------------------------------------------')
    print()
    #
    #CODE ONLY DOWNLOADS SMALL DATASETS. MODIFY THE CONDITION BELOW TO CHANGE THAT.
    #
    temp = z.split(' ')[-2:]
    # temp = ['size', 'unit of size such as kB, MB, GB']
    if temp[1] == 'kB' or (int(temp[0]) <= 100 and temp[1] == "MB"):
        top_list.append(x)
        top_list_link.append(y)

Most Trending Kaggle Datasets:

📊 Predict Liver Disease: 1700 Records Dataset

rabieelkharoua/predict-liver-disease-1700-records-dataset

Usability 10.0 · 1 File (CSV) · 71 kB

------------------------------------------------------------------------------------------

Iran Phone Ads

arianghasemi/iran-phone-ads

Usability 9.4 · 1 File (CSV) · 2 MB

------------------------------------------------------------------------------------------

Exploring E-commerce Trends⭐️⭐️⭐️

muhammadroshaanriaz/e-commerce-trends-a-guide-to-leveraging-dataset

Usability 10.0 · 1 File (CSV) · 22 kB

------------------------------------------------------------------------------------------

india economics data

shreyaskeote23/india-economics-data

Usability 10.0 · 10 Files (CSV) · 9 kB

------------------------------------------------------------------------------------------

India 2024 Election Dataset - All Candidates

rubenmukherjee/india-2024-election-dataset-all-candidates

Usability 7.6 · 3 Files (C

## Downloading Datasets

In [7]:
# run below code to download all the data sets
for i in top_list_link:
    api.dataset_download_files(i, i.split('/')[1] + '.csv', unzip=True)

Dataset URL: https://www.kaggle.com/datasets/rabieelkharoua/predict-liver-disease-1700-records-dataset
Dataset URL: https://www.kaggle.com/datasets/arianghasemi/iran-phone-ads
Dataset URL: https://www.kaggle.com/datasets/muhammadroshaanriaz/e-commerce-trends-a-guide-to-leveraging-dataset
Dataset URL: https://www.kaggle.com/datasets/shreyaskeote23/india-economics-data
Dataset URL: https://www.kaggle.com/datasets/rubenmukherjee/india-2024-election-dataset-all-candidates
Dataset URL: https://www.kaggle.com/datasets/shashankshekhar1205/wine-quality-dataset
Dataset URL: https://www.kaggle.com/datasets/mayankanand2701/tesla-stock-price-dataset
Dataset URL: https://www.kaggle.com/datasets/waleedejaz/predict-students-dropout-and-academic-success
Dataset URL: https://www.kaggle.com/datasets/monisamir/global-salary-analysis
Dataset URL: https://www.kaggle.com/datasets/damirdizdarevic/uefa-euro-2024-players
Dataset URL: https://www.kaggle.com/datasets/joshhaber/us-real-estate-incomepriceregion-ce