# Kaggle dataset

>[Kaggle](https://www.kaggle.com/) is a data science competition platform and online community of data scientists and machine learning practitioners under `Google LLC`. `Kaggle` enables users to find and publish datasets, explore and build models in a web-based data science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.


This notebook shows how to load [`Kaggle` datasets](https://www.kaggle.com/datasets) to LangChain.

## Setting up

Follow these steps to use this loader:
- [Register a Kaggle account and create an API token](https://www.kaggle.com/settings) to use this loader.
- Install `kaggle` and `pandas` python packages with `pip install kaggle pandas`
- Use `kaggle datasets list` to list all available datasets
- Use `kaggle datasets <dataset_name>` to download the dataset
- Use `unzip <dataset_zipfile_name>` to extract all files in the dataset
- Open the dataset CSV file and choose the column name for page content
- Use the dataset CSV file name and the column name to
  initialize the `KaggleDatasetLoader`

Note: Other columns in the dataset CSV file will be treated as metadata.


In [None]:
#!pip install kaggle pandas

In [2]:
!kaggle datasets list

ref                                                             title                                             size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------  -----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
carlmcbrideellis/llm-7-prompt-training-dataset                  LLM: 7 prompt training dataset                    41MB  2023-11-15 07:32:56            926         82  1.0              
thedrcat/daigt-v2-train-dataset                                 DAIGT V2 Train Dataset                            29MB  2023-11-16 01:38:36            438         70  1.0              
thedrcat/daigt-proper-train-dataset                             DAIGT Proper Train Dataset                       119MB  2023-11-05 14:03:25           1000        112  1.0              
joebeachcapital/30000-spotify-songs                             30000 Spoti

## Example

Here we download one of the dataset used for promt training.

In [3]:
!kaggle datasets download carlmcbrideellis/llm-7-prompt-training-dataset

Downloading llm-7-prompt-training-dataset.zip to /home/leo/PycharmProjects/GLD/langchain/docs/docs/integrations/document_loaders
100%|██████████████████████████████████████| 41.4M/41.4M [00:18<00:00, 2.43MB/s]
100%|██████████████████████████████████████| 41.4M/41.4M [00:18<00:00, 2.40MB/s]


In [4]:
!unzip llm-7-prompt-training-dataset.zip

Archive:  llm-7-prompt-training-dataset.zip
  inflating: train_essays_7_prompts.csv  
  inflating: train_essays_7_prompts_v2.csv  
  inflating: train_essays_RDizzl3_seven_v1.csv  
  inflating: train_essays_RDizzl3_seven_v2.csv  


In [None]:
from langchain.document_loaders import KaggleDatasetLoader

`KaggleDatasetLoader` has these arguments:
- **dataset_path**: Path to the dataset CSV file.
- **page_content_column**: Column name of the page content.

In [9]:
loader = KaggleDatasetLoader(
    dataset_path="train_essays_7_prompts.csv", page_content_column="text"
)
docs = loader.load()
len(docs)

14877

In [10]:
docs[0]

Document(page_content='Cars. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first ModelT. Cars have played a major role in our every day lives since then. But now, people are starting to question if limiting car usage would be a good thing. To me, limiting the use of cars might be a good thing to do.\n\nIn like matter of this, article, "In German Suburb, Life Goes On Without Cars," by Elizabeth Rosenthal states, how automobiles are the linchpin of suburbs, where middle class families from either Shanghai or Chicago tend to make their homes. Experts say how this is a huge impediment to current efforts to reduce greenhouse gas emissions from tailpipe. Passenger cars are responsible for 12 percent of greenhouse gas emissions in Europe...and up to 50 percent in some carintensive areas in the United States. Cars are the main reason for the greenhouse gas emissions because of a lot of people driving them around all the time getting where th