# 1. PyDataset 📚
The first package we are going look at is PyDataset. It’s easy to use and gives access to over 700 datasets. The package was inspired by ease of accessing datasets in R and aimed to bring that ease in Python. Let’s check out the list of datasets:

In [1]:
!pip install Pydataset

Collecting Pydataset
  Downloading pydataset-0.2.0.tar.gz (15.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: Pydataset
  Building wheel for Pydataset (setup.py) ... [?25l- \ done
[?25h  Created wheel for Pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939432 sha256=39fd06b6ca27711e168ec7444f3d6f487119b2c0fd285db0bb99773e28c310c7
  Stored in directory: /root/.cache/pip/wheels/72/5a/63/7d5a3ea2d921245960db8eb41a2f4272b5450a3c4c95c22d45
Successfully built Pydataset
Installing collected packages: Pydataset
Successfully installed Pydataset-0.2.0
[0m

In [2]:
#Import Package
from pydataset import data
# Check out datasets
data()

initiated datasets repo at: /root/.pydataset/


Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
1,BJsales,Sales Data with Leading Indicator
2,BOD,Biochemical Oxygen Demand
3,Formaldehyde,Determination of Formaldehyde
4,HairEyeColor,Hair and Eye Color of Statistics Students
...,...,...
752,VerbAgg,Verbal Aggression item responses
753,cake,Breakage Angle of Chocolate Cakes
754,cbpp,Contagious bovine pleuropneumonia
755,grouseticks,Data on red grouse ticks from Elston et al. 2001


In [3]:
# Load as a dataframe
df = data('iris')
df

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


# 2. Seaborn 🌊
Seaborn is another package that provides easy access to example datasets. To find the full list of datasets, you can browse the GitHub repository or you can check it in Python like this:

In [4]:
# Import seaborn
import seaborn as sns
# Check out available datasets
print(sns.get_dataset_names())

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']


In [5]:
# Load as a dataframe
df = sns.load_dataset('iris')
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


# 3. Scikit-learn 📓
Not only is scikit-learn awesome for feature engineering and building models, it also comes with toy datasets and provides easy access to download and load real world datasets. The list of toy and real datasets as well as other details are available here. You can find out more details about a dataset by scrolling through the link or referring to the individual documentation for functions. It’s worth mentioning that among the datasets, there are some toy and real image datasets such as digits dataset and Olivetti faces dataset.

Now, let’s look at how to load real dataset with an example:

In [6]:
# Import package
from sklearn.datasets import fetch_california_housing

In [7]:
# Import package
from sklearn.datasets import load_iris
# Load data
iris = load_iris(as_frame=True)
# Create a dataframe
df = iris['data'].join(iris['target'])
# Map target names (only for categorical target)
df['target'].replace(dict(enumerate(iris['target_names'])), 
                     inplace=True)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


# 4. Statsmodels 📔
Another package through which we can access data is statsmodels. Available built-in datasets are listed here on their website. Let’s pick ‘United States Macroeconomic data’ as an example and load it:

In [8]:
# Import package
import statsmodels.api as sm
# Load data as a dataframe
df = sm.datasets.macrodata.load_pandas()['data']
df

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.980,139.7,2.82,5.8,177.146,0.00,0.00
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.150,141.7,3.08,5.1,177.830,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.260,1916.4,29.350,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.370,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.540,139.6,3.50,5.2,180.007,2.31,1.19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198,2008.0,3.0,13324.600,9267.7,1990.693,991.551,9838.3,216.889,1474.7,1.17,6.0,305.270,-3.16,4.33
199,2008.0,4.0,13141.920,9195.3,1857.661,1007.273,9920.4,212.174,1576.5,0.12,6.9,305.952,-8.79,8.91
200,2009.0,1.0,12925.410,9209.2,1558.494,996.287,9926.4,212.671,1592.8,0.22,8.1,306.547,0.94,-0.71
201,2009.0,2.0,12901.504,9189.0,1456.678,1023.528,10077.5,214.469,1653.6,0.18,9.2,307.226,3.37,-3.19


In [9]:
# Load data as a dataframe
df = sm.datasets.get_rdataset(dataname='iris', package='datasets')['data']
df

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


# 5. Natural Language Toolkit | NLTK 📜
This package is slightly different from the rest because it provides access only to text datasets. Here’s the list of text datasets available (Psst, please note some items in that list are models). Using the id, we can access the relevant text dataset from NLTK. Let’s take Sentiment Polarity Dataset as an example. Its id is movie_reviews. Let’s first download it with the following script:

In [10]:
# Import package
import nltk
# Download the corpus (only need to do once)
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [11]:
# Import packages
import pandas as pd
from nltk.corpus import movie_reviews
# Convert to dataframe
documents = []
for fileid in movie_reviews.fileids():
    tag, filename = fileid.split('/')
    documents.append((tag, movie_reviews.raw(fileid)))
df = pd.DataFrame(documents, columns=['target', 'document'])
df

Unnamed: 0,target,document
0,neg,"plot : two teen couples go to a church party , drink and then drive . \nthey get into ..."
1,neg,the happy bastard's quick movie review \ndamn that y2k bug . \nit's got a head start i...
2,neg,it is movies like these that make a jaded movie viewer thankful for the invention of t...
3,neg,""" quest for camelot "" is warner bros . ' first feature-length , fully-animated attemp..."
4,neg,synopsis : a mentally unstable man undergoing psychotherapy saves a boy from a potenti...
...,...,...
1995,pos,"wow ! what a movie . \nit's everything a movie can be : funny , dramatic , interesting..."
1996,pos,"richard gere can be a commanding actor , but he's not always in great films . \neveryt..."
1997,pos,"glory--starring matthew broderick , denzel washington , and morgan freeman--is the tru..."
1998,pos,steven spielberg's second epic film on world war ii is an unquestioned masterpiece of ...
