# Data Sets in Python

In [3]:
#libraries
import numpy as np
import pandas as pd
from pydataset import data

# pydataset Library
https://github.com/iamaziz/PyDataset/blob/master/examples/basic-usage.ipynb
https://www.youngwonks.com/blog/pydataset-a-python-dataset-library

### install
Windows: pip install pydataset
Mac: pip3 install pydataset

In [4]:
#Load dataset
iris = data('iris')

In [5]:
# show data
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


In [7]:
# Show Doc of dataset
#data('iris', show_doc=True)

In [8]:
# available Data sets
data()

Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
1,BJsales,Sales Data with Leading Indicator
2,BOD,Biochemical Oxygen Demand
3,Formaldehyde,Determination of Formaldehyde
4,HairEyeColor,Hair and Eye Color of Statistics Students
...,...,...
752,VerbAgg,Verbal Aggression item responses
753,cake,Breakage Angle of Chocolate Cakes
754,cbpp,Contagious bovine pleuropneumonia
755,grouseticks,Data on red grouse ticks from Elston et al. 2001


In [9]:
#help on data
help(data)

Help on function data in module pydataset:

data(item=None, show_doc=False)
    loads a datasaet (from in-modules datasets) in a dataframe data structure.
    
    Args:
        item (str)      : name of the dataset to load.
        show_doc (bool) : to show the dataset's documentation.
    
    Examples:
    
    >>> iris = data('iris')
    
    
    >>> data('titanic', show_doc=True)
        : returns the dataset's documentation.
    
    >>> data()
        : like help(), returns a dataframe [Item, Title]
        for a list of the available datasets.



In [10]:
data('mtcars')

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [11]:
data('mtcars', show_doc=True)

mtcars

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Motor Trend Car Road Tests

### Description

The data was extracted from the 1974 _Motor Trend_ US magazine, and comprises
fuel consumption and 10 aspects of automobile design and performance for 32
automobiles (1973–74 models).

### Usage

    mtcars

### Format

A data frame with 32 observations on 11 variables.

[, 1]

mpg

Miles/(US) gallon

[, 2]

cyl

Number of cylinders

[, 3]

disp

Displacement (cu.in.)

[, 4]

hp

Gross horsepower

[, 5]

drat

Rear axle ratio

[, 6]

wt

Weight (lb/1000)

[, 7]

qsec

1/4 mile time

[, 8]

vs

V/S

[, 9]

am

Transmission (0 = automatic, 1 = manual)

[,10]

gear

Number of forward gears

[,11]

carb

Number of carburetors

### Source

Henderson and Velleman (1981), Building multiple regression models
interactively. _Biometrics_, **37**, 391–411.

### Examples

    require(graphics)
    pairs(mtcars, main = "mtcars data")
    coplot(mpg ~ disp 

In [43]:
# search a dataset by name
data()[data().dataset_id == 'AirPassengers']

Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960


In [44]:
data('AirPassengers')

Unnamed: 0,time,AirPassengers
1,1949.0,112
2,1949.083333,118
3,1949.166667,132
4,1949.25,129
5,1949.333333,121
6,1949.416667,135
7,1949.5,148
8,1949.583333,148
9,1949.666667,136
10,1949.75,119


In [45]:
AirP = data('AirPassengers')
AirP.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 1 to 144
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   time           144 non-null    float64
 1   AirPassengers  144 non-null    int64  
dtypes: float64(1), int64(1)
memory usage: 3.4 KB


In [58]:
AirP['time'] = AirP.index
print(AirP.dtypes, AirP.shape)

time             int64
AirPassengers    int64
dtype: object (144, 2)


In [61]:
AirP.head()

Unnamed: 0,time,AirPassengers
1,2012-10-01,112
2,2012-10-02,118
3,2012-10-03,132
4,2012-10-04,129
5,2012-10-05,121


In [64]:
# create a date series
AirP['time'] = pd.date_range('2012-10-01', periods=AirP.shape[0], freq='1D')
AirP

Unnamed: 0,time,AirPassengers
1,2012-10-01,112
2,2012-10-02,118
3,2012-10-03,132
4,2012-10-04,129
5,2012-10-05,121
6,2012-10-06,135
7,2012-10-07,148
8,2012-10-08,148
9,2012-10-09,136
10,2012-10-10,119


In [65]:
AirP.tail()

Unnamed: 0,time,AirPassengers
140,2013-02-17,606
141,2013-02-18,508
142,2013-02-19,461
143,2013-02-20,390
144,2013-02-21,432
