# Data Sets in Python

In [2]:
#libraries
import numpy as np
import pandas as pd
from pydataset import data

# pydataset Library
https://github.com/iamaziz/PyDataset/blob/master/examples/basic-usage.ipynb
https://www.youngwonks.com/blog/pydataset-a-python-dataset-library

### install
Windows: pip install pydataset
Mac: pip3 install pydataset

In [3]:
#Load dataset
iris = data('iris')

In [5]:
# show data
iris.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


In [6]:
# Show Doc of dataset
data('iris', show_doc=True)

iris

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Edgar Anderson's Iris Data

### Description

This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3 species of iris. The
species are _Iris setosa_, _versicolor_, and _virginica_.

### Usage

    iris
    iris3

### Format

`iris` is a data frame with 150 cases (rows) and 5 variables (columns) named
`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.

`iris3` gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with names `Sepal
L.`, `Sepal W.`, `Petal L.`, and `Petal W.`, and the third the species.

### Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomi

In [7]:
# available Data sets
data()

Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960
1,BJsales,Sales Data with Leading Indicator
2,BOD,Biochemical Oxygen Demand
3,Formaldehyde,Determination of Formaldehyde
4,HairEyeColor,Hair and Eye Color of Statistics Students
...,...,...
752,VerbAgg,Verbal Aggression item responses
753,cake,Breakage Angle of Chocolate Cakes
754,cbpp,Contagious bovine pleuropneumonia
755,grouseticks,Data on red grouse ticks from Elston et al. 2001


In [8]:
#help on data
help(data)

Help on function data in module pydataset:

data(item=None, show_doc=False)
    loads a datasaet (from in-modules datasets) in a dataframe data structure.
    
    Args:
        item (str)      : name of the dataset to load.
        show_doc (bool) : to show the dataset's documentation.
    
    Examples:
    
    >>> iris = data('iris')
    
    
    >>> data('titanic', show_doc=True)
        : returns the dataset's documentation.
    
    >>> data()
        : like help(), returns a dataframe [Item, Title]
        for a list of the available datasets.



In [43]:
# search a dataset by name
data()[data().dataset_id == 'AirPassengers']

Unnamed: 0,dataset_id,title
0,AirPassengers,Monthly Airline Passenger Numbers 1949-1960


In [44]:
data('AirPassengers')

Unnamed: 0,time,AirPassengers
1,1949.0,112
2,1949.083333,118
3,1949.166667,132
4,1949.25,129
5,1949.333333,121
6,1949.416667,135
7,1949.5,148
8,1949.583333,148
9,1949.666667,136
10,1949.75,119


In [45]:
AirP = data('AirPassengers')
AirP.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 1 to 144
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   time           144 non-null    float64
 1   AirPassengers  144 non-null    int64  
dtypes: float64(1), int64(1)
memory usage: 3.4 KB


In [58]:
AirP['time'] = AirP.index
print(AirP.dtypes, AirP.shape)

time             int64
AirPassengers    int64
dtype: object (144, 2)


In [61]:
AirP.head()

Unnamed: 0,time,AirPassengers
1,2012-10-01,112
2,2012-10-02,118
3,2012-10-03,132
4,2012-10-04,129
5,2012-10-05,121


In [64]:
# create a date series
AirP['time'] = pd.date_range('2012-10-01', periods=AirP.shape[0], freq='1D')
AirP

Unnamed: 0,time,AirPassengers
1,2012-10-01,112
2,2012-10-02,118
3,2012-10-03,132
4,2012-10-04,129
5,2012-10-05,121
6,2012-10-06,135
7,2012-10-07,148
8,2012-10-08,148
9,2012-10-09,136
10,2012-10-10,119


In [65]:
AirP.tail()

Unnamed: 0,time,AirPassengers
140,2013-02-17,606
141,2013-02-18,508
142,2013-02-19,461
143,2013-02-20,390
144,2013-02-21,432
