In [2]:
import numpy as np 
import pandas as pd

## The Pandas Series Object

Pandas Series is a one-dimensional array of indexed data

In [2]:
data = np.linspace(0,1,5)
data = pd.Series(data[1:])  # equivalently pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the `value` and `index` attribute

In [3]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
data.index

RangeIndex(start=0, stop=4, step=1)

data can be accessed by the associated index via the familiar Python square-bracket notation


In [5]:
data[1]

0.5

In [6]:
data[1:2]

1    0.5
dtype: float64

### Series as generalized Numpy array

Numpy Array has an **implicity** defined integer index used to access the values

Pandas Series has an **explicity** defined index associated with the values

The explicit index definition gives the Series object additional capabilities

In [7]:
# for example, we can use string as an index
data = pd.Series([.25, .5, .75, 1.0],
                index=list('abcd'))
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
data['b']


0.5

### Series as specialized dictionary
By default, Pandas Series will be created where index is drawn from the sorted keys

In [9]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [10]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [11]:
population['California']

38332521

Unlike a dictionary,though, the Series also supports array-like-style operation such as slicing

In [12]:
population['Texas':'Florida']

Texas       26448193
New York    19651127
Florida     19552860
dtype: int64

### Constructing Series Object
`pd.Series(data[, index=index])`

In [13]:
pd.Series([1,2,3])

0    1
1    2
2    3
dtype: int64

In [14]:
pd.Series([2,1,3],index=[100,200,300])

100    2
200    1
300    3
dtype: int64

`data` can be a dictionary, in which `index` defaults to the **sorted??** dictionary keys 

In [None]:
print(pd.Series({3:'c',1:'a', 2:'b'},index=[1,2,3]))


## The Pandas DataFrame Object
the Pandas DataFrame can be thought of either as a generalization of a Numpy array, or as a specialization of a Python dictionary

### DataFrame as a generalized Numpy array

A DataFrame is an analog of a two-dimensional array with both fiexible row indices and fiexible column names

DataFrame like a sequence of aligned `Series` objects. 'aligned' we mean that they share the same index

In [16]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
print(area)
print(population)

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


In [17]:
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like Series object, the DataFrame also has an index attribute that gives access to the index label

In [18]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the DataFrame has a `column` attribute, which is an `index` object holding the column labels

In [19]:
states.columns

Index(['population', 'area'], dtype='object')


### DataFrame as specialized dictionary
Where a dictionary maps to a key to a value, a `DataFrame` maps a column name to a `Series` of column data.

eg: {column_name:Series_data}


In [20]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. For a ``DataFrame``, ``data['col0']`` will return the first *column*.

### Construcing DataFrame objects

#### From a single Series object


In [25]:
print(population)
print(type(population))

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64
<class 'pandas.core.series.Series'>


In [28]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts


In [31]:
data = [{'a':i, 'b':2*i}
       for i in range(3)]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [32]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


even if some keys in the dictionary are missing, Pandas will fill them in with `NaN` values

In [34]:
pd.DataFrame([{'a':1,'b':2},{'a':3, 'c':2}])

Unnamed: 0,a,b,c
0,1,2.0,
1,3,,2.0


#### From a dictionary of Series objects

In [35]:
print(population)
print(area)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64


In [36]:
pd.DataFrame({'population':population, 'area':area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### From a two-demensional Numpy array


In [37]:
data = np.random.rand(2,3)
data

array([[0.96431794, 0.56023609, 0.73271384],
       [0.76573354, 0.73288266, 0.3036956 ]])

In [43]:
pd.DataFrame(data,columns=['A','B','C'], index=['a','b'])

Unnamed: 0,A,B,C
a,0.964318,0.560236,0.732714
b,0.765734,0.732883,0.303696


#### from a Numpy structured array

In [45]:
A = np.zeros(3, dtype=[('A','i8'),('B','f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [47]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object

 it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). 

In [49]:
# create a Index object
ind = pd.Index([2,3,5,7,11])
type(ind)

pandas.core.indexes.numeric.Int64Index

### Index as immutable array

The index has many ways operations like an array

In [50]:
ind[1]

3

In [52]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

Index object also have many of the attributes familier from Numpy arrays

In [53]:
print(ind.size, ind.shape,ind.ndim,ind.dtype)

5 (5,) 1 int64


One difference between Index objects and Numpy array is that indices are immutable-that is,they cannot be modified via the normal means"

In [54]:
ind[0] = 1

TypeError: Index does not support mutable operations

### Index as ordered set


In [57]:
indA = pd.Index([1,3,5,7,9])
indB = pd.Index([2,3,5,7,11])
print(indA)
print(indB)

Int64Index([1, 3, 5, 7, 9], dtype='int64')
Int64Index([2, 3, 5, 7, 11], dtype='int64')


In [61]:
intersection = indA & indB
print(set(intersection))  # intersection

{3, 5, 7}


In [62]:
indA |  indB  # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [64]:
# (indA | indB) - (indA & indB) 
indA ^ indB  # symmetric difference


Int64Index([1, 2, 9, 11], dtype='int64')

In [69]:
temp = indA.difference(indB)
print(temp)

temp = indA.factorize()
print(temp)


Int64Index([1, 9], dtype='int64')
(array([0, 1, 2, 3, 4], dtype=int64), Int64Index([1, 3, 5, 7, 9], dtype='int64'))


These operations may also be accessed via object methods, for example `indA.intersection(indB)`
 