# 1 Basic

- there are so many different (abstract) methods and processing related, which is hard to remember. A rough impression is that these are very similar to manipulations of csv data. A good way might be quickly getting familar with these methods with real examples.
- Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with **labels** rather than simple integer indices
- Three fundamental Pandas data structures: the Series, DataFrame, and Index
- One difference between Index objects and NumPy arrays is that indices are immutable
- Keep in mind that NaN is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types

# Examples

In [2]:
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"#show intermediate output

In [7]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)
population = pd.Series(population_dict)
population
a=population.max()

SyntaxError: invalid syntax (<ipython-input-7-8da0439bd828>, line 12)

In [16]:
pd.DataFrame(population, columns=['population'])

states = pd.DataFrame({'population': population,
                       'area': area})
states['population']
states.loc['California']
states.max().max()

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

population    38332521
area            423967
Name: California, dtype: int64

38332521

In [25]:
states['density'] = states['population'] / states['area']
states

states.values[0]
states['area']

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


array([3.83325210e+07, 4.23967000e+05, 9.04139261e+01])

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [27]:
states.loc[:'Illinois', :'population']#included

states.loc[states.density > 100, ['population', 'density']]

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


Unnamed: 0,population,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [13]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.135759,0.34074
b,0.112939,0.071676
c,0.977186,0.962096


In [9]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)#Any list of dictionaries can be made into a ``DataFrame``.

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [28]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [31]:
rng = np.random.RandomState(42)
A = rng.randint(10, size=(3, 4))
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

halfrow = df.iloc[0, ::2]
halfrow
df - halfrow#automatically align indices between the two elements

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,0,6,-5,2
2,1,1,-4,3


Q    6
S    7
Name: 0, dtype: int32

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,0.0,,-5.0,
2,1.0,,-4.0,


In [32]:
x = pd.Series(range(2), dtype=int)
x
x[0] = None
x

0    0
1    1
dtype: int32

0    NaN
1    1.0
dtype: float64

In [34]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.305516,0.118453
a,2,0.460742,0.627896
b,1,0.277792,0.242217
b,2,0.015634,0.800491


California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64