# Data Manipulation with pandas
* DataFrames are multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
* offering a convenient storage interface for labeled data
 * implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
  * Np kind of limited

In [1]:
import pandas as pd
import numpy as np

pd.__version__

'1.5.2'

### Series object

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
print(data[1])
print(data[1:3])  #slicing also possible

0.5
1    0.50
2    0.75
dtype: float64


Series can serve as generalised np array.
Difference is presence of index, for np array it's implicitly defined integer index, while for series it's explicitly defined

In [6]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

Series can also be used as dictionaryish, but with array features

In [7]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [8]:
population['Texas':]

Texas       26448193
New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

## DataFrame
* improved series allowing for two dimensions

In [9]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}

area = pd.Series(area_dict)
print(area)

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64


In [10]:
states_data = pd.DataFrame({'population': population, 'area': area})
states_data

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [11]:
states_data.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Number of ways to construct, begin with pd.DataFrame
*` pd.DataFrame(population, columns=['population'])`
* also give it a dictionary of series objects
*

## Data indexing and selection

In [13]:
## Series as dictionary
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data


a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [14]:
'a' in data

True

In [17]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

Demonstration of loc and iloc in pandas
* loc - used as a reference to the actual position(explicit index)
* iloc - uses logic of default python (implicit indexing)

In [22]:

print(data.loc[1])
print(data.loc[1:3])
print(data.iloc[1:3])
print(data.iloc[1])


a
1    a
3    b
dtype: object
3    b
5    c
dtype: object
b


In [23]:
#DataFrame
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area': area, 'pop': pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [26]:
data.loc['California':'New York']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


## Operating on data in pandas