## Pandas series

In [1]:
import pandas as pd
pd.__version__

'0.23.4'

## Pandas series object
A pandas series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [2]:
data =pd.Series([0.25,0.5,0.75,1.0])

In [3]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

 As we see in the output the Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NUmPy array:

In [4]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The index is an array-like object of type pd.index, which we will discuss in more detail momentarily

In [5]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a numpy array, data can be accessed by the associated index via the familiar Python square bracket notation:

In [6]:
data[1]

0.5

In [7]:
data[1:3]

1    0.50
2    0.75
dtype: float64

## Pandas Dataframe

In [8]:
# Create a simple dataframe
df =pd.DataFrame({
              'name':['bob','Jen','Tim'],
               'age':[20,30,40],
                'pet':['cat','dog','bird']
               })
df

Unnamed: 0,name,age,pet
0,bob,20,cat
1,Jen,30,dog
2,Tim,40,bird


## Indexes :loc and iloc

These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python- style index.

In [9]:
data = pd.Series(['a','b','c'],index=[1,3,5])
data

1    a
3    b
5    c
dtype: object

In [10]:
data[1] ## explicit index when indexing
# basically explicit = for index value

'a'

In [11]:
## implicit index when slicing
# implicit index means that it's actually going off of positioning
## rather than the value of the index itself
## basically implicit index = positions of the index. In python , positions start at 0 
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit index:

In [12]:
data.loc[1] #same as data[1]

'a'

In [13]:
data.loc[1:3]

1    a
3    b
dtype: object

In [17]:
data.loc[0:1]

1    a
dtype: object

The iloc attibutes allows indexing and slicing that always references implicit Python -style index.

In [14]:
data.iloc[1]

'b'

In [15]:
data

1    a
3    b
5    c
dtype: object

In [16]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [19]:
df.sort_values('pet',inplace =True)

In [20]:
df

Unnamed: 0,name,age,pet
2,Tim,40,bird
0,bob,20,cat
1,Jen,30,dog


In [21]:
df.loc[0]# I am particularly extracting 0 index value

name    bob
age      20
pet     cat
Name: 0, dtype: object

In [22]:
df.iloc[0] # I am particularly extracting first position

name     Tim
age       40
pet     bird
Name: 2, dtype: object

In [23]:
df.iloc[:,2]# Use iloc to select all rows of a column

2    bird
0     cat
1     dog
Name: pet, dtype: object

In [24]:
df.iloc[-1,:] # Use iloc to select the last row

name    Jen
age      30
pet     dog
Name: 1, dtype: object

## Slicing dataframes

In [25]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [26]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [27]:
data.area 
# equivalently we can use attribute style access with column names that are strings

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [29]:
data['density']=data['pop']/data['area']

In [30]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## Selecting multidimensional dataframes

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the 'values' attribute:

In [31]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:

In [32]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [33]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In order to access a row

In [34]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

Thus for array-style indexing, we need another convention. Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

In [37]:
data.iloc[:3,:2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [38]:
data.loc[:,:'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


## Masking in Pandas dataframe

In [39]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [40]:
## interested in states where the density > 100
## output pop and density columns

data.loc[data.density>100,['pop','density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [42]:
# I want to change first value of  density to 90

data.iloc[0,2]=90

In [43]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


## Additional indexing conventions

In [44]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [45]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


Such slices can also refer to rows by number rather than by index:

In [46]:
# 1 refers to the 2nd row and is inclusive in the start parameter
# 3 refers to the fourth row and is exclusive in the stop parameter 
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [49]:
data[data.density>100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
