# Pandas

## Sections
1.  Introduction
2.  DataFrame
3.  Data Indexing and Selection
4.  Array indexing

<a id='intro'/>

## Introduction

In [5]:
import pandas as pd

pd.__version__

'0.23.4'

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [6]:
## Pandas Series Object
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the preceding output, the Series wraps both a sequence of values and a
sequence of indices, which we can access with the values and index attributes.

In [7]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [8]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [9]:
""" Like with a NumPy array, data can be accessed by the associated index via the familiar
Python square-bracket notation """
data[1]

0.5

In [10]:
#Slicing 
data[1:3]

1    0.50
2    0.75
dtype: float64

The essential difference is the presence
of the index: while the NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with
the values.

In [11]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data['b']

0.5

In [12]:
#We can even use noncontiguous or nonsequential indices
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data[5]

0.5

In [13]:
""" We can make the Series-as-dictionary analogy even more clear by constructing a
Series object directly from a Python dictionary """

population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [14]:
population['California': 'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

<a id = 'df' />

## 2. DataFrame 
The next fundamental structure in Pandas is the DataFrame. Like the Series object
discussed in the previous section, the DataFrame can be thought of either as a generalization
of a NumPy array, or as a specialization of a Python dictionary. We’ll now
take a look at each of these perspectives.

In [15]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}

In [16]:
area = pd.Series(area_dict)

In [17]:
states = pd.DataFrame({'population': population,
'area': area})

In [18]:
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [19]:
"""Like the Series object, the DataFrame has an index attribute that gives access to the
index labels"""
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [20]:
"""Additionally, the DataFrame has a columns attribute, which is an Index object holding
the column labels"""

states.columns

Index(['population', 'area'], dtype='object')

### 2.1 DataFrame as a specialized dictionary

In [21]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Notice the potential point of confusion here: in a two-dimensional NumPy array,
data[0] will return the first row. For a DataFrame, data['col0'] will return the first
column. Because of this, it is probably better to think about DataFrames as generalized
dictionaries rather than generalized arrays

### 2.2 Pandas Index Object
We have seen here that both the Series and DataFrame objects contain an explicit
index that lets you reference and modify data. This Index object is an interesting
structure in itself, and it can be thought of either as an immutable array or as an
ordered set (technically a multiset, as Index objects may contain repeated values).
Those views have some interesting consequences in the operations available on Index
objects. As a simple example, let’s construct an Index from a list of integers

In [22]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [23]:
# Index is immutable
ind[1] = 0

TypeError: Index does not support mutable operations

### 2.3 Index as ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way

In [24]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection



Int64Index([3, 5, 7], dtype='int64')

In [25]:
indA | indB # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [26]:
indA ^ indB # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

<a id ='index' />

## 3. Data Indexing and Selection

In [27]:
# Data selection in DF
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [28]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [29]:
# Membership operator
'area' in data

True

In [30]:
# Keys in the series object
data.keys()

Index(['area', 'pop'], dtype='object')

DF objects can even be modified with a dictionary-like syntax. Just as you can
extend a dictionary by assigning to a new key, you can extend a Series by assigning
to a new index value

Like with the Series objects discussed earlier, this dictionary-style syntax can also be
used to modify the object, in this case to add a new column

In [31]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [32]:
# We can examine the raw underlying data array using the values attribute
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [33]:
# we can transpose the full DataFrame to swap rows and columns
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


<a id = 'indexing'/>

##  4. Array-style indexing, we need another convention. Here Pandas again uses the loc, iloc indexers

In [34]:
data.iloc[:3, :2] #integers

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [35]:
data.loc[:'Illinois', :'pop'] #Exact indexes

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135
