# Pandas

Pandas is a package built on top of NumPy, and provides an efficient implementation of DataFrame.

<font color=blue>Dataframes: </font> are essentially multidimensional arrays with attached row and columns labels, and often with hetereogeneus types and/or missing data.

Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

Pandas  objects  can  be  thought  of  as  enhanced  versions  of NumPy  structured  arrays  in  which  the  rows  and  columns  are  identified  with  labels rather  than  simple  integer  indices.  

In [2]:
import numpy as np
import pandas as pd

## Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data.

In [5]:
list = np.array([1,2,3,4,5]) #It can be created from a list or array
data = pd.Series(list)
data

0    1
1    2
2    3
3    4
4    5
dtype: int32

In [6]:
data.values #values

array([1, 2, 3, 4, 5])

In [7]:
data.index

RangeIndex(start=0, stop=5, step=1)

#### Can be accessed by the associated index

In [8]:
data[1]

2

In [11]:
data[0:4]

0    1
1    2
2    3
3    4
dtype: int32

### Pandas Series has an explicitly defined index associated with the values. This explicit index gives the Series object additional capabilities. For example, if we wish, we can use string as an index

In [15]:
data2 = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
data2

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [16]:
data2['b']

2

### Series as specialized dictionary
A  dictionary  is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values. This typing is important: just as the  type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations.

## Pandas DataFrame Object
### DataFrame as a generalized NumPy array
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is  an  analog  of a  two-dimensional  array  with  both  flexible  row  indices  and  flexible column names. 

In [26]:
area = {'California': 123123, 'Texas': 434365345, 'New York': 67876876, 'Florida': 43095843095, 'Illinois':12321313}
areaSerie = pd.Series(area)
areaSerie

California         123123
Texas           434365345
New York         67876876
Florida       43095843095
Illinois         12321313
dtype: int64

In [27]:
population_dict = {'California': 38332521,'Texas': 26448193,'New York': 19651127,'Florida': 19552860,'Illinois': 12882135} 
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [28]:
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,38332521,123123
Florida,19552860,43095843095
Illinois,12882135,12321313
New York,19651127,67876876
Texas,26448193,434365345


In [29]:
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [30]:
states.columns

Index(['population', 'area'], dtype='object')

In [31]:
states['area']

California         123123
Florida       43095843095
Illinois         12321313
New York         67876876
Texas           434365345
Name: area, dtype: int64