# DATA MANIPULATION IN PANDAS - PART 1
This work focus on the mechanics of using: 
1. Series,
2. DataFrame,
3. Index 
and related structures effectively using real datasets

# THE PANDAS SERIES OBJECTS
The pandas series is a 1D array of indexed data and can be created from a list or array.
1. The Pandas Series can be thought of as a generalized Numpy Array and
2. As a specialized Dictionary 

In [1]:
#Creating a pandas series object
import pandas as pd
import numpy as np
data = pd.Series([0.25,0.5,0.75,1.0])
print(data)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64


##### The output of the code above shows that the series wraps both a sequence of values and a sequence of indices with which we can access the values

In [2]:
# Accessing the values and indices of a Pandas Series
print("The values of the Series are \n", data.values)
print("\nThe indices of the Series are \n",data.index)

The values of the Series are 
 [0.25 0.5  0.75 1.  ]

The indices of the Series are 
 RangeIndex(start=0, stop=4, step=1)


In [3]:
# Accessing data using the associated index
# Just like the numpy array
print("The value at index 1 is ", data[1])
print("\nThe values at indices 1 and 2 are \n", data[1:3])

The value at index 1 is  0.5

The values at indices 1 and 2 are 
 1    0.50
2    0.75
dtype: float64


##### This is a pointer to the fact that the Panda Series is much more general and more flexible than the Numpy array

## 1. Series as a Generalized Numpy Array
In Numpy Arrays, the indices are implicitly defined but in Pandas Series, the indices are explicitly defined and associated with the values. This does not limit the indices to integers alone. 

In [4]:
# Using Strings as an Index in Pandas Series
data1 = pd.Series([0.25,0.5, 0.75,1.0], index=['a','b','c','d'])
print(data1)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


In [5]:
# Access the data values using the string indices
data1['a']

0.25

In [6]:
# The indices dont have to follow a sequence
data1 = pd.Series([0.25,0.5,0.75,1.0], index=[2,5,3,7])
data1

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

## 2. Series as a Specialized Dictionary
A Pandas Series is like a specialized version of a Python dictionary

#### What is a dictionary?
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values

#### What then is a Series?
A Series is a structure that maps typed keys to a set of typed values. Typing is very important in Pandas Series as this is what makes it more efficient than the Python dictionary

### !!! We can make a Pandas Series Object from a list or a Dictionary 

In [7]:
# Creating a Pandas Series from a Python dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


#### Note here that the states are the indices just like in a Series created using a list but here we refer to the indices as keys. From this , typical dictionary style access can be performed

In [8]:
print(population['California'])
print(population['Texas'])

38332521
26448193


## Constructing Series Objects
The different ways of constructing Pamdas Series Objects follow the syntax given below:
pd.Series(data,index=index)

In [9]:
# Ways of constructing a Pandas Series Object
#1. Data can be a list or Numpy Array
pd.Series([2,4,6])

0    2
1    4
2    6
dtype: int64

In [10]:
#2. Data can be scalar which is repeated to fill a specified index
pd.Series(5, index=[100,200,300])

100    5
200    5
300    5
dtype: int64

In [11]:
#3. Data can be a dictionary in which the index defaults to a sorted key arrays
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [12]:
# It is possible to explicitly set the index 
pd.Series({2:'a',1:'b',3:'c'}, index=[3,2])

3    c
2    a
dtype: object

# The Pandas DATAFRAME OBJECT


1. The DataFrame can be thought of as a generalization of a Numpy Array

2. As a specialization of a Python dictionary

## 1. DataFrame as a Generalized Numpy Array
If a Series is an analog of a 1D array with flexible indices,
A DataFrame is an analog of a 2D array with flexible row indices and flexible column names.

And as a 2D array is analogous to an ordered sequence of aligned- 1D columns, a DataFrame is analogous to a sequence of aligned Series Objects i.e Series Objects with the same index

In [13]:
# Lets demo this analogy with a new series listing the area of the 5 states above
area_dict = {'California': 423967,'Texas': 695662,'New York': 141297,'Florida': 170312,'Illinois': 149995}
area = pd.Series(area_dict)

# Now lets create a simple DataFrame which includes the population and area
states = pd.DataFrame({'population':population,'area':area})

#print the DataFrame
print(states)

            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995


### PARTS OF A DATAFRAME
To better understand a dataframe, it is important to know that they consist of three components stored as attributes 

1. .values: A two-dimensional NumPy array of values.
2. .columns: An index of columns: the column names.
3. .index: An index for the rows: either row numbers or row names.


In [14]:
# As you can see above that the states is the index that gives access to the index labels
# So the DataFrame has an index attribute and a column attribute
print("\n The 2D Numpy Array of Values - \n", states.values)
print("\n Index -", states.index)
print("\n Columns - ",states.columns)



 The 2D Numpy Array of Values - 
 [[38332521   423967]
 [26448193   695662]
 [19651127   141297]
 [19552860   170312]
 [12882135   149995]]

 Index - Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

 Columns -  Index(['population', 'area'], dtype='object')


## 2. DataFrame as a Specialized Dictionary
A dataFrame can be thought of as a specialized dictionary where a DataFrame maps a column name to a series of column data

In [15]:
# DataFrame as a Specialized Dictionary
print("States and their Area\n", states['area'])
print("\nStates and their Population\n",states['population'])

States and their Area
 California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

States and their Population
 California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64


## Constructing DataFrame Objects
The following are the ways of constructing DataFrame Objects 
1. From a single Series Object
2. From a list of dicts
3. From a dictionary of Series Objects
4. From a 2D Numpy Array
5. From a Numpy Structured Array

### 1. From a single Series Object
A DataFrame is collection of Series Objects and a single column DataFrame can be constructed from a single series

In [16]:
# Example
pd.DataFrame(population,columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


### 2. From a List of Dicts
Any list of dictionaries can be made into a DataFrame

In [17]:
# Example using a simple list comprehension
data3 = [{'a':i,'b':2*i} for i in range(3)]
pd.DataFrame(data3)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [18]:
# Example where you have missing values, pandas fills them with NaN
pd.DataFrame([{'a':1,'b':2},{'b':3,'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


### 3. From a Dictionary of Series Objects
As we have shown earlier, a DataFrame can be constructed from a dictionary of Series Objects

In [19]:
# Example
pd.DataFrame({'population':population,'area':area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### 4. From a 2D Numpy Array
Given a 2D array of data, we can create a DataFrame with a soecified column and index names, if no index is given, an integer index will be used for each

In [20]:
# Example
pd.DataFrame(np.random.rand(3,2),columns=['foo','bar'],index=['a','b','c'])

Unnamed: 0,foo,bar
a,0.421479,0.371401
b,0.453058,0.438047
c,0.908293,0.493588


### 5. From a Numpy Structured Array
A Pandas data structure operates much like a structured array and can be directly be created from one 

In [21]:
A = np.zeros(3, dtype=[('A','i8'),('B','f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [22]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


# The Pandas INDEX OBJECT
So we have seen that both Series and DataFrame objects contain an explicit index that lets you access and modify data.Index can be:
1. Immutable Object
2. Ordered Set

### 1. Index as an Immutable Array

In [23]:
# Example
# Lets create an index from a list of integers
ind = pd.Index([2,3,5,7,11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

#### Operates as an Array using Python indexing notation

In [24]:
print(ind[1]) #print element at index 1
print(ind[::2]) #print every other element

3
Int64Index([2, 5, 11], dtype='int64')


#### Index Objects have many attributes similar to Numpy Array

In [25]:
print("\n",ind.size,"\n",ind.shape,"\n",ind.ndim,"\n",ind.dtype)


 5 
 (5,) 
 1 
 int64


#### A difference between Index Objects and Numpy Arrays is that Indices are immutable that is they cannot be modified

In [26]:
# ind[1] = 0 #generates an error

#### It is this immutability that makes it safer to share indices between multiple DataFrames and arrays without the effect of index modification

### 2. Index as Ordered Set
Pandas Objects are designed to facilitate joins across datasets which largely depend on set arithmetic. It makes use of Pythons built-in set data structure

In [27]:
indA = pd.Index([1,3,5,7,9])
indB = pd.Index([2,3,5,7,11])

In [28]:
# Intersection
pd.Index.intersection(indA,indB)

Int64Index([3, 5, 7], dtype='int64')

In [29]:
# Union
pd.Index.union(indA,indB)

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [30]:
# Symmetric difference
pd.Index.difference(indA,indB)

Int64Index([1, 9], dtype='int64')

# DATA INDEXING & SELECTION
Just as we performed indexing and slicing operations on Numpy Arrays, it is possible to do such with Pandas Series and DataFrames

### DATA SELECTION IN SERIES
It is obvious now that a Pandas Series acts like a 1D Numpy Array and a Python Dictionary. This will help us understand how to perform data indexing and selection in Pandas

### 1. Series as a Dictionary
Just like a Python dictionary where a key is required to map onto values, so also a Pandas Series requires a collection of Keys to map onto a collection of values

In [31]:
# Examples
data4 = pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
print(data4)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


In [32]:
# Access the values using keys
data4['b':'d']

b    0.50
c    0.75
d    1.00
dtype: float64

#### Access the values using dictionary-like Python expressions and methods

In [33]:
# Example
'a' in data4

True

In [34]:
# Access the keys
data4.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [35]:
# Access the data items
list(data4.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [36]:
#### Series are mutable and can be modified like dictionaries
data4['e'] = 1.25
data4

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### These mutability of the Pandas Series Object provides conveinience for the user who does not need to worry about memory layout and data copying

### 2. Series as a 1D Array
A Pandas Series emulates the dictionary-like interface and provides array-style item selection as done by Numpy arrays i.e slicing, masking and fancy indexing

In [37]:
# Example - Slicing by explicit index
data4['a':'d']
# Here you will observe that the final index is included in the slice

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [38]:
# Example - Slicing by implicit integer index
data4[0:2]
# Here you will observe that the final index is excluded in the slice

a    0.25
b    0.50
dtype: float64

In [39]:
# Example - Masking
data4[(data4 > 0.3) & (data4 < 0.8)]

b    0.50
c    0.75
dtype: float64

In [40]:
# Example - Fancy Indexing
data4[['a','e','c']]

a    0.25
e    1.25
c    0.75
dtype: float64

### 3. Using Indexers: loc and iloc
The issues raised above about explicit and implicit indexing can cause a lot of confusion and introduce bugs in your code. That is why Python provides special index attributes that allow for a particular slicing interface to the data in Series - loc & iloc 

#### 1. loc - The explicit indexing attribute

In [41]:
# Example - The loc attribute always reference the explicit index and starts from the first to the last index
datta = pd.Series([0.1,0.2,0.3,0.4,0.5,0.6])
print("\n Data at index 1\n",datta.loc[1])
print("\n Data at indices 1 to 3\n",datta.loc[1:3])
print("\n Data at indices 2 to 5\n",datta.loc[2:5])


 Data at index 1
 0.2

 Data at indices 1 to 3
 1    0.2
2    0.3
3    0.4
dtype: float64

 Data at indices 2 to 5
 2    0.3
3    0.4
4    0.5
5    0.6
dtype: float64


#### 2. iloc - The Implicit indexing attribute

In [42]:
# Example - This always references the implicit index and starts from the first and doesnt include the last item
print("\n Data at index 1 \n", datta.iloc[1])
print("\n Data at indices 1 to 3\n",datta.iloc[1:3])


 Data at index 1 
 0.2

 Data at indices 1 to 3
 1    0.2
2    0.3
dtype: float64


#### Explicit indexing is better than Implicit. The explicit nature of loc and iloc make them very useful in maintaining clean and readable code especially in the case of integer indexes

### DATA SELECTION IN DATAFRAME
Selecting data in a dataframe exploits the characteristic feature of a dataframe to behave like:
1. A 2D or structured array and like 
2. A Dictionary of Series which share the same index.

### 1. DataFrame as a Dictionary

In [43]:
censusdata = pd.DataFrame({'area':area,'population':population})
censusdata

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [44]:
# Example - To access the individual Series that make up the dataframe, we use dictionary style indexing
censusdata['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [45]:
# Example - We can also use attribute style access using the column names that are strings
censusdata.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

#### You will observe that both styles give the same result but there's a caveat; doesn't work in all cases for example if the colmun names are not all strings or if the column names conflict with the methods of the DataFrame e.g data.pop(). Infact it is better to use data['pop'] than data.pop for any operation

In [46]:
# Example - using dictionary style to create or modify new columns
censusdata['density'] = censusdata['population']/censusdata['area']
censusdata

Unnamed: 0,area,population,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as a 2D Array
We can view the DataFrame as an enhanced 2D array by examining the underlying data array using values

In [47]:
# Example - Examining the raw data values of a dataframe
censusdata.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

#### But it is better to work with the column names and the row labels intact during operations

In [48]:
# Example - Transpose the dataframe
censusdata.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
population,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [49]:
np.linalg.norm(censusdata)

55714044.648482256

### Accessing the 2D array is better done using Pandas Attribute Indexing - loc and iloc
Using these attributes, we can index the underlying array as if it is a simple Numpy array while the DataFrame index and column labels are maintanined in the result

In [50]:
censusdata.iloc[:3,:2] #Select rows up to row 3 and columns up to column 2

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [51]:
censusdata.loc[:'Florida',:'area'] #Select rows up to florida and columns up to area

Unnamed: 0,area
California,423967
Texas,695662
New York,141297
Florida,170312


#### Combined Indexing like in Numppy Arrays is possible using the loc indexer

In [52]:
# Example - Combining Masking and Fancy Indexing
censusdata.loc[censusdata.density > 100, ['population','density']]


Unnamed: 0,population,density
New York,19651127,139.076746
Florida,19552860,114.806121


### INSPECTING A DATAFRAME
When you obtain a new dataframe, the first thing you need to do is to inspect using a variety of pandas methods and attributes such as .head, .info, .shape, .describe  

Finding interesting bits of information can be done by changing the order of the rows. You can sort the rows by passing a column name to .sort_values(). You can sort on one column or multiple columns

By combining .sort_values() with head, you obtain the top values of a given specific criteria

In [53]:
#Example
homelessness = pd.read_csv('../Data/homelessness.csv', index_col=0)

In [54]:
#print the head of the data
print(homelessness.head())

               region       state  individuals  family_members  state_pop
0  East South Central     Alabama       2570.0           864.0    4887681
1             Pacific      Alaska       1434.0           582.0     735139
2            Mountain     Arizona       7259.0          2606.0    7158024
3  West South Central    Arkansas       2280.0           432.0    3009733
4             Pacific  California     109008.0         20964.0   39461588


In [55]:
#print information about homelessness
print(homelessness.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.4+ KB
None


In [56]:
#print the shape of homelessness
print(homelessness.shape)

(51, 5)


In [57]:
#print the description of homelessness
print(homelessness.describe())

         individuals  family_members     state_pop
count      51.000000       51.000000  5.100000e+01
mean     7225.784314     3504.882353  6.405637e+06
std     15991.025083     7805.411811  7.327258e+06
min       434.000000       75.000000  5.776010e+05
25%      1446.500000      592.000000  1.777414e+06
50%      3082.000000     1482.000000  4.461153e+06
75%      6781.500000     3196.000000  7.340946e+06
max    109008.000000    52070.000000  3.946159e+07


### Other Indexing Conventions
It is important to know that indexing refers to columns while slicing refers to rows. Remember loc and iloc still remain the gold standard.

In [58]:
# Example
#select rows from Florida to Illinois
censusdata['Florida':'Illinois'] 

Unnamed: 0,area,population,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [59]:
#Select from rows 1 to 3; doesnt include the last index
censusdata[1:3] 

Unnamed: 0,area,population,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [60]:
# Example - Direct Masking Operations operate row-wise than column-wise
censusdata[censusdata.density > 100]

Unnamed: 0,area,population,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


# OPERATING ON DATA IN PANDAS
Pandas offers a better alternative to Numpy Arrays when operating on data as it keeps the context of data and combining data from different sources. 

These are 2 potentially error-prone tasks with raw Numpy Arrays. Pandas is foolproof when it comes to this.

1. When working with ufuncs, Pandas will prserve the indices and column labels
2. For binary operations like addition and subtraction, Pandas automatically aligns the indices when passing the objects to the ufunc

### 1. Index Preservation with Universal Functions (ufuncs)
Because Pandas is designed to work with Numpy,any Numpy Ufunc will work with Pandas Series and DataFrame Objects

In [61]:
rng = np.random.RandomState(42)
ss = pd.Series(rng.randint(0,10,4))
ss

0    6
1    3
2    7
3    4
dtype: int64

In [62]:
df = pd.DataFrame(rng.randint(0,10,(3,4)), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


#### Operation on any of these dataframes will give another dataframe with the indices preserved

In [63]:
# Example - Take the exponent of the dataframe
np.exp(ss)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [64]:
# Example - Take the sine of the dataframe
np.sin(df*np.pi/4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


### 2. Index Alignment with Universal Functions
For binary Operations on 2 series or dataframes, pandas will align the indices while performing operations. This is useful when working with incomplete data

In [65]:
# Example - Working with data from different sources
area_a = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
pop_p = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')

# calculate the population density
pop_p/area_a
# You will observe that states is used as the index automatically

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

**The resulting array gives a union of the indices of the two input arrays (Series) which can be obtained using Pythons set Arithmetic. Any item for which the other Series does not have data is marked with a NaN which is how Pandas marks missing data**

### 3. Index Alignment in DataFrame
When performing operations on dataframe, index alignment occurs on both rows and columns

In [66]:
# Example - Addition operation on two different dataframes
A = pd.DataFrame(rng.randint(0,20,(2,2)), columns=list('AB'))
B = pd.DataFrame(rng.randint(0,10,(3,3)), columns=list('BAC'))
A + B

Unnamed: 0,A,B,C
0,1.0,15.0,
1,13.0,6.0,
2,,,


**You will observe that the element in respective indices are added together irrespective of their order and placed in the resulting dataframe object. This is the preservation we are talking about**

### 4. Operations between DataFrames & Series with Universal Functions
In this situation, the index and column alignment are maintained as well.

Operations between DataFrame and Series is similar to operations between 2D array and 1D numpy array 

In [67]:
# Examples - Difference between a dataframe and one of its rows

Np_array = rng.randint(10,size=(3,4))

print("\nDifference btw a 2D array and one of its rows\n", Np_array - Np_array[0])

print("\nDifference btw Dataframe and a row-Series \n", B - B.iloc[0])

#Both operate row-wise


Difference btw a 2D array and one of its rows
 [[ 0  0  0  0]
 [-1 -2  2  4]
 [ 3 -7  1  4]]

Difference btw Dataframe and a row-Series 
    B  A  C
0  0  0  0
1  1  8 -9
2  5  2 -3


#### For column-wise operation in a dataframe, the axis has to be specified

In [68]:
# Example - Column-wise operation between a dataframe and a series
B.subtract(B['A'], axis=0)

Unnamed: 0,B,A,C
0,4,0,9
1,-3,0,-8
2,7,0,4


### HANDLING MISSING DATA
In this section, we will discuss some general considersations for missing data, discuss how Pandas chooses to represent it and demonstrate some buil-in Pandas tools for handling missing data.

### General Conventions for Indicating Missing Data
Two strategies are used to indicate the presence of missing data in a dataframe and both have trade-offs:

1. Using a mask that globally indicates missing values e.g an entirely separate boolean array. This adds overhead in both storage and computation

2. Choosing a sentinel value that indicates a missing entry e.g -9999. This reduces the range of valid values that can be represented and may require extra logic cpu


### MISSING DATA IN PANDAS
Pandas uses the special floating point null value and the Python None object

1. None Python Object
Operation using this is done at the python level with much more overhead than typically fast operations seen with numpy arrays

Aggregations such as sum(), min() done on a array in which a none object is used will alwyas give an error

2. NaN
This has a floating point type and supports fast operations pushed into compiled code.
Aggregations done on this doesnt result in an error but not always useful.
Numpy provides a some special aggregation that will ignore the missing values

np.nansum(), np.nanmin(), np.nanmax()
If NaN and None exist in an array, Numpy upcasts the None to a floating -point NaN 

### OPERATING ON NULL VALUES
There are several methods for detecting, removing and replacing null values in Pandas data structures

#### Detecting Null Values
Pandas data structures has two methods for detecting null data: 
1. isnull and 
2. notnull

Either one will return a boolean mask over the data


In [69]:
# Example 
dt = pd.Series([1, np.nan, 'hello', None])
dt.isnull()


0    False
1     True
2    False
3     True
dtype: bool

In [70]:
#The resulting boolean mask array can then be used as a Series or dataframe index
dt[dt.notnull()]

0        1
2    hello
dtype: object

#### Dropping Null Values

There are two conventions for dropping NA (dropna()) values and filling NA values (fillna()).


In [71]:
# Example
dt.dropna()

0        1
2    hello
dtype: object

In [72]:
# For a DataFrame, there are more options
dm = pd.DataFrame([[1,np.nan,2],
                  [2,3,5],
                  [np.nan,4,6]])
print(dm)

     0    1  2
0  1.0  NaN  2
1  2.0  3.0  5
2  NaN  4.0  6


#### '
we cannot drop single values from a Dataframe, we can only drop full rows or columns
By default, dropna will drop all rows in which any null value is present


In [73]:
# Example
dm.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [74]:
# You can drop na values along a different axis=1
dm.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


#### '
You might rather be interested in dropping rows or columns with all NA values or majority of NA values using the 'how' or 'thresh' parameters

In [75]:
#Example Using the 'how' parameter
dm[3] = np.nan #set the fourth column to NaN
dm

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [76]:
#Drop the fourth column using the how parameter set to all
dm.dropna(axis='columns', how='all')


Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [77]:
# Example using the 'thresh' parameter
dm.dropna(axis='rows',thresh=3)#thresh specifies the min.no of non-null values to be kept
#First and last rows will be dropped because they contain only two non-null values

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


#### Filling Non-Null Values
Rather than dropping null values, its better to replace them with valid values. This could be a single number like zero or imputation or interpolation of the good values and can be done using the fillna method.

In [78]:
#Example
dt = pd.Series([1, np.nan, 2, None, 3], index= list('abcde'))
print(dt)

#Fill NA values with zero
dt.fillna(0)

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64


a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [79]:
#Example - Forward fill to propagate the previous value forward
dt.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [80]:
#Example - Backward fill to propagate the next values backward
dt.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [81]:
#Example - Filling for DataFrames is different, we specify an axis to fill
print(dm)
dm.fillna(method='ffill', axis=1)

     0    1  2   3
0  1.0  NaN  2 NaN
1  2.0  3.0  5 NaN
2  NaN  4.0  6 NaN


Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0
