# Missing_values and Handling_them

- The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.

## Trade-Offs in Missing Data Conventions

- There are a number of schemes that have been developed to indicate the presence of missing data in a table or DataFrame. Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.

- In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.

- In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.

# Missing Data in Pandas¶


- NumPy does have support for masked arrays – that is, arrays that have a separate Boolean mask array attached for marking data as "good" or "bad." Pandas could have derived from this, but the overhead in both storage, computation, and code maintenance makes that an unattractive choice.

- With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point NaN value, and the Python None object.

## None: Pythonic missing data

- The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects):

In [1]:
import numpy as np
import pandas as pd

In [7]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

- This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects. While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

- The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error:

In [8]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [9]:
vals1.min()

TypeError: '<=' not supported between instances of 'int' and 'NoneType'

# NaN: Missing numerical data

- The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [10]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

- Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. You should be aware that NaN is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:

In [11]:
1+ vals2

array([ 2., nan,  4.,  5.])

In [12]:
1+ np.nan

nan

In [13]:
1* np.nan

nan

In [14]:
0*np.nan

nan

In [15]:
vals2.sum(),vals2.min(),vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [17]:
np.nansum(vals2),np.nanmin(vals2),np.nanmean(vals2),np.nanmean(vals2)

(8.0, 1.0, 2.6666666666666665, 2.6666666666666665)

- Keep in mind that NaN is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.

- np.nan, None and NaT (for datetime64[ns] types) are standard missing value for Pandas.

- Note: A new missing data type (<NA>) introduced with Pandas 1.0 which is an integer type missing value representation.
np.nan is float so if you use them in a column of integers, they will be upcast to floating-point data type as you can see in “column_a” of the dataframe we created. However, <NA> can be used with integers without causing upcasting

## NaN and None in Pandas

- NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [19]:
pd.Series([1, np.nan, 2, None])


0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [21]:
vals2 = np.array([1, np.nan, 3, 4,None]) 
vals2

array([1, nan, 3, 4, None], dtype=object)

        Typeclass	Conversion When Storing NAs	     NA Sentinel Value
        floating	No change	                      np.nan
        object	    No change	                      None or np.nan
        integer	    Cast to float64	                  np.nan
        boolean	    Cast to object	                  None or np.nan

# Operating on Null Values

- As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:

isnull(): Generate a boolean mask indicating missing values
    
notnull(): Opposite of isnull()
    
dropna(): Return a filtered version of the data
    
fillna(): Return a copy of the data with missing values filled or imputed

# Detecting null values
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:

In [22]:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [23]:
data.isna()

0    False
1     True
2    False
3     True
dtype: bool

In [25]:
data.isnull().sum()

2

In [24]:
data.notnull()

0     True
1    False
2     True
3    False
dtype: bool

In [26]:
data.notnull().sum()

2

# Dropping null values

- In addition to the masking used before, there are the convenience methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:

In [27]:
data.dropna()


0        1
2    hello
dtype: object

By default, dropna() will drop all rows in which any null value is present:



In [36]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [37]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


In [38]:
df.dropna('columns')

Unnamed: 0,2
0,2
1,5
2,6


- But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through.

- The default is how='any', such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values:

In [39]:
df[3]=np.nan

In [40]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [41]:
df.dropna(how='all',axis='columns')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [49]:
df.dropna(thresh=2,axis='columns')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [61]:
df.dropna(thresh=3,axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


In [59]:
df.dropna(thresh=1,axis='rows')

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


# Filling null values

- Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.

In [62]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [63]:
data.fillna(0)


a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill to propagate the previous value forward:

In [64]:
data.fillna(method='ffill')


a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or we can specify a back-fill to propagate the next values backward:



In [65]:
data.fillna(method='bfill')


a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

 1. Write a Pandas program to detect missing values of a given DataFrame. Display True or False.

In [66]:
df = pd.DataFrame({
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,np.nan,70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,1983.43,2480.4,250.45, 75.29,3045.6],
'ord_date': ['2012-10-05','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,3003,3002,3001,3001],
'salesman_id':[5002,5003,5001,np.nan,5002,5001,5001,np.nan,5003,5002,5003,np.nan]})

In [67]:
df

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001.0,150.5,2012-10-05,3002,5002.0
1,,270.65,2012-09-10,3001,5003.0
2,70002.0,65.26,,3001,5001.0
3,70004.0,110.5,2012-08-17,3003,
4,,948.5,2012-09-10,3002,5002.0
5,70005.0,2400.6,2012-07-27,3001,5001.0
6,,5760.0,2012-09-10,3001,5001.0
7,70010.0,1983.43,2012-10-10,3004,
8,70003.0,2480.4,2012-10-10,3003,5003.0
9,70012.0,250.45,2012-06-27,3002,5002.0


In [68]:
df.isna()

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,False,False,False,False,False
1,True,False,False,False,False
2,False,False,True,False,False
3,False,False,False,False,True
4,True,False,False,False,False
5,False,False,False,False,False
6,True,False,False,False,False
7,False,False,False,False,True
8,False,False,False,False,False
9,False,False,False,False,False


2. Write a Pandas program to identify the column(s) of a given DataFrame which have at least one missing value.

In [74]:
df.isnull().any()

ord_no          True
purch_amt      False
ord_date        True
customer_id    False
salesman_id     True
dtype: bool

3. Write a Pandas program to count the number of missing values in each column of a given DataFrame.

In [75]:
df.isnull().sum()

ord_no         4
purch_amt      0
ord_date       1
customer_id    0
salesman_id    3
dtype: int64

4. Write a Pandas program to find and replace the missing values in a given DataFrame which do not have any valuable information.

In [76]:
df = pd.DataFrame({
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,"--",70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,"?",12.43,2480.4,250.45, 3045.6],
'ord_date': ['?','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,"--",3002,3001,3001],
'salesman_id':[5002,5003,"?",5001,np.nan,5002,5001,"?",5003,5002,5003,"--"]})

In [77]:
df

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,,270.65,2012-09-10,3001,5003
2,70002,65.26,,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,,948.5,2012-09-10,3002,
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


In [79]:
df.replace({"?":np.nan,"-":np.nan})

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,,3002,5002
1,,270.65,2012-09-10,3001,5003
2,70002,65.26,,3001,
3,70004,110.5,2012-08-17,3003,5001
4,,948.5,2012-09-10,3002,
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760.0,2012-09-10,3001,5001
7,70010,,2012-10-10,3004,
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


5. 5. Write a Pandas program to drop the rows where at least one element is missing in a given DataFrame.

In [81]:
df.dropna(axis=1)

Unnamed: 0,purch_amt,customer_id
0,150.5,3002
1,270.65,3001
2,65.26,3001
3,110.5,3003
4,948.5,3002
5,2400.6,3001
6,5760,3001
7,?,3004
8,12.43,--
9,2480.4,3002


Write a Pandas program to drop the rows where all elements are missing in a given DataFrame.

In [82]:
df.dropna(how='all')

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,,270.65,2012-09-10,3001,5003
2,70002,65.26,,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,,948.5,2012-09-10,3002,
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


 -Write a Pandas program to keep the rows with at least 2 non NaN values in a given DataFrame.

In [86]:
df.dropna(thresh=2)

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,,270.65,2012-09-10,3001,5003
2,70002,65.26,,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,,948.5,2012-09-10,3002,
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


Write a Pandas program to drop those rows from a given DataFrame in which specific columns have missing values

In [89]:
df.dropna(axis=0,subset=['ord_no'])

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
2,70002,65.26,,3001,?
3,70004,110.5,2012-08-17,3003,5001
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002
11,70013,3045.6,2012-04-25,3001,--


Write a Pandas program to replace NaNs with a single constant value in specified columns in a DataFrame.



In [92]:
df.fillna(0)

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,0,270.65,2012-09-10,3001,5003
2,70002,65.26,0,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,0,948.5,2012-09-10,3002,0
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


Write a Pandas program to replace NaNs with the value from the previous row or the next row in a given DataFrame

In [95]:
df = pd.DataFrame({
'ord_no':[70001,np.nan,70002,70004,np.nan,70005,"--",70010,70003,70012,np.nan,70013],
'purch_amt':[150.5,270.65,65.26,110.5,948.5,2400.6,5760,"?",12.43,2480.4,250.45, 3045.6],
'ord_date': ['?','2012-09-10',np.nan,'2012-08-17','2012-09-10','2012-07-27','2012-09-10','2012-10-10','2012-10-10','2012-06-27','2012-08-17','2012-04-25'],
'customer_id':[3002,3001,3001,3003,3002,3001,3001,3004,"--",3002,3001,3001],
'salesman_id':[5002,5003,"?",5001,np.nan,5002,5001,"?",5003,5002,5003,"--"]})

In [96]:
df

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,,270.65,2012-09-10,3001,5003
2,70002,65.26,,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,,948.5,2012-09-10,3002,
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


In [97]:
df.fillna(axis=0,method='ffill')
#previous row

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,70001,270.65,2012-09-10,3001,5003
2,70002,65.26,2012-09-10,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,70004,948.5,2012-09-10,3002,5001
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760,2012-09-10,3001,5001
7,70010,?,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


Write a Pandas program to interpolate the missing values using the Linear Interpolation method in a given DataFrame

In [120]:
df['purch_amt'].interpolate(method='linear', direction = 'forward', inplace=True) 

 Write a Pandas program to find the Indexes of missing values in a given DataFrame.

In [122]:
df

Unnamed: 0,ord_no,purch_amt,ord_date,customer_id,salesman_id
0,70001,150.5,?,3002,5002
1,,270.65,2012-09-10,3001,5003
2,70002,65.26,,3001,?
3,70004,110.5,2012-08-17,3003,5001
4,,948.5,2012-09-10,3002,
5,70005,2400.6,2012-07-27,3001,5002
6,--,5760.0,2012-09-10,3001,5001
7,70010,0.0,2012-10-10,3004,?
8,70003,12.43,2012-10-10,--,5003
9,70012,2480.4,2012-06-27,3002,5002


In [127]:
df['ord_no'].isnull().to_numpy().nonzero()

(array([ 1,  4, 10], dtype=int64),)

A common use for nonzero is to find the indices of an array, where a condition is True.

# Finding the Percentage of Missing Values in  Each column of a Pandas DataFrame

In [128]:
import seaborn as sns

In [131]:
flight=pd.read_csv('https://raw.githubusercontent.com/roberthryniewicz/datasets/master/airline-dataset/flights/flights.csv')

In [133]:
flight.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,335,...,4.0,8.0,0,,0,,,,,
1,2008,1,3,4,754.0,735,1002.0,1000,WN,3231,...,5.0,10.0,0,,0,,,,,
2,2008,1,3,4,628.0,620,804.0,750,WN,448,...,3.0,17.0,0,,0,,,,,
3,2008,1,3,4,926.0,930,1054.0,1100,WN,1746,...,3.0,7.0,0,,0,,,,,
4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,3920,...,3.0,10.0,0,,0,2.0,0.0,0.0,0.0,32.0


In [135]:
flight.shape

(100000, 29)

In [136]:
flight.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 29 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Year               100000 non-null  int64  
 1   Month              100000 non-null  int64  
 2   DayofMonth         100000 non-null  int64  
 3   DayOfWeek          100000 non-null  int64  
 4   DepTime            98858 non-null   float64
 5   CRSDepTime         100000 non-null  int64  
 6   ArrTime            98698 non-null   float64
 7   CRSArrTime         100000 non-null  int64  
 8   UniqueCarrier      100000 non-null  object 
 9   FlightNum          100000 non-null  int64  
 10  TailNum            98858 non-null   object 
 11  ActualElapsedTime  98698 non-null   float64
 12  CRSElapsedTime     100000 non-null  int64  
 13  AirTime            98698 non-null   float64
 14  ArrDelay           98698 non-null   float64
 15  DepDelay           98858 non-null   float64
 16  Ori

In [137]:
flight.count()

Year                 100000
Month                100000
DayofMonth           100000
DayOfWeek            100000
DepTime               98858
CRSDepTime           100000
ArrTime               98698
CRSArrTime           100000
UniqueCarrier        100000
FlightNum            100000
TailNum               98858
ActualElapsedTime     98698
CRSElapsedTime       100000
AirTime               98698
ArrDelay              98698
DepDelay              98858
Origin               100000
Dest                 100000
Distance             100000
TaxiIn                98698
TaxiOut               98858
Cancelled            100000
CancellationCode       1142
Diverted             100000
CarrierDelay          19629
WeatherDelay          19629
NASDelay              19629
SecurityDelay         19629
LateAircraftDelay     19629
dtype: int64

In [150]:
missing=flight.isnull().sum()
len(flight)
missing_values_per=(missing/len(flight))*100

In [151]:
missing_values_per

Year                  0.000
Month                 0.000
DayofMonth            0.000
DayOfWeek             0.000
DepTime               1.142
CRSDepTime            0.000
ArrTime               1.302
CRSArrTime            0.000
UniqueCarrier         0.000
FlightNum             0.000
TailNum               1.142
ActualElapsedTime     1.302
CRSElapsedTime        0.000
AirTime               1.302
ArrDelay              1.302
DepDelay              1.142
Origin                0.000
Dest                  0.000
Distance              0.000
TaxiIn                1.302
TaxiOut               1.142
Cancelled             0.000
CancellationCode     98.858
Diverted              0.000
CarrierDelay         80.371
WeatherDelay         80.371
NASDelay             80.371
SecurityDelay        80.371
LateAircraftDelay    80.371
dtype: float64

In [155]:
flight.isnull().mean().round(4)*100

Year                  0.00
Month                 0.00
DayofMonth            0.00
DayOfWeek             0.00
DepTime               1.14
CRSDepTime            0.00
ArrTime               1.30
CRSArrTime            0.00
UniqueCarrier         0.00
FlightNum             0.00
TailNum               1.14
ActualElapsedTime     1.30
CRSElapsedTime        0.00
AirTime               1.30
ArrDelay              1.30
DepDelay              1.14
Origin                0.00
Dest                  0.00
Distance              0.00
TaxiIn                1.30
TaxiOut               1.14
Cancelled             0.00
CancellationCode     98.86
Diverted              0.00
CarrierDelay         80.37
WeatherDelay         80.37
NASDelay             80.37
SecurityDelay        80.37
LateAircraftDelay    80.37
dtype: float64

# Finding the Percentage of Missing Values in of a Pandas DataFrame

In [160]:
(flight.isnull().sum().sum()/np.product(flight.shape)).round(4)*100

17.65

In [161]:
np.product(flight.shape)

2900000

In [162]:
len(flight)

100000

In [163]:
flight.shape

(100000, 29)