# Missing Values In Pandas 

In Pandas, missing values are represented by the NaN (Not a Number) value. Missing values can occur in a DataFrame or Series when data is not available or is incomplete. It's essential to handle missing values properly because they can affect the accuracy and reliability of your data analysis.

MCAR, MAR, and MNAR are terms used to describe the mechanisms by which data can be missing in a dataset. These terms are particularly relevant in the context of handling missing data in statistical analysis. Let's break down each term:

### MCAR ( Missing completely at random ) :
* MCAR means that the missing values in your data are entirely random and have no connection to the observed or unobserved data. It's like the missing values are scattered randomly across your dataset, and there's no specific pattern to their occurrence.
* In Pandas, you might observe MCAR if, for example, some students forget to fill in certain questions on a survey, and these missing answers are spread across all the questions randomly.

In [1]:

import pandas as pd

# Creating a DataFrame with MCAR
df_mcar = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

# in the above code ,we have missing values in both columns ,so it is MCAR(Missing completely at random)

### MAR ( Missing at random ) :
* MAR means that the likelihood of a value being missing is related to the observed data but not to the unobserved data. In simpler terms, whether a value is missing depends on the values of other variables in your dataset, but not on the missing variable itself.
* For instance, in Pandas, you might have MAR if students from certain classes are more likely to skip answering certain questions on a test, but this behavior is related to their class and not the actual content of the unanswered questions.


In [2]:
# Creating a DataFrame with MAR
df_mar = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, 6, 7, 8], 'C': [9, None, 11, 12]})


### MNAR ( Missing not at random ) :
* MNAR means that the probability of a value being missing is related to the unobserved data itself. In other words, the fact that a value is missing is directly connected to the missing variable.
* In Pandas, this could happen if, for example, students who scored low in a test are more likely to skip answering certain questions, creating a link between the missing values and the unobserved (test scores) data.

## How can we find missing values :
You can use the isnull() or isna() functions to detect missing values in a DataFrame or Series. These functions return a DataFrame or Series of the same shape, with True where a value is missing and False otherwise.

In [3]:
# Creating a DataFrame with MNAR
df_mnar = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, 6, 7, 8]})


### Here is an example  for ,how to find null values :

In [4]:
import pandas as pd

# Creating a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

# Detecting missing values
print(df.isnull())

       A      B
0  False  False
1  False   True
2   True  False
3  False  False


### we can also find how many number of null values in a data by using sum ()

In [5]:
print(df.isnull().sum())

A    1
B    1
dtype: int64


## Handling Missing Values : 
* we can drop  null values by using dropna() .
* we can also fill null values with some other value by using fill na()

### Droping columns by using dropna ( )

In [6]:
df

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,
2,,7.0
3,4.0,8.0


In [7]:
df.dropna()

Unnamed: 0,A,B
0,1.0,5.0
3,4.0,8.0


### Filling nan values by using fillna ( )

In [8]:
# Fill missing values with a specific value
df.fillna(0)


Unnamed: 0,A,B
0,1.0,5.0
1,2.0,0.0
2,0.0,7.0
3,4.0,8.0


## Imputation:
Imputation involves replacing missing values with estimated values based on the available data. This can be done using statistical measures like mean, median, or mode.



In [9]:
# Fill missing values with the mean of each column
df.fillna(df.mean())

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.666667
2,2.333333,7.0
3,4.0,8.0


## Interpolation:

Interpolation is a method for estimating missing values based on the values of adjacent data points.

In [10]:
df.interpolate()

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.0
2,3.0,7.0
3,4.0,8.0


# Parameters of dropna ( )

df.dropna (axis=1, thresh=round(0.41*df.shape[0]), subset= ['a','b','c','d'])

### axis:

This parameter determines whether you want to drop missing values along rows (axis=0) or columns (axis=1).by deault it takes (axis=0)
* If axis=0, it will drop rows containing missing values.
* If axis=1, it will drop columns containing missing values.

In [11]:
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

In [12]:
# Drop columns with missing values
df_dropped_columns = df.dropna(axis=1)
df_dropped_columns
## it deletes the columns where we have null values 

0
1
2
3


In [13]:
# Drop rows with missing values
df_dropped_rows = df.dropna(axis=0)
df_dropped_rows 

Unnamed: 0,A,B
0,1.0,5.0
3,4.0,8.0


### Thresh 
This parameter allows you to specify a threshold for the number of non-missing values. Rows or columns with fewer non-missing values than the threshold will be dropped.

In [14]:
# Drop rows with at least 2 non-missing values
df_dropped_thresh = df.dropna(thresh=2)
# drop rows with 40 of null values 
df_dropped_thresh = df.dropna(thresh=round(0.41*df.shape[0]))

### subset:

You can use the subset parameter to specify a subset of columns or rows where missing values should be considered.
It takes a list of column names or row indices.

In [15]:
# Drop rows where 'Column_A' has missing values
# df_dropped_subset = df.dropna(subset=['Column_A'])

### how:
* This parameter allows you to specify the condition for dropping. The options are:
* 'any': Drop if any missing value is present in the row or column (default behavior).
* 'all': Drop only if all values are missing in the row or column.

In [16]:
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

In [17]:
# Drop rows with any missing values
df_dropped_any = df.dropna(how='any')
df_dropped_any 
# it deletes ,even we have one null value in a row or column in a data ,by default it takes (axis = 0 )

Unnamed: 0,A,B
0,1.0,5.0
3,4.0,8.0


In [18]:
# Drop rows with all missing values
df_dropped_any = df.dropna(how='all')
df_dropped_any 
## it only drops when we have all nan values in a mentioned axis ,by default it takes axis = 0

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,
2,,7.0
3,4.0,8.0


These parameters provide flexibility in how you want to handle missing values in your dataset. Depending on your specific use case, you can adjust these parameters to achieve the desired result.

## Parameters of Fillna ( )

### axis:

The axis parameter allows you to specify whether to fill missing values along rows (axis=0) or columns (axis=1).

In [32]:
# Fill missing values along columns (axis=1) using the mean of each column
df = df.fillna(1, axis=1)
df

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,1.0
2,1.0,7.0
3,4.0,8.0


### inplace:

The inplace parameter, if set to True, modifies the DataFrame or Series in place and returns None. If set to False (default), it returns a new DataFrame or Series with missing values filled.

In [38]:
df1=pd.DataFrame({'A':[1,1,None,2,3],
    'B':[2,3,4,5,None]})

In [39]:
# Fill missing values in place
df1.fillna(value=0, inplace=True)
df1

Unnamed: 0,A,B
0,1.0,2.0
1,1.0,3.0
2,0.0,4.0
3,2.0,5.0
4,3.0,0.0


### value:

The value parameter allows you to specify the value that will be used to fill the missing entries. This could be a scalar value, a dictionary (to specify different values for different columns), or a Series.

df_mcar = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

In [40]:
# Fill all missing values with a specific value (e.g., 0)
df_filled = df_mcar.fillna(value=0)

# Fill missing values in a specific column with a value
df['A'] = df_mcar['A'].fillna(value=10)

In [41]:
df_filled

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,0.0
2,0.0,7.0
3,4.0,8.0


In [42]:
df

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,1.0
2,10.0,7.0
3,4.0,8.0


### method:
The method parameter allows you to specify a method for filling missing values. Some common methods are:
* 'f fill' or 'pad': Forward fill - fills missing values with the previous non-missing value.
* 'b fill' or 'backfill': Backward fill - fills missing values with the next non-missing value.

In [43]:
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})

In [44]:
# Forward fill missing values in the DataFrame
df_ffill = df.fillna(method='ffill')
df_ffill

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,5.0
2,2.0,7.0
3,4.0,8.0


In [45]:
df_bfill = df.fillna(method='bfill')
df_bfill

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,7.0
2,4.0,7.0
3,4.0,8.0
