### pandas.DataFrame.dropna

**DataFrame.dropna(*, axis=0, how=<no_default>, thresh=<no_default>, subset=None, inplace=False, ignore_index=False)[source]**

Remove missing values. See the User Guide for more on which values are considered missing, and how to work with missing data.

**Parameters**:
- **axis** {0 or ‘index’, 1 or ‘columns’}, default 0  
  Determine if rows or columns which contain missing values are removed.  
  - 0, or ‘index’: Drop rows which contain missing values.  
  - 1, or ‘columns’: Drop columns which contain missing values.  
  Only a single axis is allowed.

- **how** {‘any’, ‘all’}, default ‘any’  
  Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.  
  - ‘any’: If any NA values are present, drop that row or column.  
  - ‘all’: If all values are NA, drop that row or column.

- **thresh** int, optional  
  Require that many non-NA values. Cannot be combined with how.

- **subset** column label or sequence of labels, optional  
  Labels along the other axis to consider, e.g., if you are dropping rows, these would be a list of columns to include.

- **inplace** bool, default False  
  Whether to modify the DataFrame rather than creating a new one.

- **ignore_index** bool, default False  
  If True, the resulting axis will be labeled 0, 1, …, n - 1. (Added in version 2.0.0)

**Returns**:  
DataFrame or None  
DataFrame with NA entries dropped from it or None if inplace=True.

In [20]:
import pandas as pd
import numpy as np

# Creating a DataFrame for employee data
data = {
    'EmployeeID': [101, 102, 103, 104, np.nan, 106, 107, 108, 109, 110],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', np.nan, 'Grace', 'Hannah', 'Ivy', 'Jack'],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'HR', 'Finance', 'IT', np.nan, 'HR', 'Finance'],
    'Salary': [70000, 80000, 60000, np.nan, 75000, 90000, 72000, 68000, np.nan, 85000],
    'JoiningDate': ['2020-01-15', '2019-03-22', '2021-07-30', '2018-12-12', '2020-06-01', 
                    '2022-02-15', '2021-05-24', np.nan, '2019-11-10', '2020-09-05']
}

df = pd.DataFrame(data)

print("Employee DataFrame:")
print(df)

Employee DataFrame:
   EmployeeID     Name Department   Salary JoiningDate
0       101.0    Alice         HR  70000.0  2020-01-15
1       102.0      Bob         IT  80000.0  2019-03-22
2       103.0  Charlie    Finance  60000.0  2021-07-30
3       104.0    David         IT      NaN  2018-12-12
4         NaN      Eve         HR  75000.0  2020-06-01
5       106.0      NaN    Finance  90000.0  2022-02-15
6       107.0    Grace         IT  72000.0  2021-05-24
7       108.0   Hannah        NaN  68000.0         NaN
8       109.0      Ivy         HR      NaN  2019-11-10
9       110.0     Jack    Finance  85000.0  2020-09-05


In [2]:
df.dropna()

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
6,107.0,Grace,IT,72000.0,2021-05-24
9,110.0,Jack,Finance,85000.0,2020-09-05


In [3]:
df.dropna(ignore_index=True)

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,107.0,Grace,IT,72000.0,2021-05-24
4,110.0,Jack,Finance,85000.0,2020-09-05


In [4]:
    df.dropna(subset=['Name','Salary'])

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
4,,Eve,HR,75000.0,2020-06-01
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
9,110.0,Jack,Finance,85000.0,2020-09-05


In [5]:
df.dropna(how='all')

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,,2018-12-12
4,,Eve,HR,75000.0,2020-06-01
5,106.0,,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
8,109.0,Ivy,HR,,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [6]:
df.dropna(axis=1)

0
1
2
3
4
5
6
7
8
9


### `pandas.DataFrame.fillna`

`DataFrame.fillna(value=None, *, method=None, axis=None, inplace=False, limit=None, downcast=<no_default>)[source]`

Fill `NA`/`NaN` values using the specified method.

**Parameters**:

- **value**: scalar, dict, Series, or DataFrame  
  Value to use to fill holes (e.g., 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

- **method**: {‘backfill’, ‘bfill’, ‘ffill’, None}, default None  
  Method to use for filling holes in reindexed Series:
  - `ffill`: propagate last valid observation forward to next valid.
  - `backfill` / `bfill`: use next valid observation to fill gap.
  - Deprecated since version 2.1.0: Use `ffill` or `bfill` instead.

- **axis**: {0 or ‘index’} for Series, {0 or ‘index’, 1 or ‘columns’} for DataFrame  
  Axis along which to fill missing values. For Series, this parameter is unused and defaults to 0.

- **inplace**: bool, default False  
  If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

- **limit**: int, default None  
  If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

- **downcast**: dict, default is None  
  A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g., float64 to int64 if possible).
  - Deprecated since version 2.2.0.

**Returns**:
- **Series/DataFrame or None**  
  Object with missing values filled or None if `inplace=True`.

In [7]:
df.fillna(0)


Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,0.0,2018-12-12
4,0.0,Eve,HR,75000.0,2020-06-01
5,106.0,0,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,0,68000.0,0
8,109.0,Ivy,HR,0.0,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [8]:
df.fillna(method='ffill')

  df.fillna(method='ffill')


Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,60000.0,2018-12-12
4,104.0,Eve,HR,75000.0,2020-06-01
5,106.0,Eve,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,IT,68000.0,2021-05-24
8,109.0,Ivy,HR,68000.0,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [9]:
df.fillna(method='ffill')

  df.fillna(method='ffill')


Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,60000.0,2018-12-12
4,104.0,Eve,HR,75000.0,2020-06-01
5,106.0,Eve,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,IT,68000.0,2021-05-24
8,109.0,Ivy,HR,68000.0,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [10]:
df.fillna(method='ffill', limit=1)

  df.fillna(method='ffill', limit=1)


Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,60000.0,2018-12-12
4,104.0,Eve,HR,75000.0,2020-06-01
5,106.0,Eve,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,IT,68000.0,2021-05-24
8,109.0,Ivy,HR,68000.0,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05



### pandas.DataFrame.drop_duplicates
## DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
Return DataFrame with duplicate rows removed. Considering certain columns is optional. Indexes, including time indexes are ignored.

#### Parameters:

- **subset**: column label or sequence of labels, optional  
  Only consider certain columns for identifying duplicates, by default use all of the columns.

- **keep**: {‘first’, ‘last’, False}, default ‘first’  
  Determines which duplicates (if any) to keep.
  - ‘first’ : Drop duplicates except for the first occurrence.
  - ‘last’ : Drop duplicates except for the last occurrence.
  - False : Drop all duplicates.

- **inplace**: bool, default False  
  Whether to modify the DataFrame rather than creating a new one.

- **ignore_index**: bool, default False  
  If True, the resulting axis will be labeled 0, 1, …, n - 1.

#### Returns:

- **DataFrame** or **None**  
  DataFrame with duplicates removed or None if inplace=True.

In [11]:
df.drop_duplicates(subset=['Department'],keep='last')

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
8,109.0,Ivy,HR,,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [12]:
df.drop_duplicates(subset=['Department'],keep='first',ignore_index=True)

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,108.0,Hannah,,68000.0,


### pandas.DataFrame.replace

## DataFrame.replace(to_replace=None, value=, *, inplace=False, limit=None, regex=False, method=)



Replace values given in `to_replace` with `value`. Values of the Series/DataFrame are replaced with other values dynamically. This differs from updating with `.loc` or `.iloc`, which require you to specify a location to update with some value.

#### Parameters:

- **to_replace**: str, regex, list, dict, Series, int, float, or None  
  How to find the values that will be replaced.
  - **numeric**: numeric values equal to `to_replace` will be replaced with `value`
  - **str**: string exactly matching `to_replace` will be replaced with `value`
  - **regex**: regexs matching `to_replace` will be replaced with `value`
  - **list of str, regex, or numeric**: 
    - First, if `to_replace` and `value` are both lists, they must be the same length.
    - Second, if `regex=True`, then all of the strings in both lists will be interpreted as regexs; otherwise, they will match directly. This doesn’t matter much for `value` since there are only a few possible substitution regexes you can use.
  - **dict**: 
    - Dicts can be used to specify different replacement values for different existing values. For example, `{'a': 'b', 'y': 'z'}` replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional `value` parameter should not be given.
    - For a DataFrame, a dict can specify that different values should be replaced in different columns. For example, `{'a': 1, 'b': 'z'}` looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in `value`. The `value` parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.
    - For a DataFrame, nested dictionaries, e.g., `{'a': {'b': np.nan}}`, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional `value` parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
  - **None**: This means that the `regex` argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If `value` is also None then this must be a nested dictionary or Series.

- **value**: scalar, dict, list, str, regex, default None  
  Value to replace any values matching `to_replace` with. For a DataFrame, a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings, and lists or dicts of such objects are also allowed.

- **inplace**: bool, default False  
  If True, performs operation inplace and returns None.

- **limit**: int, default None  
  Maximum size gap to forward or backward fill.  
  *Deprecated since version 2.1.0.*

- **regex**: bool or same types as `to_replace`, default False  
  Whether to interpret `to_replace` and/or `value` as regular expressions. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case `to_replace` must be None.

- **method**: {‘pad’, ‘ffill’, ‘bfill’}  
  The method to use when for replacement, when `to_replace` is a scalar, list or tuple and `value` is None.  
  *Deprecated since version 2.1.0.*

#### Returns:
- **Series/DataFrame**  
  Object after replacement.

#### Raises:
- **AssertionError**  
  If `regex` is not a bool and `to_replace` is not None.

- **TypeError**  
  If `to_replace` is not a scalar, array-like, dict, or None.  
  If `to_replace` is a dict and `value` is not a list, dict, ndarray, or Series.  
  If `to_replace` is None and `regex` is not compilable into a regular expression or is a list, dict, ndarray, or Series.  
  When replacing multiple bool or datetime64 objects and the arguments to `to_replace` do not match the type of the value being replaced.

- **ValueError**  
  If a list or an ndarray is passed to `to_replace` and `value` but they are not the same length.

In [13]:
df.replace('Alice',method='bfill')

  df.replace('Alice',method='bfill')


Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Bob,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,,2018-12-12
4,,Eve,HR,75000.0,2020-06-01
5,106.0,,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
8,109.0,Ivy,HR,,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [17]:
df.replace(to_replace=['Alice','Bob'],value='Alice')

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Alice,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,,2018-12-12
4,,Eve,HR,75000.0,2020-06-01
5,106.0,,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
8,109.0,Ivy,HR,,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [21]:
df.replace('Alice', 'Bob')

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Bob,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,Charlie,Finance,60000.0,2021-07-30
3,104.0,David,IT,,2018-12-12
4,,Eve,HR,75000.0,2020-06-01
5,106.0,,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
8,109.0,Ivy,HR,,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


In [22]:
df.replace({'Charlie':np.nan})

Unnamed: 0,EmployeeID,Name,Department,Salary,JoiningDate
0,101.0,Alice,HR,70000.0,2020-01-15
1,102.0,Bob,IT,80000.0,2019-03-22
2,103.0,,Finance,60000.0,2021-07-30
3,104.0,David,IT,,2018-12-12
4,,Eve,HR,75000.0,2020-06-01
5,106.0,,Finance,90000.0,2022-02-15
6,107.0,Grace,IT,72000.0,2021-05-24
7,108.0,Hannah,,68000.0,
8,109.0,Ivy,HR,,2019-11-10
9,110.0,Jack,Finance,85000.0,2020-09-05


### pandas.DataFrame.astype

```python
DataFrame.astype(dtype, copy=None, errors='raise')
```

Cast a pandas object to a specified dtype.

#### Parameters:
- **dtype**: `str`, `data type`, `Series`, or `Mapping` of `column name` -> `data type`
  Use a `str`, `numpy.dtype`, `pandas.ExtensionDtype`, or Python type to cast the entire pandas object to the same type. Alternatively, use a mapping, e.g. ``{col: dtype, …}``, where `col` is a column label and `dtype` is a `numpy.dtype` or Python type to cast one or more of the DataFrame’s columns to column-specific types.

- **copy**: `bool`, default `True`
  Return a copy when `copy=True` (be very careful setting `copy=False` as changes to values then may propagate to other pandas objects).

  **Note:**
  The `copy` keyword will change behavior in pandas 3.0. Copy-on-Write will be enabled by default, which means that all methods with a `copy` keyword will use a lazy copy mechanism to defer the copy and ignore the `copy` keyword. The `copy` keyword will be removed in a future version of pandas.

  You can already get the future behavior and improvements through enabling copy on write:
  ```python
  pd.options.mode.copy_on_write = True
  ```

- **errors**: `{‘raise’, ‘ignore’}`, default ‘raise’
  Control raising of exceptions on invalid data for provided `dtype`.
  - `raise` : allow exceptions to be raised
  - `ignore` : suppress exceptions. On error return original object.

#### Returns:
- Same type as caller

In [29]:
df['Salary'].fillna(0).astype('int64')

0    70000
1    80000
2    60000
3        0
4    75000
5    90000
6    72000
7    68000
8        0
9    85000
Name: Salary, dtype: int64