# 18. Miscellaneous Useful Functions
- `df.memory_usage()`: Get memory usage of a DataFrame.
- `df.query()`: Perform SQL-like queries on DataFrame.
- `pd.qcut()`: Quantile-based discretization.
- `pd.cut()`: Discretize values into bins.
- `pd.factorize()`: Encode categorical values.
- `df.melt()`: Convert wide format to long format.
- `df.explode()`: Transform each element of a list-like to a row.

In [1]:
import  pandas as pd
import numpy as np

# pandas.DataFrame.memory_usage

`DataFrame.memory_usage(index=True, deep=False)[source]`
Return the memory usage of each column in bytes.

The memory usage can optionally include the contribution of the index and elements of object dtype.

This value is displayed in `DataFrame.info` by default. This can be suppressed by setting `pandas.options.display.memory_usage` to False.

## Parameters

- **`index`**: bool, default True
  Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If `index=True`, the memory usage of the index is the first item in the output.

- **`deep`**: bool, default False
  If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

## Returns

- **Series**
  A Series whose index is the original column names and whose values is the memory usage of each column in bytes.


In [2]:

data = {
    'EmployeeID': range(1, 31),
    'Name': ['Employee ' + str(i) for i in range(1, 31)],
    'Department': np.random.choice(['HR', 'Finance', 'IT', 'Marketing', 'Sales'], 30),
    'Position': np.random.choice(['Manager', 'Analyst', 'Developer', 'Intern'], 30),
    'BaseSalary': np.random.randint(50000, 120000, 30),
    'Bonus': np.random.randint(2000, 10000, 30)
}
df_salary = pd.DataFrame(data)

print("Salary DataFrame:")
print(df_salary)


Salary DataFrame:
    EmployeeID         Name Department   Position  BaseSalary  Bonus
0            1   Employee 1         IT    Analyst      107207   5211
1            2   Employee 2      Sales    Analyst       98554   7851
2            3   Employee 3         HR    Analyst      105109   9399
3            4   Employee 4         HR  Developer       61331   6427
4            5   Employee 5  Marketing     Intern       76086   8319
5            6   Employee 6  Marketing    Manager      108205   7499
6            7   Employee 7         IT    Manager       50078   9834
7            8   Employee 8    Finance  Developer      102152   3519
8            9   Employee 9         IT    Analyst       78333   8099
9           10  Employee 10  Marketing  Developer       84009   5203
10          11  Employee 11      Sales  Developer       73429   2289
11          12  Employee 12    Finance    Manager       64823   8014
12          13  Employee 13         IT    Analyst       88920   7985
13          14  

In [3]:
df_salary.memory_usage(deep=True)

Index          132
EmployeeID     240
Name          1791
Department    1611
Position      1687
BaseSalary     120
Bonus          120
dtype: int64

# pandas.DataFrame.query

`DataFrame.query(expr, *, inplace=False, **kwargs)[source]`
Query the columns of a DataFrame with a boolean expression.

## Parameters

- **`expr`**: str
  The query string to evaluate.
  - You can refer to variables in the environment by prefixing them with an `@` character like `@a + b`.
  - You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as ``Area (cm^2)``). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.

  For example, if one of your columns is called `a a` and you want to sum it with `b`, your query should be ``a a`` + `b`.

- **`inplace`**: bool, default False
  Whether to modify the DataFrame rather than creating a new one.

- **`**kwargs`**:
  See the documentation for `eval()` for complete details on the keyword arguments accepted by `DataFrame.query()`.

## Returns

- **DataFrame or None**
  DataFrame resulting from the provided query expression or None if `inplace=True`.


In [4]:
df_salary.query(f"Bonus>1000 and Name.str.endswith('1')")

Unnamed: 0,EmployeeID,Name,Department,Position,BaseSalary,Bonus
0,1,Employee 1,IT,Analyst,107207,5211
10,11,Employee 11,Sales,Developer,73429,2289
20,21,Employee 21,Marketing,Intern,67090,5827


# pandas.qcut

`pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')[source]`
Quantile-based discretization function.

Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.

## Parameters

- **`x`**: 1d ndarray or Series
  The input data to be discretized.

- **`q`**: int or list-like of float
  Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

- **`labels`**: array or False, default None
  Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.

- **`retbins`**: bool, optional
  Whether to return the (bins, labels) or not. Can be useful if bins is given as a scalar.

- **`precision`**: int, optional
  The precision at which to store and display the bins labels.

- **`duplicates`**: {‘raise’, ‘drop’}, optional
  If bin edges are not unique, raise ValueError or drop non-uniques.

## Returns

- **`out`**: Categorical or Series or array of integers if labels is False
  The return type (Categorical or Series) depends on the input: a Series of type category if input is a Series else Categorical. Bins are represented as categories when categorical data is returned.

- **`bins`**: ndarray of floats
  Returned only if `retbins` is True.


In [5]:
pd.qcut(df_salary['Bonus'],4,labels=['less','average','enough','many'])

0     average
1      enough
2        many
3      enough
4        many
5      enough
6        many
7        less
8        many
9     average
10       less
11       many
12     enough
13    average
14       less
15       less
16     enough
17    average
18    average
19       less
20    average
21       less
22       many
23       many
24       many
25       less
26     enough
27    average
28       less
29     enough
Name: Bonus, dtype: category
Categories (4, object): ['less' < 'average' < 'enough' < 'many']

In [6]:
pd.qcut(df_salary['Bonus'],4,labels=['less','average','enough','many'],retbins=True)

(0     average
 1      enough
 2        many
 3      enough
 4        many
 5      enough
 6        many
 7        less
 8        many
 9     average
 10       less
 11       many
 12     enough
 13    average
 14       less
 15       less
 16     enough
 17    average
 18    average
 19       less
 20    average
 21       less
 22       many
 23       many
 24       many
 25       less
 26     enough
 27    average
 28       less
 29     enough
 Name: Bonus, dtype: category
 Categories (4, object): ['less' < 'average' < 'enough' < 'many'],
 array([2066.  , 3925.75, 6020.5 , 8006.75, 9834.  ]))

In [17]:
pd.qcut(df_salary['BaseSalary'],3,precision=5)

0            (99184.0, 111210.0]
1         (75200.33333, 99184.0]
2            (99184.0, 111210.0]
3     (50077.99999, 75200.33333]
4         (75200.33333, 99184.0]
5            (99184.0, 111210.0]
6     (50077.99999, 75200.33333]
7            (99184.0, 111210.0]
8         (75200.33333, 99184.0]
9         (75200.33333, 99184.0]
10    (50077.99999, 75200.33333]
11    (50077.99999, 75200.33333]
12        (75200.33333, 99184.0]
13        (75200.33333, 99184.0]
14           (99184.0, 111210.0]
15        (75200.33333, 99184.0]
16           (99184.0, 111210.0]
17           (99184.0, 111210.0]
18    (50077.99999, 75200.33333]
19    (50077.99999, 75200.33333]
20    (50077.99999, 75200.33333]
21           (99184.0, 111210.0]
22    (50077.99999, 75200.33333]
23        (75200.33333, 99184.0]
24           (99184.0, 111210.0]
25    (50077.99999, 75200.33333]
26    (50077.99999, 75200.33333]
27           (99184.0, 111210.0]
28        (75200.33333, 99184.0]
29        (75200.33333, 99184.0]
Name: Base

# pandas.cut

`pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)[source]`
Bin values into discrete intervals.

Use `cut` when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, `cut` could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

## Parameters

- **`x`**: array-like
  The input array to be binned. Must be 1-dimensional.

- **`bins`**: int, sequence of scalars, or IntervalIndex
  The criteria to bin by.
  - `int`: Defines the number of equal-width bins in the range of `x`. The range of `x` is extended by 0.1% on each side to include the minimum and maximum values of `x`.
  - `sequence of scalars`: Defines the bin edges allowing for non-uniform width. No extension of the range of `x` is done.
  - `IntervalIndex`: Defines the exact bins to be used. Note that `IntervalIndex` for bins must be non-overlapping.

- **`right`**: bool, default True
  Indicates whether bins include the rightmost edge or not. If `right == True` (the default), then the bins `[1, 2, 3, 4]` indicate `(1,2]`, `(2,3]`, `(3,4]`. This argument is ignored when `bins` is an `IntervalIndex`.

- **`labels`**: array or False, default None
  Specifies the labels for the returned bins. Must be the same length as the resulting bins. If `False`, returns only integer indicators of the bins. This affects the type of the output container. This argument is ignored when `bins` is an `IntervalIndex`. If `True`, raises an error. When `ordered=False`, `labels` must be provided.

- **`retbins`**: bool, default False
  Whether to return the bins or not. Useful when `bins` is provided as a scalar.

- **`precision`**: int, default 3
  The precision at which to store and display the bins labels.

- **`include_lowest`**: bool, default False
  Whether the first interval should be left-inclusive or not.

- **`duplicates`**: {‘raise’, ‘drop’}, optional
  If bin edges are not unique, raise ValueError or drop non-uniques.

- **`ordered`**: bool, default True
  Whether the labels are ordered or not. Applies to returned types `Categorical` and `Series` (with `Categorical` dtype). If `True`, the resulting categorical will be ordered. If `False`, the resulting categorical will be unordered (labels must be provided).

## Returns

- **`out`**: Categorical, Series, or ndarray
  An array-like object representing the respective bin for each value of `x`. The type depends on the value of `labels`.
  - `None` (default): returns a `Series` for `Series x` or a `Categorical` for all other inputs. The values stored within are `Interval` dtype.
  - `sequence of scalars`: returns a `Series` for `Series x` or a `Categorical` for all other inputs. The values stored within are whatever the type in the sequence is.
  - `False`: returns an `ndarray` of integers.

- **`bins`**: numpy.ndarray or IntervalIndex
  The computed or specified bins. Only returned when `retbins=True`. For scalar or sequence bins, this is an `ndarray` with the computed bins. If set `duplicates=drop`, bins will drop non-unique bin. For an `IntervalIndex bins`, this is equal to `bins`.


In [32]:
bins=[df_salary.describe().at['min','BaseSalary'],df_salary.describe().at['mean','BaseSalary'],df_salary.describe().at['max','BaseSalary']]
pd.cut(df_salary['BaseSalary'],bins)

0     (85851.9, 111210.0]
1     (85851.9, 111210.0]
2     (85851.9, 111210.0]
3      (50078.0, 85851.9]
4      (50078.0, 85851.9]
5     (85851.9, 111210.0]
6                     NaN
7     (85851.9, 111210.0]
8      (50078.0, 85851.9]
9      (50078.0, 85851.9]
10     (50078.0, 85851.9]
11     (50078.0, 85851.9]
12    (85851.9, 111210.0]
13     (50078.0, 85851.9]
14    (85851.9, 111210.0]
15    (85851.9, 111210.0]
16    (85851.9, 111210.0]
17    (85851.9, 111210.0]
18     (50078.0, 85851.9]
19     (50078.0, 85851.9]
20     (50078.0, 85851.9]
21    (85851.9, 111210.0]
22     (50078.0, 85851.9]
23    (85851.9, 111210.0]
24    (85851.9, 111210.0]
25     (50078.0, 85851.9]
26     (50078.0, 85851.9]
27    (85851.9, 111210.0]
28     (50078.0, 85851.9]
29    (85851.9, 111210.0]
Name: BaseSalary, dtype: category
Categories (2, interval[float64, right]): [(50078.0, 85851.9] < (85851.9, 111210.0]]

In [33]:
pd.cut(df_salary['BaseSalary'],bins,labels=['min','max'])

0     max
1     max
2     max
3     min
4     min
5     max
6     NaN
7     max
8     min
9     min
10    min
11    min
12    max
13    min
14    max
15    max
16    max
17    max
18    min
19    min
20    min
21    max
22    min
23    max
24    max
25    min
26    min
27    max
28    min
29    max
Name: BaseSalary, dtype: category
Categories (2, object): ['min' < 'max']

In [34]:
pd.cut(df_salary['BaseSalary'], bins, labels=['min', 'max'], include_lowest=True)

0     max
1     max
2     max
3     min
4     min
5     max
6     min
7     max
8     min
9     min
10    min
11    min
12    max
13    min
14    max
15    max
16    max
17    max
18    min
19    min
20    min
21    max
22    min
23    max
24    max
25    min
26    min
27    max
28    min
29    max
Name: BaseSalary, dtype: category
Categories (2, object): ['min' < 'max']

# pandas.factorize

`pandas.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)[source]`
Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. `factorize` is available as both a top-level function `pandas.factorize()`, and as a method `Series.factorize()` and `Index.factorize()`.

## Parameters

- **`values`**: sequence
  A 1-D sequence. Sequences that aren’t pandas objects are coerced to `ndarrays` before factorization.

- **`sort`**: bool, default False
  Sort uniques and shuffle codes to maintain the relationship.

- **`use_na_sentinel`**: bool, default True
  If True, the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values.
  - Added in version 1.5.0.

- **`size_hint`**: int, optional
  Hint to the hashtable sizer.

## Returns

- **`codes`**: ndarray
  An integer `ndarray` that’s an indexer into uniques. `uniques.take(codes)` will have the same values as `values`.

- **`uniques`**: ndarray, Index, or Categorical
  The unique valid values. When `values` is Categorical, `uniques` is a Categorical. When `values` is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.


In [45]:
df_salary.at[3,'Department']=np.nan

In [43]:

codes,uniques=pd.factorize(df_salary['Department'],sort=False)
print(codes)
print(uniques)

[ 0  1  2 -1  3  3  0  4  0  3  1  4  0  1  3  0  4  2  1  4  3  1  2  1
  2  0  2  4  1  2]
Index(['IT', 'Sales', 'HR', 'Marketing', 'Finance'], dtype='object')


In [44]:
df_salary

Unnamed: 0,EmployeeID,Name,Department,Position,BaseSalary,Bonus
0,1,Employee 1,IT,Analyst,107207,5211
1,2,Employee 2,Sales,Analyst,98554,7851
2,3,Employee 3,HR,Analyst,105109,9399
3,4,Employee 4,,Developer,61331,6427
4,5,Employee 5,Marketing,Intern,76086,8319
5,6,Employee 6,Marketing,Manager,108205,7499
6,7,Employee 7,IT,Manager,50078,9834
7,8,Employee 8,Finance,Developer,102152,3519
8,9,Employee 9,IT,Analyst,78333,8099
9,10,Employee 10,Marketing,Developer,84009,5203


In [47]:
pd.factorize(df_salary['Department'],sort=False,use_na_sentinel=False)

(array([0, 1, 2, 3, 4, 4, 0, 5, 0, 4, 1, 5, 0, 1, 4, 0, 5, 2, 1, 5, 4, 1,
        2, 1, 2, 0, 2, 5, 1, 2], dtype=int64),
 Index(['IT', 'Sales', 'HR', nan, 'Marketing', 'Finance'], dtype='object'))

# pandas.DataFrame.melt

`DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)[source]`
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (`id_vars`), while all other columns, considered measured variables (`value_vars`), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

## Parameters

- **`id_vars`**: scalar, tuple, list, or ndarray, optional
  Column(s) to use as identifier variables.

- **`value_vars`**: scalar, tuple, list, or ndarray, optional
  Column(s) to unpivot. If not specified, uses all columns that are not set as `id_vars`.

- **`var_name`**: scalar, default None
  Name to use for the ‘variable’ column. If None it uses `frame.columns.name` or ‘variable’.

- **`value_name`**: scalar, default ‘value’
  Name to use for the ‘value’ column, can’t be an existing column label.

- **`col_level`**: scalar, optional
  If columns are a `MultiIndex` then use this level to melt.

- **`ignore_index`**: bool, default True
  If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.

## Returns

- **DataFrame**
  Unpivoted DataFrame.


# pandas.DataFrame.explode

`DataFrame.explode(column, ignore_index=False)[source]`
Transform each element of a list-like to a row, replicating index values.

## Parameters

- **`column`**: IndexLabel
  Column(s) to explode. For multiple columns, specify a non-empty list with each element being `str` or `tuple`, and all specified columns' list-like data on the same row of the frame must have matching length.
  - **Added in version 1.3.0**: Multi-column explode.

- **`ignore_index`**: bool, default False
  If True, the resulting index will be labeled 0, 1, …, n - 1.

## Returns

- **DataFrame**
  Exploded lists to rows of the subset columns; index will be duplicated for these rows.

## Raises

- **ValueError**
  - If columns of the frame are not unique.
  - If specified columns to explode is an empty list.
  - If specified columns to explode have not matching count of elements row-wise in the frame.



In [48]:
data = { 'ID': [1, 2], 'Names': [['Alice', 'Bob'], ['Charlie', 'David']], 'Scores': [[85, 90], [75, 80]] }
df=pd.DataFrame(data)
df

Unnamed: 0,ID,Names,Scores
0,1,"[Alice, Bob]","[85, 90]"
1,2,"[Charlie, David]","[75, 80]"


In [52]:
df.explode(['Names','Scores'])

Unnamed: 0,ID,Names,Scores
0,1,Alice,85
0,1,Bob,90
1,2,Charlie,75
1,2,David,80
