### __Pandas__

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.


If you have Python and PIP already installed on a system, then installation of Pandas is very easy.

Install it using this command:

C:\Users\Your Name>pip install pandas

C:\Users\Your Name>pipenv install pandas


##### __Import Pandas__

Once Pandas is installed, import it in your applications by adding the import keyword

In [1]:

import pandas

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


##### _Pandas as pd_

Pandas is usually imported under the pd alias.

In Python alias are an alternate name for referring to the same thing.

Create an alias with the "as" keyword while importing

In [2]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


##### __Pandas Series__

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

In [3]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

0    1
1    7
2    2
dtype: int64


##### _Labels_

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [4]:
print(myvar[0])

1


With the "index" argument, you can name your own labels.

In [6]:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)

print()

print(myvar["y"])

x    1
y    7
z    2
dtype: int64

7


##### _Key/Value Objects as Series_

You can also use a key/value object, like a dictionary, when creating a Series.

In [7]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.

In [11]:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(f"{myvar} \n")
print(type(myvar))

day1    420
day2    380
dtype: int64 

<class 'pandas.core.series.Series'>


##### __Pandas DataFrame__

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

DataFrame(_data, columns_)

##### _DataFrame Attributes_

In the context of pandas, attributes are properties of a DataFrame object that provide information about the data it contains.

Attributes in pandas allow you to access general information about a dataset without performing any data manipulation. They provide a convenient way to get an overview of the DataFrame's structure and contents.

Pandas Data types

------------ Python --- Pandas

string------ str------- object

integer num- int------- int64

float num--- float-----	float64

logic data-- bool------	bool

In [31]:
import pandas as pd
import os

print(f"{os.getcwd()} \n")

df = pd.read_csv('DataSets/music_log.csv')

print(f"dtypes: \n {df.dtypes} \n") # Returns the data types of each column.
print(f"index: \n {df.index} \n") # Returns the row labels as an Index object.
print(f"columns: \n {df.columns} \n")# Returns the column labels as an Index object.
print(f"shape: \n {df.shape} \n") # Returns the dimensions of the DataFrame as a tuple (rows, columns)
print(f"size: \n {df.size} \n") # Returns the total number of elements (rows × columns).
print(f"ndim: \n {df.ndim} \n") # Returns the number of dimensions (always 2 for a DataFrame).
print(f"empty: \n {df.empty} \n") # Returns True if the DataFrame is empty.
print(f"values: \n {df.values} \n") # Returns the underlying NumPy array representation.
print(f"axes: \n {df.axes} \n") # Returns a list of row and column index labels.
print(f"T: \n {df.T} \n") # Returns the transpose of the DataFrame.

c:\Users\luisp\OneDrive\Documentos\GitHub\Python-VENV\Python tutorial 

dtypes: 
   user_id      object
total play    float64
Artist         object
genre          object
track          object
dtype: object 

index: 
 RangeIndex(start=0, stop=67963, step=1) 

columns: 
 Index(['  user_id', 'total play', 'Artist', 'genre', 'track'], dtype='object') 

shape: 
 (67963, 5) 

size: 
 339815 

ndim: 
 2 

empty: 
 False 

values: 
 [['BF6EA5AF' 92.85138808302445 'Marina Rei' 'pop' 'Musica']
 ['FB1E568E' 282.981 'Stive Morgan' 'ambient' 'Love Planet']
 ['FB1E568E' 282.981 'Stive Morgan' 'ambient' 'Love Planet']
 ...
 ['26B7058C' 292.455 'Red God' 'metal' 'Действуй!']
 ['DB0038A8' 11.529112451445515 'Less Chapell' 'pop' 'Home']
 ['FE8684F6' 0.1 nan nan nan]] 

axes: 
 [RangeIndex(start=0, stop=67963, step=1), Index(['  user_id', 'total play', 'Artist', 'genre', 'track'], dtype='object')] 

T: 
                  0             1             2                    3      \
  user_id     BF6EA5AF    

##### _DataFrame from a List_

In [33]:
import pandas as pd

atlas = [
      ['France', 'Paris'],  
        ['Russia', 'Moscow'],  
        ['China', 'Beijing'],  
        ['Mexico', 'Mexico City'],  
        ['Egypt', 'Cairo'],
]
geography = ['country', 'capital']

world_map = pd.DataFrame(data=atlas , columns=geography)

print(world_map)

  country      capital
0  France        Paris
1  Russia       Moscow
2   China      Beijing
3  Mexico  Mexico City
4   Egypt        Cairo


##### _DataFrame from a Dictionary_

In [35]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45],
  "type": ["carbs", "glucose", "sugar"]
}

df = pd.DataFrame(data)

print(df) 

   calories  duration     type
0       420        50    carbs
1       380        40  glucose
2       390        45    sugar


##### _DataFrame from a File_

When working with files in Pandas (e.g., using pd.read_csv()), you can specify the file path as either an absolute path or a relative path. Here's the difference between the two:

_Absolute Path_

df = pd.read_csv(r'C:\Users\luisp\OneDrive\Documentos\GitHub\Python-VENV\DataSets\music_log.csv')

An absolute path specifies the complete path to a file or directory, starting from the root of the file system. It is independent of the current working directory of your script.

Key Points:

Starts from the root directory (e.g.,  on Windows or / on Linux/Mac).

Always points to the same file, regardless of where the script is executed.

Useful when the file location is fixed or when working with files outside the script's directory.


_Relative Path_

df = pd.read_csv('DataSets/music_log.csv')

A relative path specifies the location of a file relative to the current working directory (CWD) of your script. The CWD is the directory from which your script is executed.

Key Points:

Starts from the current working directory.

Shorter and more portable than absolute paths.

Depends on the current working directory. If the script is executed from a different directory, the relative path may not work.


In [None]:
import pandas as pd

df = pd.read_csv('DataSets/mouse_growth_rate.csv') # Read .csv or .xlsx or xls or .xlsm or xlsb or xltx file.

print(df) 

print()

print(pd.options.display.max_rows) # You can check your system's maximum rows with the pd.options.display.max_rows statement. Default is 60.

   age  mouse1  mouse2
0    1      24      18
1    2      56      36
2    3      64      50
3    4      82      68
4    5      92      72
5    6      94      72
6    7      88      74

60


##### _DataFrame Methods()_

_.info()_

The info() method prints information about the DataFrame.

The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

In [45]:
import pandas as pd

df = pd.read_csv('DataSets/music_log.csv')

df.info() # Displays a summary of the DataFrame, including data types and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67963 entries, 0 to 67962
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0     user_id   67963 non-null  object 
 1   total play  67399 non-null  float64
 2   Artist      59646 non-null  object 
 3   genre       64661 non-null  object 
 4   track       64804 non-null  object 
dtypes: float64(1), object(4)
memory usage: 2.6+ MB


_.head()_

The head() method returns a specified number of rows, string from the top.

The head() method returns the first 5 rows if a number is not specified.

In [46]:
import pandas as pd

df = pd.read_csv('DataSets/music_log.csv')

df.head() # Returns the first n rows (default is 5).

Unnamed: 0,user_id,total play,Artist,genre,track
0,BF6EA5AF,92.851388,Marina Rei,pop,Musica
1,FB1E568E,282.981,Stive Morgan,ambient,Love Planet
2,FB1E568E,282.981,Stive Morgan,ambient,Love Planet
3,EF15C7BA,8.966,,dance,Loving Every Minute
4,82F52E69,193.776327,Rixton,pop,Me And My Broken Heart


_.tail()_

The tail() method returns a specified number of last rows.

The tail() method returns the last 5 rows if a number is not specified.

In [47]:
import pandas as pd

df = pd.read_csv('DataSets/music_log.csv')

df.tail(10) #  Returns the last n rows.

Unnamed: 0,user_id,total play,Artist,genre,track
67953,A06381D8,2.502,Flip Grater,folk,My Old Shoes
67954,6E8E430E,139.627717,Alt & J,trance,Emotion
67955,D83CBA77,185.0,TKN,rock,Не отступай
67956,816FBC10,2.0,89ers,dance,Go Go Go
67957,18510741,109.0,Steel Pulse,reggae,Chant A Psalm
67958,2E27DF51,220.551837,Nadine Coyle,pop,Girls On Fire
67959,4F29D4D5,26.127,Digital Hero,dance,The Model
67960,26B7058C,292.455,Red God,metal,Действуй!
67961,DB0038A8,11.529112,Less Chapell,pop,Home
67962,FE8684F6,0.1,,,


In [48]:
import pandas as pd

df = pd.read_csv('DataSets/music_log.csv')

# df.memory_usage() # Returns the memory usage of each column in bytes.
df.memory_usage(index=True, deep=True) # Returns the memory usage of each column in bytes, including the index.

Index             132
  user_id     3937800
total play     543704
Artist        4103453
genre         3656054
track         4503010
dtype: int64

_.describe()_

The describe() method returns description of the data in the DataFrame.

If the DataFrame contains numerical data, the description contains these information for each column:

count - The number of not-empty values.

mean - The average (mean) value.

std - The standard deviation.

min - the minimum value.

25% - The 25% percentile*.

50% - The 50% percentile*.

75% - The 75% percentile*.

max - the maximum value.

*Percentile meaning: how many of the values are less than the given percentile. Read more about percentiles in our Machine Learning Percentile chapter.

In [49]:
import pandas as pd

df = pd.read_csv('DataSets/music_log.csv')

df.describe() # Returns a summary of statistics for numerical columns. 

Unnamed: 0,total play
count,67399.0
mean,98.899155
std,144.460713
min,0.0
25%,2.019
50%,20.135056
75%,194.335
max,8638.736


_.add()_

The add() method adds each value in the DataFrame with a specified value.

The specified value must be an object that can be added to the values of the DataFrame. It can be a constant number like the one in the example, or it can be a list-like object like a list [15, 20] or a tuple {"points": 380, "total": 22}, or a  Pandas Series or another DataFrame, that fits with the original DataFrame.

_dataframe.add(other, axis, level, fill_value)_

Parameter	Description

_other_	Required. A number, list of numbers, or another object with a data structure that fits with the original DataFrame.

_axis_	Optional, A definition that decides whether to compare by index or columns.

_0_ or 'index' means compare by index.

_1_ or 'columns' means compare by columns

_level_	Optional. A number or label that indicates where to compare.

_fill_value_	Optional. A number, or None. Specifies what to do with NaN values before the adding.


In [7]:
import pandas as pd

data = {
  "points": [100, 120, 114],
  "total": [350, 340, 402]
}

df = pd.DataFrame(data)

print(df.add(15))

   points  total
0     115    365
1     135    355
2     129    417


_.aggregate()_

The aggregate() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis.

_dataframe.aggregate(func, axis, args, kwargs)_

_func_	 	Required. A function, function name, or a list of function names to apply to the DataFrame.

_axi_s	0, 1, 'index', 'columns' Optional, Which axis to apply the function to. default 0.	

args	 	Optional, arguments to send into the function

kwargs	 	Optional, keyword arguments to send into the function

In [8]:
import pandas as pd

data = {
  "x": [50, 40, 30],
  "y": [300, 1112, 42]
}

df = pd.DataFrame(data)

x = df.aggregate(["sum"])

print(x)

       x     y
sum  120  1454


_.concat()_

The concat() method in Pandas is a powerful function used to concatenate (combine) multiple DataFrames or Series along a particular axis. It has several arguments that allow you to customize how the concatenation is performed.

_dataframe.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)_

_objs_ A list or dictionary of DataFrames or Series to concatenate. Required

_axis_ Specifies whether to concatenate along rows (axis=0) or columns (axis=1). axis=0: Combines rows (default behavior). axis=1: Combines columns side by side. Default(0)

_ingnore_index_ If True, resets the index in the resulting DataFrame to a continuous range starting from 0. If False, retains the original indices from the input DataFrames. Default(False)

_join_ Specifies how to handle columns that are not present in all DataFrames: 'outer': Includes all columns (union of columns). 'inner': Includes only common columns (intersection of columns). Default(outer)

_keys_ Adds a hierarchical index (MultiIndex) to the resulting DataFrame, with labels for each DataFrame being concatenated. Default(none)

_levels_ Used with keys to specify custom levels for the hierarchical index. Default(none)

_names_ Assigns names to the levels of the hierarchical index created by keys. Default(none)

_verify_integrity_ If True, checks for duplicate indices in the resulting DataFrame and raises an error if duplicates are found. Default(None)

_sort_ If True, sorts the columns in the resulting DataFrame. Default(False)

_copy_ If True, creates a copy of the data. If False, avoids unnecessary data copying (may improve performance). Default(True)

In [12]:
import pandas as pd

data1 = {
  "age": [16, 14, 10],
  "qualified": [True, True, True]
}
df1 = pd.DataFrame(data1)

data2 = {
  "age": [55, 40],
  "qualified": [True, False]
}
df2 = pd.DataFrame(data2)

newdf = pd.concat([df1, df2], ignore_index=True)

print(newdf)

   age  qualified
0   16       True
1   14       True
2   10       True
3   55       True
4   40      False


_.count()_

The count() method counts the number of not empty values for each row, or column if you specify the axis parameter as axis='columns', and returns a Series object with the result for each row (or column).

_dataframe.count(axis, level, numeric_only)_

_axis_	0 1 'index' 'columns'	Optional, Which axis to check, default 0.

_level_	Number level name	Optional, Specifies which level ( in a hierarchical multi index) to count along

_numeric_only_	'True' 'False'	Optional, Default False, set to true if the count method should only count numeric values

In [13]:
import pandas as pd

data = {
  "Duration": [50, 40, None, None, 90, 20],
  "Pulse": [109, 140, 110, 125, 138, 170]
}

df = pd.DataFrame(data)

print(df.count())

Duration    4
Pulse       6
dtype: int64


_.drop_duplicates()_

The drop_duplicates() method removes duplicate rows.

Use the subset parameter if only some specified columns should be considered when looking for duplicates.

_dataframe.drop_duplicates(subset, keep, inplace, ignore_index)_

_subset_	column label(s)	Optional. A String, or a list, containing the columns to use when looking for duplicates. If not specified, all columns are being used.

_keep_	'first' 'last' False	Optional, default 'first'. Specifies which duplicate to keep. If False, drop ALL duplicates

_inplace_	True False	Optional, default False. If True: the removing is done on the current DataFrame. If False: returns a copy where the removing is done.

_ignore_index_	True False	Optional, default False. Specifies whether to label the 0, 1, 2 etc., or not

In [16]:
import pandas as pd

data = {
  "name": ["Sally", "Mary", "John", "Mary"],
  "age": [50, 40, 30, 40],
  "qualified": [True, False, False, False]
}

df = pd.DataFrame(data)

newdf = df.drop_duplicates()

print(newdf)

    name  age  qualified
0  Sally   50       True
1   Mary   40      False
2   John   30      False


_.dropna()_

The dropna() method removes the rows that contains NULL values.

The dropna() method returns a new DataFrame object unless the inplace parameter is set to True, in that case the dropna() method does the removing in the original DataFrame instead.

_dataframe.dropna(axis, how, thresh, subset, inplace)_

_axis_	0 1 'index' 'columns'	Optional, default 0. 0 and 'index'removes ROWS that contains NULL values. 1 and 'columns' removes COLUMNS that contains NULL values
_how_	'all' 'any'	Optional, default 'any'. Specifies whether to remove the row or column when ALL values are NULL, or if ANY value is NULL.
_thresh_	Number	Optional, Specifies the number of NOT NULL values required to keep the row.
_subset_	List	Optional, specifies where to look for NULL values
_inplace_	True False	Optional, default False. If True: the removing is done on the current DataFrame. If False: returns a copy where the removing is done.

In [19]:
import pandas as pd

df = pd.read_csv('DataSets/orders_for_anomalies_detection_visitors.csv')

newdf = df.dropna()
print(newdf)

          date group  visitors
0   01/04/2019     A     455.0
2   03/04/2019     A    1313.0
3   04/04/2019     A     555.0
4   05/04/2019     A     564.0
5   06/04/2019     A     467.0
6   07/04/2019     A     513.0
7   08/04/2019     A     559.0
8   09/04/2019     A     575.0
9   10/04/2019     A    1322.0
11  12/04/2019     A     804.0
12  13/04/2019     A     626.0
13  14/04/2019     A     679.0
14  15/04/2019     A     557.0
15  16/04/2019     A     365.0
16  17/04/2019     A     509.0
17  18/04/2019     A     788.0
18  19/04/2019     A     724.0
20  21/04/2019     A     733.0
21  22/04/2019     A     707.0
22  23/04/2019     A     882.0
23  01/04/2019     B     464.0
24  02/04/2019     B     513.0
25  03/04/2019     B    1313.0
26  04/04/2019     B     578.0
28  06/04/2019     B     470.0
29  07/04/2019     B     505.0
30  08/04/2019     B     573.0
31  09/04/2019     B     564.0
32  10/04/2019     B    1334.0
33  11/04/2019     B     570.0
34  12/04/2019     B     819.0
36  14/0

##### __DataFrame Indexing__

Pandas use the loc attribute to return one or more specified row(s)

_loc[]: df.loc[row, column]_

In [1]:
import pandas as pd

df = pd.read_csv('DataSets/music_log.csv')

# Returns the first row of the DataFrame.
print(f"1 - First Row of Dataframe: \n{df.loc[0]}") 

print()

# Returns rows from the indexed to the last one
print(f"2 - Rows from the idex given to the last one: \n{df.loc[67960:]}")
print()


# Returns the value of the "track" column in the first row or cell. loc[]: df.loc[row, column]
print(f"3 - Cell or Value from (row, column): \n{df.loc[0, "track"]}") 

print()

# Returns the first and fifth rows, within the specified list, of the DataFrame.
print(f"4 - First and Fifth rows within a list: \n{df.loc[[0,5]]}") 

print()

# Returns a column of the DataFrame and its abreviated notation.

print(f'5 - Column of DataFrame: \n{df.loc[:, "track"]}')
print()
print(f"Column of DataFrame: \n{df['track']}")
print()

# Returns multiple columns and its abreviated notation.

print(f"6 - Multiple columns of DataFrame: \n{df.loc[:, ['track', 'Artist']]}")
print()
print(f"Multiple columns of DataFrame: \n{df[['track', 'Artist']]}\n")
print()

# Returns multiple consecutive columns.

print(f"7 - Multiple consecutive columns of DataFrame: \n{df.loc[:, 'total play':'track']}")
print()



1 - First Row of Dataframe: 
  user_id       BF6EA5AF
total play     92.851388
Artist        Marina Rei
genre                pop
track             Musica
Name: 0, dtype: object

2 - Rows from the idex given to the last one: 
        user_id  total play        Artist  genre      track
67960  26B7058C  292.455000       Red God  metal  Действуй!
67961  DB0038A8   11.529112  Less Chapell    pop       Home
67962  FE8684F6    0.100000           NaN    NaN        NaN

3 - Cell or Value from (row, column): 
Musica

4 - First and Fifth rows within a list: 
    user_id  total play                                  Artist genre   track
0  BF6EA5AF   92.851388                              Marina Rei   pop  Musica
5  4166D680    3.007000  Henry Hall & His Gleneagles Hotel Band  jazz    Home

5 - Column of DataFrame: 
0                        Musica
1                   Love Planet
2                   Love Planet
3           Loving Every Minute
4        Me And My Broken Heart
                  ...    

##### __DataFrame Boolean Indexing__

In Pandas, Boolean indexing is used to filter rows or columns of a DataFrame or Series based on conditional statements. It helps extract specific data that meets the defined condition by creating boolean masks, _which are arrays of True and False values_. __The True values indicate that the respective data should be selected, while False values indicate not selected__.

_Instead of manually iterating through data to find values that meet a condition, Boolean indexing simplifies the process by applying logical expressions._

In [3]:
import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])

# Display the DataFrame
print("Input DataFrame:\n", df)
print()

# Create Boolean Index
result = df > 2

print('Boolean Index:\n', result)

Input DataFrame:
    A  B
0  1  2
1  3  4
2  5  6

Boolean Index:
        A      B
0  False  False
1   True   True
2   True   True


Once a boolean index is created, you can use it to filter rows or columns in the DataFrame. _This is done by using .loc[] for label-based indexing and .iloc[] for position-based indexing._

.loc

_df.loc[row_condition, column_selection]_.

In [5]:
import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])

# Display the DataFrame
print("Input DataFrame:\n", df)
print()

# Create Boolean Index
s = (df['A'] > 2)

# Filter DataFrame using the Boolean Index with .loc
print('Output Filtered DataFrame:\n',df.loc[s, 'B'])

Input DataFrame:
    A  B
0  1  2
1  3  4
2  5  6

Output Filtered DataFrame:
 1    4
2    6
Name: B, dtype: int64


.iloc

In [6]:
import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=['A', 'B'])

# Display the DataFrame
print("Input DataFrame:\n", df)

# Create Boolean Index
s = (df['A'] > 2)

# Filter data using .iloc and the Boolean Index
print('Output Filtered Data:\n',df.iloc[s.values, 1])

Input DataFrame:
    A  B
0  1  2
1  3  4
2  5  6
Output Filtered Data:
 1    4
2    6
Name: B, dtype: int64


##### __Advanced Boolean Indexing with multiple Conditions__

Pandas provides more complex boolean indexing by combining multiple conditions with the operators like & (and), | (or), and ~ (not). And also you can apply these conditions across different columns to create highly specific filters.

In [8]:
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 3, 5, 7],'B': [5, 2, 8, 4],'C': ['x', 'y', 'x', 'z']})

# Display the DataFrame
print("Input DataFrame:\n", df)
print()

# Apply multiple conditions using boolean indexing
result = df.loc[(df['A'] > 2) & (df['B'] < 5), 'A':'C']

print('Output Filtered DataFrame:\n',result)

Input DataFrame:
    A  B  C
0  1  5  x
1  3  2  y
2  5  8  x
3  7  4  z

Output Filtered DataFrame:
    A  B  C
1  3  2  y
3  7  4  z


##### __Pandas Boolean Masking__

A boolean mask is an array of boolean values (True or False) used to filter data. It is created by applying conditional expressions to the dataset, which evaluates each element and returns True for matching conditions and False otherwise.

In [None]:
import pandas as pd

# Create a sample DataFrame
df= pd.DataFrame({'Col1': [1, 3, 5, 7, 9],
'Col2': ['A', 'B', 'A', 'C', 'A']})

# Display the Input DataFrame
print('Original DataFrame:\n', df)
print()

# Create a boolean mask
mask = (df['Col2'] == 'A') & (df['Col1'] > 4)

# Apply the mask to the DataFrame
filtered_data = df[mask]

print('Filtered Data:\n',filtered_data)

Original DataFrame:
    Col1 Col2
0     1    A
1     3    B
2     5    A
3     7    C
4     9    A

Filtered Data:
    Col1 Col2
2     5    A
4     9    A


##### _Masking Data Based on Index Value_

Filtering data based on the index values of the DataFrame can be possible by creating the mask for the index, so that you can select rows based on their position or label.

df.isin() method to create a boolean mask based on the index labels.

In [12]:
import pandas as pd

# Create a DataFrame with a custom index
df = pd.DataFrame({'A1': [10, 20, 30, 40, 50], 'A2':[9, 3, 5, 3, 2]
}, index=['a', 'b', 'c', 'd', 'e'])

# Dispaly the Input DataFrame
print('Original DataFrame:\n', df)
print()

# Define a mask based on the index
mask = df.index.isin(['b', 'd'])

# Apply the mask
filtered_data = df[mask]

print('Filtered Data:\n',filtered_data)

Original DataFrame:
    A1  A2
a  10   9
b  20   3
c  30   5
d  40   3
e  50   2

Filtered Data:
    A1  A2
b  20   3
d  40   3


##### _Masking Data Based on Column Value_

In addition to filtering based on index values, you can also filter data based on specific column values using boolean masks. The df.isin() method is used to check if values in a column match a list of values.

In [6]:
import pandas as pd

# Create a DataFrame
df= pd.DataFrame({'A': [1, 2, 3],'B': ['a', 'b', 'f']})

# Dispaly the Input DataFrame
print('Original DataFrame:\n', df)
print()

# Define a mask for specific values in column 'A' and 'B'
mask = df['A'].isin([1, 3]) | df['B'].isin(['a'])

# Apply the mask using the boolean indexing
filtered_data = df[mask]

print('Filtered Data:\n', filtered_data)

Original DataFrame:
    A  B
0  1  a
1  2  b
2  3  f

Filtered Data:
    A  B
0  1  a
2  3  f
