isin and string methods

In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
df_countries = pd.DataFrame(data)
df_countries


Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


In [4]:
s = df_countries['capital']
print(s)

0     Brussels
1        Paris
2       Berlin
3    Amsterdam
4       London
Name: capital, dtype: object


In [5]:
s.isin?

[1;31mSignature:[0m [0ms[0m[1;33m.[0m[0misin[0m[1;33m([0m[0mvalues[0m[1;33m)[0m [1;33m->[0m [1;34m'Series'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Whether elements in Series are contained in `values`.

Return a boolean Series showing whether each element in the Series
matches an element in the passed sequence of `values` exactly.

Parameters
----------
values : set or list-like
    The sequence of values to test. Passing in a single string will
    raise a ``TypeError``. Instead, turn a single string into a
    list of one element.

Returns
-------
Series
    Series of booleans indicating if each element is in values.

Raises
------
TypeError
  * If `values` is a string

See Also
--------
DataFrame.isin : Equivalent method on DataFrame.

Examples
--------
>>> s = pd.Series(['llama', 'cow', 'llama', 'beetle', 'llama',
...                'hippo'], name='animal')
>>> s.isin(['cow', 'llama'])
0     True
1     True
2     True
3    False
4     True
5    False
Nam

used to filter the dataframe with boolean indexing:

In [6]:
s.isin(['Berlin', 'London'])

0    False
1    False
2     True
3    False
4     True
Name: capital, dtype: bool

In [7]:
df_countries[df_countries['capital'].isin(['Berlin', 'London'])]


Unnamed: 0,country,population,area,capital
2,Germany,81.3,357050,Berlin
4,United Kingdom,64.9,244820,London


Let's say we want to select all data for which the capital starts with a 'B'. In Python, when having a string, we could use the startswith method:

In [8]:
'Berlin'.startswith('B')

True

In [9]:
#to check all values in specific column.
df_countries['capital'].str.startswith('B')

0     True
1    False
2     True
3    False
4    False
Name: capital, dtype: bool

Pitfall: chained indexing (and the 'SettingWithCopyWarning')

Replacing chained indexing with .loc ensures that you modify the original DataFrame df directly and avoids any 'SettingWithCopyWarning'. This approach leads to clearer, more reliable code when working with pandas DataFrames.

In [24]:
import pandas as pd

# Creating a sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
df
# Chained indexing example
df[df['A'] > 2]['B'] = 999


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['A'] > 2]['B'] = 999


In [23]:
df.loc[df['A'] > 2, 'B'] = 999
df

Unnamed: 0,A,B
0,1,10
1,2,20
2,3,999
3,4,999
4,5,999


In [25]:
# Chained indexing with multiple operations
df[df['A'] > 2]['B'] = df[df['A'] > 2]['B'] * 2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['A'] > 2]['B'] = df[df['A'] > 2]['B'] * 2


use LOC OR ILOC to avoid warning. 
Rewrite using .loc to ensure clarity and avoid warnings:

In [27]:
df.loc[df['A'] > 2, 'B'] = df.loc[df['A'] > 2, 'B'] * 2
df


Unnamed: 0,A,B
0,1,10
1,2,20
2,3,120
3,4,160
4,5,200


In [None]:
#adding new column
df.loc[df['A'] > 2, 'C'] = 'High'
#using functions in chained indexing
df.loc[df['A'] > 2, 'B'] = df.loc[df['A'] > 2, 'B'].apply(lambda x: x * 2)


https://github.com/brandon-rhodes/pycon-pandas-tutorial/

Other features
1.Working with missing data (.dropna(), pd.isnull())
2.Merging and joining (concat, join)
3.Grouping: groupby functionality
4.Reshaping (stack, pivot)
5.Time series manipulation (resampling, timezones, ..)
6.Easy plotting

In [None]:
#1.working with missin values.
# Count missing values in each column
missing_count = df.apply(pd.isnull).sum()
# Print the missing count
# Drop rows with any NaN values
df_clean_rows = df.dropna(axis=0)

# Drop columns with any NaN values
df_clean_cols = df.dropna(axis=1)


In [None]:
# Fill NaN values with a specific value
df_filled = df.fillna(value=0)  # Fill NaN with 0

# Fill NaN values with mean of the column
mean_A = df['A'].mean()
df_filled_mean = df.fillna(value={'A': mean_A})

In [28]:
#2.Merging and joining (concat, join)
# Example DataFrames for merging
left_df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})

right_df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K4'],
                         'C': ['C0', 'C1', 'C2', 'C4'],
                         'D': ['D0', 'D1', 'D2', 'D4']})

# Inner join
inner_join = pd.merge(left_df, right_df, on='key', how='inner')
print("Inner join:")
print(inner_join)

# Left join
left_join = pd.merge(left_df, right_df, on='key', how='left')
print("\nLeft join:")
print(left_join)

# Right join
right_join = pd.merge(left_df, right_df, on='key', how='right')
print("\nRight join:")
print(right_join)

# Outer join
outer_join = pd.merge(left_df, right_df, on='key', how='outer')
print("\nOuter join:")
print(outer_join)


Inner join:
  key   A   B   C   D
0  K0  A0  B0  C0  D0
1  K1  A1  B1  C1  D1
2  K2  A2  B2  C2  D2

Left join:
  key   A   B    C    D
0  K0  A0  B0   C0   D0
1  K1  A1  B1   C1   D1
2  K2  A2  B2   C2   D2
3  K3  A3  B3  NaN  NaN

Right join:
  key    A    B   C   D
0  K0   A0   B0  C0  D0
1  K1   A1   B1  C1  D1
2  K2   A2   B2  C2  D2
3  K4  NaN  NaN  C4  D4

Outer join:
  key    A    B    C    D
0  K0   A0   B0   C0   D0
1  K1   A1   B1   C1   D1
2  K2   A2   B2   C2   D2
3  K3   A3   B3  NaN  NaN
4  K4  NaN  NaN   C4   D4


1.Stacking (pd.melt()): 

Transforms wide-format data into long-format data by melting multiple columns into key-value pairs.


2.Pivoting (pd.pivot_table()): 

Transforms long-format data into wide-format data by pivoting rows into columns, summarizing and aggregating data based on specified criteria.