**Presented by: Reza Saadatyar (2024-2025)**<br/>
**E-mail: Reza.Saadatyar@outlook.com**

**1️⃣ Series & DataFrame:**<br/>
A `Series` is a 1-dimensional labeled array in Pandas. It can hold any data type (integers, strings, floats, etc.) and is similar to a column in a DataFrame or a single key-value pair in a dictionary.<br/>
A `DataFrame` is a 2-dimensional, size-mutable, and tabular data structure in Pandas. Each column in a DataFrame is a Series.<br/>

**Properties:**<br/>
`s.name or df.name:` Returns the name of the Series.<br/>
`s.T or df.T:` Returns the transpose of the DataFrame.<br/>
`s.dtype or df.dtype:` Returns the data type of the Series.<br/>
`s.size or df.size:` Returns the number of elements in the Series.<br/>
`s.index or df.index:` Returns the index (row labels) of the Series.<br/>
`s.values or df.values:` Returns a NumPy array representation of the Series.<br/>
`s.empty or df.empty:` Returns True if the Series is empty, otherwise False.<br/>
`s.ndim or df.ndim:` Returns the number of dimensions (always 1 for Series).<br/>
`s.hasnans or df.hasnans`: Returns True if the Series contains any NaN values.<br/>
`s.is_unique or df.is_unique:` Returns True if all values in the Series are unique.<br/>
`s.shape or df.shape:` Returns a tuple representing the dimensionality of the Series.<br/>
`s..memory_usage() or df.memory_usage():` Returns the memory usage of each column in bytes.<br/>
`s.axes or df.axes:` Returns a list representing the axes (index and columns) of the DataFrame.<br/>
`s.info() or df.info(): `Provides a summary of the DataFrame, including data types and non-null values.<br/>

**Index Properties:**<br/>
`index.shape:` Returns the shape of the index.<br/>
`index.dtype:` Returns the data type of the index.<br/>
`index.empty:` Returns True if the index is empty.<br/>
`index.size:` Returns the number of elements in the index.<br/>
`index.is_unique:` Returns True if the index has unique values.<br/>
`index.values: `Returns a NumPy array representation of the index.<br/>
`index.has_duplicates:` Returns True if the index contains duplicate values.<br/>
`index.is_monotonic:` Returns True if the index is monotonic (increasing or decreasing).<br/>

**String Properties:**<br/>
`str.lower():` Converts strings to lowercase.<br/>
`str.upper():` Converts strings to uppercase.<br/>
`str.len():` Returns the length of each string.<br/>
`str.split():` Splits strings based on a delimiter.<br/>
`str.strip():` Removes leading and trailing whitespace.<br/>
`str.contains():` Checks if a substring is present in each string.<br/>
`str.find():` Returns the index of the first occurrence of a substring.<br/>
`str.endswith():` Checks if each string ends with a specified substring.<br/>
`str.replace():` Replaces occurrences of a substring with another substring.<br/>
`str.startswith():` Checks if each string starts with a specified substring.<br/>

**DateTime Properties:**<br/>
`dt.day:` Returns the day component of the datetime.<br/>
`dt.hour:` Returns the hour component of the datetime.<br/>
`dt.date:` Returns the date component of the datetime.<br/>
`dt.time:` Returns the time component of the datetime.<br/>
`dt.year:` Returns the year component of the datetime.<br/>
`dt.month:` Returns the month component of the datetime.<br/>
`dt.minute:` Returns the minute component of the datetime.<br/>
`dt.second:` Returns the second component of the datetime.<br/>

**2️⃣ Math Operations**<br/>

**3️⃣ Missing Data Handling**<br/>

<font color='#FF000e' size="4.8" face="Arial"><b>Import modules</b></font>

In [80]:
import numpy as np
import pandas as pd
from sklearn import impute, datasets

<font color=#fae903 size="4.8" face="Arial"><b>1️⃣ Properties</b></font>

In [133]:
# Creating a Series from a list
data = [10, 20, 30, 40]  # Define a list of data
pd.Series(data, name="Age")  # Convert the list into a Pandas Series with the name "Age"

0    10
1    20
2    30
3    40
Name: Age, dtype: int64

In [None]:
# Creating a Series with custom indices
data = ['apple', 'banana', 'cherry']  # Define a list of data
indices = ['a', 'b', 'c']  # Define custom indices for the Series
s = pd.Series(data, index=indices)  # Create a Pandas Series with custom indices

# Print the Series and its properties
print(f"s:\n{s}")  # Print the entire Series
print(f"\n{s.b = } → {s['b'] = }")     # Access the value at index 'b' using dot notation and indexing
print(f"\ns['a':'b']:\n{s['a':'b']}")  # Slice the Series from index 'a' to 'b'

# Print various properties of the Series
print(f"\n{s.size = }")        # Number of elements in the Series
print(f"{s.ndim = }")          # Number of dimensions (always 1 for Series)
print(f"{s.index = }")         # Index object of the Series
print(f"{s.shape = }")         # Shape of the Series (number of elements)
print(f"{s.dtype = }")         # Data type of the Series elements
print(f"{s.empty = }")         # Whether the Series is empty (True/False)
print(f"{s.keys() = }")        # Alias for the index (same as s.index)
print(f"{s.values = }")        # Numpy array of the Series values
print(f"{s.iloc[1] = }")       # Access the value at position 1 using integer location
print(f"{s.hasnans = }")       # Check if the Series contains any NaN values
print(f"{s.is_unique = }")     # Check if all values in the Series are unique
print(f"s.items:\n{s.items}")  # Iterator of index-value pairs (not directly printable as a string)

s:
a     apple
b    banana
c    cherry
dtype: object

s.b = 'banana' → s['b'] = 'banana'

s['a':'b']:
a     apple
b    banana
dtype: object

s.size = 3
s.ndim = 1
s.index = Index(['a', 'b', 'c'], dtype='object')
s.shape = (3,)
s.dtype = dtype('O')
s.empty = False
s.keys() = Index(['a', 'b', 'c'], dtype='object')
s.values = array(['apple', 'banana', 'cherry'], dtype=object)
s.iloc[1] = 'banana'
s.hasnans = False
s.is_unique = True
s.items:
<bound method Series.items of a     apple
b    banana
c    cherry
dtype: object>


<font color=#eff308 size="4.8" face="Arial"><b>1️⃣ Index Properties</b></font>

In [135]:
# Print properties of the Series index
print(f"{s.index.values = }")          # The underlying NumPy array of the index
print(f"{s.index.shape = }")           # Shape of the index (number of elements)
print(f"{s.index.size = }")            # Number of elements in the index
print(f"{s.index.ndim = }")            # Number of dimensions of the index (always 1 for a Series index)
print(f"{s.index.dtype = }")           # Data type of the index elements
print(f"{s.index.empty = }")           # Whether the index is empty (True/False)
print(f"{s.index.hasnans = }")         # Whether the index contains any NaN values
print(f"{s.index.is_unique = }")       # Whether all index labels are unique
print(f"{s.index.has_duplicates = }")  # Whether the index contains duplicate labels

s.index.values = array(['a', 'b', 'c'], dtype=object)
s.index.shape = (3,)
s.index.size = 3
s.index.ndim = 1
s.index.dtype = dtype('O')
s.index.empty = False
s.index.hasnans = False
s.index.is_unique = True
s.index.has_duplicates = False


<font color=#f5f103 size="4.8" face="Arial"><b>1️⃣ String Properties</b></font>

In [None]:
# Create a Series with strings
series = pd.Series(['apple', 'banana', 'cherry'])      # Create a Pandas Series with string values

# String properties
print(f"\nLength:\n{series.str.len()}")                # Calculate the length of each string in the Series
print(f"Lowercase:\n{series.str.lower()}")             # Convert all strings in the Series to lowercase
print(f"\nSplit:\n{series.str.split('a')}")            # Split each string at the character 'a'
print(f"\nUppercase:\n{series.str.upper()}")           # Convert all strings in the Series to uppercase
print(f"\nContains 'a':\n{series.str.contains('a')}")  # Check if each string contains the character 'a'
print(f"\nReplace:\n{series.str.replace('apple', 'none')}")  # Replace the string 'apple' with 'none'


Length:
0    5
1    6
2    6
dtype: int64
Lowercase:
0     apple
1    banana
2    cherry
dtype: object

Split:
0       [, pple]
1    [b, n, n, ]
2       [cherry]
dtype: object

Uppercase:
0     APPLE
1    BANANA
2    CHERRY
dtype: object

Contains 'a':
0     True
1     True
2    False
dtype: bool

Replace:
0      none
1    banana
2    cherry
dtype: object


<font color=#f5f103 size="4.8" face="Arial"><b>1️⃣ DateTime Properties</b></font>

In [137]:
# Create a datetime Series (Convert a list of date strings to a datetime Series)
dates = pd.Series(pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-10']))

# DateTime properties
print(f"Year:\n{dates.dt.year}")                    # Extract the year component from each date
print(f"\nMonth:\n{dates.dt.month}")                # Extract the month component from each date
print(f"\nDay:\n{dates.dt.day}")                    # Extract the day component from each date
print(f"\nDay of Week:\n{dates.dt.dayofweek}")      # Extract the day of the week (Monday=0, Sunday=6)
print(f"\nIs Leap Year?\n{dates.dt.is_leap_year}")  # Check if the year of each date is a leap year

Year:
0    2023
1    2023
2    2023
dtype: int32

Month:
0    1
1    2
2    3
dtype: int32

Day:
0     1
1    15
2    10
dtype: int32

Day of Week:
0    6
1    2
2    4
dtype: int32

Is Leap Year?
0    False
1    False
2    False
dtype: bool


In [None]:
# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],  # Column 'Name' with string values
    'Age': [25, 30, 35, 40],                       # Column 'Age' with integer values
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']  # Column 'City' with string values
}
df = pd.DataFrame(data)  # Convert the dictionary into a Pandas DataFrame

# Display the DataFrame
print(f"DataFrame:\n{df}")  # Print the entire DataFrame

DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston


In [None]:
# The summary statistics of the DataFrame using df.describe().
print(f"df.describe():\n{df.describe()}")

# The summary statistics of the DataFrame including all columns (numeric and non-numeric) using df.describe(include='all').
print(f"\ndf.describe(include='all'):\n{df.describe(include='all')}")

df.describe():
             Age
count   4.000000
mean   32.500000
std     6.454972
min    25.000000
25%    28.750000
50%    32.500000
75%    36.250000
max    40.000000

df.describe(include='all'):
         Name        Age      City
count       4   4.000000         4
unique      4        NaN         4
top     Alice        NaN  New York
freq        1        NaN         1
mean      NaN  32.500000       NaN
std       NaN   6.454972       NaN
min       NaN  25.000000       NaN
25%       NaN  28.750000       NaN
50%       NaN  32.500000       NaN
75%       NaN  36.250000       NaN
max       NaN  40.000000       NaN


In [147]:
# Display a summary of the DataFrame's structure, including data types, non-null counts, and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes


In [11]:
df.head(3) # Display the first 3 rows of the DataFrame.

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [145]:
df.tail(2) # Display the last 2 rows of the DataFrame.

Unnamed: 0,Name,Age,City
2,Charlie,35,Chicago
3,David,40,Houston


In [13]:
# Access DataFrame properties
print("\nDataFrame Properties:")
print(f"{df.shape = }")    # Number of rows and columns
print(f"{df.columns = }")  # Column names
print(f"{df.index = }")    # Row labels
print(f"\ndf.dtypes:\n{df.dtypes}")  # Data types of each column
print(f"\ndf.values:\n{df.values}")  # NumPy array representation
print(f"\n{df.empty = }")  # Check if DataFrame is empty
print(f"{df.size = }")     # Total number of elements
print(f"\ndf.T:\n{df.T}")  # Transpose of the DataFrame
print(f"\ndf.value_counts:\n{df.value_counts()}")


DataFrame Properties:
df.shape = (4, 3)
df.columns = Index(['Name', 'Age', 'City'], dtype='object')
df.index = RangeIndex(start=0, stop=4, step=1)

df.dtypes:
Name    object
Age      int64
City    object
dtype: object

df.values:
[['Alice' 25 'New York']
 ['Bob' 30 'Los Angeles']
 ['Charlie' 35 'Chicago']
 ['David' 40 'Houston']]

df.empty = False
df.size = 12

df.T:
             0            1        2        3
Name     Alice          Bob  Charlie    David
Age         25           30       35       40
City  New York  Los Angeles  Chicago  Houston

df.value_counts:
Name     Age  City       
Alice    25   New York       1
Bob      30   Los Angeles    1
Charlie  35   Chicago        1
David    40   Houston        1
Name: count, dtype: int64


In [111]:
# Access Series properties
age_series = df['Age']
print("\nSeries Properties (Age column):")
print("Shape:", age_series.shape)  # Number of elements
print("Index:", age_series.index)  # Row labels
print("Data Type:", age_series.dtype)  # Data type
print("Values:", age_series.values)    # NumPy array representation
print("Is Empty?", age_series.empty)   # Check if Series is empty
print("Size:", age_series.size)        # Number of elements
print("Name:", age_series.name)        # Name of the Series


Series Properties (Age column):
Shape: (4,)
Index: RangeIndex(start=0, stop=4, step=1)
Data Type: int64
Values: [25 30 35 40]
Is Empty? False
Size: 4
Name: Age


In [121]:
# Access DateTime properties (if applicable)
df['Join_Date'] = pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-10', '2023-04-20'])
print("DateTime Properties (Join_Date column):")
print(f"Year:\n{df['Join_Date'].dt.year}")      # Extract year
print(f"\nMonth:\n{df['Join_Date'].dt.month}")  # Extract month
print(f"\nDay:\n{df['Join_Date'].dt.day}")      # Extract day
print(f"\nDay of Week:\n{df['Join_Date'].dt.dayofweek}")  # Day of the week (Monday=0, Sunday=6)

DateTime Properties (Join_Date column):
Year:
0    2023
1    2023
2    2023
3    2023
Name: Join_Date, dtype: int32

Month:
0    1
1    2
2    3
3    4
Name: Join_Date, dtype: int32

Day:
0     1
1    15
2    10
3    20
Name: Join_Date, dtype: int32

Day of Week:
0    6
1    2
2    4
3    3
Name: Join_Date, dtype: int32


<font color="#ff0051e2" size="4.5" face="Arial"><b>2️⃣ Math Operations</b></font>

In [90]:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
print(f"s1 + s2:\n{s1 + s2}")   # Addition
print(f"\ns1 - s2:\n{s1 - s2}") # Subtraction

s1 + s2:
0    5
1    7
2    9
dtype: int64

s1 - s2:
0   -3
1   -3
2   -3
dtype: int64


In [146]:
# Create a Pandas Series
s = pd.Series([10, 20, 30, 40])  # Create a Series with integer values

# Perform aggregation operations on the Series
print("s.sum():", s.sum())    # Calculate the sum of all values in the Series
print("s.mean():", s.mean())  # Calculate the mean (average) of all values in the Series
print("s.max():", s.max())    # Find the maximum value in the Series
print("s.min():", s.min())    # Find the minimum value in the Series

s.sum(): 100
s.mean(): 25.0
s.max(): 40
s.min(): 10


<font color=#24f508 size="4.8" face="Arial"><b>3️⃣ Missing Data Handling</b></font>

In [159]:
# Create a DataFrame 'df' with 5 rows and 4 columns ('A', 'B', 'C', 'D').
# The data contains a mix of numeric values and NaN (Not a Number) values.
# NaN represents missing or undefined data in pandas.
df = pd.DataFrame([[np.nan, 2, np.nan, 0],      # Row 1: [NaN, 2, NaN, 0]
                   [3, 4, np.nan, 1],           # Row 2: [3, 4, NaN, 1]
                   [np.nan, np.nan, np.nan, 5], # Row 3: [NaN, NaN, NaN, 5]
                   [np.nan, 3, np.nan, 4],      # Row 4: [NaN, 3, NaN, 4]
                   [np.nan, np.nan, np.nan, np.nan]], # Row 5: [NaN, NaN, NaN, NaN]
                  columns=list('ABCD'))         # Assign column names as 'A', 'B', 'C', 'D'
df

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,5.0
3,,3.0,,4.0
4,,,,


In [160]:
df.drop('B', axis=1) # Drop the column 'B' from the DataFrame 'df' along the specified axis (axis=1 for columns).

Unnamed: 0,A,C,D
0,,,0.0
1,3.0,,1.0
2,,,5.0
3,,,4.0
4,,,


In [161]:
df.isnull().sum() # Identify missing values (NaN) in the DataFrame using df.isnull().

A    4
B    2
C    5
D    1
dtype: int64

In [162]:
df.isna() # Identify missing values (NaN) in the DataFrame.

Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False
3,True,False,True,False
4,True,True,True,True


In [165]:
df.notna() # Identify non-missing values in the DataFrame.

Unnamed: 0,A,B,C,D
0,False,True,False,True
1,True,True,False,True
2,False,False,False,True
3,False,True,False,True
4,False,False,False,False


In [166]:
pd.isna(df['A']) # Identify missing values (NaN) in the column 'A' of the DataFrame.

0     True
1    False
2     True
3     True
4     True
Name: A, dtype: bool

In [167]:
df.fillna(0) # Replace all missing values (NaN) in the DataFrame with the value 0.

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0.0
1,3.0,4.0,0.0,1.0
2,0.0,0.0,0.0,5.0
3,0.0,3.0,0.0,4.0
4,0.0,0.0,0.0,0.0


In [168]:
# Replace all missing values (NaN) in the DataFrame with the mean value of their respective columns.
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0.0
1,3.0,4.0,,1.0
2,3.0,3.0,,5.0
3,3.0,3.0,,4.0
4,3.0,3.0,,2.5


In [163]:
# Replace missing values (NaN) in columns 'A' and 'B' with their respective mean values.
# Columns 'C' and 'D' remain unaffected by this operation.
df.fillna(df.mean()['A':'B'])

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0.0
1,3.0,4.0,,1.0
2,3.0,3.0,,5.0
3,3.0,3.0,,4.0
4,3.0,3.0,,


In [173]:
df.ffill() # Forward-fill missing values (NaN) in the DataFrame.

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,3.0,4.0,,5.0
3,3.0,3.0,,4.0
4,3.0,3.0,,4.0


In [175]:
df.bfill() # Backward-fill missing values (NaN) in the DataFrame.

Unnamed: 0,A,B,C,D
0,3.0,2.0,,0.0
1,3.0,4.0,,1.0
2,,3.0,,5.0
3,,3.0,,4.0
4,,,,


In [169]:
# Replace missing values (NaN) in specific columns with specified values, but only up to a limit of 2 replacements per column.
df.fillna(value={'A': 0, 'B': 1, 'C': 2, 'D': 3}, limit=2)

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0.0
1,3.0,4.0,2.0,1.0
2,0.0,1.0,,5.0
3,,3.0,,4.0
4,,1.0,,3.0


In [170]:
df.dropna(axis=0) # Remove rows that contain any missing values (NaN) from the DataFrame.

Unnamed: 0,A,B,C,D


In [171]:
df['A'].dropna() # Remove missing values (NaN) from the column 'A' of the DataFrame.

1    3.0
Name: A, dtype: float64

In [172]:
print(f"df:\n{df}")
print(f"\ndf.dropna(thresh=2):\n{df.dropna(thresh=2)}")  # drop rows that have not at least 2 non-NaN values

df:
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  5.0
3  NaN  3.0 NaN  4.0
4  NaN  NaN NaN  NaN

df.dropna(thresh=2):
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
3  NaN  3.0 NaN  4.0


In [None]:
print(f"df:\n{df}")
df.dropna(how='all')  # Only drop rows where all columns are NaN

df:
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  5.0
3  NaN  3.0 NaN  4.0
4  NaN  NaN NaN  NaN


Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,5.0
3,,3.0,,4.0


In [174]:
df.dropna(subset=['B'])  # only drop rows where NaN appear in specific columns B

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
3,,3.0,,4.0


In [176]:
df = pd.DataFrame([[np.nan, 2, np.nan, 0], 
                   [3, 4, np.nan, 1], 
                   [np.nan, np.nan, np.nan, 5], 
                   [np.nan, 3, np.nan, 4],
                   [-1, 2, 8, np.nan]],
                    columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,5.0
3,,3.0,,4.0
4,-1.0,2.0,8.0,


In [177]:
imputer = impute.SimpleImputer(strategy='mean')     # Create a SimpleImputer with strategy='mean'
imputed_data = imputer.fit_transform(df)            # Fit the imputer on the data and transform it
df_imputed = pd.DataFrame(imputed_data, columns=df.columns) # Create a new DataFrame with imputed values
df_imputed

Unnamed: 0,A,B,C,D
0,1.0,2.0,8.0,0.0
1,3.0,4.0,8.0,1.0
2,1.0,2.75,8.0,5.0
3,1.0,3.0,8.0,4.0
4,-1.0,2.0,8.0,2.5


In [178]:
breast_cancer = datasets.load_breast_cancer() # Load the breast cancer dataset from sklearn's datasets module.
breast_cancer

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [179]:
# Convert the breast cancer dataset into a pandas DataFrame.
df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
df["target"] = breast_cancer.target # Add a 'target' column to the DataFrame containing the target labels.
df.head(5) # Display the first 5 rows of the DataFrame.

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [181]:
edt  = pd.get_dummies(df["target"], dtype=int)  # Convert categorical variable into dummy/indicator variables
edt

Unnamed: 0,0,1
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
564,1,0
565,1,0
566,1,0
567,1,0


In [100]:
df = df.join(edt)
df.head(5)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,0,1
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0,1,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0,1,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0,1,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0,1,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0,1,0


In [None]:
# Drop the column "mean texture" from the DataFrame df permanently.
df.drop(["mean texture"], axis=1, inplace=True)
df.head(5)

Unnamed: 0,mean radius,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,...,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,0,1
0,17.99,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,...,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0,1,0
1,20.57,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,...,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0,1,0
2,19.69,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,...,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0,1,0
3,11.42,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,...,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0,1,0
4,20.29,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,...,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0,1,0


In [189]:
# Create a DataFrame with two columns: 'Bird' and 'Speed'.
df = pd.DataFrame({
    'Bird': ['A', 'A', 'B', 'B', 'B'],  # Column 'Bird' contains categories.
    'Speed': [380, 370, 24, 26, np.nan]  # Column 'Speed' contains numeric values, with one missing value (NaN).
})
df

Unnamed: 0,Bird,Speed
0,A,380.0
1,A,370.0
2,B,24.0
3,B,26.0
4,B,


In [188]:
# Group the DataFrame by the 'Bird' column and calculate the mean of all numeric columns for each group.
df.groupby(['Bird']).mean()

Unnamed: 0_level_0,Speed
Bird,Unnamed: 1_level_1
A,375.0
B,25.0


In [187]:
# Add a new column 'Speed' to the DataFrame 'df'
# The values in this column are calculated by grouping the DataFrame by the 'Bird' column
# and filling missing values (NaN) in the 'Speed' column with the mean speed of the corresponding bird group.
df['Speed'] = df.groupby(['Bird'])['Speed'].transform(lambda x: x.fillna(x.mean()))
df

Unnamed: 0,Bird,Speed
0,A,380.0
1,A,370.0
2,B,24.0
3,B,26.0
4,B,25.0


In [195]:
df = pd.DataFrame({'x': [1, 5],  'b': [-1, 4], 'c': [5, 6]})
df

Unnamed: 0,x,b,c
0,1,-1,5
1,5,4,6


In [196]:
df.sort_index(axis=1) # Sort the columns of the DataFrame based on their column names (index along axis=1).

Unnamed: 0,b,c,x
0,-1,5,1
1,4,6,5


In [197]:
# Sort the columns of the DataFrame in descending order based on their column names (index along axis=1).
df.sort_index(axis=1, ascending=False)

Unnamed: 0,x,c,b
0,1,5,-1
1,5,6,4


In [198]:
# Sort the rows of the DataFrame based on the values in column 'x' in descending order.
df.sort_values(by='x', ascending=False)

Unnamed: 0,x,b,c
1,5,4,6
0,1,-1,5
