3. Create a data frame having at least 3 columns and 50 rows to store numeric data generated using a random
function. Replace 10% of the values by null values whose index positions are generated using random function.
Do the following:
a. Identify and count missing values in a data frame.
b. Drop the column having more than 5 null values.
c. Identify the row label having maximum of the sum of all values in a row and drop that row.
d. Sort the data frame on the basis of the first column.
e. Remove all duplicates from the first column.
f. Find the correlation between first and second column and covariance between second and third
column.
g. Discretize the second column and create 5 bins

In [1]:
import pandas as pd
import numpy as np

# Step 1: Create a dataframe with random numeric data and introduce null values
np.random.seed(42)
data = np.random.randn(100, 5)
df = pd.DataFrame(data, columns=['Column1', 'Column2', 'Column3', 'Column4', 'Column5'])

# Introduce null values randomly
null_indices = np.random.choice(df.size, int(0.25 * df.size), replace=False)
df.values.ravel()[null_indices] = np.nan

# Display the dataframe
print("Original DataFrame:")
print(df.head())

# Step 2: Identify and count missing values
missing_values = df.isnull().sum()
print("\na. Missing Values:")
print(missing_values)

# Step 3: Drop columns with more than 5 null values
df = df.dropna(axis=1, thresh=95)
print("\nb. DataFrame after dropping columns with more than 5 null values:")
print(df.head())

# Step 4: Drop the row label with the maximum sum of values
if not df.empty:
    max_sum_row_label = df.sum(axis=1).idxmax()
    df = df.drop(index=max_sum_row_label)
    print("\nc. DataFrame after dropping the row with the maximum sum of values:")
    print(df.head())

# Step 5: Sort the dataframe based on the first column
if not df.empty:
    df = df.sort_values(by=df.columns[0])
    print("\nd. DataFrame after sorting based on the first column:")
    print(df.head())

# Step 6: Remove duplicates from the first column
if not df.empty:
    df = df.drop_duplicates(subset=df.columns[0])
    print("\ne. DataFrame after removing duplicates from the first column:")
    print(df.head())


Original DataFrame:
    Column1   Column2   Column3   Column4   Column5
0  0.496714 -0.138264       NaN  1.523030 -0.234153
1 -0.234137  1.579213  0.767435       NaN  0.542560
2 -0.463418       NaN  0.241962 -1.913280 -1.724918
3       NaN -1.012831  0.314247 -0.908024 -1.412304
4       NaN       NaN  0.067528 -1.424748 -0.544383

a. Missing Values:
Column1    24
Column2    27
Column3    26
Column4    28
Column5    20
dtype: int64

b. DataFrame after dropping columns with more than 5 null values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
