##EXPT-01

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
# Step 1: Create a sample dataset with missing values and outliers
data = {
    'Age': [22, 25, np.nan, 23, 120, 24, 26, np.nan, 28, 30],   # 120 is outlier
    'Salary': [25000, 27000, 26000, np.nan, 30000, 28000, np.nan, 29000, 31000, 100000],  # 100000 is outlier
    'Gender': ['M', 'F', 'M', np.nan, 'F', 'M', np.nan, 'F', 'M', 'F']
}


In [3]:
df = pd.DataFrame(data)
print("ðŸ”¹ Original Data:\n", df)

ðŸ”¹ Original Data:
      Age    Salary Gender
0   22.0   25000.0      M
1   25.0   27000.0      F
2    NaN   26000.0      M
3   23.0       NaN    NaN
4  120.0   30000.0      F
5   24.0   28000.0      M
6   26.0       NaN    NaN
7    NaN   29000.0      F
8   28.0   31000.0      M
9   30.0  100000.0      F


In [4]:
# Step 2: Handling Missing Values
# Mean imputation
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Median imputation
df['Salary'].fillna(df['Salary'].median(), inplace=True)

# Mode imputation (for categorical data)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

print("\nðŸ”¹ After Handling Missing Values (Mean, Median, Mode):\n", df)


ðŸ”¹ After Handling Missing Values (Mean, Median, Mode):
       Age    Salary Gender
0   22.00   25000.0      M
1   25.00   27000.0      F
2   37.25   26000.0      M
3   23.00   28500.0      F
4  120.00   30000.0      F
5   24.00   28000.0      M
6   26.00   28500.0      F
7   37.25   29000.0      F
8   28.00   31000.0      M
9   30.00  100000.0      F


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

In [5]:
# Step 3: Handling Outliers - Z-score method
z = np.abs(stats.zscore(df[['Age', 'Salary']]))
df_z = df[(z < 3).all(axis=1)]
print("\nðŸ”¹ After Removing Outliers (Z-Score Method):\n", df_z)


ðŸ”¹ After Removing Outliers (Z-Score Method):
       Age    Salary Gender
0   22.00   25000.0      M
1   25.00   27000.0      F
2   37.25   26000.0      M
3   23.00   28500.0      F
4  120.00   30000.0      F
5   24.00   28000.0      M
6   26.00   28500.0      F
7   37.25   29000.0      F
8   28.00   31000.0      M
9   30.00  100000.0      F


In [6]:
# Step 4: Handling Outliers - IQR method
Q1 = df[['Age', 'Salary']].quantile(0.25)
Q3 = df[['Age', 'Salary']].quantile(0.75)
IQR = Q3 - Q1
df_iqr = df[~((df[['Age', 'Salary']] < (Q1 - 1.5 * IQR)) | (df[['Age', 'Salary']] > (Q3 + 1.5 * IQR))).any(axis=1)]
print("\nðŸ”¹ After Removing Outliers (IQR Method):\n", df_iqr)


ðŸ”¹ After Removing Outliers (IQR Method):
      Age   Salary Gender
0  22.00  25000.0      M
1  25.00  27000.0      F
2  37.25  26000.0      M
3  23.00  28500.0      F
5  24.00  28000.0      M
6  26.00  28500.0      F
7  37.25  29000.0      F
8  28.00  31000.0      M
