<a href="https://colab.research.google.com/github/Bommapala05/Data_analytics_lab/blob/main/Lab_1_Data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np




Create a Dataset

In [2]:

data = {
    'Name': ['Alice', 'Bob', 'Alice', 'David', 'Eve', 'Frank', 'Grace', 'Heidi'],
    'Age': [25, np.nan, 25, 45, 120, 30, np.nan, 35],  # 120 is likely noise/outlier
    'Salary': [50000, 60000, 50000, 80000, 55000, 58000, 62000, 2000000], # 2M is noise
    'City': ['NY', 'LA', 'NY', 'Chicago', 'Houston', 'Phoenix', 'NY', 'Seattle']
}

df = pd.DataFrame(data)

print("--- ORIGINAL DATAFRAME ---")
print(df)
print("\n")

--- ORIGINAL DATAFRAME ---
    Name    Age   Salary     City
0  Alice   25.0    50000       NY
1    Bob    NaN    60000       LA
2  Alice   25.0    50000       NY
3  David   45.0    80000  Chicago
4    Eve  120.0    55000  Houston
5  Frank   30.0    58000  Phoenix
6  Grace    NaN    62000       NY
7  Heidi   35.0  2000000  Seattle




Handling missing values

In [3]:
print(f"Missing values per column:\n{df.isnull().sum()}\n")


Missing values per column:
Name      0
Age       2
Salary    0
City      0
dtype: int64



In [5]:
df_dropped = df.dropna()
print("1. Shape after dropping rows with NaNs:", df_dropped.shape)

print(df_dropped)


1. Shape after dropping rows with NaNs: (6, 4)
    Name    Age   Salary     City
0  Alice   25.0    50000       NY
2  Alice   25.0    50000       NY
3  David   45.0    80000  Chicago
4    Eve  120.0    55000  Houston
5  Frank   30.0    58000  Phoenix
7  Heidi   35.0  2000000  Seattle


In [6]:
df_imputed = df.copy()
median_age = df_imputed['Age'].median()
df_imputed['Age'] = df_imputed['Age'].fillna(median_age)

print(f"2. Filled missing Age with median ({median_age}):")
print(df_imputed)
print("\n")

2. Filled missing Age with median (32.5):
    Name    Age   Salary     City
0  Alice   25.0    50000       NY
1    Bob   32.5    60000       LA
2  Alice   25.0    50000       NY
3  David   45.0    80000  Chicago
4    Eve  120.0    55000  Houston
5  Frank   30.0    58000  Phoenix
6  Grace   32.5    62000       NY
7  Heidi   35.0  2000000  Seattle




NOISE DETECTION & REMOVAL (Outliers)


In [7]:
Q1 = df_imputed['Age'].quantile(0.25)
Q3 = df_imputed['Age'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Age Bounds: {lower_bound} to {upper_bound}")


df_clean_noise = df_imputed[
    (df_imputed['Age'] >= lower_bound) &
    (df_imputed['Age'] <= upper_bound)
]

print("Rows removed (Noise):")
print(df_imputed[~df_imputed.index.isin(df_clean_noise.index)])
print("\n")
print("Dataset after noise removal")
print(df_clean_noise)

Age Bounds: 15.625 to 50.625
Rows removed (Noise):
  Name    Age  Salary     City
4  Eve  120.0   55000  Houston


Dataset after noise removal
    Name   Age   Salary     City
0  Alice  25.0    50000       NY
1    Bob  32.5    60000       LA
2  Alice  25.0    50000       NY
3  David  45.0    80000  Chicago
5  Frank  30.0    58000  Phoenix
6  Grace  32.5    62000       NY
7  Heidi  35.0  2000000  Seattle


IDENTIFYING & ELIMINATING DATA REDUNDANCY

In [9]:
duplicates = df_clean_noise[df_clean_noise.duplicated(keep=False)]
print("Duplicate Rows found:")
print(duplicates)


df_final = df_clean_noise.drop_duplicates(keep='first')

print("\n--- FINAL CLEANED DATAFRAME ---")
print(df_final)

Duplicate Rows found:
    Name   Age  Salary City
0  Alice  25.0   50000   NY
2  Alice  25.0   50000   NY

--- FINAL CLEANED DATAFRAME ---
    Name   Age   Salary     City
0  Alice  25.0    50000       NY
1    Bob  32.5    60000       LA
3  David  45.0    80000  Chicago
5  Frank  30.0    58000  Phoenix
6  Grace  32.5    62000       NY
7  Heidi  35.0  2000000  Seattle
