🔹 Data Cleaning
1. Handling Missing Values

Explanation:
In real-world datasets, missing values are very common (e.g., customer didn’t enter age). Many ML algorithms cannot handle NaN values, so we must either remove them or fill them with meaningful replacements. A common approach is replacing missing numbers with the column mean/median, and categorical values with the mode.

In [42]:
import pandas as pd

df = pd.DataFrame({"id":[1,2,3,4], "age":[25, None, 30, None]})
print("Before:\n", df)

# Fill missing with mean
df["age"].fillna(df["age"].mean(), inplace=True)
print("After:\n", df)


Before:
    id   age
0   1  25.0
1   2   NaN
2   3  30.0
3   4   NaN
After:
    id   age
0   1  25.0
1   2  27.5
2   3  30.0
3   4  27.5


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].mean(), inplace=True)


2. Removing Duplicates

Explanation:
Duplicates occur due to repeated data entry or system errors. If left unchecked, they distort analytics and models (e.g., one customer counted twice). Removing duplicates ensures the dataset is unique and reliable for analysis.

In [43]:
df = pd.DataFrame({"id":[1,2,2,3], "name":["A","B","B","C"]})
print("Before:\n", df)

df = df.drop_duplicates()
print("After:\n", df)


Before:
    id name
0   1    A
1   2    B
2   2    B
3   3    C
After:
    id name
0   1    A
1   2    B
3   3    C


3. Converting Data Types

Explanation:
Often data is loaded from CSV/Excel in string form, even for numeric fields like age. Wrong data types prevent proper calculations (e.g., "25" + "5" = "255"). Converting ensures correct arithmetic and efficient memory usage.

In [44]:
df = pd.DataFrame({"id":["1","2","3"], "age":["25","30","35"]})
print("Before:\n", df.dtypes)

df["id"] = df["id"].astype(int)
df["age"] = df["age"].astype(int)
print("After:\n", df.dtypes)

Before:
 id     object
age    object
dtype: object
After:
 id     int64
age    int64
dtype: object


4. Handling Outliers

Explanation:
Outliers are extreme values (e.g., salary = 10 million when average is 50k). They may be valid or errors but often distort models. One approach is capping them at a certain percentile so they don’t dominate training.

In [45]:
import numpy as np
df = pd.DataFrame({"salary":[30000, 35000, 40000, 1000000]})
print("Before:\n", df)

# Cap salaries at 99th percentile
cap = np.percentile(df["salary"], 99)
df["salary"] = np.where(df["salary"] > cap, cap, df["salary"])
print("After:\n", df)

Before:
     salary
0    30000
1    35000
2    40000
3  1000000
After:
      salary
0   30000.0
1   35000.0
2   40000.0
3  971200.0


5. Standardizing Text Data

Explanation:
Text data may appear in different forms (e.g., "New York", "new york ", "NEW YORK"). If not cleaned, the system treats them as different categories. Standardizing by trimming spaces and converting to lowercase ensures consistency.




In [46]:
df = pd.DataFrame({"city":["New York","new york ","NEW YORK"]})
print("Before:\n", df)

df["city"] = df["city"].str.strip().str.lower()
print("After:\n", df)


Before:
         city
0   New York
1  new york 
2   NEW YORK
After:
        city
0  new york
1  new york
2  new york


🔹 Data Processing
6. Normalizing Data

Explanation:
When features have very different ranges (e.g., age=20–60, salary=2000–10000), models may give more weight to larger values. Normalization rescales data between 0–1 so all features contribute equally.


In [47]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

df = pd.DataFrame({"age":[18,25,40], "salary":[2000,4000,8000]})
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)

        age    salary
0  0.000000  0.000000
1  0.318182  0.333333
2  1.000000  1.000000


7. One-Hot Encoding Categorical Variables

Explanation:
ML models cannot work directly with text labels like "Male"/"Female". One-hot encoding creates binary columns (1 or 0) for each category, turning text into machine-readable numbers.



In [48]:
from sklearn.preprocessing import OneHotEncoder

X = [["Male"], ["Female"], ["Female"]]
encoder = OneHotEncoder(sparse_output=False)  # sklearn >= 1.2
print(encoder.fit_transform(X))

[[0. 1.]
 [1. 0.]
 [1. 0.]]



8. Splitting Data into Train/Test

Explanation:
To properly evaluate a model, we split data into training (for learning) and testing (for evaluation). Without this, models may overfit and perform poorly on new data.


In [49]:
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.DataFrame({"age":[18,25,40,50,60], "income":[2000,4000,6000,8000,10000]})
X = df[["age"]]
y = df["income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train size:", len(X_train), "Test size:", len(X_test))

Train size: 4 Test size: 1


9. Feature Engineering (Creating New Column)

Explanation:
Raw data may not always be directly useful. We often create derived features (e.g., final price = price – discount). Such engineered features improve model performance by providing more meaningful inputs.


In [50]:
df = pd.DataFrame({"price":[100,200,300], "discount":[10,20,30]})
df["final_price"] = df["price"] - df["discount"]
print(df)

   price  discount  final_price
0    100        10           90
1    200        20          180
2    300        30          270


10. Scaling Features with StandardScaler

Explanation:
Unlike MinMaxScaler, StandardScaler standardizes data to mean = 0, standard deviation = 1. Many ML algorithms (like SVM, Logistic Regression) assume data follows such distribution.

In [51]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

df = pd.DataFrame({"height":[160,170,180], "weight":[55,70,80]})
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_scaled)

     height    weight
0 -1.224745 -1.297771
1  0.000000  0.162221
2  1.224745  1.135550


🔹 Data Prediction
11. Linear Regression (Predicting Income from Age)

Explanation:
Linear regression models relationships between variables. Example: as age increases, income increases linearly. We can train the model to predict unknown incomes from age values.

In [52]:
from sklearn.linear_model import LinearRegression

X = [[18],[25],[30],[40],[50]]
y = [2000,3000,4000,6000,8000]

model = LinearRegression()
model.fit(X, y)

print("Prediction for age 35:", model.predict([[35]]))

Prediction for age 35: [5057.93450882]


12. Logistic Regression (Binary Classification)

Explanation:
When the output is binary (e.g., "Low Income" vs "High Income"), logistic regression is used. It estimates probability of belonging to a class.

In [53]:
from sklearn.linear_model import LogisticRegression

X = [[18],[25],[30],[40],[50]]
y = [0,0,1,1,1]  # 0 = Low income, 1 = High income

model = LogisticRegression()
model.fit(X, y)
print("Prediction for age 28:", model.predict([[28]]))

Prediction for age 28: [1]


🔹 Data Security
13. Hashing Passwords (SHA-256)

Explanation:
Passwords should never be stored in plain text. Hashing converts them into fixed-length secure values. Even if stolen, the original password can’t be recovered easily.

In [54]:
import hashlib

pwd = "secret123"
hashed = hashlib.sha256(pwd.encode()).hexdigest()
print("Hashed:", hashed)

# Verify
print("Verify:", hashlib.sha256("secret123".encode()).hexdigest() == hashed)

Hashed: fcf730b6d95236ecd3c9fc2d92d7b6b2bb061514961aec041d6c7a7192f592e4
Verify: True


14. Masking Sensitive Data

Explanation:
Personally Identifiable Information (PII) like SSNs or credit card numbers should not be shown in full. Masking hides part of it while keeping it recognizable.

In [55]:
import pandas as pd
df = pd.DataFrame({"name":["Alice","Bob"], "ssn":["123-45-6789","987-65-4321"]})
df["ssn_masked"] = df["ssn"].str.replace(r"\d{3}-\d{2}", "***-**", regex=True)
print(df)

    name          ssn   ssn_masked
0  Alice  123-45-6789  ***-**-6789
1    Bob  987-65-4321  ***-**-4321


15. Encrypting and Decrypting with Fernet

Explanation:
Encryption ensures that even if data is intercepted, it cannot be read without the secret key. Fernet provides symmetric encryption (same key for encrypt/decrypt).

In [56]:
from cryptography.fernet import Fernet

# Generate key
key = Fernet.generate_key()
cipher = Fernet(key)

# Encrypt
msg = b"confidential data"
encrypted = cipher.encrypt(msg)
print("Encrypted:", encrypted)

# Decrypt
print("Decrypted:", cipher.decrypt(encrypted))

Encrypted: b'gAAAAABoodgz5tX5OSoi6JcQQ7ilx7WUGICblcuzqw3czugl2b7Kwe5BIOZyIxOJYR4fQTgd9aDpt8cZC8RugcKzCeSD24OrEhg-Y-AEybP1xwYbKSKBzvk='
Decrypted: b'confidential data'


🔹 Data Governance
16. Validating Schema

Explanation:
Data governance ensures data follows rules. Schema validation checks if columns match the expected structure. This prevents errors in downstream processing.

In [57]:
import pandas as pd

expected = {"id","name","age"}
df = pd.DataFrame({"id":[1],"name":["A"],"age":[25]})
print("Schema valid:", set(df.columns) == expected)

Schema valid: True


17. Checking for Duplicates

Explanation:
Governance rules may forbid duplicate records (e.g., one transaction logged twice). This check enforces uniqueness before storing data.


In [58]:
df = pd.DataFrame({"id":[1,2,2,3]})
print("Has duplicates:", df.duplicated().any())

Has duplicates: True


18. Ensuring Primary Key Uniqueness

Explanation:
Primary keys (like id) must be unique. Violations can break databases and analytics. This check confirms that uniqueness is respected.

In [59]:
df = pd.DataFrame({"id":[1,2,2,3]})
print("Primary key valid:", df["id"].is_unique)

Primary key valid: False



19. Checking Allowed Categories

Explanation:
Sometimes only specific values are allowed (e.g., gender must be "Male" or "Female"). Governance ensures no invalid categories sneak in.

In [60]:
df = pd.DataFrame({"gender":["Male","Female","Unknown"]})
allowed = {"Male","Female"}
print("Valid categories:", set(df["gender"]).issubset(allowed))

Valid categories: False


20. Ensuring Referential Integrity

Explanation:
In relational data, foreign keys (like cust_id in orders) must exist in the parent table (customers). Referential integrity ensures consistency between linked tables.



In [61]:
customers = pd.DataFrame({"cust_id":[1,2,3]})
orders = pd.DataFrame({"order_id":[101,102,103], "cust_id":[1,2,5]})
print("All foreign keys valid:", set(orders["cust_id"]).issubset(set(customers["cust_id"])))

All foreign keys valid: False
