## **Basic of Python Data Cleaning** (Using pandas, the standard industry Library)

Why Data cleaning Matter: 

Raw data often has issues like
- Missing values
- Duplicates
- Inconsistent formating
- Outliers
- Wrong Data types

##### **1.Inspectng Data:**

- `df.shape`     # Dimensions
- `df.info()`    # Data types & nulls
- `df.head()`    # Peek first five rows
- `df.describe()`# Statistical Summary (Numeric Vlalues)
- `df.columns`   # List column names

##### **2.Handling Missing Values:**

- `df.isna().sum()`                    # Count nulls
- `df.dropna()`                        # Remove rows with nulls
- `df.fillna(method='ffill')`          # Forward fill
- `df['Age'].fillna(df['column_name'].mean())` # Fill with mean

##### **3.Fixing Data Types:**

- `df['column1'] = pd.to_datetime(df['column1'])`      # Convert to datetime
- `df['column1'] = df['column1'].astype(int)`           # Convert to integer

##### **4.Removing Duplicates:**

- `df.duplicated().sum()`      # Count duplicates
- `df = df.drop_duplicates()`  # Removes them

##### **5.Handling Inconsistent Values:**

- `df['column'] = df['column'].str.lower().str.strip()` # Use inbuilt functions
- `df[column'] = df['column'].replace({'current_value':'new_value'})` # User preferance

##### **6.Outlier Detection:**
- Using IQR:

    - `Q1 = df['BP'].quantile(0.25)`
    - `Q3 = df['BP'].quantile(0.75)`
    - `IQR = Q3 - Q1`
    - `df = df[(df['BP'] >= Q1 - 1.5*IQR) & (df['BP'] <= Q3 + 1.5*IQR)]`


## **Feature Engineering:**

- Transforming raw data into meaningful inputs improves your model’s accuracy.

**1.Creating New Feature:**
- `df['BMI'] = df['Weight_kg'] / (df['Height_m'] ** 2)`

**2.Encoding Categorical Data:**

- `df['Gender'] = df['Gender'].map({'male': 0, 'female': 1})` # Label Encoding
- `pd.get_dummies(df, columns=['City'], drop_first=True)` # One-hot Encoding

**3.Date-Time Feature:**

- `df['Visit_Date'] = pd.to_datetime(df['Visit_Date'])`
- `df['Visit_Weekday'] = df['Visit_Date'].dt.day_name()`
- `df['Visit_Month'] = df['Visit_Date'].dt.month`

**4.Binning:** Convert continous values into categoricies:

- `df['AgeGroup'] = pd.cut(df['Age'], bins=[0,18,45,60,100], labels=['Child','Adult','Middle-Aged','Senior'])`

**5.Aggregations:** Useful in grouped data
- `avg_bills = df.groupby('Hospital_ID')['Bill_Amount'].mean().reset_index()`

**6.Interaction:**
- `df['HeartRisk'] = df['Age'] * df['Cholesterol_Level']`