 

## **3: Data Cleaning**

**Definition:**
Data cleaning is the process of identifying and correcting (or removing) errors or inconsistencies in data to improve its quality and make it ready for analysis.

---

### **1️⃣ Handling Missing Values**

* Missing values can appear as `NaN` (Not a Number).
* Two main ways to handle them:

**a) Drop missing values**

```python
df.dropna()          # removes rows with any missing values
df.dropna(axis=1)    # removes columns with missing values
```

**b) Fill missing values**

```python
df['Age'].fillna(0, inplace=True)         # fill with 0
df['Score'].fillna(df['Score'].mean(), inplace=True)  # fill with mean
```

✔ **Tip:** Filling with mean, median, or mode is common in numeric columns; for categorical columns, you can fill with the mode.

---

### **2️⃣ Detecting Duplicates**

* Duplicate rows can skew analysis.

```python
df.duplicated()            # returns True for duplicate rows
df.drop_duplicates(inplace=True)  # removes duplicate rows
```

---

### **3️⃣ Renaming Columns**

* Helps to make column names **clear and consistent**.

```python
df.rename(columns={'Age':'Age_Years', 'Score':'Exam_Score'}, inplace=True)
```

---

### **4️⃣ Changing Data Types**

* Sometimes, data is not in the correct type (`int`, `float`, `object`, `datetime`).
* Use `astype` to convert:

```python
df['Age'] = df['Age'].astype(float)
df['Date'] = pd.to_datetime(df['Date'])
```

✔ **Tip:** Correct data types are essential for calculations and visualizations.

---

## **4: Data Transformation**

**Definition:**
Data transformation is the process of changing or modifying data into a suitable format or structure for analysis.

---

### **1️⃣ Creating New Columns**

* You can create new columns based on existing data:

```python
df['Total'] = df['Exam_Score'] + 10
df['Passed'] = df['Exam_Score'] > 50   # Boolean column
```

---

### **2️⃣ Applying Functions**

* `apply()` lets you apply a function to a column or row.
* `lambda` is used for small anonymous functions.

```python
df['Double'] = df['Exam_Score'].apply(lambda x: x*2)
df['Grade'] = df['Exam_Score'].apply(lambda x: 'A' if x>90 else 'B')
```

* `map()` is used for mapping values:

```python
df['Passed'] = df['Exam_Score'].map(lambda x: 'Yes' if x>50 else 'No')
```

---

### **3️⃣ String Operations**

* String operations can be applied using `.str` accessor.

```python
df['Name'].str.upper()    # convert to uppercase
df['Name'].str.lower()    # convert to lowercase
df['Name'].str.len()      # length of each string
df['Name'].str.replace('a','@')  # replace character
```

---

### **4️⃣ Sorting Data**

* Sorting helps in organizing data in a meaningful order.

```python
df.sort_values('Exam_Score', ascending=False)   # sort by column
df.sort_index()                                 # sort by index
```

✔ **Tip:** Sorting is useful before reporting, plotting, or ranking data.

---

## **✨ Summary**

| Lesson                  | Key Concepts                                                                                                                                          |
| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Data Cleaning**       | Handle missing values (`dropna`, `fillna`), remove duplicates (`duplicated`, `drop_duplicates`), rename columns, change data types (`astype`)         |
| **Data Transformation** | Create new columns, apply functions (`apply`, `map`, `lambda`), string operations (`str.upper`, `str.lower`), sort data (`sort_values`, `sort_index`) |

**Key Notes:**

* Clean data = reliable analysis.
* Transformation = data ready for insights.
* Combine these steps to prepare your dataset for visualization or modeling.

---
 
