# Replace And Data Type Change

### 🧭 OVERVIEW 

- When we perform data cleaning, two very common tasks are:

- Replacing unwanted or incorrect values (e.g., missing data, inconsistent text, outliers, etc.)

- Changing data types to make columns ready for analysis or modeling (e.g., converting “123” from string to integer)

We’ll go through both systematically 👇

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("loan_data_set.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
# we might have mixed data in our column meaning both categorical and numerical data is present in our column

df.info()

# for ex in dependents column we have numbers but in .info we are getting dtype as object


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [4]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [5]:
# lets handle the null values in the dependents columns 
from sklearn.impute import SimpleImputer

modeImputer = SimpleImputer(strategy = "most_frequent")

df[["Dependents"]] = modeImputer.fit_transform(df[["Dependents"]])

In [6]:
df["Dependents"].isnull().sum()

np.int64(0)

#### 🌟 **`value_counts()`** return the single series of frequency of the data points - like how many time each data point has appeared

In [7]:
#  value_counts() count the frequency of the data points - like how many time each data point has appeared

df["Dependents"].value_counts()

Dependents
0     360
1     102
2     101
3+     51
Name: count, dtype: int64

# 🔹 1. **Using `replace()` Method**
## 📘 **Syntax**

```python
    DataFrame.replace(to_replace, value, inplace=False, regex=False)

```
## ⚙️ Parameters

| Parameter    | Description                                                                           |
| ------------ | ------------------------------------------------------------------------------------- |
| `to_replace` | The value(s) to be replaced (can be single value, list, dictionary, or regex pattern) |
| `value`      | The new value(s) to use instead                                                       |
| `inplace`    | If `True`, modifies the DataFrame directly                                            |
| `regex`      | If `True`, treats the `to_replace` as a regex pattern                                 |

## ✅ Example 1: Replace a Single Value


##### Since we haven't changed the datatype of the column we have to enter the value as string only

In [8]:
df["Dependents"].replace("3+","3",inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Dependents"].replace("3+","3",inplace = True)


##### 🌟 **`.astype()`** returns a series of converted data points as specified inside paranthesis

In [9]:
df["Dependents"].astype("int64") # why because ML Algo works on Numerical data

0      0
1      1
2      0
3      0
4      0
      ..
609    0
610    3
611    1
612    2
613    0
Name: Dependents, Length: 614, dtype: int64

## ✅ Example 2: Replace Multiple Values

In [10]:
df1 = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Unknown', 'Male'],
    'Age': [25, 30, 22, 29]
})

In [11]:
df1["Gender"] = df1["Gender"].replace(["Male", 'Female'], ["M", "F"])
df1

Unnamed: 0,Gender,Age
0,M,25
1,F,30
2,Unknown,22
3,M,29


## ✅ Example 3: Replace using Dictionary

In [13]:
df1.replace({"Gender" : {"M" : "Male", "F" : "Female"}}, inplace = True)
df1

Unnamed: 0,Gender,Age
0,Male,25
1,Female,30
2,Unknown,22
3,Male,29


# 🔹 2. Using map() for Replacement (on Series)

#### **`map()`** is handy when replacing values in **a single column** based on a mapping.

#### 🧠 Explanation:

- If a value in the column isn’t in the dictionary, it becomes `NaN`.

- Hence, useful when mapping well-defined values.

In [14]:
gender_map = {"Male" : "M", "Female" : "F"} 
df1["Gender"] = df1["Gender"].map(gender_map)
df1

Unnamed: 0,Gender,Age
0,M,25
1,F,30
2,,22
3,M,29


# 🔹 3. **Using `apply()` for Conditional Replacement**

#### If replacement depends on logic:

#### 🧠 Explanation:

- Applies a custom function to each element.

- Great for conditional data cleaning or transformation.

In [15]:
df1["Age_Group"] = df1["Age"].apply(lambda x : "Adult" if x >= 18 else "Minor")
df1

Unnamed: 0,Gender,Age,Age_Group
0,M,25,Adult
1,F,30,Adult
2,,22,Adult
3,M,29,Adult


# 🔹 4. **Using `.loc[]` for Conditional Replacement**

#### For complex logical conditions:

#### 🧠 Explanation:

- Uses **conditional indexing**.

- Replaces all values greater than 28 with 28.

### **Code Explanation:**

- **`df.loc[Condition, 'ColumnName'] = newValue`**

- **`.loc[]`** is used to **access specific rows and columns** in a DataFrame by label.

- The **first argument (condition)** **selects the rows**.

- The **second argument ('Age') selects the column**.

#### So, **`df.loc[df['Age'] > 28, 'Age']`** means:Select all rows where the value in the `Age` column is greater than 28,and then focus on the `Age` column of those rows.

In [16]:
df1.loc[df1["Age"] > 28, "Age"] = 28
df1

Unnamed: 0,Gender,Age,Age_Group
0,M,25,Adult
1,F,28,Adult
2,,22,Adult
3,M,28,Adult


# 🧩 Summary of Replace Techniques

| Method      | Use Case                             | Works On           | Notes               |
| ----------- | ------------------------------------ | ------------------ | ------------------- |
| `replace()` | Simple direct replacement            | DataFrame / Series | Can use regex       |
| `map()`     | Mapping values using dict            | Series             | Unmapped become NaN |
| `apply()`   | Logical / Function-based replacement | Series             | Flexible            |
| `loc[]`     | Conditional replacement              | DataFrame          | Very powerful       |
