# Handling Mixed-Type Data (Multitype)
- Handling mixed-type data (also called multitype data) in the same column is a common preprocessing challenge in machine learning.
- Our Goal is to Detect, convert, and replace inconsistent data types into a consistent format for modeling.

| Task                    | Function                         |
| ----------------------- | -------------------------------- |
| Replace specific values | `df.replace()`                   |
| Convert to numeric      | `pd.to_numeric(errors='coerce')` |
| Convert to string       | `astype(str)`                    |
| Convert to datetime     | `pd.to_datetime()`               |
| Detect mixed types      | `apply(type)`                    |
| Fill missing values     | `fillna()`, `dropna()`           |

| To convert  | Use                                          |
| ----------- | -------------------------------------------- |
| To numeric  | `pd.to_numeric(df['col'])`                   |
| To string   | `df['col'].astype(str)`                      |
| To category | `df['col'].astype('category')`               |
| To datetime | `pd.to_datetime(df['col'], errors='coerce')` |

- 	astype(Type): I can use this function to convert the data to any data type not only string but to interger and float also.

### Step 1: Identify the Mixed Types in column

In [55]:
# Import Libraries
import pandas as pd

In [56]:
# Create a dataset (Here, I have created a dataset with multiple data types data in a same column)
df = pd.DataFrame({'age': [25, '30', 'unknown', 45, None, '40']})
df

Unnamed: 0,age
0,25
1,30
2,unknown
3,45
4,
5,40


In [57]:
# chech the data type of the data in the column
print(df['age'].apply(type))  # Show data type per row in the column
# If want the complete info for the data set we can use data_set.info() for large dataset to make work easier
# df.info()
df['age'].value_counts()   # use to get the value count of the data like it will show each data is repeated how many times in the same column

0         <class 'int'>
1         <class 'str'>
2         <class 'str'>
3         <class 'int'>
4    <class 'NoneType'>
5         <class 'str'>
Name: age, dtype: object


age
25         1
30         1
unknown    1
45         1
40         1
Name: count, dtype: int64

### Step 2: Replace Specific Text Values

In [58]:
# Replace the specific text values with the NaN value and then hanple that NaN value in the missing values part
# Here, in the dataset you can see 'unknown' is the specific text value which we have to replace with NaN
df['age'] = df['age'].replace('unknown', pd.NA)  # Here, we can use None or pd.NA to fill the values as Null
df

Unnamed: 0,age
0,25.0
1,30.0
2,
3,45.0
4,
5,40.0


### Step 3: Convert The Data Type

In [59]:
# Convert to numeric and coerce errors to NaN
df['age'] = pd.to_numeric(df['age'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     4 non-null      float64
dtypes: float64(1)
memory usage: 180.0 bytes


### Step 4: Replace or Impute Missing Values

In [60]:
# Check the missing values in the data using the isnull() funcation
df.isnull().sum()

age    2
dtype: int64

In [64]:
# Fill the missing values in the dataset (age column have missing vales so we have to fil it)
df.fillna({'age' : df['age'].mean()}, inplace= True)  # Here, I have filled mode because i have handles missing values first and the content in the column is still object based data and i can not apply mean and median on it. (If I changed the data type of the content before handling the missing values then I can use mean and median also to fill the mssing values)
df

Unnamed: 0,age
0,25.0
1,30.0
2,35.0
3,45.0
4,35.0
5,40.0


In [None]:
# Example: Cleaning a Mixed-Type Dataset
df = pd.DataFrame({'salary': ['1000', '2000', '3k', None, 4000, 'unknown']})

# Replace 'unknown' and '3k' with NaN (custom logic)
df['salary'] = df['salary'].replace({'unknown': None, '3k': 3000})

# Convert to numeric
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')

# Fill missing with median
df['salary'] = df['salary'].fillna(df['salary'].median())

# Print DataFrame
df


Unnamed: 0,salary
0,1000.0
1,2000.0
2,3000.0
3,2500.0
4,4000.0
5,2500.0


: 