# 🏷️ Fixing Data Types

This notebook provides hands-on practice for detecting and correcting data type issues in a dataset.

In [1]:
# Step 1: Load dataset with data type issues
import pandas as pd
df = pd.DataFrame({
    'id': ['1', '2', '3', '4'],
    'age': ['25', '30', 'NaN', '40'],
    'price': ['100.0', '200.5', 'invalid', '300'],
    'signup_date': ['2023-01-15', '15-02-2023', 'March 1, 2023', 'not a date'],
    'subscribed': ['Yes', 'No', 'Yes', 'No'],
    'city': ['New York', 'Chicago', 'Chicago', 'Los Angeles']
})
df

Unnamed: 0,id,age,price,signup_date,subscribed,city
0,1,25.0,100.0,2023-01-15,Yes,New York
1,2,30.0,200.5,15-02-2023,No,Chicago
2,3,,invalid,"March 1, 2023",Yes,Chicago
3,4,40.0,300,not a date,No,Los Angeles


## Step 2: Check current data types

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           4 non-null      object
 1   age          4 non-null      object
 2   price        4 non-null      object
 3   signup_date  4 non-null      object
 4   subscribed   4 non-null      object
 5   city         4 non-null      object
dtypes: object(6)
memory usage: 320.0+ bytes


## Step 3: Fix the types

In [3]:
# Convert 'id' to integer
df['id'] = df['id'].astype(int)

# Convert 'age' to float, handling NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Convert 'price' to float, coercing invalid strings
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert 'signup_date' to datetime, coercing errors
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')

# Convert 'subscribed' to boolean
df['subscribed'] = df['subscribed'].map({'Yes': True, 'No': False})

# Convert 'city' to category
df['city'] = df['city'].astype('category')

# Display cleaned DataFrame
df

  df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')


Unnamed: 0,id,age,price,signup_date,subscribed,city
0,1,25.0,100.0,2023-01-15,True,New York
1,2,30.0,200.5,2023-02-15,False,Chicago
2,3,,,2023-03-01,True,Chicago
3,4,40.0,300.0,NaT,False,Los Angeles


## Step 4: Re-check data types and cleaned data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           4 non-null      int32         
 1   age          3 non-null      float64       
 2   price        3 non-null      float64       
 3   signup_date  3 non-null      datetime64[ns]
 4   subscribed   4 non-null      bool          
 5   city         4 non-null      category      
dtypes: bool(1), category(1), datetime64[ns](1), float64(2), int32(1)
memory usage: 380.0 bytes
