<img src="./images/banner.png" width="800">

# Examining Data Types and Structures

In the world of data analysis, understanding data types is crucial for effectively processing, analyzing, and interpreting information. Data types are categories that specify which kind of value a variable can hold. They play a fundamental role in how data is stored, manipulated, and analyzed in various programming languages and data analysis tools.


🔑 **Key Concept:** Data types define the nature of data and determine how it can be used in computations and analyses.


Understanding data types is essential for several reasons:

1. **Memory Allocation:** Different data types require different amounts of memory. Knowing your data types helps in efficient memory management.

2. **Operation Compatibility:** Certain operations are only valid for specific data types. For example, you can perform mathematical operations on numeric data types but not on strings.

3. **Data Integrity:** Proper data typing helps maintain the integrity of your data, ensuring that values are stored and processed correctly.

4. **Performance Optimization:** Using the right data types can significantly improve the performance of your data processing and analysis tasks.


In data analysis, we commonly encounter the following basic data types:

1. **Numeric Types:**
   - Integers (e.g., 1, 100, -5)
   - Floating-point numbers (e.g., 3.14, -0.001)

2. **String Type:**
   - Text data (e.g., "Hello, World!", "Data Analysis")

3. **Boolean Type:**
   - True/False values

4. **Date and Time Types:**
   - Representing dates, times, or both


When working with Pandas and NumPy, you'll encounter more specific data types:

In [1]:
import numpy as np
import pandas as pd

# NumPy data types
np_int = np.int64(42)
np_float = np.float64(3.14)

# Pandas data types
pd_series = pd.Series([1, 2, 3])
pd_series.dtype

dtype('int64')

💡 **Pro Tip:** Always check the data types of your columns when you load a dataset. This can be done easily in Pandas using the `dtypes` attribute:


```python
df = pd.read_csv('your_dataset.csv')
print(df.dtypes)
```


The choice of data type can significantly impact your analysis:

1. **Numerical Calculations:** Using the wrong numeric type (e.g., int instead of float) can lead to loss of precision or unexpected results.

2. **Memory Usage:** Large datasets with inefficient data types can consume excessive memory.

3. **Processing Speed:** Optimized data types can dramatically speed up data processing and analysis tasks.


Mismatched data types can lead to errors or unexpected behavior in your analysis. Always ensure your data types are appropriate for your intended operations.


Understanding data types is the foundation of effective data analysis. As we progress through this lecture, we'll delve deeper into how to examine, verify, and manipulate data types to ensure your data is in the best possible shape for analysis.


Properly managing data types from the outset of your analysis can save you significant time and effort later, preventing errors and ensuring the accuracy of your results.

**Table of contents**<a id='toc0_'></a>    
- [Common Data Types in Pandas and NumPy](#toc1_)    
  - [Numeric Data Types](#toc1_1_)    
  - [String Data Types](#toc1_2_)    
  - [Boolean Data Type](#toc1_3_)    
  - [DateTime Data Types](#toc1_4_)    
  - [Categorical Data Type](#toc1_5_)    
  - [Special Data Types](#toc1_6_)    
  - [Checking Data Types](#toc1_7_)    
- [Examining and Verifying Data Types](#toc2_)    
  - [Initial Data Type Inspection](#toc2_1_)    
  - [Detailed Data Type Information](#toc2_2_)    
  - [Verifying Numeric Columns](#toc2_3_)    
  - [Checking Unique Values in Categorical Columns](#toc2_4_)    
  - [Identifying Mixed Data Types](#toc2_5_)    
  - [Verifying Date Columns](#toc2_6_)    
  - [Using `select_dtypes` for Type-Based Selection](#toc2_7_)    
  - [Handling Boolean Data](#toc2_8_)    
  - [Examining Memory Usage](#toc2_9_)    
- [Identifying and Handling Mixed Data Types](#toc3_)    
  - [Identifying Mixed Data Types](#toc3_1_)    
  - [Detecting Mixed Types](#toc3_2_)    
  - [Handling Mixed Numeric and String Data](#toc3_3_)    
  - [Handling Mixed Categorical Data](#toc3_4_)    
  - [Handling Mixed Boolean Data](#toc3_5_)    
  - [Handling Mixed Date Types](#toc3_6_)    
  - [General Approach for Handling Mixed Types](#toc3_7_)    
- [Ensuring Data Integrity During Loading](#toc4_)    
  - [Specifying Data Types During Loading](#toc4_1_)    
  - [Handling Missing Values During Load](#toc4_2_)    
  - [Using Parse Functions for Complex Data](#toc4_3_)    
  - [Validating Data Range and Constraints](#toc4_4_)    
  - [Checking for Duplicate Records](#toc4_5_)    
  - [Verifying Categorical Data Integrity](#toc4_6_)    
  - [Implementing Custom Validation Functions](#toc4_7_)    
  - [Logging Data Loading Process](#toc4_8_)    
- [Summary](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Common Data Types in Pandas and NumPy](#toc0_)

Pandas and NumPy are fundamental libraries in Python for data analysis, each with its own set of data types. Understanding these types is crucial for efficient data manipulation and analysis.


### <a id='toc1_1_'></a>[Numeric Data Types](#toc0_)


Both Pandas and NumPy offer a range of numeric data types:

1. **Integers:**
   - `int8`, `int16`, `int32`, `int64`: Signed integers of varying sizes
   - `uint8`, `uint16`, `uint32`, `uint64`: Unsigned integers

2. **Floating-Point Numbers:**
   - `float16`, `float32`, `float64`: Floating-point numbers of different precisions
   - `float64` is also known as `double`


In [2]:
import numpy as np
import pandas as pd

In [3]:
# NumPy examples
np.array([1, 2, 3], dtype=np.int32)

array([1, 2, 3], dtype=int32)

In [4]:
np.array([1.0, 2.0, 3.0], dtype=np.float64)

array([1., 2., 3.])

In [5]:
# Pandas examples
pd.Series([1, 2, 3], dtype="int32")

0    1
1    2
2    3
dtype: int32

In [6]:
pd.Series([1.0, 2.0, 3.0], dtype="float64")

0    1.0
1    2.0
2    3.0
dtype: float64

### <a id='toc1_2_'></a>[String Data Types](#toc0_)


Strings are handled differently in NumPy and Pandas:

1. **NumPy:**
   - `dtype='U'`: Unicode string
   - `dtype='S'`: Byte string

2. **Pandas:**
   - `object`: Default type for string data
   - `string`: Pandas extension type for strings (introduced in newer versions)


In [7]:
# NumPy string array
np.array(["apple", "banana", "cherry"], dtype="U")

array(['apple', 'banana', 'cherry'], dtype='<U6')

In [8]:
# Pandas string series
pd.Series(["apple", "banana", "cherry"], dtype="string")

0     apple
1    banana
2    cherry
dtype: string

### <a id='toc1_3_'></a>[Boolean Data Type](#toc0_)


Both libraries support boolean data:

- `bool`: True or False values


In [9]:
np.array([True, False, True])

array([ True, False,  True])

In [10]:
pd.Series([True, False, True], dtype="bool")

0     True
1    False
2     True
dtype: bool

### <a id='toc1_4_'></a>[DateTime Data Types](#toc0_)


Handling dates and times is crucial in data analysis:

1. **NumPy:**
   - `datetime64`: Represents dates and times

2. **Pandas:**
   - `datetime64[ns]`: Nanosecond precision for datetime
   - `timedelta[ns]`: For time differences


In [11]:
# NumPy datetime
np.array(["2023-01-01", "2023-01-02"], dtype="datetime64")

array(['2023-01-01', '2023-01-02'], dtype='datetime64[D]')

In [12]:
# Pandas datetime
pd.to_datetime(["2023-01-01", "2023-01-02"])

DatetimeIndex(['2023-01-01', '2023-01-02'], dtype='datetime64[ns]', freq=None)

### <a id='toc1_5_'></a>[Categorical Data Type](#toc0_)


Pandas offers a special type for categorical data:

- `category`: Efficient storage and processing for repeated string values


In [13]:
pd.Series(["A", "B", "A", "C", "B", "A"], dtype="category")

0    A
1    B
2    A
3    C
4    B
5    A
dtype: category
Categories (3, object): ['A', 'B', 'C']

💡 **Pro Tip:** Using the `category` type for columns with repeated string values can significantly reduce memory usage and improve performance.


### <a id='toc1_6_'></a>[Special Data Types](#toc0_)


1. **NumPy:**
   - `complex`: For complex numbers

2. **Pandas:**
   - `Int64`, `Float64`: Nullable integer and float types
   - `sparse`: For efficient storage of sparse data


In [14]:
# NumPy complex number
np.array([1 + 2j, 3 + 4j])

array([1.+2.j, 3.+4.j])

In [15]:
# Pandas nullable integer
pd.Series([1, 2, None], dtype="Int64")

0       1
1       2
2    <NA>
dtype: Int64

### <a id='toc1_7_'></a>[Checking Data Types](#toc0_)


You can easily check the data types of your NumPy arrays or Pandas Series/DataFrames:


```python
# For NumPy
print(np_int.dtype)

# For Pandas
print(pd_series_int.dtype)
print(df.dtypes)  # For a DataFrame
```


🤔 **Why This Matters:** Choosing the right data type can significantly impact memory usage and computation speed, especially when dealing with large datasets.


❗️ **Important Note:** Be aware of the trade-offs between precision and memory usage when selecting numeric data types. For instance, `float32` uses less memory but has lower precision compared to `float64`.


Understanding these common data types in Pandas and NumPy is essential for effective data manipulation and analysis. In the next section, we'll explore how to examine and verify these data types in your datasets.

## <a id='toc2_'></a>[Examining and Verifying Data Types](#toc0_)

In this section, we'll explore practical methods for examining and verifying data types using real-world datasets. We'll primarily use the famous Titanic dataset, which contains various data types, and supplement it with a time series dataset to demonstrate datetime handling.


Regularly examining and verifying data types is crucial for ensuring data integrity and preventing unexpected behavior in your analyses.


Let's start by loading our datasets:


In [16]:
import pandas as pd
import seaborn as sns

In [17]:
# Load the Titanic dataset
titanic_df = sns.load_dataset("titanic")

In [18]:
# Load a time series dataset (taxis)
taxis = sns.load_dataset("taxis")

In [19]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [20]:
taxis.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


### <a id='toc2_1_'></a>[Initial Data Type Inspection](#toc0_)


The first step in examining data types is to get an overview of the types in your dataset:


In [21]:
titanic_df.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [22]:
taxis.dtypes

pickup             datetime64[ns]
dropoff            datetime64[ns]
passengers                  int64
distance                  float64
fare                      float64
tip                       float64
tolls                     float64
total                     float64
color                      object
payment                    object
pickup_zone                object
dropoff_zone               object
pickup_borough             object
dropoff_borough            object
dtype: object

💡 **Pro Tip:** Always start your analysis by checking the data types. This quick overview can reveal potential issues or areas that need attention.


### <a id='toc2_2_'></a>[Detailed Data Type Information](#toc0_)


For a more comprehensive view of your data, including data types and non-null counts:


In [23]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [24]:
taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   object        
 9   payment          6389 non-null   object        
 10  pickup_zone      6407 non-null   object        
 11  dropoff_zone     6388 non-null   object        
 12  pickup_borough   6407 non-null   object        
 13  dropoff_borough  6388 non-null   object        
dtypes: datetime64[ns](2), float64(5), int64(

To dive deeper into specific columns:


In [25]:
titanic_df["age"].dtype

dtype('float64')

In [26]:
taxis["total"].dtype

dtype('float64')

### <a id='toc2_3_'></a>[Verifying Numeric Columns](#toc0_)


For numeric columns, it's useful to check basic statistics:


In [27]:
titanic_df["age"].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

In [28]:
taxis["total"].describe()

count    6433.000000
mean       18.517794
std        13.815570
min         1.300000
25%        10.800000
50%        14.160000
75%        20.300000
max       174.820000
Name: total, dtype: float64

Potential Issues with Numeric Data:
1. **Incorrect data type:** If `'age'` is loaded as object instead of float.
2. **Presence of non-numeric values:** Strings or `NaN` values in numeric columns.
3. **Missing values:** If `'age'` is loaded as `NaN` instead of float.
4. **Range of values:** If `'age'` data is not within a reasonable range.


Example fix:

In [29]:
# Convert to numeric, coercing errors to NaN
pd.to_numeric(titanic_df["age"], errors="coerce")

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

### <a id='toc2_4_'></a>[Checking Unique Values in Categorical Columns](#toc0_)


For categorical columns, examining unique values is crucial:


In [30]:
titanic_df["sex"].unique()

array(['male', 'female'], dtype=object)

In [31]:
titanic_df["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

Potential Issues with Categorical Data:
1. **Inconsistent categories:** Variations in spelling or capitalization.
2. **Too many unique values:** Indicating the column might not be categorical or might be a numerical column.

Example fix:

In [32]:
# Standardize categories
titanic_df["sex"].str.lower().str.strip()

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: sex, Length: 891, dtype: object

### <a id='toc2_5_'></a>[Identifying Mixed Data Types](#toc0_)


Sometimes, columns might contain mixed data types:


In [33]:
titanic_df.loc[0, "age"] = "Unknown"

  titanic_df.loc[0, "age"] = "Unknown"


In [34]:
titanic_df["age"].dtype

dtype('O')

In [35]:
titanic_df["age"].head()

0    Unknown
1       38.0
2       26.0
3       35.0
4       35.0
Name: age, dtype: object

Potential Issues with Mixed Data Types:
1. **Unexpected type conversion:** Numeric column becoming object due to string values.
2. **Calculation errors:** Unable to perform numeric operations on mixed types.


Example fix:

In [36]:
# Convert to numeric, replacing non-numeric values with NaN
pd.to_numeric(titanic_df["age"], errors="coerce")

0       NaN
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

### <a id='toc2_6_'></a>[Verifying Date Columns](#toc0_)


Let's examine the date column in our Taxis dataset:


In [37]:
taxis["pickup"].dtype

dtype('<M8[ns]')

In [38]:
taxis["pickup"].head()

0   2019-03-23 20:21:09
1   2019-03-04 16:11:55
2   2019-03-27 17:53:01
3   2019-03-10 01:23:59
4   2019-03-30 13:27:42
Name: pickup, dtype: datetime64[ns]

Potential Issues with Date Data:
1. **Incorrect date format:** Dates loaded as strings instead of datetime.
2. **Inconsistent date formats:** Mixed formats within the same column.


### <a id='toc2_7_'></a>[Using `select_dtypes` for Type-Based Selection](#toc0_)


To select columns of specific types:


In [39]:
titanic_df.select_dtypes(include=["int64", "float64"])

Unnamed: 0,survived,pclass,sibsp,parch,fare
0,0,3,1,0,7.2500
1,1,1,1,0,71.2833
2,1,3,0,0,7.9250
3,1,1,1,0,53.1000
4,0,3,0,0,8.0500
...,...,...,...,...,...
886,0,2,0,0,13.0000
887,1,1,0,0,30.0000
888,0,3,1,2,23.4500
889,1,1,0,0,30.0000


In [40]:
titanic_df.select_dtypes(include=["object"])

Unnamed: 0,sex,age,embarked,who,embark_town,alive
0,male,Unknown,S,man,Southampton,no
1,female,38.0,C,woman,Cherbourg,yes
2,female,26.0,S,woman,Southampton,yes
3,female,35.0,S,woman,Southampton,yes
4,male,35.0,S,man,Southampton,no
...,...,...,...,...,...,...
886,male,27.0,S,man,Southampton,no
887,female,19.0,S,woman,Southampton,yes
888,female,,S,woman,Southampton,no
889,male,26.0,C,man,Cherbourg,yes


### <a id='toc2_8_'></a>[Handling Boolean Data](#toc0_)


Let's create a boolean column in the Titanic dataset:


In [41]:
titanic_df["age"] = pd.to_numeric(titanic_df["age"], errors="coerce")

In [42]:
titanic_df["is_adult"] = titanic_df["age"] > 18

In [43]:
titanic_df["is_adult"].dtype

dtype('bool')

In [44]:
titanic_df["is_adult"].head()

0    False
1     True
2     True
3     True
4     True
Name: is_adult, dtype: bool

Potential Issues with Boolean Data:
1. **Incorrect representation:** Stored as 0/1 instead of True/False.
2. **Mixed boolean and non-boolean values:** Leading to object dtype instead of bool.


Example fix:

In [45]:
titanic_df["is_adult"].astype("bool")

0      False
1       True
2       True
3       True
4       True
       ...  
886     True
887     True
888    False
889     True
890     True
Name: is_adult, Length: 891, dtype: bool

### <a id='toc2_9_'></a>[Examining Memory Usage](#toc0_)


Understanding memory usage can be crucial for large datasets:


In [46]:
titanic_df.memory_usage()

Index           128
survived       7128
pclass         7128
sex            7128
age            7128
sibsp          7128
parch          7128
fare           7128
embarked       7128
class          1023
who            7128
adult_male      891
deck           1247
embark_town    7128
alive          7128
alone           891
is_adult        891
dtype: int64

In [47]:
taxis.memory_usage()

Index                128
pickup             51464
dropoff            51464
passengers         51464
distance           51464
fare               51464
tip                51464
tolls              51464
total              51464
color              51464
payment            51464
pickup_zone        51464
dropoff_zone       51464
pickup_borough     51464
dropoff_borough    51464
dtype: int64

Optimizing data types can significantly reduce memory usage, especially for large datasets.


Examining and verifying data types is a critical step in the data analysis process. It helps ensure data integrity, prevents errors in calculations, and guides decisions on data preprocessing steps.


❗️ **Important Note:** Always verify data types after loading a dataset and after any major transformations. Catching type-related issues early can save significant time and prevent errors in your analysis.


💡 **Pro Tip:** Develop a systematic approach to data type verification as part of your data analysis workflow. This practice will help you catch and address issues consistently across different projects.


In the next section, we'll explore how to handle mixed data types and perform necessary conversions to prepare our data for analysis.

## <a id='toc3_'></a>[Identifying and Handling Mixed Data Types](#toc0_)

Mixed data types occur when a single column contains multiple types of data. This situation can arise due to data entry errors, inconsistent data sources, or improper data handling. Identifying and addressing mixed data types is crucial for ensuring data integrity and preventing errors in analysis.


Mixed data types can lead to unexpected behavior in data analysis and should be identified and handled appropriately. Let's continue using our Titanic dataset and introduce some mixed data types for demonstration purposes.


### <a id='toc3_1_'></a>[Identifying Mixed Data Types](#toc0_)


First, let's create a situation with mixed data types:


In [48]:
import pandas as pd
import numpy as np

In [49]:
# Load the Titanic dataset
titanic_df = sns.load_dataset("titanic")

In [50]:
# Introduce mixed types in the 'Age' column
titanic_df.loc[0, "age"] = "Unknown"

  titanic_df.loc[0, "age"] = "Unknown"


In [51]:
titanic_df.loc[1, "age"] = np.nan

In [52]:
titanic_df.loc[2, "age"] = "30.5"

In [53]:
titanic_df["age"].head()

0    Unknown
1        NaN
2       30.5
3       35.0
4       35.0
Name: age, dtype: object

In [54]:
titanic_df["age"].dtype

dtype('O')

🤔 **Why This Matters:** Notice that the 'Age' column, which should be numeric, is now of type 'object' due to the mixed data types.


### <a id='toc3_2_'></a>[Detecting Mixed Types](#toc0_)


To systematically detect columns with mixed types:


In [55]:
def detect_mixed_types(df):
    mixed_types = {}
    for column in df.columns:
        types = df[column].apply(lambda v: type(v).__name__).value_counts()
        if len(types) > 1:
            mixed_types[column] = types.to_dict()
    return mixed_types

In [56]:
mixed_columns = detect_mixed_types(titanic_df)
print("Columns with mixed types:", mixed_columns)

Columns with mixed types: {'age': {'float': 889, 'str': 2}, 'embarked': {'str': 889, 'float': 2}, 'embark_town': {'str': 889, 'float': 2}}


### <a id='toc3_3_'></a>[Handling Mixed Numeric and String Data](#toc0_)


For columns that should be numeric but contain string values:


In [57]:
def clean_numeric_column(series):
    # Convert to numeric, forcing errors to NaN
    numeric_series = pd.to_numeric(series, errors="coerce")

    # Check what percentage of data was lost
    percent_lost = (series.dropna().size - numeric_series.count()) / series.size * 100

    print(f"Percentage of data lost: {percent_lost:.2f}%")
    return numeric_series

In [58]:
clean_numeric_column(titanic_df["age"])

Percentage of data lost: 0.11%


0       NaN
1       NaN
2      30.5
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

💡 **Pro Tip:** Always check the percentage of data lost when converting mixed types. If it's significant, you may need to investigate further or use a different approach.


### <a id='toc3_4_'></a>[Handling Mixed Categorical Data](#toc0_)


For categorical columns with inconsistent data:


In [59]:
titanic_df["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [60]:
# Introduce mixed types in 'Embarked' column
titanic_df.loc[0, "embarked"] = "Southampton"
titanic_df.loc[1, "embarked"] = "Cherbourg"
titanic_df.loc[2, "embarked"] = np.nan

In [61]:
def clean_categorical_column(series):
    # Create a mapping for full names to abbreviations
    mapping = {"Southampton": "S", "Cherbourg": "C", "Queenstown": "Q"}

    # Apply mapping and convert to category
    cleaned_series = series.map(lambda x: mapping.get(x, x)).astype("category")

    print(f"Unique values: {cleaned_series.unique()}")
    return cleaned_series

In [62]:
clean_categorical_column(titanic_df["embarked"])

Unique values: ['S', 'C', NaN, 'Q']
Categories (3, object): ['C', 'Q', 'S']


0        S
1        C
2      NaN
3        S
4        S
      ... 
886      S
887      S
888      S
889      C
890      Q
Name: embarked, Length: 891, dtype: category
Categories (3, object): ['C', 'Q', 'S']

### <a id='toc3_5_'></a>[Handling Mixed Boolean Data](#toc0_)


For columns that should be boolean but contain mixed types:


In [63]:
# Create a mixed boolean column
titanic_df["is_first_class"] = titanic_df["pclass"].apply(
    lambda x: True if x == 1 else False
)
titanic_df.loc[0, "is_first_class"] = "Yes"
titanic_df.loc[1, "is_first_class"] = 1

  titanic_df.loc[0, "is_first_class"] = "Yes"


In [64]:
def clean_boolean_column(series):
    # Define values to be considered as True
    true_values = [True, "True", "true", "YES", "yes", "Y", "y", 1]

    cleaned_series = series.isin(true_values)

    print(f"Unique values: {cleaned_series.unique()}")
    return cleaned_series

In [65]:
clean_boolean_column(titanic_df["is_first_class"])

Unique values: [False  True]


0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: is_first_class, Length: 891, dtype: bool

### <a id='toc3_6_'></a>[Handling Mixed Date Types](#toc0_)


Let's create a scenario with mixed date formats:


In [66]:
taxis = sns.load_dataset("taxis")

In [67]:
# Create a new column with mixed date formats
taxis["pickup"] = "1912-04-10"
taxis.loc[0, "pickup"] = "10/04/1912"
taxis.loc[1, "pickup"] = "April 10, 1912"
taxis.loc[2, "pickup"] = np.nan

In [68]:
%pip install dateparser

Note: you may need to restart the kernel to use updated packages.


In [69]:
import dateparser


def convert_to_iso(date_string):
    # Parse the date string
    parsed_date = dateparser.parse(date_string)

    if parsed_date is None:
        return "Invalid date format"

    # Convert to ISO format
    iso_date = parsed_date.isoformat()

    return iso_date

In [70]:
convert_to_iso("April 10, 1912")

'1912-04-10T00:00:00'

In [71]:
def clean_date_column(series):
    cleaned_series = series.apply(lambda x: convert_to_iso(str(x)))

    return cleaned_series

In [72]:
clean_date_column(taxis["pickup"])

0       1912-10-04T00:00:00
1       1912-04-10T00:00:00
2       2024-08-29T00:00:00
3       1912-04-10T00:00:00
4       1912-04-10T00:00:00
               ...         
6428    1912-04-10T00:00:00
6429    1912-04-10T00:00:00
6430    1912-04-10T00:00:00
6431    1912-04-10T00:00:00
6432    1912-04-10T00:00:00
Name: pickup, Length: 6433, dtype: object

❗️ **Important Note:** When dealing with mixed date formats, always be cautious about potential ambiguities (e.g., 01/02/2023 could be January 2nd or February 1st depending on the locale).


### <a id='toc3_7_'></a>[General Approach for Handling Mixed Types](#toc0_)


To systematically handle mixed data types, you should always be careful about the nature of the mixed types (e.g., legitimate variations, data entry errors). Here is a general approach:
1. **Identify** mixed type columns using the `detect_mixed_types` function.
2. **Investigate** the nature of the mixed types (e.g., legitimate variations, data entry errors).
3. **Clean** the data using appropriate methods based on the intended data type.
4. **Verify** the results to ensure data integrity is maintained.


Identifying and handling mixed data types is a critical step in data preprocessing. It ensures that your data is consistent and ready for analysis, preventing potential errors and unexpected behavior in downstream processes.


💡 **Pro Tip:** Develop a standardized approach for handling mixed data types in your projects. This can save time and ensure consistency across different datasets and analyses.


## <a id='toc4_'></a>[Ensuring Data Integrity During Loading](#toc0_)

Ensuring data integrity during the loading process is crucial for maintaining the quality and reliability of your data analysis. This step involves setting up proper safeguards and checks to prevent data corruption, misinterpretation, or loss during the initial data import.


🔑 **Key Concept:** Proactive measures during data loading can prevent many issues that would otherwise require time-consuming cleanup later in the analysis process.


Let's explore various techniques to ensure data integrity using our Titanic dataset and a few additional examples.


### <a id='toc4_1_'></a>[Specifying Data Types During Loading](#toc0_)


One of the most effective ways to ensure data integrity is to explicitly specify data types when loading data:


In [73]:
import pandas as pd
import numpy as np

# Define dtypes for columns
dtypes = {
    "PassengerId": "int64",
    "Survived": "int64",
    "Pclass": "int64",
    "Name": "object",
    "Sex": "category",
    "Age": "float64",
    "SibSp": "int64",
    "Parch": "int64",
    "Ticket": "object",
    "Fare": "float64",
    "Cabin": "object",
    "Embarked": "category",
}

# Load the Titanic dataset with specified dtypes
url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
titanic_df = pd.read_csv(url, dtype=dtypes)

titanic_df.dtypes

Survived                      int64
Pclass                        int64
Name                         object
Sex                        category
Age                         float64
Siblings/Spouses Aboard       int64
Parents/Children Aboard       int64
Fare                        float64
dtype: object

💡 **Pro Tip:** Specifying dtypes not only ensures correct data types but can also significantly improve memory usage and loading speed for large datasets.


### <a id='toc4_2_'></a>[Handling Missing Values During Load](#toc0_)


Properly handling missing values during data loading can prevent issues later. By default the following values are interpreted as `NaN`: `“ “`, `“#N/A”`, `“#N/A N/A”`, `“#NA”`, `“-1.#IND”`, `“-1.#QNAN”`, `“-NaN”`, `“-nan”`, `“1.#IND”`, `“1.#QNAN”`, `“<NA>”`, `“N/A”`, `“NA”`, `“NULL”`, `“NaN”`, `“None”`, `“n/a”`, `“nan”`, `“null “`.


In [74]:
# Load data with custom NA values
titanic_df = pd.read_csv(
    url, dtype=dtypes, na_values=["NA", "Unknown", ""], keep_default_na=True
)

titanic_df.isnull().sum()

Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

🤔 **Why This Matters:** Different datasets might represent missing values in various ways. Specifying these upfront ensures consistent handling of missing data.


### <a id='toc4_3_'></a>[Using Parse Functions for Complex Data](#toc0_)


For columns with complex data, such as dates or currency, using parse functions can ensure correct interpretation:


In [75]:
sample_data = pd.DataFrame({
    "Transaction_Date": ["2023-01-01", "2023-01-02", "2023-01-03"],
    "Amount": ["$100,000", "$200,000", "$300,000"]
})

sample_data.to_csv("sample_data.csv", index=False)

In [76]:
# Example with a dataset containing a currency column
def parse_currency(currency_str):
    return float(currency_str.replace("$", "").replace(",", ""))


# Load a hypothetical dataset with date and currency columns
sample_data = pd.read_csv(
    "sample_data.csv",
    parse_dates=["Transaction_Date"],
    converters={"Amount": parse_currency},
)

In [77]:
sample_data.dtypes

Transaction_Date    datetime64[ns]
Amount                     float64
dtype: object

### <a id='toc4_4_'></a>[Validating Data Range and Constraints](#toc0_)


Implement checks to ensure data falls within expected ranges or meets certain constraints:


In [78]:
def validate_age(age):
    return 0 <= age <= 120 if pd.notnull(age) else True

def validate_fare(fare):
    return fare >= 0 if pd.notnull(fare) else True

In [79]:
# Apply validations
age_mask = titanic_df['Age'].apply(validate_age)
fare_mask = titanic_df['Fare'].apply(validate_fare)

In [80]:
titanic_df[~age_mask]['Age']

Series([], Name: Age, dtype: float64)

In [81]:
titanic_df[~fare_mask]['Fare']

Series([], Name: Fare, dtype: float64)

❗️ **Important Note:** While these checks help identify issues, decide carefully how to handle violations. Options include setting to NaN, using a default value, or flagging for further investigation.


### <a id='toc4_5_'></a>[Checking for Duplicate Records](#toc0_)


Identify and handle duplicate records during the loading process:


In [82]:
titanic_df.iloc[len(titanic_df) - 1] = titanic_df.iloc[0]

# Check for duplicates
titanic_df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
882    False
883    False
884    False
885    False
886     True
Length: 887, dtype: bool

In [83]:
# Option to keep only unique records
titanic_df.drop_duplicates(subset=["Name", "Age"], keep="last")

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
5,0,3,Mr. James Moran,male,27.0,0,0,8.4583
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


### <a id='toc4_6_'></a>[Verifying Categorical Data Integrity](#toc0_)


Ensure categorical columns contain only expected values:


In [84]:
expected_sex_categories = ['male', 'female']
expected_pclass_categories = [1, 2, 3]

In [85]:
titanic_df['Sex'].isin(expected_sex_categories)

0      True
1      True
2      True
3      True
4      True
       ... 
882    True
883    True
884    True
885    True
886    True
Name: Sex, Length: 887, dtype: bool

In [86]:
titanic_df['Pclass'].isin(expected_pclass_categories)

0      True
1      True
2      True
3      True
4      True
       ... 
882    True
883    True
884    True
885    True
886    True
Name: Pclass, Length: 887, dtype: bool

### <a id='toc4_7_'></a>[Implementing Custom Validation Functions](#toc0_)


Create custom functions to validate complex business rules or data relationships:


In [87]:
titanic_df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.2500
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.9250
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1000
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.0500
...,...,...,...,...,...,...,...,...
882,0,2,Rev. Juozas Montvila,male,27.0,0,0,13.0000
883,1,1,Miss. Margaret Edith Graham,female,19.0,0,0,30.0000
884,0,3,Miss. Catherine Helen Johnston,female,7.0,1,2,23.4500
885,1,1,Mr. Karl Howell Behr,male,26.0,0,0,30.0000


In [88]:
titanic_df = sns.load_dataset("titanic")

def validate_child_parent(row):
    # A child (age <= 18) should have at least one parent (Parch > 0)
    if pd.notnull(row['age']) and row['age'] <= 18:
        return row['parch'] > 0
    return True

child_parent_validity = titanic_df.apply(validate_child_parent, axis=1)
print("Number of child-parent relationship violations:", (~child_parent_validity).sum())


Number of child-parent relationship violations: 50


### <a id='toc4_8_'></a>[Logging Data Loading Process](#toc0_)


Implement logging to keep track of the data loading process and any issues encountered:


In [89]:
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def load_data_with_logging(url, dtypes):
    logging.info(f"Starting to load data from {url}")
    try:
        df = pd.read_csv(url, dtype=dtypes)
        logging.info(f"Data loaded successfully. Shape: {df.shape}")
        return df
    except Exception as e:
        logging.error(f"Error loading data: {str(e)}")
        return None

In [90]:
titanic_df = load_data_with_logging(url, dtypes)

2024-09-29 00:50:24,986 - INFO - Starting to load data from https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv
2024-09-29 00:50:25,430 - INFO - Data loaded successfully. Shape: (887, 8)


In [91]:
# using loguru
from loguru import logger

logger.add("data_loading.log")

def load_data_with_loguru(url, dtypes):
    logger.info(f"Starting to load data from {url}")
    try:
        df = pd.read_csv(url, dtype=dtypes)
        logger.info(f"Data loaded successfully. Shape: {df.shape}")
        return df
    except Exception as e:
        logger.error(f"Error loading data: {str(e)}")
        return None

In [92]:
titanic_df = load_data_with_logging(url, dtypes)

2024-09-29 00:50:25,454 - INFO - Starting to load data from https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv
2024-09-29 00:50:25,886 - INFO - Data loaded successfully. Shape: (887, 8)


💡 **Pro Tip:** Logging is invaluable for debugging and auditing your data loading process, especially when dealing with automated pipelines or large datasets.


Ensuring data integrity during loading is a critical step in the data analysis process. By implementing these techniques, you can catch and address many data quality issues at the earliest stage, saving time and preventing errors in subsequent analysis steps.


Data integrity checks during loading act as a first line of defense against data quality issues. They help ensure that your analysis is based on reliable and consistent data from the start.


## <a id='toc5_'></a>[Summary](#toc0_)

In this lecture, we've explored the critical process of examining data types and structures, a fundamental step in any data analysis project. Let's recap the key points and their significance in the data science workflow:

1. **Understanding Data Types**: We learned about various data types in Pandas and NumPy, including numeric (int, float), categorical, datetime, and boolean types. Each type has specific characteristics and use cases in data analysis.

2. **Importance of Data Type Verification**: Regularly checking data types helps prevent errors in calculations, ensures proper data handling, and optimizes memory usage.

3. **Handling Mixed Data Types**: We explored techniques to identify and resolve issues with columns containing mixed data types, which can lead to unexpected behavior in analyses.

4. **Data Integrity During Loading**: We discussed strategies to ensure data integrity right from the loading phase, including specifying dtypes, handling missing values, and implementing validation checks.

5. **Practical Techniques**: Throughout the lecture, we applied various methods such as `dtype` checks, `info()`, `describe()`, and custom functions for data examination and cleaning.


Understanding data types and structures is crucial because:

- It prevents errors in data manipulation and analysis.
- It ensures efficient memory usage, especially for large datasets.
- It guides the choice of appropriate analytical methods and visualizations.
- It helps in identifying and addressing data quality issues early in the analysis process.


Pro tips for data type examination:

1. Always start your analysis by examining data types and structures.
2. Develop a systematic approach to data type verification as part of your workflow.
3. Use appropriate data types to optimize memory usage and computation speed.
4. Be cautious with automatic type inference; explicitly specify types when possible.
5. Implement data validation checks during the loading process.
6. Document any data type changes or cleaning processes for reproducibility.


Understanding and properly managing data types and structures sets the foundation for all subsequent data analysis tasks. In the upcoming lectures, we'll build on this knowledge to perform more advanced data preprocessing, feature engineering, and analysis techniques.


❗️ **Important Note:** Remember, the quality of your analysis is only as good as the quality of your data. Proper examination and handling of data types is a critical first step in ensuring high-quality, reliable results.


🔑 **Key Concept:** Data type examination is not a one-time task but an ongoing process throughout your data analysis project. Always be vigilant about how your data types might change as you manipulate and analyze your data.


By mastering these concepts and techniques, you're well-equipped to handle diverse datasets and set a solid foundation for your data science projects. Keep practicing these skills, as they will serve you well in all your future data analysis endeavors.