<img src="./images/banner.png" width="800">

# Examining Data Types and Structures

In the world of data analysis, understanding data types is crucial for effectively processing, analyzing, and interpreting information. Data types are categories that specify which kind of value a variable can hold. They play a fundamental role in how data is stored, manipulated, and analyzed in various programming languages and data analysis tools.


🔑 **Key Concept:** Data types define the nature of data and determine how it can be used in computations and analyses.


Understanding data types is essential for several reasons:

1. **Memory Allocation:** Different data types require different amounts of memory. Knowing your data types helps in efficient memory management.

2. **Operation Compatibility:** Certain operations are only valid for specific data types. For example, you can perform mathematical operations on numeric data types but not on strings.

3. **Data Integrity:** Proper data typing helps maintain the integrity of your data, ensuring that values are stored and processed correctly.

4. **Performance Optimization:** Using the right data types can significantly improve the performance of your data processing and analysis tasks.


In data analysis, we commonly encounter the following basic data types:

1. **Numeric Types:**
   - Integers (e.g., 1, 100, -5)
   - Floating-point numbers (e.g., 3.14, -0.001)

2. **String Type:**
   - Text data (e.g., "Hello, World!", "Data Analysis")

3. **Boolean Type:**
   - True/False values

4. **Date and Time Types:**
   - Representing dates, times, or both


When working with Pandas and NumPy, you'll encounter more specific data types:

In [1]:
import numpy as np
import pandas as pd

# NumPy data types
np_int = np.int64(42)
np_float = np.float64(3.14)

# Pandas data types
pd_series = pd.Series([1, 2, 3])
pd_series.dtype

dtype('int64')

💡 **Pro Tip:** Always check the data types of your columns when you load a dataset. This can be done easily in Pandas using the `dtypes` attribute:


```python
df = pd.read_csv('your_dataset.csv')
print(df.dtypes)
```


The choice of data type can significantly impact your analysis:

1. **Numerical Calculations:** Using the wrong numeric type (e.g., int instead of float) can lead to loss of precision or unexpected results.

2. **Memory Usage:** Large datasets with inefficient data types can consume excessive memory.

3. **Processing Speed:** Optimized data types can dramatically speed up data processing and analysis tasks.


Mismatched data types can lead to errors or unexpected behavior in your analysis. Always ensure your data types are appropriate for your intended operations.


Understanding data types is the foundation of effective data analysis. As we progress through this lecture, we'll delve deeper into how to examine, verify, and manipulate data types to ensure your data is in the best possible shape for analysis.


Properly managing data types from the outset of your analysis can save you significant time and effort later, preventing errors and ensuring the accuracy of your results.

**Table of contents**<a id='toc0_'></a>    
- [Common Data Types in Pandas and NumPy](#toc1_)    
  - [Numeric Data Types](#toc1_1_)    
  - [String Data Types](#toc1_2_)    
  - [Boolean Data Type](#toc1_3_)    
  - [DateTime Data Types](#toc1_4_)    
  - [Categorical Data Type](#toc1_5_)    
  - [Special Data Types](#toc1_6_)    
  - [Checking Data Types](#toc1_7_)    
- [Examining and Verifying Data Types](#toc2_)    
  - [Initial Data Type Inspection](#toc2_1_)    
  - [Detailed Data Type Information](#toc2_2_)    
  - [Examining Specific Columns](#toc2_3_)    
  - [Verifying Numeric Columns](#toc2_4_)    
  - [Checking Unique Values in Categorical Columns](#toc2_5_)    
  - [Identifying Mixed Data Types](#toc2_6_)    
  - [Verifying Date Columns](#toc2_7_)    
  - [Using `select_dtypes` for Type-Based Selection](#toc2_8_)    
  - [Conclusion](#toc2_9_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Common Data Types in Pandas and NumPy](#toc0_)

Pandas and NumPy are fundamental libraries in Python for data analysis, each with its own set of data types. Understanding these types is crucial for efficient data manipulation and analysis.


### <a id='toc1_1_'></a>[Numeric Data Types](#toc0_)


Both Pandas and NumPy offer a range of numeric data types:

1. **Integers:**
   - `int8`, `int16`, `int32`, `int64`: Signed integers of varying sizes
   - `uint8`, `uint16`, `uint32`, `uint64`: Unsigned integers

2. **Floating-Point Numbers:**
   - `float16`, `float32`, `float64`: Floating-point numbers of different precisions
   - `float64` is also known as `double`


In [6]:
import numpy as np
import pandas as pd

In [7]:
# NumPy examples
np.array([1, 2, 3], dtype=np.int32)

array([1, 2, 3], dtype=int32)

In [8]:
np.array([1.0, 2.0, 3.0], dtype=np.float64)

array([1., 2., 3.])

In [9]:
# Pandas examples
pd.Series([1, 2, 3], dtype='int32')

0    1
1    2
2    3
dtype: int32

In [10]:
pd.Series([1.0, 2.0, 3.0], dtype='float64')

0    1.0
1    2.0
2    3.0
dtype: float64

### <a id='toc1_2_'></a>[String Data Types](#toc0_)


Strings are handled differently in NumPy and Pandas:

1. **NumPy:**
   - `dtype='U'`: Unicode string
   - `dtype='S'`: Byte string

2. **Pandas:**
   - `object`: Default type for string data
   - `string`: Pandas extension type for strings (introduced in newer versions)


In [4]:
# NumPy string array
np.array(['apple', 'banana', 'cherry'], dtype='U')

array(['apple', 'banana', 'cherry'], dtype='<U6')

In [5]:
# Pandas string series
pd.Series(['apple', 'banana', 'cherry'], dtype='string')

0     apple
1    banana
2    cherry
dtype: string

### <a id='toc1_3_'></a>[Boolean Data Type](#toc0_)


Both libraries support boolean data:

- `bool`: True or False values


In [11]:
np.array([True, False, True])

array([ True, False,  True])

In [12]:
pd.Series([True, False, True], dtype='bool')

0     True
1    False
2     True
dtype: bool

### <a id='toc1_4_'></a>[DateTime Data Types](#toc0_)


Handling dates and times is crucial in data analysis:

1. **NumPy:**
   - `datetime64`: Represents dates and times

2. **Pandas:**
   - `datetime64[ns]`: Nanosecond precision for datetime
   - `timedelta[ns]`: For time differences


In [13]:
# NumPy datetime
np.array(['2023-01-01', '2023-01-02'], dtype='datetime64')

array(['2023-01-01', '2023-01-02'], dtype='datetime64[D]')

In [14]:
# Pandas datetime
pd.to_datetime(['2023-01-01', '2023-01-02'])

DatetimeIndex(['2023-01-01', '2023-01-02'], dtype='datetime64[ns]', freq=None)

### <a id='toc1_5_'></a>[Categorical Data Type](#toc0_)


Pandas offers a special type for categorical data:

- `category`: Efficient storage and processing for repeated string values


In [15]:
pd.Series(['A', 'B', 'A', 'C', 'B', 'A'], dtype='category')


0    A
1    B
2    A
3    C
4    B
5    A
dtype: category
Categories (3, object): ['A', 'B', 'C']

💡 **Pro Tip:** Using the `category` type for columns with repeated string values can significantly reduce memory usage and improve performance.


### <a id='toc1_6_'></a>[Special Data Types](#toc0_)


1. **NumPy:**
   - `complex`: For complex numbers

2. **Pandas:**
   - `Int64`, `Float64`: Nullable integer and float types
   - `sparse`: For efficient storage of sparse data


In [16]:
# NumPy complex number
np.array([1+2j, 3+4j])

array([1.+2.j, 3.+4.j])

In [17]:
# Pandas nullable integer
pd.Series([1, 2, None], dtype='Int64')

0       1
1       2
2    <NA>
dtype: Int64

### <a id='toc1_7_'></a>[Checking Data Types](#toc0_)


You can easily check the data types of your NumPy arrays or Pandas Series/DataFrames:


```python
# For NumPy
print(np_int.dtype)

# For Pandas
print(pd_series_int.dtype)
print(df.dtypes)  # For a DataFrame
```


🤔 **Why This Matters:** Choosing the right data type can significantly impact memory usage and computation speed, especially when dealing with large datasets.


❗️ **Important Note:** Be aware of the trade-offs between precision and memory usage when selecting numeric data types. For instance, `float32` uses less memory but has lower precision compared to `float64`.


Understanding these common data types in Pandas and NumPy is essential for effective data manipulation and analysis. In the next section, we'll explore how to examine and verify these data types in your datasets.

## <a id='toc2_'></a>[Examining and Verifying Data Types](#toc0_)


In this section, we'll explore practical methods for examining and verifying data types using a real-world dataset. We'll use the famous Titanic dataset, which contains various data types and is excellent for demonstrating different aspects of data type examination.


🔑 **Key Concept:** Regularly examining and verifying data types is crucial for ensuring data integrity and preventing unexpected behavior in your analyses.


Let's start by loading the Titanic dataset:


In [21]:
import pandas as pd
import numpy as np
import seaborn as sns

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

titanic_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


### <a id='toc2_1_'></a>[Initial Data Type Inspection](#toc0_)


The first step in examining data types is to get an overview of the types in your dataset:


In [22]:
titanic_df.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

💡 **Pro Tip:** Always start your analysis by checking the data types. This quick overview can reveal potential issues or areas that need attention.


### <a id='toc2_2_'></a>[Detailed Data Type Information](#toc0_)


For a more comprehensive view of your data, including data types and non-null counts:


In [23]:
titanic_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


### <a id='toc2_3_'></a>[Examining Specific Columns](#toc0_)


To dive deeper into specific columns:


In [25]:
titanic_df['age'].dtype

dtype('float64')

In [27]:
titanic_df['sex'].dtype

dtype('O')

### <a id='toc2_4_'></a>[Verifying Numeric Columns](#toc0_)


For numeric columns, it's useful to check basic statistics:


In [29]:
titanic_df['age'].describe()


count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

This will show statistics like count, mean, std, min, max, which can help verify if the data type is appropriate and if there are any outliers.


Potential Issues:

- Missing values represented as -1 or other placeholder values.
- Overflow for very large numbers.


In [31]:
# Check for placeholder values
print(titanic_df['sibsp'].value_counts())

sibsp
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: count, dtype: Int64


In [34]:
# Replace -1 with NaN if needed
titanic_df['sibsp'] = titanic_df['sibsp'].replace(-1, pd.NA)

In [35]:
# Use Int64 instead of int64 to allow for NaN values
titanic_df['sibsp'] = titanic_df['sibsp'].astype('int64')

### <a id='toc2_5_'></a>[Checking Unique Values in Categorical Columns](#toc0_)


For categorical columns, examining unique values is crucial:


In [37]:
titanic_df['sex'].unique()

array(['male', 'female'], dtype=object)

In [38]:
titanic_df['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

🤔 **Why This Matters:** This check can reveal unexpected categories or potential data quality issues, such as the 'nan' in 'Embarked'.


### <a id='toc2_6_'></a>[Identifying Mixed Data Types](#toc0_)


Sometimes, columns might contain mixed data types. Let's create an example by introducing a string into the 'Age' column:


```python
titanic_df.loc[0, 'Age'] = 'Unknown'
print(titanic_df['Age'].dtype)
print(titanic_df['Age'].head())
```


Output:
```
object
0    Unknown
1      38.0
2      26.0
3      35.0
4      35.0
Name: Age, dtype: object
```


❗️ **Important Note:** Mixed data types can lead to unexpected behavior in analyses. Always check for and handle mixed types appropriately.


### <a id='toc2_7_'></a>[Verifying Date Columns](#toc0_)


The Titanic dataset doesn't include date columns, but let's add one for demonstration:


```python
titanic_df['Date'] = pd.date_range(start='1912-04-14', periods=len(titanic_df))
print(titanic_df['Date'].dtype)
print(titanic_df['Date'].head())
```


Output:
```
datetime64[ns]
0   1912-04-14
1   1912-04-15
2   1912-04-16
3   1912-04-17
4   1912-04-18
Name: Date, dtype: datetime64[ns]
```


### <a id='toc2_8_'></a>[Using `select_dtypes` for Type-Based Selection](#toc0_)


To select columns of specific types:


```python
numeric_columns = titanic_df.select_dtypes(include=['int64', 'float64'])
categorical_columns = titanic_df.select_dtypes(include=['object'])

print("Numeric columns:", numeric_columns.columns)
print("Categorical columns:", categorical_columns.columns)
```


Examining and verifying data types is a critical step in the data analysis process. It helps ensure data integrity, prevents errors in calculations, and guides decisions on data preprocessing steps.


💡 **Pro Tip:** Make data type examination a regular part of your data analysis workflow. It's easier to catch and correct issues early in the process.


In the next section, we'll explore how to handle mixed data types and perform necessary conversions to prepare our data for analysis.