In [1]:
import pandas as pd

# Setting the Index Column Initially When Creating a New DataFrame & CSV to DataFrame

In [2]:
brics = pd.read_csv(r"D:\Courses\DataCamp\Associate Data Scientist in Python\Python\brics.csv", index_col=0) # OR: index_col=["col0_name"]
brics

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


# The possible data types you can use for Series in Pandas
In pandas, the `dtype` argument can be used to set the data type of a variable (column) in a DataFrame. Here are the possible data types you can use:

1. **Numeric Types**:
	- `int64`: 64-bit signed integer
	- `int32`: 32-bit signed integer
	- `int16`: 16-bit signed integer
	- `int8`: 8-bit signed integer
	- `uint64`: 64-bit unsigned integer
	- `uint32`: 32-bit unsigned integer
	- `uint16`: 16-bit unsigned integer
	- `uint8`: 8-bit unsigned integer
	- `float64`: 64-bit floating point
	- `float32`: 32-bit floating point

2. **Boolean Type**:
	- `bool`: Boolean values (`True` or `False`)

3. **Object Type**:
	- `object`: General-purpose type for arbitrary Python objects, often used for strings

4. **String Type**:
	- `str`: For text data (available in newer versions of pandas as a more specific alternative to `object`)

5. **Datetime and Timedelta Types**:
	- `datetime64[ns]`: Nanosecond precision datetime
	- `datetime64[ns, tz]`: Nanosecond precision datetime with time zone
	- `timedelta64[ns]`: Nanosecond precision timedelta

6. **Categorical Type**:
	- `category`: For categorical data with a fixed number of possible values

7. **Complex Types**:
	- `complex128`: 128-bit complex number
	- `complex64`: 64-bit complex number

8. **Nullable Integer Types** (pandas 1.0.0+):
	- `Int64`: 64-bit signed integer with NA support
	- `Int32`: 32-bit signed integer with NA support
	- `Int16`: 16-bit signed integer with NA support
	- `Int8`: 8-bit signed integer with NA support
	- `UInt64`: 64-bit unsigned integer with NA support
	- `UInt32`: 32-bit unsigned integer with NA support
	- `UInt16`: 16-bit unsigned integer with NA support
	- `UInt8`: 8-bit unsigned integer with NA support

9. **Nullable Boolean Type** (pandas 1.0.0+):
	- `boolean`: Boolean values with NA support

### Some Clarifications
-   **Int64 vs. int64**
	-   ```python
		# Using int64 (no support for NA)
		df = pd.DataFrame({'a': [1, 2, 3, np.nan]}, dtype='int64')  # This will raise an error
		# Using Int64 (with support for NA)
		df = pd.DataFrame({'a': [1, 2, 3, pd.NA]}, dtype='Int64')
		```

-   **timedelta64[ns]:** Nanosecond precision timedelta
	-   This data type represents differences in time (durations) with nanosecond precision.

	-   ```python
		df = pd.DataFrame({'start': pd.to_datetime(['2023-01-01', '2023-01-02']),
						'end': pd.to_datetime(['2023-01-03', '2023-01-05'])})
		df['duration'] = df['end'] - df['start']
		print(df['duration'].dtype)  # This will show 'timedelta64[ns]'
		```

-   **str vs. object**
	- **str**: This is a specific data type for text data (strings) in newer versions of pandas. It explicitly indicates that the column contains text data, allowing for **potentially** more optimized operations on text.
	
	- **object**: This is a more general-purpose data type that can hold any Python object, including strings, numbers, lists, dictionaries, etc. When a column is set to `object` type, pandas doesn't assume anything about the content, which can make operations less efficient compared to more specific types.
	
	-   **Example 1: Optimized Operations**
		```python
		df_object = pd.DataFrame({'text_column': ['apple', 'banana', 'cherry']}, dtype='object')
		df_string = pd.DataFrame({'text_column': ['apple', 'banana', 'cherry']}, dtype='string')

		print(df_object['text_column'].dtype)  # Outputs: object
		print(df_string['text_column'].dtype)  # Outputs: string

		# Using a string method on both columns
		df_object['text_upper'] = df_object['text_column'].str.upper()
		df_string['text_upper'] = df_string['text_column'].str.upper()

		# Both will work, but the `string` dtype might be more efficient
		print(df_object)
		print(df_string)
		```

	-   **Example 2: Consistency and Type Safety**
		```python
		# Creating a DataFrame with object dtype (mixed types)
		df_mixed = pd.DataFrame({'mixed_column': ['apple', 123, 'cherry']}, dtype='object')
		# Creating a DataFrame with string dtype (only text data)
		df_text = pd.DataFrame({'text_column': ['apple', 'banana', 'cherry']}, dtype='string')

		# Attempting to insert a non-string into the string column raises an error
		try:
			df_text.loc[1] = 123
		except Exception as e:
			print(e)  # Outputs an error because 123 is not a string
		```

# Data Type, Data Range, Uniqueness, and Membership Constraints

It's important to check whether your data violate any constrains.

For Example:

-	Data type constrains: numeric data should be `int` or `float`, NOT `object`

-	Data Range constrains: user_signup date shouldn't exceed today's date (shouldn't be in the future)

-	Uniqueness: there shouldn't be complete duplicated rows in your dataset (similar in all attributes)

-	Membership constrains: blood_type column shouldn't contain 'Z+' value (it should contain ['A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', 'O-'] only)

---

# Miscellaneous 

In [None]:
pd.DataFrame().where