<a href="https://colab.research.google.com/github/ShilpaVasista/PDA---23CSE313/blob/main/Module_4_Introduction_to_pandas_Part_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Pandas integrates indexes into its data structures, leveraging the performance of NumPy arrays.
- This integration enhances flexibility and simplifies operations through internal references like labels.

- Key Functionalities:

    - Reindexing: Adjust or reorder the index of a Series or DataFrame.
    - Dropping: Remove specific indexes or labels from a structure.
    - Alignment: Automatically align data during operations based on indexes.

#Reindexing

What is Reindexing?

- Reindexing allows you to create a new Series or DataFrame with a modified order of labels or index.
- It provides flexibility to restructure data while maintaining consistency with new indexing rules

In [None]:
import pandas as pd
ser = pd.Series([2, 5, 7, 4], index=['one', 'two', 'three', 'four'])
print(ser)
# one      2
# two      5
# three    7
# four     4
# dtype: int64
new_ser = ser.reindex(['four', 'one', 'three', 'two'])
print(new_ser)
# four     4
# one      2
# three    7
# two      5
# dtype: int64

one      2
two      5
three    7
four     4
dtype: int64
four     4
one      2
three    7
two      5
dtype: int64


**Summary**

Reindexing with reindex()

**Purpose**:
- Rearranges the order of indexes in a Series or DataFrame.
- Allows adding or removing labels.

**How It Works:**
- Creates a new object based on a new sequence of labels.
- For missing labels, pandas assigns NaN as their value.

**Capabilities**:
- Change the order of existing indexes.
- Remove specific indexes.
- Add new indexes with NaN as the default value.

- Reindexing can help handle cases where your Series or DataFrame has missing index values. Let’s explore how this works with an example.
- Manually defining a complete list of labels for reindexing can be challenging, especially for large datasets.
- Instead, we can use the method option in the reindex() function to fill missing values automatically.


In [None]:
import pandas as pd

# Defining the Series
ser3 = pd.Series([1, 5, 6, 3], index=[0, 3, 5, 6])

print(ser3)

0    1
3    5
5    6
6    3
dtype: int64


**Filling Missing Indexes:**

To fill these missing indexes:
1. Use the reindex() function with a complete sequence of labels, such as range(6).
2. Set the method parameter to 'ffill' (forward fill), which assigns the value of the previous valid index to the missing ones.

In [None]:
# Reindexing with interpolation
new_ser3 = ser3.reindex(range(6), method='ffill')

print(new_ser3)

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64


**New Indexes Added:**

The missing indexes [1, 2, 4] were introduced.
**Forward Fill Applied:**

Missing values were filled with the nearest valid value from the preceding index:
- Index 1 and 2 get the value 1 (from index 0).
- Index 4 gets the value 5 (from index 3).

- Reindexing can use different methods to handle missing index values. After exploring forward fill (ffill), let’s understand backward fill (bfill) and how these concepts extend to DataFrames.

- In backward fill: Missing index values are filled with the nearest valid value from the following index.


In [None]:
# Backward fill example
new_ser3_bfill = ser3.reindex(range(6), method='bfill')

print(new_ser3_bfill)

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64


**Reindexing concepts for Series can be extended to DataFrames, allowing:**
- Rearranging rows (indexes).
- Rearranging columns.
- Adding new rows or columns, with NaN as placeholders for missing data.


In [None]:
import pandas as pd

# Create the original DataFrame
data = {'colors': ['blue', 'green', 'yellow', 'red', 'white'],
        'price': [1.2, 1.0, 0.6, 0.9, 1.7],
        'object': ['ballpand', 'pen', 'pencil', 'paper', 'mug']}

frame = pd.DataFrame(data)

# Reindex with forward fill for rows and adding new columns
new_frame = frame.reindex(
    range(5), method='ffill'
).reindex(columns=['colors', 'price', 'new', 'object'])

print(new_frame)


   colors  price  new    object
0    blue    1.2  NaN  ballpand
1   green    1.0  NaN       pen
2  yellow    0.6  NaN    pencil
3     red    0.9  NaN     paper
4   white    1.7  NaN       mug


**Dropping**

- The drop() method is used to remove specific rows or columns from a Series or DataFrame. The ability to drop elements by their labels (index or column names) makes this operation simple and intuitive.

Example: Dropping an Item from a Series

In [None]:
import pandas as pd
import numpy as np

# Define a Series with four elements and distinct labels
ser = pd.Series(np.arange(4.0), index=['red', 'blue', 'yellow', 'white'])
print("Original Series:")
print(ser)

# Drop the item with the label 'yellow'
ser_dropped = ser.drop('yellow')
print("\nSeries after dropping 'yellow':")
print(ser_dropped)

Original Series:
red       0.0
blue      1.0
yellow    2.0
white     3.0
dtype: float64

Series after dropping 'yellow':
red      0.0
blue     1.0
white    3.0
dtype: float64


Note: Original Series Remains Unchanged: By default, drop() does not modify the original object but returns a new one.

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame with 4x4 structure
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['red', 'blue', 'yellow', 'white'],
                     columns=['ball', 'pen', 'pencil', 'paper'])
print("Original DataFrame:")
print(frame)

# Drop rows with labels 'blue' and 'white'
frame_dropped_rows = frame.drop(['blue', 'white'])
print("\nDataFrame after dropping rows 'blue' and 'white':")
print(frame_dropped_rows)

# Drop columns with labels 'pen' and 'pencil'
frame_dropped_columns = frame.drop(['pen', 'pencil'], axis=1)
print("\nDataFrame after dropping columns 'pen' and 'pencil':")
print(frame_dropped_columns)


Original DataFrame:
        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

DataFrame after dropping rows 'blue' and 'white':
        ball  pen  pencil  paper
red        0    1       2      3
yellow     8    9      10     11

DataFrame after dropping columns 'pen' and 'pencil':
        ball  paper
red        0      3
blue       4      7
yellow     8     11
white     12     15


**Explanation**:
- Dropping Rows: To remove rows, pass the labels in a list to the drop() method. By default, drop() removes rows if axis=0 is not specified.
- Dropping Columns: To remove columns, set axis=1 to indicate column-wise operation.

**Notes**:

- inplace=True can be used to modify the original DataFrame directly without creating a new one.
- The axis parameter can be set to 0 or 1 to indicate whether rows or columns should be dropped (default is axis=0 for rows).

**Arithmetic and Data Alignment**


To demonstrate how pandas handles arithmetic operations with alignment, consider the following code examples:

Example of Series Alignment with Arithmetic Operations:

In [None]:
import pandas as pd

# Define two Series with different indexes
s1 = pd.Series([3, 2, 5, 1], index=['white', 'yellow', 'green', 'blue'])
s2 = pd.Series([1, 4, 7, 2, 1], index=['white', 'yellow', 'black', 'blue', 'brown'])

print("Series 1:")
print(s1)

print("\nSeries 2:")
print(s2)

# Perform addition between the two Series
result = s1 + s2
print("\nResult of adding Series 1 and Series 2:")
print(result)

Series 1:
white     3
yellow    2
green     5
blue      1
dtype: int64

Series 2:
white     1
yellow    4
black     7
blue      2
brown     1
dtype: int64

Result of adding Series 1 and Series 2:
black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64


- Alignment by Index: When performing an operation like addition between s1 and s2, pandas aligns the data based on the indexes. If an index is not present in both Series, the result will have NaN for that index.
- Index Matching: Only common indexes are used for the operation, while any non-matching indexes are represented with NaN in the result.
- NaN Values: The result contains NaN for indexes that are not present in both Series (e.g., green, black, and brown).
- Arithmetic Operations: The same alignment behavior applies to other arithmetic operations (e.g., subtraction, multiplication, division), ensuring that pandas handles mismatched indexes gracefully by filling with NaN.
- Handling NaN: You can use methods like .fillna() to replace NaN values with a specified value, if needed.

When working with DataFrames, pandas aligns data based on both the row and column indexes during arithmetic operations. This ensures that operations between DataFrames are performed in a way that matches up corresponding labels, even if they don't perfectly align.

**Example of DataFrame Alignment with Arithmetic Operations:**

In [None]:
import pandas as pd
import numpy as np

# Define two DataFrames with different indexes and columns
frame1 = pd.DataFrame(np.arange(16).reshape((4, 4)),
                      index=['red', 'blue', 'yellow', 'white'],
                      columns=['ball', 'pen', 'pencil', 'paper'])

frame2 = pd.DataFrame(np.arange(12).reshape((4, 3)),
                      index=['blue', 'green', 'white', 'yellow'],
                      columns=['mug', 'pen', 'ball'])

print("DataFrame 1:")
print(frame1)

print("\nDataFrame 2:")
print(frame2)

# Perform addition between the two DataFrames
result = frame1 + frame2
print("\nResult of adding DataFrame 1 and DataFrame 2:")
print(result)

DataFrame 1:
        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

DataFrame 2:
        mug  pen  ball
blue      0    1     2
green     3    4     5
white     6    7     8
yellow    9   10    11

Result of adding DataFrame 1 and DataFrame 2:
        ball  mug  paper   pen  pencil
blue     6.0  NaN    NaN   6.0     NaN
green    NaN  NaN    NaN   NaN     NaN
red      NaN  NaN    NaN   NaN     NaN
white   20.0  NaN    NaN  20.0     NaN
yellow  19.0  NaN    NaN  19.0     NaN


**Explanation**:
- Alignment Principle: When performing operations like addition (+), pandas aligns the data based on both the rows and columns. Only matching labels are used, and NaN is filled in where labels do not exist in both DataFrames.
- Rows and Columns: In this example:
    - Rows red and green are not present in both DataFrames, so the result for these rows contains NaN.
    - Columns mug, paper, and pencil are not shared between the DataFrames, so the result for these columns also contains NaN.
- Result Details:
    - Rows and columns that exist in both DataFrames are summed.
    - The result DataFrame only shows NaN for places where no corresponding index or column exists in both DataFrames.

## Operations between Data Structures

**Flexible Arithmetic Methods**

You’ve just seen how to use mathematical operators directly on the pandas data structures. The same operations can also be performed using appropriate methods, called Flexible arithmetic methods.
  - add()
  - sub()
  - div()
  - mul()

These methods are called on a data structure and take another data structure as an argument. For example:

In [None]:
import pandas as pd
import numpy as np

# Define two DataFrames with different indexes and columns
frame1 = pd.DataFrame(np.arange(16).reshape((4, 4)),
                      index=['red', 'blue', 'yellow', 'white'],
                      columns=['ball', 'pen', 'pencil', 'paper'])

frame2 = pd.DataFrame(np.arange(12).reshape((4, 3)),
                      index=['blue', 'green', 'white', 'yellow'],
                      columns=['mug', 'pen', 'ball'])

print("DataFrame 1:")
print(frame1)

print("\nDataFrame 2:")
print(frame2)

result = frame1.add(frame2)
print(result)

DataFrame 1:
        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

DataFrame 2:
        mug  pen  ball
blue      0    1     2
green     3    4     5
white     6    7     8
yellow    9   10    11
        ball  mug  paper   pen  pencil
blue     6.0  NaN    NaN   6.0     NaN
green    NaN  NaN    NaN   NaN     NaN
red      NaN  NaN    NaN   NaN     NaN
white   20.0  NaN    NaN  20.0     NaN
yellow  19.0  NaN    NaN  19.0     NaN


**Operations between DataFrame and Series**

When performing operations between a DataFrame and a Series, pandas aligns the Series index with the DataFrame columns. This alignment enables you to carry out operations between these two structures directly, applying the Series' values to each column of the DataFrame that shares the same index.

Example of Operations Between a DataFrame and a Series:

In [None]:
import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['red', 'blue', 'yellow', 'white'],
                     columns=['ball', 'pen', 'pencil', 'paper'])
print("DataFrame:")
print(frame)

ser = pd.Series(np.arange(4), index=['ball', 'pen', 'pencil', 'paper'])
print("\nSeries:")
print(ser)

result = frame - ser
print("\nResult of subtracting Series from DataFrame:")
print(result)


DataFrame:
        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Series:
ball      0
pen       1
pencil    2
paper     3
dtype: int64

Result of subtracting Series from DataFrame:
        ball  pen  pencil  paper
red        0    0       0      0
blue       4    4       4      4
yellow     8    8       8      8
white     12   12      12     12


- Element-wise Operation: Each value in the Series is subtracted from the corresponding column in the DataFrame where the column name matches the Series index. For example, the value 0 in the Series is subtracted from all entries in the ball column, 1 from all entries in the pen column, and so on.
- Broadcasting: If the Series index matches the DataFrame columns, pandas broadcasts the Series values to all rows of the corresponding column.
- If the Series' index has labels that do not match any column in the DataFrame, the operation will result in NaN for those columns.
- This type of operation is useful for applying a transformation across all rows for specific columns of a DataFrame without needing to loop over the DataFrame explicitly.

Example with Unmatched Index:

In [None]:
import pandas as pd
import numpy as np

# Define the DataFrame
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['red', 'blue', 'yellow', 'white'],
                     columns=['ball', 'pen', 'pencil', 'paper'])

# Define the Series
ser = pd.Series(np.arange(4), index=['ball', 'pen', 'pencil', 'paper'])

# Add a new item to the Series
ser['mug'] = 9

# Display the Series with the new item
print("Updated Series:")
print(ser)

# Perform the subtraction operation between the DataFrame and the Series
result = frame - ser

# Display the result
print("\nResult of subtracting Series from DataFrame:")
print(result)


Updated Series:
ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64

Result of subtracting Series from DataFrame:
        ball  mug  paper  pen  pencil
red        0  NaN      0    0       0
blue       4  NaN      4    4       4
yellow     8  NaN      8    8       8
white     12  NaN     12   12      12


In [None]:
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['red', 'blue', 'yellow', 'white'],
                     columns=['ball', 'pen', 'pencil', 'paper'])
print("DataFrame:")
print(frame)
ser = pd.Series(np.arange(4), index=['ball', 'pen', 'pencil', 'paper'])
print("\nSeries:")
print("ser")
print(ser)
ser_unmatched = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
result_unmatched = frame - ser_unmatched
print("\nResult with unmatched Series index:")
print(result_unmatched)


DataFrame:
        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Series:
ser
ball      0
pen       1
pencil    2
paper     3
dtype: int64

Result with unmatched Series index:
         a   b  ball   c   d  paper  pen  pencil
red    NaN NaN   NaN NaN NaN    NaN  NaN     NaN
blue   NaN NaN   NaN NaN NaN    NaN  NaN     NaN
yellow NaN NaN   NaN NaN NaN    NaN  NaN     NaN
white  NaN NaN   NaN NaN NaN    NaN  NaN     NaN


## Function Application and Mapping

In pandas, functions that operate element-wise across the data structure are referred to as universal functions (ufuncs), which are a core part of the NumPy library. These functions work by applying an operation to each element in a Series or DataFrame, allowing for efficient element-wise computation. This makes it easy to apply mathematical operations across entire datasets.

Example: Applying NumPy Functions Element-Wise

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['red', 'blue', 'yellow', 'white'],
                     columns=['ball', 'pen', 'pencil', 'paper'])

print("Original DataFrame:")
print(frame)

Original DataFrame:
        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15


To apply an element-wise operation, such as taking the square root of each value in the DataFrame, you can use np.sqrt():

In [None]:
# Apply the square root function
result = np.sqrt(frame)

print("\nSquare root of each element:")
print(result)


Square root of each element:
            ball       pen    pencil     paper
red     0.000000  1.000000  1.414214  1.732051
blue    2.000000  2.236068  2.449490  2.645751
yellow  2.828427  3.000000  3.162278  3.316625
white   3.464102  3.605551  3.741657  3.872983


Other Ufuncs

Many other NumPy functions can be used in a similar manner, such as:
- np.exp() for the exponential of each element.
- np.log() for the natural logarithm.
- np.abs() for the absolute value.
- np.sin(), np.cos(), np.tan(), etc., for trigonometric functions.

These functions enable powerful data analysis operations that operate element-wise, making pandas a robust tool for data manipulation and analysis.

## Functions by Row or Column

In pandas, functions can be applied not only through universal functions (ufuncs) but also by using custom user-defined functions. These functions must operate on a one-dimensional array and return a single value, which can then be applied across rows or columns in a DataFrame using the apply() method.

Define the Custom Function

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame
frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=['red', 'blue', 'yellow', 'white'],
                     columns=['ball', 'pen', 'pencil', 'paper'])

# Define the function to calculate the range
f = lambda x: x.max() - x.min()

# Or define the function using def
def f(x):
    return x.max() - x.min()

Apply the Function to Columns

In [None]:
# Apply the function column-wise
result_col = frame.apply(f)

print("\nRange for each column:")
print(result_col)


Range for each column:
ball      12
pen       12
pencil    12
paper     12
dtype: int64


Apply the Function to Rows

In [None]:
# Apply the function row-wise
result_row = frame.apply(f, axis=1)

print("\nRange for each row:")
print(result_row)


Range for each row:
red       3
blue      3
yellow    3
white     3
dtype: int64


- apply() method: This method applies a function along an axis of the DataFrame (either rows or columns).
- axis parameter: When axis=0, the function is applied column-wise. When axis=1, the function is applied row-wise.


You can apply functions to a pandas DataFrame using the apply() method, and these functions do not have to return a scalar value. They can also return a Series or a DataFrame, which is especially useful when you want to apply multiple functions at once or get multiple values for each feature.

- The majority of the statistical functions for arrays are still valid for DataFrame, so the use of the apply() function is no longer necessary.
- For example, functions such as sum() and mean() can calculate
the sum and the average, respectively, of the elements contained within a DataFrame.

In [None]:
import pandas as pd
import numpy as np

# Create the DataFrame
frame = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['red', 'blue', 'yellow', 'white'],
    columns=['ball', 'pen', 'pencil', 'paper']
)

# Compute the sum of each column
column_sums = frame.sum()
print("Column Sums:")
print(column_sums)

# Compute the mean of each column
column_means = frame.mean()
print("\nColumn Means:")
print(column_means)


Column Sums:
ball      24
pen       28
pencil    32
paper     36
dtype: int64

Column Means:
ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64


I/O API Tools

Here’s the table showing the **Readers** and **Writers** for pandas data I/O operations:

| **Readers**          | **Writers**         |  
|-----------------------|---------------------|  
| `read_csv`           | `to_csv`           |  
| `read_excel`         | `to_excel`         |  
| `read_hdf`           | `to_hdf`           |  
| `read_sql`           | `to_sql`           |  
| `read_json`          | `to_json`          |  
| `read_html`          | `to_html`          |  
| `read_stata`         | `to_stata`         |  
| `read_clipboard`     | `to_clipboard`     |  
| `read_pickle`        | `to_pickle`        |  
| `read_msgpack`       | `to_msgpack` (experimental) |  
| `read_gbq`           | `to_gbq` (experimental) |  

- Everyone has become accustomed over the years to write and read files in text form. In particular, data are generally reported in tabular form.
- If the values in a row are separated by a comma, you have the CSV
(comma-separated values) format, which is perhaps the best-known and most popular format.
- Other forms with tabular data separated by spaces or tabs are typically contained in text files of various types (generally with the extension .txt).
- So this type of file is the most common source of data and actually even easier to transcribe and
interpret.
- In this regard pandas provides a set of functions specific for this type of file.
    - read_csv
    - read_table
    - to_csv

## Reading Data in CSV or Text Files

- The first step in data analysis often involves reading data from a CSV or text file into a usable format like a pandas DataFrame.
- Pandas provides powerful tools to read and manipulate CSV files efficiently.

File Content (myCSV_01.csv):

In [None]:
import pandas as pd

# Create the data as a dictionary
data = {
    'white': [1, 2, 3, 2, 4],
    'red': [5, 7, 3, 2, 4],
    'blue': [2, 8, 6, 8, 2],
    'green': [3, 5, 7, 3, 1],
    'animal': ['cat', 'dog', 'horse', 'duck', 'mouse']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('myCSV_01.csv', index=False)

# Display the DataFrame
print(df)


   white  red  blue  green animal
0      1    5     2      3    cat
1      2    7     8      5    dog
2      3    3     6      7  horse
3      2    2     8      3   duck
4      4    4     2      1  mouse


In [None]:
import pandas as pd

# Reading the CSV file
csvframe = pd.read_csv('myCSV_01.csv')

# Display the DataFrame
print(csvframe)

   white  red  blue  green animal
0      1    5     2      3    cat
1      2    7     8      5    dog
2      3    3     6      7  horse
3      2    2     8      3   duck
4      4    4     2      1  mouse


- read_csv():
    - Reads a comma-delimited file and converts it into a DataFrame.
    - Handles parsing, type conversion, and DataFrame creation seamlessly.
- Data is now ready for analysis or further manipulation.


- CSV files are essentially text files with comma-separated values in tabular format.
- Although read_csv() is the primary method to read CSV files, you can use read_table() as an alternative by specifying the delimiter explicitly.


In [None]:
import pandas as pd

# Reading a CSV file using read_table with a comma as the separator
csvframe = pd.read_table('ch05_01.csv', sep=',')

# Display the DataFrame
print(csvframe)

- In some cases, CSV files may not include column headers, and the data starts directly from the first row.
- By default, read_csv() assumes the first row contains column headers.
- You can handle this scenario by explicitly specifying that there are no headers using the header=None parameter.


In [None]:
import pandas as pd

# Create the data as a list of lists
data = [
    [1, 5, 2, 3, 'cat'],
    [2, 7, 8, 5, 'dog'],
    [3, 3, 6, 7, 'horse'],
    [2, 2, 8, 3, 'duck'],
    [4, 4, 2, 1, 'mouse']
]

# Create a DataFrame with columns names
df = pd.DataFrame(data, columns=['white', 'red', 'blue', 'green', 'animal'])

# Save the DataFrame to a CSV file
df.to_csv('myCSV_02.csv', index=False, header=False)

# Display the DataFrame
print(df)


   white  red  blue  green animal
0      1    5     2      3    cat
1      2    7     8      5    dog
2      3    3     6      7  horse
3      2    2     8      3   duck
4      4    4     2      1  mouse


In [None]:
import pandas as pd

# Reading a CSV file without headers
csvframe_no_headers = pd.read_csv('myCSV_02.csv', header=None)

# Display the DataFrame
print(csvframe_no_headers)


   0  1  2  3      4
0  1  5  2  3    cat
1  2  7  8  5    dog
2  3  3  6  7  horse
3  2  2  8  3   duck
4  4  4  2  1  mouse


**Assigning Custom Column Names:**

You can provide custom column names using the names parameter:

In [None]:
# Specifying column names for the DataFrame
column_names = ['white', 'red', 'blue', 'green', 'animal']
csvframe_custom_headers = pd.read_csv('myCSV_02.csv', header=None, names=column_names)

# Display the DataFrame
print(csvframe_custom_headers)

- For more complex datasets, you can create a hierarchical structure in a DataFrame by using multiple columns as row indexes.
- The index_col parameter in read_csv() allows you to specify one or more columns to serve as the index.

In [None]:
import pandas as pd

# Create the data as a list of lists
data = [
    ['black', 'up', 3, 4, 6],
    ['black', 'down', 2, 6, 7],
    ['white', 'up', 5, 5, 5],
    ['white', 'down', 3, 3, 2],
    ['white', 'left', 1, 2, 1],
    ['red', 'up', 2, 2, 2],
    ['red', 'down', 1, 1, 4]
]

# Create a DataFrame with column names
df = pd.DataFrame(data, columns=['color', 'status', 'item1', 'item2', 'item3'])

# Save the DataFrame to a CSV file
df.to_csv('myCSV_03.csv', index=False)

# Display the DataFrame
print(df)


   color status  item1  item2  item3
0  black     up      3      4      6
1  black   down      2      6      7
2  white     up      5      5      5
3  white   down      3      3      2
4  white   left      1      2      1
5    red     up      2      2      2
6    red   down      1      1      4


In [None]:
import pandas as pd

# Reading the CSV file with a hierarchical index
hierarchical_df = pd.read_csv('myCSV_03.csv', index_col=['color', 'status'])

# Display the DataFrame
print(hierarchical_df)

              item1  item2  item3
color status                     
black up          3      4      6
      down        2      6      7
white up          5      5      5
      down        3      3      2
      left        1      2      1
red   up          2      2      2
      down        1      1      4


## Using RegExp for Parsing TXT Files
- When working with text files where delimiters like commas, semicolons, or tabs are not well-defined, regular expressions (RegEx) come in handy. Regular expressions allow you to specify a pattern for separating data based on certain criteria, such as spaces, tabs, or other characters.

- For example, if your file contains values that are separated by either spaces or tabs, but these delimiters are inconsistent (i.e., multiple spaces or tabs between values), you can use RegEx to handle this by defining a pattern that captures both spaces and tabs as valid delimiters.




| **Wildcard** | **Description** | **Example** |
|--------------|-----------------|-------------|
| .            | Matches any single character (except newline) | a.c matches "abc", "adc", etc. |
| \d           | Matches any digit | \d+ matches "1", "23", "456" |
| \D           | Matches any non-digit character | \D+ matches "a", "abc", "!" |
| \s           | Matches any whitespace character | \s+ matches one or more spaces or tabs |
| \S           | Matches any non-whitespace character | \S+ matches "abc", "123", "!" |
| \n           | Matches a newline character | Matches the end of a line |
| \t           | Matches a tab character | Matches a tab |
| \uxxxx       | Matches a Unicode character specified by hexadecimal number xxxx | Matches a Unicode character, like \u0041 for "A" |
