# __Pandas__
## __1. What is Pandas?__
### __1.1. Overview__
Pandas is an __open-source__ Python library that provides powerful data structures for __data manipulation__ and __analysis__. It is built on top of __NumPy__, making it highly efficient for data processing tasks.

#### __Key Data Structures in Pandas__:
1. __Series__:
    - __1D array__ with labels.
    - Can hold __heterogeneous data types__ (integers, strings, floats, etc.).
    - Similar to a __list__ or a __column__ in a table, but with __labeled indices__.
    - Consists of two components:
        - __Data__: The actual values in the Series.
        - __Index__: The labels corresponding to the data points.

2. __DataFrame__:
    - __2D array__ with labels (like a table in SQL or Excel).
    - Holds __heterogeneous data__ across rows and columns (e.g., a column of strings, another of numbers).
    - Consists of three components:
        - __Data__: The actual values stored in the DataFrame.
        - __Columns__: Vertical labels for data (similar to Series).
        - __Index__: Horizontal labels for rows, used for indexing and slicing.

3. __Panel__ *(Deprecated)*:
    - __3D array__ with labels (like a dictionary of DataFrames).
    - Used to represent 3-dimensional data (a "stack" of DataFrames).
    - __Deprecated__: Use __MultiIndex__ DataFrames instead.

#### __Pandas Key Features__:
- __High-performance DataFrame object__ for easy manipulation.
- __Data import/export__: Load and save data from various file formats (CSV, SQL, JSON, Excel, etc.).
- __Handling missing data__: Built-in tools to handle missing values.
- __Reshaping and pivoting__: Transform datasets for analysis.
- __Label-based indexing__: Slice and filter data with labels.
- __Grouping and aggregation__: Perform group operations like aggregation, transformations.
- __Time series functionality__: Built-in tools for handling date/time data.

---

### __1.2. Importance of Pandas in Data Science & Machine Learning__
- __Data Preparation__: Essential for transforming raw data into usable formats for analysis or modeling.
- __Data Exploration__: Quickly explore and understand data distributions, trends, and relationships using simple commands.
- __Integration with Libraries__: Easily integrates with __NumPy__, __Matplotlib__, __Seaborn__, and __Scikit-learn__, streamlining workflows for machine learning.
- __Efficiency__: Handles large datasets efficiently with memory management techniques.
- __Advanced Operations__: Supports complex operations like grouping, merging, reshaping, which are critical for feature engineering and analysis.
- __Time Series Analysis__: Native support for working with time-indexed data (critical for financial modeling, etc.).

---

### __1.3. Comparison with Other Libraries__
Here’s a quick comparison between Pandas, NumPy, Excel, and SQL to highlight Pandas' strengths:

| Feature                 | __Pandas__                                                 | __NumPy__                                   | __Excel__                               | __SQL__                              |
|-------------------------|------------------------------------------------------------|---------------------------------------------|-----------------------------------------|--------------------------------------|
| __Data Structures__     | Series (1D), DataFrame (2D), Panel (Deprecated)            | N-dimensional array (ndarray)               | Tables (rows & columns)                 | Tables (rows & columns)              |
| __Data Types__          | Heterogeneous (numerical, strings, dates, etc.)            | Numerical data (arrays/matrices)            | Primarily numerical & categorical       | Supports numbers, text, and dates    |
| __Missing Data__        | Supports missing values (NaN, None)                        | Does not support missing data               | No direct support for missing data      | Requires custom handling for NULLs   |
| __Data Operations__     | High-level data manipulation (aggregation, transformation) | Low-level numerical computations            | Limited operations                      | Complex data manipulations           |
| __Performance__         | Efficient for data analysis and manipulation               | Efficient for numerical operations          | Limited performance for data processing | Optimized for data retrieval         |
| __Use Cases__           | Data cleaning, exploration, analysis                       | Numerical computations and array operations | Basic analysis & visualization          | Data storage, retrieval, querying    |
| __Integration__         | Integrates well with NumPy, Scikit-learn, Matplotlib       | Integrates with Pandas, SciPy               | Limited integration                     | Integrates with various DBs          |
| __Time Series Support__ | Native support for time series data                        | Limited support                             | Limited support                         | Supports time series via SQL queries |
| __File I/O__            | Supports various formats (CSV, SQL, JSON, Excel)           | Limited I/O capabilities                    | Excel files only                        | Various file formats (CSV, SQL)      |

---

### __1.4. Installing Pandas__
To install Pandas, you can use Python’s package installer __pip__. Open a terminal or command prompt and run the following command:

```bash
pip install pandas
```
_Note_:
1. Homogeneous Data
    - __Definition__: Homogeneous data refers to datasets where all the elements have the same type or similar characteristics.
        - In this context, every element in a collection (such as a list, array, or column) is of the same data type (e.g., all integers, all floats, or all strings).
    - In Pandas:
        - A Series with homogeneous data means that all the elements (values) within the Series are of the same type.
        - Example: A Pandas Series of integers (e.g., [1, 2, 3, 4]).
        - Use cases: Typically seen in operations like numerical computations, where uniformity of data type (like all numerical values) is required for mathematical operations.

2. Heterogeneous Data
    - __Definition__: Heterogeneous data refers to datasets where the elements have different types or characteristics.
        - In this context, the elements in a collection (such as a list, array, or column) can be of different data types (e.g., integers, floats, strings).
    - In Pandas:
        - A DataFrame can handle heterogeneous data across its columns, meaning each column can have a different data type (e.g., one column with integers, another with strings, another with dates).
        - Example: A Pandas DataFrame where one column stores strings, another stores numbers, and another stores dates.
        - Use cases: Very common in real-world datasets where multiple features may have different types of data (e.g., age as integers, date of birth as datetime, and city as strings).


## 3. Importing and Exporting Data
## 4. Data Selection & Filtering
## 5. Data Cleaning & Preprocessing
## 6. Data Transformation
## 7. Merging and Combining Data
## 8. Working with Dates and Time Series
## 9. Data Visualization with Pandas
## 10. Advanced Pandas Operations
## 11. Pandas Best Practices and Common Mistakes

---



## __2. Pandas Data Structures__

### __2.1. Series (1D Data)__

A __Series__ is a one-dimensional, array-like object that holds a sequence of values along with an associated array of labels, known as the __index__. The simplest form of a Series is created from a data array, with a default integer index. You can also specify custom indices for better labeling.

#### __2.1.1. Creating a Series__
The general syntax for creating a Series is:

```python
pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=<no_default>)
```
Parameters:
- __data__: The data to be stored in the Series (e.g., lists, arrays, or dictionaries).
- __index__: The labels for the data points (default is a sequence of integers). If not provided, a default integer index is assigned. If provided, the length of the index must match the data.
- __dtype__: The data type of the Series elements (e.g., int, float, str). If not specified, it is inferred from the data.
- __name__: The name of the Series. Useful when working with DataFrames, where Series are columns. If not specified, it defaults to None.
- __copy__: If set to True, creates a copy of the input data (default is False).
- __fastpath__: An internal parameter for optimization (default is <no_default>).

***See***: [Official Pandas Series Documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)


In [None]:
# Importing the pandas library
import pandas as pd

LINE_BREAK = "-" * 80

# Creating a Series
int_data = [1, 2, 3, 4, 5]
int_series = pd.Series(int_data)
print("Series with default index:")
print(int_series)
print(LINE_BREAK)

# Creating a Series from a List with Custom Index
custom_index = ['A', 'B', 'C', 'D', 'E']
custom_series = pd.Series(int_data, index=custom_index)
print("Series with custom index:")
print(custom_series)
print(LINE_BREAK)

# Creating a Series from a Dictionary
dict_data = {'John': 25, 'Doe': 30, 'Jane': 28}
dict_series = pd.Series(dict_data, name='Ages')
print("Series from dictionary:")
print(dict_series)
print(LINE_BREAK)

# Creating a series from dictionary with custom values
student_marks = {'John': 85, 'Doe': 90, 'Jane': 88, 'alex': 'absent'}

# Here I am creating a series only containing the marks of the students
marks_series = pd.Series(student_marks, ['John', 'Doe', 'Jane'])
print("Series from dictionary with custom index:")
print(marks_series)
print(LINE_BREAK)

# If we pass the array of index that is not present in the dictionary, it will return NaN
marks_series = pd.Series(student_marks, ['John', 'Doe', 'Jane', 'Jasmine'])
print("Series from dictionary with custom index and missing values:")
print(marks_series)
print(LINE_BREAK)

# Creating a Series with a Specific Data Type
float_data = [1.1, 2.2, 3.3, 4.4, 5.5]
float_series = pd.Series(float_data, dtype='float32')
print("Series with float data type:")
print(float_series)
print(LINE_BREAK)

# Creating a Series with a Name
named_series = pd.Series(int_data, name='Numbers')
print("Series with a name:")
print(named_series)
print(LINE_BREAK)

# Creating a Series with a Custom Index and Name
custom_named_series = pd.Series(int_data, index=custom_index, name='Custom Series')
print("Series with custom index and name:")
print(custom_named_series)
print(LINE_BREAK)

# Creating a Series with Copy=True
data = [10, 20, 30, 40, 50]
series1 = pd.Series(data)
series2 = pd.Series(data, copy=True)
print("Original Series:")
print(series1)
print("Copied Series:")
print(series2)
print(LINE_BREAK)

# Understanding Deep Copy with copy=True
"""
The copy=True argument creates a new Series object with a distinct memory address (a deep copy). This means that any modifications made to the copied Series will not affect the original Series, as they are independent of each other.
In contrast, a shallow copy refers to copying only the reference pointers of the original object. So, if the original object is modified, the changes will be reflected in the shallow copy as well. However, deep copies create entirely independent objects, meaning changes to one will not affect the other."""
print("ID of the original Series:", id(series1))
print("ID of the copied Series:", id(series2))
print(LINE_BREAK)


#### __2.1.2. Accessing and Modifying Elements__
In Pandas, elements in a Series can be accessed and modified in two ways:

1. __By Positional Index__: Using default numerical indexing, similar to standard array indexing.

   - The first element is at index `0`, the second at `1`, and so on.
   - Negative indexing can be used to access elements from the end (-1 for the last element, -2 for the second last, etc.).
   - Slicing is supported using `:` to retrieve a subset of elements.

<div style="display: flex; justify-content: center; background-color: white;">
<img src="../assets/images/indexing.svg" style="width: auto; background-color: white;"/>
</div>
<br />

__Syntax:__
```python
series_name[position]  # Direct index access
series_name.iloc[position]  # Using iloc for explicit positional indexing

series_name[start:end]  # Slicing by positional index
series_name.iloc[start:end]  # Slicing with iloc
```

2. __By Label__: Using custom-defined index labels, which enhances readability and interpretability.

   - Labels are user-defined and can be non-numeric (e.g., strings).
   - Accessing elements using labels is more intuitive and improves data comprehension.
   - `.loc[]` is used explicitly for label-based indexing.

__Syntax:__
```python
series_name[label]  # Direct access using label
series_name.loc[label]  # Using loc for explicit label-based indexing

series_name[start:end]  # Slicing by label
series_name.loc[start:end]  # Slicing with loc
```

### __Modifying Elements__
Pandas also allows modifying values in a Series by using either positional indices or labels:
```python
# Modifying by positional index
series_name[2] = 99  # Changes the third element
series_name.iloc[1] = 45  # Changes the second element explicitly

# Modifying by label
series_name['B'] = 60  # Changes the value at label 'B'
series_name.loc['C'] = 33  # Changes the value at label 'C'
```

__See:__
- [Pandas Series Indexing](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html)
- [Pandas Series iloc](https://pandas.pydata.org/docs/reference/api/pandas.Series.iloc.html)
- [Pandas Series loc](https://pandas.pydata.org/docs/reference/api/pandas.Series.loc.html)

In [None]:
# Accessing Elements in a Series by Positional Index
indexed_data = [12, 53, 5, 19, 3]
indexed_series = pd.Series(data=indexed_data)

print(f"4th element in the series (forward index): {indexed_series[3]}")
print(f"4th element in the series (backward index): {indexed_series.iloc[-2]}")

# Accessing multiple elements using positional indices
print("\nElements at positions [0, 2, 4]:")
print(indexed_series[[0, 2, 4]])

print("\nElements at positions [0, 2, 4] using range:")
print(indexed_series[range(0, 5, 2)])

print(LINE_BREAK)

# Slicing by positional index
print("\nSliced elements [1:4] using positional index:")
print(indexed_series[1:4])  # Retrieves elements from index 1 to 3
print("\nSliced elements [1:4] using iloc:")
print(indexed_series.iloc[1:4])  # Same result as above, using iloc

print(LINE_BREAK)

# Accessing Elements in a Series by Labeled Index
labeled_data = [12, 53, 5, 19, 3]
labeled_index = ['A', 'B', 'C', 'D', 'E']
labeled_series = pd.Series(data=labeled_data, index=labeled_index)

print(f"Element at label 'C': {labeled_series['C']}")
print(f"Element at label 'D' using loc: {labeled_series.loc['D']}")

# Accessing multiple elements using labels
print("\nElements at labels ['A', 'C', 'E']:")
print(labeled_series[['A', 'C', 'E']])
print("\nElements at labels ['A', 'C', 'E'] using loc:")
print(labeled_series.loc[['A', 'C', 'E']])

print(LINE_BREAK)

# Slicing by label
print("\nSliced elements ['A':'C'] using label indexing:")
print(labeled_series['A':'C'])  # Access from label 'A' to 'C'
print("\nSliced elements ['A':'C'] using loc:")
print(labeled_series.loc['A':'C'])  # Same result as above, using loc

print(LINE_BREAK)

# Using Boolean Indexing to Filter Elements
print("Elements greater than 10:")
print(labeled_series[labeled_series > 10])

print(LINE_BREAK)

# Modifying Elements by Positional Index
indexed_series[2] = 99  # Changes the third element
indexed_series.iloc[1] = 45  # Changes the second element explicitly

print("\nModified indexed_series (after modification by positional index):")
print(indexed_series)

print(LINE_BREAK)

# Modifying Elements by Label
labeled_series['B'] = 60  # Changes the value at label 'B'
labeled_series.loc['C'] = 33  # Changes the value at label 'C'

print("\nModified labeled_series (after modification by label):")
print(labeled_series)

print(LINE_BREAK)

# Modifying Multiple Elements by Position and Label
indexed_series[0:2] = [88, 77]  # Modifies elements at positions 0 and 1
labeled_series.loc['A':'C'] = [11, 22, 33]  # Modifies values from labels 'A' to 'C'

print("\nModified indexed_series (after slicing modification):")
print(indexed_series)

print(LINE_BREAK)

print("\nModified labeled_series (after slicing modification by label):")
print(labeled_series)

#### 2.1.3. Essential Attributes and Functions of a Pandas Series

In Pandas, Series provide a variety of attributes and functions for inspecting, manipulating, and aggregating data. Below are the essential attributes and functions commonly used with Series.

__Attributes:__

1. **`index`**: Returns the index labels of the Series.
    - Useful for inspecting or modifying the labels associated with the data.
    ```python
    print(series_name.index)  # Access the index of the series
    ```

2. **`values`**: Returns the data values in the Series as a ***NumPy array***.
    - Ideal for performing numerical operations or converting to other data structures.
    ```python
    print(series_name.values)  # Access the values of the series
    ```

3. **`dtype`**: Returns the data type of the Series elements.
    - Helps identify the type of data stored in the Series (e.g., int, float, object).
    ```python
    print(series_name.dtype)  # Access the data type of the series
    ```

4. **`name`**: Returns the name of the Series (if provided).
    - Useful when working with DataFrames, where Series are treated as columns.
    - If not specified, the name defaults to `None`.
    ```python
    print(series_name.name)  # Access the name of the series
    ```

5. **`size`**: Returns the number of elements in the Series.
    - Provides the length (or number of entries) of the Series.
    ```python
    print(series_name.size)  # Access the size of the series
    ```

6. **`shape`**: Returns the dimensions of the Series as a tuple (number of rows, number of columns).
    - For a Series, the shape is always a one-dimensional tuple.
    ```python
    print(series_name.shape)  # Access the shape of the series
    ```

7. **`empty`**: Returns a boolean indicating whether the Series is empty.
    - Returns `True` if the Series is empty, otherwise `False`.
    ```python
    print(series_name.empty)  # Check if the series is empty
    ```

8. **`ndim`**: Returns the number of dimensions of the Series.
    - A Series is always 1-dimensional, but it is useful when working with other data structures.
    ```python
    print(series_name.ndim)  # Get the number of dimensions of the series
    ```

9. **`memory_usage()`**: Returns the memory usage of the Series in bytes.
    - Helps assess the memory consumption of the Series.
    ```python
    print(series_name.memory_usage())  # Check memory usage
    ```

10. **`hasnans`**: Returns a boolean indicating whether the Series contains any NaN (Not a Number) values.
    - Returns `True` if the Series contains NaN values, otherwise `False`.
    ```python
    print(series_name.hasnans)  # Check for NaN values in the series
    ```

__Functions:__

1. **`head(n)`**: Returns the first `n` elements of the Series (default is 5).
    - Useful for quickly previewing the beginning of the data.
    ```python
    print(series_name.head(3))  # Get the first 3 elements
    ```

2. **`tail(n)`**: Returns the last `n` elements of the Series (default is 5).
    - Ideal for viewing the end of the data.
    ```python
    print(series_name.tail(3))  # Get the last 3 elements
    ```

3. **`describe()`**: Generates descriptive statistics for the Series.
    - Provides statistical summaries such as count, mean, std, min, max, and quartiles.
    ```python
    print(series_name.describe())  # Descriptive statistics of the series
    ```

4. **`unique()`**: Returns unique values in the Series.
    - Helpful for identifying the distinct values present in the Series.
    ```python
    print(series_name.unique())  # Get unique values in the series
    ```

5. **`nunique()`**: Returns the number of unique values in the Series.
    - Useful for counting the distinct values in the Series.
    ```python
    print(series_name.nunique())  # Count unique values in the series
    ```

6. **`sum()`**: Returns the sum of all elements in the Series.
    - Ideal for numerical aggregations.
    ```python
    print(series_name.sum())  # Get the sum of the series elements
    ```

7. **`mean()`**: Returns the mean (average) of the Series values.
    - Used to calculate the average of numerical data.
    ```python
    print(series_name.mean())  # Get the mean of the series
    ```

***See:***
- [Pandas Series Attributes](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)
- [Pandas Series Functions](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)
- [Pandas Series Memory Usage](https://pandas.pydata.org/docs/reference/api/pandas.Series.memory_usage.html)
- [Pandas Series Describe](https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html)
- [Pandas Series Unique](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html)
- [Pandas Series Nunique](https://pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html)
- [Pandas Series Sum](https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html)
- [Pandas Series Mean](https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html)
- [Pandas Series Head](https://pandas.pydata.org/docs/reference/api/pandas.Series.head.html)
- [Pandas Series Tail](https://pandas.pydata.org/docs/reference/api/pandas.Series.tail.html)
- [Pandas Series Empty](https://pandas.pydata.org/docs/reference/api/pandas.Series.empty.html)
- [Pandas Series Ndim](https://pandas.pydata.org/docs/reference/api/pandas.Series.ndim.html)
- [Pandas Series Hasnans](https://pandas.pydata.org/docs/reference/api/pandas.Series.hasnans.html)
- [Pandas Series Shape](https://pandas.pydata.org/docs/reference/api/pandas.Series.shape.html)
- [Pandas Series Size](https://pandas.pydata.org/docs/reference/api/pandas.Series.size.html)
- [Pandas Series Values](https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html)
- [Pandas Series Index](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html)
- [Pandas Series Dtype](https://pandas.pydata.org/docs/reference/api/pandas.Series.dtype.html)
- [Pandas Series Name](https://pandas.pydata.org/docs/reference/api/pandas.Series.name.html)

In [None]:
# Creating a sample Series
data = [12, 53, 5, 19, 3]
labels = ['A', 'B', 'C', 'D', 'E']
series_name = pd.Series(data, index=labels)

LINE_BREAK = "-" * 80

# Accessing Essential Attributes of a Series
print("1. Series Index:")
print(series_name.index)  # Accessing the index of the series

print(LINE_BREAK)

print("2. Series Values:")
print(series_name.values)  # Accessing the values of the series

print(LINE_BREAK)

print("3. Series Dtype:")
print(series_name.dtype)  # Accessing the data type of the series

print(LINE_BREAK)

print("4. Series Name:")
print(series_name.name)  # Accessing the name of the series

print(LINE_BREAK)

print("5. Series Size:")
print(series_name.size)  # Accessing the size of the series

print(LINE_BREAK)

print("6. Series Shape:")
print(series_name.shape)  # Accessing the shape of the series

print(LINE_BREAK)

print("7. Is Series Empty?")
print(series_name.empty)  # Checking if the series is empty

print(LINE_BREAK)

print("8. Series Dimensions (ndim):")
print(series_name.ndim)  # Checking the number of dimensions

print(LINE_BREAK)

print("9. Memory Usage:")
print(series_name.memory_usage())  # Checking memory usage

print(LINE_BREAK)

print("10. Does Series Contain NaNs?")
print(series_name.hasnans)  # Checking if there are NaNs in the series

print(LINE_BREAK)

# Using Series Functions
print("11. First 3 Elements (head):")
print(series_name.head(3))  # Getting the first 3 elements

print(LINE_BREAK)

print("12. Last 3 Elements (tail):")
print(series_name.tail(3))  # Getting the last 3 elements

print(LINE_BREAK)

print("13. Descriptive Statistics:")
print(series_name.describe())  # Descriptive statistics of the series

print(LINE_BREAK)

print("14. Unique Values:")
print(series_name.unique())  # Getting unique values

print(LINE_BREAK)

print("15. Number of Unique Values:")
print(series_name.nunique())  # Getting the number of unique values

print(LINE_BREAK)

print("16. Sum of Elements:")
print(series_name.sum())  # Getting the sum of the series elements

print(LINE_BREAK)

print("17. Mean of Elements:")
print(series_name.mean())  # Getting the mean of the series
