# What is Pandas?

Pandas is a powerful open-source Python library used for data manipulation, analysis, and cleaning. It provides two primary data structures:

1. **Series**: One-dimensional labeled array.
2. **DataFrame**: Two-dimensional labeled data structure (like a table in a database or an Excel spreadsheet).

It is widely used in data science and analytics for tasks like:

1. Importing data from various file formats (CSV, Excel, SQL, etc.)
2. Cleaning and transforming data
3. Grouping and summarizing
4. Merging and joining datasets
5. Time series analysis

---

# What is NumPy?

NumPy (Numerical Python) is a foundational Python library for numerical computing. It provides:

1. **ndarray**: N-dimensional array object
2. Fast operations on arrays, including element-wise operations, linear algebra, statistical operations, etc.

It’s known for speed and efficiency, especially with large arrays and matrices.

---

## **Key Differences Between Pandas and NumPy**

| Feature                   | **NumPy**                                       | **Pandas**                                          |
|---------------------------|--------------------------------------------------|-----------------------------------------------------|
| **Main Data Structure**   | `ndarray` (multi-dimensional array)             | `Series` (1D), `DataFrame` (2D)                     |
| **Use Case**              | Numerical computations                          | Data analysis and manipulation                      |
| **Data Types Supported**  | Homogeneous (all elements must be same type)    | Heterogeneous (each column can be a different type) |
| **Indexing**              | Integer-based indexing                          | Label-based and integer indexing                    |
| **Ease of Use**           | Low-level; requires more code for data handling | High-level; easier data wrangling and analysis      |
| **Speed**                 | Faster for numerical operations                 | Slightly slower due to overhead                     |
| **Missing Data Support**  | Limited                                          | Built-in support (e.g., `NaN` handling)             |

---

### ***When to Use What?***

- **Use NumPy for**: Fast numerical operations, matrix algebra, or when working with homogeneous numeric data.
- **Use Pandas for**: Real-world data analysis involving labeled, heterogeneous, or missing data.

---

## **The Primary Data Structures in Pandas Are:**

### 👉 Series

- A one-dimensional labeled array.
- Can hold any data type (integers, strings, floats, Python objects, etc.).
- Has both values and an index.
    ```python
    import pandas as pd

    s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
    print(s)
    ```

### 👉 DataFrame

- A two-dimensional labeled data structure (like a table with rows and columns).
- Can contain different data types in different columns.
- Has both row and column indexing.
    ```python
    import pandas as pd

    data = {
        'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]
    }
    df = pd.DataFrame(data)
    print(df)
    ```

In [87]:
import numpy as np
import pandas as pd

In [88]:
# Create Series

sr = pd.Series([10, 20, 30, 40, 50, 60])
print("Series:", sr)
print("Type of Series:", sr.dtype)
print("Shape of Series:", sr.shape)
print("Values of Series:", sr.values)
print("Index of Series:", sr.index)
print("Name of Series:", sr.name)

# Assign a name to series
sr.name = "Calories"
print("Name of Series after assigning:", sr.name)

Series: 0    10
1    20
2    30
3    40
4    50
5    60
dtype: int64
Type of Series: int64
Shape of Series: (6,)
Values of Series: [10 20 30 40 50 60]
Index of Series: RangeIndex(start=0, stop=6, step=1)
Name of Series: None
Name of Series after assigning: Calories


# Indexing

In [89]:
print("First element in Series:", sr[0])
print("Last element in Series:", sr.iloc[-1])
print("2nd to 4th element in Series:", sr[2:4].values)  # 4th element is excluded
# NOTE: Without `values` it will return both index and value.
print("Reverse the Series:", sr[::-1].values)

First element in Series: 10
Last element in Series: 60
2nd to 4th element in Series: [30 40]
Reverse the Series: [60 50 40 30 20 10]


# location based indexing (iloc)

In [90]:
print("First element in Series:", sr.iloc[0])
print("Last element in Series:", sr.iloc[-1])
print("2nd to 4th element in Series", sr.iloc[2:4].values)
print("Reverse the Series:", sr.iloc[::-1])

# Access different indexes at once
print("Get 1st, 3rd and 5th elements:", sr.iloc[[1, 3, 5]].values)

First element in Series: 10
Last element in Series: 60
2nd to 4th element in Series [30 40]
Reverse the Series: 5    60
4    50
3    40
2    30
1    20
0    10
Name: Calories, dtype: int64
Get 1st, 3rd and 5th elements: [20 40 60]


# Add Index

In [91]:
index = ["apple", "banana", "grapes", "orange", "strawberry", "watermelon"]

# Replace default index (0, 1, ....) with custom index values
sr.index = index
print("Series with index values:", sr)

Series with index values: apple         10
banana        20
grapes        30
orange        40
strawberry    50
watermelon    60
Name: Calories, dtype: int64


In [92]:
# sr.iloc['grapes']   # will throw error, because iloc will only work with number based indexes
print("Access element with index:", sr.iloc[2])
print("\nAccess element with label:", sr.loc["grapes"])
print("\nAccess multiple elements using labels:", sr.loc[["apple", "banana"]])

Access element with index: 30

Access element with label: 30

Access multiple elements using labels: apple     10
banana    20
Name: Calories, dtype: int64


# Slicing after adding index

In [93]:
# Slicing using index
print("First element:", sr[:1])
print("\nLast element:", sr.iloc[-1:].values[0])
print("\nElements from index 1 to 3:", sr[1:3].values)
print("\nReverse the elements:", sr[::-1])

First element: apple    10
Name: Calories, dtype: int64

Last element: 60

Elements from index 1 to 3: [20 30]

Reverse the elements: watermelon    60
strawberry    50
orange        40
grapes        30
banana        20
apple         10
Name: Calories, dtype: int64


In [94]:
# Slicing using iloc
print("First element:", sr.iloc[:1])
print("\nLast element:", sr.iloc[-1:])
print("\nElements from index 1 to 3:", sr.iloc[1:3].values)
print("\nReverse the elements:", sr.iloc[::-1])

First element: apple    10
Name: Calories, dtype: int64

Last element: watermelon    60
Name: Calories, dtype: int64

Elements from index 1 to 3: [20 30]

Reverse the elements: watermelon    60
strawberry    50
orange        40
grapes        30
banana        20
apple         10
Name: Calories, dtype: int64


In [95]:
# slicing using loc (label based slicing)
print("Access apple", sr.loc["apple"])
print("\nAccess multiple elements:", sr.loc[["apple", "strawberry"]])

# In label based slicing the start and stop values are included in output.
print("\nFrom apple to orange:", sr.loc["apple":"orange"])

Access apple 10

Access multiple elements: apple         10
strawberry    50
Name: Calories, dtype: int64

From apple to orange: apple     10
banana    20
grapes    30
orange    40
Name: Calories, dtype: int64


# Create Series using Dict

In [96]:
fruit_protein = {
    "Avocado": 2.0,  # grams of protein
    "Guava": 2.6,
    "Blackberries": 2.0,
    "Oranges": 0.9,
    "Banana": 1.1,
    "Apples": 0.3,
    "Kiwi": 1.1,
    "Pomegranate": 1.7,
    "Mango": 0.8,
    "Cherries": 1.0,
}

ser = pd.Series(fruit_protein, name="protein")
ser

Avocado         2.0
Guava           2.6
Blackberries    2.0
Oranges         0.9
Banana          1.1
Apples          0.3
Kiwi            1.1
Pomegranate     1.7
Mango           0.8
Cherries        1.0
Name: protein, dtype: float64

# Conditional selection

In [97]:
# Get proteins that are greater than equals to 1.1

# Pandas will provide masked series with boolean values (True and False)
print("Get masked series of proteins greater than and equals to 1.1:", ser >= 1.1)
print("\nGet the actual series of proteins greater than and equals to 1.1:", ser[ser >= 1.1])

Get masked series of proteins greater than and equals to 1.1: Avocado          True
Guava            True
Blackberries     True
Oranges         False
Banana           True
Apples          False
Kiwi             True
Pomegranate      True
Mango           False
Cherries        False
Name: protein, dtype: bool

Get the actual series of proteins greater than and equals to 1.1: Avocado         2.0
Guava           2.6
Blackberries    2.0
Banana          1.1
Kiwi            1.1
Pomegranate     1.7
Name: protein, dtype: float64


# Logical Operator

In [98]:
# AND ( & ) operator

# Get proteins that are greater than 0.5 and less than 2
condition = (ser > 0.5) & (ser < 2)
print("Masked series:", condition)
print("\nActual Series:", ser[condition])

Masked series: Avocado         False
Guava           False
Blackberries    False
Oranges          True
Banana           True
Apples          False
Kiwi             True
Pomegranate      True
Mango            True
Cherries         True
Name: protein, dtype: bool

Actual Series: Oranges        0.9
Banana         1.1
Kiwi           1.1
Pomegranate    1.7
Mango          0.8
Cherries       1.0
Name: protein, dtype: float64


In [99]:
# OR ( | ) operator

# Get proteins that are greater than 0.5 or less than equals to 2
condition = (ser > 0.5) | (ser <= 2)
print("Masked series:", condition)
print("\nActual Series:", ser[condition])

Masked series: Avocado         True
Guava           True
Blackberries    True
Oranges         True
Banana          True
Apples          True
Kiwi            True
Pomegranate     True
Mango           True
Cherries        True
Name: protein, dtype: bool

Actual Series: Avocado         2.0
Guava           2.6
Blackberries    2.0
Oranges         0.9
Banana          1.1
Apples          0.3
Kiwi            1.1
Pomegranate     1.7
Mango           0.8
Cherries        1.0
Name: protein, dtype: float64


In [100]:
# not ( ~ ) operator

# Get proteins that are not greater than 1
condition = ~(ser > 1)
print("Masked series:", condition)
print("\nActual Series:", ser[condition])

Masked series: Avocado         False
Guava           False
Blackberries    False
Oranges          True
Banana          False
Apples           True
Kiwi            False
Pomegranate     False
Mango            True
Cherries         True
Name: protein, dtype: bool

Actual Series: Oranges     0.9
Apples      0.3
Mango       0.8
Cherries    1.0
Name: protein, dtype: float64


# Modifying Series

In [101]:
# Update protein of Mango to 2.8

ser["Mango"] = 2.8

ser

Avocado         2.0
Guava           2.6
Blackberries    2.0
Oranges         0.9
Banana          1.1
Apples          0.3
Kiwi            1.1
Pomegranate     1.7
Mango           2.8
Cherries        1.0
Name: protein, dtype: float64

In [103]:
# Bonus

ser = pd.Series(["a", np.nan, 1, -np.nan, 2])
ser.notnull().sum()

np.int64(3)