# Import libraries

In [None]:
from datetime import datetime

import numpy as np
import pandas as pd

# Pandas objects

At the very basic level, Pandas objects can be thought of as an enhanced versions of `NumPy` structured arrays in which the rows and columns are identified with _labels_ rather than simple integer indices

The three fundamental data structures in `pandas` are: `Series`, `DataFrame`, and `Index`. We'll dive into each one in more details in the subsequent sections.

# `Series`:

A Pandas `Series` is a one-dimensional array of **indexed** data. It can be created from a `list`, a `dictionary`, or a `NumPy` array.

It can be thought of as a **one-dimensional** `NumPy` array of **homogeneous** values, accompanied by _labelded_ axis and a _name_.

The following figure illustrates the difference between a `NumPy` array and a `pd.Series`:

<div>
    <img src="img/series-vs-np-array.png" alt='numpy-array-vs-pandas-series' width="800"/>
</div>

The essential difference is the presence of the **index**. While the `NumPy` array has an _implicitly_ defined integer index used to access the values, the Pandas `Series` has an _explicitly_ defined index associated with the values.

We'll talk about the advateges of the `Index` in a minute.

In [None]:
example_series = pd.Series(data=[10, 20, 30])

In [None]:
example_series

In [None]:
print(f"example_series.values: {example_series.values}")
print(f"example_series.index: {example_series.index}")
print(f"example_series.shape: {example_series.shape}")
print(f"example_series.dtype: {example_series.dtype}")

In [None]:
example_np_array = np.array([10, 20, 30])

In [None]:
print(f"example_np_array: {example_np_array}")
print(f"example_np_array.shape: {example_np_array.shape}")
print(f"example_np_array.dtype: {example_np_array.dtype}")

As we can see, the `Series` wraps both a sequence of `values` and a sequence of `indices`. In this example, the default index was used as it wasn't provided for the `Series` constructor.

Similar to `NumPy` arrays, we can access `Series` data:

In [None]:
print(f"first element in example_series: {example_series[0]}")

In [None]:
print(f"first two elements in example_series: \n{example_series[:2]}")

### `Series` with string index

We'll create a `pd.Series` with a string `index` to understand how it works:

<div>
    <img src="img/series-string-index.png" alt='series-string-index' width="800"/>
</div>

In [None]:
student_grades_series = pd.Series(
    data=[90, 83, 73, 57, 91], index=["Sami", "Ahmed", "Qusai", "Saeed", "Yamen"]
)

In [None]:
student_grades_series

In [None]:
student_grades_series.index

We can use the index to access elements in the series:

In [None]:
# get the corresponding grade for Sami
student_grades_series["Sami"]

In [None]:
# using slicing, get all grades starting from Qusai
student_grades_series["Qusai":]

### `Series` with datetime index

Similar to the previous example, here we'll create a `Series` with a datetime index.

Suppose we have the daily temperature for Damascus city for the first week of September 2022:

<div>
    <img src='img/series-datetime-index.png' alt='series-datetime-index'>
</div>

In [None]:
date_format = "%Y-%m-%d"

In [None]:
temperatures_list = [33, 36, 39, 41, 42, 40, 38]

In [None]:
dates_list = [
    datetime.strptime("2020-09-01", date_format),
    datetime.strptime("2020-09-02", date_format),
    datetime.strptime("2020-09-03", date_format),
    datetime.strptime("2020-09-04", date_format),
    datetime.strptime("2020-09-05", date_format),
    datetime.strptime("2020-09-06", date_format),
    datetime.strptime("2020-09-07", date_format),
]

In [None]:
daily_temperature_series = pd.Series(data=temperatures_list, index=dates_list)

In [None]:
daily_temperature_series

In [None]:
daily_temperature_series.index

Now, we can access individual elements in the `Series` using the index:

What was the temperature on `2020-09-02`?

In [None]:
daily_temperature_series["2020-09-02"]

In [None]:
daily_temperature_series["2020-09-04":"2020-09-06"]

If we pass a non-existent index value we get a `KeyError` exception

In [None]:
daily_temperature_series["2020-09-09"]

We can use boolean indexing similar to `NumPy`

In [None]:
daily_temperature_series[daily_temperature_series > 40]

## `Series` methods:

### `value_counts`

The `value_counts` method is one of the most common methods when working with `pd.Series`. For a `Sereis` of _categorical_ data, it returns a new `Series` with its **index** holding the unique values and its **values** holds the counts of the corresponding elements.

The below figure illustrates how `value_counts` work for a series containing fruits data:

<div>
    <img src='img/series-value-counts.png'>
</div>

We can see that the values in the new `Series` are sorted in descending order according to their frequency.

Passing `notmalize=True` can be useful to get percentage frequency rather than raw frequency.

In [None]:
fruits_series = pd.Series(
    data=[
        "Apple",
        "Orange",
        "Banana",
        "Apple",
        "Apple",
        "Strawberry",
        "Banana",
        "Orange",
    ]
)

In [None]:
fruits_series

In [None]:
fruits_series.value_counts()

In [None]:
fruits_series.value_counts(normalize=True)

### Aggregation functions:

Similar to `NumPy` arrays, pandas `Series` has the same aggrgation functions:

In [None]:
series_of_integers = pd.Series(data=[1, 2, 3, 4, 5, 6])

In [None]:
print(f"Min: {series_of_integers.min()}")
print(f"Max: {series_of_integers.max()}")
print(f"Sum: {series_of_integers.sum()}")
print(f"Mean: {series_of_integers.mean()}")

# Series indexing

The `data` argument passed to the `pd.Series` must be **one-dimensional** array. Otherwise, exception is raised:

In [None]:
arr = np.arange(4).reshape((2, 2))

In [None]:
pd.Series(arr)

# `DataFrame`:

In [None]:
syria_governorate_population_dict = {
    "Aleppo Governorate": 4600166,
    "Raqqa Governorate": 919000,
    "As-Suwayda Governorate": 364000,
    "Damascus Governorate": 2211042,
    "Daraa Governorate": 998000,
    "Deir ez-Zor Governorate": 1200500,
    "Hama Governorate": 1593000,
    "Hasaka Governorate": 1272702,
    "Homs Governorate": 1762500,
    "Idlib Governorate": 1464000,
    "Latakia Governorate": 1278486,
    "Quneitra Governorate": 87000,
    "Rif Dimashq Governorate": 2831738,
    "Tartus Governorate": 785000,
}

syria_governorate_area_dict = {
    "Aleppo Governorate": 18482,
    "Raqqa Governorate": 19616,
    "As-Suwayda Governorate": 5550,
    "Damascus Governorate": 1599,
    "Daraa Governorate": 3730,
    "Deir ez-Zor Governorate": 33060,
    "Hama Governorate": 8883,
    "Hasaka Governorate": 23334,
    "Homs Governorate": 42223,
    "Idlib Governorate": 6097,
    "Latakia Governorate": 2297,
    "Quneitra Governorate": 1861,
    "Rif Dimashq Governorate": 18032,
    "Tartus Governorate": 1892,
}

In [None]:
pd.DataFrame([syria_governorate_population_dict, syria_governorate_area_dict])

In [None]:
syria_governorate_area_series = pd.Series(syria_governorate_area_dict, name="area")

In [None]:
syria_governorate_population_series = pd.Series(
    syria_governorate_population_dict, name="population"
)

In [None]:
pd.concat([syria_governorate_area_series, syria_governorate_population_series], axis=1)

# `pd.read_html` example:

In [None]:
syria_governorates_df = pd.read_html(
    "https://en.wikipedia.org/wiki/Governorates_of_Syria", match="Governorate name"
)[0]

# Working with missing values:

- `pd.Series.isnull`
- `pd.Series.notnull`
- `pd.DataFrame.isnull`
- `pd.DataFrame.notnull`