# Import libraries

Install libraries if not already installed by running:

```
pip install pandas
pip install numpy
```

In [None]:
from datetime import datetime

import numpy as np
import pandas as pd

# Pandas objects

At the very basic level, Pandas objects can be thought of as an enhanced versions of `NumPy` structured arrays in which the rows and columns are identified with _labels_ rather than simple integer indices

The two fundamental data structures in `pandas` are: `Series`, `DataFrame`, We'll dive into each one in more details in the subsequent sections.

# `Series`:

A Pandas `Series` is a one-dimensional array of **indexed** data. It can be created from a `list`, a `dictionary`, or a `NumPy` array.

It can be thought of as a **one-dimensional** `NumPy` array of **homogeneous** values, accompanied by _labelded_ axis and a _name_.

The following figure illustrates the difference between a `NumPy` array and a `pd.Series`:

<div>
    <img src="img/series-vs-np-array.png" alt='numpy-array-vs-pandas-series' width="800"/>
</div>

The essential difference is the presence of the **index**. While the `NumPy` array has an _implicitly_ defined integer index used to access the values, the Pandas `Series` has an _explicitly_ defined index associated with the values.

We'll talk about the advateges of the `Index` in a minute.

## Creating `Series` object

In [None]:
example_series = pd.Series(data=[10, 20, 30])

In [None]:
example_series

In [None]:
print(f"example_series.values: {example_series.values}")
print(f"example_series.index: {example_series.index}")
print(f"example_series.shape: {example_series.shape}")
print(f"example_series.dtype: {example_series.dtype}")

In [None]:
example_np_array = np.array([10, 20, 30])

In [None]:
print(f"example_np_array: {example_np_array}")
print(f"example_np_array.shape: {example_np_array.shape}")
print(f"example_np_array.dtype: {example_np_array.dtype}")

As we can see, the `Series` wraps both a sequence of `values` and a sequence of `indices`. In this example, the default index was used as it wasn't provided for the `Series` constructor.

Similar to `NumPy` arrays, we can access `Series` data:

In [None]:
print(f"first element in example_series: {example_series[0]}")

In [None]:
print(f"first two elements in example_series: \n{example_series[:2]}")

### `Series` with string index

We'll create a `pd.Series` with a string `index` to understand how it works:

<div>
    <img src="img/series-string-index.png" alt='series-string-index' width="850"/>
</div>

In [None]:
student_grades_series = pd.Series(
    data=[90, 83, 73, 57, 91], index=["Sami", "Ahmed", "Qusai", "Saeed", "Yamen"]
)

In [None]:
student_grades_series

In [None]:
student_grades_series.index

We can use the index to access elements in the series:

In [None]:
# get the corresponding grade for Sami
student_grades_series["Sami"]

In [None]:
# using slicing, get all grades starting from Qusai
student_grades_series["Qusai":]

### `Series` with datetime index

Similar to the previous example, here we'll create a `Series` with a datetime index.

Suppose we have the daily temperature for Damascus city for the first week of September 2022:

<div>
    <img src='img/series-datetime-index.png' alt='series-datetime-index'>
</div>

In [None]:
date_format = "%Y-%m-%d"

In [None]:
temperatures_list = [33, 36, 39, 41, 42, 40, 38]

In [None]:
dates_list = [
    datetime.strptime("2020-09-01", date_format),
    datetime.strptime("2020-09-02", date_format),
    datetime.strptime("2020-09-03", date_format),
    datetime.strptime("2020-09-04", date_format),
    datetime.strptime("2020-09-05", date_format),
    datetime.strptime("2020-09-06", date_format),
    datetime.strptime("2020-09-07", date_format),
]

In [None]:
daily_temperature_series = pd.Series(data=temperatures_list, index=dates_list)

In [None]:
daily_temperature_series

In [None]:
daily_temperature_series.index

Now, we can access individual elements in the `Series` using the index:

What was the temperature on `2020-09-02`?

In [None]:
daily_temperature_series["2020-09-02"]

What was the temperature between `2020-09-04` and `2020-09-06`:

In [None]:
daily_temperature_series["2020-09-04":"2020-09-06"]

If we pass a non-existent index value we get a `KeyError` exception

In [None]:
daily_temperature_series["2020-09-09"]

We can use boolean indexing similar to `NumPy`

In [None]:
daily_temperature_series[daily_temperature_series > 40]

## `Series` most common methods:

### `value_counts`

The `value_counts` method is one of the most common methods when working with `pd.Series`. For a `Sereis` of _categorical_ data, it returns a new `Series` with its **index** holding the unique values and its **values** holds the counts of the corresponding elements.

The below figure illustrates how `value_counts` work for a series containing fruits data:

<div>
    <img src='img/series-value-counts.png'>
</div>

We can see that the values in the new `Series` are sorted in descending order according to their frequency.

Passing `notmalize=True` can be useful to get percentage frequency rather than raw frequency.

In [None]:
fruits_series = pd.Series(
    data=[
        "Apple",
        "Orange",
        "Banana",
        "Apple",
        "Apple",
        "Strawberry",
        "Banana",
        "Orange",
    ]
)

In [None]:
fruits_series

In [None]:
fruits_series.value_counts()

In [None]:
fruits_series.value_counts(normalize=True)

### `describe`

The `.describe` method calculates summary statistics for the series.

For _numeric_ series:
- `count`: number of elements.
- `mean`: mean value.
- `std`: standard deviation.
- `min`: minimum value.
- `25%`: first [quartile](https://en.wikipedia.org/wiki/Quartile).
- `50%`: second [quartile](https://en.wikipedia.org/wiki/Quartile) (same as `mean`).
- `75%`: third [quartile](https://en.wikipedia.org/wiki/Quartile).
- `max`: maximum value.

For _categorical_ series:
- `count`: number of elements.
- `unique`: number unique of elements.
- `top`: most common element.
- `freq`: frequency of most common element.

In [None]:
numeric_series = pd.Series(data=[1.2, 2.9, 3.05, 4, 5.6, 6.2])

In [None]:
categorical_series = pd.Series(data=["cat", "dog", "cat", "mouse", "lion"])

In [None]:
numeric_series.describe()

In [None]:
categorical_series.describe()

## Aggregation functions:

Similar to `NumPy` arrays, pandas `Series` has the same aggrgation functions:

In [None]:
series_of_integers = pd.Series(data=[1, 2, 3, 4, 5, 6])

In [None]:
print(f"Min: {series_of_integers.min()}")
print(f"Max: {series_of_integers.max()}")
print(f"Sum: {series_of_integers.sum()}")
print(f"Mean: {series_of_integers.mean()}")

## String-related functions:

Oftentimes, we have _textual_ data (words, sentences, paragraphs, etc) that we want to manipulate.

In this section, we'll give examples of how we can use regular string functions (`tolower`, `toupper`, `contains`, `match`, etc) on a series of data.

### Series of words:

In [None]:
words_series = pd.Series(data=["HEllo", "hi  ", "HEY",])

In [None]:
lowercase_words = words_series.str.lower()

In [None]:
stripped_words = words_series.str.strip()

In [None]:
print(lowercase_words)

In [None]:
print(stripped_words)

In [None]:
normalized_words_series = words_series.str.lower().str.strip()

In [None]:
print(normalized_words_series)

### Series of sentences:

# Series indexing

In [None]:
student_ids = [93, 11, 22, 51, 63]

In [None]:
student_grades_np_array = np.array([90, 83, 73, 57, 91])

In [None]:
student_grades_series = pd.Series(data=student_grades_np_array, index=student_ids)

In [None]:
student_grades_series

<div>
    <img src='img/series-indexing/1-student-grades-numpy-array-vs-pandas-series.png' width='700'>
</div>

## Single value

### Access value by its position (implicit index)

<div>
    <img src="img/series-indexing/2-student-grades-single-value-access.png" width='800'>
</div>

In [None]:
student_grades_np_array[1]

In [None]:
student_grades_series.iloc[1]

### Access value by its _label_ (explicit index)

<div>
    <img src='img/series-indexing/3-student-grades-single-value-access-by-label.png' width='700'>
</div>

In [None]:
student_grades_series.loc[51]

## Slicing

### Position-based

<div>
    <img src='img/series-indexing/4-student-grades-position-based-slicing.png' width='800'>
</div>

In [None]:
student_grades_np_array

In [None]:
student_grades_np_array[1:4]

In [None]:
student_grades_series.iloc[1:4]

### Label-based

<div>
    <img src='img/series-indexing/5-student-grades-label-based-slicing.png' width='700'>
</div>

In [None]:
student_grades_series.loc[11:51]

## Masking

<div>
    <img src='img/series-indexing/6-student-grades-boolean-masking.png' width='800'>
</div>

Here, we can't use the `.iloc` method.

In [None]:
student_grades_np_array[student_grades_np_array > 75]

In [None]:
student_grades_series[student_grades_series > 75]

In [None]:
student_grades_series.loc[student_grades_series > 75]

## Fancy indexing:

### Position-based

<div>
    <img src='img/series-indexing/7-student-grades-position-based-fancy-indexing.png' width='800'>
</div>

In [None]:
student_grades_np_array[[0, 2, 4]]

In [None]:
student_grades_series.iloc[[0, 2, 4]]

### Label-based

<div>
    <img src='img/series-indexing/8-student-grades-label-based-fancy-indexing.png' width='800'>
</div>

In [None]:
student_grades_series.loc[[93, 11, 63]]

# Operations on multiple `Series`

## `Series` and a scaler

Similar to a `NumPy` array, we can:
- add a scaler to `Series`
- subtract scaler from `Series`
- multiply `Series` by a scaler
- divide `Series` by a scaler

In [None]:
student_grades_series

Divide grades by `10`:

In [None]:
student_grades_series / 10

## Two `Series` numeric operations

In [None]:
dates_list = [
    datetime.strptime("2020-09-01", date_format),
    datetime.strptime("2020-09-02", date_format),
    datetime.strptime("2020-09-03", date_format),
    datetime.strptime("2020-09-04", date_format),
    datetime.strptime("2020-09-05", date_format),
    datetime.strptime("2020-09-06", date_format),
    datetime.strptime("2020-09-07", date_format),
    datetime.strptime("2020-09-08", date_format),
]

In [None]:
website_daily_visitors_series = pd.Series(
    data=[9, 3, 10, 2, 4, 0, 5, 12], index=dates_list
)

In [None]:
# session duration in minutes
website_daily_session_duration_series = pd.Series(
    data=[7, 20, 35, 7, 3, 0, 22, 40], index=dates_list
)

What is the (average) number of minutes a visitor spends every day?

In [None]:
website_session_duration_per_visitor_series = (
    website_daily_session_duration_series / website_daily_visitors_series
)

In [None]:
website_session_duration_per_visitor_series

The power of pandas `Series` is shown in the previous example. Having _label_ information for each value can be helpful for performing meaningful operations.

## `Series` concatenation

In certain situations, we might have two `Series` objects and we want to convert them to one.

We can concat two (or more) series into one.

In [None]:
september_dates_list = [
    datetime.strptime("2020-09-01", date_format),
    datetime.strptime("2020-09-02", date_format),
    datetime.strptime("2020-09-03", date_format),
    datetime.strptime("2020-09-04", date_format),
]

In [None]:
october_dates_list = [
    datetime.strptime("2020-10-05", date_format),
    datetime.strptime("2020-10-06", date_format),
    datetime.strptime("2020-10-07", date_format),
    datetime.strptime("2020-10-08", date_format),
]

In [None]:
september_data = pd.Series(data=[31, 29, 27, 35], index=september_dates_list)

In [None]:
october_data = pd.Series(data=[24, 26, 21, 27], index=october_dates_list)

In [None]:
september_data

In [None]:
october_data

In [None]:
concatenated_data = pd.concat([september_data, october_data])

In [None]:
concatenated_data

# Series common errors:

The `data` argument passed to the `pd.Series` must be **one-dimensional** array. Otherwise, exception is raised:

In [None]:
arr = np.arange(4).reshape((2, 2))

In [None]:
pd.Series(arr)

# `DataFrame`:

A `DataFrame` can be considered as a two-dimensional `np.array`, with _labeled_ axes (rows and columns).

Additionally, a `DataFrame` contains **heterogeneous** data types; the data are of the same type within each column (`Series`) but it could be a different data type for each column.

<div>
    <img src='img/weather-multiple-series.png' width='800'>
</div>

<div>
    <img src='img/weather-single-dataframe.png' width='800'>
</div>

In [None]:
week_days = [
    "Saturday",
    "Sunday",
    "Monday",
    "Tuesday",
    "Wedensday",
    "Thursday",
    "Friday",
]

In [None]:
weather_series = pd.Series(
    data=["Rainy", "Sunny", "Sunny", "Sunny", "Sunny", "Cloudy", "Rainy"],
    index=week_days,
    name="weather",
)

In [None]:
temperature_series = pd.Series(
    data=[11.07, 17.50, 12.79, 19.67, 17.51, 14.44, 10.51],
    index=week_days,
    name="temperature",
)

In [None]:
wind_speed_series = pd.Series(
    data=[27, 20, 13, 28, 16, 11, 26], index=week_days, name="wind_speed"
)

In [None]:
humidity_series = pd.Series(
    data=[62, 10, 30, 96, 20, 22, 79], index=week_days, name="humidity"
)

In [None]:
weather_data = pd.DataFrame(
    data={
        "weather": weather_series,
        "temperature": temperature_series,
        "wind_speed": wind_speed_series,
        "humidity": humidity_series,
    }
)

In [None]:
weather_data.columns

In [None]:
weather_data.index

In [None]:
weather_data.head()

# Creating `DataFrame`

## Dictionary of Series

Here, add example of calculating population density.

In [None]:
syria_governorate_list = [
    "Aleppo",
    "Raqqa",
    "As-Suwayda",
    "Damascus",
    "Daraa",
    "Deir ez-Zor",
    "Hama",
    "Hasaka",
    "Homs",
    "Idlib",
    "Latakia",
    "Quneitra",
    "Rif Dimashq",
    "Tartus",
]

In [None]:
syria_governorate_population_series = pd.Series(
    data=[
        4600166,
        919000,
        364000,
        2211042,
        998000,
        1200500,
        1593000,
        1272702,
        1762500,
        1464000,
        1278486,
        87000,
        2831738,
        785000,
    ],
    index=syria_governorate_list,
)

In [None]:
syria_governorate_area_series = pd.Series(
    data=[
        18482,
        19616,
        5550,
        1599,
        3730,
        33060,
        8883,
        23334,
        42223,
        6097,
        2297,
        1861,
        18032,
        1892,
    ],
    index=syria_governorate_list,
)

In [None]:
syria_data = pd.DataFrame(
    data={
        "population": syria_governorate_population_series,
        "area": syria_governorate_area_series,
    }
)

In [None]:
syria_data.head()

In [None]:
pd.read_csv()

## From two-dimensional `NumPy` array:

In [None]:
pd.DataFrame([syria_governorate_population_dict, syria_governorate_area_dict])

In [None]:
syria_governorate_area_series = pd.Series(syria_governorate_area_dict, name="area")

In [None]:
syria_governorate_population_series = pd.Series(
    syria_governorate_population_dict, name="population"
)

In [None]:
pd.concat([syria_governorate_area_series, syria_governorate_population_series], axis=1)

# Reading `DataFrame` from a file:

## `pd.read_csv`:

## `pd.read_html`:

In [None]:
syria_governorates_df = pd.read_html(
    "https://en.wikipedia.org/wiki/Governorates_of_Syria", match="Governorate name"
)[0]

# Working with missing values:

- `pd.Series.isnull`
- `pd.Series.notnull`
- `pd.DataFrame.isnull`
- `pd.DataFrame.notnull`

# Resources:

- [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Comparison with SQL](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html)
- [Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)