# Pandas Review

Pandas is a Python library, open-source in nature, that offers robust data structures and tools for data analysis. It is specifically designed to simplify and streamline data manipulation and analysis tasks, making them more efficient and straightforward. The name "Pandas" is derived from "Panel Data," which refers to multi-dimensional structured datasets commonly used in econometrics and finance [Pandas Developers, 2023].

## Data Structures:

1. **Series**: The Series is similar to a one-dimensional labeled array, similar to a NumPy array but with an associated index. This index provides meaningful labels for each element in the Series, allowing for effortless data alignment and retrieval [Pandas Developers, 2023].

2. **DataFrame**: The DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can accommodate various data types. DataFrames provide a versatile and potent method for working with structured data, enabling operations such as filtering, joining, grouping, and more [Pandas Developers, 2023].


<font color='Blue'><b>Example - Series:</b></font> A Pandas Series object can be instantiated through the implementation of the [pd.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) constructor.

In [None]:
import pandas as pd

# Create a Pandas Series with custom index
data = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])

# Print the Pandas Series
print("Pandas Series:")
print(data)

<font color='Blue'><b>Example - DataFrame:</b></font> A Pandas DataFrame object can be created by utilizing the [pd.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) constructor.

In [None]:
import pandas as pd

# Create a DataFrame from a dictionary
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'City': ['Calgary', 'Edmonton', 'Red Deer']
        }

df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame:")
display(df)

# Alice, Bob, and Charlie serve as fictitious characters usually employed as placeholders in discussions pertaining to cryptographic systems and protocols.

## Pandas Basics


| Command                    | Description                                                               |
|----------------------------|---------------------------------------------------------------------------|
| `pd.DataFrame(data)`       | Create a DataFrame from data like a dictionary, array, or list.           |
| `data.info()`              | Display basic information about the DataFrame, including data types and non-null counts. |
| `data.head(n)`             | Display the first n rows of the DataFrame (default is 5).                |
| `data.tail(n)`             | Display the last n rows of the DataFrame (default is 5).                 |
| `data.describe()`          | Display summary statistics of numerical columns (count, mean, std, min, max, quartiles). |
| `data.shape`               | Returns the number of rows and columns in the DataFrame as a tuple.      |
| `data.columns`             | Access the column labels of the DataFrame.                                |


In [None]:
import pandas as pd
import numpy as np

# Create the DataFrame
data = pd.DataFrame({'A': np.arange(0, 100),
                     'B': np.arange(1000, 900, -1)})

# Display basic DataFrame information
print("Displaying DataFrame Information:")
display(data)


# Examples

# Display the first 10 rows
print(f"Displaying First 5 Rows:")
display(data.head())

# Display summary statistics
print("\nSummary Statistics:")
display(data.describe())

Displaying DataFrame Information:


Unnamed: 0,A,B
0,0,1000
1,1,999
2,2,998
3,3,997
4,4,996
...,...,...
95,95,905
96,96,904
97,97,903
98,98,902


Displaying First 5 Rows:


Unnamed: 0,A,B
0,0,1000
1,1,999
2,2,998
3,3,997
4,4,996



Summary Statistics:


Unnamed: 0,A,B
count,100.0,100.0
mean,49.5,950.5
std,29.011492,29.011492
min,0.0,901.0
25%,24.75,925.75
50%,49.5,950.5
75%,74.25,975.25
max,99.0,1000.0


## Index Alignment in Pandas

This alignment is crucial for accurately combining, comparing, and performing arithmetic operations on data with different structures but related indices.

<font color='Blue'><b>Example - Series Alignment:</b></font>

In [None]:
import pandas as pd

# Create two Pandas Series
data1 = pd.Series([10, 20, 30], index=['A', 'B', 'C'])
data2 = pd.Series([5, 15, 25], index=['B', 'C', 'D'])

# Perform element-wise addition on the Series
result = data1 + data2

# Display the result using the appropriate function for a Series
print(result)

A     NaN
B    25.0
C    45.0
D     NaN
dtype: float64


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/15439593107f883b8695777ab7f1b03007feb66a/Images/Index_Alignment_Fig1.png" alt="picture" width="700">
</center>

## `loc` - Label-Based Indexing:

The `loc` method in Pandas allows you to access DataFrame data using labels or boolean array-based indexing. It's particularly useful for selecting rows and columns based on customized labels or names. This method provides flexibility and intuition in retrieving specific data [Molin and Jee, 2021, Pandas Developers, 2023].

The syntax for using `loc` is:

```python
df.loc[row_indexer, column_indexer]
```

- `row_indexer`: Specifies the row labels to select, which can be a single label, a list of labels, a slice, or a boolean array.

- `column_indexer`: Specifies the column labels to select, with similar indexing options.

<font color='Blue'><b>Example</b></font>:

In [None]:
import pandas as pd

# Create a dictionary for the DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}

# Create a DataFrame with custom index
df = pd.DataFrame(data, index=['ID1', 'ID2', 'ID3'])

# Original DataFrame
print("Original DataFrame:")
display(df)

# Access rows with labels 'ID1' and 'ID3' and all columns
print("\nAccess rows with labels ID1 and ID3 and all columns:")
selected_rows = df.loc[['ID1', 'ID3'], :]
display(selected_rows)

# Access rows based on a condition and specific columns
print("\nAccess rows based on a condition and specific columns:")
conditioned_rows = df.loc[df['Age'] > 30, ['Name', 'Age']]
display(conditioned_rows)

Original DataFrame:


Unnamed: 0,Name,Age
ID1,Alice,25
ID2,Bob,30
ID3,Charlie,35



Access rows with labels ID1 and ID3 and all columns:


Unnamed: 0,Name,Age
ID1,Alice,25
ID3,Charlie,35



Access rows based on a condition and specific columns:


Unnamed: 0,Name,Age
ID3,Charlie,35


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/b9f00e72650472da101fa3a6f229ca21841fe064/Images/Pandas_Row_Selection_Fig1.png" alt="picture" width="750">
</center>




## `iloc` - Position-Based Indexing:

The `iloc` method is used for accessing DataFrame data based on integer positions, similar to indexing elements in a Python list. It's valuable when you want to access data using the underlying integer-based index [Molin and Jee, 2021, Pandas Developers, 2023].

The syntax for using `iloc` is:
```python
df.iloc[row_indexer, column_indexer]
```

- `row_indexer`: Specifies the integer positions of the rows to select.
- `column_indexer`: Specifies the integer positions of the columns to select.

<font color='Blue'><b>Example:</b></font>

In [None]:
import pandas as pd

# Create a dictionary for the DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}

# Create a DataFrame
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
display(df)

# Access the first two rows and all columns using iloc
print("\nAccess the first two rows and all columns:")
first_two_rows = df.iloc[:2, :]
display(first_two_rows)

# Access specific rows and columns by position using iloc
print("\nAccess specific rows and columns by position:")
selected_rows_columns = df.iloc[[0, 2], [0, 1]]
display(selected_rows_columns)

Original DataFrame:


Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35



Access the first two rows and all columns:


Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30



Access specific rows and columns by position:


Unnamed: 0,Name,Age
0,Alice,25
2,Charlie,35


## `at` - Single Value Selection:

The `at` method is ideal for efficiently accessing or modifying a single scalar value in a DataFrame. It offers a direct alternative to `loc` or `iloc` for single element selection [Molin and Jee, 2021, Pandas Developers, 2023].

The syntax for using `at` is:
```python
df.at[row_label, column_label]
```

- `row_label`: Specifies the label of the row where the desired element is located.
- `column_label`: Specifies the label of the column where the element is located.

<font color='Blue'><b>Example:</b></font>

In [None]:
import pandas as pd

# Create a dictionary for the DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}

# Create a DataFrame
df = pd.DataFrame(data)

# Original DataFrame
print("Original DataFrame:")
display(df)

# Access and modify the element at row label 1 and column label 'Name'
df.at[1, 'Name'] = 'Robert'

# Updated DataFrame
print("\nUpdated DataFrame:")
display(df)

# Access and print the element at row label 2 and column label 'Age'
age = df.at[2, 'Age']
print("\nAge:", age)

Original DataFrame:


Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35



Updated DataFrame:


Unnamed: 0,Name,Age
0,Alice,25
1,Robert,30
2,Charlie,35



Age: 35


The `at` method is particularly efficient for single value retrieval or modification.

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/3e1e307507f33f01a24297b470535b6f6094561c/Images/pd_Selection_Fig1.png" alt="picture" width="700">
</center>

# Handling Missing Data in Pandas

## Identifying Missing Data

In Pandas, the `isna()` and `isnull()` methods are used interchangeably to check for missing values within a DataFrame or Series. These methods have no functional difference; they yield identical results. Both methods generate a Boolean mask, where `True` indicates a missing value, and `False` indicates a non-missing value [Molin and Jee, 2021, Pandas Developers, 2023].

<font color='Blue'><b>Example</b></font>:

In [None]:
import pandas as pd

data = pd.Series([1, None, 3, None, 5])

# Original Series
print("Original Series:")
print(data)

# Using isnull() to identify missing values
missing_values = data.isnull()

# displaying missing_values
print("\nIdentifying Missing Values:")
print(missing_values)

<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/33649ffb000850c06dec64870ba9c542f832d840/Images/pd_Missing_Data_Fig1.png" alt="picture" width="400">
</center>

## Filling Missing Data

### Constant Fill, Forward Fill, and Backward Fill

In Pandas, the `fillna()` function is a versatile tool for replacing missing or NaN (Not a Number) values within a DataFrame or Series. This method is particularly useful during data preprocessing or cleaning tasks, enabling effective handling of missing data [Molin and Jee, 2021, Pandas Developers, 2023].

Available methods are `'fill'` (or `'pad'`) for forward filling (propagating the last valid value forward) and `'bfill'` (or `'backfill'`) for backward filling (propagating the next valid value backward).

You can see full description of the function [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html).

<font color='Blue'><b>Example</b></font>:

In [None]:
import numpy as np
import pandas as pd

# Create a simple time series DataFrame with missing values
date_rng = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
            '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
            '2023-01-09', '2023-01-10']
data = {'Temperature': [25.0, 24.5, np.nan, 23.0, np.nan, 22.0, 21.5, np.nan, 20.0, 19.5]}
df = pd.DataFrame(data, index=date_rng)
print("Original Data:")
display(df)

# Forward fill to propagate the last valid observation forward
df_filled_ffill = df.ffill()
print('Forward fill to propagate the last valid observation forward:')
display(df_filled_ffill)

# Backward fill to propagate the next valid observation backward
df_filled_bfill = df.bfill()
print('Backward fill to propagate the next valid observation backward:')
display(df_filled_bfill)

Original Data:


Unnamed: 0,Temperature
2023-01-01,25.0
2023-01-02,24.5
2023-01-03,
2023-01-04,23.0
2023-01-05,
2023-01-06,22.0
2023-01-07,21.5
2023-01-08,
2023-01-09,20.0
2023-01-10,19.5


Forward fill to propagate the last valid observation forward:


Unnamed: 0,Temperature
2023-01-01,25.0
2023-01-02,24.5
2023-01-03,24.5
2023-01-04,23.0
2023-01-05,23.0
2023-01-06,22.0
2023-01-07,21.5
2023-01-08,21.5
2023-01-09,20.0
2023-01-10,19.5


Backward fill to propagate the next valid observation backward:


Unnamed: 0,Temperature
2023-01-01,25.0
2023-01-02,24.5
2023-01-03,23.0
2023-01-04,23.0
2023-01-05,22.0
2023-01-06,22.0
2023-01-07,21.5
2023-01-08,20.0
2023-01-09,20.0
2023-01-10,19.5


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/20a7679e4725dcca74441a97fc13b1bbd8a4febd/Images/pd_fill_04.png" alt="picture" width="800">
</center>

Fill NaN values with summary statistics like mean or median. We can also compute custom aggregations based on the context of our data.

In [None]:
# Display the DataFrame with missing values filled using interpolation
print("Fill NaN values with the mean:")
_mean = df['Temperature'].mean().round(2)
df_filled_mean = df.fillna(_mean)
display(df_filled_mean)

Fill NaN values with the mean:


Unnamed: 0,Temperature
2023-01-01,25.0
2023-01-02,24.5
2023-01-03,22.21
2023-01-04,23.0
2023-01-05,22.21
2023-01-06,22.0
2023-01-07,21.5
2023-01-08,22.21
2023-01-09,20.0
2023-01-10,19.5


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/194a0014c2450ed9dc328006b16f7b9db9885aa5/Images/pd_fill_06.png" alt="picture" width="650">
</center>

## Reading and Writing Data:

Pandas is a powerful Python library that provides data manipulation and analysis tools. It's widely used for tasks like reading and writing data in various formats. Here's how you can use Pandas to read and write data [Pandas Developers, 2023]:

### Reading Data

Pandas can read data from various file formats like CSV, Excel, SQL databases, and more. The most commonly used method is `pandas.read_csv()` for reading CSV files.

In [None]:
import pandas as pd

# download the following CSV file
!wget -N https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv

# Read a CSV file into a DataFrame
data = pd.read_csv('Import_User_Sample_en.csv')
# we could also read it directly
# data = pd.read_csv('https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv')

# Display the DataFrame
print("The DataFrame:")
display(data)

--2024-01-04 22:18:04--  https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv
Resolving download.microsoft.com (download.microsoft.com)... 72.246.252.244, 2600:1408:c400:183::317f, 2600:1408:c400:18c::317f, ...
Connecting to download.microsoft.com (download.microsoft.com)|72.246.252.244|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1049 (1.0K) [application/octet-stream]
Saving to: ‘Import_User_Sample_en.csv’


2024-01-04 22:18:04 (503 MB/s) - ‘Import_User_Sample_en.csv’ saved [1049/1049]

The DataFrame:


Unnamed: 0,User Name,First Name,Last Name,Display Name,Job Title,Department,Office Number,Office Phone,Mobile Phone,Fax,Address,City,State or Province,ZIP or Postal Code,Country or Region
0,chris@contoso.com,Chris,Green,Chris Green,IT Manager,Information Technology,123451,123-555-1211,123-555-6641,123-555-9821,1 Microsoft way,Redmond,Wa,98052,United States
1,ben@contoso.com,Ben,Andrews,Ben Andrews,IT Manager,Information Technology,123452,123-555-1212,123-555-6642,123-555-9822,1 Microsoft way,Redmond,Wa,98052,United States
2,david@contoso.com,David,Longmuir,David Longmuir,IT Manager,Information Technology,123453,123-555-1213,123-555-6643,123-555-9823,1 Microsoft way,Redmond,Wa,98052,United States
3,cynthia@contoso.com,Cynthia,Carey,Cynthia Carey,IT Manager,Information Technology,123454,123-555-1214,123-555-6644,123-555-9824,1 Microsoft way,Redmond,Wa,98052,United States
4,melissa@contoso.com,Melissa,MacBeth,Melissa MacBeth,IT Manager,Information Technology,123455,123-555-1215,123-555-6645,123-555-9825,1 Microsoft way,Redmond,Wa,98052,United States


### Writing and Exporting Data

Pandas provides a versatile set of tools for exporting data to a variety of formats. One of the frequently employed techniques is using the `DataFrame.to_csv()` method, which facilitates the export of data to a CSV (Comma-Separated Values) file:

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'City': ['Calgary', 'Edmonton', 'Red Deer']
        }

df = pd.DataFrame(data)

# Write the DataFrame to a CSV file
csv_filename = 'data.csv'
df.to_csv(csv_filename, index=False)

# Print a message indicating that the data has been written
print(f"Data written to {csv_filename}")

Data written to data.csv


## References

* S. Molin and K. Jee. Hands-On Data Analysis with Pandas: A Python data science handbook for data collection, wrangling, analysis, and visualization. Packt Publishing, 2021. ISBN 9781800565913. URL: https://books.google.ca/books?id=Eh4sEAAAQBAJ.
* Pandas Developers. Pandas documentation. https://pandas.pydata.org/docs/, 2023. [Online; accessed 01-August-2023].