# Pandas
Popular library for data manipulation. It uses numpy in the backend. It has 2 basic data structures:
1. Series (built on top of numpy.ndarray), from [pandas.Series docs](https://pandas.pydata.org/docs/reference/api/pandas.Series.html):
> One-dimensional ndarray with axis labels (including time series). Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

2. Dataframe (think of it as a dictionary of series objects), from the [pandas.DataFrame docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):
> Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

In [63]:
import pandas as pd

# Series

In [15]:
# Series from list
data = [1, 2, 3, 4]
pd.Series(data)

0    1
1    2
2    3
3    4
dtype: int64

In [14]:
# Series from dict
data = {"a": 1, "b": 2, "c": 3}
pd.Series(data)

a    1
b    2
c    3
dtype: int64

Note how in the above example, the index became the dictionary key (the letters `a`, `b`, `c`).

In [13]:
# Series with varying data types
data = {"a": 1, "b": 2, "c": "text"}
pd.Series(data)

a       1
b       2
c    text
dtype: object

Note how in the above the `dtype` changed to object, that's because entries in the series are of different types, so pandas inferred that the dtype for the series should be object (which can support holding any type).

In [12]:
# Creating your own custom index
data = [1, 2, 3]
index = ['a', 'b', 'c']
pd.Series(data, index=index)

a    1
b    2
c    3
dtype: int64

# Dataframe
## Creating the dataframe

In [17]:
# Create a dataframe from a dictionary of lists
data = {
    'Name': ["omar", "serag", "mohamed"],
    'Age': ["31", "70", "100"],
    'City': ["Toronto", "Cairo", "Algeria"]
}

df = pd.DataFrame(data)
print(df)

Unnamed: 0,Name,Age,City
0,omar,31,Toronto
1,serag,70,Cairo
2,mohamed,100,Algeria


In [20]:
# Convert df into np array
import numpy as np
np.array(df)

array([['omar', '31', 'Toronto'],
       ['serag', '70', 'Cairo'],
       ['mohamed', '100', 'Algeria']], dtype=object)

In [26]:
# Dataframe from a list of dictionaries
data = [
    {'Name': 'Omar', 'Age': 31, 'City': 'Toronto'},
    {'Name': 'Serag', 'Age': 71, 'City': 'Cairo'},
    {'Name': 'Mohamed', 'Age': 101, 'City': 'Algeria'}
]
pd.DataFrame(data)

Unnamed: 0,Name,Age,City
0,Omar,31,Toronto
1,Serag,71,Cairo
2,Mohamed,101,Algeria


In [31]:
# Reading from a csv
df = pd.read_csv('test_data/test_data.csv')
print(df)

         Date          description  debit   credit  balance
0  02/19/2025    E-TRANSFER ***qP7    NaN  2000.00  4000.00
1  02/28/2025  MONTHLY ACCOUNT FEE  16.95      NaN  3983.05
2  02/28/2025      ACCT FEE REBATE    NaN    16.95  4000.00


In [33]:
# Notice how the above produces NaN where we have empty fields
df.fillna(0, inplace=True)
print(df)

         Date          description  debit   credit  balance
0  02/19/2025    E-TRANSFER ***qP7   0.00  2000.00  4000.00
1  02/28/2025  MONTHLY ACCOUNT FEE  16.95     0.00  3983.05
2  02/28/2025      ACCT FEE REBATE   0.00    16.95  4000.00


In [34]:
# Also, notice how the dataframe doesn't have the data set as the index.
# We could change that by changing the index as follows
df.set_index("Date", inplace=True)
print(df)

                    description  debit   credit  balance
Date                                                    
02/19/2025    E-TRANSFER ***qP7   0.00  2000.00  4000.00
02/28/2025  MONTHLY ACCOUNT FEE  16.95     0.00  3983.05
02/28/2025      ACCT FEE REBATE   0.00    16.95  4000.00


In [36]:
# Printing the top n elements of the dataframe
df.head(2)

Unnamed: 0_level_0,description,debit,credit,balance
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
02/19/2025,E-TRANSFER ***qP7,0.0,2000.0,4000.0
02/28/2025,MONTHLY ACCOUNT FEE,16.95,0.0,3983.05


## Accessing values in the dataframe

In [43]:
# Let's first create a simple df
# Create a dataframe from a dictionary of lists
data = {
    'Name': ["omar", "serag", "mohamed"],
    'Age': ["31", "70", "100"],
    'City': ["Toronto", "Cairo", "Algeria"]
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print(df)

         Age     City
Name                 
omar      31  Toronto
serag     70    Cairo
mohamed  100  Algeria


### Column Access

In [45]:
# Accessing an entire column
name_column = df['Age']
print(name_column)
print(type(name_column))

Name
omar        31
serag       70
mohamed    100
Name: Age, dtype: object
<class 'pandas.core.series.Series'>


**Notice how the type of the column is a data series**

### Row Access

In [49]:
# Accessing a row by using the index label
df.loc["omar"] # outputs a data series

Age          31
City    Toronto
Name: omar, dtype: object

**See lots of more examples of the usage of `df.loc` in the [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)**

In [53]:
#Accessing using the integer position
df.iloc[0]

Age          31
City    Toronto
Name: omar, dtype: object

**`df.iloc` is very similar to `df.loc` but only uses integer positions, see [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)**

### Accessing a specified element

In [51]:
# Using df.loc 
df.loc["omar", "Age"]

'31'

In [52]:
# Could also directly access the series
df.loc["omar"]["Age"]

'31'

In [56]:
df.iloc[0, 0] # Accessing the age as well

'31'

In [57]:
df.iloc[0]["Age"]

'31'

In [59]:
df.iloc[0].iloc[0]

'31'

In [60]:
# Using at
df.at["omar", "Age"]

'31'

In [62]:
# Using iat
df.iat[0, 0]

'31'

## Data Manipulation

In [80]:
# Let's first create a simple df
# Create a dataframe from a dictionary of lists
data = {
    'Name': ["omar", "serag", "mohamed"],
    'Age': [31, 70, 100],
    'City': ["Toronto", "Cairo", "Algeria"]
}

df = pd.DataFrame(data)
df.set_index('Name', inplace=True)
print(df)

         Age     City
Name                 
omar      31  Toronto
serag     70    Cairo
mohamed  100  Algeria


In [76]:
# Adding a column
df['Salary'] = [10000, 200000, 300000]
print(df)

         Age     City  Salary
Name                         
omar      31  Toronto   10000
serag     70    Cairo  200000
mohamed  100  Algeria  300000


In [77]:
# Remove a column
df.drop('Salary', axis=1, inplace=True)
print(df)

         Age     City
Name                 
omar      31  Toronto
serag     70    Cairo
mohamed  100  Algeria


In [81]:
# Add to the age column
df['Age'] = df['Age'] + 1
print(df)

         Age     City
Name                 
omar      32  Toronto
serag     71    Cairo
mohamed  101  Algeria


## Inspecting the DataFrame

In [85]:
print("data types:\n", df.dtypes)
print()
print("Statistical summary:\n", df.describe())

data types:
 Age      int64
City    object
dtype: object

Statistical summary:
               Age
count    3.000000
mean    68.000000
std     34.597688
min     32.000000
25%     51.500000
50%     71.000000
75%     86.000000
max    101.000000
