## 📊 Data Structures

> Pandas provides two primary data structures for handling data efficiently:

- **Series** – A one-dimensional labeled array.
- **DataFrame** – A two-dimensional labeled table (rows and columns).


In [1]:
import pandas as pd
import numpy as np

### Series

- It is an one-dimensional labeled array 
- Each value has an index (label)

In [2]:
# Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series:", s)

Series: 0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [3]:
# Naming the series

s.name = "numbers"
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: numbers, dtype: float64

##### Numpy Arrary

- A NumPy array is the core data structure of the NumPy library in Python.
- It's like a supercharged list, built for fast numerical computing.

In [4]:
s.to_numpy()

array([ 1.,  3.,  5., nan,  6.,  8.])

## DataFrame

- A **DataFrame** is like a table in Python. It has rows and columns. In general the most commonly used pandas is an object. 
- DataFrame accepts many different kinds of input:

    - Dict of 1D ndarrays, lists, dicts, or Series

     - 2-D numpy.ndarray

     - Structured or record ndarray

     - A Series

     - Another DataFrame


In [5]:
dataframe = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20230102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
})
dataframe

Unnamed: 0,A,B,C,D,E,F
0,1.0,2023-01-02,1.0,3,test,foo
1,1.0,2023-01-02,1.0,3,train,foo
2,1.0,2023-01-02,1.0,3,test,foo
3,1.0,2023-01-02,1.0,3,train,foo


In [6]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype        
---  ------  --------------  -----        
 0   A       4 non-null      float64      
 1   B       4 non-null      datetime64[s]
 2   C       4 non-null      float32      
 3   D       4 non-null      int32        
 4   E       4 non-null      category     
 5   F       4 non-null      object       
dtypes: category(1), datetime64[s](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes


## Assign

- It add new columns to a DataFrame. It returns a new DataFrame with the added columns, without modifying the original one.

In [7]:
dataframe.assign(D_multi=dataframe["D"] + dataframe["A"])

Unnamed: 0,A,B,C,D,E,F,D_multi
0,1.0,2023-01-02,1.0,3,test,foo,4.0
1,1.0,2023-01-02,1.0,3,train,foo,4.0
2,1.0,2023-01-02,1.0,3,test,foo,4.0
3,1.0,2023-01-02,1.0,3,train,foo,4.0


## Indexing and Selection

| Operation                        | Syntax           | Result     | 
|----------------------------------|------------------|------------|
| Select column                    | `df[col]`        | Series     |
| Select row by label             | `df.loc[label]`  | Series     |
| Select row by integer location | `df.iloc[loc]`   | Series     |
| Slice rows                       | `df[5:10]`       | DataFrame  |
| Select rows by boolean vector   | `df[bool_vec]`   | DataFrame  |

In [8]:
df_index = pd.DataFrame({
    "A" : np.random.random_sample(7),
    "B" : np.random.random_sample(7),
    "C" : np.random.random_sample(7),
    "D" : np.random.random_sample(7),
    "E" : np.random.random_sample(7),
    })

df_index

Unnamed: 0,A,B,C,D,E
0,0.916876,0.480658,0.800431,0.34144,0.750877
1,0.785624,0.379212,0.309245,0.134703,0.274035
2,0.123311,0.100469,0.800373,0.938661,0.879924
3,0.069857,0.586498,0.999678,0.375113,0.719195
4,0.515022,0.831374,0.118607,0.231559,0.046052
5,0.030939,0.577144,0.530527,0.748415,0.954475
6,0.870225,0.417478,0.12698,0.274361,0.359165


In [9]:
# Series (single column)
df_index['A']

0    0.916876
1    0.785624
2    0.123311
3    0.069857
4    0.515022
5    0.030939
6    0.870225
Name: A, dtype: float64

In [10]:
# Series (single row)
df_index.iloc[0]

A    0.916876
B    0.480658
C    0.800431
D    0.341440
E    0.750877
Name: 0, dtype: float64

In [11]:
# DataFrame (multiple rows)
df_index.iloc[0:3]

Unnamed: 0,A,B,C,D,E
0,0.916876,0.480658,0.800431,0.34144,0.750877
1,0.785624,0.379212,0.309245,0.134703,0.274035
2,0.123311,0.100469,0.800373,0.938661,0.879924


In [12]:
df_index[0:3]

Unnamed: 0,A,B,C,D,E
0,0.916876,0.480658,0.800431,0.34144,0.750877
1,0.785624,0.379212,0.309245,0.134703,0.274035
2,0.123311,0.100469,0.800373,0.938661,0.879924


In [13]:
# DataFrame (filtered rows)
df_index[df_index['A'] > 0.9]

Unnamed: 0,A,B,C,D,E
0,0.916876,0.480658,0.800431,0.34144,0.750877
