## 📊 Data Structures

> Pandas provides two primary data structures for handling data efficiently:

- **Series** – A one-dimensional labeled array.
- **DataFrame** – A two-dimensional labeled table (rows and columns).


In [1]:
import pandas as pd
import numpy as np

### Series

- It is an one-dimensional labeled array 
- Each value has an index (label)

In [2]:
# Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series:", s)

Series: 0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [3]:
# Naming the series

s.name = "numbers"
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: numbers, dtype: float64

##### Numpy Arrary

- A NumPy array is the core data structure of the NumPy library in Python.
- It's like a supercharged list, built for fast numerical computing.

In [4]:
s.to_numpy()

array([ 1.,  3.,  5., nan,  6.,  8.])

## DataFrame

- A **DataFrame** is like a table in Python. It has rows and columns. In general the most commonly used pandas is an object. 
- DataFrame accepts many different kinds of input:

    - Dict of 1D ndarrays, lists, dicts, or Series

     - 2-D numpy.ndarray

     - Structured or record ndarray

     - A Series

     - Another DataFrame


In [5]:
dataframe = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20230102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
})
dataframe

Unnamed: 0,A,B,C,D,E,F
0,1.0,2023-01-02,1.0,3,test,foo
1,1.0,2023-01-02,1.0,3,train,foo
2,1.0,2023-01-02,1.0,3,test,foo
3,1.0,2023-01-02,1.0,3,train,foo


In [6]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype        
---  ------  --------------  -----        
 0   A       4 non-null      float64      
 1   B       4 non-null      datetime64[s]
 2   C       4 non-null      float32      
 3   D       4 non-null      int32        
 4   E       4 non-null      category     
 5   F       4 non-null      object       
dtypes: category(1), datetime64[s](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes


## Assign

- It add new columns to a DataFrame. It returns a new DataFrame with the added columns, without modifying the original one.

In [7]:
dataframe.assign(D_multi=dataframe["D"] + dataframe["A"])

Unnamed: 0,A,B,C,D,E,F,D_multi
0,1.0,2023-01-02,1.0,3,test,foo,4.0
1,1.0,2023-01-02,1.0,3,train,foo,4.0
2,1.0,2023-01-02,1.0,3,test,foo,4.0
3,1.0,2023-01-02,1.0,3,train,foo,4.0


## Indexing and Selection

| Operation                        | Syntax           | Result     | 
|----------------------------------|------------------|------------|
| Select column                    | `df[col]`        | Series     |
| Select row by label             | `df.loc[label]`  | Series     |
| Select row by integer location | `df.iloc[loc]`   | Series     |
| Slice rows                       | `df[5:10]`       | DataFrame  |
| Select rows by boolean vector   | `df[bool_vec]`   | DataFrame  |

In [8]:
df_index = pd.DataFrame({
    "A" : np.random.random_sample(7),
    "B" : np.random.random_sample(7),
    "C" : np.random.random_sample(7),
    "D" : np.random.random_sample(7),
    "E" : np.random.random_sample(7),
    })

df_index

Unnamed: 0,A,B,C,D,E
0,0.343961,0.150962,0.219405,0.174419,0.643213
1,0.526462,0.012176,0.270882,0.466939,0.635735
2,0.815319,0.495526,0.595185,0.007326,0.15774
3,0.023758,0.599895,0.133167,0.445974,0.719086
4,0.425161,0.135112,0.323847,0.591478,0.443758
5,0.833287,0.083967,0.629248,0.004949,0.717577
6,0.858486,0.661285,0.578026,0.443783,0.021609


In [9]:
# Series (single column)
df_index['A']

0    0.343961
1    0.526462
2    0.815319
3    0.023758
4    0.425161
5    0.833287
6    0.858486
Name: A, dtype: float64

In [10]:
# Series (single row)
df_index.iloc[0]

A    0.343961
B    0.150962
C    0.219405
D    0.174419
E    0.643213
Name: 0, dtype: float64

In [11]:
# DataFrame (multiple rows)
df_index.iloc[0:3]

Unnamed: 0,A,B,C,D,E
0,0.343961,0.150962,0.219405,0.174419,0.643213
1,0.526462,0.012176,0.270882,0.466939,0.635735
2,0.815319,0.495526,0.595185,0.007326,0.15774


In [12]:
df_index[0:3]

Unnamed: 0,A,B,C,D,E
0,0.343961,0.150962,0.219405,0.174419,0.643213
1,0.526462,0.012176,0.270882,0.466939,0.635735
2,0.815319,0.495526,0.595185,0.007326,0.15774


In [13]:
# DataFrame (filtered rows)
df_index[df_index['A'] > 0.9]

Unnamed: 0,A,B,C,D,E


In [14]:
# Create a copy and add values
df_index_copy = df_index.copy()
df_index_copy.loc[:0, ['A', 'B']] = 0.1
df_index_copy

Unnamed: 0,A,B,C,D,E
0,0.1,0.1,0.219405,0.174419,0.643213
1,0.526462,0.012176,0.270882,0.466939,0.635735
2,0.815319,0.495526,0.595185,0.007326,0.15774
3,0.023758,0.599895,0.133167,0.445974,0.719086
4,0.425161,0.135112,0.323847,0.591478,0.443758
5,0.833287,0.083967,0.629248,0.004949,0.717577
6,0.858486,0.661285,0.578026,0.443783,0.021609


## Console

In [15]:
# Setting display options, this will only show 10 columns
pd.set_option("display.max_columns", 10)
data_concole = pd.DataFrame(np.random.randn(3, 12))
data_concole

Unnamed: 0,0,1,2,3,4,...,7,8,9,10,11
0,0.518983,1.755889,0.784926,0.944446,1.006893,...,-1.580462,1.195347,1.7771,-1.103896,-0.038294
1,1.407854,0.70787,-0.109564,-2.214826,-0.166679,...,-0.713977,-0.318543,0.478405,-1.053564,-0.781285
2,-2.512259,0.727504,0.486903,-1.045777,0.984781,...,-0.127726,0.752647,-0.52922,-0.635748,-0.806894


In [16]:
# Setting display options, this will all show 12 columns
pd.set_option("display.max_columns", 12)
data_concole

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.518983,1.755889,0.784926,0.944446,1.006893,0.072711,1.646737,-1.580462,1.195347,1.7771,-1.103896,-0.038294
1,1.407854,0.70787,-0.109564,-2.214826,-0.166679,0.157673,1.077261,-0.713977,-0.318543,0.478405,-1.053564,-0.781285
2,-2.512259,0.727504,0.486903,-1.045777,0.984781,-0.143523,0.56028,-0.127726,0.752647,-0.52922,-0.635748,-0.806894
