## 📊 Data Structures

> Pandas provides two primary data structures for handling data efficiently:

- **Series** – A one-dimensional labeled array.
- **DataFrame** – A two-dimensional labeled table (rows and columns).


In [1]:
import pandas as pd
import numpy as np

### Series

- It is an one-dimensional labeled array 
- Each value has an index (label)

In [2]:
# Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series:", s)

Series: 0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [3]:
# Naming the series

s.name = "numbers"
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
Name: numbers, dtype: float64

##### Numpy Arrary

- A NumPy array is the core data structure of the NumPy library in Python.
- It's like a supercharged list, built for fast numerical computing.

In [4]:
s.to_numpy()

array([ 1.,  3.,  5., nan,  6.,  8.])

## DataFrame

- A **DataFrame** is like a table in Python. It has rows and columns. In general the most commonly used pandas is an object. 
- DataFrame accepts many different kinds of input:

    - Dict of 1D ndarrays, lists, dicts, or Series

     - 2-D numpy.ndarray

     - Structured or record ndarray

     - A Series

     - Another DataFrame


In [5]:
dataframe = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20230102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
})
dataframe

Unnamed: 0,A,B,C,D,E,F
0,1.0,2023-01-02,1.0,3,test,foo
1,1.0,2023-01-02,1.0,3,train,foo
2,1.0,2023-01-02,1.0,3,test,foo
3,1.0,2023-01-02,1.0,3,train,foo


In [6]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype        
---  ------  --------------  -----        
 0   A       4 non-null      float64      
 1   B       4 non-null      datetime64[s]
 2   C       4 non-null      float32      
 3   D       4 non-null      int32        
 4   E       4 non-null      category     
 5   F       4 non-null      object       
dtypes: category(1), datetime64[s](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes


## Assign

- It add new columns to a DataFrame. It returns a new DataFrame with the added columns, without modifying the original one.

In [7]:
dataframe.assign(D_multi=dataframe["D"] + dataframe["A"])

Unnamed: 0,A,B,C,D,E,F,D_multi
0,1.0,2023-01-02,1.0,3,test,foo,4.0
1,1.0,2023-01-02,1.0,3,train,foo,4.0
2,1.0,2023-01-02,1.0,3,test,foo,4.0
3,1.0,2023-01-02,1.0,3,train,foo,4.0


## Indexing and Selection

| Operation                        | Syntax           | Result     | 
|----------------------------------|------------------|------------|
| Select column                    | `df[col]`        | Series     |
| Select row by label             | `df.loc[label]`  | Series     |
| Select row by integer location | `df.iloc[loc]`   | Series     |
| Slice rows                       | `df[5:10]`       | DataFrame  |
| Select rows by boolean vector   | `df[bool_vec]`   | DataFrame  |

In [8]:
df_index = pd.DataFrame({
    "A" : np.random.random_sample(7),
    "B" : np.random.random_sample(7),
    "C" : np.random.random_sample(7),
    "D" : np.random.random_sample(7),
    "E" : np.random.random_sample(7),
    })

df_index

Unnamed: 0,A,B,C,D,E
0,0.939421,0.859362,0.818666,0.15226,0.619816
1,0.117915,0.900797,0.790772,0.328731,0.082007
2,0.311455,0.915702,0.041642,0.257354,0.345515
3,0.489604,0.089913,0.915684,0.999537,0.314076
4,0.558999,0.76278,0.464198,0.381002,0.741625
5,0.354595,0.352696,0.882769,0.980515,0.458547
6,0.271341,0.793454,0.808487,0.377263,0.09229


In [9]:
# Series (single column)
df_index['A']

0    0.939421
1    0.117915
2    0.311455
3    0.489604
4    0.558999
5    0.354595
6    0.271341
Name: A, dtype: float64

In [10]:
# Series (single row)
df_index.iloc[0]

A    0.939421
B    0.859362
C    0.818666
D    0.152260
E    0.619816
Name: 0, dtype: float64

In [11]:
# DataFrame (multiple rows)
df_index.iloc[0:3]

Unnamed: 0,A,B,C,D,E
0,0.939421,0.859362,0.818666,0.15226,0.619816
1,0.117915,0.900797,0.790772,0.328731,0.082007
2,0.311455,0.915702,0.041642,0.257354,0.345515


In [12]:
df_index[0:3]

Unnamed: 0,A,B,C,D,E
0,0.939421,0.859362,0.818666,0.15226,0.619816
1,0.117915,0.900797,0.790772,0.328731,0.082007
2,0.311455,0.915702,0.041642,0.257354,0.345515


In [13]:
# DataFrame (filtered rows)
df_index[df_index['A'] > 0.9]

Unnamed: 0,A,B,C,D,E
0,0.939421,0.859362,0.818666,0.15226,0.619816


In [14]:
# Create a copy and add values
df_index_copy = df_index.copy()
df_index_copy.loc[:0, ['A', 'B']] = 0.1
df_index_copy

Unnamed: 0,A,B,C,D,E
0,0.1,0.1,0.818666,0.15226,0.619816
1,0.117915,0.900797,0.790772,0.328731,0.082007
2,0.311455,0.915702,0.041642,0.257354,0.345515
3,0.489604,0.089913,0.915684,0.999537,0.314076
4,0.558999,0.76278,0.464198,0.381002,0.741625
5,0.354595,0.352696,0.882769,0.980515,0.458547
6,0.271341,0.793454,0.808487,0.377263,0.09229


## Console

In [35]:
# Setting display options, this will only show 10 columns
pd.set_option("display.max_columns", 10)
data_concole = pd.DataFrame(np.random.randn(3, 12))
data_concole

Unnamed: 0,0,1,2,3,4,...,7,8,9,10,11
0,0.622883,0.970012,-1.272123,-0.451679,-1.019953,...,-0.712983,-0.091395,0.566931,0.24162,-0.756996
1,-0.246547,-0.527671,-0.936813,0.68951,2.067178,...,2.069777,-0.33158,-0.659246,0.259157,-0.368424
2,1.175306,1.01013,-0.707288,-0.293335,0.345242,...,0.647592,0.373969,1.11938,0.421097,0.236516


In [36]:
# Setting display options, this will all show 12 columns
pd.set_option("display.max_columns", 12)
data_concole

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.622883,0.970012,-1.272123,-0.451679,-1.019953,0.308539,0.025021,-0.712983,-0.091395,0.566931,0.24162,-0.756996
1,-0.246547,-0.527671,-0.936813,0.68951,2.067178,0.622154,-0.022478,2.069777,-0.33158,-0.659246,0.259157,-0.368424
2,1.175306,1.01013,-0.707288,-0.293335,0.345242,-0.753883,-0.580522,0.647592,0.373969,1.11938,0.421097,0.236516
