# Intro to data structures

In [27]:
import numpy as np
import pandas as pd
import time
import math

## Series

In [6]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=['a', 'b', 'c', 'd', 'e', 'f'])

In [8]:
s.dtype

dtype('float64')

In [9]:
s.array

<NumpyExtensionArray>
[np.float64(1.0), np.float64(3.0), np.float64(5.0), np.float64(nan),
 np.float64(6.0), np.float64(8.0)]
Length: 6, dtype: float64

In [10]:
s.to_numpy()

array([ 1.,  3.,  5., nan,  6.,  8.])

In [12]:
s.get("a", np.nan)

np.float64(1.0)

## Vectorizing

s + s

In [13]:
s+s

a     2.0
b     6.0
c    10.0
d     NaN
e    12.0
f    16.0
dtype: float64

In [14]:
s ** 3

a      1.0
b     27.0
c    125.0
d      NaN
e    216.0
f    512.0
dtype: float64

In [15]:
s = pd.Series(np.random.randn(5), name="something")
s.name

'something'

## Datafames

In [16]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}


df = pd.DataFrame(d)

In [17]:
pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [18]:
pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]), orient='index')

Unnamed: 0,0,1,2
A,1,2,3
B,4,5,6


### Assign

In [19]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(10, 3), columns=['a', 'b', 'c'])

# Use .assign() to create new columns
df_new = (
    df
    .assign(
        new_column = df['a'] + df['b'],
        another_column = lambda x: x['c'] * 2
    )
)

print(df_new)

          a         b         c  new_column  another_column
0  0.815959  0.789534  0.083381    1.605493        0.166762
1  0.220591  0.913380  0.119860    1.133971        0.239719
2  0.503974  0.354909  0.194086    0.858883        0.388172
3  0.987659  0.438210  0.316668    1.425869        0.633336
4  0.325342  0.510591  0.311661    0.835934        0.623322
5  0.455779  0.039334  0.540863    0.495113        1.081726
6  0.301972  0.290723  0.206702    0.592695        0.413404
7  0.196111  0.275351  0.194808    0.471463        0.389616
8  0.201214  0.258102  0.538218    0.459316        1.076435
9  0.694826  0.360002  0.489759    1.054828        0.979518


### Advantage of using .assign() instead direct assignment

The main advantage of using .assign() in pandas over directly assigning a new column with df['new_column'] = df['a'] + df['b'] is that .assign() allows for **clean and functional chaining of operations**, returning a new DataFrame **without modifying** the original one. Here are some benefits:

- **Immutability:** Using .assign() ensures that the original DataFrame remains unchanged, which helps avoid unintended side effects.

- **Chaining:** It makes it easier to build a pipeline of transformations, leading to more readable and organized code.

- **Multiple Columns:** You can create multiple columns in a single call, which can make your code more concise.

In [39]:
# Create a DataFrame with random values
df = pd.DataFrame(np.random.rand(10, 3), columns=['a', 'b', 'c'])

# Use .assign() to create new columns
df_new = (
    df
    .assign(
        new_column = df['a'] + df['b'],
        another_column = lambda x: x['c'] * 2
    )
)

print(df_new)

          a         b         c  new_column  another_column
0  0.647061  0.707687  0.609734    1.354748        1.219467
1  0.272986  0.270612  0.952313    0.543598        1.904626
2  0.909359  0.694084  0.546607    1.603442        1.093214
3  0.663067  0.056012  0.637765    0.719079        1.275531
4  0.591930  0.593451  0.749606    1.185382        1.499212
5  0.130057  0.809858  0.073296    0.939915        0.146592
6  0.976192  0.453902  0.636216    1.430094        1.272432
7  0.951737  0.021167  0.323400    0.972905        0.646801
8  0.281993  0.091625  0.375010    0.373619        0.750020
9  0.150728  0.080310  0.731202    0.231038        1.462404


### Numpy universal functions in pandas dataframe

Using NumPy's *ufuncs* (universal functions) in Pandas DataFrames offers several key advantages over using Python’s built-in math library or manually looping through data:

- **Performance:**

    - **Vectorization: Ufuncs are vectorized, meaning they operate on entire arrays at once rather than element by element. This leverages highly optimized C code, resulting in significant speed improvements compared to using scalar operations (e.g., those from the math module) inside Python loops.
Optimized Operations: Since these functions are implemented in C and optimized for performance, they execute operations much faster than manually iterating over each element in Python.

    - **Broadcasting:** Ufuncs support broadcasting, which means they can automatically handle operations between arrays of different shapes, applying the operation element-wise without the need for explicit loops or reshaping.

- **Seamless Integration with Pandas:**

    - **Automatic Alignment:** When you apply NumPy ufuncs to Pandas objects like Series or DataFrames, Pandas automatically aligns the data based on the index labels. This ensures that operations are performed on matching labels, even if the original order differs.
Element-wise Operations: They work directly on the underlying NumPy arrays, making it easy to perform element-wise computations across large datasets.

In [35]:
# Create a DataFrame with 1 million rows and one column 'x'
df = pd.DataFrame({'x': np.linspace(0, 10, 1000000)})

# 1. Using vectorized np.sin function
start = time.time()
df['sin_vectorized'] = np.sin(df['x'])
time_vectorized = time.time() - start
print("Time using vectorized np.sin:", time_vectorized, "seconds")

# 2. Using apply with math.sin (processing element by element)
start = time.time()
df['sin_loop'] = df['x'].apply(math.sin)
time_loop = time.time() - start
print("Time using apply with math.sin:", time_loop, "seconds")

Time using vectorized np.sin: 0.018065929412841797 seconds
Time using apply with math.sin: 0.22726941108703613 seconds


In [38]:
# Create a DataFrame with 1 million rows and two columns
df2 = pd.DataFrame({
    'a': np.random.rand(1000000),
    'b': np.random.rand(1000000)
})

# 1. Using vectorized addition with the + operator
start = time.time()
df2['sum_vectorized'] = df2['a'] + df2['b']
time_vectorized = time.time() - start
print("Time using vectorized addition (operator +):", time_vectorized, "seconds")

# 2. Using np.add (NumPy's ufunc)
start = time.time()
df2['sum_np_add'] = np.add(df2['a'], df2['b'])
time_np_add = time.time() - start
print("Time using np.add:", time_np_add, "seconds")

# 3. Using apply with a lambda function (row-by-row processing)
start = time.time()
df2['sum_loop'] = df2.apply(lambda row: row['a'] + row['b'], axis=1)
time_loop = time.time() - start
print("Time using apply with lambda:", time_loop, "seconds")


Time using vectorized addition (operator +): 0.0036292076110839844 seconds
Time using np.add: 0.004775285720825195 seconds
Time using apply with lambda: 11.774370193481445 seconds
