<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<!--NAVIGATION-->
< [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) | [Contents](Index.ipynb) | [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) >

# The Basics of NumPy Arrays

**Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas ([Chapter 3](03.00-Introduction-to-Pandas.ipynb)) are built around the NumPy array.**
This section will present several examples of **using NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.**
While the types of operations shown here may seem a bit dry and pedantic, they comprise the building blocks of many other examples used throughout the book.
Get to know them well!

We'll cover a few categories of basic array manipulations here:

- *Attributes of arrays*: Determining the size, shape, memory consumption, and data types of arrays
- *Indexing of arrays*: Getting and setting the value of individual array elements
- *Slicing of arrays*: Getting and setting smaller subarrays within a larger array
- *Reshaping of arrays*: Changing the shape of a given array
- *Joining and splitting of arrays*: Combining multiple arrays into one, and splitting one array into many

## NumPy Array Attributes

First let's discuss some useful array attributes.
We'll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array.
We'll use NumPy's random number generator, which we will *seed* with a set value in order to ensure that the same random arrays are generated each time this code is run:

np.random.seed(0)을 사용해 시드(seed) 값을 설정하면, 난수 생성의 시작점을 고정하게 되어 난수 생성 결과가 동일하게 재현

- 난수는 원래 무작위성을 갖지만, 시드를 고정하면 매번 동일한 결과.

In [5]:
import numpy as np
np.random.seed(0)  # 재현 가능성을 위한 시드 값 - 주석으로 처리하여 차이를 이해 

x1 = np.random.randint(10, size=6)  # 1차원 배열
x2 = np.random.randint(10, size=(3, 4))  # 2차원 배열
x3 = np.random.randint(10, size=(3, 4, 5))  # 3차원 배열
print(x3)

[[[8 1 5 9 8]
  [9 4 3 0 3]
  [5 0 2 3 8]
  [1 3 3 3 7]]

 [[0 1 9 9 0]
  [4 7 3 2 7]
  [2 0 0 4 5]
  [5 6 8 4 1]]

 [[4 9 8 1 1]
  [7 9 9 3 6]
  [7 2 0 3 5]
  [9 4 4 6 4]]]


**Each array has attributes ``ndim`` (the number of dimensions), ``shape`` (the size of each dimension), and ``size`` (the total size of the array):**

In [4]:
print(x2[1,1])

6


In [5]:
print(x2[1][1])

6


In [6]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape) #⭐
print("x3 size: ", x3.size)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


Another useful attribute is the ``dtype``, the data type of the array (which we discussed previously in [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb)):

In [7]:
print("dtype:", x3.dtype)

dtype: int32


Other attributes include ``itemsize``, which lists the size (in bytes) of each array element, and ``nbytes``, which lists the total size (in bytes) of the array:

In [4]:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

itemsize: 8 bytes
nbytes: 480 bytes


In general, we expect that ``nbytes`` is equal to ``itemsize`` times ``size``.

## Array Indexing: Accessing Single Elements

If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite familiar.
In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:

In [16]:
x1

array([5, 0, 3, 3, 7, 9])

In [6]:
x1[0]

5

In [7]:
x1[4]

7

To index from the end of the array, you can use **negative indices:**

In [8]:
x1[-1]

9

In [9]:
x1[-2]

7

**In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:**

In [None]:
rows, cols = 3, 4
array = [[0 for _ in range(cols)] for _ in range(rows)]

# 첫 번째 행, 첫 번째 열에 값 넣기
array[0][0] = 99

print(array)
# 출력
print(array[0][0])  # 결과: 99
#print(array[1,2]) # 파이썬 리스트의 리스트는 [1,2] 허용 안됨 -> numpy이는 됨

[[99, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]
99


In [14]:
print(x2) # 넘파이 배열
print(type(x2))

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]
<class 'numpy.ndarray'>


In [7]:
print(x2[0, 0]) # ❗다차원 배열의 인덱싱 

3


In [10]:
print(x2[2][0])

1


In [None]:
x2[1, -1] #-1

8

Values can also be modified using any of the above index notation:

In [14]:
x2[0, 0] = 12
x2

array([[12,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

Keep in mind that,** unlike Python lists, NumPy arrays have a fixed type.**
This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don't be caught unaware by this behavior!

In [15]:
x1[0] = 3.14159  # 이 값의 소수점 이하는 잘릴 것이다!
x1

array([3, 0, 3, 3, 7, 9])

## Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
**The NumPy slicing syntax follows that of the standard Python list**; to access a slice of an array ``x``, use this:
``` python
❗x[start:stop:step]
```
If any of these are unspecified, they default to the values **``start=0``, ``stop=``*``size of dimension``*, ``step=1``.**
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

### One-dimensional subarrays

In [12]:
x = np.arange(10) #⭐list생성
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [24]:
x[:5]  # 첫 다섯 개의 요소

array([0, 1, 2, 3, 4])

In [25]:
x[5:]  # 인덱스 5 다음의 요소들

array([5, 6, 7, 8, 9])

In [19]:
x[4:7]  # 중간 하위 배열

array([4, 5, 6])

In [27]:
x[::3]  # 하나 걸러 하나씩의 요소들로 구성된 배열

array([0, 3, 6, 9])

In [None]:
x[1::2]  # 인덱스 1로부터 시작하여 하나 걸러 하나씩 요소들로 구성된 배열

array([4, 2, 0])

A potentially confusing case is **when the ``step`` value is negative.
In this case, the defaults for ``start`` and ``stop`` are swapped.
This becomes a convenient way to reverse an array:**
``` python
❗ x[end:start:step]
```

In [28]:
x[::-1]  # 모든 요소들을 거꾸로 나열

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [None]:
x[6::-2]  # 하나 걸러 하나씩 요소들을 거꾸로 나열, 6이 마지막이므로 6,4,2,0을 선택

array([6, 4, 2, 0])

In [None]:
x[5::-2]# 하나 걸러 거꾸로 마지막이 5이므로, 5, 3, 1을 선택

array([5, 3, 1])

### Multi-dimensional subarrays

Multi-dimensional slices work in the same way, with **multiple slices separated by commas.**
For example:

In [15]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]], dtype=int32)

In [16]:
x2[:2, :3]  # 두 개의 행, 세 개의 열 > 2,3은 포함안됨 

array([[3, 5, 2],
       [7, 6, 8]], dtype=int32)

In [17]:
x2[:3, ::2]  # 0 ~ 1 행, 한 열 걸러 하나씩

array([[3, 2],
       [7, 8],
       [1, 7]], dtype=int32)

Finally, subarray dimensions can even be reversed together:

In [None]:
x2[::-1, ::-1] # 행과 열이 역순으로 재배치 

array([[7, 7, 6, 1],
       [8, 8, 6, 7],
       [4, 2, 5, 3]])

#### Accessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array.
This can be done by **combining indexing and slicing, using an empty slice marked by a single colon (``:``):**

In [18]:
print(x2)

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]


In [19]:
print(x2[:, 0])  # x2의 첫 번째 열

[3 7 1]


In [36]:
print(x2[0, :])  # x2의 첫 번째 행

[3 5 2 4]


In the case of row access, the empty slice can be omitted for a more compact syntax:

In [None]:
print(x2[0])  # x2[0, :] 와 동일 > x2가 이차원 배열인데 인덱스를 하나만 줄 때 

[12  5  2  4]


### Subarrays as no-copy views

**One important–and extremely useful–thing to know about array slices is that they return *views* rather than *copies* of the array data.**
**This is one area in which NumPy array slicing differs from Python list slicing**: in lists, slices will be copies.
Consider our two-dimensional array from before:

In [37]:
print(x2)

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]


In [38]:
print(type(x2))

<class 'numpy.ndarray'>


In [21]:
l = [1,2,3,4,5,6,7,8,9]
print(l[::-1])

[9, 8, 7, 6, 5, 4, 3, 2, 1]


In [None]:
l2= l[::-2] #❗리스트는 '사본'이다 - 파이썬
l2[0]= 10
print(l2)
print(l)

[10, 7, 5, 3, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9]


Let's extract a $2 \times 2$ subarray from this:

In [23]:
print(x2)
print(type(x2))

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]
<class 'numpy.ndarray'>


In [None]:
x2_sub = x2[:2, :2] #❗넘파이는 사본이 아니라 'view'이다 
print(x2_sub)

[[3 5]
 [7 6]]


Now if we **modify this subarray, we'll see that the original array is changed** Observe:

In [28]:
x2_sub[0, 0] = 99
print(x2_sub)

[[99  5]
 [ 7  6]]


In [29]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


This default behavior is actually quite useful: it means that **when we work with large datasets(가져온 소스데이터), we can access and process pieces of these datasets without the need to copy the underlying data buffer.**

### Creating copies of arrays(X)

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the **``copy()`` method:**

In [None]:
x2_sub_copy = x2[:2, :2].copy() #❗넘파이 사본생성
print(x2_sub_copy)

[[99  5]
 [ 7  6]]


If we now modify this subarray, the original array is not touched:

In [47]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)

[[42  5]
 [ 7  6]]


In [48]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


## Reshaping of Arrays

**Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the ``reshape`` method.**
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [None]:
grid = np.arange(1, 10)
print(grid)
grid=grid.reshape((3, 3)) #❗차원을 바꿈
print(grid)

[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [None]:
print(grid)
one_grid= grid.reshape(-1)  #❗reshape(-1)에서 -1을 사용하는 이유는 **NumPy에게 “적절한 크기를 자동으로 계산하라”**는 의미
print(one_grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
[1 2 3 4 5 6 7 8 9]


Note that for this to work, the size of the initial array must match the size of the reshaped array. 
Where possible, the ``reshape`` method will use a no-copy view of the initial array, but with non-contiguous memory buffers this is not always the case.

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix.
This can be done with **the ``reshape`` method, or more easily done by making use of the ``newaxis`` keyword within a slice operation:**

In [26]:
x = np.array([1, 2, 3])
print(x)
# reshape 을 이용한 행 벡터
x.reshape((1, 3))

[1 2 3]


array([[1, 2, 3]])

In [None]:
# newaxis를 이용한 행 벡터
# ❗np.newaxis는 차원을 추가할 때 사용하는 NumPy의 특별한 인덱싱 도구

x[np.newaxis, :]

array([[1, 2, 3]])

np.newaxis란?

- np.newaxis는 None과 같은 의미

- 배열의 **shape(형태)**에 새로운 차원을 추가할 때 사용

- 인덱싱 연산자이며, 실제로는 슬라이싱할 때 None을 넣는 것과 같다.



In [39]:
# reshape을 이용한 열 벡터
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [40]:
# newaxis를 이용한 열 벡터
x[:, np.newaxis]

array([[1],
       [2],
       [3]])

In [None]:
x[:, None] #None: 슬라이싱할 값이 없다 = 껍데기만 만든다(아래 설명)

array([[1],
       [2],
       [3]])

x[np.newaxis, :] == x[None, :]  # 같은 결과


In [29]:
import numpy as np

x = np.array([10, 20, 30])
print(x.shape)  # (3,)


(3,)


In [33]:
x= np.arange(10)

In [35]:
# ❗슬라이싱 자리에 None을 넣으면 새로운 축(axis)이 생긴다 
x1 = x[None, :]           # → shape: (1, 3) : 행 추가
x2 = x[:, None]           # → shape: (3, 1) : 열 추가
print(x1)
print(x2)

x3 = x[np.newaxis, :]           # → shape: (1, 3) : 행 추가
x4 = x[:, np.newaxis]           # → shape: (3, 1) : 열 추가
print(x3)
print(x4)

[[0 1 2 3 4 5 6 7 8 9]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
[[0 1 2 3 4 5 6 7 8 9]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]


We will see this type of transformation often throughout the remainder of the book.

## Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines **``np.concatenate``, ``np.vstack``, and ``np.hstack``.**
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [4]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [None]:
z = [99, 99, 99]
print(np.concatenate([x, z, y]))

[ 1  2  3 99 99 99  3  2  1]


It can also be used for two-dimensional arrays:

In [7]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

In [8]:
# 첫 번째 축을 따라 연결
np.concatenate([grid, grid])

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

❗핵심 개념: axis란?

- NumPy 배열은 다차원(예: 2D, 3D)이기 때문에, 어떤 방향으로 연산할지를 지정할 필요.

- axis=0 → 행 방향, 위↔아래로 이어붙임

- axis=1 → 열 방향, 좌↔우로 이어붙임

np.concatenate([...], axis=0)에서 axis=0은 어떤 축(방향)으로 배열을 이어붙일 것인지를 지정하는 옵션

- axis=0 → 행(row) 방향으로 붙인다

- 행의 개수가 늘어나고, 열은 그대로 유지됨

- axis=1: 좌/우로 붙임 (열 늘어남)

In [9]:
np.concatenate([grid, grid],axis=0)

array([[1, 2, 3],
       [4, 5, 6],
       [1, 2, 3],
       [4, 5, 6]])

In [10]:
# 두 번째 축을 따라 연결(0부터 시작되는 인덱스 방식) 
np.concatenate([grid, grid], axis=1) #column으로 합치다 

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

For working with arrays of mixed dimensions, it can be clearer to ** use the ``np.vstack`` (vertical stack) and ``np.hstack`` (horizontal stack) functions**

In [11]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# 배열을 수직적으로 쌓음 
np.vstack([x, grid])

array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])

In [12]:
# 배열을 수평적으로 쌓음
y = np.array([[99],
              [99]])
np.hstack([grid, y])

array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

Similary, ``np.dstack`` will stack arrays along the third axis.

### Splitting of arrays

The opposite of concatenation is splitting, which is implemented by the functions ``np.split``, ``np.hsplit``, and ``np.vsplit``.  For each of these, we can pass a list of indices giving the split points:

In [None]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3, x4 = np.split(x, [1, 3, 5])#[3,5]는 분할점
print(x1, x2, x3, x4)

[1] [2 3] [99 99] [3 2 1]


Notice that *N* split-points, leads to *N + 1* subarrays.
The related functions ``np.hsplit`` and ``np.vsplit`` are similar:

In [21]:
grid = np.arange(16).reshape((4, 4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [20]:
grid

array([[9, 8, 7],
       [6, 5, 4]])

In [22]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


In [53]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


Similarly, ``np.dsplit`` will split arrays along the third axis.

<!--NAVIGATION-->
< [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) | [Contents](Index.ipynb) | [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) >

모델 학습에 사용되는 특성(feature)을 (N, 1)로 하고 목표 값(target)을 (N,)으로 구성하는 이유는 데이터 구조가 모델 입력과 출력의 요구사항에 맞도록 적합하게 설정

- 특성은 2차원 배열로, 목표 값은 1차원 배열로 구성해 일관된 데이터 처리와 수학적 연산의 편리함을 유지.

In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

# 임의 데이터 생성
N = 10
X = np.random.rand(N, 1)  # (N, 1) 형태의 2차원 배열
y = 3 * X.ravel() + 2 + np.random.randn(N) * 0.1  # (N,) 형태의 1차원 배열
### X.ravel()은 (N,1)을 (N,)로 변환 1차원으로 변환
# 모델 초기화 및 학습
model = LinearRegression()
model.fit(X, y)

# 예측
y_pred = model.predict(X)


선형 회귀 수식: y = wX + b
𝑋: 입력 특성 (N개의 샘플, 각 샘플당 특성 D개)

𝑤: 가중치 벡터

𝑏: 절편 (bias)

𝑦_hat: 예측값

y: 실제 정답값

왜 X는 (N, 1) 인가?
- X는 N개의 샘플과 **1개의 특성(feature)**로 구성된 2차원 배열이어야 한다.

- shape이 (N, 1)이면,

 > N × 1 행렬

 > 각 행이 하나의 샘플, 열이 특성 값

→ 모델은 내부적으로 w를 (1,)로 보고, 행렬 곱을 계산

왜 y는 (N,) 인가?
- y는 각 샘플에 대한 실수형 정답값.

- 따라서 shape (N,) → N개의 스칼라 값 (벡터)

- 이는 모델이 예측한 y_pred와 동일한 형태로 만들어야 loss 계산이 가능.



- X는 (N, 1)이어야 행렬 연산에서 특성(feature)의 개수가 정의.

- y는 (N,)이어야 예측값과 비교할 수 있는 벡터.

- 선형 회귀 모델의 계산 흐름(Xw + b)에 맞추기 위한 필수 구조

In [2]:
X

array([[0.73613843],
       [0.50815634],
       [0.58765119],
       [0.61958882],
       [0.91672011],
       [0.71052193],
       [0.36547981],
       [0.47249113],
       [0.4986792 ],
       [0.26277717]])

In [3]:
y

array([4.1989358 , 3.56522979, 3.73608323, 3.99871055, 4.79906523,
       4.194128  , 3.12456489, 3.40448707, 3.44657535, 2.92236572])

In [45]:
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4]])   # (4, 1)
y = np.array([3.1, 5.0, 6.8, 8.9])   # (4,)

model = LinearRegression()
model.fit(X, y)

# 예측 계산: y = w*x + b
print("기울기(w):", model.coef_)
print("절편(b):", model.intercept_)

# 예측값
y_pred = model.predict(X)


기울기(w): [1.92]
절편(b): 1.1500000000000004


예측식: y = 1.95x + 1.15