

### **Announcements**

- Homework 1 due anytime today
- Homework 2 released. Due 02/13
- Last day to add or drop courses

# Agenda

- Built-ins and functions
- Synchronizing threads
- Scheduling threads
- Memory model
- Matrix multiply revisited
- Atomic functions



### **Functional Declarations**

- \_\_global\_\_
  - Must return void
- \_\_device\_\_
  - Inlined by default

See Appendix B.1 in the NVIDIA CUDA C Programming Guide for more details

# Functional Declarations

- What do these do?
  - \_\_global\_\_ \_host\_\_ void func()
  - \_\_device\_\_ \_host\_\_ void func()

# Functional Declarations What do these do? global\_\_\_host\_\_void func() device\_\_host\_\_void func() host\_\_device\_func() tif\_cuda ARCH\_ == 100

#elif \_\_CUDA\_ARCH\_\_ == 200

#endif

#elif !defined(\_\_CUDA\_ARCH\_\_)
 // Host code path

// Device code path for compute capability 1.0

// Device code path for compute capability 2.0

Code from http://developer.download.nvidia.com/compute/cuda/3\_2\_prod/toolkit/docs/CUDA\_C\_Programming\_Guide.pdf

# **Functional Declarations**

- Global and device functions
  - No recursion (except Fermi)
  - No static variables
  - No malloc()
  - Careful with function calls through pointers
- We'll see similar constraints in GLSL

# **Vector Types**

- char[1-4], uchar[1-4]
- short[1-4], ushort[1-4]
- int[1-4], uint[1-4]
- long[1-4], ulong[1-4]
- longlong[1-4], ulonglong[1-4]
- float[1-4]
- double1, double2

# **Vector Types**

- Available in host and device code
- Construct with make\_<type name>

```
int2 i2 = make_int2(1, 2);
float4 f4 = make_float4(
   1.0f, 2.0f, 3.0f, 4.0f);
```

# **Vector Types**

Access with .x, .y, .z, and .w

```
int2 i2 = make_int2(1, 2);
int x = i2.x;
int y = i2.y;
```

■ No.r, .g, .b, .a, etc. like GLSL

### **Math Functions**

- Double and float overloads
  - □ No vector overloads
- On the host, functions use the C runtime implementation if available

See Appendix C in the NVIDIA CUDA C Programming Guide for a complete list of math function

# 





```
Review: Thread Hierarchies

int threadID = blockIdx.x *
  blockDim.x + threadIdx.x;

float x = input[threadID];

float y = func(x);

output[threadID] = y;
```

# Review: Thread Hierarchies int threadID = blockIdx.x \* blockDim.x + threadIdx.x; float x = input[threadID]; float y = func(x); output[threadID] = y; Use grid and block position to compute a thread id

```
Review: Thread Hierarchies

int threadID = blockIdx.x *
blockDim.x + threadIdx.x;

float x = input[threadID];

float y = func(x);
output[threadID] = y;

Use thread id to read from input
```

```
Review: Thread Hierarchies

int threadID = blockIdx.x *
  blockDim.x + threadIdx.x;
float x = input[threadID];

float y = func(x);
output[threadID] = y;

Run function on input: data-paralle!
```

```
Review: Thread Hierarchies

int threadID = blockIdx.x *
  blockDim.x + threadIdx.x;
float x = input[threadID];
float y = func(x);
output[threadID] = y;
Use thread id to output result
```

# Thread Synchronization ■ Threads in a block can synchronize □ call syncthreads to create a barrier □ A thread waits at this call until all threads in the block reach it, then all threads continue Mds[i] = Md[j];syncthreads(); func(Mds[i], Mds[i + 1]);

















# Thread Synchronization Why is it important that execution time be similar among threads? Why does it only synchronize within a block?



# Thread Synchronization ■ Can \_\_syncthreads() cause a thread to hang?

```
Thread Synchronization

if (someFunc())
{
    __syncthreads();
}
// ...
```

```
Thread Synchronization

if (someFunc())
{
    __syncthreads();
}
else
{
    __syncthreads();
}
```

























# **Scheduling Threads**

What happens if branches in a warp diverge?

# **Scheduling Threads**

- If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there?
- A SM on GT200 can host up to 1024 threads, how many warps is that?

# Scheduling Threads

32 threads per warp but 8 SPs per SM. What gives?

# **Scheduling Threads**

- 32 threads per warp but 8 SPs per SM. What gives?
- When an SM schedules a warp:
  - ☐ Its instruction is ready
  - □8 threads enter the SPs on the 1st cycle
  - □8 more on the 2<sup>nd</sup>, 3<sup>rd</sup>, and 4<sup>th</sup> cycles
  - ☐ Therefore, 4 cycles are required to dispatch a warp

# Scheduling Threads

- Question
  - □ A kernel has
    - 1 global memory read (200 cycles)
    - 4 non-dependent multiples/adds
  - ☐ How many warps are required to hide the memory latency?

# Scheduling Threads

- Solution
  - □ Each warp has 4 multiples/adds
    - ■16 cycles
  - □We need to cover 200 cycles
    - **200** / 16 = 12.5
    - **ceil**(12.5) = 13
  - □ 13 warps are required

























# Let's revisit matrix multiple

```
Matrix Multiply: CPU Implementation

void MatrixMulOnHost(float* M, float* N, float* P, int width)
{
  for (int i = 0; i < width; ++i)
    for (int j = 0; j < width; ++j)
    {
      float sum = 0;
      for (int k = 0; k < width; ++}
    {
      float a = M[i * width + k];
      float b = N[k * width + j];
      sum += a * b;
    }
    P[i * width + j] = sum;
    }
}

Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt</pre>
```

```
Matrix Multiply: CUDA Kernel
           // Matrix multiplication kernel - thread specification
            _global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
             // 2D Thread ID
             int tx = threadIdx.x;
              int ty = threadIdx.y;
             // Pvalue stores the Pd element that is computed by the thread
                                                     Where did the two outer for loops
               for (int k = 0; k < Width; ++k
                                                     in the CPU implementation go?
                 float Mdelement = Md[ty * Md.width + k];
                 float Ndelement = Nd[k * Nd.width + tx];
                Pvalue += Mdelement * Ndelement:
              // Write the matrix to device memory each thread writes one element
              Pd[ty * Width + tx] = Pvalue;
                  Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
```

```
Matrix Multiply: CUDA Kernel
           // Matrix multiplication kernel - thread specification
             global void MatrixMulKernel(float* Md, float* Nd, float* Pd. int Width)
              // 2D Thread ID
              int tx = threadIdx.x;
              int ty = threadIdx.y;
              // Pvalue stores the Pd element that is computed by the thread
              float Pvalue = 0;
              for (int k = 0; k < Width; ++k)
                 float Mdelement = Md[ty * Md.width + k];
                 float Ndelement = Nd[k * Nd.width + tx];
                 Pvalue += Mdelement * Ndelement:
                                                     No locks or synchronization, why?
              // Write the matrix to device memory each thread writes one element
              Pd[ty * Width + tx] = Pvalue;
                   Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
```









```
Matrix Multiply

__global___ void MatrixMulKernel(
    float* Md, float* Nd, float* Pd, int Width)
{
    int Row = blockIdx.y * TILE_WIDTH + threadIdx.y;
    int Col = blockIdx.x * TILE_WIDTH + threadIdx.x;

    float Pvalue = 0;
    for (int k = 0; k < Width; ++k)
        Pvalue += Md[Row * Width + k] * Nd[k * Width + Col];

    Pd[Row * Width + Col] = Pvalue;
}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html</pre>
```

```
Matrix Multiply

Calculate the row index of the Pd element and M

_global__ void MatrixMulKernel(
    float* Md, float* Nd, float* Pd, int Width)

{
    int Row = blockIdx.y * TILE_WIDTH + threadIdx.y;
    int Col = blockIdx.x * TILE_WIDTH + threadIdx.x;

    float Pvalue = 0;
    for (int k = 0; k < Width; ++k)
        Pvalue += Md[Row * Width + k] * Nd[k * Width + Col];

    Pd[Row * Width + Col] = Pvalue;
    }

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
Calculate the column index of Pd and N

__global___ void MatrixMulKernel(
    float* Md, float* Nd, float* Pd, int Width)
{
    int Row = blockIdx.y * TILE WIDTH + threadIdx.y;
    int Col = blockIdx.x * TILE_WIDTH + threadIdx.x;

    float Pvalue = 0;
    for (int k = 0; k < Width; ++k)
        Pvalue += Md[Row * Width + k] * Nd[k * Width + Col];

    Pd[Row * Width + Col] = Pvalue;
}

Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
Matrix Multiply

Invoke kernel:

dim3 dimBlock(Width / TILE_WIDTH, Height / TILE_WIDTH);
dim3 dimGrid(TILE_WIDTH, TILE_WIDTH);

MatrixMulKernel<<<dimGrid, dimBlock>>>(
Md, Nd, Pd, TILE_WIDTH);
```

What about global memory access?

# Matrix Multiply Limited by global memory bandwidth G80 peak GFLOPS: 346.5 Require 1386 GB/s to achieve this G80 memory bandwidth: 86.4 GB/s Limits code to 21.6 GFLOPS In practice, code runs at 15 GFLOPS Must drastically reduce global memory access







```
__global__ void MatrixMulKernel(
 float* Md, float* Nd, float* Pd, int Width)
 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
 __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * TILE_WIDTH + ty;
 int Col = bx * TILE_WIDTH + tx;
 float Pvalue = 0;
 for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
   Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
   Nds[tv][tx] = Nd[Col + (m*TILE WIDTH + tv)*Width];
   __syncthreads();
   for (int k = 0; k < TILE WIDTH; ++k)</pre>
     Pvalue += Mds[ty][k] * Nds[k][tx];
   __synchthreads();
 Pd[Row*Width+Col] = Pvalue;
                                  Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
global void MatrixMulKernel(
float* Md, float* Nd, float* Pd, int Width)
 shared float Mds[TILE WIDTH][TILE WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
                                                Width/TILE WIDTH
int bx = blockIdx.x; int by = blockIdx.y;

    Number of phases

int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * TILE WIDTH + ty;
                                                · Index for current phase
int Col = bx * TILE WIDTH + tx;
 float Pvalue = 0;
 or (int m = 0; m < Width/TILE_WIDTH; ++m) {
  Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
  Nds[ty][tx] = Nd[Col + (m*TILE WIDTH + ty)*Width];
  syncthreads();
  for (int k = 0; k < TILE WIDTH; ++k)</pre>
    Pvalue += Mds[ty][k] * Nds[k][tx];
  __synchthreads();
Pd[Row*Width+Coll = Pvalue;
                                 Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
global void MatrixMulKernel(
 float* Md, float* Nd, float* Pd, int Width)
 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
  _shared__ float Nds[TILE_WIDTH][TILE_WIDTH]
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * TILE_WIDTH + ty;
                                       Shared memory for a
 int Col = bx * TILE_WIDTH + tx;
                                       subset of Md and Nd
 float Pvalue = 0;
 for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
   Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
   Nds[tv][tx] = Nd[Col + (m*TILE WIDTH + tv)*Width];
   __syncthreads();
   for (int k = 0; k < TILE WIDTH; ++k)</pre>
    Pvalue += Mds[ty][k] * Nds[k][tx];
   __synchthreads();
 Pd[Row*Width+Col] = Pvalue;
                                  Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
global void MatrixMulKernel(
 float* Md, float* Nd, float* Pd, int Width)
 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
 __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
                                                 Bring one element each
 int bx = blockIdx.x; int by = blockIdx.y;
                                                 from Md and Nd into
 int tx = threadIdx.x; int ty = threadIdx.y;
                                                 shared memory
 int Row = by * TILE WIDTH + ty;
 int Col = bx * TILE WIDTH + tx;
 float Pvalue = 0;
 for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
   Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
  Nds[ty][tx] = Nd[Col + (m*TILE WIDTH + ty)*Width]
   syncthreads();
   for (int k = 0; k < TILE WIDTH; ++k)</pre>
     Pvalue += Mds[ty][k] * Nds[k][tx];
   __synchthreads();
 Pd[Row*Width+Col] = Pvalue;
                                  Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
__global__ void MatrixMulKernel(
 float* Md, float* Nd, float* Pd, int Width)
 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
 __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * TILE_WIDTH + ty;
 int Col = bx * TILE_WIDTH + tx;
 float Pvalue = 0;
 for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
   Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
   Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
                                                  Wait for every thread in
                                                 the block, i.e., wait for
   for (int k = 0; k < TILE WIDTH; ++k)</pre>
                                                 the tile to be in shared
     Pvalue += Mds[ty][k] * Nds[k][tx];
   __synchthreads();
                                                 memory
 Pd[Row*Width+Col] = Pvalue;
                                   Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
global void MatrixMulKernel(
float* Md, float* Nd, float* Pd, int Width)
 shared float Mds[TILE WIDTH][TILE WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
int Row = by * TILE WIDTH + ty;
int Col = bx * TILE WIDTH + tx;
float Pvalue = 0;
for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
  Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
  Nds[ty][tx] = Nd[Col + (m*TILE WIDTH + ty)*Width];
  syncthreads();
  for (int k = 0; k < TILE WIDTH; ++k)</pre>
    Pvalue += Mds[ty][k] * Nds[k][tx];
                                           Why?
Pd[Row*Width+Col] = Pvalue;
                                 Code from http://courses.engr.illinois.edu/ece498/al/Svllabus.html
```

```
global void MatrixMulKernel(
 float* Md, float* Nd, float* Pd, int Width)
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
 __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * TILE_WIDTH + ty;
 int Col = bx * TILE_WIDTH + tx;
 float Pvalue = 0;
 for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
   Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
   Nds[tv][tx] = Nd[Col + (m*TILE WIDTH + tv)*Width];
   __syncthreads();
                                                 Accumulate subset of
    for (int k = 0; k < TILE WIDTH; ++k)
   Pvalue += Mds[ty][k] * Nds[k][tx];
                                                dot product
   __synchthreads();
 Pd[Row*Width+Col] = Pvalue;
                                  Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

```
global void MatrixMulKernel(
 float* Md, float* Nd, float* Pd, int Width)
 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
 __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
 int bx = blockIdx.x; int by = blockIdx.y;
 int tx = threadIdx.x; int ty = threadIdx.y;
 int Row = by * TILE WIDTH + ty;
 int Col = bx * TILE WIDTH + tx;
 float Pvalue = 0;
 for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
   Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
   Nds[ty][tx] = Nd[Col + (m*TILE WIDTH + ty)*Width];
   syncthreads();
   for (int k = 0; k < TILE WIDTH; ++k)</pre>
     Pvalue += Mds[ty][k] * Nds[k][tx];
   __synchthreads();
                                        Write final answer
 Pd[Row*Width+Col] = Pvalue;
                                       to global memory
                                  Code from http://courses.engr.illinois.edu/ece498/al/Syllabus.html
```

# Matrix Multiply

- How do you pick TILE\_WIDTH?
  - □ How can it be too large?

# Matrix Multiply How do you pick TILE\_WIDTH? How can it be too large? By exceeding the maximum number of threads/block G80 and GT200 - 512 Fermi - 1024

# Matrix Multiply

- How do you pick TILE\_WIDTH?
  - □ How can it be too large?
    - By exceeding the maximum number of threads/block
      - □ G80 and GT200 512
      - □ Fermi 1024
    - By exceeding the shared memory limitations
      - □ G80: 16KB per SM and up to 8 blocks per SM
        - 2 KB per block
        - 1 KB for Nds and 1 KB for Mds (16 \* 16 \* 4)
        - TILE WIDTH = 16
        - A larger TILE\_WIDTH will result in less blocks

#### -Matrix Multiply

- Shared memory tiling benefits
  - □ Reduces global memory access by a factor of TILE\_WIDTH
    - 16x16 tiles reduces by a factor of 16
  - □G80
    - Now global memory supports 345.6 GFLOPS
    - Close to maximum of 346.5 GFLOPS

#### First-order Size Considerations in G80

- Each thread block should have many threads

  □ TILE\_WIDTH of 16 gives 16\*16 = 256 threads
- There should be many thread blocks
  - □ A 1024\*1024 Pd gives 64\*64 = 4K Thread Blocks
- Each thread block perform 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8K mul/add operations.
  - □ Memory bandwidth no longer a limiting factor

Slide from http://courses.engr.illinois.edu/ece498/al/Syllabus.htm



#### **Atomic Functions**

■ What is the value of count if 8 threads execute ++count?

```
__device__ unsigned int count = 0;
// ...
++count;
```



#### **Atomic Functions**

- Read-modify-write atomic operation
  - □ Guaranteed no interference from other threads
  - □ No guarantee on order
- Shared or global memory
- Requires compute capability 1.1 (> G80)

See G.1 in the NVIDIA CUDA C Programming Guide for full compute capability requirements

### **Atomic Functions**

■ What is the value of count if 8 threads execute atomicAdd below?

```
__device__ unsigned int count = 0;
// ...
// atomic ++count
atomicAdd(&count, 1);
```

### **Atomic Functions**

■ How do you implement atomicAdd?

```
__device__ int atomicAdd(
  int *address, int val);
```

### **Atomic Functions**

■ How do you implement atomicAdd?

```
__device__ int atomicAdd(
  int *address, int val)
{ // Made up keyword:
  __lock (address) {
    *address += value;
  }
}
```

# **Atomic Functions**

How do you implement atomicAdd without locking?

# **Atomic Functions**

- How do you implement atomicAdd without locking?
- What if you were given an atomic compare and swap?

```
int atomicCAS(int *address, int
  compare, int val);
```

# **Atomic Functions**

■ atomicCAS pseudo implementation

```
int atomicCAS(int *address,
  int compare, int val)
{ // Made up keyword
  __lock(address) {
   int old = *address;
   *address = (old == compare) ? val : old;
   return old;
  }
}
```

# Atomic Functions atomicCAS pseudo implementation int atomicCAS(int \*address, int compare, int val) { // Made up keyword \_lock(address) { int old = \*address; \*address = (old == compare) ? val : old; return old; } }

# Atomic Functions

■ atomicCAS pseudo implementation

```
int atomicCAS(int *address,
  int compare, int val)
{ // Made up keyword
  __lock(address) {
   int old = *address;
   *address = (old == compare) ? val : old;
   return old;
  }
}
```

# **Atomic Functions**

■ Example:

```
*addr = 1;
atomicCAS(addr, 1, 2);
atomicCAS(addr, 1, 3);
atomicCAS(addr, 2, 3);
```

# Atomic Functions

Example:

\*addr = 1;

```
atomicCAS(addr, 1, 2);
atomicCAS(addr, 1, 3);
// returns 1
// *addr = 2
atomicCAS(addr, 2, 3);
```

# Atomic Functions Example: \*addr = 1; atomicCAS(addr, 1, 2); atomicCAS(addr, 1, 3); atomicCAS(addr, 2, 3); // returns 2 // \*addr = 2

# **Atomic Functions**

■ Example:

```
*addr = 1;

atomicCAS(addr, 1, 2);

atomicCAS(addr, 1, 3);

atomicCAS(addr, 2, 3);

// returns 2

// *addr = 3
```

## **Atomic Functions**

Again, how do you implement atomicAdd given atomicCAS?

```
__device__ int atomicAdd(
  int *address, int val);
```

```
Atomic Functions

__device__ int atomicAdd(int *address, int val)
{
  int old = *address, assumed;
  do {
    assumed = old;
    old = atomicCAS(address,
        assumed, val + assumed);
  } while (assumed != old);
  return old;
}
```

```
Atomic Functions

__device__ int atomicAdd(int *address, int val)
{
    int old = *address, assumed;
    do {
        assumed = old;
        old = atomicCAS(address,
            assumed, val + assumed);
    } while (assumed != old);
    return old;
}
```

```
Atomic Functions

__device__ int atomicAdd(int *address, int val)
{
    int old = *address, assumed;
    do {
        assumed = old;
        old = atomicCAS(address,
            assumed, val + assumed);
    } while (assumed != old);
    return old;
}
```

```
Atomic Functions

__device__ int atomicAdd(int *address, int val)
{
    int old = *address, assumed;
    do {
        assumed = old;
        old = atomicCAS(address,
            assumed, assumed + val);
    } while (assumed != old);
    return old;
}

The value of *address after this function returns is not necessarily the original value of *address + val, why?
```

#### **Atomic Functions** ■ Lots of atomics: // Arithmetic // Bitwise atomicAdd() atomicAnd() atomicSub() atomicOr() atomicExch() atomicXor() atomicMin() atomicMax() atomicAdd() atomicDec() atomicCAS() See B.10 in the NVIDIA CUDA C Programming Guide

# Atomic Functions

- How can threads from different blocks work together?
- Use atomics sparingly. Why?