# Views vs Copies
The concept of views and copies is important to prevent bugs that are difficult to identify and solve in Pandas. The package Pandas, which is the main focus in the course, is built on top of the package Numpy. Therefore, we first explore the copy vs view behavior of Numpy.

#### Views in Numpy

In [1]:
import numpy as np

First, we create a one dimensional Numpy Array that will be used for the demonstration. 

In [2]:
vector = np.array([1, 2, 3, 4, 5])
type(vector)

numpy.ndarray

Like mentioned during the lecture, the numpy array has a certain data buffer. To see the data buffer of a numpy array, you can use the *base* property:

In [3]:
vector.base

The base property returns None when the vector does not have a base vector from which it uses the data buffer. However, the moment we create a view of the initial array, the base vector returns the data buffer of the base:

In [4]:
view = vector.view()
view.base

array([1, 2, 3, 4, 5])

Since this is a view, if we modify the data (buffer) of the view, this will also change the value of the *vector* array since they share the same buffer.

In [5]:
view[0] = 100
print(view)
print(vector)

[100   2   3   4   5]
[100   2   3   4   5]


Likewise, if we change the original vector, we also change the view:

vector[0] = 1
print(view)
print(vector)

#### Copies in Numpy
Alternatively, we can copy the data instead of creating a view. In that case, both the metadata, and the data buffer are copied to a new object:

In [6]:
copy = vector.copy()
copy.base

As you can see, the copy does not have a base. This is because it is an entirely new object! If we now change a value in the copy, this will not impact the vector:

In [7]:
copy[0] = 100
print(copy)
print(vector)

[100   2   3   4   5]
[100   2   3   4   5]


#### Why important?
Wgen doing data manipulation, you will typically use Numpy functions. Because of memory optimization, these functions try to use views as much as possible; however, sometimes it is not possible to use views and the function will use copies instead. Not realizing that you are working on a copy of the data instead of a view can introduce some nasty bugs!

In [8]:
time_series = np.array([100.10, 105.78, 98.54, 10098, 102.83, 105.98, 101.26])

We work with the time series above. In a first step, we are going to filter the data to include only values that are greater than 100:

In [9]:
greater_100 = time_series[time_series > 100]
greater_100

array([  100.1 ,   105.78, 10098.  ,   102.83,   105.98,   101.26])

By doing this, we realize that we have an outlying value that seems to be a mistake in data input. We quickly solve this:

In [10]:
time_series[3] = time_series[3]/100
time_series

array([100.1 , 105.78,  98.54, 100.98, 102.83, 105.98, 101.26])

Now, everything looks fine and we will continue with computing the average for all values greater than 100:

In [11]:
np.mean(greater_100)

1768.9916666666666

**What happened?** I assumed that the data filtering returned a view, but it returned a copy! The fix I did on the time series array did not result in a fix in the greater_100 array.

In [12]:
greater_100.base

##### Example 2
Let's say we now want to take the average of the first 5 values of the array; however, these are represented in euros and we want to represent the average in cents. Therefore, we first filter the array based on the index:

In [13]:
first_5 = time_series[:5]
first_5

array([100.1 , 105.78,  98.54, 100.98, 102.83])

Now that we obtained the filtered array, we multiple each value by 100:

In [14]:
first_5 *= 100
first_5

array([10010., 10578.,  9854., 10098., 10283.])

Next, we compute the mean:

In [15]:
np.mean(first_5)

10164.6

Now, we continue the project with the original array. But if we are not careful, then we do not notice that the original array now looks like this:

In [16]:
print(time_series)

[10010.   10578.    9854.   10098.   10283.     105.98   101.26]


**What happened?** The subset operation returned a view instead of a copy! The same happened when we updated the values of the array by multipying them by 100. 

In [17]:
first_5.base

array([10010.  , 10578.  ,  9854.  , 10098.  , 10283.  ,   105.98,
         101.26])

**Note:** if we used *first_5 = first_5 \* 100* instead of *first_5 \*= 100*, then we would not have had this issue because this returns a copy of the array!

In [18]:
time_series = np.array([100.10, 105.78, 98.54, 100.98, 102.83, 105.98, 101.26])
first_5 = time_series[:5]
first_5 = first_5 * 100 
print(first_5)
print(time_series)

[10010. 10578.  9854. 10098. 10283.]
[100.1  105.78  98.54 100.98 102.83 105.98 101.26]


In [19]:
first_5.base

## Pandas
Now, let's move on to pandas.

In [20]:
import pandas as pd

We use the movies database like the other notebook:

In [21]:
movies = pd.read_parquet('https://kuleuven-mda.s3.eu-central-1.amazonaws.com/movies.parquet.gzip')
movies.release_date = pd.to_datetime(movies.release_date)
movies.id = pd.to_numeric(movies.id)
movies.original_language = pd.Categorical(movies.original_language)
movies.head()

Unnamed: 0,id,original_title,release_date,original_language,popularity,revenue,vote_average,vote_count
0,862,Toy Story,1995-10-30,en,21.946943,373554033.0,7.7,5415.0
1,8844,Jumanji,1995-12-15,en,17.015539,262797249.0,6.9,2413.0
2,15602,Grumpier Old Men,1995-12-22,en,11.7129,0.0,6.5,92.0
3,31357,Waiting to Exhale,1995-12-22,en,3.859495,81452156.0,6.1,34.0
4,11862,Father of the Bride Part II,1995-02-10,en,8.387519,76578911.0,5.7,173.0


Let's start with subsetting the data. We first select a single column:

In [22]:
vote_counts = movies.loc[:, "vote_count"]
vote_counts._is_view

True

This returns a view of the data. This makes sense because we are doing the operation in a single block so the Numpy operation can return a view without any issues:

In [23]:
movies._data.blocks

(ObjectBlock: slice(1, 2, 1), 1 x 45454, dtype: object,
 NumericBlock: slice(4, 8, 1), 4 x 45454, dtype: float64,
 ExtensionBlock: slice(3, 4, 1), 1 x 45454, dtype: category,
 DatetimeLikeBlock: slice(2, 3, 1), 1 x 45454, dtype: datetime64[ns],
 NumericBlock: slice(0, 1, 1), 1 x 45454, dtype: int64)

Now, let's filter both rows and columns:

In [24]:
vote_counts = movies.loc[1:4, "vote_count"]
vote_counts._is_view

True

In [25]:
mask = (movies.id == 862) | (movies.id == 8884)
movies.loc[mask, "vote_count"]._is_view

False

And let's now select multiple columns:

In [26]:
movies.loc[:, ['vote_count', 'vote_average']]._is_view

False