# The use of .loc vs. chained assignment, copies and views

This is a test, is this uploaded to github?

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame( np.random.random( size=(3,4) ), 
                      columns=list("ABCD"), 
                      index=['first', 'second','third'] )
df

Unnamed: 0,A,B,C,D
first,0.35446,0.729999,0.129636,0.215645
second,0.505258,0.186622,0.71605,0.251266
third,0.781954,0.132473,0.268704,0.229917


Accessing data in a dataframe and assignment as one operation (creating one subset of the dataframe). Access and assignment work in the same way.

In [3]:
df['A'] # accessing the content of column A

first     0.354460
second    0.505258
third     0.781954
Name: A, dtype: float64

In [4]:
df['A'] = 1  # assigning value 1 to all elements in column 'A' - This works and it is being executed as one operation
df

Unnamed: 0,A,B,C,D
first,1,0.729999,0.129636,0.215645
second,1,0.186622,0.71605,0.251266
third,1,0.132473,0.268704,0.229917


### Chained assignment, order of operations

The following cell is executed in two steps.
First df['B'] is executed, and to execute it, python creates a **temporary variable**. This is a **copy** of the original variable, and has a separate location in memory.
This temporary variable is then used for the second step, indexing on ['second']. This is a **chain** of operations.

In [5]:
df['B']['second']  # accessing the element at column 'B', index 'second'
# 

0.1866223517075658

The same result can be obtained using .loc. This time we have only one operation, and this is faster.

In [6]:
df.loc['second','B']

0.1866223517075658

While we can use both notations above to access data from a dataframe, this is no longer possible with assignment. **In the case of assignment, we need to use .loc**, otherwise we are assigning on a copy of our original dataframe and the result is not guarateed. 

In [7]:
df['B']['second'] = 100

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [8]:
df.loc['third','C']

0.2687044998026171

In [9]:
df.loc['third','C']  = 99999
df

Unnamed: 0,A,B,C,D
first,1,0.729999,0.129636,0.215645
second,1,100.0,0.71605,0.251266
third,1,0.132473,99999.0,0.229917


### Views vs. actual assignment

Be aware that expressions to modify a dataframe only modify the original dataframe if there is as assignment. If there is no assignment, we will still see the outcome of the expression, as a **view** of the dataframe (in this case), but the original dataframe will not change.

In [10]:
df+100  # here we can see the result of adding 100 to all values in df

Unnamed: 0,A,B,C,D
first,101,100.729999,100.129636,100.215645
second,101,200.0,100.71605,100.251266
third,101,100.132473,100099.0,100.229917


In [11]:
df  # however the oreiginal dataframe did not change

Unnamed: 0,A,B,C,D
first,1,0.729999,0.129636,0.215645
second,1,100.0,0.71605,0.251266
third,1,0.132473,99999.0,0.229917


In [12]:
df.drop(['first'])  # here we see the outcome of dropping a line

Unnamed: 0,A,B,C,D
second,1,100.0,0.71605,0.251266
third,1,0.132473,99999.0,0.229917


In [13]:
df  # however, in the original data that line was not dropped

Unnamed: 0,A,B,C,D
first,1,0.729999,0.129636,0.215645
second,1,100.0,0.71605,0.251266
third,1,0.132473,99999.0,0.229917


In [14]:
df.drop(['first'], inplace= True)  # if we want to change the original dataframe we can use inplace

In [15]:
df

Unnamed: 0,A,B,C,D
second,1,100.0,0.71605,0.251266
third,1,0.132473,99999.0,0.229917
