# SettingwithCopyWarning: How to Fix This Warning in Pandas

## Agenda

- pandas.DataFrameの処理でよく遭遇する`SettingWithCopyWarning`の対処方法を抑える
- なぜ`SettingWithCopyWarning`は発生するのか？
- are we modifying the original?” is unknown.

In [1]:
## create the clean environment
import gc
import matplotlib.pyplot as plt

def clear_all():
    # Clears all the variables from the workspace
    gl = globals().copy()
    for var in gl:
        if var in clean_env_var: continue
        del globals()[var]
    # Garbage collection:
    gc.collect()

def close_plots():
  my_plots = plt.get_fignums()
  for j in my_plots:
    plt.close(plt.figure(j))

clean_env_var = dir()
clean_env_var.append('clean_env_var')

In [2]:
clear_all()

### Hardware

In [3]:
%%bash
system_profiler SPHardwareDataType | grep -E \
"Model Identifier"\|"Processor Name"\|"Processor Speed"\
\|"Number of Processors"\|"Memory:"

      Model Identifier: MacBookPro13,1
      Processor Name: Dual-Core Intel Core i5
      Processor Speed: 2 GHz
      Number of Processors: 1
      Memory: 16 GB


### Python

In [4]:
!python -V

Python 3.7.4


### Import Libraries

In [5]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt

### pandas version

In [6]:
pd.__version__

'1.0.3'

### read data

a data set of the prices of Xboxes sold in 3-day auctions on eBay from the book [Modelling Online Auctions](http://www.modelingonlineauctions.com/datasets).

- auctionid — A unique identifier of each auction.
- bid — The value of the bid.
- bidtime — The age of the auction, in days, at the time of the bid.
- bidder — eBay username of the bidder.
- bidderrate – The bidder’s eBay user rating.
- openbid — The opening bid set by the seller for the auction.
- price — The winning bid at the close of the auction.


In [7]:
path = '../data/Xbox 3-day auctions.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
0,8213034705,95.0,2.927373,jake7870,0,95.0,117.5
1,8213034705,115.0,2.943484,davidbresler2,1,95.0,117.5
2,8213034705,100.0,2.951285,gladimacowgirl,58,95.0,117.5
3,8213034705,117.5,2.998947,daysrus,10,95.0,117.5
4,8213060420,2.0,0.065266,donnie4814,5,1.0,120.0


## 1. SettingWithCopyWarningとは？

- warningであって、errorではない
- errorは何かが壊れていることを示す一方、warningはpotential bugなどの警告をしてくれるもの
- `SettingWithCopyWarning`は処理がプログラマーが意図したものではない恐れがあることを警告してくれている

### ViewとCopyの違い

<img src = "https://github.com/RyoNakagami/omorikaizuka/blob/master/IT101/pandas_view_copy.jpg?raw=true">

### 用語の整理

- Assignment — Operations that set the value of something, for example data = pd.read_csv('xbox-3-day-auctions.csv'). Often referred to as a set.
- Access — Operations that return the value of something, such as the below examples of indexing and chaining. Often referred to as a get.
- Indexing — Any assignment or access method that references a subset of the data; for example data[1:5].
- Chaining — The use of more than one indexing operation back-to-back; for example data[1:5][1:3].



## 2. Common issue #1: Chained assignment

- `chained assignment`はchainingとassignmentのコンビネーション

`'parakeet2004'`さんのbidderを変更したいとする。まず変更前の確認、

In [8]:
data[data.bidder == 'parakeet2004']

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
6,8213060420,3.0,0.186539,parakeet2004,5,1.0,120.0
7,8213060420,10.0,0.18669,parakeet2004,5,1.0,120.0
8,8213060420,24.99,0.187049,parakeet2004,5,1.0,120.0


warningを発生させる

In [9]:
data[data.bidder == 'parakeet2004']['bidderrate'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


代入結果を確認してみると、

In [10]:
data[data.bidder == 'parakeet2004']

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
6,8213060420,3.0,0.186539,parakeet2004,5,1.0,120.0
7,8213060420,10.0,0.18669,parakeet2004,5,1.0,120.0
8,8213060420,24.99,0.187049,parakeet2004,5,1.0,120.0


### なぜ上述のようなエラーが発生したのか

- `data[data.bidder == 'parakeet2004']`はaccess method, 新しいDataFrame objectを返している
- `['bidderrate'] = 100`はassignment method, 新しいobjectの`bidderrate`に対してassignmentを実施している = original DataFrame objectは変更されない


### Solution

In [11]:
data.loc[data.bidder == 'parakeet2004', 'bidderrate'] = 100

In [12]:
data[data.bidder == 'parakeet2004']['bidderrate']

6    100
7    100
8    100
Name: bidderrate, dtype: int64

## 2. Common issue #2: Hidden chaining

In [13]:
winners = data.loc[data.bid == data.price]
winners.head()

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
3,8213034705,117.5,2.998947,daysrus,10,95.0,117.5
25,8213060420,120.0,2.999722,djnoeproductions,17,1.0,120.0
44,8213067838,132.5,2.996632,*champaignbubbles*,202,29.99,132.5
45,8213067838,132.5,2.997789,*champaignbubbles*,202,29.99,132.5
66,8213073509,114.5,2.999236,rr6kids,4,1.0,114.5


たまたま`bidder` columnの一部がNaNであることを発見したとする

In [14]:
winners.loc[304, 'bidder']

nan

`nan`を`therealname`にreplaceしたいとする

In [15]:
winners.loc[304, 'bidder'] = 'therealname'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [16]:
winners.loc[304, 'bidder']

'therealname'

機能するがwarningも発生する。これは`data.loc[data.bid == data.price]`がcopyなのかそうでないのか分からないため発生する。なので、copyを明示することで解決するはず

In [17]:
winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'
print(winners.loc[304, 'bidder'])
print(data.loc[304, 'bidder'])

therealname
nan


### Key-points

- indexingを用いてDataFrameにアクセスしたとしても、copyなのかViewなのかがケースバイケース

## False negative

In [18]:
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]['bid'] = 5.0
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]

Unnamed: 0,bidderrate,bid
6,100,3.0
7,100,10.0
8,100,24.99


warningが発生していないのに、意図した通りの処理になっていない。

In [19]:
data.loc[data.bidder == 'parakeet2004', 'bid'] = 5.0
data.loc[data.bidder == 'parakeet2004', ('bidderrate', 'bid')]

Unnamed: 0,bidderrate,bid
6,100,5.0
7,100,5.0
8,100,5.0


## Hidden chaining

In [20]:
winners = data.loc[data.bid == data.price]
winners.loc[304, 'bidder'] = 'therealname'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [21]:
winners.loc[304, 'bidder']

'therealname'

### Copy solution

In [22]:
winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'
print(data.loc[304, 'bidder']) # Original
print(winners.loc[304, 'bidder']) # Copy

nan
therealname


## 参考

- https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy