## Appending new rows to DataFrame

A useful shortcut to concat() are the append() instance methods on Series and DataFrame. 
These methods actually predated concat. They concatenate along axis=0, namely the index:

### StackOverflow references
- Add one row to pandas dataframe
- https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe

In [19]:
import pandas as pd
import numpy as np

### Add one row to dataframe using loc indexer

Read names.csv and assign it into variable naemd 'names'

In [2]:
names = pd.read_csv('data/names.csv')
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2


Create a list that contains some new data and use `.loc` indexer to set a single row
- .loc is referencing the index column

In [3]:
new_data_list = ['Aria', 1]
names.loc[4] = new_data_list
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1


.loc indexer uses labes to refer to the rows.

In [4]:
names.loc['five'] = ['Zach', 3]
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1
five,Zach,3


- From above, we associated values to variables implicitly
- To be more explicit in associating variables to values, you may use a dictionary
- We can dynamically choose the new index label to be the length of the DataFrame

In [5]:
names.loc[len(names)] = {'Name': 'Zayd', 'Age': 2}
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1
five,Zach,3
6,Zayd,2


A Series can hold the new data as well and works exactly the same as a dictionary

In [6]:
names.loc[len(names)] = pd.Series({'Name': 'Dean', 'Age': 32})
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1
five,Zach,3
6,Zayd,2
7,Dean,32


### Add rows to dataframe using append() method

- 앞에서는 .loc indexing operator를 사용하여 `names` DataFrame에 in-place하게 변경(row 추가)
- in-place로 추가하였기 때문에, DataFrame의 separate copy가 없음
- `append` 메소드는 DataFrame을 수정하지 않고 appended row와 함께 새로운 copy를 반환

In [8]:
# See what happens when we attempt to use a dictionary with append
names = pd.read_csv('data/names.csv')
names.append({'Name': 'Aria', 'Age': 1})

TypeError: Can only append a Series if ignore_index=True or if the Series has a name

When ignore_index is set to True, the old index will be removed completely and replaced with a RandgeIndex from 0 to n-1

In [9]:
names.append({'Name': 'Aria', 'Age': 1}, ignore_index=True)

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1


In [10]:
names.index = ['Canada', 'Canada', 'USA', 'USA']
names

Unnamed: 0,Name,Age
Canada,Cornelia,70
Canada,Abbas,69
USA,Penelope,4
USA,Niko,2


기존 dataframe의 index를 변경하고, append 메소드 사용시 ignore_index를 True로 하면, original index는 무시되고 RangeIndex로 변경

In [11]:
names.append({'Name': 'Aria', 'Age': 1}, ignore_index=True)

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1


In [13]:
s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s

Age        3
Name    Zach
Name: 4, dtype: object

In [14]:
names.append(s)

Unnamed: 0,Name,Age
Canada,Cornelia,70
Canada,Abbas,69
USA,Penelope,4
USA,Niko,2
4,Zach,3


- `append` 메소드는 multiple rows를 동시에 append 가능
- list of Series를 append

In [16]:
s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
names.append([s1, s2])

Unnamed: 0,Name,Age
Canada,Cornelia,70
Canada,Abbas,69
USA,Penelope,4
USA,Niko,2
4,Zach,3
USA,Zayd,2


- `names` DataFrame과 같이 두 개의 컬럼만 있는 작은 데이터의 경우, 직접 names와 values를 작성해도 충분
- 데이터 셋이 더 커질 경우를 살펴봄

In [17]:
bball_16 = pd.read_csv('data/baseball16.csv')
bball_16.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,altuvjo01,2016,1,HOU,AL,161,640,108,216,42,...,96.0,30.0,10.0,60,70.0,11.0,7.0,3.0,7.0,15.0
1,bregmal01,2016,1,HOU,AL,49,201,31,53,13,...,34.0,2.0,0.0,15,52.0,0.0,0.0,0.0,1.0,1.0
2,castrja01,2016,1,HOU,AL,113,329,41,69,16,...,32.0,2.0,1.0,45,123.0,0.0,1.0,1.0,0.0,9.0
3,correca01,2016,1,HOU,AL,153,577,76,158,36,...,96.0,13.0,3.0,75,139.0,5.0,5.0,0.0,3.0,12.0
4,gattiev01,2016,1,HOU,AL,128,447,58,112,19,...,72.0,2.0,1.0,43,127.0,6.0,4.0,0.0,5.0,12.0


- 22개의 columns의 이름을 직접 치는 것은 mistyping 또는 누락의 가능성이 높아짐
- DataFrame의 single row를 Series로 선택하여, dict로 변경하여 활용

In [22]:
# get 0 row with iloc indexer and chain the to_dict method to convert it into dict
data_dict = bball_16.iloc[0].to_dict()
print(data_dict)

{'playerID': 'altuvjo01', 'yearID': 2016, 'stint': 1, 'teamID': 'HOU', 'lgID': 'AL', 'G': 161, 'AB': 640, 'R': 108, 'H': 216, '2B': 42, '3B': 5, 'HR': 24, 'RBI': 96.0, 'SB': 30.0, 'CS': 10.0, 'BB': 60, 'SO': 70.0, 'IBB': 11.0, 'HBP': 7.0, 'SH': 3.0, 'SF': 7.0, 'GIDP': 15.0}


뽑아낸 row의 dict의 old values를 clear하여 template으로 사용

In [23]:
new_data_dict = {k: '' if isinstance(v, str) else np.nan for k, v in data_dict.items()}
print(new_data_dict)

{'playerID': '', 'yearID': nan, 'stint': nan, 'teamID': '', 'lgID': '', 'G': nan, 'AB': nan, 'R': nan, 'H': nan, '2B': nan, '3B': nan, 'HR': nan, 'RBI': nan, 'SB': nan, 'CS': nan, 'BB': nan, 'SO': nan, 'IBB': nan, 'HBP': nan, 'SH': nan, 'SF': nan, 'GIDP': nan}


### Notes
- Appending a single row to a DataFrame is a fairly expensive opration
- Writing a loop to append single rows of data to a DataFrame is inefficient
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append

- Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. 
- A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

In [27]:
# Create 1,000 rows of new data as a list of Series
random_data = []
for i in range(1000):
    d = dict()
    for k, v in data_dict.items():
        if isinstance(v, str):
            d[k] = np.random.choice(list('abcde'))
        else:
            d[k] = np.random.randint(10)
    random_data.append(pd.Series(d, name=i + len(bball_16)))

random_data[0].head()

2B    6
3B    3
AB    7
BB    2
CS    6
Name: 16, dtype: object

Check how long it takes to loop through each item making one append at a time

In [28]:
%%timeit
bball_16_copy = bball_16.copy()
for row in random_data:
    bball_16_copy = bball_16_copy.append(row)

1 loop, best of 3: 4.13 s per loop


- By passing in the list of Seires, the time has been reduced to under one-tenth of a second
- Internally, pandas converts the list of Series to a single DataFrame and then makes the append

In [29]:
%%timeit
bball_16_copy = bball_16.copy()
bball_16_copy = bball_16_copy.append(random_data)

10 loops, best of 3: 67.7 ms per loop
