## 데이터프레임 합치기
- 두개 이상의 DataFrame을 합쳐 하나의 DataFrame으로 만든다.

![image.png](attachment:image.png)

## 데이터셋 읽기
- stocks_2016.csv, stocks_2017.csv, stocks_2018.csv : 년도별 보유 주식
- stocks_info.csv : 주식 정보

In [4]:
import pandas as pd

# a = pd.read_csv('data/stocks_2016.csv')
# b = pd.read_csv('data/stocks_2017.csv')
# c = pd.read_csv('data/stocks_2018.csv')
# i = pd.read_csv('data/stocks_info.csv)

In [9]:
f_txt = ['2016', '2017', '2018', 'info']
s_2016, s_2017, s_2018, s_info = [pd.read_csv(f'data/stocks_{txt}.csv') for txt in f_txt]

print(s_2016.shape, s_2017.shape, s_2018.shape, s_info.shape)

# for t in f_txt:
#     pd.read_csv(f'data/stocks_{t}.csv')

(3, 4) (6, 4) (3, 4) (8, 2)


In [10]:
s_2016

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70


In [11]:
s_info

Unnamed: 0,Symbol,Name
0,AAPL,Apple Inc
1,TSLA,Tesla Inc
2,WMT,Walmart Inc
3,GE,General Electric
4,IBM,IBM(International Business Machines Co)
5,SLB,Schlumberger Limited.
6,TXN,Texas Instruments Incorporated
7,AMZN,"Amazon.com, Inc"


## concat() 이용
- 수직, 조인을 이용한 수평 결합 모두 지원한다.
- 수직 결합의 경우 합치는 기준: 컬럼명이 같은 열끼리 합친다.
- 조인(수평결함)의 경우 full outer join과 inner join을 지원한다.
    - full outer join이 기본 방식
    - 조인 기준: index가 같은 행 끼리 합친다. (equi-join)
- pd.concat(objs,  [, key=리스트]), axis=0, join='outer' )
    - 매개변수
        - objs: 합칠 DataFrame들을 리스트로 전달
        - keys=[] 를 이용해 합친 행들을 구분하기 위한 다중 인덱스 처리
        - axis
            - 0 또는 index : 수직결합
            - 1 또는 columns : 수평결합
        - join: 
            - 조인방식
            - 'outer'(기본값) 또는 'inner'

> ### 조인(join)
> - 여러 데이터프레임에 흩어져 있는 정보 중 필요한 정보만 모아서 결합하기 위한 것.
> - 두개 이상의 데이터프레임을 특정 컬럼(열)의 값이 같은 행 끼리 수평 결합하는 것.
> - Inner Join, Left Outer Join, Right Outer Join, Full Outer Join

In [23]:
df = pd.concat([s_2016, s_2017, s_2018])
print(df.shape)
df

(12, 4)


Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70
0,AAPL,50,120,140
1,GE,100,30,40
2,IBM,87,75,95
3,SLB,20,55,85
4,TXN,500,15,23
5,TSLA,100,100,300
0,AAPL,40,135,170


In [26]:
df.loc[1]

Unnamed: 0,Symbol,Shares,Low,High
1,TSLA,50,80,130
1,GE,100,30,40
1,AMZN,8,900,1125


In [27]:
# 합칠때 index는 버리기 => 행식별
df2 = pd.concat([s_2016, s_2017, s_2018], ignore_index=True)
df2

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70
3,AAPL,50,120,140
4,GE,100,30,40
5,IBM,87,75,95
6,SLB,20,55,85
7,TXN,500,15,23
8,TSLA,100,100,300
9,AAPL,40,135,170


In [28]:
df2.loc[1]

Symbol    TSLA
Shares      50
Low         80
High       130
Name: 1, dtype: object

In [29]:
# keys: multi-index
df3 = pd.concat([s_2016, s_2017, s_2018], keys=[2016, 2017, 2018])
df3

Unnamed: 0,Unnamed: 1,Symbol,Shares,Low,High
2016,0,AAPL,80,95,110
2016,1,TSLA,50,80,130
2016,2,WMT,40,55,70
2017,0,AAPL,50,120,140
2017,1,GE,100,30,40
2017,2,IBM,87,75,95
2017,3,SLB,20,55,85
2017,4,TXN,500,15,23
2017,5,TSLA,100,100,300
2018,0,AAPL,40,135,170


In [30]:
df3.loc[2016, 1]

Symbol    TSLA
Shares      50
Low         80
High       130
Name: (2016, 1), dtype: object

In [31]:
df3.loc[2018]

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


In [34]:
# 수평결합(axis=1) - index가 같은 행끼리 합친다. (full outer join 기본)
df4 = pd.concat([s_2016, s_2018], axis=1)
print(df4.shape)
df4

(3, 8)


Unnamed: 0,Symbol,Shares,Low,High,Symbol.1,Shares.1,Low.1,High.1
0,AAPL,80,95,110,AAPL,40,135,170
1,TSLA,50,80,130,AMZN,8,900,1125
2,WMT,40,55,70,TSLA,50,220,400


In [32]:
s_2016.shape, s_2018.shape

((3, 4), (3, 4))

In [35]:
s_2016

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70


In [36]:
s_2018

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,40,135,170
1,AMZN,8,900,1125
2,TSLA,50,220,400


In [39]:
df5 = pd.concat([s_2016, s_2017], axis=1) # full outer join(default)
df5.shape

(6, 8)

In [40]:
df5

Unnamed: 0,Symbol,Shares,Low,High,Symbol.1,Shares.1,Low.1,High.1
0,AAPL,80.0,95.0,110.0,AAPL,50,120,140
1,TSLA,50.0,80.0,130.0,GE,100,30,40
2,WMT,40.0,55.0,70.0,IBM,87,75,95
3,,,,,SLB,20,55,85
4,,,,,TXN,500,15,23
5,,,,,TSLA,100,100,300


In [41]:
df6 = pd.concat([s_2016, s_2017], axis=1, join="inner")  #inner join
df6

Unnamed: 0,Symbol,Shares,Low,High,Symbol.1,Shares.1,Low.1,High.1
0,AAPL,80,95,110,AAPL,50,120,140
1,TSLA,50,80,130,GE,100,30,40
2,WMT,40,55,70,IBM,87,75,95


In [37]:
s_2016.shape, s_2017.shape

((3, 4), (6, 4))

## 조인을 통한 DataFrame 합치기
- join()
    - 2개 이상의 DataFrame을 조인할 때 사용
- merge()
    - 2개의 DataFrame의 조인만 지원

### join()
- dataframe객체.join(others, how='left', lsuffix='', rsuffix='') 
- `df_A.join(df_b)`, `df_A.join([df_b, df_c, df_d])`
- 두개 이상의 DataFrame들을 조인 할 수 있다.
    - **조인 기준**: index가 같은 값인 행끼리 합친다. (equi-join)
    - **조인 기본 방식**: Left Outer Join
- 매개변수
    - lsuffix, rsuffix
        - 조인 대상 DataFrame에 같은 이름의 컬럼이 있으면 에러 발생.
        - 같은 이름이 있는 경우 붙일 접미어 지정
    - how :조인방식. 'left', 'right', 'outer', 'inner'. left가 기본
        

In [42]:
s_2017

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,50,120,140
1,GE,100,30,40
2,IBM,87,75,95
3,SLB,20,55,85
4,TXN,500,15,23
5,TSLA,100,100,300


In [43]:
s_info

Unnamed: 0,Symbol,Name
0,AAPL,Apple Inc
1,TSLA,Tesla Inc
2,WMT,Walmart Inc
3,GE,General Electric
4,IBM,IBM(International Business Machines Co)
5,SLB,Schlumberger Limited.
6,TXN,Texas Instruments Incorporated
7,AMZN,"Amazon.com, Inc"


In [46]:
# target(s_info), source(s_2017)
# s_info.join(s_2017)  #join 대상 dataframe에 같은 이름의 컬럼이 있으면  Error
s_info.join(s_2017, lsuffix="_info", rsuffix='_2017')
# index명이 같은 행끼리 합친다. (s_info, s_2017 데이터셋에 맞춰서는 잘못 join이 됨.)

Unnamed: 0,Symbol_info,Name,Symbol_2017,Shares,Low,High
0,AAPL,Apple Inc,AAPL,50.0,120.0,140.0
1,TSLA,Tesla Inc,GE,100.0,30.0,40.0
2,WMT,Walmart Inc,IBM,87.0,75.0,95.0
3,GE,General Electric,SLB,20.0,55.0,85.0
4,IBM,IBM(International Business Machines Co),TXN,500.0,15.0,23.0
5,SLB,Schlumberger Limited.,TSLA,100.0,100.0,300.0
6,TXN,Texas Instruments Incorporated,,,,
7,AMZN,"Amazon.com, Inc",,,,


In [48]:
# target(2017), src(info)
s_2017.join(s_info, lsuffix='_2017', rsuffix='_info')

Unnamed: 0,Symbol_2017,Shares,Low,High,Symbol_info,Name
0,AAPL,50,120,140,AAPL,Apple Inc
1,GE,100,30,40,TSLA,Tesla Inc
2,IBM,87,75,95,WMT,Walmart Inc
3,SLB,20,55,85,GE,General Electric
4,TXN,500,15,23,IBM,IBM(International Business Machines Co)
5,TSLA,100,100,300,SLB,Schlumberger Limited.


In [53]:
# join() index명을 기준으로 join
# s_info, s_2017 => 기준: Symbol 컬럼 => 기준 컬럼을 index로 변경후에 join을 해야 한다.
s_info.set_index('Symbol').join(s_2017.set_index("Symbol"))

Unnamed: 0_level_0,Name,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,Apple Inc,50.0,120.0,140.0
TSLA,Tesla Inc,100.0,100.0,300.0
WMT,Walmart Inc,,,
GE,General Electric,100.0,30.0,40.0
IBM,IBM(International Business Machines Co),87.0,75.0,95.0
SLB,Schlumberger Limited.,20.0,55.0,85.0
TXN,Texas Instruments Incorporated,500.0,15.0,23.0
AMZN,"Amazon.com, Inc",,,


In [57]:
s_info.set_index('Symbol').join(s_2017.set_index("Symbol"), how='left')  #how=inner (inner join)

Unnamed: 0_level_0,Name,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,Apple Inc,50.0,120.0,140.0
TSLA,Tesla Inc,100.0,100.0,300.0
WMT,Walmart Inc,,,
GE,General Electric,100.0,30.0,40.0
IBM,IBM(International Business Machines Co),87.0,75.0,95.0
SLB,Schlumberger Limited.,20.0,55.0,85.0
TXN,Texas Instruments Incorporated,500.0,15.0,23.0
AMZN,"Amazon.com, Inc",,,


In [69]:
# s_info(target), s_2016, 2_2017 세개 join
s_info_2 = s_info.copy()
s_2016_2 = s_2016.copy()
s_2017_2 = s_2017.copy()

In [65]:
# s_info_2에 join할 대상인 s_2016, s_2017 두 DF에 동일한 이름의 컬럼이 있기 때문에 Exception 발생 (lsuffix, rsuffix로 처리가 안됨.)
#   join 전에 컬럼명을 변경하고 join해야 한다.
# s_info_2.set_index('Symbol').join([
#     s_2016_2.set_index('Symbol'), 
#     s_2017_2.set_index('Symbol')
# ], rsuffix=['_2016', '_2017'])

ValueError: Indexes have overlapping values: Index(['Shares', 'Low', 'High'], dtype='object')

In [68]:
# s_2016_2.columns = ['A', 'B' , 'C', 'D']
# s_2016_2

Unnamed: 0,A,B,C,D
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70


In [72]:
# df.add_suffix("문자열")/df.add_prefix('문자열') => df의 모든 컬럼명 뒤/앞에 지정한 문자열을 붙여준다.
s_2016_2.add_suffix('_2016')
s_2016_2.add_prefix('2016_')

Unnamed: 0,2016_Symbol,2016_Shares,2016_Low,2016_High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70


In [80]:
others = [
    s_2016_2.set_index('Symbol').add_suffix('_2016'),
    s_2017_2.set_index('Symbol').add_suffix('_2017'),
    s_2018.set_index('Symbol').add_suffix('_2018')
]

s_info_2.set_index('Symbol').join(others)

Unnamed: 0_level_0,Name,Shares_2016,Low_2016,High_2016,Shares_2017,Low_2017,High_2017,Shares_2018,Low_2018,High_2018
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
AAPL,Apple Inc,80.0,95.0,110.0,50.0,120.0,140.0,40.0,135.0,170.0
TSLA,Tesla Inc,50.0,80.0,130.0,100.0,100.0,300.0,50.0,220.0,400.0
WMT,Walmart Inc,40.0,55.0,70.0,,,,,,
GE,General Electric,,,,100.0,30.0,40.0,,,
IBM,IBM(International Business Machines Co),,,,87.0,75.0,95.0,,,
SLB,Schlumberger Limited.,,,,20.0,55.0,85.0,,,
TXN,Texas Instruments Incorporated,,,,500.0,15.0,23.0,,,
AMZN,"Amazon.com, Inc",,,,,,,8.0,900.0,1125.0


### merge()
- `df_a.merge(df_b)`
- 두개의 DataFrame 조인만 지원
    - **조인 기준**: 같은 컬럼명을 기준으로 equi-join이 기본. **조인기준을 다양하게 정할 수 있다.**
    - **조인 기본 방식**: inner join
- `dataframe.merge(합칠dataframe, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False)`  
- 매개변수
    - on : 같은 컬럼명이 여러개일때 join 대상 컬럼을 선택
    - right_on, left_on : 조인할 때 사용할 왼쪽,오른쪽 Dataframe의 컬럼명. 
    - left_index, right_index: 조인 할때 index를 사용할 경우 True로 지정 
    - how : 조인 방식.  'left', 'right', 'outer', 'inner'. 기본: inner 
    - suffixes: 두 DataFrame에 같은 이름의 컬럼명이 있을 경우 구분을 위해 붙인 접미어를 리스트로 설정
        - 생략시 x, y를 붙인다.       

In [82]:
s_2016.merge(s_2017) # 컬럼명이 같은 컬럼들을 기준으로 join => 4개 컬럼의 값들이 모두 같은 행이 없기 때문에 join 된 것이 없다.(defaul방식: inner)

Unnamed: 0,Symbol,Shares,Low,High


In [85]:
s_2016.merge(s_2017, on='Symbol') #양쪽의 Symbol 컬럼 값이 같은 행끼리 join
# join 대상 컬럼은 하나만 나온다.
# 그 이외 컬럼들은 모두 나온다. 단 중복된 이름의 컬럼의 경우 _x, _y suffix를 붙여준다.

Unnamed: 0,Symbol,Shares_x,Low_x,High_x,Shares_y,Low_y,High_y
0,AAPL,80,95,110,50,120,140
1,TSLA,50,80,130,100,100,300


In [86]:
s_2016.merge(s_2017, on='Symbol', suffixes=['_2016', '_2017'])

Unnamed: 0,Symbol,Shares_2016,Low_2016,High_2016,Shares_2017,Low_2017,High_2017
0,AAPL,80,95,110,50,120,140
1,TSLA,50,80,130,100,100,300


In [87]:
s_2016.merge(s_2017, on='Symbol', suffixes=['_2016', '_2017'], how='left') #left join : s_2016의 모든 행은 다 나온다.

Unnamed: 0,Symbol,Shares_2016,Low_2016,High_2016,Shares_2017,Low_2017,High_2017
0,AAPL,80,95,110,50.0,120.0,140.0
1,TSLA,50,80,130,100.0,100.0,300.0
2,WMT,40,55,70,,,


In [88]:
s_2016.merge(s_2017, on='Symbol', suffixes=['_2016', '_2017'], how='right') #right join: s_2017의 모든 행이 다 나온다.

Unnamed: 0,Symbol,Shares_2016,Low_2016,High_2016,Shares_2017,Low_2017,High_2017
0,AAPL,80.0,95.0,110.0,50,120,140
1,GE,,,,100,30,40
2,IBM,,,,87,75,95
3,SLB,,,,20,55,85
4,TXN,,,,500,15,23
5,TSLA,50.0,80.0,130.0,100,100,300


In [89]:
s_2016.merge(s_2017, on='Symbol', suffixes=['_2016', '_2017'], how='outer') #full outer join (s_2016, s_2017 모두다 나온다.)

Unnamed: 0,Symbol,Shares_2016,Low_2016,High_2016,Shares_2017,Low_2017,High_2017
0,AAPL,80.0,95.0,110.0,50.0,120.0,140.0
1,TSLA,50.0,80.0,130.0,100.0,100.0,300.0
2,WMT,40.0,55.0,70.0,,,
3,GE,,,,100.0,30.0,40.0
4,IBM,,,,87.0,75.0,95.0
5,SLB,,,,20.0,55.0,85.0
6,TXN,,,,500.0,15.0,23.0


In [93]:
s_info_2.set_index('Symbol', inplace=True)
s_info_2

Unnamed: 0_level_0,Name
Symbol,Unnamed: 1_level_1
AAPL,Apple Inc
TSLA,Tesla Inc
WMT,Walmart Inc
GE,General Electric
IBM,IBM(International Business Machines Co)
SLB,Schlumberger Limited.
TXN,Texas Instruments Incorporated
AMZN,"Amazon.com, Inc"


In [97]:
# s_2016 - Symbol 컬럼, s_info_2 - index명  을 기준으로 join
s_2016.merge(s_info_2,
             left_on="Symbol", # 왼쪽 DataFrame의 join 연산 기준 - 컬럼: left_on
             right_index=True  # 오른쪽 DataFrame의 join 연산 기준 - index: right_index
            ) #왼쪽(s_2016)의 Symbol 컬럼과 오른쪽(s_info_2)의 index명 이 같은 행끼리 join. 방식: inner(default)

Unnamed: 0,Symbol,Shares,Low,High,Name
0,AAPL,80,95,110,Apple Inc
1,TSLA,50,80,130,Tesla Inc
2,WMT,40,55,70,Walmart Inc


In [102]:
s_info_2.merge(s_2016, left_index=True, right_on='Symbol', how='left') #.set_index('Symbol')#.reset_index(drop=True)

Unnamed: 0,Name,Symbol,Shares,Low,High
0.0,Apple Inc,AAPL,80.0,95.0,110.0
1.0,Tesla Inc,TSLA,50.0,80.0,130.0
2.0,Walmart Inc,WMT,40.0,55.0,70.0
,General Electric,GE,,,
,IBM(International Business Machines Co),IBM,,,
,Schlumberger Limited.,SLB,,,
,Texas Instruments Incorporated,TXN,,,
,"Amazon.com, Inc",AMZN,,,


- 수직으로 합치는 경우(Union) : concat() 사용
- 두개 **이상의** DataFrame을 조인할 때는 하는 경우 : join() 사용
- 두개의 DataFrame을 조인할 때는 **merge()** 를 사용한다. => 컨트롤이 편하다.

# TODO

In [103]:
import pandas as pd

In [104]:
# TODO 1 data/customer.csv, data/order.csv, data/qna.csv 를 DataFrame으로 읽으시오.
cust_df = pd.read_csv('data/customer.csv')
order_df = pd.read_csv('data/order.csv')
qna_df = pd.read_csv('data/qna.csv')

In [105]:
file_names = ['data/customer.csv', 'data/order.csv', 'data/qna.csv']
cust_df, order_df, qna_df = [pd.read_csv(file) for file in file_names]

In [107]:
# TODO 2 TODO1에서 읽은 세개의 데이터셋의 정보를 확인하세요. 
cust_df.shape, order_df.shape, qna_df.shape

((5, 3), (6, 3), (3, 3))

In [109]:
cust_df

Unnamed: 0,id,name,age
0,id-1,김영수,33
1,id-2,박선영,23
2,id-3,오정현,21
3,id-4,박명수,40
4,id-5,이철기,17


In [110]:
order_df.head()

Unnamed: 0,order_id,cust_id,total_price
0,1,id-1,100000
1,2,id-1,250000
2,3,id-2,300000
3,4,id-2,15000
4,5,id-2,51000


In [111]:
qna_df

Unnamed: 0,qna_no,cust_id,txt
0,1,id-4,물건있나요?
1,2,id-4,얼마에요
2,3,id-5,반품은 어떻게 해요?


In [114]:
# TODO 3 customer DataFrame과 order DataFrame을 고객정보는 모두 나오도록(주문안한 고객정보 포함) join 하세요.
# join()
cust_df.set_index('id').join(order_df.set_index('cust_id'))

Unnamed: 0,name,age,order_id,total_price
id-1,김영수,33,1.0,100000.0
id-1,김영수,33,2.0,250000.0
id-2,박선영,23,3.0,300000.0
id-2,박선영,23,4.0,15000.0
id-2,박선영,23,5.0,51000.0
id-3,오정현,21,,
id-4,박명수,40,6.0,32000.0
id-5,이철기,17,,


In [120]:
# merge() - 기본join방식: inner
result = cust_df.merge(order_df, left_on='id', right_on='cust_id', how='left')
result.drop(columns='cust_id', inplace=True)
result

Unnamed: 0,id,name,age,order_id,total_price
0,id-1,김영수,33,1.0,100000.0
1,id-1,김영수,33,2.0,250000.0
2,id-2,박선영,23,3.0,300000.0
3,id-2,박선영,23,4.0,15000.0
4,id-2,박선영,23,5.0,51000.0
5,id-3,오정현,21,,
6,id-4,박명수,40,6.0,32000.0
7,id-5,이철기,17,,


In [124]:
# TODO 4 customer DataFrame의 index를 id컬럼으로 변경.
cust_df.set_index('id', inplace=True)

KeyError: "None of ['id'] are in the columns"

In [127]:
cust_df

Unnamed: 0_level_0,name,age
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id-1,김영수,33
id-2,박선영,23
id-3,오정현,21
id-4,박명수,40
id-5,이철기,17


In [132]:
# TODO 5 customer DataFrame과 qna DataFrame을 inner join 하세요.
# join() - left join
# cust_df.join(qna_df.set_index('cust_id'), how='inner')  
qna_df.set_index('cust_id').join(cust_df, how='inner')

Unnamed: 0,qna_no,txt,name,age
id-4,1,물건있나요?,박명수,40
id-4,2,얼마에요,박명수,40
id-5,3,반품은 어떻게 해요?,이철기,17


In [134]:
# merge() - inner
cust_df.merge(qna_df, left_index=True, right_on='cust_id')  #, how='inner')

Unnamed: 0,name,age,qna_no,cust_id,txt
0,박명수,40,1,id-4,물건있나요?
1,박명수,40,2,id-4,얼마에요
2,이철기,17,3,id-5,반품은 어떻게 해요?


In [139]:
# TODO 6. 세개의 DataFrame을 고객정보는 모두 나오도록 join 하세요. - target: customers
cust_df.join([order_df.set_index('cust_id'), qna_df.set_index('cust_id')], how='left')

Unnamed: 0,name,age,order_id,total_price,qna_no,txt
id-1,김영수,33,1.0,100000.0,,
id-1,김영수,33,2.0,250000.0,,
id-2,박선영,23,3.0,300000.0,,
id-2,박선영,23,4.0,15000.0,,
id-2,박선영,23,5.0,51000.0,,
id-3,오정현,21,,,,
id-4,박명수,40,6.0,32000.0,1.0,물건있나요?
id-4,박명수,40,6.0,32000.0,2.0,얼마에요
id-5,이철기,17,,,3.0,반품은 어떻게 해요?


In [142]:
# merge
tmp = cust_df.merge(order_df,  left_index=True, right_on='cust_id', how='left')
tmp

Unnamed: 0,name,age,order_id,cust_id,total_price
0.0,김영수,33,1.0,id-1,100000.0
1.0,김영수,33,2.0,id-1,250000.0
2.0,박선영,23,3.0,id-2,300000.0
3.0,박선영,23,4.0,id-2,15000.0
4.0,박선영,23,5.0,id-2,51000.0
,오정현,21,,id-3,
5.0,박명수,40,6.0,id-4,32000.0
,이철기,17,,id-5,


In [144]:
tmp.merge(qna_df, left_on='cust_id', right_on='cust_id', how='left')

Unnamed: 0,name,age,order_id,cust_id,total_price,qna_no,txt
0,김영수,33,1.0,id-1,100000.0,,
1,김영수,33,2.0,id-1,250000.0,,
2,박선영,23,3.0,id-2,300000.0,,
3,박선영,23,4.0,id-2,15000.0,,
4,박선영,23,5.0,id-2,51000.0,,
5,오정현,21,,id-3,,,
6,박명수,40,6.0,id-4,32000.0,1.0,물건있나요?
7,박명수,40,6.0,id-4,32000.0,2.0,얼마에요
8,이철기,17,,id-5,,3.0,반품은 어떻게 해요?


## Database Table의 데이터를 읽어서 DataFrame 생성

- pd.read_sql('select문', connection)
    - 'select문' 조회결과를 가지는 DataFrame을 반환

In [146]:
!pip install pymysql

Collecting pymysql
  Using cached PyMySQL-1.0.2-py3-none-any.whl (43 kB)
Installing collected packages: pymysql
Successfully installed pymysql-1.0.2


In [147]:
import pymysql 
import pandas as pd

In [148]:
connection = pymysql.connect(host='127.0.0.1', port=3306, user='scott', password='tiger', database='hr_join')

In [149]:
emp = pd.read_sql('select * from emp', connection)



In [152]:
import warnings
warnings.filterwarnings(action='ignore')

In [153]:
emp_test = pd.read_sql('select emp_id ID, emp_name 이름, hire_date 입사일 from emp where comm_pct is not null', connection)
emp_test

Unnamed: 0,ID,이름,입사일
0,145,John,2004-10-01
1,146,Karen,2004-10-01
2,147,Alberto,2005-03-10
3,148,Gerald,2007-10-15
4,149,Eleni,2007-10-15
5,150,Peter,2007-10-15
6,151,David,2005-03-24
7,152,Peter,2005-08-20
8,153,Christopher,2006-03-30
9,154,Nanette,2006-12-09


In [154]:
dept = pd.read_sql('select * from dept', connection)
job = pd.read_sql('select * from job', connection)

In [155]:
emp.shape, dept.shape, job.shape

((107, 8), (27, 3), (19, 4))

In [156]:
emp_dept = pd.read_sql('select  e.*, d.dept_name, d.loc from    emp e left join dept d on e.dept_id = d.dept_id', connection)

In [157]:
emp_dept

Unnamed: 0,emp_id,emp_name,job_id,mgr_id,hire_date,salary,comm_pct,dept_id,dept_name,loc
0,100,Steven,AD_PRES,,2003-06-17,24000.0,,90.0,Executive,Seattle
1,101,Neena,AD_VP,100.0,2005-09-21,17000.0,,90.0,Executive,Seattle
2,102,Lex,AD_VP,100.0,2001-01-13,17000.0,,90.0,Executive,Seattle
3,103,Alexander,IT_PROG,102.0,2006-01-03,9000.0,,60.0,IT,San Francisco
4,104,Bruce,IT_PROG,103.0,2007-05-21,6000.0,,60.0,IT,San Francisco
...,...,...,...,...,...,...,...,...,...,...
102,202,Pat,MK_REP,201.0,2005-08-17,6000.0,,20.0,Marketing,New York
103,203,Susan,HR_REP,101.0,2002-06-07,6500.0,,40.0,Human Resources,New York
104,204,Hermann,PR_REP,101.0,2002-06-07,10000.0,,70.0,Public Relations,New York
105,205,Shelley,AC_MGR,101.0,2002-06-07,12008.0,,110.0,Accounting,Seattle


In [158]:
emp.merge(dept, how='left')

Unnamed: 0,emp_id,emp_name,job_id,mgr_id,hire_date,salary,comm_pct,dept_id,dept_name,loc
0,100,Steven,AD_PRES,,2003-06-17,24000.0,,90.0,Executive,Seattle
1,101,Neena,AD_VP,100.0,2005-09-21,17000.0,,90.0,Executive,Seattle
2,102,Lex,AD_VP,100.0,2001-01-13,17000.0,,90.0,Executive,Seattle
3,103,Alexander,IT_PROG,102.0,2006-01-03,9000.0,,60.0,IT,San Francisco
4,104,Bruce,IT_PROG,103.0,2007-05-21,6000.0,,60.0,IT,San Francisco
...,...,...,...,...,...,...,...,...,...,...
102,202,Pat,MK_REP,201.0,2005-08-17,6000.0,,20.0,Marketing,New York
103,203,Susan,HR_REP,101.0,2002-06-07,6500.0,,40.0,Human Resources,New York
104,204,Hermann,PR_REP,101.0,2002-06-07,10000.0,,70.0,Public Relations,New York
105,205,Shelley,AC_MGR,101.0,2002-06-07,12008.0,,110.0,Accounting,Seattle
