# DataFrame

## 1. DataFrame 생성 
- looks just like spread sheet.
- Consists of several columns and **each columns can contain different data type.**

- **How to create DataFrame?**
 - By taking on dict with a list of values as a dictionary
 - By using numpy array
 - read_csv(), read_excel()... functions

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt

---

### Creating DataFrame Through Dictionary

In [2]:
dic = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
       'year' :[2000,2001,2002,2001,2002,20003],
       'population':[1.5,1.7,3.6,2.4,2.9,3.2]
      }
#state의 밸류 사이즈가 6이면 다른 애들도 같아야.(matrix이기 때문?)

In [3]:
dicDf = DataFrame(dic) #컬럼값을 넣어주지 않으면, 딕셔너리의 키값이 저절로 columns이 된다.
dicDf

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,20003,3.2


In [4]:
print(dicDf.state) #Series type. -> 칼럼을 찾는다. 
print(type(dicDf.state))

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object
<class 'pandas.core.series.Series'>


In [5]:
print(dicDf.year) #Series type.
print(type(dicDf.year))

0     2000
1     2001
2     2002
3     2001
4     2002
5    20003
Name: year, dtype: int64
<class 'pandas.core.series.Series'>


#### 2. DataFrame CRUD

##### Delete

In [6]:
a = dicDf.pop('population') #컬럼 추출, 삭제 
a

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: population, dtype: float64

In [7]:
dicDf  # pop을 한 후 사라진 'population' column

Unnamed: 0,state,year
0,Ohio,2000
1,Ohio,2001
2,Ohio,2002
3,Nevada,2001
4,Nevada,2002
5,Nevada,20003


##### Create

In [8]:
#Adding column1
dicDf['winner']=['Jim','Kerry','John','Casey','Jack','Luke'] 

In [9]:
dicDf

Unnamed: 0,state,year,winner
0,Ohio,2000,Jim
1,Ohio,2001,Kerry
2,Ohio,2002,John
3,Nevada,2001,Casey
4,Nevada,2002,Jack
5,Nevada,20003,Luke


In [10]:
#Adding column2
dicDf.insert(3,"Age",[21,33,34,28,39,50]) #3번째 인덱스에 "Age"라는 컬럼명으로 [21,33,34,28,39,50] 를 각 밸류로 넣겠다. 
dicDf

Unnamed: 0,state,year,winner,Age
0,Ohio,2000,Jim,21
1,Ohio,2001,Kerry,33
2,Ohio,2002,John,34
3,Nevada,2001,Casey,28
4,Nevada,2002,Jack,39
5,Nevada,20003,Luke,50


##### Read

In [11]:
#slicing
dicDf[3:5]

#note that index used here is different from that of list the following example will error out
#dicDf[-1]

Unnamed: 0,state,year,winner,Age
3,Nevada,2001,Casey,28
4,Nevada,2002,Jack,39


### Interim Check!
- DataFrame은 Series의 결합체다.
 - 즉,각 시리즈(컬럼)는 하나의 데이터타입으로 통일되는 특성을 갖지만<br>서로 다른 시리즈는 각기 다른 데이터타입을 취할 수 있다.



---

### Creating DataFrame through numpy.array()!

In [12]:
data1 = {'name':['James','Peter','Thomas','Robert'],
       'address':['NY','TXS','LA','CA'],
       'age':[33,44,55,66],
      }
df1 = DataFrame(data1)
df1


Unnamed: 0,name,address,age
0,James,NY,33
1,Peter,TXS,44
2,Thomas,LA,55
3,Robert,CA,66


In [13]:
#quick reminder!
#DataFrame takes on 3 distinct keyword parameter -> data, index, columns
np.random.seed(100)
df2 = DataFrame(np.random.randint(10,100,16).reshape(4,4),
                index=list('abcd'),
                columns=list('abcd')
               ) #Dataframe은 2차원 구조


In [14]:
df2

Unnamed: 0,a,b,c,d
a,18,34,77,97
b,89,58,20,62
c,63,76,24,44
d,34,25,70,68


#### 컬럼명&인덱스명 수정

In [15]:
df2.columns=['one','two','three','four'] #컬럼명을 바꾸고 싶을때. 갯수를 맞춰주어야.
df2.index = range(len(df2)) #인덱스명 수정. 파이썬 range가 통한다.
df2

Unnamed: 0,one,two,three,four
0,18,34,77,97
1,89,58,20,62
2,63,76,24,44
3,34,25,70,68


---

### Creating DataFrame through read_csv()!

In [16]:
df3=pd.read_csv('../data/tips.csv')

In [17]:
df3

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2.0
1,10.34,1.66,Male,No,Sun,Dinner,3.0
2,21.01,3.50,Male,No,Sun,Dinner,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2.0
4,24.59,3.61,Female,No,Sun,Dinner,4.0
...,...,...,...,...,...,...,...
240,27.18,2.00,Female,Yes,Sat,Dinner,2.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2.0
242,17.82,1.75,Male,No,Sat,Dinner,2.0
243,18.78,3.00,Female,No,Thur,Dinner,2.0


#### NaN Handling 

In [18]:
# df3.dropna(axis=1) will remove columns which contain NaN
#df3.dropna(axis=0) #will remove rows with NaN

df4 = DataFrame(np.random.rand(5,5),index=list(range(5)),columns=list(range(5)))
df4[df4<0.5] = np.nan
print(df4)
print("-"*50)
df4[df4.isnull()] = 1
print(df4)

          0         1   2         3         4
0       NaN       NaN NaN  0.978624  0.811683
1       NaN  0.816225 NaN       NaN  0.940030
2  0.817649       NaN NaN       NaN       NaN
3       NaN  0.795663 NaN  0.598843  0.603805
4       NaN       NaN NaN  0.890412  0.980921
--------------------------------------------------
          0         1    2         3         4
0  1.000000  1.000000  1.0  0.978624  0.811683
1  1.000000  0.816225  1.0  1.000000  0.940030
2  0.817649  1.000000  1.0  1.000000  1.000000
3  1.000000  0.795663  1.0  0.598843  0.603805
4  1.000000  1.000000  1.0  0.890412  0.980921


#### Transposition

In [19]:
print(df1)
print('-'*30)
print(df1.T) #Transposed result. 

     name address  age
0   James      NY   33
1   Peter     TXS   44
2  Thomas      LA   55
3  Robert      CA   66
------------------------------
             0      1       2       3
name     James  Peter  Thomas  Robert
address     NY    TXS      LA      CA
age         33     44      55      66


---

## 2.DataFrame Structure

**구조를 확인하기 위한 속성**
- index
- columns
- values
- dtype
- 조회확인
    - info()
    - head() -> default(5)
    - tail() -> default(5)
    - describe()

In [20]:
print(df3.index)
print(df3.columns)
print(df3.values)
print('-'*30)
print(df3[['sex','size']].dtypes) #dtype 이 아닌 dtypes! as it is assumed to have 2 or more columns.
print('-'*30)
print(df3.dtypes) 


RangeIndex(start=0, stop=245, step=1)
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
[[16.99 1.01 'Female' ... 'Sun' 'Dinner' 2.0]
 [10.34 1.66 'Male' ... 'Sun' 'Dinner' 3.0]
 [21.01 3.5 'Male' ... 'Sun' 'Dinner' 3.0]
 ...
 [17.82 1.75 'Male' ... 'Sat' 'Dinner' 2.0]
 [18.78 3.0 'Female' ... 'Thur' 'Dinner' 2.0]
 [25.34 nan nan ... nan nan nan]]
------------------------------
sex      object
size    float64
dtype: object
------------------------------
total_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size          float64
dtype: object


In [21]:
#Column information 
df3.info()  
# How many columns, how many non-null values are out there, Dtype...
# 데이터 분석 전 NaN 값을 확인하기 위함. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  245 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    float64
dtypes: float64(3), object(4)
memory usage: 13.5+ KB


## 3. DataFrame - 컬럼명 변경 및 추가하기

#### 컬럼명 변경하기
- 1) DataFrame.columns = []
- 2) NewDataFrame = DataFrame.rename(columns={old:new})
- 3) DataFrame.rename(columns={old:new},inplace=True)

In [22]:
df2

Unnamed: 0,one,two,three,four
0,18,34,77,97
1,89,58,20,62
2,63,76,24,44
3,34,25,70,68


In [23]:
#1. 컬럼명 변경 (전체변경) ... -> .columns = []
df2.columns = ['A-class','B-class','C-class','D-class']
df2

Unnamed: 0,A-class,B-class,C-class,D-class
0,18,34,77,97
1,89,58,20,62
2,63,76,24,44
3,34,25,70,68


In [24]:
#2. 컬럼명 부분수정 ...  -> .rename(columns={oldName:newName}) :: destructive
df1

Unnamed: 0,name,address,age
0,James,NY,33
1,Peter,TXS,44
2,Thomas,LA,55
3,Robert,CA,66


In [25]:
print(df1.rename(columns={'address':'add'})) 
print(df1)#원본 변경 x :: Non-destructive

#if you wanna change original, you have to "put inplace : True"


df1.rename(columns={"address":'add'},inplace=True)
df1



     name  add  age
0   James   NY   33
1   Peter  TXS   44
2  Thomas   LA   55
3  Robert   CA   66
     name address  age
0   James      NY   33
1   Peter     TXS   44
2  Thomas      LA   55
3  Robert      CA   66


Unnamed: 0,name,add,age
0,James,NY,33
1,Peter,TXS,44
2,Thomas,LA,55
3,Robert,CA,66


#### 컬럼 추가하기


In [26]:
df1

Unnamed: 0,name,add,age
0,James,NY,33
1,Peter,TXS,44
2,Thomas,LA,55
3,Robert,CA,66


In [27]:
df1['phone'] = np.nan #넘파이 NaN값으로 채우기. None으로 하면? -> 오브젝트

df1['sex']= None
df1['phone'].dtype

dtype('float64')