# # Working With Missing Data
[From the dataquest.io site](https://www.dataquest.io/m/83/data-manipulation-with-pandas/5/normalizing-columns-in-a-data-set)
### <p style="color:Tomato">Learn to handle missing data using pandas, and a data set on Titanic survival.
<p/>
#### <p style="color:Gray">Clean and analyze data on passenger survival from the Titanic. Many of the columns, such as age and sex, have missing data.<p/>
* cause errors
* finding the mean of a column with a missing value is not successful.
because it's impossible to average  missing value.


<p style="color:Blue">**1. titanic_survival.csv**<p/>

In [1]:
import pandas as pd
import numpy as np

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
titanic_survival = pd.read_csv("titanic_survival.csv")

In [4]:
age = titanic_survival["age"]
print(age.loc[10:20])

10    47.0
11    18.0
12    24.0
13    26.0
14    80.0
15     NaN
16    24.0
17    50.0
18    32.0
19    36.0
20    37.0
Name: age, dtype: float64


In [5]:
age_is_null = pd.isnull(age)
age_null_true = age[age_is_null]
age_null_count = len(age_null_true)
print(age_null_count)

264


#### <p style="color:Gray">NaN<p/>
not a number, to indicate a missing value
#### <p style="color:Gray">pandas.isnull()<p/>
returns a series of True and False values.

In [6]:
sex = titanic_survival["sex"]
sex_is_null = pd.isnull(sex)
print(sex_is_null)

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1280    False
1281    False
1282    False
1283    False
1284    False
1285    False
1286    False
1287    False
1288    False
1289    False
1290    False
1291    False
1292    False
1293    False
1294    False
1295    False
1296    False
1297    False
1298    False
1299    False
1300    False
1301    False
1302    False
1303    False
1304    False
1305    False
1306    False
1307    False
1308    False
1309     True
Name: sex, Length: 1310, dtype: bool


In [7]:
sex_null_true = sex[sex_is_null]
print(sex_null_true)

1309    NaN
Name: sex, dtype: object


This is because any calculations we do with a null value also result in a null value. 

In [8]:
age_is_null = pd.isnull(titanic_survival["age"])
good_ages = titanic_survival["age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)

29.8811345124


#### <p style="color:Gray">Series.mean()<p/>
To calculate the mean of a column,<br/>
missing values will not be includd in the calculation.

In [9]:
correct_mean_age = titanic_survival["age"].mean()
print(correct_mean_age)

29.8811345124283


Assign the mean of the "fare" column to correct_mean_fare.

In [10]:
correct_mean_fare = titanic_survival["fare"].mean()
print(correct_mean_fare)

33.29547928134572


In [11]:
passenger_classes = [1,2,3]
fares_by_class = {}

In [12]:
for this_class in passenger_classes:
#     print(this_class)
    pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class] 
#     print(pclass_rows)
    pclass_fares = pclass_rows["fare"]
#     print(pclass_fares)
    fare_for_class = pclass_fares.mean()
    print(pclass_fares.mean())
#     print(fare_for_class)
    fares_by_class[this_class] = fare_for_class

87.50899164086687
21.1791963898917
13.302888700564957


In [13]:
print(fares_by_class[1])
print(fares_by_class[2])
print(fares_by_class[3])

87.50899164086687
21.1791963898917
13.302888700564957


#### <p style="color:Tomato"> Pivot tables<p/>
하나의 열로 하위 집합을 만든 다음 합계 또는 평균과 같은 계산을 적용할 수 있다. 
<br>
> 피벗 테이블을 먼저 그룹화 한 다음 계산을 적용한다. 위에서는 pclass열을 기준으로 그룹화 한 다음 각 클래스의 fare열의 평균을 계산하여 수동으로 피벗 테이블을 만들었다. 

In [14]:
passenger_class_fares = titanic_survival.pivot_table(
    index="pclass", values="fare", aggfunc=np.mean)
print(passenger_class_fares)

             fare
pclass           
1.0     87.508992
2.0     21.179196
3.0     13.302889


#### <p style="color:Gray">First parameter<p/>
> index tells the method which column to group by.
> Index는 메서드를 통해 그룹화 할 열을 알려준다.

#### <p style="color:Gray">Second parameter<p/>
> values is the column that we want to apply the calculation to,and aggfunc specifies the calculation we want to perform.
> 두 번째 매개 변수 값은 계산을 적용하려는 열이다. 

#### <p style="color:Gray">Third parameter<p/>
> aggfunc specifies the calculation we want to perform. The defalt for the aggfunc parameter is actually the mean.
> aggfunc는 수행하고자 하는 계산방법을 지정한다. 


In [15]:
passenger_survival = titanic_survival.pivot_table(
    index="pclass", values="survived")
print(passenger_survival)

        survived
pclass          
1.0     0.619195
2.0     0.429603
3.0     0.255289


[http://pbpython.com/pandas-pivot-table-explained.html](http://pbpython.com/pandas-pivot-table-explained.html)

In [16]:
passenger_age = titanic_survival.pivot_table(index="pclass", values="age")
print(passenger_age)

              age
pclass           
1.0     39.159918
2.0     29.506705
3.0     24.816367


A pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port ("embarked") <br>
<br>
각 승선 포트에 대한 수집된 운임과 총 생존자 수

In [17]:
port_stats = titanic_survival.pivot_table(
    index="embarked", values=["fare", "survived"], 
    aggfunc=np.sum)
print(port_stats)

                fare  survived
embarked                      
C         16830.7922     150.0
Q          1526.3085      44.0
S         25033.3862     304.0


### Remove the missing values in a vector of data, and in a matrix

 #### <p style="color:Tomato"> What is matrix?<p/>
 #### <p style="color:Tomato"> What is vector?<p/>

#### <p style="color:Gray">DataFrame.dropna()<p/>
This method will drop any rows that contain missing values.

참고사이트: [결측값 있는 행 제거](http://rfriend.tistory.com/263)

```python
df.dropna(axis=0)
```
* Delete row with NaN <br>
* 결측값이 들어있는 행 전체를 삭제한다. <br>
* 모든 column은 유지되고, nan값이 들어있는 행만 삭제된다. 

```python
df.dropna(axis=1)
```
* Delete column with NaN <br>
* 결측값이 들어있는 열 전체를 삭제한다. <br>
* NaN값을 포함하고 있는 칼럼이 지워진다. 

In [18]:
drop_na_rows = titanic_survival.dropna(axis=0)
drop_na_col = titanic_survival.dropna(axis=1)
print(drop_na_rows)
print(drop_na_col)

Empty DataFrame
Columns: [pclass, survived, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked, boat, body, home.dest]
Index: []
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[1310 rows x 0 columns]


```python
dat.dropna(how='any') #to drop if any value in the row has a nan
dat.dropna(how='all') #to drop if all values in the row are nan
```

In [19]:
len(titanic_survival.index)

1310

In [20]:
drop_na_columns = titanic_survival.dropna(
    axis=1)
print(drop_na_columns)

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]

[1310 rows x 0 columns]


where the columns "age", "sex" have missing values and assign he result t new_titanic_survival.

In [21]:
new_titanic_survival = titanic_survival.dropna(axis=0, subset=["age", "sex"])
print(new_titanic_survival)

      pclass  survived                                               name  \
0        1.0       1.0                      Allen, Miss. Elisabeth Walton   
1        1.0       1.0                     Allison, Master. Hudson Trevor   
2        1.0       0.0                       Allison, Miss. Helen Loraine   
3        1.0       0.0               Allison, Mr. Hudson Joshua Creighton   
4        1.0       0.0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
5        1.0       1.0                                Anderson, Mr. Harry   
6        1.0       1.0                  Andrews, Miss. Kornelia Theodosia   
7        1.0       0.0                             Andrews, Mr. Thomas Jr   
8        1.0       1.0      Appleton, Mrs. Edward Dale (Charlotte Lamson)   
9        1.0       0.0                            Artagaveytia, Mr. Ramon   
10       1.0       0.0                             Astor, Col. John Jacob   
11       1.0       1.0  Astor, Mrs. John Jacob (Madeleine Talmadge Force)   

#### <p style="color:Gray">Dataframe.loc()<p/>
These work just like column labels, and can be values like numbers, characters, and strings.



In [22]:
first_five_rows = new_titanic_survival.iloc[0:5]
print(first_five_rows)

   pclass  survived                                             name     sex  \
0     1.0       1.0                    Allen, Miss. Elisabeth Walton  female   
1     1.0       1.0                   Allison, Master. Hudson Trevor    male   
2     1.0       0.0                     Allison, Miss. Helen Loraine  female   
3     1.0       0.0             Allison, Mr. Hudson Joshua Creighton    male   
4     1.0       0.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   

       age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
0  29.0000    0.0    0.0   24160  211.3375       B5        S    2    NaN   
1   0.9167    1.0    2.0  113781  151.5500  C22 C26        S   11    NaN   
2   2.0000    1.0    2.0  113781  151.5500  C22 C26        S  NaN    NaN   
3  30.0000    1.0    2.0  113781  151.5500  C22 C26        S  NaN  135.0   
4  25.0000    1.0    2.0  113781  151.5500  C22 C26        S  NaN    NaN   

                         home.dest  
0                     St 

In [23]:
first_ten_rows = new_titanic_survival.iloc[0:10]
print(first_ten_rows)

   pclass  survived                                             name     sex  \
0     1.0       1.0                    Allen, Miss. Elisabeth Walton  female   
1     1.0       1.0                   Allison, Master. Hudson Trevor    male   
2     1.0       0.0                     Allison, Miss. Helen Loraine  female   
3     1.0       0.0             Allison, Mr. Hudson Joshua Creighton    male   
4     1.0       0.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   
5     1.0       1.0                              Anderson, Mr. Harry    male   
6     1.0       1.0                Andrews, Miss. Kornelia Theodosia  female   
7     1.0       0.0                           Andrews, Mr. Thomas Jr    male   
8     1.0       1.0    Appleton, Mrs. Edward Dale (Charlotte Lamson)  female   
9     1.0       0.0                          Artagaveytia, Mr. Ramon    male   

       age  sibsp  parch    ticket      fare    cabin embarked boat   body  \
0  29.0000    0.0    0.0     24160  211.3

In [24]:
row_position_fifth = new_titanic_survival.iloc[4]
print(row_position_fifth)

pclass                                                     1
survived                                                   0
name         Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
sex                                                   female
age                                                       25
sibsp                                                      1
parch                                                      2
ticket                                                113781
fare                                                  151.55
cabin                                                C22 C26
embarked                                                   S
boat                                                     NaN
body                                                     NaN
home.dest                    Montreal, PQ / Chesterville, ON
Name: 4, dtype: object


Assign the row with index label 25 from new_titanic_survival

In [25]:
row_index_25 = new_titanic_survival.loc[25]

In [26]:
first_row_first_column = new_titanic_survival.iloc[0,0]
print(first_row_first_column)

1.0


In [27]:
all_rows_first_three_columns = new_titanic_survival.iloc[:, 0:3]
print(all_rows_first_three_columns)

      pclass  survived                                               name
0        1.0       1.0                      Allen, Miss. Elisabeth Walton
1        1.0       1.0                     Allison, Master. Hudson Trevor
2        1.0       0.0                       Allison, Miss. Helen Loraine
3        1.0       0.0               Allison, Mr. Hudson Joshua Creighton
4        1.0       0.0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5        1.0       1.0                                Anderson, Mr. Harry
6        1.0       1.0                  Andrews, Miss. Kornelia Theodosia
7        1.0       0.0                             Andrews, Mr. Thomas Jr
8        1.0       1.0      Appleton, Mrs. Edward Dale (Charlotte Lamson)
9        1.0       0.0                            Artagaveytia, Mr. Ramon
10       1.0       0.0                             Astor, Col. John Jacob
11       1.0       1.0  Astor, Mrs. John Jacob (Madeleine Talmadge Force)
12       1.0       1.0                

In [28]:
row_index_83_age = new_titanic_survival.loc[83, "age"]
print(row_index_83_age)

64.0


In [29]:
row_83_colums_all = new_titanic_survival.iloc[85, :] #82 , #83, #84
print(row_83_colums_all)

pclass                               1
survived                             1
name         Dodge, Master. Washington
sex                               male
age                                  4
sibsp                                0
parch                                2
ticket                           33638
fare                           81.8583
cabin                              A34
embarked                             S
boat                                 5
body                               NaN
home.dest            San Francisco, CA
Name: 94, dtype: object


In [30]:
row_index_766_pclass = new_titanic_survival.loc[766, "pclass"]
print(row_index_766_pclass)

3.0


In [31]:
row_index_1100_age = new_titanic_survival.loc[1100, "age"]
print(row_index_1100_age)

29.0


In [32]:
row_index_25_survived = new_titanic_survival.loc[25, "survived"]
print(row_index_25_survived)

0.0


In [33]:
five_rows_three_cols = new_titanic_survival.iloc[0:5, 0:3]
print(five_rows_three_cols)

   pclass  survived                                             name
0     1.0       1.0                    Allen, Miss. Elisabeth Walton
1     1.0       1.0                   Allison, Master. Hudson Trevor
2     1.0       0.0                     Allison, Miss. Helen Loraine
3     1.0       0.0             Allison, Mr. Hudson Joshua Creighton
4     1.0       0.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)


#### <p style="color:Gray">Dataframe.reset_index()<p/>
나이별로 정렬한 것은 인덱스가 더 이상 순차적이지 않다. 
각 행들은 titanic_survival에서 존재한 원래 색인을 유지했기 때문. 
따라서 0부터 시작하여 다시 인덱싱할 필요가 있다. <br>
<br>
기본적으로 메서드는 이전 인덱스 값을 사용하여 데이터 프레임에 추가 열을 추가하여 이전 인덱스를 유지합니다. 

In [34]:
titanic_reindexed = new_titanic_survival.reset_index(drop=True)

In [35]:
print(titanic_reindexed.iloc[0:5, 0:3])

   pclass  survived                                             name
0     1.0       1.0                    Allen, Miss. Elisabeth Walton
1     1.0       1.0                   Allison, Master. Hudson Trevor
2     1.0       0.0                     Allison, Miss. Helen Loraine
3     1.0       0.0             Allison, Mr. Hudson Joshua Creighton
4     1.0       0.0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)


뻘짓!
How can I write a function that counts the number of null elements in a Series.

In [36]:
def hundredth_row(column):
    hundredth_item = column.iloc[99]
    return hundredth_item

In [37]:
hundredth_row_var = titanic_survival.apply(hundredth_row)
print(hundredth_row_var)

pclass                                                       1
survived                                                     1
name         Duff Gordon, Lady. (Lucille Christiana Sutherl...
sex                                                     female
age                                                         48
sibsp                                                        1
parch                                                        0
ticket                                                   11755
fare                                                      39.6
cabin                                                      A16
embarked                                                     C
boat                                                         1
body                                                       NaN
home.dest                                       London / Paris
dtype: object


In [38]:
def find_null(value):
    if value is None:
        return "null values"
    elif value is not None:
        return "no nulls here"

In [39]:
print(find_null(titanic_reindexed))

no nulls here


In [40]:
null_count = 0
value_count = 0
for value in titanic_reindexed:
    if value is None:
        null_count += 1
    elif value is not None:
        value_count += 1
        

In [41]:
print(null_count)
print(value_count)

0
14


In [42]:
print(titanic_reindexed)

      pclass  survived                                               name  \
0        1.0       1.0                      Allen, Miss. Elisabeth Walton   
1        1.0       1.0                     Allison, Master. Hudson Trevor   
2        1.0       0.0                       Allison, Miss. Helen Loraine   
3        1.0       0.0               Allison, Mr. Hudson Joshua Creighton   
4        1.0       0.0    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
5        1.0       1.0                                Anderson, Mr. Harry   
6        1.0       1.0                  Andrews, Miss. Kornelia Theodosia   
7        1.0       0.0                             Andrews, Mr. Thomas Jr   
8        1.0       1.0      Appleton, Mrs. Edward Dale (Charlotte Lamson)   
9        1.0       0.0                            Artagaveytia, Mr. Ramon   
10       1.0       0.0                             Astor, Col. John Jacob   
11       1.0       1.0  Astor, Mrs. John Jacob (Madeleine Talmadge Force)   

[1046 rows x 14 columns]


### <p style="color:orange">cheating</p>

In [43]:
def null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

In [44]:
column_null_count = titanic_survival.apply(null_count)
print(column_null_count)

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64


### <p style="color:green">study self</p>
### <p style="color:green">try1</p>

In [45]:
print(titanic_survival.loc[:, "age"])

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
5       48.0000
6       63.0000
7       39.0000
8       53.0000
9       71.0000
10      47.0000
11      18.0000
12      24.0000
13      26.0000
14      80.0000
15          NaN
16      24.0000
17      50.0000
18      32.0000
19      36.0000
20      37.0000
21      47.0000
22      26.0000
23      42.0000
24      29.0000
25      25.0000
26      25.0000
27      19.0000
28      35.0000
29      28.0000
         ...   
1280    22.0000
1281    22.0000
1282        NaN
1283        NaN
1284        NaN
1285    32.5000
1286    38.0000
1287    51.0000
1288    18.0000
1289    21.0000
1290    47.0000
1291        NaN
1292        NaN
1293        NaN
1294    28.5000
1295    21.0000
1296    27.0000
1297        NaN
1298    36.0000
1299    27.0000
1300    15.0000
1301    45.5000
1302        NaN
1303        NaN
1304    14.5000
1305        NaN
1306    26.5000
1307    27.0000
1308    29.0000
1309        NaN
Name: age, Length: 1310,

In [46]:
null_count = 0
for value in titanic_survival.loc[:, "age"]:
    value_null = pd.isnull(value)
#     print(value_null)
#     print(value[value_null])
#     print(len(value[value_null]))
    null_count += len(value[value_null])
print(null_count)

264


pd.isnull에 true, false를 반환한다.<br>
is.null이 null값에 활용되는 것은 알겠는데 아직 잘 모르겠다. 
1. null값 찾기
2. null값이 포함된 row가 몇 개인지 찾기
3. null값이 포함된 column 그리고 column별로 null값이 몇 개인지 알아내기

def 그대로 따라해보면
1. column별로 null값 확인 (true/false)
2. null값인 것들만 저장
3. 그리고 길이 확인

### <p style="color:green">try2</p>

In [47]:
null = []
for value in titanic_survival.loc[:, "age"]:
    value_null = pd.isnull(value)
    null.append(value[value_null])
#     print(null)
len(null)

1310

### <p style = "color:green">try3</p>

In [48]:
age_titanic_survival = titanic_survival.loc[:, "age"]

In [49]:
# print(pd.isnull(age_titanic_survival))
age_null = pd.isnull(age_titanic_survival)

In [50]:
# print(age_null)
print(titanic_survival[age_null])

      pclass  survived                                               name  \
15       1.0       0.0                                Baumann, Mr. John D   
37       1.0       1.0      Bradley, Mr. George ("George Arthur Brayton")   
40       1.0       0.0                          Brewe, Dr. Arthur Jackson   
46       1.0       0.0                              Cairns, Mr. Alexander   
59       1.0       1.0  Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev...   
69       1.0       1.0             Chibnall, Mrs. (Edith Martha Bowerman)   
70       1.0       0.0              Chisholm, Mr. Roderick Robert Crispin   
74       1.0       0.0                        Clifford, Mr. George Quincy   
80       1.0       0.0                          Crafton, Mr. John Bertram   
106      1.0       0.0                                 Farthing, Mr. John   
107      1.0       1.0               Flegenheim, Mrs. Alfred (Antoinette)   
108      1.0       1.0                            Fleming, Miss. Margaret   

In [51]:
print(len(titanic_survival[age_null]))

264


In [52]:
fare_titanic_survival = titanic_survival.loc[:, 'fare']

In [53]:
fare_null = pd.isnull(fare_titanic_survival)
# print(fare_null) # true, false만을 반환

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1280    False
1281    False
1282    False
1283    False
1284    False
1285    False
1286    False
1287    False
1288    False
1289    False
1290    False
1291    False
1292    False
1293    False
1294    False
1295    False
1296    False
1297    False
1298    False
1299    False
1300    False
1301    False
1302    False
1303    False
1304    False
1305    False
1306    False
1307    False
1308    False
1309     True
Name: fare, Length: 1310, dtype: bool


In [55]:
print(titanic_survival[fare_null]) #1225와 1309에 fare null값이 있다. 따라서, 
print(len(titanic_survival[fare_null]))

      pclass  survived                name   sex   age  sibsp  parch ticket  \
1225     3.0       0.0  Storey, Mr. Thomas  male  60.5    0.0    0.0   3701   
1309     NaN       NaN                 NaN   NaN   NaN    NaN    NaN    NaN   

      fare cabin embarked boat   body home.dest  
1225   NaN   NaN        S  NaN  261.0       NaN  
1309   NaN   NaN      NaN  NaN    NaN       NaN  
2


### 100번째에 있는 값을 확인하려면

In [56]:
def hundredth_row(column):
    hundredth_item = column.iloc[99]
    return hundredth_item

In [57]:
hundredth_row_var = titanic_survival.apply(hundredth_row)
print(hundredth_row_var)

pclass                                                       1
survived                                                     1
name         Duff Gordon, Lady. (Lucille Christiana Sutherl...
sex                                                     female
age                                                         48
sibsp                                                        1
parch                                                        0
ticket                                                   11755
fare                                                      39.6
cabin                                                      A16
embarked                                                     C
boat                                                         1
body                                                       NaN
home.dest                                       London / Paris
dtype: object


<p>By passing in the axis=1 arhument, we can use the DataFrame.apply() method to iterate over rows instead of columns</p>
<p>We can user this to calculate some summary information about the ages of the passengers on the Titanic.</p>

### <p style="color:red">if/elif/else</p>

In [61]:
def which_class(row):
    pclass = row['pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First"
    elif pclass == 2:
        return "Second"
    else:
        return "Third"
classes = titanic_survival.apply(which_class, axis=1)
print(classes)

0         First
1         First
2         First
3         First
4         First
5         First
6         First
7         First
8         First
9         First
10        First
11        First
12        First
13        First
14        First
15        First
16        First
17        First
18        First
19        First
20        First
21        First
22        First
23        First
24        First
25        First
26        First
27        First
28        First
29        First
         ...   
1280      Third
1281      Third
1282      Third
1283      Third
1284      Third
1285      Third
1286      Third
1287      Third
1288      Third
1289      Third
1290      Third
1291      Third
1292      Third
1293      Third
1294      Third
1295      Third
1296      Third
1297      Third
1298      Third
1299      Third
1300      Third
1301      Third
1302      Third
1303      Third
1304      Third
1305      Third
1306      Third
1307      Third
1308      Third
1309    Unknown
Length: 1310, dtype: obj

<p style="color:green"> Create a function that returns the string "minor" if someone is under 18, "adult" if they are equal to or over 18, and "unknown" if their age is null </p>

In [64]:
def is_minor(row):
    if row["age"] < 18:
        return "minor"
    else:
        return "adult"
    
minors = titanic_survival.apply(is_minor, axis=1)
print(minors)

0       adult
1       minor
2       minor
3       adult
4       adult
5       adult
6       adult
7       adult
8       adult
9       adult
10      adult
11      adult
12      adult
13      adult
14      adult
15      adult
16      adult
17      adult
18      adult
19      adult
20      adult
21      adult
22      adult
23      adult
24      adult
25      adult
26      adult
27      adult
28      adult
29      adult
        ...  
1280    adult
1281    adult
1282    adult
1283    adult
1284    adult
1285    adult
1286    adult
1287    adult
1288    adult
1289    adult
1290    adult
1291    adult
1292    adult
1293    adult
1294    adult
1295    adult
1296    adult
1297    adult
1298    adult
1299    adult
1300    minor
1301    adult
1302    adult
1303    adult
1304    minor
1305    adult
1306    adult
1307    adult
1308    adult
1309    adult
Length: 1310, dtype: object


In [77]:
def classify_age(row):
    age = row["age"]
    if pd.isnull(age):
        return "nuknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"
# print(titanic_survival.apply(classify_age, axis=1))
age_labels = titanic_survival.apply(classify_age, axis=1)
print(age_labels)
# print(age_labels[adult]) no excute

0         adult
1         minor
2         minor
3         adult
4         adult
5         adult
6         adult
7         adult
8         adult
9         adult
10        adult
11        adult
12        adult
13        adult
14        adult
15      nuknown
16        adult
17        adult
18        adult
19        adult
20        adult
21        adult
22        adult
23        adult
24        adult
25        adult
26        adult
27        adult
28        adult
29        adult
         ...   
1280      adult
1281      adult
1282    nuknown
1283    nuknown
1284    nuknown
1285      adult
1286      adult
1287      adult
1288      adult
1289      adult
1290      adult
1291    nuknown
1292    nuknown
1293    nuknown
1294      adult
1295      adult
1296      adult
1297    nuknown
1298      adult
1299      adult
1300      minor
1301      adult
1302    nuknown
1303    nuknown
1304      minor
1305    nuknown
1306      adult
1307      adult
1308      adult
1309    nuknown
Length: 1310, dtype: obj

<p style="color:green">Let's make a pivot table to find the probability of survival for each age group.</p>

<p style="color:red">I don't know why I can't do this</p>

In [79]:
# age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
# print(age_group_survival)

KeyError: 'age_labels'