# Notes

This assignment is devoted to `pandas`. It covers indexing and filtering, and some `groupby` and `join` operations. The assignment roughly corresponds to Week 4 and the beginning of Week 5 of the course.

The main dataset you'll be using is [Titanic](https://www.kaggle.com/c/titanic).

In [1]:
%pylab inline
plt.style.use("bmh")

ModuleNotFoundError: No module named 'matplotlib'

In [0]:
plt.rcParams["figure.figsize"] = (6,6)

In [47]:
import numpy as np
import pandas as pd

In [48]:
titanic_train = pd.read_csv(r"data/train.csv", index_col="PassengerId")
titanic_test = pd.read_csv(r"data/test.csv", index_col="PassengerId")
titanic = pd.concat([titanic_train, titanic_test], sort=False)

In [4]:
STUDENT = "Matan Avitan"
ASSIGNMENT = 4
TEST = False

In [0]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 16

# Indexing and filtering

### 1. Fixing age (1 point).

There are several known mistakes in the Titanic dataset.

Namely, [Julia Florence Siegel](https://www.encyclopedia-titanica.org/titanic-survivor/julia-florence-cavendish.html) (Mrs. Tyrell William Cavendish) is mistakenly marked as being 76 years old (the age she actually died, but many years after Titanic).

You must replace age value for her with her actual age at the time (25) and return the dataset. Input is indexed with `PassengerId` and is a concatenation of train and test sets. You must return a copy of the dataframe, and not perform replacement in original dataframe. Structure and indexing must be the same as in input.

In [0]:
def fix_age(df):
    """Fix age for Julia Florence Siegel."""
    filter = df.Name.str.contains('Julia Florence Siegel')
    series_copy = df[filter].replace(76, 25, inplace=False)
    df_copy = df.copy()
    df_copy[filter] = series_copy
    return df_copy
frames = [titanic_train,titanic_test]
fix_age(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if __name__ == '__main__':


Unnamed: 0_level_0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,22.0,,S,7.2500,"Braund, Mr. Owen Harris",0,3,male,1,0.0,A/5 21171
2,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,female,1,1.0,PC 17599
3,26.0,,S,7.9250,"Heikkinen, Miss. Laina",0,3,female,0,1.0,STON/O2. 3101282
4,35.0,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,female,1,1.0,113803
5,35.0,,S,8.0500,"Allen, Mr. William Henry",0,3,male,0,0.0,373450
...,...,...,...,...,...,...,...,...,...,...,...
1305,,,S,8.0500,"Spector, Mr. Woolf",0,3,male,0,,A.5. 3236
1306,39.0,C105,C,108.9000,"Oliva y Ocana, Dona. Fermina",0,1,female,0,,PC 17758
1307,38.5,,S,7.2500,"Saether, Mr. Simon Sivertsen",0,3,male,0,,SOTON/O.Q. 3101262
1308,,,S,8.0500,"Ware, Mr. Frederick",0,3,male,0,,359309


In [0]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, fix_age)

### 2. Embarkment port distribution (1 point).

You must find the value counts for embarkment port (`Embarked` column) for the passengers, who travelled in 3-d class, were male and between 20 and 30 years old (both inclusive). No need to treat missing values separately.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. You must return series, indexed with values from `Embarked`, according to `.value_counts()` method semantics.

In [0]:
def embarked_stats(df):
    """Calculate embarkment port statistics."""
    filter = (df.Pclass == 3) & (df.Sex == 'male') & (df.Age >= 20) & (df.Age <= 30)
    return df[filter].Embarked.value_counts()
frames = [titanic_train,titanic_test]
embarked_stats(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


S    132
C     21
Q      7
Name: Embarked, dtype: int64

In [0]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, embarked_stats)

In [50]:
titanic_train.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### 3. Fill missing age values (1 point).

Some age values are missing in the Titanic dataset. You need to calculate average age, and fill missing age values in `Age` column.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a **new** dataframe with the same structure, but without missing values in `Age` column.

In [52]:
def fix_age(df):
    """Fix missing age values."""
    filter = df.Age.isnull()
    df_copy = df.copy()
    df_copy.Age[filter] = df.Age.mean()
    return df_copy
frames = [titanic_train,titanic_test]
fix_age(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0_level_0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,22.000000,,S,7.2500,"Braund, Mr. Owen Harris",0,3,male,1,0.0,A/5 21171
2,38.000000,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,female,1,1.0,PC 17599
3,26.000000,,S,7.9250,"Heikkinen, Miss. Laina",0,3,female,0,1.0,STON/O2. 3101282
4,35.000000,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,female,1,1.0,113803
5,35.000000,,S,8.0500,"Allen, Mr. William Henry",0,3,male,0,0.0,373450
...,...,...,...,...,...,...,...,...,...,...,...
1305,29.881138,,S,8.0500,"Spector, Mr. Woolf",0,3,male,0,,A.5. 3236
1306,39.000000,C105,C,108.9000,"Oliva y Ocana, Dona. Fermina",0,1,female,0,,PC 17758
1307,38.500000,,S,7.2500,"Saether, Mr. Simon Sivertsen",0,3,male,0,,SOTON/O.Q. 3101262
1308,29.881138,,S,8.0500,"Ware, Mr. Frederick",0,3,male,0,,359309


In [0]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, fix_age)

### 4. Child travelling alone (1 point).

You must find a child (`Age<10`) on-board, who was travelling without siblings or parents and find a name of her nursemaid.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a tuple of two strings, collected from `Name` column, with one being child's name and second being nursemaid's name. It's known, that there's only one child like this.

In [0]:
def get_nursemaid(df):
    child_filter = (df.Age<10) & (df.SibSp == 0) & (df.Parch == 0) & (df.Sex == 'female')
    nursemaid_filter = (df.Age>=18) & (df.SibSp == 0) & (df.Parch == 0) & (df.Embarked == df[child_filter].Embarked.values[0]) & (df.Pclass == df[child_filter].Pclass.values[0]) & (df.Fare == df[child_filter].Fare.values[0])
    return df[child_filter].Name.values[0], df[nursemaid_filter].Name.values[0]
frames = [titanic_train,titanic_test]
get_nursemaid(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


('Emanuel, Miss. Virginia Ethel', 'Dowdell, Miss. Elizabeth')

In [0]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_nursemaid)

### 5. Port with the most children embarked (1 point).

You must find, which port had the largest percentage of children (`Age<10`) embarked, i.e. number of children divided by total number of passengers embarked.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a single string with port letter.

In [0]:
def get_port(df):
    """Get port with the most children embarked."""
    children_filter = df.Age < 10
    embarked_stats = df[children_filter].Embarked.value_counts() / df.Embarked.value_counts().sum()
    return embarked_stats.idxmax()
frames = [titanic_train,titanic_test]
get_port(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  import sys


'S'

In [0]:
PROBLEM_ID = 5

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_port)

### 6. Passengers per ticket (2 points).

Calculate average and maximum number of passengers per ticket.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a tuple of two values.

In [0]:
def get_ticket_stats(df):
    """Calculate passenger per ticket statistics."""
    df_gb_ticket = df.groupby('Ticket')
    mean = df_gb_ticket.Fare.count().mean() # Each one pay the same Fare for the same ticket
    max = df_gb_ticket.Fare.count().max()
    return mean, max
frames = [titanic_train,titanic_test]
get_ticket_stats(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


(1.4079655543595264, 11)

In [0]:
PROBLEM_ID = 6

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_ticket_stats)

In [0]:
titanic_train[titanic_train.SibSp == 5]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S
72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,5,2,CA 2144,46.9,,S
387,0,3,"Goodwin, Master. Sidney Leonard",male,1.0,5,2,CA 2144,46.9,,S
481,0,3,"Goodwin, Master. Harold Victor",male,9.0,5,2,CA 2144,46.9,,S
684,0,3,"Goodwin, Mr. Charles Edward",male,14.0,5,2,CA 2144,46.9,,S


### 7. Fare per passenger (3 points).

For each individual ticket, you must calculate fare per person for that ticket, and then calculate averages for each class. Note, that you will need to apply `groupby` and then you may consider using `.first()` of resulting `DataFrameGroupBy`.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be `Series` with three elements, indexed by class.

In [0]:
def get_fare_per_pass(df):
    """Calculate fare per passenger for different classes."""
    df_gb_ticket = df.groupby('Ticket')['Pclass', 'Fare']
    # Get first row for each combination of unique Pclass-Fare
    df_gb_ticket_unique_P_F = df_gb_ticket.first()
    return df_gb_ticket_unique_P_F.groupby('Pclass').mean()
frames = [titanic_train,titanic_test]
get_fare_per_pass(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0_level_0,Fare
Pclass,Unnamed: 1_level_1
1,58.261126
2,16.462759
3,9.468468


In [0]:
PROBLEM_ID = 7

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, get_fare_per_pass)

In [0]:
titanic_test[titanic_test.Age.isnull()]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
902,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S
914,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
921,3,"Samaan, Mr. Elias",male,,2,0,2662,21.6792,,C
925,3,"Johnston, Mrs. Andrew G (Elizabeth Lily"" Watson)""",female,,1,2,W./C. 6607,23.4500,,S
928,3,"Roth, Miss. Sarah A",female,,0,0,342712,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
1300,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q
1302,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


### 8. Fill missing age values (3 points).

In problem 3 you filled missing age values with average for all passengers. Now, you need to fill them according to class and sex. For example, for a female passenger from 2d class, missing age value must be filled with average age of females in 2d class.

In this problem, you may need joins and `.apply()`, although there are several ways to get the same result.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be a **new** dataframe with the same structure as input, but without missing values in `Age` column.

In [64]:
def fix_age_groupped(df):
    """Fill missing age values."""
    filter = df.Age.isnull()
    df_copy = df.copy()
    df_gb_pclass_sex = df.groupby(['Pclass', 'Sex'])
    df_age_pclass_sex = df_gb_pclass_sex.apply(lambda row: row.Age.mean())
    df_age_pclass_sex = df_age_pclass_sex.unstack()
    df_sex_mean = df_copy.join(df_age_pclass_sex, on=['Pclass'])
    
    df_sex_mean_filtered = df_sex_mean[df_sex_mean.Age.isnull()]
    df_sex_mean_filtered_gb_pid = df_sex_mean_filtered.groupby('PassengerId')
    df_sex_mean_filtered_gb_pid = df_sex_mean_filtered_gb_pid.apply(lambda row: row.assign(Age = row[row.Sex]))
    df_sex_mean.Age[df_sex_mean.Age.isnull()] = df_sex_mean_filtered_gb_pid.Age
    
    return df_sex_mean.drop(columns = ['female', 'male'])
frames = [titanic_train,titanic_test]
fix_age_groupped(pd.concat(frames))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0_level_0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,22.000000,,S,7.2500,"Braund, Mr. Owen Harris",0,3,male,1,0.0,A/5 21171
2,38.000000,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,female,1,1.0,PC 17599
3,26.000000,,S,7.9250,"Heikkinen, Miss. Laina",0,3,female,0,1.0,STON/O2. 3101282
4,35.000000,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,female,1,1.0,113803
5,35.000000,,S,8.0500,"Allen, Mr. William Henry",0,3,male,0,0.0,373450
...,...,...,...,...,...,...,...,...,...,...,...
1305,25.962264,,S,8.0500,"Spector, Mr. Woolf",0,3,male,0,,A.5. 3236
1306,39.000000,C105,C,108.9000,"Oliva y Ocana, Dona. Fermina",0,1,female,0,,PC 17758
1307,38.500000,,S,7.2500,"Saether, Mr. Simon Sivertsen",0,3,male,0,,SOTON/O.Q. 3101262
1308,25.962264,,S,8.0500,"Ware, Mr. Frederick",0,3,male,0,,359309


In [0]:
PROBLEM_ID = 8

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, fix_age_groupped)

### 9. Finding couples (3 points).

Based on the code from Lecture 5, build a dataframe of couples. Filter it by survival status: select those couples, in which only one of spouses survived or none of two. Built survival statistics by class, i.e. ratio of couples with partial survival divided by total number of couples in class.

Input is indexed with `PassengerId` and is a concatenation of train and test sets. Output must be `Series` with three elements indexed by values from `Pclass` column.

In [0]:
def find_couples(df):
    """Fill missing age values."""
    pass

In [0]:
PROBLEM_ID = 9

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_couples)