# Not your usual solution, but an excellent solution nontheless

For the very start, I need to say that I am not going to use regression or any tree-based method here. I am going to __reverse engineer__ the kaggle titantic data set.

Why?

Have you noticed that many people managed to get a perfect score for this problem? 100% correct. I wouldn't expect any machine learning algorithm to perform this good, and you should think this way too. One way to obtain a perfect score is to check manually the survival status of each person in the test set. A perfectly viable solution, but not Pythonic at all!!!

A better solution, in my opinion, is to use web scraping to gather that info from the internet. Solving the problem this way will not garrentee a perfect score, but will substantially reduce the problem to the extent that can be solved by hand. For example, check a few dozen names on the internet instead of hundreds of names.

This solution has two parts. Part 1 is the web crawller which is straight forward and hosted [here](https://github.com/HVoltBb/misc/src/kaggle/titantic/crawler.py) on GitHub. Part 2 is the data processing part, and it is shown below.

There are numerous online sources that you can crawl, but the one I used is the [Encyclopedia TITANICA](https://www.encyclopedia-titanica.org/titanic-survivors/). You may also use the [Wikipedia page](https://en.wikipedia.org/wiki/Passengers_of_the_Titanic), which also has a well formated table.

WARNING: I know some of you will hate this solution. You don't have to like it. But by doing this exercise, you learn far more about this problem than simply running some packaged models.

Much of the effort is spent on transforming non-ascii characters into ascii characters. You will see more below.

WARNING: Also note that I wrote this notebook on my own laptop. Some of the statements require Python 3.8+, and Python on kaggle is 3.7, and you will get syntax errors if you execute this notebook on kaggle. You may convert those statements to be 3.7 compatible by typing some extra characters.

In [58]:
import pandas as pd
import re
import numpy as np

train = pd.read_csv('~/Downloads/train.csv')
test = pd.read_csv('~/Downloads/test.csv')

I have downloaded these datasets to my laptop.

In [59]:
print(f'{train.shape}_train, {test.shape}_test')

(891, 12)_train, (418, 11)_test


# Tables from Encyclopedia TITANTICA

Both `surv.csv` and `vict.csv` are generated by running this [script](https://github.com/HVoltBb/misc/src/kaggle/titantic/crawler.py).

For your convenience, they can also be downloaded [here](https://github.com/HVoltBb/misc/src/kaggle/titantic/surv.csv) and [here](https://github.com/HVoltBb/misc/src/kaggle/titantic/vict.csv)

In [60]:
suv = pd.read_csv('surv.csv')
vic = pd.read_csv('vict.csv')

In [61]:
print(f'{suv.shape}_surv, {vic.shape}_vic')

(712, 5)_surv, (1496, 5)_vic


In [62]:
suv['survived'] = 1
vic['survived'] = 0
ground_truth = pd.concat([suv, vic])
ground_truth['fsname'] = [re.search('^(.*?)( |$)', item).group(1) for item in ground_truth['given name']]
ground_truth.head()

Unnamed: 0,family name,prefix,given name,age,alt name,survived,fsname
0,ABBOTT,Mrs,Rhoda Mary 'Rosa',39.0,rhoda-mary-rosa-abbott,1,Rhoda
1,ABELSETH,Miss,Kalle (Karen) Marie Kristiane,16.0,karen-marie-abelseth,1,Kalle
2,ABELSETH,Mr,Olaus Jørgensen,25.0,olaus-jorgensen-abelseth,1,Olaus
3,ABELSON,Mrs,Anna,28.0,abelson,1,Anna
4,ABĪ SA'B,Mrs,Sha'nīnah,38.0,shawneene-george-joseph,1,Sha'nīnah


# Non-ascii names

155 out of all the TITANIC passengers (including ship crew) have a non-ascii last name.

70 out of all the passengers have a non-ascii first name.

In [63]:
tmp_f = [item.encode('ascii', 'ignore').decode('ascii') for item in ground_truth['family name']]
non_ascii = [True if x != y else False for x, y in zip(tmp_f, ground_truth['family name'])]
ground_truth['uni_f'] = non_ascii
print('Non-ascii family names')
pd.value_counts(non_ascii)

Non-ascii family names


False    2053
True      155
dtype: int64

In [64]:
tmp_fs = [item.encode('ascii', 'ignore').decode('ascii') for item in ground_truth['fsname']]
non_ascii_ = [True if x != y else False for x, y in zip(tmp_fs, ground_truth['fsname'])]
ground_truth['uni_g'] = non_ascii_
print('Non-ascii first names')
pd.value_counts(non_ascii_)

Non-ascii first names


False    2138
True       70
dtype: int64

# Use unidecode to transform non-ascii names

In [65]:
#!pip install unidecode
from unidecode import unidecode
ground_truth['family name'] = [unidecode(item) for item in ground_truth['family name']]
ground_truth['fsname'] = [unidecode(item) for item in ground_truth['fsname']]


# Or get the ascii names from the url

I noticed that the `unidecode` transformed non-ascii names do not match those names in the provide dataset AT ALL!!!

Apperantly, the conversion was done some other way.

Note that urls can not have non-ascii characters, and the urls for those passengers can be parsed to exact their family and last names. You can see in the following that this works. 

In [66]:
ground_truth.set_index(np.arange(0, ground_truth.shape[0]), inplace=True)

In [67]:
for i, item in ground_truth.iterrows():
    dash = re.search('-', item['alt name'])
    if item.uni_f | item.uni_g | bool(dash):
        ground_truth.at[i, 'family name'] = item['alt name'].split('-')[-1].upper()
        ground_truth.at[i, 'fsname'] = item['alt name'].split('-')[0].capitalize()        


In [68]:
train['fname'] = [re.search('^(.*?), ', item).group(1) for item in train.Name]
train['prefix'] = [re.search('^.*?, (.*?)\. ', item).group(1) for item in train.Name]
train['gname'] = [re.search('^.*?, .*?\. (.*)', item).group(1) for item in train.Name]


# Cleaning up the names

Even though the description of this problem says you don't need to do much data cleaning, it is not the case.

In [69]:
# cleaning
tmp = [re.search('^.*?, .*?\. ([^ ]*?)( |$)', item).group(1) for item in train.Name]
tmp2 = [re.search('\((.*?)( |\)|$)', item).group(1) if item.startswith('(') else item for item in tmp]

# more cleaning
tmp3 = [z.group(1) if y == 'Mrs' and (z:=re.search('^.*?\((.*?)( |\))', x)) is not None else w for x, y, w in zip(train.gname, train.prefix, tmp2)]
train['fsname'] = tmp3

# dashes
train['fname'] = [item.split('-')[-1] if bool(re.search('-', item)) else item for item in train['fname']]
# spaces
train['fname'] = [item.split(' ')[-1] if bool(re.search(' ', item)) else item for item in train['fname']]
# quotes
train['fname'] = [item.replace("'", '') if bool(re.search("'", item)) else item for item in train['fname']]



In [70]:
test['fname'] = [re.search('^(.*?), ', item).group(1) for item in test.Name]
test['prefix'] = [re.search('^.*?, (.*?)\. ', item).group(1) for item in test.Name]
test['gname'] = [re.search('^.*?, .*?\. (.*)', item).group(1) for item in test.Name]
# cleaning
tmp = [re.search('^.*?, .*?\. ([^ ]*?)( |$)', item).group(1) for item in test.Name]
tmp2 = [re.search('\((.*?)( |\)|$)', item).group(1) if item.startswith('(') else item for item in tmp]

# more cleaning
tmp3 = [z.group(1) if y == 'Mrs' and (z:=re.search('^.*?\((.*?)( |\))', x)) is not None else w for x, y, w in zip(test.gname, test.prefix, tmp2)]
test['fsname'] = tmp3

# dashes
test['fname'] = [item.split('-')[-1] if bool(re.search('-', item)) else item for item in test['fname']]
# spaces
test['fname'] = [item.split(' ')[-1] if bool(re.search(' ', item)) else item for item in test['fname']]
# quotes
test['fname'] = [item.replace("'", '') if bool(re.search("'", item)) else item for item in test['fname']]

# Checking names

Out of the 1309 records provided by kaggle, we only failed to identify 57 of them. I say this is pretty good.

I have checked those 57 records. The problem is misspelled names in the kaggle dataset. I see no point in manually checking these records, even though it is achievable in under 1 hour, assuming that you can identify 1 record in 1 min.

Another problem I see is that often times the Age field in the kaggle dataset is not correct. It is not a rounding issue. Sometimes the age is off by a few years. A few things are possible here:

1. kaggle staff intentially modified those values to defy a solution like this one
2. kaggle staff scraped a less reliable source than the one used here

I will be honest here. Before attempting this solution, I have tried a ML approach which only scored ~78%, and in that approach I found that Age is a very important predictor of survivalship. Given the provided Age is not the actual age of the passenger, now I feel that some of the significance of the Age field may have been engineered into this dataset by kaggle staff.

In [71]:
dataset = pd.concat([train, test])
print(dataset.shape)
dataset.head()

(1309, 16)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,fname,prefix,gname,fsname
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr,Owen Harris,Owen
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs,John Bradley (Florence Briggs Thayer),Florence
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss,Laina,Laina
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs,Jacques Heath (Lily May Peel),Lily
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr,William Henry,William


In [72]:
fails_count = 0
srved = ground_truth

for i, item in dataset.iterrows():
    if (not np.isnan(item.Survived)) and int(item.Survived) == 0:
        continue
    mask_lastname = [item.fname.upper()==itemx for itemx in srved['family name']]
    how_many = sum(mask_lastname)
    if how_many == 1:
        True
    elif how_many > 1:
        mask_prefix = [item.prefix == itemx for itemx in srved['prefix']]
        mask_ = np.array(mask_lastname) & np.array(mask_prefix)
        how_many = sum(mask_)
        if how_many == 1:
            True
        elif how_many > 1:
            mask_fstname = [item.fsname == itemx for itemx in srved['fsname']]
            mask__ = np.array(mask_fstname) & np.array(mask_)
            how_many = sum(mask__)
            if how_many == 1:
                True    
            else:
                fails_count += 1
                print(f'failed at given name {item.fsname}, indix {i}, matched {how_many}')
    else:
        fails_count += 1
        print(f'failed at family name {item.fname}, indix {i}, matched {how_many}')

print(f'{fails_count} failed')

failed at given name Charles, indix 17, matched 2
failed at family name Masselmani, indix 19, matched 0
failed at family name ODriscoll, indix 47, matched 0
failed at family name Sheerlinck, indix 81, matched 0
failed at family name Peter, indix 128, matched 0
failed at given name August, indix 146, matched 0
failed at given name Margaret, indix 194, matched 0
failed at given name Katherine, indix 241, matched 0
failed at family name Bissette, indix 269, matched 0
failed at family name Castellana, indix 307, matched 0
failed at family name Bratthammer, indix 444, matched 0
failed at given name Eden, indix 489, matched 0
failed at given name George, indix 507, matched 0
failed at family name Peter, indix 533, matched 0
failed at given name John, indix 572, matched 2
failed at given name Ali, indix 692, matched 0
failed at family name Mullens, indix 697, matched 0
failed at family name Lesurer, indix 737, matched 0
failed at given name Susan, indix 742, matched 0
failed at family name Ba

# What have I learned from this?

The majority of the missed counts are due to typos in the training set. I am not sure if those typos are intentional planted there or not. The training set should not be taken as the facts, as I have encountered plenty of inconsistent age values. It is possible to get 100% correct on this, but I don't think it is worth the effort, so I am not trying to improve my score further.

155 family names and 70 surnames have non-ascii characters in them, and converting these chars accounts for most of my effort in this problem. There are many ways to convert accented chars to latin chars. For this particular dataset, scraping the url (ascii by the standard) link works better than using the uniencode package.

# Predictions

In [73]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,fname,prefix,gname,fsname
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,Kelly,Mr,James,James
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,Wilkes,Mrs,James (Ellen Needs),Ellen
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,Myles,Mr,Thomas Francis,Thomas
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,Wirz,Mr,Albert,Albert
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,Hirvonen,Mrs,Alexander (Helga E Lindqvist),Helga


In [74]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,fname,prefix,gname,fsname
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,Kelly,Mr,James,James
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,Wilkes,Mrs,James (Ellen Needs),Ellen
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,Myles,Mr,Thomas Francis,Thomas
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,Wirz,Mr,Albert,Albert
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,Hirvonen,Mrs,Alexander (Helga E Lindqvist),Helga


In [75]:
test['survived'] = None

srved = ground_truth[ground_truth.survived == 1]
fails_count = 0

for i, item in test.iterrows():
    mask_lastname = [item.fname.upper()==itemx for itemx in srved['family name']]
    how_many = sum(mask_lastname)
    if how_many == 1:
        test.survived.at[i] = 1
    #    print('\u2713')
    elif how_many > 1:
        mask_prefix = [item.prefix == itemx for itemx in srved['prefix']]
        mask_ = np.array(mask_lastname) & np.array(mask_prefix)
        how_many = sum(mask_)
        if how_many == 1:
            test.survived.at[i] = 1
    #        print('\u2713')
        elif how_many > 1:
            mask_fstname = [item.fsname == itemx for itemx in srved['fsname']]
            mask__ = np.array(mask_fstname) & np.array(mask_)
            how_many = sum(mask__)
            if how_many == 1:
                test.survived.at[i] = 1
    #            print('\u2713')
            else:
                fails_count += 1
                print(f'failed at given name {item.fsname}, indix {i}, matched {how_many}')
    else:
        fails_count += 1
        print(f'failed at family name {item.fname}, indix {i}, matched {how_many}')

print(f'{fails_count} failed')

failed at family name Myles, indix 2, matched 0
failed at family name Wirz, indix 3, matched 0
failed at family name Ilieff, indix 10, matched 0
failed at family name Ilmakangas, indix 18, matched 0
failed at family name Khalil, indix 19, matched 0
failed at family name Robins, indix 25, matched 0
failed at family name Brady, indix 28, matched 0
failed at family name Samaan, indix 29, matched 0
failed at family name Jefferys, indix 31, matched 0
failed at family name Katavelas, indix 35, matched 0
failed at family name Cacic, indix 37, matched 0
failed at family name Franklin, indix 41, matched 0
failed at family name Corbett, indix 43, matched 0
failed at family name Peltomaki, indix 45, matched 0
failed at family name Shaughnessy, indix 47, matched 0
failed at family name Pulbaum, indix 51, matched 0
failed at family name Mangiavacchi, indix 54, matched 0
failed at family name Cor, indix 56, matched 0
failed at family name Dika, indix 60, matched 0
failed at family name McCrae, indix

In [76]:
srved = ground_truth[ground_truth.survived == 0]
fails_count = 0

for i, item in test.iterrows():
    mask_lastname = [item.fname.upper()==itemx for itemx in srved['family name']]
    how_many = sum(mask_lastname)
    if how_many == 1:
        test.survived.at[i] = 0
    #    print('\u2713')
    elif how_many > 1:
        mask_prefix = [item.prefix == itemx for itemx in srved['prefix']]
        mask_ = np.array(mask_lastname) & np.array(mask_prefix)
        how_many = sum(mask_)
        if how_many == 1:
            test.survived.at[i] = 0
    #        print('\u2713')
        elif how_many > 1:
            mask_fstname = [item.fsname == itemx for itemx in srved['fsname']]
            mask__ = np.array(mask_fstname) & np.array(mask_)
            how_many = sum(mask__)
            if how_many == 1:
                test.survived.at[i] = 0
    #            print('\u2713')
            else:
                fails_count += 1
                print(f'failed at given name {item.fsname}, indix {i}, matched {how_many}')
    else:
        fails_count += 1
        print(f'failed at family name {item.fname}, indix {i}, matched {how_many}')

print(f'{fails_count} failed')

failed at family name Wilkes, indix 1, matched 0
failed at family name Hirvonen, indix 4, matched 0
failed at family name Connolly, indix 6, matched 0
failed at family name Caldwell, indix 7, matched 0
failed at family name Abrahim, indix 8, matched 0
failed at family name Ilieff, indix 10, matched 0
failed at family name Snyder, indix 12, matched 0
failed at family name Assaf, indix 17, matched 0
failed at family name Flegenheim, indix 22, matched 0
failed at given name Richard, indix 23, matched 0
failed at family name Mock, indix 34, matched 0
failed at family name Katavelas, indix 35, matched 0
failed at family name Roth, indix 36, matched 0
failed at family name Sap, indix 38, matched 0
failed at family name Hee, indix 39, matched 0
failed at family name Karun, indix 40, matched 0
failed at family name Kimball, indix 44, matched 0
failed at family name Chevre, indix 46, matched 0
failed at family name Bucknell, indix 48, matched 0
failed at family name Coutts, indix 49, matched 0


In [77]:
test.survived.isna().sum()

33

# Filling in missing values

We failed to locate 33 passengers in the test set. We are not going to manually check those 33 names, although it is possible. We are going to fillin the most probable survival status for these 33 passengers, which is '0' based on the training set. 

At the very end, we used some statistics skills to fill in NA values with their most probable outcome. Now, I am feeling a bit better now. All those years studying statistics are not useless after all!

In [78]:
pd.value_counts(train.Survived)

0    549
1    342
Name: Survived, dtype: int64

In [79]:
test.survived.fillna(0, inplace=True)

In [80]:
result = pd.DataFrame([test.PassengerId, test.survived]).T
result.astype({'PassengerId': 'int32', 'survived': 'int32'})
result.to_csv('submit.csv', index=False)

# End

I hope you like this solution. All the scripts, including this notebook, and outputs with the exception of the final predictions can be found on my [GitHub repo](https://github.com/HVoltBb/misc/src/kaggle/titanic/).

Let me know if you learned something new!!!