Thanks for **UPVOTING** this kernel! Trying to become a Kernels Master. 🤘

Check out my other cool projects:
- [💲 Minimizing investment risk for high interest loans](https://www.kaggle.com/pavlofesenko/minimizing-investment-risk-for-high-interest-loans)
- [📊 Interactive Titanic dashboard using Bokeh](https://www.kaggle.com/pavlofesenko/interactive-titanic-dashboard-using-bokeh)
- [👪 Titanic extended dataset (Kaggle + Wikipedia)](https://www.kaggle.com/pavlofesenko/titanic-extended)

# 1. Introduction

This project was inspired by [this post of Salamati](https://www.kaggle.com/c/titanic/discussion/73133#441915) from the Kaggle forum:

> Hello, I did a project on the Titanic Data. I need to merge the titanic data with some other data sets. I was looking for other features like Demographic of the lifeboats, passengers' employment category, and passenger nationality, etc to merge it to the Titanic based on the passenger ID. I was not able to find any data to merge with the Titanic so far. Do you think if these data sets or other data sets are available that I can add it to the Titanic based on the passenger ID. If not, is there any other data sets on the Kaggle that I can merge with other data sets based on the specific column. I have to merge 2 or three data sets based on the specific column and then do my analyses. I would very appreciate for any help/suggestion on this matter. Thank you,

The current [Kaggle Titanic dataset](https://www.kaggle.com/c/titanic/data) is based on the dataset that was assembled by Thomas E. Cason, an undergraduate research assistant of the University of Virginia (more info [here](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)). It was created using the data from the [*Encyclopedia Titanica*](https://www.encyclopedia-titanica.org/titanic-passenger-list/) available as of 2 August 1999. Note that the original Titanic dataset contained 3 extra features: `boat` (Lifeboat), `body` (Body Identification Number), `home.dest` (Home/Destination). The first 2 features were obviously not included in the Kaggle dataset since they immediately indicate if the person has survived or not. The last feature was not included probably due to a lot of missing values. It does, however, contain some interesting information such as home country that can be used to engineer new features, for example, nationality.

<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Wikipedia-logo-en-big.png" width="200" align="right"/>

Besides Kaggle, there is also the [Titanic passenger list on Wikipedia](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic). It is divided in 3 tables according to the class and each of the tables contains the following columns: `Name`, `Age`, `Hometown`, `Boarded`, `Destination`, `Lifeboat`, `Body`. The survived passengers are highlighted with blue and the victims with white. This list was created using the data from the [*Encyclopedia Titanica*](https://www.encyclopedia-titanica.org/titanic-passenger-list/) that was retrieved in 2011 and from couple of other sources. The big advantage of the Wikipedia list is that it has almost no missing values. Therefore, it would be a great source to complement the existing Kaggle Titanic dataset.

I also checked the [*Encyclopedia Titanica*](https://www.encyclopedia-titanica.org/titanic-passenger-list/) that has the most extensive database of the Titanic passengers untill now. Its drawback, however, is that the Encyclopedia Titanica doesn't allow to copy its materials for public use.

Therefore, I decided to merge the Kaggle and Wikipedia databases and use the Encyclopedia Titanica only to distinguish the passengers with difficult names from the previous two databases. In this way I won't be violating the copyright of the Encyclopedia Titanica.

Throughout this notebook I will be using only two Python libraries: `pandas` and `unidecode`. The latter is very convenient to convert special characters into ASCII equivalents (for example, it converts `é` to `e`).

In [None]:
import pandas as pd
from unidecode import unidecode

# 2. Importing Kaggle dataset

I start by importing the training and testing datasets from the [Kaggle Titanic competition](https://www.kaggle.com/c/titanic/data).

In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')

kagg = pd.concat([train, test], sort=False)
kagg.head()

# 3. Importing Wikipedia dataset

To get the [Titanic passenger list on Wikipedia](https://en.wikipedia.org/wiki/Passengers_of_the_RMS_Titanic), I scrape it using the function `pd.read_html()`. It returns all tables from the web page as a list of DataFrames. There are 3 DataFrames that correspond to 3 passenger classes and are assigned to the variables `wiki1`, `wiki2`, `wiki3`. For reproducibility I used the permanent link to the [version of 18 February 2019](https://en.wikipedia.org/w/index.php?title=Passengers_of_the_RMS_Titanic&oldid=883859055) because Wikipedia constantly updates its content. The parameter `encoding='unicode'` is used here because some of the passenger names contain non-ASCII characters.

In [None]:
wiki1 = pd.read_html('https://en.wikipedia.org/w/index.php?title=Passengers_of_the_RMS_Titanic&oldid=883859055', header=0, encoding='unicode')[1]
wiki2 = pd.read_html('https://en.wikipedia.org/w/index.php?title=Passengers_of_the_RMS_Titanic&oldid=883859055', header=0, encoding='unicode')[2]
wiki3 = pd.read_html('https://en.wikipedia.org/w/index.php?title=Passengers_of_the_RMS_Titanic&oldid=883859055', header=0, encoding='unicode')[3]

Let's have a look at the table for the 1st class (see below).

Note that after some of the names there are rows starting with "and &lt;profession&gt;, &lt;Title&gt; &lt;Name&gt; &lt;Surname&gt;". These rows correspond to the servants who were travelling with some of the families from the 1st class.

Sometimes rows have a reference. For example, [59] says:

> Though their employers travelled in first class, this servant was given second class accommodations, as their services were not needed while their employers were on board.

Note also that `Body` contains some letters after the body number. These are references for the name of the ship that found the body. For example, "MB" means "Mackay-Bennett".

For the extended Titanic dataset I will leave everything as it is to preserve as much information as possible.

In [None]:
wiki1.head()

I also add the column `Class` to the DataFrame `wiki1` and assign it to 1. This is not always true (see the reference 59 above) but it can always be taken into account later.

In [None]:
wiki1['Class'] = 1
wiki1.head()

The table for the 2nd class (see below) has the same structure as the one for the 1st class. Note, however, that `Hometown` sometimes doesn't have a town and only the country is mentioned. Also, the countries are not always consistent. For example, in the table for the 1st class "England" is followed by "UK" (see row 3 above) but in the table for the 2nd class (see rows 3-4 below) only "England" is mentioned. The country is also sometimes missed in the column `Destination` (see, for example, row 4 below).

In [None]:
wiki2.head()

Just like with the previous DataFrame I add the column `Class` but assign it to 2.

In [None]:
wiki2['Class'] = 2
wiki2.head()

The table for the 3rd class (see below) has an additional column `Home country` apart from the column `Hometown`.

In [None]:
wiki3.head()

I join the columns `Hometown` and `Home country` so that all 3 tables have the same structure and can be later concatenated. Just like with the previous DataFrames I add the column `Class` but assign it to 3.

In [None]:
wiki3['Hometown'] = wiki3['Hometown'] + ', ' + wiki3['Home country']
wiki3 = wiki3.drop('Home country', axis=1, errors='ignore')

wiki3['Class'] = 3

wiki3.head()

Finally, I concatenate all 3 tables and for convenience reset the index using the parameter `ignore_index=True`.

In [None]:
wiki = pd.concat([wiki1, wiki2, wiki3], ignore_index=True)
wiki.head()

# 4. Preparing Kaggle dataset for merge

In order to merge the Kaggle and Wikipedia datasets, I need a matching column(s), for example, `Name`. But if you compare the column `Name` in both datasets, you will see that the names are often spelled differently, for example, "Eli**z**abeth" on Wikipedia (see row 1 above) and "Eli**s**abeth" on Kaggle (see row 1 below). So `Name` is not a good matching column, however, I could use it to generate other matching columns. These could be, for example, the first 3 letters of a surname or the first 3 letters of a name. This should be precise enough for most of the cases and will hopefully avoid misspelling issues. In case of duplicates I could use the column `Age` to differentiate between them.

In [None]:
kagg.sort_values(['Pclass', 'Name']).reset_index(drop=True).head()

Before I start extracting surname and name codes (the first 3 letters), note that in the Kaggle dataset the title of "Mrs." is followed first by the name of the husband and only then in parentheses by the wife's actual name (see row 4 above). Therefore, I need to extract surname and name codes for passengers with the title "Mrs." separately. For this purpose I use regular expressions (see more in the [Python regex documentation](https://docs.python.org/3/howto/regex.html)). The new DataFrame will be called `kagg_corr` and for convenience I sort it by `Pclass` and `Name`.

The rows that contain "Mrs." in the column `Name` are matched using the first regular expression. Matched groups `(?P<Surname>.*)`, `(?P<Husband_name>.*)` and `(?P<Wife_name>.*)` are extracted as columns in the DataFrame `temp`. It is important not to leave any space between `(?P<Husband_name>.*)` and `\((?P<Wife_name>.*)\)` because sometimes the husband's name is missing and the matching won't work with the extra space between them.

Afterwards all rows are matched using the second regular expression. This time the named groups for matching `(?P<Surname>.*)`, `(?P<Title>.*)` and `?P<Name>.*)` are extracted as columns in the DataFrame `temp2`. In this case the rows that contain "Mrs." will be matched wrongly but it's ok because they have already been correctly matched using the previous regular expression.

Note that the named groups `(?P<Husband_name>.*)` and `(?P<Title>.*)` can be simply replaced by `.*` since they are not used later on. Nevertheless, I included them for clarity.

Due to incosistencies in compound surnames, for example, "van Billiard" in Kaggle and "Van Billiard" in Wikipedia datasets, I transform all surnames/names to begin with a capital letter using the built-in method `title()`.

In [None]:
kagg_corr = kagg.copy()

# Sorting by class and name
kagg_corr = kagg_corr.sort_values(['Pclass', 'Name']).reset_index(drop=True)

# Extracting surnames and names using regular expressions
temp = kagg_corr.Name.str.extract(r'(?P<Surname>.*), Mrs\. (?P<Husband_name>.*)\((?P<Wife_name>.*)\)')
temp2 = kagg_corr.Name.str.extract(r'(?P<Surname>.*), (?P<Title>.*)\. (?P<Name>.*)')

# Adding Kaggle surname codes
surname = temp.Surname
surname2 = temp2.Surname
surname = surname.fillna(surname2)
surname = surname.str.title()
surname_code = surname.str[0:3]
kagg_corr['Surname_code'] = surname_code

# Adding Kaggle name codes
name = temp.Wife_name
name2 = temp2.Name
name = name.fillna(name2)
name = name.str.title()
name_code = name.str[0:3]
kagg_corr['Name_code'] = name_code

kagg_corr.head()

# 5. Preparing Wikipedia dataset for merge

Just like with the Kaggle dataset it is useful to have a column with a unique ID for each passenger. Let's name this column `WikiId` and insert it in the beginning of the dataset using the method `insert()`.

Then I correct a typo in row 339 that has "." after the surname instead of ",".

I also replace the missing values in the Wikipedia dataset written as "--" with `NaN`.

Afterwards I use the same strategy as in the Kaggle dataset by matching the name strings with two regular expressions. This time the first regular expression is used to match the servants that have a different name structure starting with "and ...". Note that the group `(?P<Title>.*?)` now has `?` in the end and indicates non-greedy matching. This is important for the cases when the title doesn't end with ".", for example, "Miss" (unlike the Kaggle dataset where all titles end with "." even "Miss."). For the same reason `(?P<Title>.*?)` is followed by `\.*` where `*` indicates that "." might not be present at all.

Here I also convert special characters into their ASCII equivalents using the function `unidecode()`. Unlike the Kaggle dataset, the Wikipedia dataset has a lot of these characters especially in Scandinavian names (for example, "Björnström-Steffanson, Mr. Mauritz Håkan")

Since I will use the column `Age` to help resolving duplicates, I need to change its type to `float` just like in the Kaggle dataset. This column, however, indicates the age of babies in months (for example, "11 mo.") so it should be first extracted using the matched group `(?P<Months>\d*)` and divided by 12. The missing values are then filled with the rest of the values from `Age` and its type is changed to `float`. The rounding to 2 decimal digits is applied so that the format is consistent with the Kaggle dataset.

Since both Kaggle and Wikipedia datasets has columns `Name`, `Age`, `Surname_code` and `Name_code`, let's add the suffix `_wiki` to the features from the Wikipedia dataset in order to distinguish them.

In [None]:
wiki_corr = wiki.copy()

# Adding WikiId
wiki_corr.insert(0, 'WikiId', wiki_corr.index + 1)

# Correcting a typo
wiki_corr.loc[339, 'Name'] = 'Beane, Mrs. Ethel (née Clarke)'

# Replacing -- with NaN
wiki_corr = wiki_corr.replace('–', float('nan'))

# Extracting surnames and names using regular expressions
temp = wiki_corr.Name.str.extract(r'and (?P<Profession>.*), (?P<Title>.*?)\.* (?P<Name>.*) (?P<Surname>.*)')
temp2 = wiki_corr.Name.str.extract(r'(?P<Surname>.*), (?P<Title>.*?)\.* (?P<Name>.*)')

# Adding Wikipedia surname codes
surname = temp.Surname
surname2 = temp2.Surname
surname = surname.fillna(surname2)
surname = surname.str.title()
surname = surname.apply(unidecode)
surname_code = surname.str[0:3]
wiki_corr['Surname_code'] = surname_code

# Adding Wikipedia name codes
name = temp.Name
name2 = temp2.Name
name = name.fillna(name2)
name = name.str.title()
name = name.apply(unidecode)
name_code = name.str[0:3]
wiki_corr['Name_code'] = name_code

# Converting age type to float
months = wiki_corr.Age.str.extract(r'(?P<Months>\d*) mo.', expand=False).astype('float64')
age = months / 12.0
age = age.fillna(wiki_corr.Age)
wiki_corr['Age'] = age.astype('float64').round(2)

# Adding Wikipedia suffixes
wiki_corr = wiki_corr.rename(columns={'Name': 'Name_wiki', 'Age': 'Age_wiki', 'Surname_code': 'Surname_code_wiki', 'Name_code': 'Name_code_wiki'})

wiki_corr.head()

# 6. Merging Kaggle and Wikipedia datasets

Now the Kaggle and Wikipedia datasets are ready to be merged. I will do it in several stages in order to maximize the number of passengers that are matched automatically.

## 6.1. Merging datasets using surname codes, name codes and age

During the first stage, the datasets are merged using surname codes, name codes and age.

Then the function `merge_report()` counts all duplicates either in `PassengerId` or `WikiId` and prints the report that contains the number of matched passengers, the number of duplicates, the number of unmatched passengers in comparison with the original Kaggle dataset and the total number of passengers in the original Kaggle dataset.

After the first stage more than half of the passengers are matched automatically.

In [None]:
merg = pd.merge(kagg_corr, wiki_corr, left_on=['Surname_code', 'Name_code', 'Age'],
                right_on=['Surname_code_wiki', 'Name_code_wiki', 'Age_wiki'])

def merge_report(df, kagg):
    dupl = df.PassengerId.duplicated(keep=False)
    dupl2 = df.WikiId.duplicated(keep=False)
    dupl_num = (dupl | dupl2).sum()
    print(f'''Matched: {df.shape[0]} ({dupl_num} duplicates)
Unmatched: {kagg.shape[0] - df.shape[0]}
Total: {kagg.shape[0]}''')
    
merge_report(merg, kagg_corr)
merg.head()

To deal with the duplicates I define the function `dupl_drop()` that removes duplicates both in the columns `PassengerId` and `WikiId`. The new `merge_report()` shows that indeed it effectively removed the duplicates.

In [None]:
def dupl_drop(df):
    df_corr = df.drop_duplicates(subset=['PassengerId'], keep=False)
    df_corr = df_corr.drop_duplicates(subset=['WikiId'], keep=False)
    return df_corr

merg_corr = dupl_drop(merg)

merge_report(merg_corr, kagg_corr)

## 6.2. Merging the unmatched passengers using surname codes and name codes

During the next stage, I try to merge the unmatched passengers using only surname codes and name codes which is a weaker criteria. Since more than a half of the passengers are already matched, it is very likely that the remaining passengers will have significantly fewer duplicates and will be matched as well.

I first obtain the rest Kaggle and Wikipedia datasets using the newly defined function `df_rest()` and then merge these two rest datasets. The duplicates are removed using the previously defined function `dupl_drop()` and the function `merge_report()` summarizes the result.

This approach turned out to be very effective leaving only 178 unmatched passengers.

In [None]:
def df_rest(df, kagg, wiki):
    kagg_rest = kagg[~kagg.PassengerId.isin(df.PassengerId)].copy()
    wiki_rest = wiki[~wiki.WikiId.isin(df.WikiId)].copy()
    return kagg_rest, wiki_rest

kagg_rest, wiki_rest = df_rest(merg_corr, kagg_corr, wiki_corr)

merg_rest = pd.merge(kagg_rest, wiki_rest, left_on=['Surname_code', 'Name_code'],
                     right_on=['Surname_code_wiki', 'Name_code_wiki'])
merg_rest = dupl_drop(merg_rest)

merge_report(merg_rest, kagg_rest)
merg_rest.head()

## 6.3. Merging the unmatched passengers using surname codes and age

During the next stage, I try to match the remaining passengers using the surname codes and age which is an even weaker criteria than the previous one.

This approach allows to decrease the number of unmatched passengers even further down to 145.

In [None]:
kagg_rest2, wiki_rest2 = df_rest(merg_rest, kagg_rest, wiki_rest)

merg_rest2 = pd.merge(kagg_rest2, wiki_rest2, left_on=['Surname_code', 'Age'],
                      right_on=['Surname_code_wiki', 'Age_wiki'])
merg_rest2 = dupl_drop(merg_rest2)

merge_report(merg_rest2, kagg_rest2)
merg_rest2.head()

## 6.4. Merging the unmatched passengers using surname codes

During the next stage I try to merge the unmatched passengers using only the surname codes which is the weakest criteria of all.

Ultimately I decreased the number of unmatched passengers down to 113.

In [None]:
kagg_rest3, wiki_rest3 = df_rest(merg_rest2, kagg_rest2, wiki_rest2)

merg_rest3 = pd.merge(kagg_rest3, wiki_rest3, left_on=['Surname_code'],
                      right_on=['Surname_code_wiki'])
merg_rest3 = dupl_drop(merg_rest3)

merge_report(merg_rest3, kagg_rest3)
merg_rest3.head()

## 6.5. Merging the unmatched passengers manually

During the last stage I merge the remaining 113 unmatched passengers manually. To do so, I sort the datasets with unmatched passengers by names, save these datasets as CSV files and match the passengers, for example, by inspecting their names in Excel. Most of the passenger names are easy to match, however, in some difficult cases I had to type the ticket number from the Kaggle dataset into the Encyclopedia Titanica to understand the real name of a passenger.

In [None]:
kagg_rest4, wiki_rest4 = df_rest(merg_rest3, kagg_rest3, wiki_rest3)

kagg_rest4 = kagg_rest4.sort_values('Name')
wiki_rest4 = wiki_rest4.sort_values('Name_wiki')

kagg_rest4.to_csv('kagg_rest4.csv', index=False)
wiki_rest4.to_csv('wiki_rest4.csv', index=False)

The matched values of `WikiId` are presented in the list `wiki_id_match` and are added as an extra column to the DataFrame `kagg_rest4_corr` which is a copy of `kagg_rest4`.

In [None]:
wiki_id_match = [622,float('nan'),822,661,662,671,672,670,669,667,852,1188,960,311,
                 697,698,853,696,741,717,720,float('nan'),1002,402,789,791,float('nan'),
                 float('nan'),798,1205,603,722,804,319,802,859,981,434,879,961,1202,
                 1308,902,908,612,629,989,948,float('nan'),1021,999,1000,1053,1006,
                 1008,1027,float('nan'),1045,1046,1044,1311,1057,1055,1056,519,1072,
                 1071,1075,1082,1084,782,1085,229,668,1139,702,703,float('nan'),1137,
                 1138,552,555,553,312,float('nan'),183,1063,1181,1182,1190,1189,1191,
                 1199,1222,1223,270,275,1250,1248,1249,1059,1265,893,602,604,181,1310,
                 1291,1309,607,884,885,354]

kagg_rest4_corr = kagg_rest4.reset_index(drop=True).copy()
kagg_rest4_corr['WikiId'] = wiki_id_match
kagg_rest4_corr.head()

Then I merge two datasets based on `WikiId`. Unfortunately 8 passengers couldn't be matched but I will have a closer look at them in the next section.

In [None]:
merg_rest4 = pd.merge(kagg_rest4_corr, wiki_rest4, on=['WikiId'])

merge_report(merg_rest4, kagg_rest4)
merg_rest4.head()

For convenience I concatenate all matched passengers in one DataFrame `merg_all`.

In [None]:
merg_all = pd.concat([merg_corr, merg_rest, merg_rest2, merg_rest3, merg_rest4],
                     ignore_index=True)

merge_report(merg_all, kagg_corr)

## 6.6. Inspecting the unmatched passengers

Let's inspect the unmatched passengers from the Kaggle dataset (8 passengers) and from the Wikipedia dataset (13 passengers).

In [None]:
kagg_rest5, wiki_rest5 = df_rest(merg_all, kagg_corr, wiki_corr)

kagg_rest5

In [None]:
wiki_rest5

By manually inspecting each unmatched name from the Kaggle dataset and looking for it in the Wikipedia dataset, I discovered several matching mistakes. For example, the name "Peters, Miss. Katie" (see the last row in the above Kaggle dataset) is actually present in the Wikipedia dataset (see below) but wasn't matched for some reason.

In [None]:
wiki_corr[wiki_corr.Name_wiki.str.contains('Peters')]

To understand the issue, I looked for the `WikiId=1128` in the final DataFrame with all matched passengers `merg_all`. It turns out that these passengers were indeed incorrectly matched because their surname and name codes are the same "Pet"/"Cat". Moreover, the incorrect match from the Kaggle dataset ("Peter, Mrs. Catherine (Catherine Rizk)") actually corresponds to one of the unmatched passengers from the Wikipedia dataset ("Butrus-Youssef, Mrs. Katarin (née Rizk)") but is just spelled differently.

In [None]:
merg_all[merg_all.WikiId == 1128]

After inspecting all the unmatched passengers from the Kaggle dataset, the following manipulations should be performed to correct the mistakes:

- Drop the rows with the following `PassengerId`: 1146, 569, 534, 919, 508
- Merge the following pairs `PassengerId-WikiId`: 147-1293, 1146-980, 6-785, 681-1128, 534-701, 569-750, 919-1203, 508-41

The result of these manipulations looks correct.

In [None]:
merg_all_corr = merg_all[(merg_all.PassengerId != 1146) &
                         (merg_all.PassengerId != 569) &
                         (merg_all.PassengerId != 534) &
                         (merg_all.PassengerId != 919) &
                         (merg_all.PassengerId != 508)]

kagg_rest5_corr, wiki_rest5_corr = df_rest(merg_all_corr, kagg_corr, wiki_corr)

kagg_rest5_corr.loc[:, 'WikiId'] = float('nan')
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 147, 'WikiId'] = 1293
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 1146, 'WikiId'] = 980
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 6, 'WikiId'] = 785
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 681, 'WikiId'] = 1128
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 534, 'WikiId'] = 701
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 569, 'WikiId'] = 750
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 919, 'WikiId'] = 1203
kagg_rest5_corr.loc[kagg_rest5_corr.PassengerId == 508, 'WikiId'] = 41

merg_rest5 = pd.merge(kagg_rest5_corr, wiki_rest5_corr, on=['WikiId'])
merg_rest5

Then the corrected dataset for all merged passengers `merg_all2` is obtained.

In [None]:
merg_all2 = pd.concat([merg_all_corr, merg_rest5], ignore_index=True)

merge_report(merg_all2, kagg_corr)

Below are the final lists of unmatched passengers from the Kaggle dataset `kagg_rest6` and from the Wikipedia datasets `wiki_rest6`. The 5 unmatched passengers from the Kaggle dataset can be found on the Encyclopedia Titanica but for some reason are absent in the Wikipedia dataset. This is a bit strange because the Kaggle dataset is based on the version of the Encyclopedia Titanica retrieved in 1999 while the Wikipedia one is based on the version from 2011.

Notice also that most of the unmatched passengers from the Wikipedia dataset have the reference [60] that says:

> "See the list of crew members on board RMS Titanic article for further information."

These passengers were part of the crew members and probably that's why they aren't mentioned in the Kaggle dataset.

In [None]:
kagg_rest6, wiki_rest6 = df_rest(merg_all2, kagg_corr, wiki_corr)
kagg_rest6

In [None]:
wiki_rest6

Finally, the 5 unmatched passengers from `kagg_rest6` are added to the rest of the matched passengers in `merg_all2`. After concatenating both DataFrames, 5 duplicates appear. These are 5 unmatched passengers that have the same value of `NaN` in the column `WikiId` .

In [None]:
merg_all3 = pd.concat([merg_all2, kagg_rest6], ignore_index=True, sort=False)

merge_report(merg_all3, kagg_corr)

For the final dataset I drop the surname and name codes because they aren't needed anymore. I also sort the DataFrame according to the `PassengerId` and reset the index so that it looks like the original Kaggle dataset.

In [None]:
merg_all3_corr = merg_all3.drop(['Surname_code', 'Name_code', 'Surname_code_wiki', 'Name_code_wiki'], axis=1)
merg_all3_corr = merg_all3_corr.sort_values('PassengerId').reset_index(drop=True)
merg_all3_corr.head()

# 7. Splitting in training and testing datasets

For convenience I also split the full dataset into the training and testing parts.

In [None]:
full = merg_all3_corr.copy()

train = full[:891]

test = full[891:]
test = test.drop('Survived', axis=1)

All datasets were saved as CSV files and can be easily added to your kernel from the page [Titanic extended dataset (Kaggle + Wikipedia)](https://www.kaggle.com/pavlofesenko/titanic-extended).

In [None]:
full.to_csv('full.csv', index=False)
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)

# 8. Conclusion

In this kernel I have extended the original Kaggle dataset with the features available on Wikipedia as of 18 February 2019. Although most of the features from the Wikipedia dataset are similar to the ones from the Kaggle dataset, they are more up-to-date and have much fewer missing values. In the resulting dataset all features were left unprocessed to keep as much as information as possible. Among 1309 passengers from the Kaggle dataset 1304 were successfully matched with the passengers from the Wikipedia dataset. The new extended Titanic dataset can be potentially used to create better models (due to fewer missing values and additional features) or to perform an additial EDA for Titanic passengers (for example, using interactive visualizations).

Thanks for **UPVOTING** this kernel! Trying to become a Kernels Master. 🤘

Check out my other cool projects:
- [💲 Minimizing investment risk for high interest loans](https://www.kaggle.com/pavlofesenko/minimizing-investment-risk-for-high-interest-loans)
- [📊 Interactive Titanic dashboard using Bokeh](https://www.kaggle.com/pavlofesenko/interactive-titanic-dashboard-using-bokeh)
- [👪 Titanic extended dataset (Kaggle + Wikipedia)](https://www.kaggle.com/pavlofesenko/titanic-extended)