# **Data extraction and processing**

### To view the accesses allowed for web-scraping
[Acces for web scraping (titanicfacts.net by 2024 Dave Fowler) ](https://titanicfacts.net/robots.txt)

Install the libraries:

In [200]:
# %pip install pandas numpy unidecode

In [201]:
import pandas as pd
import re
import numpy as np
from unidecode import unidecode
pd.options.mode.copy_on_write = True 
URL = 'https://titanicfacts.net/titanic-passenger-list/'
dfs = pd.read_html(URL)

In [202]:
len(dfs)

3

### Columns of interest with their descriptions.

| **Column** | **Description**                                           |
| ---------- | --------------------------------------------------------- |
| Survived   | person survived the titanic tragedy (0 - "NO", 1- "YES" ) |
| Sex        | the gender (female, male)                                 |
| Name       | the name                                                  |
| Age        | Age in years                                              |
| SibSp      | # of siblings/wife or husband on board                    |
| Parch      | # of parents / children on board                          |
| Pclass     | The ticket class (1 = "1st", 2 = "2nd", 3 = "3rd")        |

In [203]:
dfs[0].head()

Unnamed: 0,0,1,2,3,4
0,Surname,First Names,Age,Boarded,Survivor (S) or Victim (†)
1,Allen,Miss Elisabeth Walton,29,Southampton,S
2,Allison,Mr Hudson Joshua Creighton,30,Southampton,†
3,Allison,Mrs Bessie Waldo,25,Southampton,†
4,Allison,Miss Helen Loraine,2,Southampton,†


### Data extraction from the web site

In [204]:
FIRST_ROW_OF_DF=0
TICKET_CLASS_COLUMN="Pclass"
SELECTED_COLUMNS_DEL_SITE_WEB_COLUMNS=["Surname", "First Names", "Age", "Pclass"]


def set_columns_of_row(df,row):
    new_columns=df.iloc[row]
    df.columns=new_columns

def delete_row(df,row_to_drop):
    df.drop(row_to_drop,inplace=True)

def add_columns_class_ticket(df,ticket_class):
    df[TICKET_CLASS_COLUMN]=ticket_class


def get_dataframe(df,ticket_class):
    set_columns_of_row(df,FIRST_ROW_OF_DF)
    delete_row(df,FIRST_ROW_OF_DF)
    add_columns_class_ticket(df,ticket_class)
    return df[SELECTED_COLUMNS_DEL_SITE_WEB_COLUMNS]


In [205]:
dfs=[get_dataframe(df,ticket_class+1) for ticket_class,df in enumerate(dfs)]
website_df=pd.concat(dfs)
website_df.reset_index(inplace=True,drop=True)


In [206]:
website_df.head()

Unnamed: 0,Surname,First Names,Age,Pclass
0,Allen,Miss Elisabeth Walton,29,1
1,Allison,Mr Hudson Joshua Creighton,30,1
2,Allison,Mrs Bessie Waldo,25,1
3,Allison,Miss Helen Loraine,2,1
4,Allison,Master Hudson Trevor,11m,1


In [207]:
website_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1317 entries, 0 to 1316
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Surname      1317 non-null   object
 1   First Names  1317 non-null   object
 2   Age          1317 non-null   object
 3   Pclass       1317 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 41.3+ KB


### Reading csv data

In [208]:
csv_df=pd.read_csv("titanic-proyecto-final.csv")

In [209]:
csv_df.head()

Unnamed: 0,Survived,Name,Sex,Age,SibSp,Parch,Embarked
0,0,"Braund, Mr. Owen Harris",male,22.0,1,0,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,C
2,1,"Heikkinen, Miss. Laina",female,26.0,0,0,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,S
4,0,"Allen, Mr. William Henry",male,35.0,0,0,S


In [210]:
csv_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Name      891 non-null    object 
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Embarked  889 non-null    object 
dtypes: float64(1), int64(3), object(3)
memory usage: 48.9+ KB


Matching techniques:
1. Matching by surname and age, contemplating that there are no persons with the same surname and age.
2. Matching by full name using a set technique.
3. Match special cases manually.