## Download data

### By:
Jose R. Zapata - https://joserzapata.github.io/

### Date:
2024-02-27

### Description:

Download the dataset and select the columns that are going to be used in the project.

## 📚 Import  libraries

In [1]:
# base libraries for data science
import pandas as pd

from pathlib import Path

## 💾 Load data

In [2]:
url_data = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
titanic_df = pd.read_csv(url_data,
                         low_memory=False) # no parsing of mixed types
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [3]:
titanic_df.sample(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
659,3,1,"Baclini, Miss. Marie Catherine",female,5,2,1,2666,19.2583,?,C,C,?,"Syria New York, NY"
351,2,0,"Brown, Mr. Thomas William Solomon",male,60,1,1,29750,39.0,?,S,?,?,"Cape Town, South Africa / Seattle, WA"
1075,3,0,"Odahl, Mr. Nils Martin",male,23,0,0,7267,9.225,?,S,?,?,?
489,2,1,"Louch, Mrs. Charles Alexander (Alice Adelaide ...",female,42,1,0,SC/AH 3085,26.0,?,S,?,?,"Weston-Super-Mare, Somerset"
39,1,0,"Brandeis, Mr. Emil",male,48,0,0,PC 17591,50.4958,B10,C,?,208,"Omaha, NE"
1247,3,1,"Thorneycroft, Mrs. Percival (Florence Kate White)",female,?,1,0,376564,16.1,?,S,10,?,?
101,1,0,"Dulles, Mr. William Crothers",male,39,0,0,PC 17580,29.7,A18,C,?,133,"Philadelphia, PA"
480,2,0,"Laroche, Mr. Joseph Philippe Lemercier",male,25,1,2,SC/Paris 2123,41.5792,?,C,?,?,Paris / Haiti
444,2,0,"Hickman, Mr. Stanley George",male,21,2,0,S.O.C. 14879,73.5,?,S,?,?,"West Hampstead, London / Neepawa, MB"
188,1,1,"Lines, Mrs. Ernest H (Elizabeth Lindsey James)",female,51,0,1,PC 17592,39.4,D28,S,9,?,"Paris, France"


##  📊 Data info

This demo project uses the titanic dataset where the goal is to predict if a passenger survived or not. 


first iteration which columns are going to be used in the project?

In [4]:
titanic_df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

This columns are related with the survival variable

- boat: The lifeboat number (if they survived)
- body: The body number (if they did not survive and the body was recovered)

so these columns are not be using because will give us the answer of the problem, is like data leakage.

In [5]:
columns_to_use = [
    "pclass",
    "survived",
    "name",
    "sex",
    "age",
    "sibsp",
    "parch",
    "ticket",
    "fare",
    "cabin",
    "embarked",
    "home.dest",
]

final data download

In [6]:
titanic_final_df = pd.read_csv( url_data,
                                usecols=columns_to_use,
                                low_memory=False)
titanic_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  home.dest  1309 non-null   object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB


## Save dataset in local format

In [7]:
DATA_DIR = Path.cwd().resolve().parents[1] / "data"

In [8]:
titanic_final_df.to_csv(DATA_DIR /
                        "01_raw/titanic_raw.csv",
                        )

## Results

The dataset is saved in parquet format and the columns that are going to be used in the project are selected.

Avoiding data leakage removing the columns boat and body.

path: `data/01_raw/titanic_raw.csv`