## Download data

### By:
Jose R. Zapata - https://joserzapata.github.io/

### Date:
2024-02-27

### Description:

Download the dataset and select the columns that are going to be used in the project.

## 📚 Import  libraries

In [1]:
# base libraries for data science
import pandas as pd

from pathlib import Path

## 💾 Load data

In [2]:
url_data = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
titanic_df = pd.read_csv(url_data,
                         low_memory=False) # no parsing of mixed types
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  boat       1309 non-null   object
 12  body       1309 non-null   object
 13  home.dest  1309 non-null   object
dtypes: int64(4), object(10)
memory usage: 143.3+ KB


In [3]:
titanic_df.sample(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
54,1,1,"Carter, Master. William Thornton II",male,11,1,2,113760,120.0,B96 B98,S,4,?,"Bryn Mawr, PA"
404,2,0,"Enander, Mr. Ingvar",male,21,0,0,236854,13.0,?,S,?,?,"Goteborg, Sweden / Rockford, IL"
204,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)",female,?,1,0,PC 17604,82.1708,?,C,6,?,"New York, NY"
463,2,0,"Jefferys, Mr. Ernest Wilfred",male,22,2,0,C.A. 31029,31.5,?,S,?,?,"Guernsey / Elizabeth, NJ"
760,3,1,"de Mulder, Mr. Theodore",male,30,0,0,345774,9.5,?,S,11,?,"Belgium Detroit, MI"
1007,3,1,"McGowan, Miss. Anna 'Annie'",female,15,0,0,330923,8.0292,?,Q,?,?,?
1237,3,0,"Svensson, Mr. Olof",male,24,0,0,350035,7.7958,?,S,?,?,?
837,3,0,"Gustafsson, Mr. Anders Vilhelm",male,37,2,0,3101276,7.925,?,S,?,98,"Ruotsinphytaa, Finland New York, NY"
1090,3,0,"Oreskovic, Miss. Jelka",female,23,0,0,315085,8.6625,?,S,?,?,?
956,3,0,"Lefebre, Miss. Jeannie",female,?,3,1,4133,25.4667,?,S,?,?,?


##  📊 Data Analysis

This demo project uses the titanic dataset where the goal is to predict if a passenger survived or not. 


first iteration which columns are going to be used in the project?

In [4]:
titanic_df.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

This columns are related with the survival variable

- boat: The lifeboat number (if they survived)
- body: The body number (if they did not survive and the body was recovered)

so these columns are not be using because will give us the answer of the problem, is like data leakage.

In [5]:
columns_to_use = [
    "pclass",
    "survived",
    "name",
    "sex",
    "age",
    "sibsp",
    "parch",
    "ticket",
    "fare",
    "cabin",
    "embarked",
    "home.dest",
]

final data download

In [6]:
titanic_final_df = pd.read_csv( url_data,
                                usecols=columns_to_use,
                                low_memory=False)
titanic_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  home.dest  1309 non-null   object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB


## Save dataset in parquet format

In [7]:
DATA_DIR = Path.cwd().resolve().parents[1] / "data"

In [8]:
titanic_final_df.to_parquet(DATA_DIR /
                            "01_raw/titanic_raw.parquet",
                            engine="pyarrow")

## Results

The dataset is saved in parquet format and the columns that are going to be used in the project are selected.

Avoiding data leakage removing the columns boat and body.

path: `data/01_raw/titanic_raw.parquet`