## Data explore and description

### By:
Jose R. Zapata

### Date:
2024-03-01

### Description:

Data overview and exploration to check data types and fix any issue with the data types.

this is in other to do a correct data analysis and visualization of the data.


## 📚 Import  libraries

In [1]:
# base libraries for data science
import pandas as pd
import numpy as np
import pyarrow as pa

from pathlib import Path

## 💾 Load data

In [2]:
# data directory path
DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df= pd.read_parquet(DATA_DIR / 
                            "01_raw/titanic_raw.parquet",
                            engine="pyarrow")

## 📊 Data description

In [3]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1309 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1309 non-null   object
 9   cabin      1309 non-null   object
 10  embarked   1309 non-null   object
 11  home.dest  1309 non-null   object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB


In [4]:
titanic_df.sample(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
242,1,1,"Rosenbaum, Miss. Edith Louise",female,33,0,0,PC 17613,27.7208,A11,C,"Paris, France"
323,2,0,"Abelson, Mr. Samuel",male,30,1,0,P/PP 3381,24.0,?,C,"Russia New York, NY"
466,2,0,"Kantor, Mr. Sinai",male,34,1,0,244367,26.0,?,S,"Moscow / Bronx, NY"
865,3,0,"Henry, Miss. Delia",female,?,0,0,382649,7.75,?,Q,?
910,3,0,"Kallio, Mr. Nikolai Erland",male,17,0,0,STON/O 2. 3101274,7.125,?,S,?
735,3,1,"Coutts, Mrs. William (Winnie 'Minnie' Treanor)",female,36,0,2,C.A. 37671,15.9,?,S,"England Brooklyn, NY"
1061,3,1,"Nilsson, Miss. Helmina Josefina",female,26,0,0,347470,7.8542,?,S,?
282,1,1,"Stephenson, Mrs. Walter Bertram (Martha Eustis)",female,52,1,0,36947,78.2667,D20,C,"Haverford, PA"
625,3,1,"Andersson, Miss. Erna Alexandra",female,17,4,2,3101281,7.925,?,S,"Ruotsinphyhtaa, Finland New York, NY"
944,3,0,"Laleff, Mr. Kristo",male,?,0,0,349217,7.8958,?,S,?


## Null values

In this dataset the null values are represented by the string '?' so we need to replace them with `pd.NA`

In [5]:
titanic_df = titanic_df.replace("?", np.nan)

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pclass     1309 non-null   int64 
 1   survived   1309 non-null   int64 
 2   name       1309 non-null   object
 3   sex        1309 non-null   object
 4   age        1046 non-null   object
 5   sibsp      1309 non-null   int64 
 6   parch      1309 non-null   int64 
 7   ticket     1309 non-null   object
 8   fare       1308 non-null   object
 9   cabin      295 non-null    object
 10  embarked   1307 non-null   object
 11  home.dest  745 non-null    object
dtypes: int64(4), object(8)
memory usage: 122.8+ KB


## Remove columns

- We will remove the columns that have too many null values and need to much effort to find the correct value.
- The column `ticket` is a string that is unique for each passenger, but is just a identifier, so we will remove it.

so we will remove the columns `cabin`, `ticket` and `home.dest`

In [6]:
titanic_df = titanic_df.drop(columns=["cabin", "home.dest", "ticket"])

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   pclass    1309 non-null   int64 
 1   survived  1309 non-null   int64 
 2   name      1309 non-null   object
 3   sex       1309 non-null   object
 4   age       1046 non-null   object
 5   sibsp     1309 non-null   int64 
 6   parch     1309 non-null   int64 
 7   fare      1308 non-null   object
 8   embarked  1307 non-null   object
dtypes: int64(4), object(5)
memory usage: 92.2+ KB


### Categorical variables
#### Ordinal
- `Pclass`: A proxy for socio-economic status (SES)
    - 1 = Upper
    - 2 = Middle
    - 3 = Lower


#### Nominal

- `sex`: Gender of the passenger
    - female
    - male
- `embarked`: Port of embarkation
    - C = Cherbourg
    - Q = Queenstown
    - S = Southampton

### Numerical variables
#### Discrete
- `sibsp`: The dataset defines family relations in this way...
    - Sibling = brother, sister, stepbrother, stepsister
    - Spouse = husband, wife (mistresses and fiancés were ignored)
    - `sibsp` = 0, 1, 2, 3, 4, 5, 8
- `parch`: The dataset defines family relations in this way...
    - Parent = mother, father
    - Child = daughter, son, stepdaughter, stepson
    - Some children travelled only with a nanny, therefore `parch` = 0 for them.
    - `parch` = 0, 1, 2, 3, 4, 5, 6 

#### Continuous

- `fare`: Passenger fare
- `age`: Age of the passenger, some values are float has to be converted to int.

### Boolean variables
- `Survived`: 0 = No, 1 = Yes

## String variables

- `name`: Name of the passenger with the format `Last name, Title. First name`

## Convert data types

### Categorical variables

In [7]:
cols_categoric = [  "pclass",
                    "sex",
                    "embarked"]

titanic_df[cols_categoric] = titanic_df[cols_categoric].astype("category")

In [8]:
titanic_df["pclass"] = pd.Categorical(
                        titanic_df["pclass"],
                        categories=[3, 2, 1],
                        ordered=True
)

### Numerical variables

In [9]:
cols_numeric_float = ["age", "fare"]

titanic_df[cols_numeric_float] = titanic_df[cols_numeric_float].astype("float")

In [10]:
cols_numeric_int = ['sibsp', 'parch']

titanic_df[cols_numeric_int] = (titanic_df[cols_numeric_int].astype("int8"))

### Boolean variables

In [11]:
cols_boolean = ["survived"]

titanic_df[cols_boolean] = (titanic_df[cols_boolean].astype("bool"))

In [12]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   pclass    1309 non-null   category
 1   survived  1309 non-null   bool    
 2   name      1309 non-null   object  
 3   sex       1309 non-null   category
 4   age       1046 non-null   float64 
 5   sibsp     1309 non-null   int8    
 6   parch     1309 non-null   int8    
 7   fare      1308 non-null   float64 
 8   embarked  1307 non-null   category
dtypes: bool(1), category(3), float64(2), int8(2), object(1)
memory usage: 38.9+ KB


In [None]:
schema = pa.Table.from_pandas(titanic_df).schema

###  💾 Save dataframe with data types

In [13]:
titanic_df.to_parquet(
    DATA_DIR / "02_intermediate/titanic_type_fixed.parquet",
    index=False,
    schema=schema
)

## 📊 Analysis of Results

Some columns have been removed and the data types have been fixed to correct pyarrow data types.
and null values have been replaced with `np.nan`

in order to do a correct analysis and visualization of the data.


## 💡 Proposals and Ideas

- use other tools to compare which one can be used to describe and explore data and do data analysis.

- Use pyarrow as dtype backend
- Use `pd.NA` as null value, but yprofiling is not working well with pyarrow backend


## 📖 References

- <https://pandas.pydata.org/docs/user_guide/pyarrow.html>