## Data explore and description

### By:
Jose R. Zapata

### Date:
2024-03-01

### Description:

Data overview and exploration to check data types and fix any issue with the data types.

this is in other to do a correct data analysis and visualization of the data.


## 📚 Import  libraries

In [1]:
# base libraries for data science
import pandas as pd

from pathlib import Path

In [2]:
pd.options.mode.string_storage = "pyarrow"

## 💾 Load data

In [3]:
# data directory path
DATA_DIR = Path.cwd().resolve().parents[1] / "data"

titanic_df= pd.read_parquet(DATA_DIR / 
                            "01_raw/titanic_raw.parquet",
                            dtype_backend="pyarrow",
                            engine="pyarrow")

## 📊 Data description

In [4]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype          
---  ------     --------------  -----          
 0   pclass     1309 non-null   int64[pyarrow] 
 1   survived   1309 non-null   int64[pyarrow] 
 2   name       1309 non-null   string[pyarrow]
 3   sex        1309 non-null   string[pyarrow]
 4   age        1309 non-null   string[pyarrow]
 5   sibsp      1309 non-null   int64[pyarrow] 
 6   parch      1309 non-null   int64[pyarrow] 
 7   ticket     1309 non-null   string[pyarrow]
 8   fare       1309 non-null   string[pyarrow]
 9   cabin      1309 non-null   string[pyarrow]
 10  embarked   1309 non-null   string[pyarrow]
 11  home.dest  1309 non-null   string[pyarrow]
dtypes: int64[pyarrow](4), string[pyarrow](8)
memory usage: 158.3 KB


In [5]:
titanic_df.sample(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,home.dest
711,3,0,"Carver, Mr. Alfred John",male,28,0,0,392095,7.25,?,S,"St Denys, Southampton, Hants"
971,3,0,"Linehan, Mr. Michael",male,?,0,0,330971,7.8792,?,Q,?
978,3,1,"Lulic, Mr. Nikola",male,27,0,0,315098,8.6625,?,S,?
319,1,1,"Wilson, Miss. Helen Alice",female,31,0,0,16966,134.5,E39 E41,C,?
357,2,0,"Byles, Rev. Thomas Roussel Davids",male,42,0,0,244310,13.0,?,S,London
492,2,1,"Mallet, Master. Andre",male,1,0,2,S.C./PARIS 2079,37.0042,?,C,"Paris / Montreal, PQ"
112,1,1,"Fortune, Miss. Ethel Flora",female,28,3,2,19950,263.0,C23 C25 C27,S,"Winnipeg, MB"
847,3,0,"Hanna, Mr. Mansour",male,23.5,0,0,2693,7.2292,?,C,?
580,2,1,"Ware, Mrs. John James (Florence Louise Long)",female,31,0,0,CA 31352,21.0,?,S,"Bristol, England / New Britain, CT"
440,2,1,"Herman, Mrs. Samuel (Jane Laver)",female,48,1,2,220845,65.0,?,S,"Somerset / Bernardsville, NJ"


## Null values

In this dataset the null values are represented by the string '?' so we need to replace them with `pd.NA`

In [6]:
titanic_df =  titanic_df.replace("?", pd.NA)

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype          
---  ------     --------------  -----          
 0   pclass     1309 non-null   int64[pyarrow] 
 1   survived   1309 non-null   int64[pyarrow] 
 2   name       1309 non-null   string[pyarrow]
 3   sex        1309 non-null   string[pyarrow]
 4   age        1046 non-null   string[pyarrow]
 5   sibsp      1309 non-null   int64[pyarrow] 
 6   parch      1309 non-null   int64[pyarrow] 
 7   ticket     1309 non-null   string[pyarrow]
 8   fare       1308 non-null   string[pyarrow]
 9   cabin      295 non-null    string[pyarrow]
 10  embarked   1307 non-null   string[pyarrow]
 11  home.dest  745 non-null    string[pyarrow]
dtypes: int64[pyarrow](4), string[pyarrow](8)
memory usage: 157.3 KB


## Remove columns

- We will remove the columns that have too many null values and need to much effort to find the correct value.
- The column `ticket` is a string that is unique for each passenger, but is just a identifier, so we will remove it.

so we will remove the columns `cabin`, `ticket` and `home.dest`

In [7]:
titanic_df = titanic_df.drop(columns=["cabin", "home.dest", "ticket"])

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype          
---  ------    --------------  -----          
 0   pclass    1309 non-null   int64[pyarrow] 
 1   survived  1309 non-null   int64[pyarrow] 
 2   name      1309 non-null   string[pyarrow]
 3   sex       1309 non-null   string[pyarrow]
 4   age       1046 non-null   string[pyarrow]
 5   sibsp     1309 non-null   int64[pyarrow] 
 6   parch     1309 non-null   int64[pyarrow] 
 7   fare      1308 non-null   string[pyarrow]
 8   embarked  1307 non-null   string[pyarrow]
dtypes: int64[pyarrow](4), string[pyarrow](5)
memory usage: 117.9 KB


### Categorical variables
#### Ordinal
- `Pclass`: A proxy for socio-economic status (SES)
    - 1 = Upper
    - 2 = Middle
    - 3 = Lower


#### Nominal

- `sex`: Gender of the passenger
    - female
    - male
- `embarked`: Port of embarkation
    - C = Cherbourg
    - Q = Queenstown
    - S = Southampton

### Numerical variables
#### Discrete
- `sibsp`: The dataset defines family relations in this way...
    - Sibling = brother, sister, stepbrother, stepsister
    - Spouse = husband, wife (mistresses and fiancés were ignored)
    - `sibsp` = 0, 1, 2, 3, 4, 5, 8
- `parch`: The dataset defines family relations in this way...
    - Parent = mother, father
    - Child = daughter, son, stepdaughter, stepson
    - Some children travelled only with a nanny, therefore `parch` = 0 for them.
    - `parch` = 0, 1, 2, 3, 4, 5, 6 

#### Continuous

- `fare`: Passenger fare
- `age`: Age of the passenger, some values are float has to be converted to int.

### Boolean variables
- `Survived`: 0 = No, 1 = Yes

## String variables

- `name`: Name of the passenger with the format `Last name, Title. First name`

## Convert data types

### Categorical variables

In [8]:
cols_categoric = [  "pclass",
                    "sex",
                    "embarked"]

titanic_df[cols_categoric] = titanic_df[cols_categoric].astype("category")

In [9]:
titanic_df["pclass"] = pd.Categorical(
    titanic_df["pclass"],
    categories=[3, 2, 1],
    ordered=True
)

### Numerical variables

In [10]:
cols_numeric_float = ["age", "fare"]

titanic_df[cols_numeric_float] = (titanic_df[cols_numeric_float]
                                  .astype(dtype="float32[pyarrow]"))

In [11]:
cols_numeric_int = ['sibsp', 'parch']

titanic_df[cols_numeric_int] = (titanic_df[cols_numeric_int]
                                .astype(dtype="uint8[pyarrow]"))

### Boolean variables

In [12]:
cols_boolean = ["survived"]

titanic_df[cols_boolean] = (titanic_df[cols_boolean]
                            .astype("bool[pyarrow]"))

In [13]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype          
---  ------    --------------  -----          
 0   pclass    1309 non-null   category       
 1   survived  1309 non-null   bool[pyarrow]  
 2   name      1309 non-null   string[pyarrow]
 3   sex       1309 non-null   category       
 4   age       1046 non-null   float[pyarrow] 
 5   sibsp     1309 non-null   uint8[pyarrow] 
 6   parch     1309 non-null   uint8[pyarrow] 
 7   fare      1308 non-null   float[pyarrow] 
 8   embarked  1307 non-null   category       
dtypes: bool[pyarrow](1), category(3), float[pyarrow](2), string[pyarrow](1), uint8[pyarrow](2)
memory usage: 57.9 KB


###  💾 Save dataframe with data types

In [15]:
titanic_df.to_parquet(  DATA_DIR / 
                        "02_intermediate/titanic_type_fixed.parquet",
                        index=False)

## 📊 Analysis of Results

Some columns have been removed and the data types have been fixed to correct pyarrow data types.
and null values have been replaced with `pd.NA`

in order to do a correct analysis and visualization of the data.


## 💡 Proposals and Ideas

use other tools to compare which one can be used to describe and explore data and do data analysis.


## 📖 References

- <https://pandas.pydata.org/docs/user_guide/pyarrow.html>