In [2]:
# Notebook's title: 01 - Data Ingestion and Initial Inspection

# 1. Import the essential library for the data analysis
import pandas as pd

# 2. Upload the file CSV.
# Make sure the file name matches the one you downloaded from Kaggle!
# We define data frame as df
try:
    df = pd.read_csv('train.csv')

    # 3. Display the first 5 rows to verify successful upload
    print("Loading successful. First 5 rows:") 
    print(df.head())
    
except FileNotFoundError:
    print("ERROR: File 'train.csv' not found. Make sure it is in the same folder as this notebook.")

Loading successful. First 5 rows:
   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  3101298  12.2875   NaN        S  


In [5]:
# Structural and types diagnostic

print ("Dataset dimensions:", df.shape)
print ("\nType and null's information:")
df.info()

Dataset dimensions: (418, 12)

Type and null's information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [4]:
# Statistical summary (Numerical and Categorical)

print ("\nNumerical Summary:")
print (df.describe())

print("\nCategorical Summary:")
print(df.describe(include='object'))



Numerical Summary:
       PassengerId    Survived      Pclass         Age       SibSp  \
count   418.000000  418.000000  418.000000  332.000000  418.000000   
mean   1100.500000    0.363636    2.265550   30.272590    0.447368   
std     120.810458    0.481622    0.841838   14.181209    0.896760   
min     892.000000    0.000000    1.000000    0.170000    0.000000   
25%     996.250000    0.000000    1.000000   21.000000    0.000000   
50%    1100.500000    0.000000    3.000000   27.000000    0.000000   
75%    1204.750000    1.000000    3.000000   39.000000    1.000000   
max    1309.000000    1.000000    3.000000   76.000000    8.000000   

            Parch        Fare  
count  418.000000  417.000000  
mean     0.392344   35.627188  
std      0.981429   55.907576  
min      0.000000    0.000000  
25%      0.000000    7.895800  
50%      0.000000   14.454200  
75%      0.000000   31.500000  
max      9.000000  512.329200  

Categorical Summary:
                            Name   Sex 

A. Dimensions and Data Types:

1. Dimensions (Dataset Size): The Titanic dataset (train.csv) contains [418] records (passengers) and [12] columns (variables/features).

2. Data Types (Dtypes): Three main data types have been identified that will guide the cleaning process:

    * Numeric: int64 and float 64 (Examples: `Age`, `Fare`, `Pclass`).
    * Categorical/Text: object (Examples: `Name`, `Sex`, `Cabin`, `Embarked`, `Ticket`).

3. Columns of Low Relevance: Passengerld and Ticked are unique identifiers or high-cardinality variables (many unique values). It is recommended to remove them in the early stages of Feature Engineering to simplify the model.

B. Detection and Plan of Missing Values (Missing Data)

Based on the inspection of `df.info()`, the next columns have missing data:

| Column | Total of Rows (418) | No-Null Count | Missing Values | Null Percentage (Aprox.) | Treatment Plan |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Cabin** | 418 | [91] | [327] | **~77.04%** | **Elimination:** The percentage is critical. The column will be deleted. |
| **Age** | 418 | [332] | [86] | **~20.57%** | **Numerical Imputation:** It will be filled with the **Median** to preserve the distribution without bias from *outliers*. |
| **Embarked** | 418 | [418] | [0] | **~0.48%** | **Categorical Imputacion:** It will be filled in with the **Mode**, given the low incidence of null values.|
