
# üß© Titanic Dataset Exploration with pandas

üéØ **Goal:**  
In this notebook, you'll explore the Titanic dataset using **pandas**.  
You'll practice the most common pandas functions for data inspection, selection, filtering, cleaning, and analysis.

For each function in the list below:
1. Explain what it does (in your own words, in a Markdown cell).
2. Give at least **two examples** using the Titanic dataset.
3. Add a short comment about the output or why it‚Äôs useful.


In [24]:

import pandas as pd

# Load Titanic dataset
# (Make sure titanic.csv is in the same folder as this notebook)
df = pd.read_csv("titanic_dataset.csv")

# Show first few rows
df.head()
df


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C



## üß† Step 1: Inspecting the Data

Functions to explore:
- df.head()
- df.tail()
- df.info()
- df.describe()
- df.shape
- df.columns


In [25]:

# Example: df.head()
df.head()

# Example 2
df.head(10)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


La funci√≥n head() muestra por defecto 5 filas si no se pasan par√°metros.

En el segundo ejemplo, 10 es el par√°metro, por lo que head() muestra las filas del 0 al 9.

In [26]:
df.tail()
df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Esta funci√≥n es similar a head, pero en este caso muestra las √∫ltimas 5 filas de forma predeterminada.
Si la funci√≥n recibe par√°metro, ense√±ar las √∫ltimas n filas.

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Este m√©todo imprime informaci√≥n sobre un DataFrame, incluyendo el tipo de datos del √≠ndice y las columnas, los valores no nulos y el uso de memoria.

In [28]:
df.info(verbose=False,memory_usage=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Columns: 12 entries, PassengerId to Embarked
dtypes: float64(2), int64(5), object(5)

Podemos decir que parametros se pueden mostrar o no. En mi caso le paso que no quiero me muestre la informacion de las columnas, ni la memoria usada.

Parametros:
    -verbose,max_cols,buf,memory_usage,show_counts.

In [29]:
- df.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,-891.0,-891.0,-891.0,-714.0,-891.0,-891.0,-891.0
mean,-446.0,-0.383838,-2.308642,-29.699118,-0.523008,-0.381594,-32.204208
std,-257.353842,-0.486592,-0.836071,-14.526497,-1.102743,-0.806057,-49.693429
min,-1.0,-0.0,-1.0,-0.42,-0.0,-0.0,-0.0
25%,-223.5,-0.0,-2.0,-20.125,-0.0,-0.0,-7.9104
50%,-446.0,-0.0,-3.0,-28.0,-0.0,-0.0,-14.4542
75%,-668.5,-1.0,-3.0,-38.0,-1.0,-0.0,-31.0
max,-891.0,-1.0,-3.0,-80.0,-8.0,-6.0,-512.3292


Muestra estad√≠sticas descriptivas de las columnas num√©ricas (por defecto).

In [30]:
df.Age.describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [31]:
df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


En las 2 funciones anteriores, se puede observar que podemos usar describe para que devuelva valores solo de una columna, o inclusive de columnas no n√∫mericas. 


In [None]:
df.shape

(891, 12)

Nos devuelve una tupla con el numero de filas y columnas

In [36]:
type(df.shape)

tuple

Como observamos el del tipo tupla, por eso shape no se pone de esta manera df.shape()

In [41]:
filas=df.shape[0]
print(filas)
columnas=df.shape[1]
print(columnas)
print("Antes:", df.shape)
borrado_datos=df.dropna()
print("Despu√©s:", borrado_datos.shape)


891
12
Antes: (891, 12)
Despu√©s: (183, 12)


Se podr√≠a almacenar en una variable las filas y en otra las columnas, para poder operar con ellas.
O tambien es √∫til cuando estamos borrando nulos o duplicados.

In [42]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

df.colums devuelve un objeto de tipo Index con los nombre de las columnas del DataFrame

In [44]:
print(df.columns[-1])

Embarked


Uno de sus uso es acceder a una columna espec√≠fica, como la √∫ltima columna.

In [45]:
columns_lowe = [col.lower() for col in df.columns]  
print(columns_lowe)


['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked']


O para limpieza de datos, renombrar todas las columnas a la vez.


## üîç Step 2: Selecting Data

Functions to explore:
- df["column"]
- df[["col1", "col2"]]
- df.loc[]
- df.iloc[]


In [None]:

# Example: Selecting columns
df["Age"].head()
df[["Sex", "Age", "Survived"]].head()


Unnamed: 0,Sex,Age,Survived
0,male,22.0,0
1,female,38.0,1
2,female,26.0,1
3,female,35.0,1
4,male,35.0,0


In [51]:
df_subset=df[['Age','Sex']].mean
df_subset


<bound method DataFrame.mean of       Age     Sex
0    22.0    male
1    38.0  female
2    26.0  female
3    35.0  female
4    35.0    male
..    ...     ...
886  27.0    male
887  19.0  female
888   NaN  female
889  26.0    male
890  32.0    male

[891 rows x 2 columns]>

√ötil para crear subsets, y trabajar con ellos.

In [58]:
df.loc[0:10,['Name','Age']]

Unnamed: 0,Name,Age
0,"Braund, Mr. Owen Harris",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
5,"Moran, Mr. James",
6,"McCarthy, Mr. Timothy J",54.0
7,"Palsson, Master. Gosta Leonard",2.0
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,"Nasser, Mrs. Nicholas (Adele Achem)",14.0


Usado para seleccionar por etiquetas, por ejemplo nombre y edad.

In [59]:
df.loc[df["Age"] > 30, ["Name", "Age", "Sex"]]


Unnamed: 0,Name,Age,Sex
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female
4,"Allen, Mr. William Henry",35.0,male
6,"McCarthy, Mr. Timothy J",54.0,male
11,"Bonnell, Miss. Elizabeth",58.0,female
...,...,...,...
873,"Vander Cruyssen, Mr. Victor",47.0,male
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",56.0,female
881,"Markun, Mr. Johann",33.0,male
885,"Rice, Mrs. William (Margaret Norton)",39.0,female


Tambi√©n para filtrar por una de sus columnas, en este caso que la edad sea mayor de 30

In [60]:
borrado_datos.loc[borrado_datos['Sex']=='male', 'Sex']='M'
borrado_datos.loc[0:,['Name','Sex','Age']]

Unnamed: 0,Name,Sex,Age
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
6,"McCarthy, Mr. Timothy J",M,54.0
10,"Sandstrom, Miss. Marguerite Rut",female,4.0
11,"Bonnell, Miss. Elizabeth",female,58.0
...,...,...,...
871,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0
872,"Carlsson, Mr. Frans Olof",M,33.0
879,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0
887,"Graham, Miss. Margaret Edith",female,19.0


Modificar valores con .loc,se puede asignar valores directamente, en el ejemplo cambiamos el atributo male por M de todo el DataFrame


## üîé Step 3: Filtering Rows

Functions to explore:
- df[df["Age"] > 30]
- df.query("Sex == 'female' and Survived == 1")


In [None]:

# Example: Filtering data
df[df["Age"] > 50].head()
df.query("Sex == 'female' and Survived == 1").head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C



## üßπ Step 4: Handling Missing Data

Functions to explore:
- df.isna()
- df.isna().sum()
- df.dropna()
- df.fillna()


In [None]:

# Example: Check missing values
df.isna().sum()

# Fill missing ages with median
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Age"].head()


0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64


## üìä Step 5: Grouping and Aggregating

Functions to explore:
- df.groupby("Sex")["Survived"].mean()
- df["Fare"].mean()
- df["Age"].median()


In [None]:

# Example: Aggregation
df.groupby("Sex")["Survived"].mean()
df.groupby("Pclass")["Fare"].mean()


Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64


## üìà Step 6: Sorting and Counting

Functions to explore:
- df.sort_values("Age")
- df["Sex"].unique()
- df["Pclass"].value_counts()


In [None]:

# Example: Sorting and counting
df.sort_values("Age").head()
df["Pclass"].value_counts()


Pclass
3    491
1    216
2    184
Name: count, dtype: int64


## ‚öôÔ∏è Step 7: Creating or Modifying Columns

Functions to explore:
- df.assign()
- df.apply()
- df["new_col"] = ...
- pd.concat()
- pd.merge()


In [None]:

# Example: Create new column
df["Fare_per_Age"] = df["Fare"] / df["Age"]
df[["Age", "Fare", "Fare_per_Age"]].head()


Unnamed: 0,Age,Fare,Fare_per_Age
0,22.0,7.25,0.329545
1,38.0,71.2833,1.875876
2,26.0,7.925,0.304808
3,35.0,53.1,1.517143
4,35.0,8.05,0.23



## üíæ Step 8: Exporting Data

Function to explore:
- df.to_csv("output.csv", index=False)


In [None]:

# Example: Save cleaned data
df.to_csv("titanic_cleaned.csv", index=False)



## üß© Step 9: Summary

Reflect on what you learned:
- Which functions were most useful?
- What insights did you gain from the Titanic dataset?
