<div>
<img src="https://coursereport-s3-production.global.ssl.fastly.net/uploads/school/logo/219/original/CT_LOGO_NEW.jpg" width=60>
</div>

# **Lesson 02. Pandas Basics: Getting Started**

When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. 

In pandas, a data table is called a DataFrame.

<img src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg" width="400" align="left">

<img src="https://pandas.pydata.org/docs/_images/02_io_readwrite.svg" width="700" align="left">

pandas offers built-in support for various file formats and data sources, including CSV, Excel, SQL, JSON, and more. Today, we’ll be reading in a CSV file and play with pandas ourselves!

---

# <span style="color:#D34B47">📌 Working with Pandas</span>

## Importing Pandas

The most common way (and method you should use) is to import pandas as the abbreviation `pd` (e.g. `pandas` -> `pd`).

In [2]:
# Import here!
import pandas as pd

## Creating Our DataFrames

In [None]:
# Number entries



In [None]:
# Text entries



## Understanding DataFrame Indexing

We can either set an existing column as our index or specify an index when creating a DataFrame.

Let's begin by setting an an existing column as index.

Alternatively, we can specify an index column when creating a dataframe via the 'index' argument.

You can also reset the index back to its default.

In [None]:
# Reset index
# Try playing around with 'drop' and 'inplace' and see what they do



---

# <span style="color:#D34B47">📌 Modifying DataFrames</span>

## Renaming Columns 

In [None]:
# Suppose we want to change the names of the first two columns



In [None]:
# Very common method



## Dropping Columns and Rows

There are a few of ways you can drop columns or rows from your dataframe. In this example, I am only focusing on the 'drop' function.

In [None]:
# Drop the 'math' column



In [None]:
# Drop row with student_ID 973
# We can make this more robust once we learn the 'loc' down below
 


## Adding Columns

In [None]:
# Create a new column for history and gym classes


In [None]:
# YOUR TURN - Create a column for english and add the grades 90, 94, and 80.


# ⏸️ PAUSE - The next lesson will continue in this notebook!

---

<div>
<img src="https://coursereport-s3-production.global.ssl.fastly.net/uploads/school/logo/219/original/CT_LOGO_NEW.jpg" width=60>
</div>

# **Lesson 03. Pandas Basics: Advanced Techniques**

---

# <span style="color:#D34B47">📌 Reading in a Dataset</span>

### Welcome aboard the Titanic dataset!

<img src="https://cdn.britannica.com/79/4679-050-BC127236/Titanic.jpg" width="200" align="center">

The following dataset has been sourced from Kaggle [here](https://www.kaggle.com/c/titanic). Using pandas, we can explore information about some of the passengers that were on board on that fateful night. We can uncover insights like the total number of passengers in our dataset, the count of those embarking from different cities, the average fare, survival rates among genders, and much more! We can allow our own curiousities to take over.

In [3]:
# Read data via 'pd.read_csv'
# Use the appropriate read function for different file formats, for example pd.read_excel allows you to import files in excel format

df = pd.read_csv('data/titanic.csv')

---

# <span style="color:#D34B47">📌 Basic Pandas Functions</span>

In [4]:
# 'head' shows the first five rows of the DataFrame by default but you can specify the number of rows in the parentheses

df.head()

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# 'tail' shows the bottom five rows by default

df.tail()

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [6]:
# 'shape' function tells us how many rows and columns exist in a DataFrame

df.shape

(891, 12)

In [8]:
# Missing values

df.isnull().sum()

Passenger Id      0
Survived          0
Pclass            0
Name              0
Sex               0
Age             177
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin           687
Embarked          2
dtype: int64

---

# <span style="color:#D34B47">📌 Selecting a Series/Column in a DataFrame</span>

There are two ways you can select a column of a DataFrame.
1. `df.Name`
2. `df['Name']`

What is the difference between the two? Well, they both do the exact same thing. Why might we prefer one over the other?

Let's see it in action.

In [9]:
# Let's first try out it out on the `Name` column

df.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [10]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [14]:
# What about `Passenger Id`?
df['Passenger Id']

0        1
1        2
2        3
3        4
4        5
      ... 
886    887
887    888
888    889
889    890
890    891
Name: Passenger Id, Length: 891, dtype: int64

---

# <span style="color:#D34B47">📌 Index-Based Selection Using `iloc`</span>

We use `iloc` to select data based on their numerical position in the DataFrame.

`iloc` takes two argument, first is row followed by column. It has a starting index of 0 that is 0 is first, 1 is second, 2 is third and so on.

In [15]:
# All rows and all columns
df.iloc[:,:]

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [16]:
# All rows, just the fourth column
# Since starting index is 0 fourth column corresponds to index number 3

df.iloc[:,4]

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

In [17]:
# What is this the same as?

df['Sex']

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

In [18]:
# First three rows and all columns

df.iloc[:3,:]

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [19]:
# What is this the same as?

df.head(3)

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


We can also go from the bottom of the DataFrame.

In [20]:
# Bottom five rows of the DataFrame

df.iloc[-5:,:]  

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [21]:
# What is this the same as?

df.tail(5)

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


---

# <span style="color:#D34B47">📌 Label-Based Selection Using `loc`</span>

With `loc` we need to specify the actual name of the column.

In [22]:
# First row of the `Name` column
df.loc[0,'Name']


'Braund, Mr. Owen Harris'

Different to `iloc`, when we want to select a range of values, `loc` includes both the start as well as the end of the range.

For example, to get the first 5 rows under `iloc` we would have `data[:5]` whereas for `loc` we have `data[:4]` instead.

In [23]:
# First 5 rows of the `Name`, `Sex` and `Age` column
df.loc[:4,['Name', 'Sex', 'Age']]   


Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


---

# <span style="color:#D34B47">📌 Using `loc` with Conditions</span>

We can select rows that satisfy certain conditions. In this section, we will look at how that works.

In [24]:
# Rows with age 50

df[df['Age'] == 50] 

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
177,178,0,1,"Isham, Miss. Ann Elizabeth",female,50.0,0,0,PC 17595,28.7125,C49,C
259,260,1,2,"Parrish, Mrs. (Lutie Davis)",female,50.0,0,1,230433,26.0,,S
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
434,435,0,1,"Silvey, Mr. William Baird",male,50.0,1,0,13507,55.9,E44,S
458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
482,483,0,3,"Rouse, Mr. Richard Henry",male,50.0,0,0,A/5 3594,8.05,,S
526,527,1,2,"Ridsdale, Miss. Lucy",female,50.0,0,0,W./C. 14258,10.5,,S
544,545,0,1,"Douglas, Mr. Walter Donald",male,50.0,1,0,PC 17761,106.425,C86,C
660,661,1,1,"Frauenthal, Dr. Henry William",male,50.0,2,0,PC 17611,133.65,,S
723,724,0,2,"Hodges, Mr. Henry Price",male,50.0,0,0,250643,13.0,,S


In [25]:
# Rows with age 50 AND are female
# This is a subset of the above DataFrame by filtering out males

df[(df['Age'] == 50) & (df['Sex'] == 'female')]


Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
177,178,0,1,"Isham, Miss. Ann Elizabeth",female,50.0,0,0,PC 17595,28.7125,C49,C
259,260,1,2,"Parrish, Mrs. (Lutie Davis)",female,50.0,0,1,230433,26.0,,S
299,300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
458,459,1,2,"Toomey, Miss. Ellen",female,50.0,0,0,F.C.C. 13531,10.5,,S
526,527,1,2,"Ridsdale, Miss. Lucy",female,50.0,0,0,W./C. 14258,10.5,,S


In [26]:
# Rows with age 50 OR have fare greater than or equal to 200
df[df['Age'] == 50 | (df['Fare'] >= 200)]


Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
164,165,0,3,"Panula, Master. Eino Viljami",male,1.0,4,1,3101295,39.6875,,S
172,173,1,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,,S
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S
381,382,1,3,"Nakid, Miss. Maria (""Mary"")",female,1.0,0,2,2653,15.7417,,C
386,387,0,3,"Goodwin, Master. Sidney Leonard",male,1.0,5,2,CA 2144,46.9,,S
788,789,1,3,"Dean, Master. Bertram Vere",male,1.0,1,2,C.A. 2315,20.575,,S
827,828,1,2,"Mallet, Master. Andre",male,1.0,0,2,S.C./PARIS 2079,37.0042,,C


In [27]:
# All the rows with null cabin column

df[df['Cabin'].isnull()]    

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


The exact opposite to the isnull function is the `notnull` function which returns series without any null values.

In [28]:
# All rows with C or Q in Embarked column

df[df['Embarked'].isin(['C','Q'])]  

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [32]:
# This is the same as if we had used the or statement

df[(df['Embarked'] == 'C') | (df['Embarked'] == 'Q')]   

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [34]:
# YOUR TURN - Show me female passengers that were older than 60

df[(df['Sex'] == 'female') & (df['Age'] > 60)]

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
483,484,1,3,"Turkula, Mrs. (Hedwig)",female,63.0,0,0,4134,9.5875,,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [35]:
# YOUR TURN - Show me male passengers older than 60 that survived

df[(df['Sex'] == 'male') & (df['Age'] > 60) & (df['Survived'] == 1)]

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
570,571,1,2,"Harris, Mr. George",male,62.0,0,0,S.W./PP 752,10.5,,S
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S


---

# <span style="color:#D34B47">📌 Using Summary Functions</span>

### `.info()` and `.describe()`

In [None]:
# Info


In [None]:
# Describe

# What is missing?

In [None]:
# Describe (again!)


### `.unique()`, `.nunique()`, and `.value_counts()`

In [36]:
# How many unique Embarked values are there?

df['Embarked'].nunique()    

3

In [37]:
# What are the unique Embarked values?

df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [38]:
# What are the counts of those individual values?

df['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [39]:
# Normalized?
df['Embarked'].value_counts(normalize=True)


Embarked
S    0.724409
C    0.188976
Q    0.086614
Name: proportion, dtype: float64

In [40]:
# YOUR TURN: What percentatage of passengers were male?

df['Sex'].value_counts(normalize=True)

Sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64

In [41]:
# YOUR TURN: What was the most popular Pclass?

df['Pclass'].value_counts(normalize=True)

Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

---

# <span style="color:#D34B47">📌 Descriptive Statistics</span>

Descriptive statistics provide a concise summary of the key characteristics and distributional properties of a dataset, offering insights into its central tendency, variability, and overall shape.

In [42]:
# What is the oldest age?

df['Age'].max() 

80.0

In [43]:
# Who is that passenger?
# Recall loc function

df.loc[df['Age'].idxmax()]  

Passenger Id                                     631
Survived                                           1
Pclass                                             1
Name            Barkworth, Mr. Algernon Henry Wilson
Sex                                             male
Age                                             80.0
SibSp                                              0
Parch                                              0
Ticket                                         27042
Fare                                            30.0
Cabin                                            A23
Embarked                                           S
Name: 630, dtype: object

In [44]:
# What is the youngest age?

df['Age'].min() 

0.42

In [46]:
# Who is that passenger?
df.loc[df['Age'].idxmin()]


Passenger Id                                804
Survived                                      1
Pclass                                        3
Name            Thomas, Master. Assad Alexander
Sex                                        male
Age                                        0.42
SibSp                                         0
Parch                                         1
Ticket                                     2625
Fare                                     8.5167
Cabin                                       NaN
Embarked                                      C
Name: 803, dtype: object

In [47]:
# What is the average age?

df['Age'].mean()    

29.69911764705882

In [48]:
# LATER: What is the average age in each Pclass?

df.groupby('Pclass')['Age'].mean()  

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [49]:
# What is the median fare?

df['Fare'].median()

14.4542

In [50]:
# LATER: What is the median fare in each Pclass?

df.groupby('Pclass')['Fare'].median()   

Pclass
1    60.2875
2    14.2500
3     8.0500
Name: Fare, dtype: float64

In [51]:
# What is the most frequent port of embarkation?

df['Embarked'].mode()   


0    S
Name: Embarked, dtype: object

In [52]:
# LATER: What was the most expensive ticket?
df['Fare'].max()    


512.3292

In [53]:
# YOUR TURN: Who paid for it?

df.loc[df['Fare'].idxmax()] 

Passenger Id                 259
Survived                       1
Pclass                         1
Name            Ward, Miss. Anna
Sex                       female
Age                         35.0
SibSp                          0
Parch                          0
Ticket                  PC 17755
Fare                    512.3292
Cabin                        NaN
Embarked                       C
Name: 258, dtype: object

In [54]:
# LATER: Sort the DataFrame by fare price

df.sort_values('Fare', ascending=False)

Unnamed: 0,Passenger Id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0000,C23 C25 C27,S
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
...,...,...,...,...,...,...,...,...,...,...,...,...
633,634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0000,,S
822,823,0,1,"Reuchlin, Jonkheer. John George",male,38.0,0,0,19972,0.0000,,S
732,733,0,2,"Knight, Mr. Robert J",male,,0,0,239855,0.0000,,S


There are more functions for descriptive statistics than what I have shown here. If you are interested, you can have a look at [this page](https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm).

---

# <span style="color:#D34B47">📌 Applying Functions to Columns in Pandas</span>

How can I use a custom function to extract the titles from the passenger names?

In [55]:
import re

def extract_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

df['Title'] = df['Name'].apply(extract_title)
df[['Name', 'Title']].head()

  title_search = re.search(' ([A-Za-z]+)\.', name)


Unnamed: 0,Name,Title
0,"Braund, Mr. Owen Harris",Mr
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Mrs
2,"Heikkinen, Miss. Laina",Miss
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Mrs
4,"Allen, Mr. William Henry",Mr


# <span style="color:#4CCFA2">Key Takeaways</span>

- **Data Loading and Inspection**: We mastered the art of loading data into pandas DataFrame from CSV files and inspecting its structure to understand the dataset's composition and characteristics.

```python
  import pandas as pd
  
  # Load Titanic dataset into a DataFrame
  df = pd.read_csv('data/titanic.csv')
  
  # Inspect the structure of the DataFrame
  df.head()
```

- **Data Manipulation**: Using pandas, we gained proficiency in manipulating the data, including selecting specific columns, filtering rows based on conditions, and creating new columns based on existing data.

```python
    # Selecting specific columns
    selected_columns = df[['Age', 'Sex', 'Survived']]

    # Filtering rows based on conditions
    females_survived = df.loc[(df['Sex'] == 'female') & (df['Survived'] == 1)]

    # Creating new columns
    df['Family_Size'] = df['SibSp'] + df['Parch']
```

- **Descriptive Analysis**: We developed the ability to conduct descriptive analysis, such as calculating summary statistics and exploring patterns in the data to uncover trends and relationships.

```python
    df.describe()
    df['Age'].max()
```

- **Feature Engineering with Custom Functions**: The apply() method allows you to apply a function along an axis of the DataFrame or to a Series, which is particularly useful for more complex data manipulations that go beyond simple arithmetic operations.

```python
    df['New_Column'] = df['Column'].apply(function)
```


These techniques and insights serve as a solid foundation for further data analysis and exploration using pandas. Next lesson, we'll dive even further! 🚀