**Table of contents**<a id='toc0_'></a>    
- [Short intro to NumPy](#toc1_)    
- [Introduction to Pandas Library](#toc2_)    
  - [Key Features of Pandas](#toc2_1_)    
  - [DataFrames and Series in Pandas](#toc2_2_)    
- [Dataframes in Pandas](#toc3_)    
  - [Creating DataFrames](#toc3_1_)    
    - [From Dictionaries](#toc3_1_1_)    
    - [From Lists of Lists](#toc3_1_2_)    
    - [From NumPy arrays](#toc3_1_3_)    
    - [From a file / URL](#toc3_1_4_)    
  - [DataFrame operations](#toc3_2_)    
    - [`head()` and `tail()`](#toc3_2_1_)    
    - [`index, columns and values`](#toc3_2_2_)    
  - [Data access in Dataframe](#toc3_3_)    
    - [Columns](#toc3_3_1_)    
    - [Rows](#toc3_3_2_)    
    - [Specific values](#toc3_3_3_)    
    - [⚠️ WARNING: Changing values in DataFrames ⚠️ (🔰)](#toc3_3_4_)    
  - [More DataFrame Methods](#toc3_4_)    
    - [`shape`](#toc3_4_1_)    
    - [`describe()`](#toc3_4_2_)    
    - [`info()`](#toc3_4_3_)    
    - [`nunique() and unique()`](#toc3_4_4_)    
    - [`dtypes`](#toc3_4_5_)    
    - [`select_dtypes()`](#toc3_4_6_)    
    - [Aggregation such as `max()`](#toc3_4_7_)    
  - [💡 Check for understanding](#toc3_5_)    
- [Series in Pandas](#toc4_)    
  - [Creating Series](#toc4_1_)    
    - [From a list](#toc4_1_1_)    
    - [From a list with index](#toc4_1_2_)    
    - [From a dictionary](#toc4_1_3_)    
    - [From a file](#toc4_1_4_)    
  - [Data access in Series](#toc4_2_)    
  - [Methods in Series](#toc4_3_)    
    - [`concat()`](#toc4_3_1_)    
    - [`sort_values() and sort_index()`](#toc4_3_2_)    
    - [`value_counts()`](#toc4_3_3_)    
  - [💡 Check for understanding](#toc4_4_)    
- [Summary](#toc5_)    
- [Extra: Creating Dataframes from a Dictionary](#toc6_)    
- [Extra: pickle](#toc7_)    
  - [Saving DataFrames with Pickle](#toc7_1_)    
  - [Loading DataFrames from Pickle](#toc7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [97]:
# But first... how do we import libraries?
# from library import function
# import library
# import library as liby

In [2]:
%pip install numpy




# <a id='toc1_'></a>[Short intro to NumPy](#toc0_)

Let's start by mentioning NumPy as Pandas builds on top of NumPy.

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with an extensive collection of mathematical functions to operate on these arrays efficiently.

In [4]:
lst_ = [1, 2, 3, 4, 5]
lst_ * 2

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

In [3]:
import numpy as np

# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Multiply by 2
arr * 2

array([ 2,  4,  6,  8, 10])

# <a id='toc2_'></a>[Introduction to Pandas Library](#toc0_)

<iframe src="https://giphy.com/embed/QoCoLo2opwUW4" width="480" height="278" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/panda-playing-QoCoLo2opwUW4">via GIPHY</a></p>

`pandas` is a powerful library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more. It revolves around two main data structures: `Series` and `DataFrames`.

## <a id='toc2_1_'></a>[Key Features of Pandas](#toc0_)

- Robust data import/export: `pandas` provides efficient tools to read and write data from various file formats like CSV, XLS, SQL, Parquet, SPSS, HDF5, and more.
- Data manipulation: With `pandas`, you can easily filter, add, or delete data, enabling seamless data processing.
- High-performance and versatility: It combines the performance of `numpy` arrays with the ability to handle tabulated data efficiently.

To import the necessary modules from the `pandas` library, we use the following syntax:

In [4]:
import pandas as pd  # 'pd' is the common abbreviation for pandas

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/pandas.svg?raw=true)

## <a id='toc2_2_'></a>[DataFrames and Series in Pandas](#toc0_)

In `pandas`, a `DataFrame` is a two-dimensional data structure that stores data in tabular form, with labeled rows and columns. Each row represents an observation, and each column represents a variable. `DataFrame` can handle heterogeneous data with different types (numeric, string, boolean, etc.). It also includes variable names, types, and methods to access and modify the data.

```python
# Creating a DataFrame
df = pd.DataFrame(data, ...)
```

On the other hand, a `Series` is a one-dimensional array of data with associated labels called the *index*. If no index is specified, it generates an ordered sequence of integers.

```python
# Creating a Series
s = pd.Series(data, index=index)
```

# <a id='toc3_'></a>[Dataframes in Pandas](#toc0_)

## <a id='toc3_1_'></a>[Creating DataFrames](#toc0_)

### <a id='toc3_1_1_'></a>[From Dictionaries](#toc0_)

A simple and common way to create a DataFrame is from a dictionary where keys are column names, and values are lists or arrays representing data for each column.

In [89]:
beatles_dict = {
    'name': ["John Lenon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"],
    'instrument': ["Vocals", "Bass", "Bass", "Drums", "Violin"],
    'tenure': [9, 12, 12, 8, 1],
    'num_fans': [9000, 2400, 2000, 1600, 6]
}
beatles_dict

{'name': ['John Lenon',
  'Paul McCartney',
  'George Harrison',
  'Ringo Starr',
  'Hanif Kantor'],
 'instrument': ['Vocals', 'Bass', 'Bass', 'Drums', 'Violin'],
 'tenure': [9, 12, 12, 8, 1],
 'num_fans': [9000, 2400, 2000, 1600, 6]}

In [75]:
beatles_df = pd.DataFrame(beatles_dict)
beatles_df

Unnamed: 0,name,instrument,tenure,num_fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6


In [15]:
# Check dtypes
pd.DataFrame(beatles_dict).loc[1]

name          Paul McCartney
instrument              Bass
tenure                    12
num_fans                2400
Name: 1, dtype: object

In [11]:
beatles_df = pd.DataFrame(beatles_dict)

In [12]:
type(beatles_df)

pandas.core.frame.DataFrame

In [13]:
beatles_df.dtypes

name          object
instrument    object
tenure         int64
num_fans       int64
dtype: object

### <a id='toc3_1_2_'></a>[From Lists of Lists](#toc0_)

You can also create a DataFrame from a list of lists. There are 2 ways this can happen:

- the records oriented lists (not very common):

In [25]:
data = [
    ["John Lenon", "Vocals", 9, 9000],
    ["Paul McCartney", "Bass", 12, 2400],
    ["George Harrison", "Bass", 12, 2000],
    ["Ringo Starr", "Drums", 8, 1600],
    ["Hanif Kantor", "Violin", 1, 6]
]

# Creating the DataFrame with specified column names
beatles_df = pd.DataFrame(data, columns=['Name', 'Instrument', 'Tenure (years)', 'Num Fans'])
beatles_df

Unnamed: 0,Name,Instrument,Tenure (years),Num Fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6


- the columns oriented lists (much more common):

In [31]:
names = ["John Lenon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"]
instruments = ["Vocals", "Bass", "Bass", "Drums", "Violin"]
tenure = [9, 12, 12, 8, 1]
num_fans = [9000, 2400, 2000, 1600, 6]

# Initial DF
beatles_df = pd.DataFrame([names, instruments, tenure, num_fans] )
beatles_df

Unnamed: 0,0,1,2,3,4
0,John Lenon,Paul McCartney,George Harrison,Ringo Starr,Hanif Kantor
1,Vocals,Bass,Bass,Drums,Violin
2,9,12,12,8,1
3,9000,2400,2000,1600,6


In [32]:
# This is not a proper relational database so we need to spin it... or transpose it
beatles_df = beatles_df.T
beatles_df

Unnamed: 0,0,1,2,3
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6


In [83]:
beatles_df = pd.DataFrame({
    'Name': names,
    'Instrument': instruments,
    'Tenure': tenure,
    'Number of Fans': num_fans
})
beatles_df

Unnamed: 0,Name,Instrument,Tenure,Number of Fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6


### <a id='toc3_1_3_'></a>[From NumPy arrays](#toc0_)

Whilst it's possible to convert lists of lists into DataFrames, it's much more likely you'll be converting NumPy arrays to DataFrames instead so you can operate on them more easily:

In [29]:
data = np.array([
    ["John Lenon", "Vocals", 9, 9000],
    ["Paul McCartney", "Bass", 12, 2400],
    ["George Harrison", "Bass", 12, 2000],
    ["Ringo Starr", "Drums", 8, 1600],
    ["Hanif Kantor", "Violin", 1, 6]
])

# Creating the DataFrame with specified column names
beatles_df = pd.DataFrame(data, columns=['Name', 'Instrument', 'Tenure (years)', 'Num Fans'])
beatles_df

Unnamed: 0,Name,Instrument,Tenure (years),Num Fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6


### <a id='toc3_1_4_'></a>[From a file / URL](#toc0_)

We will use `read_csv` to read data from a CSV file and create a DataFrame. (We can use parameter `usecols` and the method `squeeze("columns")` to create a Series instead - see Extra).

In [5]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_df = pd.read_csv(url)

# Types of files
# excel (personally only recommend using Excel for client facing tables as it's heavier than .csv and can get messier)
# parquet (very large files, can be used in finance, retail, bioinformatics, etc.)
# JSON (typically used for website contents -> web scraping sessions)
# sql (when you have a connection with a local/cloud database)
# SPSS (from different scientific software, typically in survey settings)

## <a id='toc3_2_'></a>[DataFrame operations](#toc0_)

### <a id='toc3_2_1_'></a>[`head()` and `tail()`](#toc0_)

The `head()` method returns the first few rows of a DataFrame or Series, while the `tail()` method returns the last few rows.

In [30]:
# Default head and tail
display(titanic_df.head()) #the beginning of the file
titanic_df.tail()#the end of the file
#Nan = Not a number, we don't know about that data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [37]:
# Custom # of rows
display(titanic_df.head(3))
titanic_df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [38]:
# Print vs display for DataFrames
print(titanic_df.head(3))
display(titanic_df.head(3))

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### <a id='toc3_2_2_'></a>[`index, columns and values`](#toc0_)

The `index` attribute in Pandas returns the index labels of a DataFrame or Series, and the `columns` attribute returns the column labels of a DataFrame.

In [39]:
titanic_df.index

RangeIndex(start=0, stop=891, step=1)

`RangeIndex(start=0, stop=1460, step=1)` is a special type of index in Pandas called a `RangeIndex`. It represents a range of integer values starting from 0, up to (but not including) 1460, with a step size of 1.

In [40]:
# Convert to list
list(titanic_df.index)

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [41]:
# Review columns - data type is index!
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [42]:
# Convert to list
list(titanic_df.columns)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In Pandas, `df.values` is an attribute that returns the underlying NumPy array of a DataFrame df. It represents the actual data stored in the DataFrame as a two-dimensional NumPy array.

In [43]:
# .values is useful when you use/create functions that need to handle either lists or arrays  
titanic_df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)


Additionally, `Pandas` can create Dataframes from other sources, such as JSON, URL ...


---
## <a id='toc3_3_'></a>[Data access in Dataframe](#toc0_)


### <a id='toc3_3_1_'></a>[Columns](#toc0_)

You can extract columns from a `DataFrame` using dictionary-like notation or as attributes, obtaining a `Series` object in both cases, provided the column label is a valid Python identifier.

In [45]:
# dict way of accessing a column
titanic_df["Name"] 

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [46]:
# Check the data structure of the selected column
type(titanic_df["Sex"]) # Whenever we select a column, we get a Pandas Series

pandas.core.series.Series

In [48]:
# attribute access of the column
titanic_df.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [52]:
# This only works if the column name doesn't have unusual characters, this means no spaces or special characters
titanic_df['Passenger Id'] = titanic_df['PassengerId']
titanic_df.Passenger Id

SyntaxError: invalid syntax (3136390257.py, line 3)

In [55]:
titanic_df['Passenger Id'] = titanic_df['PassengerId']
titanic_df['Passenger Id'] 

0        1
1        2
2        3
3        4
4        5
      ... 
886    887
887    888
888    889
889    890
890    891
Name: Passenger Id, Length: 891, dtype: int64

In [53]:
# When we use a list instead of a string we get a DataFrame instead of a Series -> useful when that's the data structure you need
display(titanic_df[["Sex"]])
type(titanic_df[['Sex']])

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male
...,...
886,male
887,female
888,female
889,male


pandas.core.frame.DataFrame

In [57]:
col_list = ["Name", "SibSp", "Parch"]
titanic_df[col_list]

Unnamed: 0,Name,SibSp,Parch
0,"Braund, Mr. Owen Harris",1,0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0
2,"Heikkinen, Miss. Laina",0,0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0
4,"Allen, Mr. William Henry",0,0
...,...,...,...
886,"Montvila, Rev. Juozas",0,0
887,"Graham, Miss. Margaret Edith",0,0
888,"Johnston, Miss. Catherine Helen ""Carrie""",1,2
889,"Behr, Mr. Karl Howell",0,0


In [None]:
# Multiple objects
titanic_df[["Sex", "Fare"]]

### <a id='toc3_3_2_'></a>[Rows](#toc0_)

To access rows in a Pandas DataFrame, you can use the `iloc` attribute with integer-based indexing or the `loc` attribute with label-based indexing. For example, `df.iloc[0]` will access the first row, and `df.loc['row_label']` will access the row with the specified label.


<div class="alert alert-info">Note: Consult http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#different-choices-for-indexing to understand the differences between methods.</div>

Just for the sake of this example, we will create a dictionary with a non-default index.

![Table](../../../img/pd-indexing.png)

In [70]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index

Unnamed: 0,name,score
123A,Paula,98.5
789B,Mark,95.0



**loc** is used for label-based indexing to access rows.

In [60]:
df_index.loc["789B"]

name     Mark
score    95.0
Name: 789B, dtype: object


**iloc** is used for integer-based indexing in Pandas to access rows

In [61]:
df_index.iloc[1]

name     Mark
score    95.0
Name: 789B, dtype: object

In [63]:
df_index.iloc[:1]
#all line till the 1 one

Unnamed: 0,name,score
123A,Paula,98.5


### <a id='toc3_3_3_'></a>[Specific values](#toc0_)

Here's a summary of the different ways to access individual values in a Pandas DataFrame:

1. `df.loc[row_label][column_label]`: Chooses the row with the label and then the value in that row with the column label.

2. `df.iloc[row_position][column_label]`: Selects the row with the position and then the value in that row with the column label.

3. `df.loc[row_label, column_label]`: Directly accesses the value using both row and column labels.

4. `df.iloc[row_position, column_position]`: Directly accesses the value using both row and column positions.

Note: you might see `df[row_label][column_label]`, even though is less recommended that using `loc()`.

In [76]:
# Option 1 - names

# Step 1 - select the row
df_index.loc['123A']

# Step 2 - select the column
df_index.loc['123A']['name']

'Paula'

In [67]:
df_index.loc['name']['123A']
#Doesn't work as you start to read by the rows first

KeyError: 'name'

In [65]:
# Try Option 1 without loc
df_index['123A']['name']

KeyError: '123A'

In [99]:
# Option 2 - index, name

# Step 1 - select the row
df_index.iloc[0]

# Step 2 - select the column
# 2.a. Use column name
df_index.iloc[0]['name']

'Paula'

In [68]:
# 2.b. Use column index
df_index.iloc[0][0]

  df_index.iloc[0][0]


'Paula'

In [71]:
# Try Option 2 without iloc
df_index[0][0]

KeyError: 0

### <a id='toc3_3_4_'></a>[⚠️ WARNING: Changing values in DataFrames ⚠️ (🔰)](#toc0_)

In [104]:
# Get beatles names
new_beatles = beatles_df['Name']
new_beatles

0         John Lenon
1     Paul McCartney
2    George Harrison
3        Ringo Starr
4       Hanif Kantor
Name: Name, dtype: object

In [91]:
# Steal Hanif's place in the band
new_beatles[4] = 'sabina'
new_beatles

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_beatles[4] = 'sabina'


0         John Lenon
1     Paul McCartney
2    George Harrison
3        Ringo Starr
4             sabina
Name: name, dtype: object

In [92]:
beatles_df

Unnamed: 0,Name,Instrument,Tenure,Number of Fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6


In [107]:
# Use .copy() instead
new_beatles_df = beatles_df.copy()
new_beatles_series = new_beatles_df['Name']
new_beatles_series[4] = 'sabina'
new_beatles_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_beatles_series[4] = 'sabina'


Unnamed: 0,Name,Instrument,Tenure,Number of Fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,sabina,Violin,1,6


In [108]:
beatles_df


Unnamed: 0,Name,Instrument,Tenure,Number of Fans
0,John Lenon,Vocals,9,9000
1,Paul McCartney,Bass,12,2400
2,George Harrison,Bass,12,2000
3,Ringo Starr,Drums,8,1600
4,Hanif Kantor,Violin,1,6



---
## <a id='toc3_4_'></a>[More DataFrame Methods](#toc0_)


Let's see some useful methods of the Dataframe class.

### <a id='toc3_4_1_'></a>[`shape`](#toc0_)

`shape` is a Pandas attribute that returns a tuple representing the dimensions of a DataFrame or Series, indicating the number of rows and columns, while `shape()` is not a valid function in Pandas, and attempting to call it will result in an AttributeError.

In [95]:
titanic_df.shape

(891, 13)

In [96]:
# No of cols
titanic_df.shape[1]

13

In [97]:
# No of rows
titanic_df.shape[0]

891

### <a id='toc3_4_3_'></a>[`info()`](#toc0_)

`info()` is a Pandas method that provides a concise summary of a DataFrame, including information about the data types, non-null values, and memory usage.

In [98]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   891 non-null    int64  
 1   Survived      891 non-null    int64  
 2   Pclass        891 non-null    int64  
 3   Name          891 non-null    object 
 4   Sex           891 non-null    object 
 5   Age           714 non-null    float64
 6   SibSp         891 non-null    int64  
 7   Parch         891 non-null    int64  
 8   Ticket        891 non-null    object 
 9   Fare          891 non-null    float64
 10  Cabin         204 non-null    object 
 11  Embarked      889 non-null    object 
 12  Passenger Id  891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB


### <a id='toc3_4_2_'></a>[`describe()`](#toc0_)

`describe()` is a Pandas method that generates descriptive statistics of a DataFrame, providing information on count, mean, standard deviation, minimum, maximum, and quartiles for each numerical column.

In [110]:
titanic_df.describe().round(2) # By default looks only at numerical data

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Passenger Id
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0
mean,446.0,0.38,2.31,29.7,0.52,0.38,32.2,446.0
std,257.35,0.49,0.84,14.53,1.1,0.81,49.69,257.35
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,1.0
25%,223.5,0.0,2.0,20.12,0.0,0.0,7.91,223.5
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.45,446.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,668.5
max,891.0,1.0,3.0,80.0,8.0,6.0,512.33,891.0


In [111]:
titanic_df.describe(include="object")

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [113]:
titanic_df.describe(include="all").round(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Passenger Id
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889,891.0
unique,,,,891,2,,,,681.0,,147,3,
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S,
freq,,,,1,577,,,,7.0,,4,644,
mean,446.0,0.38,2.31,,,29.7,0.52,0.38,,32.2,,,446.0
std,257.35,0.49,0.84,,,14.53,1.1,0.81,,49.69,,,257.35
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,,1.0
25%,223.5,0.0,2.0,,,20.12,0.0,0.0,,7.91,,,223.5
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.45,,,446.0
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,,668.5


### <a id='toc3_4_4_'></a>[`nunique() and unique()`](#toc0_)

`nunique()` is a Pandas function that returns the number of unique elements in a Series or DataFrame, while `unique()` returns an array of unique elements in a Series or DataFrame.

In [6]:
titanic_df.nunique() #number of unique values per column

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [11]:
titanic_df.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [13]:
titanic_df.iloc[:,3]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [12]:
titanic_df.loc[:,"Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [10]:
titanic_df["Name"]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [7]:
#see unique values of one column
titanic_df.Pclass.unique()

array([3, 1, 2], dtype=int64)

### <a id='toc3_4_5_'></a>[`dtypes`](#toc0_)

`dtypes` is a Pandas attribute that returns the data types of each column in a DataFrame, while `dtype` is a method that returns the data type of a single element in a Series or DataFrame.

In [None]:
titanic_df.dtypes
#int + float = float


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

### <a id='toc3_4_6_'></a>[`select_dtypes()`](#toc0_)

`select_dtypes()` is a pandas function used to filter columns in a DataFrame based on their data types. It allows you to select numeric, object (string), boolean, datetime, or categorical columns.

Syntax:
```python
DataFrame.select_dtypes(include=None, exclude=None)
```

- `include`: A list of data types or strings representing data types to include in the selection. If specified, only columns with these data types will be included.
- `exclude`: A list of data types or strings representing data types to exclude from the selection. If specified, columns with these data types will be excluded.

Example:
```python
# Assuming df is a DataFrame
numeric_columns = df.select_dtypes(include='number')
object_columns = df.select_dtypes(include='object')
```



In [None]:
titanic_df.select_dtypes(include='number').head()
#select only number (int qnd float)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.25
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.925
3,4,1,1,35.0,1,0,53.1
4,5,0,3,35.0,0,0,8.05


### <a id='toc3_4_7_'></a>[Aggregation such as `max()`](#toc0_)

In [16]:
titanic_df.select_dtypes(include='number').max() #get the max for each numerical column in the df

PassengerId    891.0000
Survived         1.0000
Pclass           3.0000
Age             80.0000
SibSp            8.0000
Parch            6.0000
Fare           512.3292
dtype: float64

In [17]:
titanic_df.Fare.max() #get the max for one numerical column

512.3292

 Just like `max()`, there are many methods that can be applied to either the entire dataframe or its individual columns.

## <a id='toc3_5_'></a>[💡 Check for understanding](#toc0_)

- a. Use the original titanic_df
- b. Select the `Sex` and `Fare` columns.
- c. Indicate how many different types of `Sex` there are.
- d. Indicate how many `Sex` of each type there are.
- e. Show a statistical summary of all the variables.
- f. Write some conclusions

In [32]:
# Your code here
x = set(titanic_df.Sex)
y = set(titanic_df.Fare)
print("There are ", len(x), "sex values")
print("There are ", len(y), "fare values")
z = list(titanic_df.Sex)
print("There are ",z.count("male"), "males")
print("There are ",z.count("female"), "females")


There are  2 sex values
There are  248 fare values
There are  577 males
There are  314 females


In [31]:
titanic_df.describe(include="all")

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### <a id='toc4_3_1_'></a>[`concat()`](#toc0_)

`concat()` is a pandas function used to concatenate and combine DataFrames along a specified axis, either vertically (rows, or axis 0):

In [33]:
missing_passenger_data = {
    'PassengerId': 4,
    'Survived': 0,
    'Pclass': 3,
    'Name': 'Mary Johnson',
    'Sex': 'female',
    'Age': None,          # Missing age
    'SibSp': 0,
    'Parch': 0,
    'Ticket': 'STON/O2. 3101283',
    'Fare': None,         # Missing fare
    'Cabin': None,        # Missing cabin
    'Embarked': 'S'
}

In [39]:
# Convert to DataFrame directly
missing_passenger = pd.DataFrame(missing_passenger_data)

ValueError: If using all scalar values, you must pass an index

In [35]:
# Convert to DataFrame via list
missing_passenger = pd.DataFrame([missing_passenger_data])
missing_passenger

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,4,0,3,Mary Johnson,female,,0,0,STON/O2. 3101283,,,S


In [None]:
# Check shape before and after appending
#axis refer to row
print(titanic_df.shape)
titanic_df = pd.concat([titanic_df, missing_passenger], axis=0)
print(titanic_df.shape)

(891, 12)
(892, 12)


  titanic_df = pd.concat([titanic_df, missing_passenger], axis=0)


or horizontally (columns, or axis 1):

In [43]:
# New column with First Name as a Series with index aligned to the DataFram
# This would usually come from a separate place but this time we are creating it
first_name_col = titanic_df['Name'].apply(lambda name: name.split(" ")[0].replace(",", ""))
first_name_col

0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
887       Graham
888     Johnston
889         Behr
890       Dooley
0           Mary
Name: Name, Length: 892, dtype: object

In [54]:
titanic_new_df = titanic_df

In [57]:
bogus_df = pd.DataFrame([1 for x in range(0,titanic_df.shape[0])])
bogus_df

Unnamed: 0,0
0,1
1,1
2,1
3,1
4,1
...,...
887,1
888,1
889,1
890,1


In [None]:

titanic_new_df = pd.concat([titanic_df.reset_index(), bogus_df], axis=1,ignore_index = False)
titanic_new_df

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name.1,0
0,0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Braund,1
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,1
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Heikkinen,1
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Futrelle,1
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Allen,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Graham,1
888,888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Johnston,1
889,889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Behr,1
890,890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q,Dooley,1


In [44]:
# Check shape before and after appending
#Add a new row at the position 0
print(titanic_df.shape)
titanic_df = pd.concat([titanic_df, first_name_col], axis=1)
print(titanic_df.shape)

(892, 12)
(892, 13)


By default `concat` keeps the original indexes. It does not restart the index by default, unless we specify `ignore_index=True`.

In [41]:
titanic_df.loc[0] # we see we have two elements for index 0

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
0,4,0,3,Mary Johnson,female,,0,0,STON/O2. 3101283,,,S


In [42]:
titanic_df_new_idx = pd.concat([titanic_df, missing_passenger],ignore_index=True)
titanic_df_new_idx

  titanic_df_new_idx = pd.concat([titanic_df, missing_passenger],ignore_index=True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q
891,4,0,3,Mary Johnson,female,,0,0,STON/O2. 3101283,,,S


### `set_index()`

In [66]:
titanic_df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Braund
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Heikkinen
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Futrelle
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Allen
...,...,...,...,...,...,...,...,...,...,...,...,...
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Graham
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Johnston
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Behr
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q,Dooley


In [65]:
titanic_df.set_index('PassengerId')

KeyError: "None of ['PassengerId'] are in the columns"

In [72]:
# What column makes most sense as an index?
titanic_df.set_index('PassengerId',inplace=False)
titanic_df.head()

KeyError: "None of ['PassengerId'] are in the columns"

### <a id='toc4_3_2_'></a>[`sort_values() and sort_index()`](#toc0_)

`sort_values()` is a pandas DataFrame method that sorts the DataFrame based on specified column(s), while `sort_index()` sorts the DataFrame based on its index labels.

In [68]:
# Quickly review the dataframe
titanic_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen


In [71]:
# sort by age
titanic_df.sort_values(by='Age', ascending=False,inplace=False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S,Barkworth
852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S,Svensson
494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,Artagaveytia
97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,Goldschmidt
117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q,Connors
...,...,...,...,...,...,...,...,...,...,...,...,...
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,Sage
869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S,van
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S,Laleff
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Johnston


In [74]:
# sort by fare
titanic_df.sort_values(by='Fare',ascending = False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,Ward
680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,Cardeza
738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,Lesurer
439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0000,C23 C25 C27,S,Fortune
342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0000,C23 C25 C27,S,Fortune
...,...,...,...,...,...,...,...,...,...,...,...,...
482,0,2,"Frost, Mr. Anthony Wood ""Archie""",male,,0,0,239854,0.0000,,S,Frost
272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0000,,S,Tornquist
180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0000,,S,Leonard
634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S,Parr


In [75]:
# Sort descending by fare
titanic_df.sort_values(by="Fare", ascending=False) #to change the ordering type

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C,Ward
680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C,Cardeza
738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,Lesurer
439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0000,C23 C25 C27,S,Fortune
342,1,1,"Fortune, Miss. Alice Elizabeth",female,24.0,3,2,19950,263.0000,C23 C25 C27,S,Fortune
...,...,...,...,...,...,...,...,...,...,...,...,...
482,0,2,"Frost, Mr. Anthony Wood ""Archie""",male,,0,0,239854,0.0000,,S,Frost
272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0000,,S,Tornquist
180,0,3,"Leonard, Mr. Lionel",male,36.0,0,0,LINE,0.0000,,S,Leonard
634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S,Parr


In [76]:
# Sort by index
titanic_df.sort_index(ascending=False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q,Dooley
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Behr
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Johnston
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Graham
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Montvila
...,...,...,...,...,...,...,...,...,...,...,...,...
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Futrelle
4,0,3,Mary Johnson,female,,0,0,STON/O2. 3101283,,,S,Mary
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Heikkinen
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings


Sort index or sort values does not change the series.
We can either save it again in the variable or we can use a paremeter
called **inplace**.

In [None]:
# Check if we changed the df
titanic_df.head() 

In [78]:
# Change the df
#going back to the original dataframe
titanic_df.sort_index(ascending=False, inplace=True)

# equivalent: titanic_df = titanic_df.sort_index(ascending=False)

In [79]:
titanic_df.head() # the series has changed because we used 'inplace=True'

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q,Dooley
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,Behr
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,Johnston
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S,Graham
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S,Montvila


---
# <a id='toc4_'></a>[Series in Pandas](#toc0_)

`Series` are the basic component of a dataframe and is designed to store homogenous multivariable data, i.e. data of the same type, be it int, string, float, or datetime. It can be created from lists, dictionaries, or directly read from a file, similarly to dataframes.

## <a id='toc4_1_'></a>[Creating Series](#toc0_)


### <a id='toc4_1_1_'></a>[From a list](#toc0_)

Create a Series with default indexes from a list

In [80]:
beatles = ["John Lennon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"]

# Create pandas series
beatles_series = pd.Series(beatles)
beatles_series

0        John Lennon
1     Paul McCartney
2    George Harrison
3        Ringo Starr
4       Hanif Kantor
dtype: object

In [81]:
type(beatles_series) #but careful, its still a pandas series

pandas.core.series.Series

In [82]:
beatles_series.dtype #this gives me the type of the elements inside the series
# this is an attribute (remember our lesson about classes)

dtype('O')

We can also have numbers in the series:

In [83]:
num_fans = [9000, 2400, 2000, "1600", 15]

# Create series
beatles_fans_series = pd.Series(num_fans)
beatles_fans_series

0    9000
1    2400
2    2000
3    1600
4      15
dtype: object

In [84]:
# Remember... all Python functions are case-sensitive:
beatles_fans_series = pd.series(num_fans)

AttributeError: module 'pandas' has no attribute 'series'

In [85]:
# Correct series
beatles_fans_series = pd.series(num_fans)

AttributeError: module 'pandas' has no attribute 'series'

In [86]:
# Series data type
type(beatles_series)

pandas.core.series.Series

In [87]:
# Type of data inside the series - with and without print
beatles_fans_series.dtype

dtype('O')

In [88]:
num_fans = [9000, 2400, 2000, 1600, 15]
beatles_fans_series = pd.Series(num_fans)
beatles_fans_series

0    9000
1    2400
2    2000
3    1600
4      15
dtype: int64

No need to worry about the number after `int`, it simply represents how many digits the number can store, in this case 64-bits. If you're curious about how these bits work (i.e. go down a rabbit hole), you can <a href="https://www.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-course/numpy-introduction-a">have a look at this video</a> from the FreeCodeCamp Data Analysis certification.

The `Series` have two attributes:` values` and `index`. The first is a `numpy array` that stores the data, and the second is an object that contains the indexes.

In [93]:
beatles_series.columns

AttributeError: 'Series' object has no attribute 'columns'

In [89]:
beatles_series.values

array(['John Lennon', 'Paul McCartney', 'George Harrison', 'Ringo Starr',
       'Hanif Kantor'], dtype=object)

In [90]:
beatles_series.index

RangeIndex(start=0, stop=5, step=1)

In Pandas, `Series.items()` is a method used to iterate over the elements of a Pandas Series. It returns an iterator that yields the index-label and corresponding value pairs of the Series.

In [94]:
#items has tuples (index,value) so i need to iterate with two variables
# if i want the values separated, if not, i get the tuple in each
# iteration
for i in beatles_series.items() :
    print(i)

(0, 'John Lennon')
(1, 'Paul McCartney')
(2, 'George Harrison')
(3, 'Ringo Starr')
(4, 'Hanif Kantor')


In [None]:
for i, v in beatles_series.items() :
    print("index ", i)
    print("value ", v)

index  0
value  John Lennon
index  1
value  Paul McCartney
index  2
value  George Harrison
index  3
value  Ringo Starr
index  4
value  Hanif Kantor


### <a id='toc4_1_2_'></a>[From a list with index](#toc0_)

When creating a `Series`, you can explicitly define an `array` index and pass it as an argument.


Creating series with defined indexes

In [96]:
beatles = ["John Lennon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"]
num_fans = [9000, 2400, 2000, 1600, 15]

# Create series
beatles_fans_series = pd.Series(num_fans, index=beatles)
beatles_fans_series

John Lennon        9000
Paul McCartney     2400
George Harrison    2000
Ringo Starr        1600
Hanif Kantor         15
dtype: int64

### <a id='toc4_1_3_'></a>[From a dictionary](#toc0_)

In [97]:
# I could do a Beatles fans dict manually
beatles_fans = {
    "John Lennon": 9000, 
    "Paul McCartney": 2400,
    "George Harrison": 2000,
    "Ringo Starr": 1600,
    "Hanif Kantor": 15,
    }

beatles_fans_series = pd.Series(beatles_fans)
beatles_fans_series

John Lennon        9000
Paul McCartney     2400
George Harrison    2000
Ringo Starr        1600
Hanif Kantor         15
dtype: int64

In [98]:
# Or I could use a dict comprehension # Zip function connects the 2 lists along the same index
beatles_fans = {beatle: num_fans for beatle, num_fans in zip(beatles, num_fans)}
beatles_fans

{'John Lennon': 9000,
 'Paul McCartney': 2400,
 'George Harrison': 2000,
 'Ringo Starr': 1600,
 'Hanif Kantor': 15}

### <a id='toc4_1_4_'></a>[From a file](#toc0_)

`read_csv()` is a Pandas function used to read data from a CSV file and create a DataFrame.

When assigning one column in the parameter `usecols` and then calling the method `squeeze("columns")`, the result is a Series instead of a Dataframe

In [None]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_series = pd.read_csv(url, usecols=["Name"]).squeeze("columns")
print(titanic_series)
#squeeze change the dataframe in a serie

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object




---
## <a id='toc4_2_'></a>[Data access in Series](#toc0_)


Data access in Pandas can be achieved through either the categorical index or the internally generated numerical index.

### <a id='toc4_2_2_'></a>[Using the index label:](#toc0_)

In [102]:
beatles_fans_series['Hanif Kantor']

15

In [None]:
titanic_series[0]#same as list

'Braund, Mr. Owen Harris'

In [105]:
titanic_series[1:]

1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 890, dtype: object

In [106]:
titanic_series[1:20:2]

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)
5                                      Moran, Mr. James
7                        Palsson, Master. Gosta Leonard
9                   Nasser, Mrs. Nicholas (Adele Achem)
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
17                         Williams, Mr. Charles Eugene
19                              Masselmani, Mrs. Fatima
Name: Name, dtype: object

In [None]:
titanic_series[::-1]#reverse serie

890                                  Dooley, Mr. Patrick
889                                Behr, Mr. Karl Howell
888             Johnston, Miss. Catherine Helen "Carrie"
887                         Graham, Miss. Margaret Edith
886                                Montvila, Rev. Juozas
                             ...                        
4                               Allen, Mr. William Henry
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
2                                 Heikkinen, Miss. Laina
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
0                                Braund, Mr. Owen Harris
Name: Name, Length: 891, dtype: object

In [None]:
titanic_df["Ticket"][::-1]#a dataframe colon is a serie

PassengerId
1             A/5 21171
2              PC 17599
3      STON/O2. 3101282
4      STON/O2. 3101283
4                113803
             ...       
887              211536
888              112053
889          W./C. 6607
890              111369
891              370376
Name: Ticket, Length: 892, dtype: object

In [110]:
titanic_df["Ticket"][:3]

PassengerId
891        370376
890        111369
889    W./C. 6607
Name: Ticket, dtype: object

### <a id='toc4_2_1_'></a>[Using Pandas internal index:](#toc0_)

In [111]:
# One element
beatles_fans_series[0]

  beatles_fans_series[0]


9000

In [112]:
# Multiple elements slicing
beatles_fans_series[1:4]

Paul McCartney     2400
George Harrison    2000
Ringo Starr        1600
dtype: int64

In [113]:
# 1: - all from one 
beatles_fans_series[1:]

Paul McCartney     2400
George Harrison    2000
Ringo Starr        1600
Hanif Kantor         15
dtype: int64

In [114]:
# [1:5:2] - odd indexes
beatles_fans_series[1:5:2]

Paul McCartney    2400
Ringo Starr       1600
dtype: int64

In [115]:
# [::-1] - reversed
beatles_fans_series[::-1]

Hanif Kantor         15
Ringo Starr        1600
George Harrison    2000
Paul McCartney     2400
John Lennon        9000
dtype: int64



---
## <a id='toc4_3_'></a>[Methods in Series](#toc0_)

In [116]:
titanic_series

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

Series have many of the same methods as pandas DataFrames:
- concat
- sort_values
- sort_index

However, as opposed to DataFrames, they don't require column names or axes to be explicitly mentioned:

### `concat()`

In [121]:
missing_passengers = pd.Series(
    [
        "James Bennett",
        "Eleanor Smith (née Roberts)",
        "Lillian Grey",
        "Henry Dawson",
        "Thomas Wills",
        "Ada Carter (née Foster)",
        "Margaret Lane",
        "Charles Baldwin",
        "Beatrice Hollins",
        "Samuel Kline",
    ]
)
titanic_new_df = pd.concat([titanic_series, missing_passengers])
titanic_new_df
#Add at the end of the serie

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
                           ...                        
5                              Ada Carter (née Foster)
6                                        Margaret Lane
7                                      Charles Baldwin
8                                     Beatrice Hollins
9                                         Samuel Kline
Length: 901, dtype: object

In [132]:
titanic_new_df[-1:-15:-1]

9                                  Samuel Kline
8                              Beatrice Hollins
7                               Charles Baldwin
6                                 Margaret Lane
5                       Ada Carter (née Foster)
4                                  Thomas Wills
3                                  Henry Dawson
2                                  Lillian Grey
1                   Eleanor Smith (née Roberts)
0                                 James Bennett
890                         Dooley, Mr. Patrick
889                       Behr, Mr. Karl Howell
888    Johnston, Miss. Catherine Helen "Carrie"
887                Graham, Miss. Margaret Edith
dtype: object

In [None]:
titanic_new_df.reset_index()#dataframe as extra index column

Unnamed: 0,index,0
0,0,"Braund, Mr. Owen Harris"
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,2,"Heikkinen, Miss. Laina"
3,3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,4,"Allen, Mr. William Henry"
...,...,...
896,5,Ada Carter (née Foster)
897,6,Margaret Lane
898,7,Charles Baldwin
899,8,Beatrice Hollins


In [127]:
type(titanic_new_df.reset_index())

pandas.core.frame.DataFrame

In [None]:
titanic_new_df.reset_index(drop = True)#serie as only one index

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
896                              Ada Carter (née Foster)
897                                        Margaret Lane
898                                      Charles Baldwin
899                                     Beatrice Hollins
900                                         Samuel Kline
Length: 901, dtype: object

In [128]:
type(titanic_new_df.reset_index(drop = True))

pandas.core.series.Series

In [119]:
# Try using axis=1
titanic_new_df = pd.concat([titanic_series, missing_passengers], axis=1)
titanic_new_df

Unnamed: 0,Name,0
0,"Braund, Mr. Owen Harris",James Bennett
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Eleanor Smith (née Roberts)
2,"Heikkinen, Miss. Laina",Lillian Grey
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",Henry Dawson
4,"Allen, Mr. William Henry",Thomas Wills
...,...,...
886,"Montvila, Rev. Juozas",
887,"Graham, Miss. Margaret Edith",
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",


### `sort_values()`

In [129]:
titanic_series.sort_values()

845                      Abbing, Mr. Anthony
746              Abbott, Mr. Rossmore Edward
279         Abbott, Mrs. Stanton (Rosa Hunt)
308                      Abelson, Mr. Samuel
874    Abelson, Mrs. Samuel (Hannah Wizosky)
                       ...                  
286                  de Mulder, Mr. Theodore
282                de Pelsmaeker, Mr. Alfons
361                del Carlo, Mr. Sebastiano
153          van Billiard, Mr. Austin Blyler
868              van Melkebeke, Mr. Philemon
Name: Name, Length: 891, dtype: object

### `sort_index()`

In [130]:
titanic_series.sort_index()

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

### <a id='toc4_3_3_'></a>[`value_counts()`](#toc0_)

`value_counts()` is a Pandas function that returns a Series containing the counts of unique values in a Series or DataFrame.

In [131]:
titanic_series.value_counts()

Name
Braund, Mr. Owen Harris                     1
Boulos, Mr. Hanna                           1
Frolicher-Stehli, Mr. Maxmillian            1
Gilinski, Mr. Eliezer                       1
Murdlin, Mr. Joseph                         1
                                           ..
Kelly, Miss. Anna Katherine "Annie Kate"    1
McCoy, Mr. Bernard                          1
Johnson, Mr. William Cahoone Jr             1
Keane, Miss. Nora A                         1
Dooley, Mr. Patrick                         1
Name: count, Length: 891, dtype: int64

## <a id='toc4_4_'></a>[💡 Check for understanding](#toc0_)

In [None]:
titanic_series

1. Get the column "Embarked" from the Titanic csv as a Pandas Series
2. Print the first value
3. Print the last 5 values
3. Append "NA" to the Series
3. Get the number of each Embarked type (number of repeated values)
3. Order the Series *descending*, and print the Embarked type most repeated in the Series

In [None]:
# Your answer here

# <a id='toc5_'></a>[Summary](#toc0_)

- Pandas is a library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more, built on top of NumPy.
- DataFrames and Series are the two main data structures in Pandas.
- Series is a one-dimensional array of data with associated labels called the index, while DataFrame is a two-dimensional tabular data structure with labeled rows and columns.
- Data access in Series and DataFrame can be achieved using integer-based indexing (iloc), label-based indexing (loc), or dictionary-like notation for column access.
- Series and DataFrame have various methods, such as sort_values(), sort_index(), value_counts(), describe(), info(), nunique(), unique(), dtypes, and select_dtypes().


# <a id='toc6_'></a>[Extra: Creating Dataframes from a Dictionary](#toc0_)

In [None]:
# Create a Dataframe from a dictionary with
# automatic indexes

d = {"state": ["Ohio", "Ohio", "California", "Nevada", "California"],
     "year": [2000, 2001, 2002, 2001, 2002],
     "avg": [1.5, 1.7, 3.6, 2.4, 1.9]
}

df = pd.DataFrame(d)
df


DataFrame from a dictionary of lists and indexes

In [None]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index

# <a id='toc7_'></a>[Extra: pickle](#toc0_)

Pandas `pickle` function provides a convenient way to save and load Python objects, including DataFrames, to and from disk. Pickling allows you to serialize Python objects into a binary format, making it easy to store large datasets or complex data structures. It's a great tool for saving and restoring your work, especially when dealing with large datasets that might take a long time to process or recreate.

## <a id='toc7_1_'></a>[Saving DataFrames with Pickle](#toc0_)

You can use the `to_pickle()` method in pandas to save a DataFrame to a pickle file. This method takes the file path as an argument and creates a binary representation of the DataFrame, which is then saved to the specified file.

In [None]:
import pandas as pd

# Assuming df is your DataFrame
df.to_pickle('data.pkl')

## <a id='toc7_2_'></a>[Loading DataFrames from Pickle](#toc0_)

To load a DataFrame from a pickle file, you can use the `read_pickle()` function in pandas. This function reads the binary data from the pickle file and converts it back into a DataFrame.


In [73]:
import pandas as pd

# Load DataFrame from pickle file
df = pd.read_pickle('data.pkl')

In [None]:
df.head()