**Table of contents**<a id='toc0_'></a>    
- [Short intro to NumPy](#toc1_)    
- [Introduction to Pandas Library](#toc2_)    
  - [Key Features of Pandas](#toc2_1_)    
  - [DataFrames and Series in Pandas](#toc2_2_)    
- [Dataframes in Pandas](#toc3_)    
  - [Creating DataFrames](#toc3_1_)    
    - [From Dictionaries](#toc3_1_1_)    
    - [From Lists of Lists](#toc3_1_2_)    
    - [From NumPy arrays](#toc3_1_3_)    
    - [From a file / URL](#toc3_1_4_)    
  - [DataFrame operations](#toc3_2_)    
    - [`head()` and `tail()`](#toc3_2_1_)    
    - [`index, columns and values`](#toc3_2_2_)    
  - [Data access in Dataframe](#toc3_3_)    
    - [Columns](#toc3_3_1_)    
    - [Rows](#toc3_3_2_)    
    - [Specific values](#toc3_3_3_)    
    - [⚠️ WARNING: Changing values in DataFrames ⚠️ (🔰)](#toc3_3_4_)    
  - [More DataFrame Methods](#toc3_4_)    
    - [`shape`](#toc3_4_1_)    
    - [`describe()`](#toc3_4_2_)    
    - [`info()`](#toc3_4_3_)    
    - [`nunique() and unique()`](#toc3_4_4_)    
    - [`dtypes`](#toc3_4_5_)    
    - [`select_dtypes()`](#toc3_4_6_)    
    - [Aggregation such as `max()`](#toc3_4_7_)    
  - [💡 Check for understanding](#toc3_5_)    
- [Series in Pandas](#toc4_)    
  - [Creating Series](#toc4_1_)    
    - [From a list](#toc4_1_1_)    
    - [From a list with index](#toc4_1_2_)    
    - [From a dictionary](#toc4_1_3_)    
    - [From a file](#toc4_1_4_)    
  - [Data access in Series](#toc4_2_)    
  - [Methods in Series](#toc4_3_)    
    - [`concat()`](#toc4_3_1_)    
    - [`sort_values() and sort_index()`](#toc4_3_2_)    
    - [`value_counts()`](#toc4_3_3_)    
  - [💡 Check for understanding](#toc4_4_)    
- [Summary](#toc5_)    
- [Extra: Creating Dataframes from a Dictionary](#toc6_)    
- [Extra: pickle](#toc7_)    
  - [Saving DataFrames with Pickle](#toc7_1_)    
  - [Loading DataFrames from Pickle](#toc7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [97]:
# But first... how do we import libraries?
# from library import function
# import library
# import library as liby

# <a id='toc1_'></a>[Short intro to NumPy](#toc0_)

Let's start by mentioning NumPy as Pandas builds on top of NumPy.

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with an extensive collection of mathematical functions to operate on these arrays efficiently.

In [None]:
lst_ = [1, 2, 3, 4, 5]
lst_ * 2

In [None]:
import numpy as np

# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Multiply by 2
arr * 2

# <a id='toc2_'></a>[Introduction to Pandas Library](#toc0_)

<iframe src="https://giphy.com/embed/QoCoLo2opwUW4" width="480" height="278" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/panda-playing-QoCoLo2opwUW4">via GIPHY</a></p>

`pandas` is a powerful library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more. It revolves around two main data structures: `Series` and `DataFrames`.

## <a id='toc2_1_'></a>[Key Features of Pandas](#toc0_)

- Robust data import/export: `pandas` provides efficient tools to read and write data from various file formats like CSV, XLS, SQL, Parquet, SPSS, HDF5, and more.
- Data manipulation: With `pandas`, you can easily filter, add, or delete data, enabling seamless data processing.
- High-performance and versatility: It combines the performance of `numpy` arrays with the ability to handle tabulated data efficiently.

To import the necessary modules from the `pandas` library, we use the following syntax:

In [100]:
import pandas as pd  # 'pd' is the common abbreviation for pandas

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/pandas.svg?raw=true)

## <a id='toc2_2_'></a>[DataFrames and Series in Pandas](#toc0_)

In `pandas`, a `DataFrame` is a two-dimensional data structure that stores data in tabular form, with labeled rows and columns. Each row represents an observation, and each column represents a variable. `DataFrame` can handle heterogeneous data with different types (numeric, string, boolean, etc.). It also includes variable names, types, and methods to access and modify the data.

```python
# Creating a DataFrame
df = pd.DataFrame(data, ...)
```

On the other hand, a `Series` is a one-dimensional array of data with associated labels called the *index*. If no index is specified, it generates an ordered sequence of integers.

```python
# Creating a Series
s = pd.Series(data, index=index)
```

# <a id='toc3_'></a>[Dataframes in Pandas](#toc0_)

## <a id='toc3_1_'></a>[Creating DataFrames](#toc0_)

### <a id='toc3_1_1_'></a>[From Dictionaries](#toc0_)

A simple and common way to create a DataFrame is from a dictionary where keys are column names, and values are lists or arrays representing data for each column.

In [None]:
beatles_dict = {
    'name': ["John Lenon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"],
    'instrument': ["Vocals", "Bass", "Bass", "Drums", "Violin"],
    'tenure': [9, 12, 12, 8, 1],
    'num_fans': [9000, 2400, 2000, 1600, 6]
}
beatles_dict

In [None]:
beatles_df = pd.DataFrame(beatles_dict)
beatles_df

In [103]:
# Check dtypes

### <a id='toc3_1_2_'></a>[From Lists of Lists](#toc0_)

You can also create a DataFrame from a list of lists. There are 2 ways this can happen:

- the records oriented lists (not very common):

In [None]:
data = [
    ["John Lenon", "Vocals", 9, 9000],
    ["Paul McCartney", "Bass", 12, 2400],
    ["George Harrison", "Bass", 12, 2000],
    ["Ringo Starr", "Drums", 8, 1600],
    ["Hanif Kantor", "Violin", 1, 6]
]

# Creating the DataFrame with specified column names
beatles_df = pd.DataFrame(data, columns=['Name', 'Instrument', 'Tenure (years)', 'Num Fans'])
beatles_df

- the columns oriented lists (much more common):

In [None]:
names = ["John Lenon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"]
instruments = ["Vocals", "Bass", "Bass", "Drums", "Violin"]
tenure = [9, 12, 12, 8, 1]
num_fans = [9000, 2400, 2000, 1600, 6]

# Initial DF
beatles_df = pd.DataFrame([names, instruments, tenure, num_fans])
beatles_df

In [None]:
# This is not a proper relational database so we need to spin it... or transpose it
beatles_df = beatles_df.T
beatles_df

### <a id='toc3_1_3_'></a>[From NumPy arrays](#toc0_)

Whilst it's possible to convert lists of lists into DataFrames, it's much more likely you'll be converting NumPy arrays to DataFrames instead so you can operate on them more easily:

In [None]:
data = np.array([
    ["John Lenon", "Vocals", 9, 9000],
    ["Paul McCartney", "Bass", 12, 2400],
    ["George Harrison", "Bass", 12, 2000],
    ["Ringo Starr", "Drums", 8, 1600],
    ["Hanif Kantor", "Violin", 1, 6]
])

# Creating the DataFrame with specified column names
beatles_df = pd.DataFrame(data, columns=['Name', 'Instrument', 'Tenure (years)', 'Num Fans'])
beatles_df

### <a id='toc3_1_4_'></a>[From a file / URL](#toc0_)

We will use `read_csv` to read data from a CSV file and create a DataFrame. (We can use parameter `usecols` and the method `squeeze("columns")` to create a Series instead - see Extra).

In [108]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_df = pd.read_csv(url)

# Types of files
# excel (personally only recommend using Excel for client facing tables as it's heavier than .csv and can get messier)
# parquet (very large files, can be used in finance, retail, bioinformatics, etc.)
# JSON (typically used for website contents -> web scraping sessions)
# sql (when you have a connection with a local/cloud database)
# SPSS (from different scientific software, typically in survey settings)

## <a id='toc3_2_'></a>[DataFrame operations](#toc0_)

### <a id='toc3_2_1_'></a>[`head()` and `tail()`](#toc0_)

The `head()` method returns the first few rows of a DataFrame or Series, while the `tail()` method returns the last few rows.

In [None]:
# Default head and tail
display(titanic_df.head())
titanic_df.tail()

In [None]:
# Custom # of rows
display(titanic_df.head(3))
titanic_df.tail(3)

In [None]:
# Print vs display for DataFrames
print(titanic_df.head(3))
display(titanic_df.head(3))

### <a id='toc3_2_2_'></a>[`index, columns and values`](#toc0_)

The `index` attribute in Pandas returns the index labels of a DataFrame or Series, and the `columns` attribute returns the column labels of a DataFrame.

In [None]:
titanic_df.index

`RangeIndex(start=0, stop=1460, step=1)` is a special type of index in Pandas called a `RangeIndex`. It represents a range of integer values starting from 0, up to (but not including) 1460, with a step size of 1.

In [None]:
# Convert to list
list(titanic_df.index)

In [None]:
# Review columns - data type is index!
titanic_df.columns

In [None]:
# Convert to list
list(titanic_df.columns)

In Pandas, `df.values` is an attribute that returns the underlying NumPy array of a DataFrame df. It represents the actual data stored in the DataFrame as a two-dimensional NumPy array.

In [None]:
# .values is useful when you use/create functions that need to handle either lists or arrays  
titanic_df.values


Additionally, `Pandas` can create Dataframes from other sources, such as JSON, URL ...


---
## <a id='toc3_3_'></a>[Data access in Dataframe](#toc0_)


### <a id='toc3_3_1_'></a>[Columns](#toc0_)

You can extract columns from a `DataFrame` using dictionary-like notation or as attributes, obtaining a `Series` object in both cases, provided the column label is a valid Python identifier.

In [None]:
# dict way of accessing a column
titanic_df["Sex"] 

In [None]:
# Check the data structure of the selected column
type(titanic_df["Sex"]) # Whenever we select a column, we get a Pandas Series

In [None]:
# attribute access of the column
titanic_df.Sex

In [None]:
# This only works if the column name doesn't have unusual characters, this means no spaces or special characters
titanic_df['Passenger Id'] = titanic_df['PassengerId']
titanic_df.Passenger Id

In [None]:
# When we use a list instead of a string we get a DataFrame instead of a Series -> useful when that's the data structure you need
display(titanic_df[["Sex"]])
type(titanic_df[['Sex']])

In [None]:
# Multiple objects
titanic_df[["Sex", "Fare"]]

### <a id='toc3_3_2_'></a>[Rows](#toc0_)

To access rows in a Pandas DataFrame, you can use the `iloc` attribute with integer-based indexing or the `loc` attribute with label-based indexing. For example, `df.iloc[0]` will access the first row, and `df.loc['row_label']` will access the row with the specified label.


<div class="alert alert-info">Note: Consult http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#different-choices-for-indexing to understand the differences between methods.</div>

Just for the sake of this example, we will create a dictionary with a non-default index.

![Table](../../../img/pd-indexing.png)

In [None]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index


**loc** is used for label-based indexing to access rows.

In [None]:
df_index.loc["789B"]


**iloc** is used for integer-based indexing in Pandas to access rows

In [None]:
df_index.iloc[1]

### <a id='toc3_3_3_'></a>[Specific values](#toc0_)

Here's a summary of the different ways to access individual values in a Pandas DataFrame:

1. `df.loc[row_label][column_label]`: Chooses the row with the label and then the value in that row with the column label.

2. `df.iloc[row_position][column_label]`: Selects the row with the position and then the value in that row with the column label.

3. `df.loc[row_label, column_label]`: Directly accesses the value using both row and column labels.

4. `df.iloc[row_position, column_position]`: Directly accesses the value using both row and column positions.

Note: you might see `df[row_label][column_label]`, even though is less recommended that using `loc()`.

In [None]:
# Option 1 - names

# Step 1 - select the row
df_index.loc['123A']

# Step 2 - select the column
df_index.loc['123A']['name']

In [None]:
# Try Option 1 without loc
df_index['123A']['name']

In [None]:
# Option 2 - index, name

# Step 1 - select the row
df_index.iloc[0]

# Step 2 - select the column
# 2.a. Use column name
df_index.iloc[0]['name']

In [None]:
# 2.b. Use column index
df_index.iloc[0][0]

In [None]:
# Try Option 2 without iloc
df_index[0][0]

### <a id='toc3_3_4_'></a>[⚠️ WARNING: Changing values in DataFrames ⚠️ (🔰)](#toc0_)

In [None]:
# Get beatles names
new_beatles = beatles_df['name']
new_beatles

In [None]:
# Steal Hanif's place in the band
new_beatles[4] = 'sabina'
new_beatles

In [None]:
beatles_df

In [None]:
# Use .copy() instead
new_beatles_df = beatles_df.copy()
new_beatles_series = new_beatles_df['name']
new_beatles_series[4] = 'Hanif Kantor'
new_beatles_df

In [None]:
beatles_df


---
## <a id='toc3_4_'></a>[More DataFrame Methods](#toc0_)


Let's see some useful methods of the Dataframe class.

### <a id='toc3_4_1_'></a>[`shape`](#toc0_)

`shape` is a Pandas attribute that returns a tuple representing the dimensions of a DataFrame or Series, indicating the number of rows and columns, while `shape()` is not a valid function in Pandas, and attempting to call it will result in an AttributeError.

In [None]:
titanic_df.shape

In [None]:
# No of cols
titanic_df.shape[1]

In [None]:
# No of rows
titanic_df.shape[0]

### <a id='toc3_4_3_'></a>[`info()`](#toc0_)

`info()` is a Pandas method that provides a concise summary of a DataFrame, including information about the data types, non-null values, and memory usage.

In [None]:
titanic_df.info()

### <a id='toc3_4_2_'></a>[`describe()`](#toc0_)

`describe()` is a Pandas method that generates descriptive statistics of a DataFrame, providing information on count, mean, standard deviation, minimum, maximum, and quartiles for each numerical column.

In [None]:
titanic_df.describe() # By default looks only at numerical data

In [None]:
titanic_df.describe(include="object")

In [None]:
titanic_df.describe(include="all")

### <a id='toc3_4_4_'></a>[`nunique() and unique()`](#toc0_)

`nunique()` is a Pandas function that returns the number of unique elements in a Series or DataFrame, while `unique()` returns an array of unique elements in a Series or DataFrame.

In [None]:
titanic_df.nunique() #number of unique values per column

In [None]:
#see unique values of one column
titanic_df.Pclass.unique()

### <a id='toc3_4_5_'></a>[`dtypes`](#toc0_)

`dtypes` is a Pandas attribute that returns the data types of each column in a DataFrame, while `dtype` is a method that returns the data type of a single element in a Series or DataFrame.

In [None]:
titanic_df.dtypes

### <a id='toc3_4_6_'></a>[`select_dtypes()`](#toc0_)

`select_dtypes()` is a pandas function used to filter columns in a DataFrame based on their data types. It allows you to select numeric, object (string), boolean, datetime, or categorical columns.

Syntax:
```python
DataFrame.select_dtypes(include=None, exclude=None)
```

- `include`: A list of data types or strings representing data types to include in the selection. If specified, only columns with these data types will be included.
- `exclude`: A list of data types or strings representing data types to exclude from the selection. If specified, columns with these data types will be excluded.

Example:
```python
# Assuming df is a DataFrame
numeric_columns = df.select_dtypes(include='number')
object_columns = df.select_dtypes(include='object')
```



In [None]:
titanic_df.select_dtypes(include='number').head()

### <a id='toc3_4_7_'></a>[Aggregation such as `max()`](#toc0_)

In [None]:
titanic_df.select_dtypes(include='number').max() #get the max for each numerical column in the df

In [None]:
titanic_df.Fare.max() #get the max for one numerical column

 Just like `max()`, there are many methods that can be applied to either the entire dataframe or its individual columns.

## <a id='toc3_5_'></a>[💡 Check for understanding](#toc0_)

- a. Use the original titanic_df
- b. Select the `Sex` and `Fare` columns.
- c. Indicate how many different types of `Sex` there are.
- d. Indicate how many `Sex` of each type there are.
- e. Show a statistical summary of all the variables.
- f. Write some conclusions

In [146]:
# Your code here

### <a id='toc4_3_1_'></a>[`concat()`](#toc0_)

`concat()` is a pandas function used to concatenate and combine DataFrames along a specified axis, either vertically (rows, or axis 0):

In [147]:
missing_passenger_data = {
    'PassengerId': 4,
    'Survived': 0,
    'Pclass': 3,
    'Name': 'Mary Johnson',
    'Sex': 'female',
    'Age': None,          # Missing age
    'SibSp': 0,
    'Parch': 0,
    'Ticket': 'STON/O2. 3101283',
    'Fare': None,         # Missing fare
    'Cabin': None,        # Missing cabin
    'Embarked': 'S'
}

In [None]:
# Convert to DataFrame directly
missing_passenger = pd.DataFrame(missing_passenger_data)

In [None]:
# Convert to DataFrame via list
missing_passenger = pd.DataFrame([missing_passenger_data])
missing_passenger

In [None]:
# Check shape before and after appending
print(titanic_df.shape)
titanic_df = pd.concat([titanic_df, missing_passenger], axis=0)
print(titanic_df.shape)

or horizontally (columns, or axis 1):

In [None]:
# New column with First Name as a Series with index aligned to the DataFram
# This would usually come from a separate place but this time we are creating it
first_name_col = titanic_df['Name'].apply(lambda name: name.split(" ")[0].replace(",", ""))
first_name_col

In [None]:
# Check shape before and after appending
print(titanic_df.shape)
titanic_df = pd.concat([titanic_df, first_name_col], axis=1)
print(titanic_df.shape)

By default `concat` keeps the original indexes. It does not restart the index by default, unless we specify `ignore_index=True`.

In [None]:
titanic_df.loc[0] # we see we have two elements for index 0

In [None]:
titanic_df_new_idx = pd.concat([titanic_df, missing_passenger],ignore_index=True)
titanic_df_new_idx

### `set_index()`

In [None]:
# What column makes most sense as an index?
titanic_df.set_index('PassengerId')
titanic_df.head()

### <a id='toc4_3_2_'></a>[`sort_values() and sort_index()`](#toc0_)

`sort_values()` is a pandas DataFrame method that sorts the DataFrame based on specified column(s), while `sort_index()` sorts the DataFrame based on its index labels.

In [None]:
# Quickly review the dataframe
titanic_df.head()

In [None]:
# sort by age
titanic_df.sort_values(by='Age')

In [None]:
# sort by fare
titanic_df.sort_values(by='Fare')

In [None]:
# Sort descending by fare
titanic_df.sort_values(by="Fare", ascending=False) #to change the ordering type

In [None]:
# Sort by index
titanic_df.sort_index(ascending=False)

Sort index or sort values does not change the series.
We can either save it again in the variable or we can use a paremeter
called **inplace**.

In [None]:
# Check if we changed the df
titanic_df.head() 

In [None]:
# Change the df
titanic_df.sort_index(ascending=False, inplace=True)

# equivalent: titanic_df = titanic_df.sort_index(ascending=False)

In [None]:
titanic_df.head() # the series has changed because we used 'inplace=True'

---
# <a id='toc4_'></a>[Series in Pandas](#toc0_)

`Series` are the basic component of a dataframe and is designed to store homogenous multivariable data, i.e. data of the same type, be it int, string, float, or datetime. It can be created from lists, dictionaries, or directly read from a file, similarly to dataframes.

## <a id='toc4_1_'></a>[Creating Series](#toc0_)


### <a id='toc4_1_1_'></a>[From a list](#toc0_)

Create a Series with default indexes from a list

In [None]:
beatles = ["John Lennon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"]

# Create pandas series
beatles_series = pd.Series(beatles)
beatles_series

In [None]:
type(beatles_series) #but careful, its still a pandas series

In [None]:
beatles_series.dtype #this gives me the type of the elements inside the series
# this is an attribute (remember our lesson about classes)

We can also have numbers in the series:

In [None]:
num_fans = [9000, 2400, 2000, "1600", 15]

# Create series
beatles_fans_series = pd.Series(num_fans)
beatles_fans_series

In [None]:
# Remember... all Python functions are case-sensitive:
beatles_fans_series = pd.series(num_fans)

In [None]:
# Correct series
beatles_fans_series = pd.series(num_fans)

In [None]:
# Series data type
type(beatles_series)

In [None]:
# Type of data inside the series - with and without print
beatles_fans_series.dtype

No need to worry about the number after `int`, it simply represents how many digits the number can store, in this case 64-bits. If you're curious about how these bits work (i.e. go down a rabbit hole), you can <a href="https://www.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-course/numpy-introduction-a">have a look at this video</a> from the FreeCodeCamp Data Analysis certification.

The `Series` have two attributes:` values` and `index`. The first is a `numpy array` that stores the data, and the second is an object that contains the indexes.

In [None]:
beatles_series.values

In [None]:
beatles_series.index

In Pandas, `Series.items()` is a method used to iterate over the elements of a Pandas Series. It returns an iterator that yields the index-label and corresponding value pairs of the Series.

In [None]:
#items has tuples (index,value) so i need to iterate with two variables
# if i want the values separated, if not, i get the tuple in each
# iteration
for i in beatles_series.items() :
    print(i)

In [None]:
for i, v in beatles_series.items() :
    print("index ", i)
    print("value ", v)

### <a id='toc4_1_2_'></a>[From a list with index](#toc0_)

When creating a `Series`, you can explicitly define an `array` index and pass it as an argument.


Creating series with defined indexes

In [None]:
beatles = ["John Lennon", "Paul McCartney", "George Harrison", "Ringo Starr", "Hanif Kantor"]
num_fans = [9000, 2400, 2000, 1600, 15]

# Create series
beatles_fans_series = pd.Series(num_fans, index=beatles)
beatles_fans_series

### <a id='toc4_1_3_'></a>[From a dictionary](#toc0_)

In [49]:
# I could do a Beatles fans dict manually
beatles_fans = {
    "John Lennon": 9000, 
    "Paul McCartney": 2400,
    "George Harrison": 2000,
    "Ringo Starr": 1600,
    "Hanif Kantor": 15,
    }

beatles_fans_series = pd.Series(beatles_fans)
beatles_fans_series

In [None]:
# Or I could use a dict comprehension # Zip function connects the 2 lists along the same index
beatles_fans = {beatle: num_fans for beatle, num_fans in zip(beatles, num_fans)}
beatles_fans

### <a id='toc4_1_4_'></a>[From a file](#toc0_)

`read_csv()` is a Pandas function used to read data from a CSV file and create a DataFrame.

When assigning one column in the parameter `usecols` and then calling the method `squeeze("columns")`, the result is a Series instead of a Dataframe

In [None]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_series = pd.read_csv(url, usecols=["Name"]).squeeze("columns")



---
## <a id='toc4_2_'></a>[Data access in Series](#toc0_)


Data access in Pandas can be achieved through either the categorical index or the internally generated numerical index.

### <a id='toc4_2_2_'></a>[Using the index label:](#toc0_)

In [None]:
beatles_fans_series['Hanif Kantor']

### <a id='toc4_2_1_'></a>[Using Pandas internal index:](#toc0_)

In [None]:
# One element
beatles_fans_series[0]

In [None]:
# Multiple elements slicing
beatles_fans_series[1:4]

In [None]:
# 1: - all from one 
beatles_fans_series[1:]

In [None]:
# [1:5:2] - odd indexes
beatles_fans_series[1:5:2]

In [None]:
# [::-1] - reversed
beatles_fans_series[::-1]



---
## <a id='toc4_3_'></a>[Methods in Series](#toc0_)

In [None]:
titanic_series

Series have many of the same methods as pandas DataFrames:
- concat
- sort_values
- sort_index

However, as opposed to DataFrames, they don't require column names or axes to be explicitly mentioned:

### `concat()`

In [None]:
missing_passengers = pd.Series(
    [
        "James Bennett",
        "Eleanor Smith (née Roberts)",
        "Lillian Grey",
        "Henry Dawson",
        "Thomas Wills",
        "Ada Carter (née Foster)",
        "Margaret Lane",
        "Charles Baldwin",
        "Beatrice Hollins",
        "Samuel Kline",
    ]
)
titanic_new_df = pd.concat([titanic_series, missing_passengers])
titanic_new_df

In [None]:
# Try using axis=1
titanic_new_df = pd.concat([titanic_series, missing_passengers], axis=1)

### `sort_values()`

In [None]:
titanic_series.sort_values()

### `sort_index()`

In [None]:
titanic_series.sort_index()

### <a id='toc4_3_3_'></a>[`value_counts()`](#toc0_)

`value_counts()` is a Pandas function that returns a Series containing the counts of unique values in a Series or DataFrame.

In [None]:
titanic_series.value_counts()

## <a id='toc4_4_'></a>[💡 Check for understanding](#toc0_)

In [None]:
titanic_series

1. Get the column "Embarked" from the Titanic csv as a Pandas Series
2. Print the first value
3. Print the last 5 values
3. Append "NA" to the Series
3. Get the number of each Embarked type (number of repeated values)
3. Order the Series *descending*, and print the Embarked type most repeated in the Series

In [None]:
# Your answer here

# <a id='toc5_'></a>[Summary](#toc0_)

- Pandas is a library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more, built on top of NumPy.
- DataFrames and Series are the two main data structures in Pandas.
- Series is a one-dimensional array of data with associated labels called the index, while DataFrame is a two-dimensional tabular data structure with labeled rows and columns.
- Data access in Series and DataFrame can be achieved using integer-based indexing (iloc), label-based indexing (loc), or dictionary-like notation for column access.
- Series and DataFrame have various methods, such as sort_values(), sort_index(), value_counts(), describe(), info(), nunique(), unique(), dtypes, and select_dtypes().


# <a id='toc6_'></a>[Extra: Creating Dataframes from a Dictionary](#toc0_)

In [None]:
# Create a Dataframe from a dictionary with
# automatic indexes

d = {"state": ["Ohio", "Ohio", "California", "Nevada", "California"],
     "year": [2000, 2001, 2002, 2001, 2002],
     "avg": [1.5, 1.7, 3.6, 2.4, 1.9]
}

df = pd.DataFrame(d)
df


DataFrame from a dictionary of lists and indexes

In [None]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index

# <a id='toc7_'></a>[Extra: pickle](#toc0_)

Pandas `pickle` function provides a convenient way to save and load Python objects, including DataFrames, to and from disk. Pickling allows you to serialize Python objects into a binary format, making it easy to store large datasets or complex data structures. It's a great tool for saving and restoring your work, especially when dealing with large datasets that might take a long time to process or recreate.

## <a id='toc7_1_'></a>[Saving DataFrames with Pickle](#toc0_)

You can use the `to_pickle()` method in pandas to save a DataFrame to a pickle file. This method takes the file path as an argument and creates a binary representation of the DataFrame, which is then saved to the specified file.

In [None]:
import pandas as pd

# Assuming df is your DataFrame
df.to_pickle('data.pkl')

## <a id='toc7_2_'></a>[Loading DataFrames from Pickle](#toc0_)

To load a DataFrame from a pickle file, you can use the `read_pickle()` function in pandas. This function reads the binary data from the pickle file and converts it back into a DataFrame.


In [73]:
import pandas as pd

# Load DataFrame from pickle file
df = pd.read_pickle('data.pkl')

In [None]:
df.head()