# Short intro to NumPy

Let's start by mentioning NumPy as Pandas builds on top of NumPy.

NumPy is a powerful library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with an extensive collection of mathematical functions to operate on these arrays efficiently.

In [1]:
# To use a library:
#
# 1-The library must installed in our computer.
#
# 2-Import the library or some functions from the library.
#
# import library_name
# import library_name as alias
# from library import function, class,...
#
import numpy as np

# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Performing operations on the array
result = arr * 2
print(result)  # Output: [2 4 6 8 10]


[ 2  4  6  8 10]


In [2]:
# other ways to do it

# loops
my_list = [1,2,3,4,5]
for i in range(len(my_list)):
    my_list[i] = my_list[i]*2

# map
list(map(lambda x: x*2, my_list))

# list comprehension
[x*2 for x in my_list]

[4, 8, 12, 16, 20]

In [3]:
#how to do this for a dict

my_dict = {"a":1, "b":2, "c":3}

for key in my_dict.keys():
    my_dict[key] = my_dict[key] * 2

for key, value in my_dict.items():
    my_dict[key] = value * 2

{key: value*2 for key, value in my_dict.items()}

{'a': 8, 'b': 16, 'c': 24}

# Introduction to Pandas Library

`pandas` is a powerful library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more. It revolves around two main data structures: `Series` and `DataFrames`.

## Key Features of Pandas

- Robust data import/export: `pandas` provides efficient tools to read and write data from various file formats like CSV, XLS, SQL, and HDF5.
- Data manipulation: With `pandas`, you can easily filter, add, or delete data, enabling seamless data processing.
- High-performance and versatility: It combines the performance of `numpy` arrays with the ability to handle tabulated data efficiently.

To import the necessary modules from the `pandas` library, we use the following syntax:

In [4]:
import pandas as pd  # 'pd' is an alias for pandas

![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/pandas.svg?raw=true)

## DataFrames and Series in Pandas

In `pandas`, a `Series` is a one-dimensional array of data with associated labels called the *index*. If no index is specified, it generates an ordered sequence of integers.

```python
# Creating a Series
s = pd.Series(data, index=index)
```

On the other hand, a `DataFrame` is a two-dimensional data structure that stores data in tabular form, with labeled rows and columns. Each row represents an observation, and each column represents a variable. `DataFrame` can handle heterogeneous data with different types (numeric, string, boolean, etc.). It also includes variable names, types, and methods to access and modify the data.

```python
# Creating a DataFrame
df = pd.DataFrame(data, ...)
```


---
# Series in Pandas

## Creating Series


### From a list

Create a Series with default indexes from a list

In [18]:
l = [1980, 2020, 2001, 1999]
series = pd.Series(l)
series #we can see dtype int which means the series has integers

0    1980
1    2020
2    2001
3    1999
dtype: int64

In [None]:
# To know the version of a library
print(f"The version of numpy is: {np.__version__}")
print(f"The version of pandas is: {pd.__version__}")

In [22]:
type(series) #but careful, its still a pandas series

pandas.core.series.Series

In [24]:
series.dtype #this gives me the type of the elements inside the series
# this is an attribute (remember our lesson about classes)

dtype('int64')

The `Series` have two attributes:` values` and `index`. The first is a `numpy array` that stores the data, and the second is an object that contains the indexes.

In [25]:
series.values

array([1980, 2020, 2001, 1999])

In [40]:
series.index

RangeIndex(start=0, stop=4, step=1)

In Pandas, `Series.items()` is a method used to iterate over the elements of a Pandas Series. It returns an iterator that yields the index-label and corresponding value pairs of the Series.

In [35]:
for key in series.keys():
    print(key)

(0, 1980)
(1, 2020)
(2, 2001)
(3, 1999)


In [43]:
#items has tuples (index,value) so i need to iterate with two variables
# if i want the values separated, if not, i get the tuple in each
# iteration
for i in series.index:
    print(i)

#for key, value in series.items():
#    print(key, value)

0
1
2
3


In [44]:
for i, v in series.items() :
    print("index ", i)
    print("value ", v)

index  0
value  1980
index  1
value  2020
index  2
value  2001
index  3
value  1999


### From a list with index

When creating a `Series`, you can explicitly define an `array` index and pass it as an argument.


Creating series with defined indexes

In [45]:
# create a new series with same values as before, and as indexes, names
# values [1980, 2020, 2001, 1999]
# index "deb","d","martha", "martin"
new_series = pd.Series([1980, 2020, 2001, 1999],
                       index = ["deb","d","martha", "martin"])
new_series

deb       1980
d         2020
martha    2001
martin    1999
dtype: int64

### From a dictionary

In [46]:
# using dict comprehension, create a dictionary
# that has as key "square of {nr}" and as value the nr squared
# for values from 0 to 10
d = {"square of "+str(n): n*n for n in range(11)}
d

{'square of 0': 0,
 'square of 1': 1,
 'square of 2': 4,
 'square of 3': 9,
 'square of 4': 16,
 'square of 5': 25,
 'square of 6': 36,
 'square of 7': 49,
 'square of 8': 64,
 'square of 9': 81,
 'square of 10': 100}

In [53]:
pd.Series({str(n): n for n in range(11)})

0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
dtype: int64

In [51]:
pd.Series(d) # this creates a series from the dictionary 'd'

square of 0       0
square of 1       1
square of 2       4
square of 3       9
square of 4      16
square of 5      25
square of 6      36
square of 7      49
square of 8      64
square of 9      81
square of 10    100
dtype: int64

### From a file

`read_csv()` is a Pandas function used to read data from a CSV file and create a DataFrame.

When assigning one column in the parameter `usecols` and then calling the method `squeeze("columns")`, the result is a Series instead of a Dataframe

In [54]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_series = pd.read_csv(url, usecols=["Name"]).squeeze("columns")
titanic_series

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object



---
## Data access in Series


Data access in Pandas can be achieved through either the categorical index or the internally generated numerical index.

In [55]:
new_series

deb       1980
d         2020
martha    2001
martin    1999
dtype: int64

In [56]:
new_series[1] #accessing elements by pandas internal index

  new_series[1] #accessing elements by pandas internal index


2020

In [78]:
new_series["d"] #accessing elements by index label

2020

In [76]:
new_series[1:]  #we can do the same as with lists, and use slicing

d         2020
martha    2001
martin    1999
dtype: int64

In [59]:
new_series[1:3:2] # 1, 3, 5,...

d    2020
dtype: int64

In [74]:
new_series[::-1]

martin    1999
martha    2001
d         2020
deb       1980
dtype: int64



---
## Methods in Series

### `concat()`

`concat()` is a pandas function used to concatenate and combine DataFrames along a specified axis, either vertically (rows) or horizontally (columns).

In [80]:
s3 = pd.concat([new_series,new_series])
s3

deb       1980
d         2020
martha    2001
martin    1999
deb       1980
d         2020
martha    2001
martin    1999
dtype: int64

By default `concat` keeps the original indexes. It does not restart the index by default, unless we specify `ignore_index=True`.

In [81]:
s3["d"] # we see we have two elements for index "d"

d    2020
d    2020
dtype: int64

In [82]:
s3[1]

  s3[1]


2020

In [83]:
s3 = pd.concat([new_series,new_series],ignore_index=True)
s3

0    1980
1    2020
2    2001
3    1999
4    1980
5    2020
6    2001
7    1999
dtype: int64

In [84]:
pd.DataFrame(pd.concat([new_series,new_series]))

Unnamed: 0,0
deb,1980
d,2020
martha,2001
martin,1999
deb,1980
d,2020
martha,2001
martin,1999


In [85]:
pd.DataFrame(pd.concat([new_series,new_series])).reset_index()

Unnamed: 0,index,0
0,deb,1980
1,d,2020
2,martha,2001
3,martin,1999
4,deb,1980
5,d,2020
6,martha,2001
7,martin,1999


In [87]:
pd.concat([new_series,new_series], axis=1)

Unnamed: 0,0,1
deb,1980,1980
d,2020,2020
martha,2001,2001
martin,1999,1999


### `sort_values() and sort_index()`

`sort_values()` is a pandas DataFrame method that sorts the DataFrame based on specified column(s), while `sort_index()` sorts the DataFrame based on its index labels.

In [88]:
#titanic_series
titanic_series.sort_values()

845                      Abbing, Mr. Anthony
746              Abbott, Mr. Rossmore Edward
279         Abbott, Mrs. Stanton (Rosa Hunt)
308                      Abelson, Mr. Samuel
874    Abelson, Mrs. Samuel (Hannah Wizosky)
                       ...                  
286                  de Mulder, Mr. Theodore
282                de Pelsmaeker, Mr. Alfons
361                del Carlo, Mr. Sebastiano
153          van Billiard, Mr. Austin Blyler
868              van Melkebeke, Mr. Philemon
Name: Name, Length: 891, dtype: object

In [None]:
titanic_series.sort_values(ascending=False) #to change the ordering type

Sort index or sort values does not change the series.
We can either save it again in the variable or we can use a paremeter
called **inplace**.

In [None]:
titanic_series # we see our last line of code didn't change our series

In [90]:
titanic_series.sort_index(ascending=False, inplace=True)

# equivalent: titanic_series = titanic_series.sort_index(ascending=False)

In [None]:
titanic_series # the series has changed because we used 'inplace=True'

### `value_counts()`

`value_counts()` is a Pandas function that returns a Series containing the counts of unique values in a Series or DataFrame.

In [107]:
pd.read_csv(url)['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [104]:
pd.read_csv(url)['Sex'].unique()

array(['male', 'female'], dtype=object)

In [106]:
titanic_series.value_counts()

Name
Dooley, Mr. Patrick                 1
Levy, Mr. Rene Jacques              1
Keane, Miss. Nora A                 1
Johnson, Mr. William Cahoone Jr     1
McCoy, Mr. Bernard                  1
                                   ..
Rintamaki, Mr. Matti                1
Murdlin, Mr. Joseph                 1
Gilinski, Mr. Eliezer               1
Frolicher-Stehli, Mr. Maxmillian    1
Braund, Mr. Owen Harris             1
Name: count, Length: 891, dtype: int64

## 💡 Check for understanding

1. Get the column "Embarked" from the Titanic csv as a Pandas Series
2. Print the first value
3. Print the last 5 values
3. Append "NA" to the Series
3. Get the number of each Embarked type (number of repeated values)
3. Order the Series *descending*, and print the Embarked type most repeated in the Series

In [None]:
# Your answer here
# 1.
pd.read_csv(url)["Embarked"]

In [112]:
# 2.
pd.read_csv(url)["Embarked"][0]

'S'

In [None]:
# 3.
pd.read_csv(url)["Embarked"][-5:]

In [None]:
# 4
pd.Series(list(pd.read_csv(url)['Embarked'].values) + ["NA"])

In [115]:
# 5.
pd.read_csv(url)["Embarked"].nunique()

3

In [None]:
# 6.
pd.read_csv(url)["Embarked"].sort_values(ascending=False)

In [119]:
# 7.
pd.read_csv(url)["Embarked"].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [109]:
df = pd.read_csv(url)
df
# .loc[rows,columns] accessing by label
# .iloc[rows, columns] accessing by index
#df.loc[[2,3],["Survived", "Pclass"]]
# df.iloc[:,[1,2] ]
df.iloc[[2,3],[1,2]]

Unnamed: 0,Survived,Pclass
2,1,3
3,1,1


In [120]:
df = pd.read_csv(url)
df[(df['Embarked']=="S") | (df['Age']== 22)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
...,...,...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S



# Dataframes in Pandas

Unlike `Series`, `DataFrame` is designed to store heterogeneous multivariable data. It can be created from dictionaries (look to Extra section for examples).

We will use `read_csv` to read data from a CSV file and create a DataFrame.

We won't be using now the parameter `usecols` nor calling the method `squeeze("columns")`.

In [121]:
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
titanic_df = pd.read_csv(url)

### `head()` and `tail()`

The `head()` method returns the first few rows of a DataFrame or Series, while the `tail()` method returns the last few rows.

In [122]:
titanic_df.head(7)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S


In [123]:
titanic_df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### `index, columns and values`

The `index` attribute in Pandas returns the index labels of a DataFrame or Series, and the `columns` attribute returns the column labels of a DataFrame.

In [125]:
titanic_df.index

RangeIndex(start=0, stop=891, step=1)

`RangeIndex(start=0, stop=1460, step=1)` is a special type of index in Pandas called a `RangeIndex`. It represents a range of integer values starting from 0, up to (but not including) 1460, with a step size of 1.

In [126]:
list(titanic_df.index)

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,


In [127]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [128]:
list(titanic_df.columns)

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In Pandas, `df.values` is an attribute that returns the underlying NumPy array of a DataFrame df. It represents the actual data stored in the DataFrame as a two-dimensional NumPy array.

In [None]:
titanic_df.head(5)

In [129]:
titanic_df.values

array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ...,
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)


Additionally, `Pandas` can create Dataframes from other sources, such as JSON, URL ...


---
## Data access in Dataframe


### Columns

You can extract columns from a `DataFrame` using dictionary-like notation or as attributes, obtaining a `Series` object in both cases, provided the column label is a valid Python identifier.

In [130]:
titanic_df["Sex"] #dict way of accessing a column

0        male
1      female
2      female
3      female
4        male
        ...  
886      male
887    female
888    female
889      male
890      male
Name: Sex, Length: 891, dtype: object

In [None]:
titanic_df.Sex #attribute access of the column

#this only works if the column name doesn't have unusual characters, this means
#no spaces!!!

In [131]:
titanic_df[["Sex"]] #by using double [] we get a Dataframe instead
# of a series

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male
...,...
886,male
887,female
888,female
889,male


In [134]:
titanic_df[["Sex","Fare"]]

Unnamed: 0,Sex,Fare
0,male,7.2500
1,female,71.2833
2,female,7.9250
3,female,53.1000
4,male,8.0500
...,...,...
886,male,13.0000
887,female,30.0000
888,female,23.4500
889,male,30.0000


### Rows

To access rows in a Pandas DataFrame, you can use the `iloc` attribute with integer-based indexing or the `loc` attribute with label-based indexing. For example, `df.iloc[0]` will access the first row, and `df.loc['row_label']` will access the row with the specified label.


<div class="alert alert-info">Note: Consult http://pandas.pydata.org/pandas-docs/version/0.18.1/indexing.html#different-choices-for-indexing to understand the differences between methods.</div>

Just for the sake of this example, we will create a dictionary with a non-default index.

In [None]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index


**loc** is used for label-based indexing to access rows.

In [135]:
df_index.loc["789B"]

NameError: name 'df_index' is not defined


**iloc** is used for integer-based indexing in Pandas to access rows

In [None]:
df_index.iloc[1]

### Specific values

Here's a summary of the different ways to access individual values in a Pandas DataFrame:

1. `df.loc[row_label][column_label]`: Chooses the row with the label and then the value in that row with the column label.

2. `df.iloc[row_position][column_label]`: Selects the row with the position and then the value in that row with the column label.

3. `df.loc[row_label, column_label]`: Directly accesses the value using both row and column labels.

4. `df.iloc[row_position, column_position]`: Directly accesses the value using both row and column positions.

Note: you might see `df[row_label][column_label]`, even though is less recommended that using `loc()`.

In [None]:
# first I access the row with the index, then I access the value
# with the index of the resulting series (name of the column)
df_index.iloc[0]["score"]

In [None]:
# first I access the row with the label, then I access the value
# with the index of the resulting series (name of the column)
df_index.loc["789B"]["score"]

In [None]:
df_index.loc["789B","score"]


---
## More DataFrame Methods


Let's see some useful methods of the Dataframe class.

### `shape`

`shape` is a Pandas attribute that returns a tuple representing the dimensions of a DataFrame or Series, indicating the number of rows and columns, while `shape()` is not a valid function in Pandas, and attempting to call it will result in an AttributeError.

In [None]:
titanic_df.shape # gives a tuple (nr rows, nr columns)

In [None]:
titanic_df.shape[1] #tuples are accesed like this. nr of columns

In [None]:
titanic_df.shape[0] # nr of rows

### `describe()`

`describe()` is a Pandas method that generates descriptive statistics of a DataFrame, providing information on count, mean, standard deviation, minimum, maximum, and quartiles for each numerical column.

In [None]:
titanic_df.describe()

In [None]:
titanic_df.describe(include="object")

In [None]:
titanic_df.describe(include="all")

### `info()`

`info()` is a Pandas method that provides a concise summary of a DataFrame, including information about the data types, non-null values, and memory usage.

In [None]:
titanic_df.info()

### `nunique() and unique()`

`nunique()` is a Pandas function that returns the number of unique elements in a Series or DataFrame, while `unique()` returns an array of unique elements in a Series or DataFrame.

In [None]:
titanic_df.nunique() #number of unique values per column

In [136]:
#see unique values of one column
titanic_df.Pclass.unique()

array([3, 1, 2])

### `dtypes`

`dtypes` is a Pandas attribute that returns the data types of each column in a DataFrame, while `dtype` is a method that returns the data type of a single element in a Series or DataFrame.

In [None]:
titanic_df.dtypes

### `select_dtypes()`

`select_dtypes()` is a pandas function used to filter columns in a DataFrame based on their data types. It allows you to select numeric, object (string), boolean, datetime, or categorical columns.

Syntax:
```python
DataFrame.select_dtypes(include=None, exclude=None)
```

- `include`: A list of data types or strings representing data types to include in the selection. If specified, only columns with these data types will be included.
- `exclude`: A list of data types or strings representing data types to exclude from the selection. If specified, columns with these data types will be excluded.

Example:
```python
# Assuming df is a DataFrame
numeric_columns = df.select_dtypes(include='number')
object_columns = df.select_dtypes(include='object')
```



In [None]:
titanic_df.select_dtypes(include=['int','float'])

In [137]:
titanic_df.select_dtypes(include='number').head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.25
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.925
3,4,1,1,35.0,1,0,53.1
4,5,0,3,35.0,0,0,8.05


### Aggregation such as `max()`

In [148]:
titanic_df.select_dtypes(include='number').max() #get the max for each numerical column in the df

PassengerId    891.0000
Survived         1.0000
Pclass           3.0000
Age             80.0000
SibSp            8.0000
Parch            6.0000
Fare           512.3292
dtype: float64

In [141]:
titanic_df["Fare"].max() #get the max for one numerical column

512.3292

 Just like `max()`, there are many methods that can be applied to either the entire dataframe or its individual columns.

## 💡 Check for understanding

- a. Use the original titanic_df
- b. Select the `Sex` and `Fare` columns.
- c. Indicate how many different types of `Sex` there are.
- d. Indicate how many `Sex` of each type there are.
- e. Show a statistical summary of all the variables.
- f. Write some conclusions

In [None]:
# Your code here
# b.
titanic_df[["Sex","Fare"]]

In [150]:
# c.
titanic_df['Sex'].nunique()

2

In [151]:
# d.
titanic_df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [None]:
# e.
titanic_df.describe(include='all')

# Summary

- Pandas is a library designed for working with tabulated and tagged data, making it ideal for handling spreadsheets, SQL tables, and more, built on top of NumPy.
- DataFrames and Series are the two main data structures in Pandas.
- Series is a one-dimensional array of data with associated labels called the index, while DataFrame is a two-dimensional tabular data structure with labeled rows and columns.
- Data access in Series and DataFrame can be achieved using integer-based indexing (iloc), label-based indexing (loc), or dictionary-like notation for column access.
- Series and DataFrame have various methods, such as sort_values(), sort_index(), value_counts(), describe(), info(), nunique(), unique(), dtypes, and select_dtypes().


# Extra: Creating Dataframes from a Dictionary

In [153]:
# Create a Dataframe from a dictionary with
# automatic indexes

d = {"state": ["Ohio", "Ohio", "California", "Nevada", "California"],
     "year": [2000, 2001, 2002, 2001, 2002],
     "avg": [1.5, 1.7, 3.6, 2.4, 1.9]
}

df = pd.DataFrame(d)
df

Unnamed: 0,state,year,avg
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,California,2002,3.6
3,Nevada,2001,2.4
4,California,2002,1.9



DataFrame from a dictionary of lists and indexes

In [157]:
d_index = {
    "name": ["Paula", "Mark"],
    "score": [98.5, 95]
}

df_index = pd.DataFrame(d_index, index=["123A", "789B"])
df_index

Unnamed: 0,name,score
123A,Paula,98.5
789B,Mark,95.0


# Extra: pickle

Pandas `pickle` function provides a convenient way to save and load Python objects, including DataFrames, to and from disk. Pickling allows you to serialize Python objects into a binary format, making it easy to store large datasets or complex data structures. It's a great tool for saving and restoring your work, especially when dealing with large datasets that might take a long time to process or recreate.

## Saving DataFrames with Pickle

You can use the `to_pickle()` method in pandas to save a DataFrame to a pickle file. This method takes the file path as an argument and creates a binary representation of the DataFrame, which is then saved to the specified file.

In [None]:
import pandas as pd

# Assuming df is your DataFrame
df.to_pickle('data.pkl') # data.pkl / data.pickle

## Loading DataFrames from Pickle

To load a DataFrame from a pickle file, you can use the `read_pickle()` function in pandas. This function reads the binary data from the pickle file and converts it back into a DataFrame.


In [None]:
import pandas as pd

# Load DataFrame from pickle file
df = pd.read_pickle('data.pkl')