# My pandas notebook

Notebook for my pandas studies. I'm currently reading the [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html), and trying to learn this amazing python library!

![Panda](images/Panda.gif)

## Importing dependencies, and creating dataframes

In [1]:
# Dependencies
import pandas as pd

In [2]:
# Df for What kind of data does pandas handle?
standard_df = pd.DataFrame(
    {
        "Name": [
            "Gabriel",
            "Ricardo",
            "Giovanna"
        ],
        "Age": [
            19,
            43,
            27
        ],
        "Sex": [
            'm',
            'm',
            'f'
        ]
    }
)

In [12]:
# Titanic df
titanic_df = pd.read_csv("data/titanic.csv")

## What kind of data does pandas handle?

Doc [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented)

In [3]:
standard_df.head()

Unnamed: 0,Name,Age,Sex
0,Gabriel,19,m
1,Ricardo,43,m
2,Giovanna,27,f


In [4]:
# Getting the ages from my df
standard_ages_df = standard_df["Age"]

# Creating a series of ages
standard_ages_series = pd.Series([19, 43, 27], name="Age")

In [5]:
standard_ages_df.head()

0    19
1    43
2    27
Name: Age, dtype: int64

In [6]:
standard_ages_series.head()

0    19
1    43
2    27
Name: Age, dtype: int64

In [7]:
# Max age of my df
standard_df["Age"].max()

43

As illustrated by the max() method, you can do things with a DataFrame or Series. The max method for an example returns the maximum value of a Pandas Series (single column).

In [8]:
# Statistics of my df
standard_df.describe()

Unnamed: 0,Age
count,3.0
mean,29.666667
std,12.220202
min,19.0
25%,23.0
50%,27.0
75%,35.0
max,43.0


The describe method provides a quick overview of the numerical data in a DataFrame. As the Name and Sex columns are textual data, these are by default not taken into account by the describe method.

![Screenshot](images/Screenshot01.png)

## How do I read and write tabular data?

Docs [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/02_read_write.html)

In [15]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
titanic_df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [17]:
# Check the column types
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

The object columns here are basically an array of characters (string).

In [18]:
# General info about the df - really useful method
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Lets explain the ouput:

* It is indeed a DataFrame.
* There are 891 entries, i.e. 891 rows.
* Each row has a row label (aka the index) with values ranging from 0 to 890.
* The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null). Some columns do have missing values and less than 891 non-null values.
* The columns Name, Sex, Cabin and Embarked consists of textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
* The approximate amount of RAM used to hold the DataFrame is provided as well.

![Screenshot](images/Screenshot02.png)

## How do I select a subset of a Dataframe?

Docs [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/03_subset_data.html)

### How do I select specific columns from a DataFrame?

In [26]:
titanic_ages = titanic_df["Age"]
titanic_age_sex = titanic_df[["Age", "Sex"]]

In [27]:
# Number of columns and rows
titanic_ages.shape

(891,)

In [28]:
titanic_df.shape

(891, 12)

In [31]:
# We can verify the type of a Pandas Object using the built-in python method type
print(type(titanic_age_sex)) # DataFrame
print(type(titanic_ages)) # Series

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [34]:
print(titanic_ages.head(), end='\n\n')
print(titanic_age_sex.head())

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

    Age     Sex
0  22.0    male
1  38.0  female
2  26.0  female
3  35.0  female
4  35.0    male


### How do I filter data from a DataFrame?

In [41]:
# Using conditions
titanic_above_35 = titanic_df[titanic_df["Age"] > 35]
titanic_above_35.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S
25,26,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,1,5,347077,31.3875,,S
30,31,0,1,"Uruchurtu, Don. Manuel E",male,40.0,0,0,PC 17601,27.7208,,C
33,34,0,2,"Wheadon, Mr. Edward H",male,66.0,0,0,C.A. 24579,10.5,,S
35,36,0,1,"Holverson, Mr. Alexander Oskar",male,42.0,1,0,113789,52.0,,S
40,41,0,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.475,,S


In [42]:
# Using Is in method
titanic_class_two_and_three = titanic_df[titanic_df["Pclass"].isin([2, 3])]
titanic_class_two_and_three.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S


Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets [].

In [43]:
# Using logical operators
titanic_class_two_and_three = titanic_df[(titanic_df["Pclass"] == 2) | (titanic_df["Pclass"] == 3)]
titanic_class_two_and_three.head(15)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S


When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &, just like we do for working with Python lists.

In [45]:
# Using notna
titanic_known_ages = titanic_df[titanic_df["Age"].notna()]
titanic_known_ages.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S


In [46]:
# Getting name of adult passangers
titanic_adult_passangers = titanic_df.loc[titanic_df["Age"] >= 18, "Name"]
titanic_adult_passangers.head(10)

0                               Braund, Mr. Owen Harris
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                Heikkinen, Miss. Laina
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                              Allen, Mr. William Henry
6                               McCarthy, Mr. Timothy J
8     Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
11                             Bonnell, Miss. Elizabeth
12                       Saundercock, Mr. William Henry
13                          Andersson, Mr. Anders Johan
Name: Name, dtype: object

For that kind of filter, we need to use the loc (location) and iloc (integer location) operators. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

In [47]:
type(titanic_adult_passangers)

pandas.core.series.Series

In [49]:
# Using iloc for getting data from specific rows/columns
titanic_df.iloc[9:25, 2:5] # Rows 10 to 25, Columns 3 to 5

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss. Marguerite Rut",female
11,1,"Bonnell, Miss. Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master. Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


In [51]:
# Changing the value of column name from the first three rows 
titanic_df.iloc[0:3, 3] = "anonymous"
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


* loc: loc() is label based data selecting method which means that we have to pass the name of the row or column which we want to select. loc() can accept the boolean data unlike iloc().
* iloc: iloc() is a indexed based selecting method which means that we have to pass integer index in the method to select specific row/column.

Source [here](https://www.geeksforgeeks.org/difference-between-loc-and-iloc-in-pandas-dataframe/)

![Screenshot](images/Screenshot03.png)