<a id='start'></a>
# Collecting

This notebook explains the main methods for collecting and manipulating data for the first time. <br>
The most used library to perform these main operations is **Pandas**. <br>
<br>
The notebook is divided into the following sections:<br>
- [DataFrame and Series](#section1)<a href='#section1'></a>; <br>
- [Import data from external sources](#section2)<a href='#section2'></a>; <br>
- [Select data from datasets](#section3)<a href='#section3'></a>; <br>
    - [Index - based Selection](#section4)<a href='#section4'></a><br>
    - [Label - based Selection](#section5)<a href='#section5'></a> <br>
    - [Conditional Selection](#section6)<a href='#section6'></a>
    - [Variables and References](#section7)<a href='#section7'></a>

<a id='section1'></a>
## DataFrame and Series

We introduce the **Pandas** library, used to create and manage objects: **Series** and **Dataframe**.<br>

The *Series* and *Dataframe* objects can be imported from files (csv, xls, html, ..) or created manually. <br>

We initially import the Pandas bookstore

In [6]:
import pandas as pd

print("Setup Complete.")

Setup Complete.


A **DataFrame** is a table containing an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.  
At low level a dataframe can also be viewed with a dictionary dictionary

In [7]:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

Unnamed: 0,Yes,No
0,50,131
1,21,2


A DataFrame can also contain string characters and not only numeric values

In [8]:
pd.DataFrame({'Audi': ['Best', 'Worst'], 'Mercedes': ['Best', 'Worst']})

Unnamed: 0,Audi,Mercedes
0,Best,Best
1,Worst,Worst


The lines of a DataFrame are named **Index** and can be assigned a value using the following code:

In [9]:
pd.DataFrame({'Audi': ['Best', 'Worst'], 'Mercedes':['Best', 'Worst']}, index = ['Subcompact', 'Sports'])

Unnamed: 0,Audi,Mercedes
Subcompact,Best,Best
Sports,Worst,Worst


A **Series** is a sequence of data values. If a DataFrame is a table, a series is a particular list (not to be confused with a python list).

In [10]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

You can assign values to rows in a Series in the same way as before, using an index parameter. <br>
Also, a Series doesn't have a column name, it only has an overall name

In [11]:
pd.Series([300, 450, 400], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product X')

2015 Sales    300
2016 Sales    450
2017 Sales    400
Name: Product X, dtype: int64

<a id='section2'></a>
## Import data from external source

In this section of the notebook we deal with how to import data from external resources (csv, excel and html) thanks to the Pandas library.

### Import data from csv

The method to use to import data from a csv with the Pandas library is __[read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)__. <br>
There are many parameters for the *read_csv* method, the most important are:
- sep
- delimeter
- header
- index_col
- skiprows
- na_values <br>
...

In [12]:
dataset = pd.read_csv("dataset.csv", index_col = 0)
dataset

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
4,3,"Allen, Mr. William Henry",male,35.0,8.05,S,0
5,3,"Moran, Mr. James",male,,8.4583,Q,0
6,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625,S,0
7,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,S,0
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1


In [13]:
# The .head() method allows you to see the first lines (by default the first 5) of a DataFrame or Series
dataset.head()

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
4,3,"Allen, Mr. William Henry",male,35.0,8.05,S,0
5,3,"Moran, Mr. James",male,,8.4583,Q,0


In [14]:
# The .tail() method allows you to see the last rows (by default the last 5) of a DataFrame or Series
dataset.tail()

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
16,3,"Rice, Master. Eugene",male,2.0,29.125,Q,0
17,2,"Williams, Mr. Charles Eugene",male,,13.0,S,1
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,18.0,S,0
19,3,"Masselmani, Mrs. Fatima",female,,7.225,C,1
20,2,"Fynney, Mr. Joseph J",male,35.0,26.0,S,0


### Import data from excel

The method to use to import data from an excel file with the Pandas library is __[read_excel](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)__. <br>
In this case it will be necessary to indicate in which sheet of the excel file is the dataframe we want to import using the parameter *sheet_name*.

In [15]:
dataset = pd.read_excel("dataset_excel_workbook.xlsx", sheet_name='dataset', index_col=0)
dataset

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
4,3,"Allen, Mr. William Henry",male,35.0,8.05,S,0
5,3,"Moran, Mr. James",male,,8.4583,Q,0
6,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625,S,0
7,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,S,0
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1


### Import data from website

The method to use to import a table from a website with the Pandas library is __[read_html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html#pandas.read_html)__. <br>


The most important parameters to consider when importing are the following: <br>
- **skiprows** = indicates the number of lines to skip in the import; <br>
- **header** = indicates the row to use to create column headers.

In [16]:
classifica_serie_a = pd.read_html(io="http://www.legaseriea.it/it/serie-a/classifica", skiprows=1, header=0)
classifica_serie_a

[             SQUADRE  PUNTI   G   V  N   P  G.1  V.1  N.1  P.1  G.2  V.2  N.2  \
 0        1  Juventus     63  26  20  3   3   13   12    1    0   13    8    2   
 1           2  Lazio     62  26  19  5   2   14   11    3    0   12    8    2   
 2           3  Inter     54  25  16  6   3   12    7    4    1   13    9    2   
 3        4  Atalanta     48  25  14  6   5   12    6    2    4   13    8    4   
 4            5  Roma     45  26  13  6   7   13    6    3    4   13    7    3   
 5          6  Napoli     39  26  11  6   9   13    5    2    6   13    6    4   
 6           7  Milan     36  26  10  6  10   13    4    5    4   13    6    1   
 7   8  Hellas Verona     35  25   9  8   8   12    6    3    3   13    3    5   
 8           9  Parma     35  25  10  5  10   13    6    1    6   12    4    4   
 9        10  Bologna     34  26   9  7  10   13    4    5    4   13    5    2   
 10      11  Sassuolo     32  25   9  5  11   13    6    1    6   12    3    4   
 11      12  Cag

The *read_html* function returns a *DataFrame list* as an object. <br>
We can now associate the list of Dataframes identified by the element 0 of the object obtained from *read_html*.

In [17]:
serie_a = classifica_serie_a[0]
serie_a.head()

Unnamed: 0,SQUADRE,PUNTI,G,V,N,P,G.1,V.1,N.1,P.1,G.2,V.2,N.2,P.2,F,S
0,1 Juventus,63,26,20,3,3,13,12,1,0,13,8,2,3,50,24
1,2 Lazio,62,26,19,5,2,14,11,3,0,12,8,2,2,60,23
2,3 Inter,54,25,16,6,3,12,7,4,1,13,9,2,2,49,24
3,4 Atalanta,48,25,14,6,5,12,6,2,4,13,8,4,1,70,34
4,5 Roma,45,26,13,6,7,13,6,3,4,13,7,3,3,51,35


You can get the names of the columns that form a Dataframe or Series through the **.columns** attribute.

In [62]:
serie_a.columns

Index(['SQUADRE', 'PUNTI', 'G', 'V', 'N', 'P', 'G.1', 'V.1', 'N.1', 'P.1',
       'G.2', 'V.2', 'N.2', 'P.2', 'F', 'S'],
      dtype='object')

In [19]:
type(serie_a)

pandas.core.frame.DataFrame

In [20]:
serie_a

Unnamed: 0,SQUADRE,PUNTI,G,V,N,P,G.1,V.1,N.1,P.1,G.2,V.2,N.2,P.2,F,S
0,1 Juventus,63,26,20,3,3,13,12,1,0,13,8,2,3,50,24
1,2 Lazio,62,26,19,5,2,14,11,3,0,12,8,2,2,60,23
2,3 Inter,54,25,16,6,3,12,7,4,1,13,9,2,2,49,24
3,4 Atalanta,48,25,14,6,5,12,6,2,4,13,8,4,1,70,34
4,5 Roma,45,26,13,6,7,13,6,3,4,13,7,3,3,51,35
5,6 Napoli,39,26,11,6,9,13,5,2,6,13,6,4,3,41,36
6,7 Milan,36,26,10,6,10,13,4,5,4,13,6,1,6,28,34
7,8 Hellas Verona,35,25,9,8,8,12,6,3,3,13,3,5,5,29,26
8,9 Parma,35,25,10,5,10,13,6,1,6,12,4,4,4,32,31
9,10 Bologna,34,26,9,7,10,13,4,5,4,13,5,2,6,38,42


There are many other ways to import data to use, depending on the data source from which you need to draw information. <br>
The purpose of this lesson is not to illustrate and describe all possible methods, because it is quite dispersive as work, but simply to provide useful elements from which to start working.

In this regard only links and reference material are indicated

### Import data from DB and use inside Jupyter

https://towardsdatascience.com/heres-how-to-run-sql-in-jupyter-notebooks-f26eb90f3259  
https://realpython.com/tutorials/databases/

### Scraping Data from the web

https://automatetheboringstuff.com/chapter11/  
https://automatetheboringstuff.com/2e/chapter12/  


### Import data from twitter

https://github.com/Jefferson-Henrique/GetOldTweets-python

### Import data from newspapers

https://github.com/codelucas/newspaper

### Import data using api

https://towardsdatascience.com/creating-a-dataset-using-an-api-with-python-dcc1607616d  
https://realpython.com/api-integration-in-python/

<a id='section3'></a>
## Select data from dataset

In this section we will learn the main methods to select the columns and rows of a dataset, identified as DataFrame or Series.

Before continuing, it is necessary to specify that the **Pandas** library do not use names, which is commonly used to identify the axes of a database, i.e. *dimension* and *feature*. <br>
The Pandas library indicates the size of a matrix, with the term **axes**, i.e. it uses the parameter **(axis = 0)** to indicate the rows of a dataset and the parameter **(axis = 1)** to indicate the columns of a dataset.

<img src="axis_Pandas.jpg">

In [21]:
# We rename the previously imported dataset via excel file
titanic = dataset
titanic.head()

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
4,3,"Allen, Mr. William Henry",male,35.0,8.05,S,0
5,3,"Moran, Mr. James",male,,8.4583,Q,0


In [22]:
# Select the Age column of the titanic dataset

titanic['Name'] # it's possibile to use also the column titanic.Age

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                Heikkinen, Miss. Laina
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                              Allen, Mr. William Henry
5                                      Moran, Mr. James
6                               McCarthy, Mr. Timothy J
7                        Palsson, Master. Gosta Leonard
8     Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                   Nasser, Mrs. Nicholas (Adele Achem)
10                      Sandstrom, Miss. Marguerite Rut
11                             Bonnell, Miss. Elizabeth
12                       Saundercock, Mr. William Henry
13                          Andersson, Mr. Anders Johan
14                 Vestrom, Miss. Hulda Amanda Adolfina
15                     Hewlett, Mrs. (Mary D Kingcome) 
16                                 Rice, Master. Eugene
17                         Williams, Mr. Charles Eugene
18    Vander Planke, Mrs. Julius (Emelia Maria V

In [63]:
titanic.Name

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                Heikkinen, Miss. Laina
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                              Allen, Mr. William Henry
5                                      Moran, Mr. James
6                               McCarthy, Mr. Timothy J
7                        Palsson, Master. Gosta Leonard
8     Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                   Nasser, Mrs. Nicholas (Adele Achem)
10                      Sandstrom, Miss. Marguerite Rut
11                             Bonnell, Miss. Elizabeth
12                       Saundercock, Mr. William Henry
13                          Andersson, Mr. Anders Johan
14                 Vestrom, Miss. Hulda Amanda Adolfina
15                     Hewlett, Mrs. (Mary D Kingcome) 
16                                 Rice, Master. Eugene
17                         Williams, Mr. Charles Eugene
18    Vander Planke, Mrs. Julius (Emelia Maria V

In [23]:
type(titanic['Name'])

pandas.core.series.Series

In the previous cell we selected a column from a DataFrame and extracted it with the format **Series**. <br>
We have indicated two methods to extract a column from a DataFrame, neither is the best but if the column name had a space, for example suppose it was "Age Female", in that case we should use the notation with square brackets.

In [24]:
# Now select the first row of the Age column
titanic['Age'][1]

38.0

Pandas uses two paradigms to select data: <br>
- Index-based: based on the numerical position of the data; <br>
- Label-based: based on the value of a data index.

<a id='section4'></a>
### Index-Based Selection

Both the method **iloc** and **loc** (which we will see later) are *row-first, column-second* i.e. they take as first input the value of the row and as second the value of the column. This is opposite to the traditional Python behavior, which is *column-first, row-second*. <br>

In fact a few lines above, before introducing *iloc* and *loc* we used the code "titanic['Age'][1]", that is dataset[column][line].

The code you use to make an *index-based selection* is **iloc**.<br>.
For example, we select the first line of the dataset on the titanic:

In [25]:
titanic.iloc[0]

Pclass                                                      1
Name        Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                    female
Age                                                        38
Fare                                                  71.2833
Embarked                                                    C
Survived                                                    1
Name: 1, dtype: object

To obtain the desired column with iloc we must use the following syntax:

In [26]:
# I print ALL the rows of the names column
titanic.iloc[:,1]

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                Heikkinen, Miss. Laina
3          Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                              Allen, Mr. William Henry
5                                      Moran, Mr. James
6                               McCarthy, Mr. Timothy J
7                        Palsson, Master. Gosta Leonard
8     Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                   Nasser, Mrs. Nicholas (Adele Achem)
10                      Sandstrom, Miss. Marguerite Rut
11                             Bonnell, Miss. Elizabeth
12                       Saundercock, Mr. William Henry
13                          Andersson, Mr. Anders Johan
14                 Vestrom, Miss. Hulda Amanda Adolfina
15                     Hewlett, Mrs. (Mary D Kingcome) 
16                                 Rice, Master. Eugene
17                         Williams, Mr. Charles Eugene
18    Vander Planke, Mrs. Julius (Emelia Maria V

In [27]:
# first 3 rows of the names column
titanic.iloc[:3,1]

1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
Name: Name, dtype: object

In [28]:
# Rows that i interested in of the names column
# In this case I use a list to indicate the lines that interest me
titanic.iloc[[1,2,3,5,7],1]

2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
6                              McCarthy, Mr. Timothy J
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
Name: Name, dtype: object

<a id='section5'></a>
### Label-Based Selection

The code you use to make a *label-based selection* is **loc**.<br>.
To get the first record in the "Name" field of the dataset on the titanic we must use the following syntax:

In [29]:
titanic.loc[1, 'Name']

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [30]:
titanic.loc[2, ['Name', 'Age', 'Pclass']]

Name      Heikkinen, Miss. Laina
Age                           26
Pclass                         3
Name: 2, dtype: object

When choosing or switching from *loc* to *iloc*, there is a **difference** that it is important to keep in mind, i.e. **the two methods use slightly different indexing schemes**.

**iloc** uses the Python stdlib indexing scheme: where **the first item in the range is included and the last excluded.**Then theoc[0:10] will select items 0,...,9. <br>
**loc**, meanwhile, **indicates in an inclusive way.**So loc[1:10] will select items 1,...,10.

Suppose you have a DataFrame with a simple numerical list, for example 0,.....1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] will return 1001! To get 1000 items using loc, we will have to use df.loc[0:999].


In [31]:
titanic.iloc[0,1]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [32]:
try:
    print(titanic.loc[0,'Name'])
except Exception as e:
    print(f'Error: {e}')

Error: 'the label [0] is not in the [index]'


<a id='section6'></a>
### Conditional Selection

During our analysis we may need to select parts of a dataset based on the values that the fields may assume. That is, we may want to place conditions on our selections.

Supponiamo di voler selezionare dal dataset del titanic solo le donne; iniziamo chiedendoci quali sono le righe che hanno come campo della colonna "Sex" il valore di "female":

In [33]:
titanic[titanic.Sex == "female"]

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1
11,1,"Bonnell, Miss. Elizabeth",female,58.0,26.55,S,1
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,7.8542,S,0
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,16.0,S,1
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,18.0,S,0


We have obtained a True/False column that we can use with the *loc* operator to select in the dataset the fields referring only to women:

In [34]:
titanic.loc[titanic.Sex == "female"]

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1
11,1,"Bonnell, Miss. Elizabeth",female,58.0,26.55,S,1
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,7.8542,S,0
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,16.0,S,1
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,18.0,S,0


Let us now suppose we want to select women under 30:

In [35]:
# using the logic operator &
titanic.loc[(titanic.Sex == "female") & (titanic.Age < 30)]

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,7.8542,S,0


Suppose you want to select the females or all those under 30 years old, in this case you will need to use the **or**: | operator.

In [36]:
titanic.loc[(titanic.Sex == "female") | (titanic.Age < 30)]

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
7,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,S,0
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1
11,1,"Bonnell, Miss. Elizabeth",female,58.0,26.55,S,1
12,3,"Saundercock, Mr. William Henry",male,20.0,8.05,S,0
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,7.8542,S,0


Pandas has pre-built *conditional selectors* that can be useful when analyzing:<br>
- **isin**: allows you to select data whose value is in a list of values; <br>
- **isnull** (and its complementary **notnull**: allows you to select the values that are or are not null (NaN). 

Suppose you want to select only people who belong to the second and third class (Pclass field).

In [37]:
titanic.loc[titanic.Pclass.isin([2,3])]

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
4,3,"Allen, Mr. William Henry",male,35.0,8.05,S,0
5,3,"Moran, Mr. James",male,,8.4583,Q,0
7,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,S,0
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1
12,3,"Saundercock, Mr. William Henry",male,20.0,8.05,S,0
13,3,"Andersson, Mr. Anders Johan",male,39.0,31.275,S,0
14,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,7.8542,S,0


Suppose we want to select all passengers who are of no age:

In [38]:
titanic.loc[titanic.Age.isnull()]

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
5,3,"Moran, Mr. James",male,,8.4583,Q,0
17,2,"Williams, Mr. Charles Eugene",male,,13.0,S,1
19,3,"Masselmani, Mrs. Fatima",female,,7.225,C,1


At this point we can also associate a value to a field once a selection has been made. <br>
We assign all people with a zero age value, the age of 35.

In [39]:
titanic.loc[titanic.Age.isnull(), 'Age'] = 35
titanic

Unnamed: 0,Pclass,Name,Sex,Age,Fare,Embarked,Survived
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C,1
2,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S,1
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S,1
4,3,"Allen, Mr. William Henry",male,35.0,8.05,S,0
5,3,"Moran, Mr. James",male,35.0,8.4583,Q,0
6,1,"McCarthy, Mr. Timothy J",male,54.0,51.8625,S,0
7,3,"Palsson, Master. Gosta Leonard",male,2.0,21.075,S,0
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,11.1333,S,1
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,30.0708,C,1
10,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,16.7,S,1


<a id='section7'></a>
# Variables and References

you have to pay attention to the scope of variables (in all programming languages actually), but what happens with a pandas dataframe instead? Does it behave like a variable?

Let's see what happens with standard python variables, but especially with lists!

with global

In [83]:
n = 0
text = 'try'
my_list = [1,2,3,4]

def test():
    global n
    global text
    global my_list
    n = 1
    text = 'ok maby something happend'
    my_list.pop()

#Let's try!
print(f'before = {n} and: {text} with list: {my_list}')
test()
print(f'after = {n} and: {text} with list: {my_list}')

before = 0 and: try with list: [1, 2, 3, 4]
after = 1 and: ok maby something happend with list: [1, 2, 3]


without global

In [84]:
x = 1
text = 'try'
my_list = [1,2,3,4]

def test():
    x = 2  # creates local copy
    text = 'ok maby something happend'
    my_list.pop()
    
test()
print(x)
print(text)
print(my_list)

1
try
[1, 2, 3]


**but what happens using some computations and elaborations with a dataframe?**

In [44]:
temp = titanic.copy()

In [45]:
temp.iloc[0,1]

'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'

In [52]:
def modify_something(temp):
    temp.iloc[0,1] = 'test'

In [53]:
modify_something(temp)
temp.iloc[0,1]

'test'



Here are some useful links: <br>
- <a href='https://pandas.pydata.org/pandas-docs/stable/indexing.html'>Pandas - Indexing and Selecting Data</a><br>
- <a href='https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html'>Pandas - Comparison with Sql</a><br>

[Click here to come back to the index](#start)<a id='start'></a>