# Lab Session 1

## Introduction to Python and Pandas

In this notebook we will practice the following "data science" steps:
1. Loading a data set
2. Examining the data

We will not give a complete course in Python, but will expect you to do some self-study. To get you started, here are some links which could be useful, followed by some information about what kind of Python commands you will need to complete the excercises:

1. [Introduction Python programming](https://www.programiz.com/python-programming)
2. [Jupyter notebook tutorial](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)
3. [Quick introduction to Pandas](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673)

Some Pandas functionality you should familiarise yourselves with:
1. pd.read_csv(...)
2. shape
3. head(...)
4. info(...)
5. describe(...)
6. value_counts(...)
7. selecting a single column (with a condition) from a dataframe

The idea is that you will fill in the answers in this notebook, save it, and send it back to me (<peter.berck@hh.se>) before the deadline.

By the way, we will be using Python 3. There is still a lot of material available for Python 2, but try to look for Python 3 examples as much as possible.


### The Titanic data set

Often, data is available in comma separated format (csv). The data consists of a number of rows, with the information in different colums. In this tutorial, we will use the "titanic" data set (see https://www.kaggle.com/c/titanic for some background information. Kaggle is a website providing different machine learning challenges).

The titanic dataset contains data about the passengers on board, and whether they survived the ill-fated journey. Do not download the data provided on the Kaggle website, but use the one provided together with this notebook.

It looks like this (a small part of the data). The full data set contains about 1300 rows of data, in 13 columns.

|Age	|Cabin	|Embarked	|Fare	|Name	|Parch|	PassengerId	|Pclass	|Sex	|SibSp	|Survived
|------------------------------------------------------------------
|	22.0	||	S|	7.25|	Braund, Mr. Owen Harris|	0	|1	|3	|male	|1	|0.0	
|	38.0	|C85	|C	|71.2833|	Cumings, Mrs. John Bradley |	0|	2|	1	|female	|1	|1.0	
|	26.0	||	S	|7.925|	Heikkinen, Miss. Laina	|0	|3	|3	|female|	0|	1.0|
|	35.0	|C123	|S	|53.1|	Futrelle, Mrs. Jacques |	0|	4|	1|	female|	1	|1.0|

## Loading data

For this tutorial, we will use a library called "Pandas" (https://pandas.pydata.org/). pandas is a library which contains a number of functions to easily load and process data. It does this by providing a data structure called a "DataFrame" to hold the data. There are a large number of functions available to manipulate the data. 

Most Python programs start with loading a few libraries, and that us what we shall do here. Put the cursor in the next cell, and press ``shift-enter`` or ``ctrl-enter``.

In [7]:
import pandas as pd

Nothing happened, at least on the screen, but the library is now loaded. To check, we will print the version number of the library. Don't worry about the details of the commands yet. Just go the the next cell, and press ``shift-enter`` again.

In [8]:
pd.__version__

'0.23.0'

You should have some output, like "0.22.0". This means the library has been properly loaded.

Now we can continue to the next step, loading the data into a DataFrame. The data set is called "titanic.csv". Let's call the variable to hold the DataFrame **df**. In Python, a value (like ``28``) can be assigned to a variable (with a name like ``foo``) like this:

```
foo = 28
```

From now on, you will have to find the answers yourselves. The information in the beginning of this notebook should help you. 

Write the command to load the CSV data into a dataframe in the next cell, and run it.

In [9]:
df = pd.read_csv("titanic.csv")

Ok, the data has been loaded. The first thing to do is checking if it is what you expected. One way of doing that is to print the "shape" of the DataFrame. The shape is similar to the dimensions of the data. In the next cell, print the dimensions of the data.

In [10]:
df.shape

(1309, 13)

Is this what you expected?

Next we will look at the first ten rows of the data. Write the relevant command in the next cell.

In [11]:
# answer here
df.head(n=10)

Unnamed: 0.1,Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450
5,5,,,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,0.0,330877
6,6,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463
7,7,2.0,,S,21.075,"Palsson, Master. Gosta Leonard",1,8,3,male,3,0.0,349909
8,8,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,1.0,347742
9,9,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,1.0,237736


If done correctly, you will be shown a table like in the beginning of this notebook.

Print the info about all the columns, put the command in the next cell.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
Unnamed: 0     1309 non-null int64
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(5), object(5)
memory usage: 133.0+ KB


As you notice, there are a number of columns which do not seem to contain data in all rows. This is something that often happens, and this is something that needs to be fixed. Luckily, Pandas has functions for that as well. We will come to those later.

In the next cell, write how you can tell there is missing data from the output of the previous command.

Answer:we can find some missing data's of age, cabin, embarked, fare and survived by comparing the RangeIndex:1309 entries are there.

Now, let's look closer at the age of the people in the data set. There are panda functions to print all the values (plus their counts) contained in a column. You will need to find out the name of that command, and how to apply it to only the "Age" column in the dataframe.

In [13]:
df["Age"]

0       22.0
1       38.0
2       26.0
3       35.0
4       35.0
5        NaN
6       54.0
7        2.0
8       27.0
9       14.0
10       4.0
11      58.0
12      20.0
13      39.0
14      14.0
15      55.0
16       2.0
17       NaN
18      31.0
19       NaN
20      35.0
21      34.0
22      15.0
23      28.0
24       8.0
25      38.0
26       NaN
27      19.0
28       NaN
29       NaN
        ... 
1279    21.0
1280     6.0
1281    23.0
1282    51.0
1283    13.0
1284    47.0
1285    29.0
1286    18.0
1287    24.0
1288    48.0
1289    22.0
1290    31.0
1291    30.0
1292    38.0
1293    22.0
1294    17.0
1295    43.0
1296    20.0
1297    23.0
1298    50.0
1299     NaN
1300     3.0
1301     NaN
1302    37.0
1303    28.0
1304     NaN
1305    39.0
1306    38.5
1307     NaN
1308     NaN
Name: Age, Length: 1309, dtype: float64

That wasn't too useful maybe. Let's describe the "Age" column instead, and look at the statistics. Enter your code in the next cell.

In [14]:
# answer here
df.describe()

Unnamed: 0.1,Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
count,1309.0,1046.0,1308.0,1309.0,1309.0,1309.0,1309.0,891.0
mean,369.478992,29.881138,33.295479,0.385027,655.0,2.294882,0.498854,0.383838
std,248.767105,14.413493,51.758668,0.86556,378.020061,0.837836,1.041658,0.486592
min,0.0,0.17,0.0,0.0,1.0,1.0,0.0,0.0
25%,163.0,21.0,7.8958,0.0,328.0,2.0,0.0,0.0
50%,327.0,28.0,14.4542,0.0,655.0,3.0,0.0,0.0
75%,563.0,39.0,31.275,0.0,982.0,3.0,1.0,1.0
max,890.0,80.0,512.3292,9.0,1309.0,3.0,8.0,1.0


The minimum value is a bit strange. Possibly it is some kind of encoding of the age of infants, but let's clean the data and remove them. In pandas, conditions can be put inside parentheses "()" when accessing a column. In the next cell, print the values that are less than 1.

In [15]:
df.query('Age<1')

Unnamed: 0.1,Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
78,78,0.83,,S,29.0,"Caldwell, Master. Alden Gates",2,79,2,male,0,1.0,248738
305,305,0.92,C22 C26,S,151.55,"Allison, Master. Hudson Trevor",2,306,1,male,1,1.0,113781
469,469,0.75,,C,19.2583,"Baclini, Miss. Helene Barbara",1,470,3,female,2,1.0,2666
644,644,0.75,,C,19.2583,"Baclini, Miss. Eugenie",1,645,3,female,2,1.0,2666
755,755,0.67,,S,14.5,"Hamalainen, Master. Viljo",1,756,2,male,1,1.0,250649
803,803,0.42,,C,8.5167,"Thomas, Master. Assad Alexander",1,804,3,male,0,1.0,2625
831,831,0.83,,S,18.75,"Richards, Master. George Sibley",1,832,2,male,1,1.0,29106
1092,201,0.33,,S,14.4,"Danbom, Master. Gilbert Sigvard Emanuel",2,1093,3,male,0,,347080
1141,250,0.92,,S,27.75,"West, Miss. Barbara J",2,1142,2,female,1,,C.A. 34651
1172,281,0.75,,S,13.775,"Peacock, Master. Alfred Edward",1,1173,3,male,1,,SOTON/O.Q. 3101315


Good. Now let's modify our dataframe (let's still call it **df**) by removing all the ages which are less than 1. As usual, enter the code in the next cell and run it. Describe the dataframe you created, to see if the output makes sense. Also note that you can write more than one command in a cell (just press ``enter`` after each command). All commands will be executed when you press ``ctrl-enter``. So in the next cell, write both the command to create the dataframe, and to print some information about it.

In [16]:
df=df.query('Age>=1')
df.describe()

Unnamed: 0.1,Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
count,1034.0,1034.0,1033.0,1034.0,1034.0,1034.0,1034.0,707.0
mean,370.766925,30.220019,36.776641,0.409091,653.54352,2.204062,0.499033,0.400283
std,251.013132,14.147138,55.903432,0.835997,377.4246,0.84291,0.913754,0.490302
min,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,161.25,21.0,8.05,0.0,324.5,1.0,0.0,0.0
50%,328.0,28.0,15.75,0.0,660.5,2.0,0.0,0.0
75%,569.75,39.0,36.75,1.0,971.75,3.0,1.0,1.0
max,890.0,80.0,512.3292,6.0,1307.0,3.0,8.0,1.0


As you probably remember from the slides in the first lecture, another thing to check for is missing values, or "null values", in the columns. This is a bit more complicated. If you are stuck, remember the slides you saw in the first lecture. Do this in the next cell.

In [17]:
df.apply(lambda x: sum(x.isnull()),axis=0)

Unnamed: 0       0
Age              0
Cabin          763
Embarked         2
Fare             1
Name             0
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       327
Ticket           0
dtype: int64

The data is obviously not complete. For example, for a lot of passengers it is not known in which cabin they were. The survived category is a  bit different. To get to the bottom of this, describe the "Survived" column in the next cell, and look at the values.

In [18]:
# answer here
df.Survived.describe()

count    707.000000
mean       0.400283
std        0.490302
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

Write your thoughts about the ``count`` of the values above, and the number of non-null  values in the "Survived" column in the excersise before. Why is there a difference, and does it mean we have a lot of missing values?

Answer: The above survived count value is 707 and in the previous cell report of info command the count for survived is 891 this difference is because we have made changes in the dataset by excluding the age of infants less than 1.

This end the excersises for the first lab session. Save your notebook, and send the saved file to <peter.berck@hh.se>.