# Week 3

**Table of Contents**
1. [Quick Recap from Python_Week_2](#quick-recap-from-python-week2)
2. [Functions](#functions)
3. [Numpy](#numpy)
4. [Pandas](#pandas)
5. [Dataset: loading &  manipulation & visualization](#dataset)

<a id='Quick Recap from Python_Week_2'></a>
## Quick Recap from Python_Week_2

#

## Functions

Functions take an input, do something with it, and then output the result. Functions can be seen as mini-programs on themselves. 

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Function_machine2.svg/220px-Function_machine2.svg.png)

In Python we have 3 types of functions;
- **Built-in Functions**. We have seen several of these already, such as the `len()`, `min()` and `print()` functions
- User Defined Functions (**UDF**'s) functions that people create themselves
- Anonymous or **Lambda functions** ( not declared with the standard `def` keyword )


watch this [Video about functions](https://www.youtube.com/watch?v=9Os0o3wzS_I)

The syntax for functions in Python is as follows. We have;

- `def statement` | function name | parameter
- body of the function 
- function calls | input an argument


Remember
- a **parameter** is a placeholder for the actual value
- an **argument** is the actual value that is passed in

![alt text](https://drive.google.com/uc?id=1hQV_c74lYUT9lSMD2dltLr_uca80gYK8)

what is the parameter here? and the argument?


### Installing & importing packages

In [6]:
!pip install pandas



In [0]:
# Step 2.Import a library
from <module_name> import <library_name> as <preferred_name>

In [0]:
import foo                 # foo imported and bound locally
import foo.bar.baz         # foo.bar.baz imported, foo bound locally
import foo.bar.baz as fbb  # foo.bar.baz imported and bound as fbb
from foo.bar import baz    # foo.bar.baz imported and bound as baz
from foo import attr       # foo imported and foo.attr bound as attr

In [7]:
# example:
from math import pi
print(pi)

# math library has ceil(), round(), factorial(), trigonometric(sin, cos, tan) methods and etc. but also mathematical constants: pi, e= 2.718281, tau 

# What happens if we do:
# from math import pi as p
# print(p)

3.141592653589793


# Numpy



*   Linear Algebra Library for Python
*   Incredibly fast (as it has bindings to C libraries)
*  **superpower**: arrays . Why would you use Numpy arrays instead of lists? Read [here this awesome Stackoverflow post!](https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists)



In [0]:
import numpy as np

In [0]:
import numpy as np

# a = np.array(1,2,3,4)    # WRONG
a = np.array([1,2,3,4])  # RIGHT
a

array([1, 2, 3, 4])

In [0]:
my_matrix = [[1,2,3],[4,5,6],[7,8,9]]
my_matrix

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

In [0]:
np.array(my_matrix)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [0]:
# Return evenly spaced values within a given interval.
np.arange(0,20)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [0]:
np.arange(0,20,2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [0]:
np.zeros(3)

array([0., 0., 0.])

In [0]:
np.zeros((5,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [0]:
np.ones(3)

array([1., 1., 1.])

In [0]:
3 * np.ones((3,3), dtype=int)

array([[3, 3, 3],
       [3, 3, 3],
       [3, 3, 3]])

### linspace
Return evenly spaced numbers over a specified interval.

In [0]:
# creating numeric sequence of 3 numbers from 0 to 10 
np.linspace(0,10,4)

array([ 0.        ,  3.33333333,  6.66666667, 10.        ])

In [0]:
np.linspace(0,10,50)

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

### max,min,argmax,argmin

These are useful methods for finding max or min values. Or to find their index locations using argmin or argmax

In [8]:
simple_array = np.array([0.23, -1, 35, 11, -44, 90.21, 0.2, -300, 300.5])

print(simple_array.max())        # returns the max from the array.
print(simple_array.min())        # returns the min from the array.
print(simple_array.argmax())     # returns the indices of the maximum value.
print(simple_array.argmin())     # returns the indices of the minimum value.

NameError: name 'np' is not defined

In [0]:
simple_array.argmax()
simple_array[8]

300.5

### Numpy arrays operations


*   Arrays arithmetic operations
*   Arays with a scalar operations



In [0]:
array = np.arange(0,10)
array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [0]:
# Addition of 2 arrays
array + array

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [0]:
# Multiplication of 2 arrays
array * array

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [0]:
# Subtraction between 2 arrays
array - array

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [0]:
# Division
# Warning on division by zero, but not an error!
# Just replaced with nan
array/array

  """Entry point for launching an IPython kernel.


array([nan,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [0]:
# Also warning, but not an error instead infinity
1/array

  """Entry point for launching an IPython kernel.


array([       inf, 1.        , 0.5       , 0.33333333, 0.25      ,
       0.2       , 0.16666667, 0.14285714, 0.125     , 0.11111111])

In [0]:
# Raise to the power (element-wise)
array**3

# Square the array (element-wise)
np.square(array)

# or power method
np.power(array, 2) # > this is slower than the square. power is used when we want to raise to different powers different elements in the array


array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [0]:
#Taking Square Roots
np.sqrt(array)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

In [0]:
#Calcualting exponential (e^)
np.exp(array)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03])

In [0]:
#######################################################
################### Challenge 1 #######################
#######################################################

# Create an array of 10 zeros

In [0]:
#######################################################
################### Challenge 2  ######################
#######################################################

# Create an array of the integers from 10 to 50

In [0]:
#######################################################
################### Challenge 3  ######################
#######################################################

# Create a 3x3 matrix with values ranging from 0 to 8

In [0]:
#######################################################
################### Challenge 4  ######################
#######################################################

# Use NumPy to generate an array of 25 random numbers sampled from a standard normal distribution

In [0]:
#######################################################
################### Challenge 5  ######################
#######################################################

# Create an array of 20 linearly spaced points between 0 and 1:

In [0]:
#######################################################
################### Challenge 6  ######################
#######################################################

# Given the matrix below

matrix = np.arange(1,26).reshape(5,5)
matrix

# array([[ 1,  2,  3,  4,  5],
#        [ 6,  7,  8,  9, 10],
#        [11, 12, 13, 14, 15],
#        [16, 17, 18, 19, 20],
#        [21, 22, 23, 24, 25]])

# WRITE CODE  THAT REPRODUCES THE OUTPUT BELOW

# array([[12, 13, 14, 15],
#        [17, 18, 19, 20],
#        [22, 23, 24, 25]])

# Hint: look at the number of rows & columns and think about  operation that has to be applied on the initial matrix.

In [0]:
#######################################################
################### Challenge 7  ######################
#######################################################

# Given the matrix below, print the sum of all the columns in matrix.

matrix = np.arange(1,26).reshape(5,5)
matrix

 

# Pandas

## Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

###Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [0]:
import numpy as np
import pandas as pd

# Create series

# You can convert a list,numpy array, or dictionary to a Series:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

In [0]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [0]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [0]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [0]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan']) 
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [0]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

### DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [0]:
from numpy.random import randn
np.random.seed(101)

In [0]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [0]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [0]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [0]:
type(df['W'])

pandas.core.series.Series

In [0]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [0]:
# remove columns
df.drop('new',axis=1,inplace=True)

In [0]:
# Not inplace unless specified!
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
# drop rows (remove rows)
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
# selecting rows
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

In [0]:
# Or select based off of position instead of label 
df.iloc[:, 0]

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [0]:
# ** Selecting subset of rows and columns **
df.loc['B','Y']

-0.8480769834036315

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [0]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [0]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
new_df = df[df['W']>0]
new_df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [0]:
df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

In [0]:
df[df['W']>0][['Y','X']]

Unnamed: 0,Y,X
A,0.907969,0.628133
B,-0.848077,-0.319318
D,-0.933237,-0.758872
E,2.605967,1.978757


In [0]:
# For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


### Merging, Joining, and Concatenating

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this lecture we will discuss these 3 methods with examples.


In [0]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11'],
                        'E': [1,1,1,1]},
                        index=[8, 9, 10, 11])

In [0]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [0]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [0]:
df3

Unnamed: 0,A,B,C,D,E
8,A8,B8,C8,D8,1
9,A9,B9,C9,D9,1
10,A10,B10,C10,D10,1
11,A11,B11,C11,D11,1


In [0]:
# concatenation
pd.concat([df1,df2,df3])

#pd.concat([df1,df2,df3],axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,A,B,C,D,E
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,A4,B4,C4,D4,
5,A5,B5,C5,D5,
6,A6,B6,C6,D6,
7,A7,B7,C7,D7,
8,A8,B8,C8,D8,1.0
9,A9,B9,C9,D9,1.0


### Merging

The **merge** function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

In [0]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})   

In [0]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [0]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


In [0]:
pd.merge(left, right, on=['key'])

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


### Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [0]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [0]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [0]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [0]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


## Exercises

In [9]:
import pandas as pd

titanic_data = pd.read_csv('https://rotterdamai001.blob.core.windows.net/python/titanic/train.csv')

### Analysing the data

Let's first look at our dataset. 

In [10]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The next thing we'll do is look at an overview of our dataset using **df.info()**. This allows us to quickly see the type of every column, as well as where there's missing data.

In [11]:
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


We can clearly see that 'Age', 'Cabin', 'Embarked' colums have missing values which needs to be handled.

In [0]:
missing_age_count = 0
for value in titanic_data['Age']:
    if pd.isnull(value):
        missing_age_count += 1
print("Number of Age data missing: ", missing_age_count)

missing_cabin_count = 0
for value in titanic_data['Cabin']:
    if pd.isnull(value):
        missing_cabin_count += 1
print("Number of Cabin data missing: ", missing_cabin_count)

emb = 0
for value in titanic_data['Embarked']:
    if pd.isnull(value):
        emb += 1
print("Number of Embarked data missing: ", emb)

Number of Age data missing:  177
Number of Cabin data missing:  687
Number of Embarked data missing:  2


**Handling Missing values**: We are excluding the missing values because using the mean age would give us biased results.

We can observe some NaN values in the Age column of the original given titanic data. Also, not all cabin data is given in the data. 


Filling in missing values will always introduce bias, but is sometimes necessary if, for example, you have access to a limited amount of data.

Strategies for filling in missing values include:

1.   Filling in with the **average value** over the population;
2.   Using a **random sample** from the same distribution as your original data;
3.   Assuming the **most common value**.



In [0]:
# Dropping the Nan values in 'Age' column and Embarked column.
clean_data = titanic_data.dropna(subset=['Age', 'Embarked'])

# Since we don't need 'Cabin' Column,
# we remove it.
new_data = clean_data.drop('Cabin', axis=1)

# Finding the number of survivors
survivors = new_data.loc[new_data['Survived'] == 1]

# The index of the data is reset.
new_data = new_data.reset_index()
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 712 entries, 0 to 711
Data columns (total 12 columns):
index          712 non-null int64
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Name           712 non-null object
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Ticket         712 non-null object
Fare           712 non-null float64
Embarked       712 non-null object
dtypes: float64(2), int64(6), object(4)
memory usage: 66.8+ KB


The data has now been cleaned of missing values. 

We will also not be using the PassengerId, SibSp, Parch and Ticket colums in the analysis because no important conclusions can be drawn from them, so we can drop those columns.

In [0]:
new_data = new_data.drop(columns=['PassengerId', 'SibSp', 'Parch', 'Ticket'])
new_data.head()

Unnamed: 0,index,Survived,Pclass,Name,Sex,Age,Fare,Embarked
0,0,0,3,"Braund, Mr. Owen Harris",male,22.0,7.25,S
1,1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,71.2833,C
2,2,1,3,"Heikkinen, Miss. Laina",female,26.0,7.925,S
3,3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,53.1,S
4,4,0,3,"Allen, Mr. William Henry",male,35.0,8.05,S


We will now dissecate the dataset and try to figure **what factors made people more likely to survive**, so it's a good idea to look at the overall probability of survival first:


In [0]:
prob_survival = new_data['Survived'].mean() * 100

print('Survival rate: {0:.2f}%'.format(prob_survival))

Survival rate: 40.45%


### Challenge

Continue to dissecate the dataset to try and answer a few questions:

1.   How many passengers embarked on S,C or Q?
2.   How many passengers were male and female respectively?
3.   What was the minimum and maximum age of the passengers?
4.   What was the minimum, maximum and mean fare for traveling in Titanic?
5.   Considering children as people under the age of 10, calculate the survival rate for children and compare it to the survival rate of adults (divided into male and female).

In [None]:
# ANSWER HERE