## Week 1 lab 1: Data tables and summary statistics

### General note 
Lab exercises in the first part of this unit ("Data Science") were designed on Python having the __pandas__, __numpy__, __matplotlib__ and __sklearn__ packages installed. It is possible that for some commands you need to have the latest version of these packages/libraries installed on your machine. 
Finally, it is likely that you are new to some of the packages/commands/objects/data structures used for solving the lab excersies. In this case it is recommended to search for the documentations of these on the corresponding package's (e.g. numpy, pandas etc) website to familiarise yourself with these commands. An nice introduction to the basics of using the above Python packages (in the form of Jupyter notebooks) are also included in this online, free book which is highly recommended to look if you're new to them: https://github.com/jakevdp/PythonDataScienceHandbook.

The unit tutors will also be available to provide help in the lab and also on the unit's online forum. 

__This lab__: During this exercise you will practice skills to basic data import and analysis using __pandas__ and __numpy__ packages. 


In [2]:
# Clean the workspace and screen
%clear
%reset
import matplotlib.pyplot as plt
plt.close('all')


Nothing done.


In [3]:
import pandas as pd
import numpy as np

## Loading table data

Load (read) the titanic.csv file into the Python's workspace. Python can read csv tables with the help of pandas. For this use pandas' ```read_csv``` command i.e. ```pd.read_csv('titanic.csv')```, to load data into a pandas' table object (here, the table is named T). 

In [4]:
T = pd.read_csv("titanic.csv")

Use pandas's ```head``` command to display the head of this dataset i.e. a few first rows of the table (you can specify the number of elements to see by passing an natural number as an argument). You should be able to see the variables each column represents alongside their values in each row (i.e., each sample/ passenger item).

In [25]:
T.shape[0]

887

## Accessing the table columns/rows


Assume you want to access the table's column about the journey outcome, i.e. the variable ```Survived```, and put it in a new variable ```y```. Then using the column name you can run ```y = T['Survived']``` or ```y = T.Survived```. 

ILOC: Allows you to identify data of certain columns or rows, lets say we want to only output the first column, use T.iloc[:,0]

Splicing: You can output multiple columns, lets say we want columns 2 and 3. T.iloc[:,[1,2]]

The output will be another table object because the columns could potentially have mixed variable types (numerical, strings etc). If all columns are numerical values, then you can turn ```y``` into a matrix using ```to_numpy()``` command i.e. ```y = T.iloc[:,[0,1]].to_numpy()```. (More details on data type conversion commands can be found at: https://pandas.pydata.org/docs/user_guide/indexing.html). Similarly, ```y = T.iloc[:,3:-1].to_numpy()``` would extract columns 4 and onward till the end, into a matrix y. Row indexing is also possible. For instance, ```y = T.iloc([0, 4, 5],:)``` would extract table rows 1 5 and 6 into a new sub-table.

Now, use numpy's ```randint``` command to sub-select 5 passengers (rows of the table) at random and display their data. NB: randint might have repetitions, and thefore another alternative is to use numpy's ```permutation``` command.

In [38]:


a = np.random.choice(T.shape[0],5)

T.iloc[a,:]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
194,1,1,Miss. Elise Lurette,female,58.0,0,0,146.5208
273,1,1,Miss. Kornelia Theodosia Andrews,female,63.0,1,0,77.9583
192,1,2,Master. Michel M Navratil,male,3.0,1,1,26.0
517,1,1,Miss. Anne Perreault,female,30.0,0,0,93.5
511,0,3,Mr. Satio Coleff,male,24.0,0,0,7.4958


Use pandas's logical commands (see: https://pandas.pydata.org/docs/user_guide/indexing.html) along with the table access operations above to create a binary vector named first (with size equal to the number of passengers) whose entries are 1 for passengers that had a 1st class ticket and zero otherwise i.e. runing ```first = (T['Pclass']==1)```. This would create a form of an indicator vector. Similarly, create binary indicator vectors 'second' and 'third' for passengers on 2nd and 3rd class tickets. Also create indicator vectors 'male' and 'female' e.g. ```male = (T['Sex']=="male")``` and ```female = (not male)```. Similarly, create an indicator vector for the 'survived' passengers. Finally, create new indicator vectors 'child', 'young' and 'old' for passengers less than 18, between 18-40, and more than 40 year old, respectively.

In [48]:
df = pd.DataFrame(T)

first  = df["Pclass"] == 1
second = df["Pclass"] == 2
third  = df["Pclass"] == 3
male   = df["Sex"] == "male"
female = df["Sex"] == "female"
survived = df["Survived"] == 1
child  = df["Age"] < 18
young  = (df["Age"] >= 18) & (df["Age"] <= 40) #to have multiple conditions have them in brackets and use &
old    = df["Age"] >40


print(first)

0      False
1       True
2      False
3       True
4      False
       ...  
882    False
883     True
884    False
885     True
886    False
Name: Pclass, Length: 887, dtype: bool


Use Python's boolean indexing on the indicator vectors built above, to build various sets of indices for the passengers that were on first, second, third class tickets i.e. ```first_idx = [index for index in range(len(T)) if first[index]]```. Similarly repeat this for 'male', 'female', 'survived', and 'child', 'young', 'old' passengers

This indexing will return the row number for those that satisfy a TRUE condition for the particular indicator


In [88]:
first_idx  = [index for index in range (len(T)) if first[index]]
second_idx = [index for index in range (len(T)) if second[index]]
third_idx  = [index for index in range (len(T)) if third[index]]
male_idx   = [index for index in range (len(T)) if male[index]]
female_idx = [index for index in range (len(T)) if female[index]]
survived_idx = [index for index in range (len(T)) if survived[index]]
child_idx  = [index for index in range (len(T)) if child[index]]
young_idx  = [index for index in range (len(T)) if young[index]]
old_idx    = [index for index in range (len(T)) if old[index]]

print(survived_idx)
print(len(male_idx))
print(len(female_idx))

[1, 2, 3, 8, 9, 10, 11, 15, 17, 19, 21, 22, 23, 25, 28, 31, 32, 36, 39, 42, 43, 46, 51, 52, 54, 55, 57, 60, 64, 65, 67, 73, 77, 78, 80, 81, 83, 84, 87, 96, 97, 105, 106, 108, 122, 124, 126, 127, 132, 135, 140, 141, 145, 150, 155, 160, 164, 165, 171, 182, 183, 185, 186, 189, 191, 192, 193, 194, 197, 203, 206, 207, 208, 210, 214, 215, 217, 219, 223, 225, 229, 232, 236, 240, 246, 247, 254, 255, 256, 257, 258, 259, 265, 266, 267, 269, 270, 272, 273, 277, 281, 284, 286, 287, 288, 289, 296, 297, 298, 299, 301, 303, 304, 305, 307, 308, 309, 313, 314, 316, 317, 320, 321, 323, 325, 326, 327, 328, 332, 335, 336, 338, 339, 343, 344, 345, 346, 354, 356, 357, 364, 365, 366, 367, 368, 373, 374, 378, 379, 381, 385, 387, 388, 389, 391, 392, 397, 398, 405, 410, 412, 414, 415, 423, 424, 426, 427, 428, 429, 432, 434, 437, 440, 441, 442, 443, 444, 445, 446, 450, 452, 454, 455, 457, 466, 469, 470, 476, 480, 481, 483, 486, 493, 501, 503, 504, 506, 507, 509, 510, 513, 515, 517, 520, 523, 527, 530, 532, 534, 

Use Python's set operation commands (e.g. intersection, union, difference, ...) to extract subset of indicies e.g. for passengers that were male and didn't survived. Use extracted subsets to compute and print out the following risk factors: the precentage of victims among male and female passnegers e.g. the probability that a passenger did not survived given that he was male. Also output the precentage (probability) of victims among the first, second and third class passnegers, as well as the children, young, and old passengers.

![image.png](attachment:image.png)



In [94]:
# write code to compute variables risk_male, risk_female, risk_first, risk_second, risk_third, risk_child, risk_young, risk_old,  

risk_male = 1 - len(set(male_idx)&set(survived_idx))/len(male_idx)
risk_female =1 - len(set(female_idx)&set(survived_idx))/len(female_idx)
risk_first =1 - len(set(first_idx)&set(survived_idx))/len(first_idx)
risk_second =1 - len(set(second_idx)&set(survived_idx))/len(second_idx)
risk_third =1 - len(set(third_idx)&set(survived_idx))/len(third_idx)
risk_child =1 - len(set(child_idx)&set(survived_idx))/len(child_idx)
risk_young =1 - len(set(young_idx)&set(survived_idx))/len(young_idx)
risk_old =1 - len(set(old_idx)&set(survived_idx))/len(old_idx)

In [95]:
#display risk values below
print('Risk male = %.3f \n' % risk_male)
print('Risk female = %.3f \n' % risk_female)
print('Risk first = %.3f \n' % risk_first)
print('Risk second = %.3f \n' % risk_second)
print('Risk third = %.3f \n' % risk_third)
print('Risk child = %.3f \n' % risk_child)
print('Risk young = %.3f \n' % risk_young)
print('Risk old= %.3f \n' % risk_old)

Risk male = 0.810 

Risk female = 0.258 

Risk first = 0.370 

Risk second = 0.527 

Risk third = 0.756 

Risk child = 0.500 

Risk young = 0.634 

Risk old= 0.635 



## Summary Statistics

Now let's compute some summary statistics for the titanic data. For the numerical variables 'Fare' and 'Age' compute and print the min, max, mean, median, variance and standard deviations using the corresponding Python commands. 

In [97]:
#compute and print summary stats of Age

import statistics

x = input("Select the variable you are interested in: ")

mean = statistics.mean(df[x])
median = statistics.median(df[x])
variance = statistics.variance(df[x])
sd = statistics.stdev(df[x])

print(f"Here are your results:\nMean: {mean}\nMedian: {median}\nVariance: {variance}\nStandard Deviation: {sd}")

In [None]:
#compute and print summary stats of Fare

Use numpy's ```cov``` command to compute the covariance matrix of the three variables (features) Age, Fare, Ticket class. Be areful with the syntax used. We recommend you have a look at the documentation of the command: https://numpy.org/doc/stable/reference/generated/numpy.cov.html

In [None]:
#your code

Similarly, use numpy's ```corrcoef``` command to compute the cthe Pearson correlation coefficient matrix of these classes. Look at the off-diagonal elements of this matrix. Which pairs of variable are most/least correlated with eachother?

In [None]:
#your code

Now use pandas' ```hist``` command to plot the ditribution of passenger ages (NB: you can specify the number of histogram bins explicitly within this command). Repeat this procedure for the variables Fare and ticket class.

In [None]:
#histogram of Age

In [None]:
#histogram of Fare

In [None]:
#histogram of Ticket class

Finally, note that pandas has the command ```describe()``` which can be used to give some summary statistics for an input table of data. Use this command for the titanic table.

In [None]:
T.describe()