<a href="https://colab.research.google.com/github/Trantracy/Titanic-/blob/master/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://i.imgur.com/0AUxkXt.png)

# Introduction to Pandas

![](https://akm-img-a-in.tosshub.com/indiatoday/titanic_647_041416113640.jpg?IWI8WJ3owRLPfIO2GUMAyyypPfwvvcRV)

__Pandas__ is a Python library for data analysis. It allows us to read and present the dataset in a table-like format, as well as manipulate, transform aggregate the data. _Series_ and _Dataframe_ are the two core data structures of __Pandas__.

- _Series_ can be understood as one-dimensional arrays with flexible indices. It is __Pandas__ term for _column_.
- _Dataframe_ is __Pandas__ term for _table_. A _Dataframe_ is a two-dimensional array with flexible row and column indices.


[Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)

In [0]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Import library
import pandas as pd
import numpy as np

With __Pandas__, we can load data from various data files like .csv, .tsv, Excel, and even from and SQL table, etc.

In [0]:
# Read a csv file
df = pd.read_csv('/content/drive/My Drive/FTMLE - Tonga/Data/titanic.csv')

### DataFrame

In [0]:
# Show the DataFrame
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### Series

In [0]:
# Select a Series from the DataFrame
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [0]:
a = np.array(['a','b','c'])
a[1]

'b'

### Data Selection

In [0]:
# Indexing
# Label-based: loc

# Integer position-based: iloc

# Slicing

df['Name'].loc[1:5]

# Filtering

1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
Name: Name, dtype: object

In [0]:
# Example: loc vs iloc
new_index = [i*2 for i in range(10)]
new_data = [i**2 for i in range(10)]

new_se = pd.Series(data = new_data, index = new_index)

print(new_se)

# label-based indexing
print(new_se.loc[4])

# integer-based indexing
print(new_se.iloc[4])

0      0
2      1
4      4
6      9
8     16
10    25
12    36
14    49
16    64
18    81
dtype: int64
4
16


In [0]:
# Get the index
df['Name'].index

RangeIndex(start=0, stop=891, step=1)

### Basic Operations in Pandas

In [0]:
# Print a summary of the data
df.info()

# Print summary statistics of the data
df.describe()

# # Count the number of rows
len(df)

# # Show 5 random rows
df.sample(5)

# # Show first 10 rows
df.head(10)

# # Show last 15 rows
df.tail(15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
876,877,0,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S
877,878,0,3,"Petroff, Mr. Nedelio",male,19.0,0,0,349212,7.8958,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q


### Data Exploration

In [0]:
# What is the average age of passengers on Titanic?
df['Age'].mean()

29.69911764705882

In [0]:
# How many males and females are there on the Titanic?
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [0]:
# Who is the oldest non-survived passenger?
max_age = df[df['Survived'] == 0]['Age'].max()

# Filtering index
# Filter / Boolean filter
df[(df['Age'] == 74) & (df['Survived'] == 0)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S


In [0]:
# Find passengers whose name has Jack & Rose
df[df['Name'].str.contains('Rose')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
855,856,1,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.35,,S


In [0]:
# What are different titles of passengers on the Titanic? How many are there for each title?

name = 'Braund, Mr. Owen Harris'

def get_title(name): # Problem here

  return name.split()[1]

titles = df['Name'].apply(get_title)

# Unique titles
titles.unique()

# Number of unique titles
titles.nunique()

# Frequency of each title
titles.value_counts()

Mr.             502
Miss.           179
Mrs.            121
Master.          40
Dr.               7
Rev.              6
y                 4
Planke,           3
Impe,             3
Mlle.             2
Major.            2
Col.              2
Gordon,           2
der               1
Jonkheer.         1
Capt.             1
Walle,            1
Pelsmaeker,       1
Velde,            1
the               1
Don.              1
Messemaeker,      1
Carlo,            1
Mulder,           1
Cruyssen,         1
Ms.               1
Billiard,         1
Mme.              1
Steen,            1
Melkebeke,        1
Shawah,           1
Name: Name, dtype: int64

In [0]:
# What is the average age of passengers on each Passenger Class?
df.groupby('Pclass').mean()[['Age']].round(2)

Unnamed: 0_level_0,Age
Pclass,Unnamed: 1_level_1
1,38.23
2,29.88
3,25.14


In [0]:
# Who are the top three passengers paying highest ticket fare?
df.sort_values(by = 'Fare', ascending = False).head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
737,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
679,680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C


### Challenge

Predict whether a passenger will survive based on the given features in the Titanic Dataset.

__Twist: You can ONLY use basic Python and Pandas to complete this challenge__

__Instruction__
- Run the codes from the beginning of the notebook
- Your task is to write a function that tell whether __a person__ survived in the Titanic incident based on given information. 
- You are not allowed to used to the column `Survived` in the dataset.
- The function will take one single argument, which represents __a row__ of in the input data and __return 1 if that person survives and 0 if if that person does not__ (Refer to the template below)
- You can test the accuracy of your "model" using the function `check_accuracy()` below. It takes the name of your function as argument.
- __Do not reveal the accuracy of your "model" with your classmates!__
- Once you are confident with your model, change the name of the model to your name (E.g. minhanh()), send me the code on a DM on Discord. My Discord name: Nguyễn Minh Anh#4144
- Person with the highest accuracy gets 2 points. Top 2 and Top 3 will get 1 point each. In case of tie, faster submission wins.




In [0]:
df.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
427,428,1,2,"Phillips, Miss. Kate Florence (""Mrs Kate Louis...",female,19.0,0,0,250655,26.0,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
264,265,0,3,"Henry, Miss. Delia",female,,0,0,382649,7.75,,Q
444,445,1,3,"Johannesen-Bratthammer, Mr. Bernt",male,,0,0,65306,8.1125,,S
202,203,0,3,"Johanson, Mr. Jakob Alfred",male,34.0,0,0,3101264,6.4958,,S
743,744,0,3,"McNamee, Mr. Neal",male,24.0,1,0,376566,16.1,,S
147,148,0,3,"Ford, Miss. Robina Maggie ""Ruby""",female,9.0,2,2,W./C. 6608,34.375,,S
774,775,1,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0,,S
329,330,1,1,"Hippach, Miss. Jean Gertrude",female,16.0,0,1,111361,57.9792,B18,C
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [0]:
# Template
# Example: Return 0 if passengers in Pclass 3 unless they are female

def minhanh(passenger):
  # Your code here
  
  if passenger['Pclass'] == 3:
    if passenger['Sex'] == 'female':
      result = 1
    else:
      result = 0
  else:
    result = 1
  
  return result

In [0]:
# Check Accuracy (%) Function
def check_accuracy(func):
  
  # Empty prediction list
  prediction_list = []
  
  # Generate predictions from input func
  for i, row in df.iterrows():
    prediction_list.append(func(row))
  
  # Create accuracy list
  accuracy = prediction_list.copy()
  
  # Actual result list
  actual_results = df['Survived'].values.tolist()

  # Compare prediction to actual result
  for i, value in enumerate(actual_results):
    if prediction_list[i] == value:
      accuracy[i] = 1
    else:
      accuracy[i] = 0

  score = sum(accuracy) / len(accuracy) * 100

  return score

In [0]:
# Check the accuracy of template function
check_accuracy(minhanh)

66.77890011223344

In [0]:
# Nguyen
def nguyen(passenger):
    # Your code here
    if passenger['Age'] >= 15:
        if passenger['Embarked'] == 'C':
            result = 1
        else:
            result = 0
    else:
        result = 1
    return result

In [0]:
# Thi
def thinguyen(passenger):  
    if passenger['Embarked'] != "C":
      if passenger['Sex'] == 'female' :
        result = 1
      else:
        result = 0
    else:
      result = 1
  
    return result

In [0]:
# Thomas
def thomas(passenger):
  # Your code here
  
  
  if passenger['Sex'] == 'female':
    result = 1
  else:
    result = 0
  
  return result 

In [0]:
# KTA
def kta(passenger):
  # Your code here
  
  if passenger['Sex'] == 'female' or passenger['Age'] <= 6:
    return 1
  else:
    result = 0
  
  return result

In [0]:
# An
def an_dang(passenger):
  # Your code here

  if passenger['Age'] <= 12:
    result = 1
  elif passenger['Sex'] == 'female' and passenger['Parch'] <= 3:
    result = 1
  else:
    result = 0
  return result

In [0]:
# Tin
def tinngo(passenger):
  # Your code here
  result = 0
  # if passenger['Pclass'] == 1:
  if (passenger['Age'] > df['Age'].mean()):
    if passenger['Sex'] == 'male':
        result = 0
    else:
      if passenger['Parch'] > 1:
         result = 0
      else:
         result = 1
  else:
    if passenger['Sex'] == 'female':
        result = 1
    else:
        result = 0
 

  return result

In [0]:
# Pham Tuan Anh
def pham_tuan_anh(passenger):
  # Your code here
  
  if passenger['Pclass'] != '3' :
    if passenger['Sex'] == 'female':
        result = 1
    elif passenger['Embarked'] == 'Q':
        result = 0
    else:
        result = 0
  else:
    result = 1

  return result

In [0]:
# Tobi
def tobi(passenger):
  percentage_pclass = []
  for i in set(df['Pclass']):
    df_pclass = df[df['Pclass']==i]
    percentage_pclass.append(round(sum(df_pclass['Survived']==1)/sum(df['Survived']==1)+1, 2))

  percentage_sex = []
  for i in set(df['Sex']):
    df_sex = df[df['Sex']==i]
    percentage_sex.append(round(sum(df_sex['Survived']==1)/sum(df['Survived']==1)+1, 2))

  percentage_sib = []
  for i in set(df['SibSp']):
    df_sib = df[df['SibSp']==i]
    percentage_sib.append(round(sum(df_sib['Survived']==1)/sum(df['Survived']==1)+1, 2))
  percentage_sib.append(0)
  percentage_sib.append(0)

  percentage_parch = []
  for i in set(df['Parch']):
    df_parch = df[df['Parch']==i]
    percentage_parch.append(round(sum(df_parch['Survived']==1)/sum(df['Survived']==1)+1, 2))

  p = passenger['Pclass']-1
  if passenger['Sex'] == 'female':
    s = 0
  else:
    s = 1
  sib = passenger['SibSp']
  parch = passenger['Parch']

  try:
    if (percentage_pclass[p]percentage_sex[s]percentage_sib[sib]*percentage_parch[parch]-1)/4 > 1:
      return 1
    else:
      return 0
  except Exception as err:
    print(err)
    print(p)
    print(s)
    print(sib)
    print(parch)

SyntaxError: ignored

In [0]:
# Huynh
def huynh(passenger):
  if passenger['Sex'] == 'female':
    result = 1
  else:
    result = 0
  if '34' in passenger['Ticket']:
    result = 0

  return result

In [0]:
# Chow
def chow(passenger):
  # Your code here

  if passenger['Pclass'] == 3 or passenger['Pclass'] == 1:
    if passenger['Sex'] == 'female':
      result = 1
    elif passenger['Fare'] == 13.0000 or passenger['Fare'] == 26.0000 or passenger['Fare'] == 7.7500:
      result = 1
    elif passenger['Cabin'] == 'B96 B98':
      result = 1
    elif passenger['Parch'] == 0 and passenger['Age'] <= 28.343689655172415:
      result = 1
    elif passenger['Embarked'] == 'S':
      result = 1
    else:
      result = 0
  else:
    result = 1
  return result

In [0]:
# Kha
def Kha(passenger):
  if passenger['Fare'] > 20000 or passenger['Sex'] == 'female'  : 
      result = 1
  else:
      result = 0
  
  return result

In [0]:
# Tran
def trantran(passenger):
  # Your code here

  if passenger['Pclass'] == 3:
    if passenger['Sex'] == 'female' and passenger["Embarked"] =="S":
      result = 1
    else:
      result = 0
  else:
    result = 1

  return result

In [0]:
# An vu
def vuquocan(passenger):
  # Your code here
  basechance = 549/891
  chance = basechance
  if passenger['Sex'] == 'female':
    chance += 0.25
  if passenger['Sex'] == 'male':
    chance -= 0.25

  if passenger["Pclass"] == 3:
    chance -= 0.14
  if passenger["Pclass"] == 2:
    chance -= 0.05
  if passenger["Pclass"] == 1:
    chance += 0.05
  if passenger["Age"] < 15:
    chance += 0.1

  if chance > 0.5:
    return 1
  else:
    return 0
  return result

In [0]:
# Phuc
def phuctran(passenger):
  # Your code here
  result = 0
  if passenger['Pclass'] == 3:
    if passenger['Sex'] == 'female':
      result = 1
    else:
      result = 0
  else:
    result = 1

  if passenger['Pclass'] == 1:
    if passenger['Sex'] == 'female':
      result = 1
    else:
      result = 0
  
  age = passenger['Age']
  if passenger['Parch'] >= 1 and age < 10:
    return 0

  return result

In [0]:
# Nguyen
def nguyen(passenger):
    if passenger['Pclass'] > 1:
        result = 0
    else:
        result = 1
    return result

In [0]:
# Dung
def dung(passenger):
  # Your code here
  result = 0
  if passenger['Sex'] == 'female':
    result = 1
  if passenger['Fare'] > 270:
    result = 1
  if passenger['Sex'] == 'male':
      if passenger['Ticket'] == 'STON/O 2. 3101286':
        result = 1
      if passenger['Ticket'] == 'PC 17572':
        result = 1
  return result

In [0]:
# Nguyet Do
def nguyetdo(passenger):
  for index, passenger in data.iterrows():
  # Your code here
    if passenger['Sex'] == 'female':
      if passenger['Pclass'] == 3 and passenger['Age'] > 40 and passenger['Age'] < 60:
        result = 0
      elif passenger['Pclass'] == 1 and passenger['Age'] < 10:
        result = 0
      else:
        result = 1

    if passenger['Sex'] == 'male':
      if passenger['Pclass'] == 2 and passenger['Age'] < 10:
        result = 1
      elif passenger['Pclass'] == 1 and passenger['Age'] < 40:
        result = 1
      else:
        result = 0

    # Return our predictions
      return result

In [0]:
#@title Evaluate the prediction accuracy of participants

# Initialize empty lists
name_list = []
score_list = []
message_list = []

# Initialize input value
raw_input = ''

# Keep running until inputting 'stop'
while raw_input != 'stop':
  
  print("+ Input name and function here:")
  
  # Input
  raw_input = input()
  
  try:
    
    # Split the name
    name = raw_input.split(',')[0]
    
    # Check if name existed
    if name in name_list:
      print("ERROR: Name already taken. Please Try again.")
      pass
    
    else:
      
      # Generate message. Add name, score & message to respective list
      name_list.append(name)
      input_data = eval("check_accuracy({})".format(raw_input.split(',')[1]))
      score_list.append(input_data)

      message = (name + " - Accuracy Score: " + str(input_data) + "%") 
      message_list.append(message)
    
  except:
    # Pass if error (when inputting 'stop')
    pass
  
# Print final results with winner
print('--- RESULTS ---')
for i, value in enumerate(score_list):
  
  if value == max(score_list):
    print(message_list[i], "- WINNER!!")
    
  else:
    print(message_list[i])

+ Input name and function here:
thi,thinguyen
+ Input name and function here:
thomas,thomas
+ Input name and function here:
kta,kta
+ Input name and function here:
an_dang,an_dang
+ Input name and function here:
tin,tinngo
+ Input name and function here:
pta,pham_tuan_anh
+ Input name and function here:
tobi,tobi
+ Input name and function here:
Huynh,huynh
+ Input name and function here:
chow,chow
+ Input name and function here:
kha,kha
+ Input name and function here:
tran,trantran
+ Input name and function here:
Anvu,vuquocan
+ Input name and function here:
phuc,phuctran
+ Input name and function here:
nguyen,nguyen
+ Input name and function here:
dung,dung
+ Input name and function here:
nguyetdo,nguyetdo
+ Input name and function here:
stop
--- RESULTS ---
thi - Accuracy Score: 74.5230078563412%
thomas - Accuracy Score: 78.67564534231201%
kta - Accuracy Score: 79.57351290684625%
an_dang - Accuracy Score: 79.7979797979798% - WINNER!!
tin - Accuracy Score: 78.56341189674522%
pta - Acc

In [0]:
#@title Evaluate the prediction accuracy of participants

# Initialize empty lists
name_list = []
score_list = []
message_list = []

# Initialize input value
raw_input = ''

# Keep running until inputting 'stop'
while raw_input != 'stop':
  
  print("+ Input name and function here:")
  
  # Input
  raw_input = input()
  
  try:
    
    # Split the name
    name = raw_input.split(',')[0]
    
    # Check if name existed
    if name in name_list:
      print("ERROR: Name already taken. Please Try again.")
      pass
    
    else:
      
      # Generate message. Add name, score & message to respective list
      name_list.append(name)
      input_data = eval("check_accuracy({})".format(raw_input.split(',')[1]))
      score_list.append(input_data)

      message = (name + " - Accuracy Score: " + str(input_data) + "%") 
      message_list.append(message)
    
  except:
    # Pass if error (when inputting 'stop')
    pass
  
# Print final results with winner
print('--- RESULTS ---')
for i, value in enumerate(score_list):
  
  if value == max(score_list):
    print(message_list[i], "- WINNER!!")
    
  else:
    print(message_list[i])

+ Input name and function here:
kha,Kha
+ Input name and function here:
stop
--- RESULTS ---
kha - Accuracy Score: 78.67564534231201% - WINNER!!
