# Celebrity Deaths in 2016

Source: [Wikipedia - Deaths in 2016](https://en.wikipedia.org/wiki/Deaths_in_2016)

#### Structure of dataset:
- File: "celebrity_deaths_2016.xlsx"
- Contains 2 sheets:
 - "celeb_death": contains records of deaths of famous humans and non-humans
 - "cause_of_death": contains the causes of the deaths (you'll need to merge it with the "celeb_death" sheet)

#### Other information about the dataset:
- The cause of death was not reported for all individuals
- The dataset might include deaths that took place in other years (you'll need to ignore these records)
- The dataset might contain duplicate records (you'll need to remove them)

#### The goals of the exercise:
- Load, merge, and clean the data
- Explore the data and answer some simple questions
- Run some basic analysis
- Visualize your results

In [None]:
"""
We're providing most of the import statements you need for the entire exercise
"""

import pandas as pd
import matplotlib.pyplot as plt 

%matplotlib inline

### Load, merge, and clean the data

In [None]:
"""
Load the "celebrity_deaths_2016.xlsx" data file and print the sheet names
"""

xl = pd.ExcelFile('celebrity_deaths_2016.xlsx')
print(xl.sheet_names)

In [None]:
"""
Read the "celeb_death" sheet into a dataframe named "df"
Take a look at the top 5 rows
"""

df = xl.parse("celeb_death")
df.head()

In [None]:
"""
Take a look at the data types stored in each column
"""

df.dtypes

In [None]:
"""
Drop the duplicates (based on all columns) from df
"""

df.drop_duplicates()

In [None]:
"""
Look at just the names
"""

df['name']

In [None]:
"""
Look at the names and ages
"""

df[['name', 'age']]

In [None]:
"""
When did Yogi Berra die?
"""

df[df['name'] == 'Yogi Berra']

In [None]:
"""
Which celebrities died after Yogi Berra?
"""

df[df['date of death'] >= '2015-09-22']

In [None]:
"""
Read the "cause_of_death" sheet into a dataframe named "cause_of_death"
Take a look at the top 5 rows
"""

cause_of_death = xl.parse("cause_of_death")
cause_of_death.head()

In [None]:
"""
Drop the duplicates (based on the "cause_id" column) from the cause_of_death DataFrame

Use the "subset" argument to specify the "cause_id" column
"""

cause_of_death.drop_duplicates(subset = "cause_id")

In [None]:
"""
Merge the cause_of_death DataFrame with the df DataFrame

Note: There are records in df (left DataFrame) that do not have a matching record in cause_of_death (right DataFrame)
We want to see all records in df (left DataFrame) despite the missing matches in cause_of_death.
Thus, you want to use a "left join".
"""

df = pd.merge(left=df, right=cause_of_death, how='left', left_on='cause_id', right_on='cause_id')
df.head(100)

### Answer some basic questions about the data

In [None]:
"""
We'll be doing some calculations with the age column, but it was loaded from the data file as dtype "object"
So first, we need to cast it to a numeric value

The "errors" argument will catch (and ignore) any records where age cannot be converted to a number
"""

df['age'] = pd.to_numeric(df['age'], errors='coerce')

In [None]:
"""
Look at the data types again
"""

df.dtypes

In [None]:
"""
Which celebrities died when they were older than 100?
"""

df[df['age'] > 100]

In [None]:
"""
What was the average age of death?
"""

df['age'].mean()

In [None]:
"""
Who died the youngest and what was the cause of death?
Hint: Get the min age and find the record that has that value
"""

minage = df['age'].min()
minage

In [None]:
record_minage = df['age'] == minage
df[record_minage]

In [None]:
"""
Who died the oldest and what was the cause of death?
Hint: Get the max age and find the record that has that value
"""

maxage = df['age'].max()
record_maxage = df['age'] == maxage
df[record_maxage]

In [None]:
"""
We'll be running some queries based on the bio and cause_of_death columns, but they were loaded from the data file as objects
So first, we need to cast them to strings
"""

df['cause of death'] = df['cause of death'].astype(str)
df['bio'] = df['bio'].astype(str)

In [None]:
"""
What is total number of deaths caused by cancer?
Hint: Check if the cause_of_death is any type of (contains) cancer
"""

cancer = df["cause of death"].str.contains("cancer")
len(df[cancer])

In [None]:
"""
How many American celebrities died?
Hint: Search the bio for "American"
"""

american = df["bio"].str.contains('American')
american.sum()

### Count the number of people who died in each month of 2016
1. Create new columns that shows which month and year each person died in
2. Group all the entries based on the month they appeared in

In [None]:
"""
Make a new column with the numeric month of death

This code maps a lambda function to pull out the numeric month from the date of death column
"""

df['month'] = df['date of death'].map(lambda x: x.month)
df.head()

In [None]:
"""
Make a new column with the year of death
This code maps a lambda function to pull out the year from the date of death column
"""

df['year']  = df['date of death'].map(lambda x: x.year)
df.head()

In [None]:
"""
Only look at deaths that took place in 2016
"""

df_2016 = df[df['year'] == 2016]
df_2016.head()

In [None]:
"""
Using a pivot table, obtain a list that contains the number of people that died in each month
"""

df_per_month = pd.pivot_table(df_2016, index=['month'], values=['name'], aggfunc=[len])
df_per_month

### Visualize the number of deaths per month as a bar chart

In [None]:
"""
Hint: The df_per_month DataFrame has a simple .plot() method you can use 
"""

df_per_month.plot(kind = 'bar', figsize=(12, 6), 
       fontsize=12, legend=False, title="Number of Deaths per Month")
plt.xlabel("Month")
plt.ylabel("Number of deaths")
plt.show()

### What was the mean age for each cause of death?

In [None]:
"""
Hint: import numpy and group by 'cause of death', then get the mean
""" 

import numpy as np
df.groupby(['cause of death']).agg([np.mean])['age']

### What was one cause of death for celebrities who died at 50?

In [None]:
"""
Hint: import random and randomly select a cause of death for celebrities who died at 50
"""

#create a series for celebrities who died at 50
age50 = df["age"] == 50
age50

In [None]:
#create a series for celebrities where we know the cause of death
causeofdeathexists = df["cause of death"] != 'nan'
causeofdeathexists

In [None]:
#celebrities who dies at 50 and where we know the cause of death
atage50 = df[age50 & causeofdeathexists]
atage50

In [None]:
import random

rand_int = random.randint(0, len(atage50) - 1)
rand_cause = atage50[rand_int:rand_int + 1]
rand_cause

### Make a histogram that plots the number of deaths per nationality
1. Create a new column that identifies the nationality of each celebrity, extracting the first word from the bio
2. Make a histogram that plots the number of deaths per nationality

In [None]:
"""
Get the nationality from the bio.
"""

nationality = df['bio'].str.split()
df['nationality'] = nationality.apply(lambda x: x[0])
df.head()

In [None]:
"""
Make a histogram that plots the number of deaths per nationality
Only include nationalities with more than 50 deaths 
"""

countries = df['nationality'].value_counts()
countries

In [None]:
unlucky_countries = countries[countries > 50]
unlucky_countries

In [None]:
ax = unlucky_countries.plot(kind = "bar", figsize = (15, 5), 
                          title = "Nationality of celebrities who died in 2016", 
                          fontsize= 15)
ax.set_xlabel("Nationality")
ax.set_ylabel("Death Counts")
plt.show()