## Case Study 1 : Suicide Rates
### 实例1: 自杀率
This case study with Pandas package uses a simple toy dataset from Kaggle. You can [download](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016) here.

这个例子使用Kaggle上一个1985到2016年国家自杀率的数据集，可以从这里下载：[下载](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016)

We read the csv and do some setup process below.

In [27]:
import pandas as pd
import numpy as np
import os

data_path = 'master.csv'

# Read the CSV file and rename columns
df = pd.read_csv(filepath_or_buffer=data_path).rename(columns={
    'suicides/100k pop': 'suicides_per_100k',
    ' gdp_for_year ($) ': 'gdp_year',
    'gdp_per_capita ($)': 'gdp_capita',
    'country-year': 'country_year'
})

# Remove commas and convert 'gdp_year' column to int64
df = df.assign(gdp_year=lambda _df: _df['gdp_year'].str.replace(',', '').astype(np.int64))

# change type of year column to int
df['year'] = df['year'].astype(np.int32)
display(df.dtypes)

country               object
year                   int32
sex                   object
age                   object
suicides_no            int64
population             int64
suicides_per_100k    float64
country_year          object
HDI for year         float64
gdp_year               int64
gdp_capita             int64
generation            object
dtype: object

In [28]:
# We can see the column names below
df.columns

Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides_per_100k', 'country_year', 'HDI for year', 'gdp_year',
       'gdp_capita', 'generation'],
      dtype='object')

By using unique, nunique and describe functions, you can access the overview of data quickly

Also, use head() and tail(), you can have a glimpse of the overall shape of data

In [29]:
# unique() will return unique elements in a column
print("Distinct Gender", df['sex'].unique())
print("Distinct Generations:", df['generation'].unique())

# nunique() will count the number of unique elements
print('Number of Distinct Countries:', df['country'].nunique())

# describe() will print common statistic data of this data frame
df.describe()

Distinct Gender ['male' 'female']
Distinct Generations: ['Generation X' 'Silent' 'G.I. Generation' 'Boomers' 'Millenials'
 'Generation Z']
Number of Distinct Countries: 101


Unnamed: 0,year,suicides_no,population,suicides_per_100k,HDI for year,gdp_year,gdp_capita
count,27820.0,27820.0,27820.0,27820.0,8364.0,27820.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816097,0.776601,445581000000.0,16866.464414
std,8.469055,902.047917,3911779.0,18.961511,0.093367,1453610000000.0,18887.576472
min,1985.0,0.0,278.0,0.0,0.483,46919620.0,251.0
25%,1995.0,3.0,97498.5,0.92,0.713,8985353000.0,3447.0
50%,2002.0,25.0,430150.0,5.99,0.779,48114690000.0,9372.0
75%,2008.0,131.0,1486143.0,16.62,0.855,260202400000.0,24874.0
max,2016.0,22338.0,43805210.0,224.97,0.944,18120710000000.0,126352.0


Below are some examples of usgae of indexing to select data from the data frame

In [44]:
# loc function - select certain rows
df.loc[2:5]

# iloc function
df.iloc[lambda x: x.index % 2 == 0] # select even number rows
df.iloc[1:3, :6] # select 1-3 rows and first 6 columns

Unnamed: 0,country,year,sex,age,suicides_no,population
1,Albania,1987,male,35-54 years,16,308000
2,Albania,1987,female,15-24 years,14,289700


Below are some examples of functions chain to analyze the dataset