# 项目：评估、清洗和整理失约医疗预约数据

## 1. 分析目标

本实战项目的目的是，联系评估数据的结构性问题和内容性问题，并且基于评估结果对失约医疗数据进行清洗、整理，从而得到干净、整洁的数据，供下一步用于探索患者预约后是否失约和其它特征之间的关系。

## 2. 简介

此数据集包含110,527医疗预约及其14个相关变量（特征），可以被用于探索患者预约后是否失约和其它特征之间的关系。

数据每列的含义如下：
- `PatientId`：患者身份id
- `AppointmentID`：患者预约id
- `Gender`：患者性别
- `ScheduledDay`：预约的应该到达诊所的日期
- `AppointmentDay`：打电话或注册预约的日期
- `Age`：患者年龄
- `Neighbourhood`：患者所在的街区
- `Scholarship`：是否有奖学金
- `Hipertension`：是否有高血压
- `Diabetes`：是否患有糖尿病
- `Alcoholism`：是否酗酒
- `Handcap`：是否患有残疾
- `SMS_received`：是否收到诊所发的一条或多条短信
- `No-show`：患者是否出现

## 3. 读取数据

导入数据分析所需要的库，并通过Pandas的 `read_csv` 函数，将原始文件`appointments.csv`里的数据内容，解析为DataFrame并赋值给变量 `original_data`。

In [1]:
import pandas as pd
import numpy as np

In [2]:
original_data = pd.read_csv("appointments.csv")
original_data.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


## 4. 评估和清理数据

在这一部分中，我们将对在上一部分建立的 `original_data` DataFrame所包含的数据进行评估和清理。

主要从两个方面进行：结构和内容，即整齐度和干净度。

数据的结构性问题指不符合“每个变量为一列，每个观察值为一行，每种类型的观察单位为一个表格”这三个标准；数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

为了区分开经过清理的数据和原始的数据，我们创建新的变量 `cleaned_data`，让它为 `original_data` 复制出的副本。我们之后的清理步骤都将被运用在 `cleaned_data` 上。

In [3]:
cleaned_data = original_data.copy()

### 4.1 数据整齐度

In [4]:
cleaned_data.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


从前5行的数据来看，数据符合“每列一个变量，每行一个观察值，每个单元格一个值”，具体来看每行是一条失约记录，不存在结构性问题。

### 4.2 数据干净度

In [5]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [6]:
cleaned_data['No-show'].value_counts()

No-show
No     88208
Yes    22319
Name: count, dtype: int64

从输出结果来看，`cleaned_data` 共110527条观察值，不存在缺失值。  
  
`PatientId`、`AppointmentID` 的数据类型应该为字符串，所以应该将其改为object;  
`ScheduleDay` 和 `AppointmentDay` 的类型应该为日期，`Gender`应为category类型，`No-show`含有"Yes"、"No"两种，应为分类变量，改为category。

In [7]:
cleaned_data['PatientId'] = cleaned_data['PatientId'].astype("str")
cleaned_data['AppointmentID'] = cleaned_data['AppointmentID'].astype("str")

In [8]:
cleaned_data['ScheduledDay'] = pd.to_datetime(cleaned_data['ScheduledDay'])
cleaned_data['AppointmentDay'] = pd.to_datetime(cleaned_data['AppointmentDay'])

In [9]:
cleaned_data['Gender'] = cleaned_data['Gender'].astype("category")
cleaned_data['No-show'] = cleaned_data['No-show'].astype("category")

In [10]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   PatientId       110527 non-null  object             
 1   AppointmentID   110527 non-null  object             
 2   Gender          110527 non-null  category           
 3   ScheduledDay    110527 non-null  datetime64[ns, UTC]
 4   AppointmentDay  110527 non-null  datetime64[ns, UTC]
 5   Age             110527 non-null  int64              
 6   Neighbourhood   110527 non-null  object             
 7   Scholarship     110527 non-null  int64              
 8   Hipertension    110527 non-null  int64              
 9   Diabetes        110527 non-null  int64              
 10  Alcoholism      110527 non-null  int64              
 11  Handcap         110527 non-null  int64              
 12  SMS_received    110527 non-null  int64              
 13  No-show       

#### 4.2.1 处理缺失数据

不存在缺失数据，直接进行下一步。

#### 4.2.2 处理重复数据

根据数据变量的含义以及内容来看，`cleaned_data`里不应该存在每个变量值都相同的观察值，因此查看是否存在重复值。

In [11]:
cleaned_data.duplicated().sum()

0

输出结果为0，说明不存在重复值。

#### 4.2.3 处理不一致数据

针对`cleaned_data`，不一致数据可能存在于`Neighbourhood`变量中，我们将查看是否存在多个不同值指代同一街区。

In [12]:
with pd.option_context('display.max_rows', None):
    print(cleaned_data['Neighbourhood'].value_counts())

Neighbourhood
JARDIM CAMBURI                 7717
MARIA ORTIZ                    5805
RESISTÊNCIA                    4431
JARDIM DA PENHA                3877
ITARARÉ                        3514
CENTRO                         3334
TABUAZEIRO                     3132
SANTA MARTHA                   3131
JESUS DE NAZARETH              2853
BONFIM                         2773
SANTO ANTÔNIO                  2746
SANTO ANDRÉ                    2571
CARATOÍRA                      2565
JABOUR                         2509
SÃO PEDRO                      2448
ILHA DO PRÍNCIPE               2266
NOVA PALESTINA                 2264
ANDORINHAS                     2262
DA PENHA                       2217
ROMÃO                          2215
GURIGICA                       2018
SÃO JOSÉ                       1977
BELA VISTA                     1907
MARUÍPE                        1902
FORTE SÃO JOÃO                 1889
ILHA DE SANTA MARIA            1885
SÃO CRISTÓVÃO                  1836
REDENÇÃO      

从上面可以看出，不存在不一致的数据，不同值指代不同的街区。

#### 4.2.4 处理无效或错误数据

可以通过DataFrame的`describe`方法，对数值统计信息进行快速了解。

In [13]:
cleaned_data.describe()

Unnamed: 0,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,115.0,1.0,1.0,1.0,1.0,4.0,1.0


从输出结果来看，`Age`最小值为-1，这不符合现实。对其进行筛选以查看具体情况。

In [14]:
cleaned_data.query('Age == -1')

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
99832,465943158731293.0,5775010,F,2016-06-06 08:58:13+00:00,2016-06-06 00:00:00+00:00,-1,ROMÃO,0,0,0,0,0,0,No


该条记录脱离现实意义，可能当时记录错误，应该删除。

In [15]:
cleaned_data = cleaned_data[cleaned_data['Age'] != -1]
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 110526 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   PatientId       110526 non-null  object             
 1   AppointmentID   110526 non-null  object             
 2   Gender          110526 non-null  category           
 3   ScheduledDay    110526 non-null  datetime64[ns, UTC]
 4   AppointmentDay  110526 non-null  datetime64[ns, UTC]
 5   Age             110526 non-null  int64              
 6   Neighbourhood   110526 non-null  object             
 7   Scholarship     110526 non-null  int64              
 8   Hipertension    110526 non-null  int64              
 9   Diabetes        110526 non-null  int64              
 10  Alcoholism      110526 non-null  int64              
 11  Handcap         110526 non-null  int64              
 12  SMS_received    110526 non-null  int64              
 13  No-show         110

重置索引。

In [16]:
cleaned_data = cleaned_data.reset_index(drop=True)
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110526 entries, 0 to 110525
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   PatientId       110526 non-null  object             
 1   AppointmentID   110526 non-null  object             
 2   Gender          110526 non-null  category           
 3   ScheduledDay    110526 non-null  datetime64[ns, UTC]
 4   AppointmentDay  110526 non-null  datetime64[ns, UTC]
 5   Age             110526 non-null  int64              
 6   Neighbourhood   110526 non-null  object             
 7   Scholarship     110526 non-null  int64              
 8   Hipertension    110526 non-null  int64              
 9   Diabetes        110526 non-null  int64              
 10  Alcoholism      110526 non-null  int64              
 11  Handcap         110526 non-null  int64              
 12  SMS_received    110526 non-null  int64              
 13  No-show       

## 5. 整理数据

In [17]:
cleaned_data

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872499824296.0,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997776694438.0,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962299951.0,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951213174.0,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186448183.0,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110521,2572134369293.0,5651768,F,2016-05-03 09:15:35+00:00,2016-06-07 00:00:00+00:00,56,MARIA ORTIZ,0,0,0,0,0,1,No
110522,3596266328735.0,5650093,F,2016-05-03 07:27:33+00:00,2016-06-07 00:00:00+00:00,51,MARIA ORTIZ,0,0,0,0,0,1,No
110523,15576631729893.0,5630692,F,2016-04-27 16:03:52+00:00,2016-06-07 00:00:00+00:00,21,MARIA ORTIZ,0,0,0,0,0,1,No
110524,92134931435557.0,5630323,F,2016-04-27 15:09:23+00:00,2016-06-07 00:00:00+00:00,38,MARIA ORTIZ,0,0,0,0,0,1,No


对数据的整理，与分析方向紧密相关。此次数据分析目标是，整理不同失约数据的不同特征患者预约后是否失约和其它特征之间的关系。

#### 5.1 对年龄进行分箱化处理

将连续变量Age按25%、50%、75%、100%划分为4个年龄段，并用1234表示。

In [20]:
# qcut, q指quantile，传入浮点列表（如下）表示相应的分位数分割点，q为整数n时，按照n等分位数把数据分箱。
cleaned_data['AgeBand'] = pd.qcut(cleaned_data['Age'], 4, labels=[1,2,3,4])
cleaned_data.sample(5)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,AgeBand
32393,37415276297299.0,5661968,M,2016-05-05 07:57:59+00:00,2016-05-05 00:00:00+00:00,2,CONSOLAÇÃO,0,0,0,0,0,0,No,1
15384,1216192481994.0,5683872,F,2016-05-11 07:57:35+00:00,2016-05-11 00:00:00+00:00,9,JESUS DE NAZARETH,0,0,0,0,0,0,No,1
72244,48397545322299.0,5716153,F,2016-05-18 16:00:17+00:00,2016-05-19 00:00:00+00:00,63,UNIVERSITÁRIO,0,1,0,0,0,0,Yes,4
107958,74243331784977.0,5594961,M,2016-04-18 11:39:16+00:00,2016-06-01 00:00:00+00:00,26,RESISTÊNCIA,0,0,0,0,0,1,No,2
109686,728733838911.0,5767224,M,2016-06-02 16:54:18+00:00,2016-06-07 00:00:00+00:00,57,JARDIM DA PENHA,0,0,0,0,0,1,No,4


#### 5.2 对文本变量`Gender`、`No-show`进行转换

将类别文本Gender M/F 转化为 1/2。

In [21]:
cleaned_data['Gender_num'] = cleaned_data['Gender'].replace(['M', 'F'], [1, 2])
cleaned_data.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,AgeBand,Gender_num
0,29872499824296.0,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No,4,2
1,558997776694438.0,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No,4,1
2,4262962299951.0,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No,4,2
3,867951213174.0,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,1,2
4,8841186448183.0,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No,4,2


将文本变量`No-show` No/Yes 转化为 1/0。

In [22]:
cleaned_data['No-Show'] = cleaned_data['No-show'].replace(['No', 'Yes'], [1, 0])
cleaned_data.sample(5)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show,AgeBand,Gender_num,No-Show
37259,82526363865717.0,5674309,M,2016-05-09 10:48:50+00:00,2016-05-09 00:00:00+00:00,6,FORTE SÃO JOÃO,0,0,0,0,0,0,No,1,1,1
101536,4743686897353.0,5668962,F,2016-05-06 10:03:25+00:00,2016-06-06 00:00:00+00:00,1,SANTA TEREZA,0,0,0,0,0,1,No,1,2,1
81672,65337697381767.0,5617691,F,2016-04-25 16:58:05+00:00,2016-05-24 00:00:00+00:00,34,BONFIM,0,0,0,0,0,1,Yes,2,2,0
49385,3231754686884.0,5586399,F,2016-04-15 07:35:54+00:00,2016-05-03 00:00:00+00:00,42,PONTAL DE CAMBURI,1,0,0,0,0,0,No,3,2,1
56271,335785633723459.0,5698678,F,2016-05-16 07:46:21+00:00,2016-05-31 00:00:00+00:00,32,DA PENHA,1,0,0,0,0,1,No,2,2,1


In [23]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110526 entries, 0 to 110525
Data columns (total 17 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   PatientId       110526 non-null  object             
 1   AppointmentID   110526 non-null  object             
 2   Gender          110526 non-null  category           
 3   ScheduledDay    110526 non-null  datetime64[ns, UTC]
 4   AppointmentDay  110526 non-null  datetime64[ns, UTC]
 5   Age             110526 non-null  int64              
 6   Neighbourhood   110526 non-null  object             
 7   Scholarship     110526 non-null  int64              
 8   Hipertension    110526 non-null  int64              
 9   Diabetes        110526 non-null  int64              
 10  Alcoholism      110526 non-null  int64              
 11  Handcap         110526 non-null  int64              
 12  SMS_received    110526 non-null  int64              
 13  No-show       

In [24]:
cleaned_data.describe()

Unnamed: 0,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0
mean,37.089219,0.098266,0.197248,0.071865,0.0304,0.022248,0.321029
std,23.110026,0.297676,0.397923,0.258266,0.171686,0.161543,0.466874
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,115.0,1.0,1.0,1.0,1.0,4.0,1.0


## 保存清理后的数据

In [27]:
cleaned_data.to_csv("appointments_cleaned_data.csv")