# 项目：评估和清理航空公司乘客满意度数据

## 分析目标

该数据集包含航空公司乘客满意度调查结果，可以用于分析哪些因素与满意（或不满意）的乘客高度相关，并构建机器学习模型，预测乘客的满意度水平（1.满意 2.中立或不满意）。  

## 简介

数据每列含义如下：
- `Gender`: 乘客的性别（女、男）
- `Customer Type`: 乘客类型（皇家乘客、非皇家乘客）
- `Type of Travel`：乘客的飞行目的（个人旅行、商务旅行）
- `Class`：乘客乘坐飞机的旅行舱（Business - 商务舱、Eco - 经济舱、Eco-Plus - 舒适经济舱）
- `Flight distance`：这段旅程的飞行距离
- `Inflight wifi service`：机上无线网络服务的满意度（0：没有；1-5）
- `Departure/Arrival time convenient`：离港/抵达时间方便的满意度
- `Ease of Online booking`：在线预订的满意度
- `Gate location`：登机口位置满意度
- `Food and drink`：食品和饮料的满意度
- `Satisfaction level of online boarding`：在线登机的满意度
- `Seat comfort`：座椅舒适度的满意度
- `Inflight entertainment`：机上娱乐的满意度
- `On-board service`：机上服务满意度
- `Leg room service`：腿部空间满意度
- `Baggage handling`：行李处理的满意度
- `Check-in service`：登机服务满意度
- `Inflight service`：机上服务满意度
- `Cleanliness`：清洁度满意度
- `Departure Delay in Minutes`：出发时延迟的分钟数
- `Arrival Delay in Minutes`：抵达时延迟的分钟数
- `Satisfaction`：航空公司满意度（满意、中立或不满意）

## 读取数据

导入数据分析所需要的库，并通过Pandas的`read_csv`函数，将原始数据文件"airline_survey_train.csv"里的数据内容，解析为DataFrame，并赋值给变量`original_data`。

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
original_data = pd.read_csv("airline_survey_train.csv")
original_data.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied


## 评估数据

在这一部分，将对在上一部分建立的`original_data`这个DataFrame所包含的数据进行评估。

评估主要从两个方面进行：结构和内容，即整齐度和干净度。  
数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

- ### 评估数据整齐度

In [2]:
original_data.sample(10)

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
5998,5998,85158,Male,Loyal Customer,26,Business travel,Business,696,5,5,5,5,4,5,4,4,3,2,4,3,5,4,17,7.0,satisfied
50974,50974,65345,Female,Loyal Customer,49,Business travel,Business,631,1,1,4,1,4,5,5,4,4,4,4,3,4,3,82,77.0,satisfied
30305,30305,38104,Male,Loyal Customer,27,Business travel,Eco,1237,2,1,1,1,2,2,2,2,3,1,2,4,3,2,17,41.0,neutral or dissatisfied
89890,89890,42073,Male,Loyal Customer,21,Personal Travel,Eco,106,1,5,0,3,4,0,4,4,4,5,5,5,5,4,0,0.0,neutral or dissatisfied
85925,85925,4597,Female,Loyal Customer,40,Personal Travel,Eco,416,5,4,5,4,3,5,3,3,1,4,4,4,3,3,0,0.0,satisfied
47507,47507,67325,Male,Loyal Customer,50,Personal Travel,Eco Plus,929,2,5,2,4,5,2,5,5,3,3,4,5,4,5,0,0.0,neutral or dissatisfied
58241,58241,25011,Male,Loyal Customer,41,Business travel,Business,3098,1,2,2,2,1,4,3,1,1,1,1,1,1,3,0,0.0,neutral or dissatisfied
95445,95445,97153,Female,Loyal Customer,54,Business travel,Business,1076,2,2,2,2,4,4,4,2,2,2,2,2,2,1,0,0.0,neutral or dissatisfied
41312,41312,60047,Male,Loyal Customer,41,Business travel,Eco Plus,1065,4,1,1,1,4,4,4,4,1,4,2,1,2,4,0,0.0,neutral or dissatisfied
89491,89491,112677,Female,disloyal Customer,50,Business travel,Business,672,3,4,4,3,2,3,2,2,4,2,5,3,4,2,0,0.0,neutral or dissatisfied


In [7]:
original_data['satisfaction'].value_counts()

satisfaction
neutral or dissatisfied    58879
satisfied                  45025
Name: count, dtype: int64

从抽样的数据来看，数据符合“每列一个变量，每行一个观察值，每个单元格一个值”，具体来看每行是关于一种商品的交易记录，每列是交易相关的各个变量，不存在结构性问题。

- ### 评估数据干净度

In [4]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

共有103904条观察值，无空缺值。  `Unnamed: 0`列应该为索引列。  
  
此外，`id`列不需要进行数学运算，相当于字符串，所以应该改为objectL类型。 `Arrival Delay in Minutes`变量的类型为float，那` Departure Delay in Minutes `最好和它一致吧，应当进行数据类型的转换。

#### 评估缺失数据

不需要，无空缺值。

#### 评估重复值

五点评分列内的值都可以重复，可以看一下`id`列有无重复值，如果有重复即意味着一位乘客提交了多份评价。

#### 评估不一致数据

根据数据内容来看，没有哪里列可能存在不一致数据。

#### 评估无效或错误数据

In [5]:
original_data.describe()

Unnamed: 0.1,Unnamed: 0,id,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
count,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103904.0,103594.0
mean,51951.5,64924.210502,39.379706,1189.448375,2.729683,3.060296,2.756901,2.976883,3.202129,3.250375,3.439396,3.358158,3.382363,3.351055,3.631833,3.30429,3.640428,3.286351,14.815618,15.178678
std,29994.645522,37463.812252,15.114964,997.147281,1.327829,1.525075,1.398929,1.277621,1.329533,1.349509,1.319088,1.332991,1.288354,1.315605,1.180903,1.265396,1.175663,1.312273,38.230901,38.698682
min,0.0,1.0,7.0,31.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,25975.75,32533.75,27.0,414.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,2.0,0.0,0.0
50%,51951.5,64856.5,40.0,843.0,3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0,0.0,0.0
75%,77927.25,97368.25,51.0,1743.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,5.0,4.0,12.0,13.0
max,103903.0,129880.0,85.0,4983.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1592.0,1584.0


从输出结果来看，并无错误或无效数据。

## 清洗数据

根据前面评估部分得到的结论，我们需要进行的数据清理包括：
- 把`Unnamed: 0`、`id`列删除
- 把一些列转化为分类数据类型