# 项目：北美餐馆综合数据

## 数据简介

---
Topic:
    - 饮食
,    - 商业金融
,    - 日常生活

Field:
    - 数据挖掘
,    - 回归

Ext:
    - .csv

DatasetUsage:
    - 88092
---

### **背景描述**
本数据集收集了1500 个北美餐馆综合数据信息，包括餐厅位置、提供的菜品、餐厅评分等。

![Image Name](https://cdn.kesci.com/upload/image/s9av02ycpj.png?imageView2/0/w/640/h/640)

### **数据说明**
| 字段                    | 说明                                       |
|-------------------------|--------------------------------------------|
| name                    | 餐厅名称。                                  |
| city                    | 餐厅所在城市。                             |
| state                   | 餐厅所在的州或省份。                        |
| zipcode                 | 餐厅的邮政编码。                           |
| country                 | 餐厅所在国家（美国或加拿大）。             |
| cuisines                | 餐厅提供的菜系类型。                       |
| pickup_enabled          | 表示是否提供取餐服务（是或否）。           |
| delivery_enabled        | 表示是否提供外卖服务（是或否）。           |
| weighted_rating_value   | 餐厅的平均评分，范围从0到5。             |
| aggregated_rating_count | 餐厅收到的总评分数量。                     |

### **数据来源**
https://www.kaggle.com/datasets/saketk511/1500-north-american-resturants

### **问题描述**
餐馆数量分布
餐馆评分分布
不同菜系类型的比较
餐馆服务选项分析
顾客评分与评价数量的关系

## 读取数据

导入数据分析所需要的库，并通过Pandas的`read_csv`函数，将原始数据文件"North America Restaurants.csv"里的数据内容，解析为DataFrame，并赋值给变量`original_data`。

In [158]:
import pandas as pd

In [159]:
original_data = pd.read_csv("North America Restaurants.csv")
original_data.head()

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Burger King,Manitowoc,WI,54220,US,"American, Burger, Burgers, Family Meals, Fast Food, Subs & Sandwiches",True,True,2.4,42
1,Petro-Canada,Airdrie,AB,T4A,CA,"Ben & Jerry's, Café/Thé, Coffee/Tea, Convenience, Cool Treats, Crème Glacée, Crèmerie, Dessert, Desserts, Drinks, Dépanneur, Everyday Essentials, Food & Drinks, Fringales Nocturnes, Grocery, Gâter...",True,True,4.1,1
2,Boba Bae,Ashwaubenon,WI,54304,US,"American, Asian Food, Bubble Tea, Coffee & Tea, Dessert",True,True,4.0,88
3,1001 Nights Shawarma,Kitchener,ON,N2C,CA,"Beau, Bon, Local, Chicken, Dessert, Desserts, Global Diversity Month, Lebanese, Local Goods, Mediterranean, Middle East, Middle Eastern, Mois Mondial De La Diversité, Moyen-Oriental, Must-Eat Rest...",True,True,4.6,1077
4,Chirpyhut Fried Chicken (JlgJ),Richmond,BC,V6X 2B8,CA,"Ailes, Allergy Friendly, American, Beau, Bon, Local, Burgers, Chicken, Fait-Maison, Fast Food, Group Friendly, Hamburgers, Homemade, Hot Deal Summer, Local Goods, Offers, Offres, Poulet, Promos D'...",True,True,4.6,30


## 评估数据

在这个在这一部分，我将对在上一部分建立的`original_data`这个DataFrame所包含的数据进行评估。

评估主要从两个方面进行：结构和内容，即整齐度和干净度。数据的结构性问题指不符合“每列是一个变量，每行是一个观察值，每个单元格是一个值”这三个标准，数据的内容性问题包括存在丢失数据、重复数据、无效数据等。

### 评估整齐度

In [160]:
original_data.sample(10)

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
960,Starbucks,Kitchener,ON,N2P 2J5,CA,"Bakery, Breakfast &Amp; Brunch, Breakfast And Brunch, Cafe, Cold Coffees, Cold Drinks, Frappuccino Blended Beverages, Frappuccino® Blended Beverages, Hot Breakfast, Hot Coffees, Hot Drinks, Hot Te...",False,True,4.2,34
1037,Cluck U Chicken,Hoboken,NJ,7030,US,"American, Bowls, Chicken, Dinner, Hamburgers, Late Night, Lunch, Lunch Specials, Wings, Wraps",True,True,2.0,58
858,Endivine Grill & Bar,Bowmanville,ON,L1C 1N4,CA,"Allergy Friendly, Breakfast & Brunch, Burgers, Canadian, Canadien, Déjeuner Et Brunch, Dîner, European, Healthy, Lunch, Modern European, Salads, Sandwiches & Subs, Sandwichs Et Sous-Marins, Santé,...",True,True,4.7,16
719,Route 66 Pizza,Orting,WA,98360,US,"American, Dessert, Fast Food, Healthy Pizza, Lunch, Pizza, Thin Crust Pizza, Veggie Pizza",True,True,2.0,1
385,McDonald's,Conway,AR,72034,US,"American, Breakfast, Burger, Burgers, Cheeseburger, Chicken, Coffee, Desserts, Fast Food, Hamburger, Salad",False,True,3.8,20
486,Jalebi Junction🌀,Toronto,ON,M6J 1E3,CA,"Desserts, Indian, Vegetarian",True,True,2.2,19
686,Krystal,Pensacola,FL,32504,US,"American, Burgers, Fast Food",True,True,3.6,56
388,Pizza Ranch,Sun Prairie,WI,53590,US,"Chicken, Dessert, Fried Chicken, Pizza, Wings",True,True,2.9,11
773,Burger King,Eau Claire,WI,54703,US,"American, American Food, Burger, Burgers, Family Meals, Fast Food",True,True,2.6,6
1317,Ennio's Pasta House,Waterloo,ON,N2J 4G8,CA,"Beau, Bon, Local, Chicken, Comfort Food, Fait-Maison, Fruits De Mer, Group Friendly, Homemade, Italian, Italien, Local Goods, Local Legends, Légendes Locales, Noodle Shop, Nouilles, Pasta, Pizza, ...",True,True,4.7,134


从抽样的10行数据数据来看，数据符合“每列是一个变量，每行是一个观察值”但是`cuisines`列有多个值，具体来看每行是关于某餐馆的基本信息，每列是点评相关的各个变量。

### 评估数据干净度

In [161]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   name                     1500 non-null   object 
 1   city                     1500 non-null   object 
 2   state                    1500 non-null   object 
 3   zipcode                  1500 non-null   object 
 4   country                  1500 non-null   object 
 5   cuisines                 1499 non-null   object 
 6   pickup_enabled           1500 non-null   bool   
 7   delivery_enabled         1500 non-null   bool   
 8   weighted_rating_value    1500 non-null   float64
 9   aggregated_rating_count  1500 non-null   int64  
dtypes: bool(2), float64(1), int64(1), object(6)
memory usage: 96.8+ KB


由表格信息可知，该数据共1500条观察值，其中`cuisines`列存在缺失值

### 评价缺失数据

在了解`cuisines`存在缺失值后，找到缺失数据所在的观察值，并进行评估

In [162]:
original_data[original_data["cuisines"].isnull()]

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
633,All You Need,Sunny Isles Beach,FL,33160,US,,True,True,2.5,2


有一条数据存在缺失值

因为`cuisines`是菜系类型，所以该数据的缺失是在容许的范围内的

### 评估重复数据

由于该数据的列都没有唯一标识符，所以只有当所有列的数据相同时才是为重复数据

In [163]:
original_data.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1495    False
1496    False
1497     True
1498    False
1499     True
Length: 1500, dtype: bool

In [164]:
original_data.duplicated().sum()

618

由图标可知该数据存在大量的重复数据，我猜测是由于数据采集时有多少次点评就把这么多条全部都放进去了

下面进行验证

由上表可知第1497条和第1499条为重复数据，所以我们可以把其展示出来

In [165]:
original_data.iloc[1497]

name                                                                                                                                                                 Petro-Canada
city                                                                                                                                                                    Longueuil
state                                                                                                                                                                          QC
zipcode                                                                                                                                                                       J4V
country                                                                                                                                                                        CA
cuisines                   Convenience, Everyday Essentials, Grocery, Home & Personal Care, Home &Amp; Persona

将所有`name`为`Petro-Canada`,`city`为`Longueuil`展示出来

In [166]:
original_data[(original_data['name']=='Petro-Canada') & (original_data['city']=='Longueuil')]

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
490,Petro-Canada,Longueuil,QC,J4V,CA,"Convenience, Everyday Essentials, Grocery, Home & Personal Care, Home &Amp; Personal Care, Nestle 4 Pour 2, Nestle Buy 2 Get 2, Offers, Offres, Snacks",True,True,4.5,47
1249,Petro-Canada,Longueuil,QC,J4V,CA,"Convenience, Everyday Essentials, Grocery, Home & Personal Care, Home &Amp; Personal Care, Nestle 4 Pour 2, Nestle Buy 2 Get 2, Offers, Offres, Snacks",True,True,4.5,47
1309,Petro-Canada,Longueuil,QC,J4V,CA,"Convenience, Everyday Essentials, Grocery, Home & Personal Care, Home &Amp; Personal Care, Nestle 4 Pour 2, Nestle Buy 2 Get 2, Offers, Offres, Snacks",True,True,4.5,47
1497,Petro-Canada,Longueuil,QC,J4V,CA,"Convenience, Everyday Essentials, Grocery, Home & Personal Care, Home &Amp; Personal Care, Nestle 4 Pour 2, Nestle Buy 2 Get 2, Offers, Offres, Snacks",True,True,4.5,47


虽然，显然我的猜测是错误的，但是我们能发现重复数据是无意义且多余的数据，应该将其删除

### 评价不一致数据

不一致数据可能存在于`country`变量中，我们要查看是否存在多个不同值指代同一国家的情况。

对于`country`变量

In [167]:
original_data["country"].value_counts()

country
US    829
CA    671
Name: count, dtype: int64

很显然不存在不一致数据

### 评价无效或错误数据

可以通过DataFrame的`describe`方法，对数值统计信息进行快速了解。

In [168]:
original_data.describe()

Unnamed: 0,weighted_rating_value,aggregated_rating_count
count,1500.0,1500.0
mean,3.724533,85.500667
std,0.989005,277.071136
min,1.0,1.0
25%,2.9,5.0
50%,4.1,25.0
75%,4.6,68.0
max,5.0,4211.0


由上表可知，数据全在可接受范围

探寻`weighted_rating_value`是否都在0~5之间

In [169]:
original_data[(original_data["weighted_rating_value"]<0) & original_data["weighted_rating_value"]>5]

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count


显然没有数据超出这个范围

所以该数据错误的数据

## 清理数据

根据前面评估部分得到的结论，我们需要进行的数据清理包括：
- 将`cuisines`拆分
- 删除重复数据

为了区分开经过清理的数据和原始的数据，我们创建新的变量`cleaned_data`，让它为`original_data`复制出的副本。我们之后的清理步骤都将被运用在`cleaned_data`上。

In [170]:
cleaned_data = original_data.copy()
cleaned_data.head()

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Burger King,Manitowoc,WI,54220,US,"American, Burger, Burgers, Family Meals, Fast Food, Subs & Sandwiches",True,True,2.4,42
1,Petro-Canada,Airdrie,AB,T4A,CA,"Ben & Jerry's, Café/Thé, Coffee/Tea, Convenience, Cool Treats, Crème Glacée, Crèmerie, Dessert, Desserts, Drinks, Dépanneur, Everyday Essentials, Food & Drinks, Fringales Nocturnes, Grocery, Gâter...",True,True,4.1,1
2,Boba Bae,Ashwaubenon,WI,54304,US,"American, Asian Food, Bubble Tea, Coffee & Tea, Dessert",True,True,4.0,88
3,1001 Nights Shawarma,Kitchener,ON,N2C,CA,"Beau, Bon, Local, Chicken, Dessert, Desserts, Global Diversity Month, Lebanese, Local Goods, Mediterranean, Middle East, Middle Eastern, Mois Mondial De La Diversité, Moyen-Oriental, Must-Eat Rest...",True,True,4.6,1077
4,Chirpyhut Fried Chicken (JlgJ),Richmond,BC,V6X 2B8,CA,"Ailes, Allergy Friendly, American, Beau, Bon, Local, Burgers, Chicken, Fait-Maison, Fast Food, Group Friendly, Hamburgers, Homemade, Hot Deal Summer, Local Goods, Offers, Offres, Poulet, Promos D'...",True,True,4.6,30


将`cuisines`拆分

In [171]:
cleaned_data['cuisines'] = cleaned_data['cuisines'].str.split(',')
cleaned_data = cleaned_data.explode('cuisines')
cleaned_data

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Burger King,Manitowoc,WI,54220,US,American,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Burger,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Burgers,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Family Meals,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Fast Food,True,True,2.4,42
...,...,...,...,...,...,...,...,...,...,...
1499,Bean House,Richmond Hill,ON,L4B 3Z1,CA,Noodles,True,True,4.8,93
1499,Bean House,Richmond Hill,ON,L4B 3Z1,CA,Vegan,True,True,4.8,93
1499,Bean House,Richmond Hill,ON,L4B 3Z1,CA,Vegetarian,True,True,4.8,93
1499,Bean House,Richmond Hill,ON,L4B 3Z1,CA,Végétalien,True,True,4.8,93


删除重复数据

In [172]:
cleaned_data = cleaned_data[~ cleaned_data.duplicated()]
cleaned_data

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Burger King,Manitowoc,WI,54220,US,American,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Burger,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Burgers,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Family Meals,True,True,2.4,42
0,Burger King,Manitowoc,WI,54220,US,Fast Food,True,True,2.4,42
...,...,...,...,...,...,...,...,...,...,...
1498,Denny's,Kingman,AZ,86401,US,Family Meals,True,True,3.4,22
1498,Denny's,Kingman,AZ,86401,US,Healthy,True,True,3.4,22
1498,Denny's,Kingman,AZ,86401,US,Salads,True,True,3.4,22
1498,Denny's,Kingman,AZ,86401,US,Sandwich,True,True,3.4,22


## 保存清理后的数据

完成数据清理后，把干净整齐的数据保存到新的文件里，文件名为`North America Restaurants_cleaned.csv`。

In [173]:
cleaned_data.to_csv("North America Restaurants_cleaned.csv",index= False)

In [174]:
pd.read_csv("North America Restaurants_cleaned.csv")

Unnamed: 0,name,city,state,zipcode,country,cuisines,pickup_enabled,delivery_enabled,weighted_rating_value,aggregated_rating_count
0,Burger King,Manitowoc,WI,54220,US,American,True,True,2.4,42
1,Burger King,Manitowoc,WI,54220,US,Burger,True,True,2.4,42
2,Burger King,Manitowoc,WI,54220,US,Burgers,True,True,2.4,42
3,Burger King,Manitowoc,WI,54220,US,Family Meals,True,True,2.4,42
4,Burger King,Manitowoc,WI,54220,US,Fast Food,True,True,2.4,42
...,...,...,...,...,...,...,...,...,...,...
9107,Denny's,Kingman,AZ,86401,US,Family Meals,True,True,3.4,22
9108,Denny's,Kingman,AZ,86401,US,Healthy,True,True,3.4,22
9109,Denny's,Kingman,AZ,86401,US,Salads,True,True,3.4,22
9110,Denny's,Kingman,AZ,86401,US,Sandwich,True,True,3.4,22
