# Introduction Pandas

`pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way towards this goal.

Prerequisite :
- Numpy
pandas "是一个 Python 软件包，它提供快速、灵活和富有表现力的数据结构，旨在使 "关系 "或 "标记 "数据的处理变得简单而直观。它的目标是成为用 Python 进行实际数据分析的基本高级构建模块。此外，它还有一个更广泛的目标，即成为任何语言中最强大、最灵活的开源数据分析/处理工具。它已经在实现这一目标的道路上迈出了坚实的一步。

前提条件 ：
- Numpy

我加的。 Pandas 名字衍生自术语 "panel data"（面板数据）和 "Python data analysis"（Python 数据分析）。

## Import pandas (and Numpy)

Note: it is conventional to refer to `pandas` as `pd`.   
When you add the `as pd` at the end of your import statement, your Jupyter Notebook understands that from this point on every time you type `pd`, you are actually referring to the pandas library.

In [2]:
import numpy as np
import pandas as pd

## Download the dataset

In [4]:
df = pd.read_csv('pandas_tutorial_read.csv')

# The Basics

Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function `head()` to see the first couple rows of the dataframe (or the function `tail()` to see the last few rows).

In [5]:
df.head()

Unnamed: 0,ins,type_entity,entity,period,value
0,3000,Région,Wallonie,01/01/2019,215.0
1,20002,Province,Brabant Wallon,01/01/2019,367.9
2,25000,Arrondissement,Nivelles,01/01/2019,367.9
3,25005,Commune,Beauvechain,01/01/2019,187.8
4,25014,Commune,Braine-l'Alleud,01/01/2019,764.8


In [6]:
df.tail()

Unnamed: 0,ins,type_entity,entity,period,value
286,93018,Commune,Doische,01/01/2019,35.6
287,93022,Commune,Florennes,01/01/2019,84.3
288,93056,Commune,Philippeville,01/01/2019,59.0
289,93088,Commune,Walcourt,01/01/2019,149.1
290,93090,Commune,Viroinval,01/01/2019,46.7


We can see the dimensions of the dataframe using the the `shape` attribute, like in Numpy.

我的备注  :  (291, 5)  MEANS 291 records (rows) and each record has 5 attributes (columns)
EXEMPLE :
"""
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
df = pd.DataFrame(data)

print(df.shape)  # Output: (3, 2)
"""

In [8]:
df.shape

(291, 5)

We can also extract all the column names as a list, by using the `columns` attribute and can extract the rows with the `index` attribute

In [9]:
df.columns.tolist()

['ins', 'type_entity', 'entity', 'period', 'value']

In order to get a better idea of the type of data that we are dealing with, we can call the `describe()` function to see statistics like mean, min, etc... about each column of the dataset. 为了更好地了解我们正在处理的数据类型，我们可以调用 `describe()` 函数查看数据集每一列的平均值、最小值等统计数据。

In [10]:
df.describe()

Unnamed: 0,ins,value
count,291.0,291.0
mean,64080.941581,319.149828
std,19262.087619,428.431706
min,3000.0,24.5
25%,55019.5,79.4
50%,62096.0,186.0
75%,83020.5,323.7
max,93090.0,3518.6


Okay, so now let's look at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function `max()` will show you the maximum value accross all columns好了，现在让我们来看看我们想从数据帧中提取的信息。比方说，我想知道某一列的最大值。函数 `max()` 将显示所有列的最大值

In [11]:
df.max()

ins                 93090
type_entity        Région
entity             Étalle
period         01/01/2019
value              3518.6
dtype: object

Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator然后，如果您想特别获取某一列的最大值，可以使用括号中的索引操作符输入列名

我的备注，看上面表格第五行 value 3518.6 , df['value'].max()

In [15]:
df['value'].max()

np.float64(3518.6)

If you'd like to find the mean of the column `'value'`, you can use the `mean()` function. 

In [16]:
df['value'].mean()

np.float64(319.1498281786941)

But what if that's not enough? Let's say we want to actually see the row where this max value is. We can call the `idxmax()` function to identify the row index  好的，如果希望找到包含最大值的行，可以使用 idxmax() 函数来获取该行的索引。然后可以用这个索引来定位整个行。以下是实现这一目的的示例代码：

In [17]:
df['value'].idxmax()

144

One of the most useful functions that you can call on certain columns in a dataframe is the `value_counts()` function. It shows how many times each item appears in the column. This particular command shows the number of the same values.您可以在数据框中的某些列上调用的最有用的函数之一是“value_counts()”函数。它显示每个项目在列中出现的次数。该特定命令显示相同值的数量。

In [18]:
df['value'].value_counts()

value
49.4     3
312.6    2
642.8    2
367.9    2
46.7     2
        ..
663.8    1
309.6    1
479.1    1
220.0    1
414.8    1
Name: count, Length: 280, dtype: int64

# Acessing Values  访问值

Then, in order to get a particular row of the dataframe, we need to use the `iloc[]` function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per the Pandas documentation, iloc is an "integer-location based indexing for selection by position."
(If you ask yourself: *Why the double brackets ?*; It is explained further.)
然后，为了获取数据帧的特定行，我们需要使用“iloc[]”函数。 iloc绝对是比较重要的功能之一。主要思想是，每当您拥有要访问的某一行的整数索引时，您都希望使用它。根据 Pandas 文档，iloc 是“基于整数位置的索引，用于按位置选择”。
（如果您问自己：*为什么使用双括号？*；会有进一步解释。）In pandas, iloc[] is a method used for selecting rows and columns by integer index location. It allows you to access a group of rows and columns in a DataFrame using integer-based indexing.

In [20]:
df.iloc[[df['value'].idxmax()]]

Unnamed: 0,ins,type_entity,entity,period,value
144,62093,Commune,Saint-Nicolas,01/01/2019,3518.6


#### 我去pandas_tutorial_read.csv查看是 146   不是144 看上面答案搜索 "62093","Commune","Saint-Nicolas","01/01/2019","3518.6"    懂了
#### 因为第一行是 从数字1开始 标题  "ins","type_entity","entity","period","value" ，每行计数去掉标题(-1) ; 第二原因计算机从0 开始计数，所以再减去 1 , total moins 2 就对了

iloc 是 integer location 的缩写。它是 pandas 库中的一种索引器，用于基于整数位置（而不是标签）来选择数据。iloc 允许通过行和列的整数位置来访问 DataFrame 的特定元素、行或列。以下是一些常见的用法示例：
选择单个元素： :   df.iloc[0, 1]  # 选择第1行第2列的元素
选择单行: df.iloc[0]  # 选择第1行的所有列
选择单列：df.iloc[:, 1]  # 选择第2列的所有行
选择多行多列：df.iloc[0:2, 1:3]  # 选择第1到第2行和第2到第3列的子集
通过列表选择特定行或列: df.iloc[[0, 2], [1, 3]]  # 选择第1和第3行，第2和第4列的元素
iloc 对于按位置索引和切片 DataFrame 非常有用，特别是在标签不可用或不便于使用的情况下。

Let's take this a step further. Let's say you want to know the value of the column `'ins'` when the `'value'` column has its max.

让我们更进一步。假设您想知道当“value”列达到最大值时“ins”列的值。

144 "62063","Commune","Liège","01/01/2019","2877.7"
145 "62079","Commune","Oupeye","01/01/2019","701.1"
146 "62093","Commune","Saint-Nicolas","01/01/2019","3518.6"
注意数据是 ！146 这行，

144 "62063","Commune","Liège","01/01/2019","2877.7"
145 "62079","Commune","Oupeye","01/01/2019","701.1"
146 "62093","Commune","Saint-Nicolas","01/01/2019","3518.6"
147 "62096","Commune","Seraing","01/01/2019","1822.4"
我查的数据， 在pandas_tutorial_read.csv 里

In [21]:
# (我把题目拆开来分析)
# 1    df['value']：选择 'value' 列。
#     idxmax()：返回该列最大值所在的行索引。 
max_index = df['value'].idxmax()
print(max_index)

# 2 df.iloc[[max_index]]：使用 iloc 按位置提取数据。双重括号 [[]] 表示提取一个包含索引为 max_index 的行的 DataFrame。
max_row = df.iloc[[max_index]]
print(max_row)

# 3 从提取的行中选择 ins 列
max_ins = max_row['ins']
print(max_ins)
print(type(max_ins))

# 这行代码的意思是：
# max_row['ins']：从 DataFrame 中选择 'ins' 列。结果是一个包含单个值的 Series。


144
       ins type_entity         entity      period   value
144  62093     Commune  Saint-Nicolas  01/01/2019  3518.6
144    62093
Name: ins, dtype: int64
<class 'pandas.core.series.Series'>


在你的代码中，如果输出显示的是 <class 'pandas.core.series.Series'>，这意味着 print(df.iloc[max_value_index]) 返回的不是单独的值，而是一个 Pandas Series 对象。

Pandas Series 是一维数组形式的数据结构，类似于一列数据。当你使用 iloc 来获取特定索引位置的行时，如果这行只有一条记录（也就是一维的），Pandas 会返回一个 Series 对象而不是 DataFrame。

In [22]:
df.iloc[[df['value'].idxmax()]]['ins']

144    62093
Name: ins, dtype: int64

When you see data displayed in the above format, you're dealing with a Pandas `Series` object, not a dataframe object.
当您看到以上述格式显示的数据时，您正在处理 Pandas `Series` 对象，而不是数据框对象。

In [23]:
type(df.iloc[[df['value'].idxmax()]]['ins'])

pandas.core.series.Series

In [9]:
type(df.iloc[[df['value'].idxmax()]])

pandas.core.frame.DataFrame

The other really important function in Pandas is the `loc` function. Contrary to `iloc`, which is an integer based indexing, `loc` is a "Purely label-location based indexer for selection by label". Since all the rows are ordrered by the `ins` column, `loc` and `iloc` are going to be pretty interchangeable for this dataset.
在 Pandas 中，另一个非常重要的函数是 loc 函数。与基于整数索引的 iloc 不同，loc 是一个“纯粹基于标签的位置索引器，用于按标签进行选择”。由于所有行都按照 ins 列进行了排序，所以在这个数据集中，loc 和 iloc 可以相互替换使用。这表示我们选择了从行标签 0 到行标签 3（包括行标签 3）的所有行。在这个例子中，loc 函数选择了按行标签排序的行数据，而不是基于整数位置的索引。

In [10]:
df.loc[:3]

Unnamed: 0,ins,type_entity,entity,period,value
0,3000,Région,Wallonie,01/01/2019,215.0
1,20002,Province,Brabant Wallon,01/01/2019,367.9
2,25000,Arrondissement,Nivelles,01/01/2019,367.9
3,25005,Commune,Beauvechain,01/01/2019,187.8


Notice the slight difference in that **`iloc` is exclusive of the second number, while loc is inclusive**. 请注意，iloc 是排除第二个数的，而 loc 是包含第二个数的。    使用 loc 进行切片选择：df.loc[0:3] 选择的是从行标签 0 到行标签 3 的所有行（包括行标签 3）。

selected_data = df.loc[0:3]
print(selected_data)
   A   B  C
0  1  10  a
1  2  20  b
2  3  30  c
3  4  40  d

使用 iloc 进行切片选择：
df.iloc[0:3] 选择的是索引位置为 0 到 2 的行（不包括索引位置 3）

selected_data = df.iloc[0:3]
print(selected_data)
  A   B  C
0  1  10  a
1  2  20  b
2  3  30  c


Below is an example of how you can use `loc` to achieve the same task as we did previously with `iloc`.

我的备注  146 "62093","Commune","Saint-Nicolas","01/01/2019","3518.6"

In [24]:
df.loc[df['value'].idxmax(), 'ins']

np.int64(62093)

A faster version uses the `at()` function. `at()` is really useful whenever you know the row label and the column label of the particular value that you want to get. 

In [11]:
df.at[df['value'].idxmax(), 'ins']

np.int64(62093)

If you'd like to see more discussion on how loc and iloc are different, check out this great Stack Overflow post: http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation. Just remember that **`iloc` looks at position** and **`loc` looks at labels**. `loc` becomes very important when your row labels aren't integers. 
一种更快速的版本使用 at() 函数。当您知道特定值的行标签和列标签时，at() 非常有用。

总结
df.loc[:5] 包含标签为 5 的行，因此会返回 6 行。
df.iloc[:5] 不包含索引为 5 的行，因此会返回 5 行。

# Sorting

Let's say that we want to sort the dataframe in increasing order.
总结
使用 sort_values(by='列名') 可以对指定列进行升序排序。
使用 sort_values(by=['列名1', '列名2'], ascending=[True, False]) 可以对多列进行混合排序（升序或降序）。
排序后的 DataFrame 会生成一个新的 DataFrame，原始 DataFrame 保持不变。

In [25]:
df.sort_values('value').head()

Unnamed: 0,ins,type_entity,entity,period,value
228,84016,Commune,Daverdisse,01/01/2019,24.5
265,91143,Commune,Vresse-sur-Semois,01/01/2019,25.8
214,82038,Commune,Sainte-Ode,01/01/2019,26.1
229,84029,Commune,Herbeumont,01/01/2019,27.8
219,83031,Commune,La Roche-en-Ardenne,01/01/2019,28.6


# Filtering Rows Conditionally

Now, let's say we want to find all of the rows that satisy a particular condition. For example, we want to find all the rows where `'value'` is higher than 150. The idea behind this command is you want to access the column `'value'` of the dataframe df (`df['value']`), find which entries are above 150 (`df['value'] > 150`), and then returns only those specific rows in a dataframe format (`df[df['value'] > 150]`).

In [27]:
df[df['value'] > 150]

Unnamed: 0,ins,type_entity,entity,period,value
0,3000,Région,Wallonie,01/01/2019,215.0
1,20002,Province,Brabant Wallon,01/01/2019,367.9
2,25000,Arrondissement,Nivelles,01/01/2019,367.9
3,25005,Commune,Beauvechain,01/01/2019,187.8
4,25014,Commune,Braine-l'Alleud,01/01/2019,764.8
...,...,...,...,...,...
277,92114,Commune,Sombreffe,01/01/2019,235.1
278,92137,Commune,Sambreville,01/01/2019,825.7
280,92140,Commune,Jemeppe-sur-Sambre,01/01/2019,408.5
281,92141,Commune,La Bruyère,01/01/2019,174.9


This also works if you have multiple conditions. Let's say we want to find out when the values between two specific values, here 150 and 200. 

In [28]:
df[(df['value'] > 150) & (df['value'] < 200)]

Unnamed: 0,ins,type_entity,entity,period,value
3,25005,Commune,Beauvechain,01/01/2019,187.8
8,25031,Commune,Genappe,01/01/2019,170.7
11,25044,Commune,Ittre,01/01/2019,196.7
12,25048,Commune,Jodoigne,01/01/2019,191.5
16,25084,Commune,Perwez,01/01/2019,181.8
25,25120,Commune,Orp-Jauche,01/01/2019,175.7
29,25124,Commune,Walhain,01/01/2019,188.8
31,51000,Arrondissement,Ath,01/01/2019,190.4
61,53044,Commune,Jurbise,01/01/2019,178.4
76,55085,Commune,Seneffe,01/01/2019,179.2


# Grouping

Another important function in Pandas is `groupby` This is a function that allows you to group entries by certain attributes (e.g Grouping entries by `ins` number) and then perform operations on them.

In [29]:
df.groupby('ins')['value'].mean().head()

ins
3000     215.0
20002    367.9
25000    367.9
25005    187.8
25014    764.8
Name: value, dtype: float64

This next command groups all the entities with the same value and finds how many times that specific entity appears on the group.

In [30]:
df.groupby('value')['entity'].value_counts().head()

value  entity             
24.5   Daverdisse             1
25.8   Vresse-sur-Semois      1
26.1   Sainte-Ode             1
27.8   Herbeumont             1
28.6   La Roche-en-Ardenne    1
Name: count, dtype: int64

Each dataframe has a `values` attribute which is useful because it basically displays your dataframe in a numpy array style format

In [31]:
df.values

array([[3000, 'Région', 'Wallonie', '01/01/2019', 215.0],
       [20002, 'Province', 'Brabant Wallon', '01/01/2019', 367.9],
       [25000, 'Arrondissement', 'Nivelles', '01/01/2019', 367.9],
       ...,
       [93056, 'Commune', 'Philippeville', '01/01/2019', 59.0],
       [93088, 'Commune', 'Walcourt', '01/01/2019', 149.1],
       [93090, 'Commune', 'Viroinval', '01/01/2019', 46.7]], dtype=object)

Now, you can simply just access elements like you would in an array. 

In [32]:
df.values[0][0]

3000

# Dataframe Iteration

In order to iterate through dataframes, we can use the `iterrows` function. Below is an example of what the first two rows look like. Each row in `df.iterrows()` is a `Series` object

In [None]:
for index, row in df.iterrows():
    print(row)
    if index == 1:
        break

# Extracting Rows and Columns

The bracket indexing operator is one way to extract certain columns from a dataframe.

In [33]:
df[['entity', 'value']].head()

Unnamed: 0,entity,value
0,Wallonie,215.0
1,Brabant Wallon,367.9
2,Nivelles,367.9
3,Beauvechain,187.8
4,Braine-l'Alleud,764.8


Notice that you can achieve the same result by using the `loc` function. `loc` is a veryyyy versatile function that can help you in a lot of accessing and extracting tasks. 

In [34]:
df.loc[:, ['entity', 'value']].head()

Unnamed: 0,entity,value
0,Wallonie,215.0
1,Brabant Wallon,367.9
2,Nivelles,367.9
3,Beauvechain,187.8
4,Braine-l'Alleud,764.8


Note the difference is the return types when you use brackets and when you use double brackets. 

In [35]:
type(df['entity'])

pandas.core.series.Series

In [36]:
type(df[['entity']])

pandas.core.frame.DataFrame

You've seen before that you can access columns through `df['col name']`. You can also access rows by using slicing operations. 

In [37]:
df[0:3]

Unnamed: 0,ins,type_entity,entity,period,value
0,3000,Région,Wallonie,01/01/2019,215.0
1,20002,Province,Brabant Wallon,01/01/2019,367.9
2,25000,Arrondissement,Nivelles,01/01/2019,367.9


Here's an equivalent using `iloc`

In [38]:
df.iloc[0:3,:]

Unnamed: 0,ins,type_entity,entity,period,value
0,3000,Région,Wallonie,01/01/2019,215.0
1,20002,Province,Brabant Wallon,01/01/2019,367.9
2,25000,Arrondissement,Nivelles,01/01/2019,367.9


# Data Cleaning

One of the big jobs of doing well in [Kaggle competitions](https://www.kaggle.com/competitions) is the one of data cleaning. A lot of times, the CSV file you're given (especially like in the Titanic dataset), you'll have a lot of missing values in the dataset, which you have to identify. The following `isnull` function will figure out if there are any missing values in the dataframe, and will then sum up the total for each column. In this case, we have a pretty clean dataset.

In [39]:
df.isnull().sum()

ins            0
type_entity    0
entity         0
period         0
value          0
dtype: int64

If you do end up having missing values in your datasets, be sure to get familiar with these two functions. 
* `dropna()` - This function allows you to drop all(or some) of the rows that have missing values. 
* `fillna()` - This function allows you replace the rows that have missing values with the value that you pass in.

# Creating Kaggle Submission CSVs

This isn't directly Pandas related, but most people who use Pandas probably do a lot of Kaggle competitions as well. As you probably know, Kaggle competitions require you to create a CSV of your predictions. Here's some starter code that can help you create that csv file.

In [41]:
import numpy as np
import csv

results = [[0,10],[1,15],[2,20]]
results = np.array(results)
print(results)

[[ 0 10]
 [ 1 15]
 [ 2 20]]


In [44]:
firstRow = [['id', 'pred']]
with open("result.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(firstRow)
    writer.writerows(results)

The approach I described above deals more with python lists and numpy. If you want a purely Pandas based approach, take a look at this video: https://www.youtube.com/watch?v=ylRlGCtAtiE&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=22

# Other Useful Functions

* `drop()` - This function removes the column or row that you pass in (You also have the specify the axis). 
* `agg()` - The aggregate function lets you compute summary statistics about each group
* `apply()` - Lets you apply a specific function to any/all elements in a Dataframe or Series
* `get_dummies()` - Helpful for turning categorical data into one hot vectors.
* `drop_duplicates()` - Lets you remove identical rows

* `drop()` - 该函数用于删除输入的列或行（还需要指定坐标轴）。
* `agg()` - 通过聚合函数，你可以计算每个组的汇总统计数据。
* `apply()` - 可以对数据帧或系列中的任何/所有元素应用特定函数。
* `get_dummies()` - 有助于将分类数据转化为单热向量。
* `drop_duplicates()` - 可以删除相同的行。

# Lots of Other Great Resources

Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. 
* http://pandas.pydata.org/pandas-docs/stable/10min.html
* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
* https://www.dataquest.io/blog/pandas-python-tutorial/
* https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y