# Module1. Pandas and Numpy
Pandas與Numpy是用Python做資料分析最常用的套件。Pandas最主要的功能在於處理表格形式的結構性資料，而Numpy則是處理多維度向量的操作。  
這章節我們會以Pandas為主，因為大部分的資料分析工作都以結構性的資料為開始。我們主要會使用Pandas來(1)探索資料、(2)合併資料與(3)清理並轉換資料。
Pandas and numpy are the two most commonly used package for doing data analysis in Python. Pandas provides comprehensive tools for the user to manipulate the structured data, and Numpy is a package designed to handle the vector and matrix operation.
In this module, we will focus more on Pandas, and try to use it to: 1.explore the data, 2.merge the data, and 3.clean and transform the data.

In [1]:
# import the package
import pandas as pd
import numpy as np

## 1.Use Pandas for Exploratory Data Analysis 用Pandas進行探索式資料分析  
你可以把Pandas想像成Python版本的Excel，它可以用來開啟、操作、然後儲存結構性資料。當你有個csv檔或是excel檔，你都可以使用pandas來讀取資料。
Pandas提供許多敘述性統計的功能，讓你可以快速的瞭解手上資料的長相。  
Pandas is just like Excel, it is designed to handle structured data. You can use pandas to quickly produce some statistics for the data.
This process is sometimes called exploratory data analysis(EDA). EDA is a basic but important step for doing data analysis.

![Dataframe_explain](https://github.com/JumpingSquid/py_tutorial/blob/master/image/pandas_dataframe.png?raw=true)

### 1.1 Read the data 讀取資料: read_csv() and read_excel()
You can use read_csv("xxx.csv") or read_excel("xxx.xlsx") to read .csv or excel file

In [None]:
# read the data
# if you are working on Google Colab, please change the path to :
# https://raw.githubusercontent.com/JumpingSquid/py_tutorial/master/titanic.csv
df = pd.read_csv("https://raw.githubusercontent.com/JumpingSquid/py_tutorial/master/netflix_titles.csv")

### 1.2 Take a look 擷取/敘述資料: head() and describe()

In [None]:
# have a look at the data
# use "head" to display the top n data
df.head(n=10)

In [None]:
# we can also use "describe" to show the simple stat
df.describe()

### 1.3 Index and slice 資料定位I - part I: loc and iloc
有時候我們只想針對部份資料處理時，就需要使用loc或iloc來定位。  
在Pandas的Dataframe物件旁邊，加上.loc[] 或 .iloc[,]就可以定位。逗號左邊放想要的列，右邊放想要的行。  
loc用 (1) index或columns的名稱或是 (2) boolean mask來定位，而iloc則是用數字座標來選擇。  
loc and iloc are the two major ways to get the data from the dataframe. loc takes the name or boolean mask as input, iloc take the number index (e.g. the third row with fifth column).

Note: ":" means all rows or columns.
<br>You can also set the starting point or the end point, like 
<br><b>\[2:\]</b> means from 2 to the last number
<br><b>\[:3\]</b> means for the first to the second (not third!), and 
<br><b>\[2:4\]</b> means the second and the third.
<br>You can also use negative number, like <b>\[:-1\]</b> means from the first to the last two.

In [None]:
# loc use index and column name
# loc[row index, column name]
df.loc[:, "title"] #

In [None]:
# extract the data by row index
df.loc[0, :]

In [None]:
# of course you can extract multiple index or columns by using list
df.loc[[0,1,2], ["title", "director", "cast"]]

In [None]:
# iloc use the coordinate
df.iloc[:, 3]

In [None]:
# iloc use the coordinate
df.iloc[1, :]

In [None]:
# again, you can use list to contain all the rows and columns' index
df.iloc[[1,2,3], [1,2,3]]

In [None]:
# you can also use this way to extract the entire column
df.country

### 1.4 Index and slice 資料定位II - part II: conditional select 條件式定位
前面講到，用loc來定位的時候，可以放進boolean mask，那甚麼是boolean mask呢? Boolean mask是一組跟行數或列數相同的長度的集合，裏頭只有True或False，代表著這一行/列是不是被指定的行/列。而這樣的特性，讓我們可以透過設定條件來找到我們想要的資料。  
When we try to find specific columns or rows, we generally do not find iy by id but by some conditions (like SELECT and WHERE in SQL).<br>
loc\[\] allows you to do that by specify the condition for the row or column in a form like:<br>
<b>loc\[condition for rows, condition for columns\]</b>


![boolmask_explain](https://github.com/JumpingSquid/py_tutorial/blob/master/image/pandas_bmask.png?raw=true)

In [None]:
df.loc[df.release_year > 2015, ["title", "director", "cast"]]

In [None]:
df.loc[:, df.columns == "duration"]

## Example One:
Can you extract the dataframe conditional on movies from India since 2018?
<br>Hint: You can use (condition 1) & (condition 2) to combine two condition

### 1.5 Other ueful tools for EDA: value_counts(), groupby(), and pivot_table()

In [None]:
# count the number
df.country.value_counts(normalize=False)

In [None]:
# grouped by
df.groupby(by="country").mean()

In [None]:
# pivot table
df.pivot_table(index="country", columns="release_year", aggfunc="size")

## 2. Use Pandas to combine the data 使用pandas來合併資料
通常在實務上，一個完整的分析會需要使用多種資料。有時候你會想要合併不同種類的資料，來增加分析的深度。
pandas提供多種方式來讓你合併資料，在這邊我們介紹兩種不同的方式: merge() 跟 concat()。
In practice, it is rare to have a complete, clean, and merged data.
You typically need to combine several relational dataset into one.
Pandas has many tools to help you achieve this. Now let's try some of them.

In [None]:
# To learn this, we split the data into two pieces
# Ignore this block, as this is not important at all
df_staff = df.loc[:, ["title", "director", "cast"]].sample(frac=1).reset_index(drop=True)
df_time = df.loc[:, ['show_id', 'title', "country", "date_added", "release_year"]].sample(frac=1).reset_index(drop=True)
df_description = df.loc[:, ['show_id', 'description']].sample(frac=1).reset_index(drop=True)

In [None]:
df_staff.head()

In [None]:
df_time.head()

In [None]:
df_description.head()

### 2.1 merge()
當我們有好幾份資料，而這些資料有相對應的key時(鍵值，如ID號碼、人名等)，merge()可以幫我們依照key值來合併這些資料。  
有時候有些key可能只出現在單獨一份資料裡，你可以選擇要保留全部(缺少的資料就變成nan)，也可以選擇只保留在兩份資料都有的。  
When you have multiple data, and you want to bundle them, you can use merge(). merge() basically combine the two data based on the "key".
The key is usually an ID or name. Using merger(), you can choose different way to merge the data. For instance, you can decide whether to keep only the IDs that exist in both data or to keep all the IDs.

![boolmask_explain](https://github.com/JumpingSquid/py_tutorial/blob/master/image/pandas_merge.png?raw=true)

In [None]:
pd.merge(df_staff, df_description, on='title', how='outer', indicator=True)

### 2.2 concat()
concat()用於比較簡單的合併，可以將多份資料沿著行或列去合併在一起。常見的用法有把同一份表格的新資料表跟舊資料表和在一起。  
Besides the case the several data share one id, sometimes you will face the scenario that there are many dataframe with same structure but collected in different timing. To analyze the whole data, you need to use "concatenate".

![boolmask_explain](https://github.com/JumpingSquid/py_tutorial/blob/master/image/pandas_concat.png?raw=true)

In [None]:
df_old = df.iloc[:400, :]
df_new = df.iloc[400:, :]

In [None]:
df_old.head()

In [None]:
df_new.head()

In [None]:
pd.concat([df_old, df_new])

## Excersise Two:
Please combine the three dataframe(<b>df_personal, df_survival, df_ticket</b>) into one.

## 3. Use Pandas and Numpy to clean and transform the data 清理與資料轉換
通常我們手上的資料都有一堆問題需要去清理，pandas也有提供方法來處理。  
Data is not always perfect. In fact, the most of your time as a data analyst will be spending on cleaning the data.

### 3.1 Remove nan 移除空白值: fillna() and dropna()

We can use <b>isnull()</b> to find the columns which have nan value. nan value exists when the original data has no value. It is very important to find the nan when you are doing data analysis.

In [None]:
df.isnull().any()

Solution 1: <b>fillna()</b> can fill all nan cell with a specific value

In [None]:
df_nona = df.fillna(0)
df_nona.isnull().any()

Solution 2: <b>dropna()</b> will drop the columns or the rows that contain nan value. It is faster but please be more cautious to use.

In [None]:
df_nona = df.dropna()
df_nona.isnull().any()

### 3.2 Transform the column 資料轉換: using loc(), iloc(), and numpy
If we want to change the value of a column, we need to use loc or iloc to specify the column.

In [None]:
df.Fare

In [None]:
df.loc[:, "year"] = df.loc[:, "year"] - 2013
print(df.year)

In [None]:
df.loc[:, "year"] = np.mean(df.year)
print(df.Fare)

## Excercise Three:
Please fill the nan value in <b>director</b> column with the word "unknown".