# Module1. Pandas and Numpy
Pandas and numpy are the two most commonly used package for doing data analysis in Python. Pandas provides comprehensive tools for the user to manipulate the structured data, and Numpy is a package designed to handle the vector and matrix operation.
In this module, we will focus more on Pandas, and try to use it to: 1.explore the data, 2.merge the data, and 3.clean and transform the data.

In [None]:
# import the package
import pandas as pd
import numpy as np

## 1.Use Pandas for Exploratory Data Analysis
Pandas is just like Excel, it is designed to handle structured data. You can use pandas to quickly produce some statistics for the data.
This process is sometimes called exploratory data analysis(EDA). EDA is a basic but important step for doing data analysis.

### Read the data: read_csv() and read_excel()
You can use read_csv("xxx.csv") or read_excel("xxx.xlsx") to read .csv or excel file

In [None]:
# read the data
# if you are working on Google Colab, please change the path to :
# https://raw.githubusercontent.com/JumpingSquid/py_tutorial/master/titanic.csv
df = pd.read_csv("titanic.csv")

### Take a look: head() and describe()

In [None]:
# have a look at the data
# use "head" to display the top n data
df.head(n=10)

In [None]:
# we can also use "describe" to show the simple stat
df.describe()

### Index and slice - part I: loc and iloc
loc and iloc are the two major ways to get the data from the dataframe. loc takes the name or boolean mask as input, iloc take the number index (e.g. the third row with fifth column).

Note: ":" means all rows or columns.
<br>You can also set the starting point or the end point, like 
<br><b>\[2:\]</b> means from 2 to the last number
<br><b>\[:3\]</b> means for the first to the second (not third!), and 
<br><b>\[2:4\]</b> means the second and the third.
<br>You can also use negative number, like <b>\[:-1\]</b> means from the first to the last two.

In [None]:
# loc use index and column name
# loc[row index, column name]
df.loc[:, "Age"] # 

In [None]:
# extract the data by row index
df.loc[0, :]

In [None]:
# of course you can extract multiple index or columns by using list
df.loc[[0,1,2], ["Name", "Sex", "Age"]]

In [None]:
# iloc use the coordinate
df.iloc[:, 3]

In [None]:
# iloc use the coordinate
df.iloc[1, :]

In [None]:
# again, you can use list to contain all the rows and columns' index
df.iloc[[1,2,3], [1,2,3]]

In [None]:
# you can also use this way to extract the entire column
df.Age

### Index and slice - part II: conditional select
When we try to find specific columns or rows, we generally do not find iy by id but by some conditions (like SELECT and WHERE in SQL).<br>
loc\[\] allows you to do that by specify the condition for the row or column in a form like:<br>
<b>loc\[condition for rows, condition for columns\]</b>


In [None]:
df.loc[df.Age < 10, ["Name", "Sex", "Age"]]

In [None]:
df.loc[:, df.columns == "Age"]

## Excercise One:
Can you extract the dataframe conditional on people who stay in the third class and are female passenger?
<br>Hint: You can use (condition 1) & (condition 2) to combine two condition

### Other ueful tools for EDA: value_counts(), groupby(), and pivot_table()

In [None]:
# count the number
df.Sex.value_counts(normalize=False)

In [None]:
# grouped by
df.groupby(by="Sex").mean()

In [None]:
# pivot table
df.pivot_table(index="Sex", columns="Pclass", aggfunc="size")

## 2. Use Pandas to combine the data
In practice, it is rare to have a complete, clean, and merged data. You typically need to combine several relational dataset into one. Pandas has many tools to help you achieve this. Now let's try some of them.

In [None]:
# To learn this, we split the data into two pieces
# Ignore this block, as this is not important at all
df_personal = df.loc[:, ["Name", "Sex", "Age"]].sample(frac=1).reset_index(drop=True)
df_ticket = df.loc[:, ['PassengerId', 'Pclass', 'Name', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']].sample(frac=1).reset_index(drop=True)
df_survival = df.loc[:, ['PassengerId', 'Survived']].sample(frac=1).reset_index(drop=True)

In [None]:
df_personal.head()

In [None]:
df_ticket.head()

In [None]:
df_survival.head()

### merge() and concat()
When you have multiple data, and you want to bundle them, you can use merge(). merge() basically combine the two data based on the "key".
The key is usually an ID or name. Using merger(), you can choose different way to merge the data. For instance, you can decide whether to keep only the IDs that exist in both data or to keep all the IDs.

In [None]:
pd.merge(df_ticket, df_survival, on='PassengerId', how='outer', indicator=True)

Besides the case the several data share one id, sometimes you will face the scenario that there are many dataframe with same structure but collected in different timing. To analyze the whole data, you need to use "concatenate".

In [None]:
df_old = df.iloc[:400, :]
df_new = df.iloc[400:, :]

In [None]:
df_old.head()

In [None]:
df_new.head()

In [None]:
pd.concat([df_old, df_new])

## Excersise Two:
Please combine the three dataframe(<b>df_personal, df_survival, df_ticket</b>) into one.

## 3. Use Pandas and Numpy to clean and transform the data
Data is not always clean. In fact, the most of your time as a data analyst will be spending on cleaning the data.

### Remove nan: fillna() and dropna()

We can use <b>isnull()</b> to find the columns which have nan value. nan value exists when the original data has no value. It is very important to find the nan when you are doing data analysis.

In [None]:
df.isnull().any()

Solution 1: <b>fillna()</b> can fill all nan cell with a specific value

In [None]:
df_nona = df.fillna(0)
df_nona.isnull().any()

Solution 2: <b>dropna()</b> will drop the columns or the rows that contain nan value. It is faster but please be more cautious to use.

In [None]:
df_nona = df.dropna()
df_nona.isnull().any()

### Transform the column: using loc(), iloc(), and numpy
If we want to change the value of a column, we need to use loc or iloc to specify the column.

In [None]:
df.Fare

In [None]:
df.loc[:, "Fare"] = df.loc[:, "Fare"] * 30
print(df.Fare)

In [None]:
df.loc[:, "Fare"] = np.mean(df.Fare)
print(df.Fare)

## Excercise Three:
Please fill the nan value in <b>Age</b> column with the mean of other passengers' age.