** O U T L I N E **

* What is Pandas and its Ecosystem?
* Load csv/ excel/ json/ url data file
* Access dataframe by columns, rows and cells.
* How to view dataframe in a high level.
* Work with Text in Pandas
* Manipulate data
* Some simple visualizations
* Useful external links

###  What is Pandas and its Ecosystem?

`Pandas` is a Python open-source toolkit for data analysis built on top of `Numpy`. It is considered as one of the most powerful tools ever existed for practices in real world for tabular data in 1-dimension (`pd.series`), 2-dimension (`pd.DataFrame`), or higher-dimension (`pd.Panel`). However, in this assignment, we focus on the first two kinds of Pandas data structure.

Along with `Numpy`, `Pandas` is one of the most popular open-source package in `Python` community. There are many other libraries built on top of and/or integrate with `Pandas` to serve multiple needs in data science in general, e.g. data preparation, analysis and visualization. Some of them can be named as follows:

* [**Statmodels**](!http://www.statsmodels.org/): general linear regression and time series models in statistics, and econometrices.
* [**scikit-learn**](!http://scikit-learn.org/): machine learning library with pipeline.
* [**Bokeh**](!http://bokeh.pydata.org/): interactive visualization for web, especially used for large datasets.

More information can be seen at [Pandas Documentation](!https://pandas.pydata.org/pandas-docs/stable/ecosystem.html). Nonetheless, even only with `Pandas` you can also do all kinds of jobs thanks to the support of community! In the next part of this pair assignment, we will walk you through all the basic steps to learn more `Pandas`

### Load csv/ excel/ json/ url data file

The first step is importing libaries (most of the time we combine both `numpy` and `pandas`)

In [None]:
import numpy as np
import pandas as pd

#### CSV File
```python
df = pd.read_csv(PATH_TO_CSV_FILE)
```

In [None]:
pd.read_csv?

#### Excel file


```python
df = pd.read_excel(PATH_TO_EXCEL_FILE)
```

In [None]:
PATH_TO_EXCEL_FILE = None" #Replace None by the path to excel file
df1 = pd.read_excel(PATH_TO_EXCEL_FILE, na_values="") # Read the excel file located at PATH_TO_EXCEL_FILE by the above method

#### Json file
```python
df = pd.read_json(PATH_TO_JSON_FILE)
```

In [None]:
PATH_TO_JSON_FILE = None #Replace None by the path to json file
df2 = pd.read_json(PATH_TO_JSON_FILE, orient="columns") # Read the excel file located at PATH_TO_JSON_FILE by the above method

#### CSV with URL
```python
df = pd.read_csv(PATH_TO_FILE)
```

In [None]:
pd.read_csv?

In [None]:
URL_PATH_TO_FILE = "https://pythonhow.com/data/income_data.csv" #No need to do anything. Only run this cell
df3 = pd.read_csv(URL_PATH_TO_FILE).head()

(**Bonus**): For some reasons, during a working process, if you would like to try something on a copy version of your `DataFrame`, use `copy()` method if you would not want the original to be changed.

In [None]:
pd.DataFrame.copy?

### Accessing DataFrame by columns, rows, and cells


#### Extract a subset with `loc` and `iloc`
There are two main ways are strongly recommended to indexing and selecting data

* Label-based index and column: **`loc`**
* Integer-based index and column: **`iloc`**

Both of them are used the same kind of syntax
```python
df.iloc[start_row_index:end_row_index, start_col_index:end_col_index]

df.loc[start_row_index:end_row_index, start_col_index:end_col_index]
```

More advanced techniques can be found at [Pandas Cookbook](!https://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook-selection)

Starting from this session, we will rely on documentation and the above reference link to complete the rest of this assignment.

In [None]:
#Show a slice of DataFrame from "Tỉnh":"Xã" of df1

In [None]:
#Show a list of all columns names of df1

In [None]:
#Show us the first 10 rows of "Tên Tỉnh" and "Ghi Chú" of df1

In [None]:
# Show the last column of df1

### How to view dataframe in a high level

##### `head`/ `tail` method

In [None]:
pd.DataFrame.head?

In [None]:
pd.DataFrame.tail?

In [None]:
#Apply method head for df1

In [None]:
#Apply method tail for df3

####  `describe` method

In [None]:
pd.DataFrame.describe?

In [None]:
#Apply method describe for 'Tên Tỉnh','Tên QH','Tên Xã' in df1

####   `Min/ Max` method

In [None]:
pd.DataFrame.min?

In [None]:
pd.DataFrame.max?

In [None]:
#Apply method min and max for to get the minimum value and maximum value in df3

####  Count values and Mode

In [None]:
pd.DataFrame.count?

In [None]:
pd.DataFrame.mode?

In [None]:
df1[['Tên Tỉnh','Tên QH','Tên Xã']].groupby(['Tên Tỉnh','Tên QH']).count()

### Work with Text in Pandas

#### Splitting  strings

In [None]:
# Use .str.split(',') to split text in `other` of df3

#### Checking a substring contained in a string

In [None]:
# Count how many rows are there containing "Thành phố" by str.contains("Thành phố")

In [None]:
# # Count how many rows are there containing "Thành phố" by check_city function
check_city = lambda x: True if "Thành phố" in x else False

### Manipulate data

#### Working missing data

In [None]:
# Drop all rows contain any 'NaN' values

#### Group by with `split-apply-combine`

In [None]:
grouped = df1[['Tên Tỉnh','Tên QH','Tên Xã']].groupby(['Tên Tỉnh','Tên QH']) #Run this cell

In [None]:
grouped.describe()  #Run this cell

### Simple visualization

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df3.loc[0:2,"2005":"2013"].T.plot(kind="line") # Run this cell

### Useful external links:
1. Simple step to guild you from the beginning until build a simple web app of applications in 15 minutes https://pythonhow.com/accessing-dataframe-columns-rows-and-cells/

2. An exhaustive list of up-to-date tutorials of `Pandas` https://pandas.pydata.org/pandas-docs/stable/text.html