# LESSON 5: PANDAS INTRODUCTION
<img src="../images/pd_logo.png" width="400px"/>

## 1. Overall introduction
**pandas** is a Python package providing **fast, flexible, and expressive data structures** designed to make working with **“relational” or “labeled” data both easy and intuitive**.

**pandas** is well suited for many different kinds of data:
- **Tabular data** with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) **time series data**.
- **Arbitrary matrix data** (homogeneously typed or heterogeneous) with row and column labels
- Any other form of **observational / statistical data** sets. The data need not be labeled at all to be placed into a pandas data structure

More detail about pandas can be found in this official [document](https://pandas.pydata.org/docs/user_guide/index.html).


## 2. Install and import pandas
Install pandas by running this command in jupyter notebook: 
`!conda install -c anaconda pandas -y`

In [None]:
import pandas as pd

## 3. Data structures in pandas
There are two main data structures in pandas: **SERIES** and **DATAFRAME**. <br>
<img src="../images/pd_series_df.png" width="600px"/>

### 3.1. Series
**Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). <br>
Checkout all arguments of pandas.Series [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

In [None]:
# Init a pandas series without name
number_list = [4, 5, 6, 3, 1]

series_1 = pd.Series(data=number_list)
print(series_1, '\n', type(series_1))

In [None]:
# Init a pandas series with name
number_list = [4, 5, 6, 3, 1]

series_1 = pd.Series(data=number_list, name='Mango')
print(series_1, '\n', type(series_1))

In [None]:
series_2 = pd.Series(data=[5, 4, 3, 0, 2], name='Apple')
series_3 = pd.Series(data=[2, 3, 5, 2, 7], name='Banana')

print(series_2, '\n', type(series_2))
print(series_3, '\n', type(series_3))

### 3.2. Dataframe
#### 3.2.1. Create dataframe from Series, list or dict

In [None]:
# Create dataframe from ONE Series
test_df = pd.DataFrame(series_1)
print(test_df, '\n', type(test_df))

In [None]:
# Create dataframe from multiple Series
test_df = pd.concat([series_1, series_2, series_3], axis=1)
print(test_df, '\n', type(test_df))

In [None]:
# Create dataframe from dict
data_dict = {
    'Mango': [4, 5, 6, 3, 1],
    'Apple': [5, 4, 3, 0, 2],
    'Banana': [2, 3, 5, 2, 7]
}
test_df = pd.DataFrame(data_dict)
print(test_df, '\n', type(test_df))

#### 3.2.2. Create dataframe by reading from csv or xlsx file
##### Read from CSV file

In [None]:
test_df = pd.read_csv('data/predict_future_sales/items.csv')
test_df

##### Read from XLSX file
To read `.xlsx` file by pandas, we have to install dependency (or engine):
- `xlrd` supports old-style Excel files (.xls).
- `openpyxl` supports newer Excel file formats.
- `odf` supports OpenDocument file formats (.odf, .ods, .odt).
- `pyxlsb` supports Binary Excel files.

In [None]:
!conda install -c anaconda openpyxl -y

In [None]:
# sheet_name=`integer` to load specific index sheet
test_df = pd.read_excel('data/predict_future_sales/data.xlsx', sheet_name=0, engine='openpyxl')

test_df

In [None]:
# sheet_name=`str` to load specific name sheet
test_df = pd.read_excel('data/predict_future_sales/data.xlsx', sheet_name='item_categories', engine='openpyxl')

test_df

In [None]:
# sheet_name=`list` (list of `integer` or `str`) to load multiple sheets into a `dict`
test_df = pd.read_excel('data/predict_future_sales/data.xlsx', sheet_name=[2, 'items'], engine='openpyxl')

test_df

In [None]:
# sheet_name=`None` to load all sheets into a `dict`
test_df = pd.read_excel('data/predict_future_sales/data.xlsx', sheet_name=None, engine='openpyxl')

test_df

## 4. Access element in Series and Dataframe
###  4.1. In Series

In [None]:
print(series_1)

# Access one element
print(series_1[2])

In [None]:
# Access multiple elements
print(series_1[[1, 2, 3]])

In [None]:
# Access elements based on one condition
print(series_1[series_1 > 3])

In [None]:
# Access elements based on multiple conditions
print(series_1[(series_1 > 3) & (series_1 < 6)])

###  4.2. In Dataframe

In [None]:
test_df

####  Access elements by columns

In [None]:
# Show all columns in dataframe
test_df.columns

In [None]:
# Access elements from one column in dataframe
test_df['item_name']

In [None]:
# Access elements from multiple columns in dataframe
test_df[['item_id', 'item_category_id']]

####  Access elements by rows

In [None]:
# Access a row in dataframe
test_df.loc[1]

In [None]:
# Access multiple rows in dataframe
test_df.loc[[1, 2, 22168]]

####  Access elements by conditions

In [None]:
# Access elements by condition in dataframe
test_df[test_df.item_id > 1000]

In [None]:
# Access elements by multiple conditions in dataframe
test_df[(test_df.item_id > 1000) & (test_df.item_id < 2345)]

In [None]:
# Access elements by multiple conditions in dataframe
test_df[(test_df.item_id > 1000) & (test_df.item_id < 2345) & (test_df.item_category_id == 12)]

## 5. Functions with dataframe

### 5.1. Basic functions

#### Delete one row or column by `drop()`

In [None]:
short_test_df = test_df.loc[1000:1010]
short_test_df

In [None]:
# Delete one row by index
new_df = short_test_df.drop(index=1000, axis=1)
new_df

In [None]:
# Delete multiple rows by index
new_df = short_test_df.drop(index=[1001, 1003, 1005, 1007, 1009], axis=1)
new_df

In [None]:
# Delete one column by column name
new_df = short_test_df.drop(columns='item_name')
new_df

In [None]:
# Delete multiple columns by column name
new_df = short_test_df.drop(columns=['item_name', 'item_category_id'])
new_df

#### Drop duplicate row base on columns by `drop_duplicates()`

In [None]:
test_df

In [None]:
test_df.drop_duplicates(subset='item_category_id')

#### Process NULL value by `dropna()` and `fillna()`

In [None]:
import numpy as np

nan_df = pd.DataFrame(data={
    'name': ['Minh', np.nan, 'Hieu'],
    'age': [None, 30, None],
    'job': ['DS', 'DA', 'DE']
})
nan_df

In [None]:
# Drop nan values by row
new_df = nan_df.dropna(axis=0)
new_df

In [None]:
# Drop nan values by columns
new_df = nan_df.dropna(axis=1)
new_df

In [None]:
# Fill all nan values with one specific value
new_df = nan_df.fillna(-1)
# new_1_df = new_df.fillna({
#     'age': -99
# })
new_df

In [None]:
# Fill nan values according to columns
new_df = nan_df.fillna({
    'name': 'Thao Anh',
    'age': -99,
    'job': 'student'
})
new_df

In [None]:
new_df.loc[new_df.name == 'Minh', 'name'] = np.nan
new_df

#### Modify elements in dataframe by `apply()`

In [None]:
short_test_df

In [None]:
short_test_df.item_category_id = short_test_df.item_category_id.apply(float)
short_test_df

In [None]:
short_test_df.item_category_id = short_test_df.item_category_id.apply(lambda x: (x + 1) / 2)
short_test_df

In [None]:
def plus_1_divide_2(x):
    return (x + 1) / 2

short_test_df.item_category_id = short_test_df.item_category_id.apply(plus_1_divide_2)
short_test_df

#### Save a dataframe to csv file by using `to_csv()`

In [None]:
short_test_df.to_csv('short_test_df.csv', index=False)

### 5.2. Advanced functions

In [None]:
test_df

#### Aggregate data by `groupby()` and `agg()`

In [None]:
new_df = test_df.groupby(by='item_category_id')
new_df

In [None]:
new_df = new_df.agg({'item_name': list, 'item_id': sum})
new_df

In [None]:
new_df = new_df.reset_index()
new_df

#### Join two dataframes by using `pd.merge()`
<img src="../images/pd_merge.png" width="400px"/>

In [None]:
test_df

In [None]:
# Create dummy dataframe
# index % 2 == 0
dummy_1_df = test_df.loc[list(range(0, len(test_df) - 1, 2))]
dummy_1_df

In [None]:
# Create dummy dataframe
# index % 3 == 0
dummy_2_df = test_df.loc[list(range(0, len(test_df) - 1, 3))]
dummy_2_df

In [None]:
merged_df = pd.merge(dummy_1_df, dummy_2_df, how='inner', on='item_id')
merged_df

In [None]:
merged_df = pd.merge(dummy_1_df, dummy_2_df, how='left', on='item_id')
merged_df

In [None]:
merged_df = pd.merge(dummy_1_df, dummy_2_df, how='right', on='item_id')
merged_df

In [None]:
merged_df = pd.merge(dummy_1_df, dummy_2_df, how='outer', on='item_id')
merged_df

In [None]:
merged_df = pd.merge(dummy_1_df, dummy_2_df, how='outer', on='item_id', indicator=True)
merged_df

## 6. Homework
### 6.1. Exercise 1:
Create dummy dataset which contains 5 data fields: "first_name", "last_name", "age", "job", "country" and 10 rows. (dataset contains 1 "country" named "Vietnam", 1 age is "20")

### 6.2. Exercise 2:
Create 5 Series for each field, after that, create a Dataframe from 5 Series, dump that Dataframe to csv file.

### 6.3. Exercise 3:
Read the csv file from excercise 2, add the string "my_first_name" in to each value in the "first_name", add the string "my_last_name" in to each value in the "last_name".

### 6.4. Exercise 4:
Print all rows whose "country" is "Vietnam", whose "country" is not "Vietnam", whose "age" is greater than "20", whose "age" is lower than or equal "20".