# Pandas practice (part 1)

In [1]:
import pandas as pd
import numpy as np

## Contents:

1. `SettingWithCopyWarning` issues with Pandas
1. Create and inspect data frames
1. Subset columns
1. Subset rows
1. Add new columns
1. Modify columns
1. Delete columns
1. Concatenate DFs vertically

## `SettingWithCopyWarning` issues with Pandas

Indepth reading: https://www.dataquest.io/blog/settingwithcopywarning/

Key points:

- Always use `.copy()` when create a new DF from the original DF.
- When access and set values, use `.loc[]`
- Avoid chained assignment and hidden chaining

In [2]:
df = pd.read_csv("data/example_data.csv", sep=";") # Comma-separated values

In [3]:
df.head()

Unnamed: 0,time,place,magType,mag,alert,tsunami
0,2018-10-13 11:10:23.560,"262km NW of Ozernovskiy, Russia",mww,6.7,green,1
1,2018-10-13 04:34:15.580,"25km E of Bitung, Indonesia",mww,5.2,green,0
2,2018-10-13 00:13:46.220,"42km WNW of Sola, Vanuatu",mww,5.7,green,0
3,2018-10-12 21:09:49.240,"13km E of Nueva Concepcion, Guatemala",mww,5.7,green,0
4,2018-10-12 02:52:03.620,"128km SE of Kimbe, Papua New Guinea",mww,5.6,green,1


In [4]:
df = pd.read_csv("data/example_data.csv", sep=";")
print(df.shape)
df.head()

(5, 6)


Unnamed: 0,time,place,magType,mag,alert,tsunami
0,2018-10-13 11:10:23.560,"262km NW of Ozernovskiy, Russia",mww,6.7,green,1
1,2018-10-13 04:34:15.580,"25km E of Bitung, Indonesia",mww,5.2,green,0
2,2018-10-13 00:13:46.220,"42km WNW of Sola, Vanuatu",mww,5.7,green,0
3,2018-10-12 21:09:49.240,"13km E of Nueva Concepcion, Guatemala",mww,5.7,green,0
4,2018-10-12 02:52:03.620,"128km SE of Kimbe, Papua New Guinea",mww,5.6,green,1


In [6]:
df2 = df.copy()

### Issue 1: chained assignment

Pandas generates the warning when it detects something called chained assignment. Here are some terms:

- `Assignment`: Operations that set the value of something. Example: `df = pd.read_csv("data/example_data.csv", sep=";")`. Often referred to as a `set` operation.
- `Access`: Operations that return the value of something such as the below examples of indexing and chaining. Often referred to as a `get` operations.
- `Indexing`: Any assignment or access method that references a subset of the data; for example `data[1:5]`.
- `Chaining`: The use of more than one indexing operation back-to-back; for example `data[1:5][1:3]`.

#### Not work, and produce warning

In [7]:
df2.loc[df2["tsunami"] == 1]["alert"] = "red"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Reasons: We have chained two indexing operations together.

- First one: `[df2["tsunami"] == 1]`
- Second one: `["alert"]`

- These two chained operations execute independently, one after another. 
- The first is an access method (get operation), that will return a DataFrame containing all rows where `tsunami` equals `1`.
- The second is an assignment operation (set operation), that is called on this new DataFrame. 
- We are not operating on the original DataFrame at all.
- Therefore, `alert` did not change.

#### Work, no warning

In [235]:
df2.loc[df2["tsunami"] == 1, "alert"] = "red"
df2.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...


### Issue 2: Hidden chainning

In [238]:
# Create a copy
df2 = df.copy()
df2.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...


In [240]:
# Create tsunami_1 from df2 where tsunami = 1
# tsunami_1 can be a copy or a view to df2, we don't know
tsunami_1 = df2.loc[df2["tsunami"] == 1, :]
tsunami_1.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
36,,,1000hbsa,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.541,,51.0,",us1000hbsa,",5.0,mww,...,",us,",reviewed,1539459504090,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...",1,earthquake,",geoserve,origin,phase-data,",420.0,1539461285040,https://earthquake.usgs.gov/earthquakes/eventp...
118,green,,1000hbkz,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.623,,25.0,",pt18286001,at00pgjb1a,us1000hbkz,",6.7,mww,...,",pt,at,us,",reviewed,1539429023560,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake,",geoserve,ground-failure,impact-link,losspager...",600.0,1539455437040,https://earthquake.usgs.gov/earthquakes/eventp...


In [243]:
df2.loc[df2["tsunami"] == 1, :].head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
36,,,1000hbsa,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.541,,51.0,",us1000hbsa,",5.0,mww,...,",us,",reviewed,1539459504090,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...",1,earthquake,",geoserve,origin,phase-data,",420.0,1539461285040,https://earthquake.usgs.gov/earthquakes/eventp...
118,green,,1000hbkz,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.623,,25.0,",pt18286001,at00pgjb1a,us1000hbkz,",6.7,mww,...,",pt,at,us,",reviewed,1539429023560,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake,",geoserve,ground-failure,impact-link,losspager...",600.0,1539455437040,https://earthquake.usgs.gov/earthquakes/eventp...


In [244]:
# Set alert column of tsunami_1 to 'red'
# Will produce warning
tsunami_1["alert"] = 'red'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [247]:
# View
tsunami_1.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
36,red,,1000hbsa,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.541,,51.0,",us1000hbsa,",5.0,mww,...,",us,",reviewed,1539459504090,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...",1,earthquake,",geoserve,origin,phase-data,",420.0,1539461285040,https://earthquake.usgs.gov/earthquakes/eventp...
118,red,,1000hbkz,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.623,,25.0,",pt18286001,at00pgjb1a,us1000hbkz,",6.7,mww,...,",pt,at,us,",reviewed,1539429023560,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake,",geoserve,ground-failure,impact-link,losspager...",600.0,1539455437040,https://earthquake.usgs.gov/earthquakes/eventp...


In [248]:
# View df2
df2.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...


In [249]:
# To prevent this, we EXPLICITLY make a copy when create tsunami_1
# No warnings
tsunami_1 = df2.loc[df2["tsunami"] == 1, :].copy()
tsunami_1.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
36,,,1000hbsa,https://earthquake.usgs.gov/fdsnws/event/1/que...,1.541,,51.0,",us1000hbsa,",5.0,mww,...,",us,",reviewed,1539459504090,"M 5.0 - 165km NNW of Flying Fish Cove, Christm...",1,earthquake,",geoserve,origin,phase-data,",420.0,1539461285040,https://earthquake.usgs.gov/earthquakes/eventp...
118,green,,1000hbkz,https://earthquake.usgs.gov/fdsnws/event/1/que...,2.623,,25.0,",pt18286001,at00pgjb1a,us1000hbkz,",6.7,mww,...,",pt,at,us,",reviewed,1539429023560,"M 6.7 - 262km NW of Ozernovskiy, Russia",1,earthquake,",geoserve,ground-failure,impact-link,losspager...",600.0,1539455437040,https://earthquake.usgs.gov/earthquakes/eventp...


In [250]:
tsunami_1["alert"] = "red"

In [251]:
df2.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...


In [253]:
df2.head(2)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...


### Turn off warnings
- Not recommended.
- But in case you really want: `pd.set_option('mode.chained_assignment', mode)`
    - `mode=None`: to switch off the warning entirely
    - `mode='warn'`: to generate a warning (default)
    - `mode='raise'`: to raise an exception instead of a warning (stop program)

## Create and inspect data frames

### Tạo data frame từ `data/earthquakes.csv`

In [254]:
df = pd.read_csv("data/earthquakes.csv")

### Show 3 dòng đầu, 3 dòng cuối

In [255]:
df.head(3)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
0,,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.008693,,85.0,",ci37389218,",1.35,ml,...,",ci,",automatic,1539475168010,"M 1.4 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475395144,https://earthquake.usgs.gov/earthquakes/eventp...
1,,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02003,,79.0,",ci37389202,",1.29,ml,...,",ci,",automatic,1539475129610,"M 1.3 - 9km NE of Aguanga, CA",0,earthquake,",geoserve,nearby-cities,origin,phase-data,",-480.0,1539475253925,https://earthquake.usgs.gov/earthquakes/eventp...
2,,4.4,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.02137,28.0,21.0,",ci37389194,",3.42,ml,...,",ci,",automatic,1539475062610,"M 3.4 - 8km NE of Aguanga, CA",0,earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",-480.0,1539536756176,https://earthquake.usgs.gov/earthquakes/eventp...


In [256]:
df.tail(3)

Unnamed: 0,alert,cdi,code,detail,dmin,felt,gap,ids,mag,magType,...,sources,status,time,title,tsunami,type,types,tz,updated,url
9329,,,2018261000,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.4526,,276.0,",pr2018261000,",2.4,md,...,",pr,",reviewed,1537229908180,"M 2.4 - 35km NNE of Hatillo, Puerto Rico",0,earthquake,",geoserve,origin,phase-data,",-240.0,1537243777410,https://earthquake.usgs.gov/earthquakes/eventp...
9330,,,38063959,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.01865,,61.0,",ci38063959,",1.1,ml,...,",ci,",reviewed,1537229545350,"M 1.1 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537230211640,https://earthquake.usgs.gov/earthquakes/eventp...
9331,,,38063935,https://earthquake.usgs.gov/fdsnws/event/1/que...,0.01698,,39.0,",ci38063935,",0.66,ml,...,",ci,",reviewed,1537228864470,"M 0.7 - 9km NE of Aguanga, CA",0,earthquake,",focal-mechanism,geoserve,nearby-cities,origin...",-480.0,1537305830770,https://earthquake.usgs.gov/earthquakes/eventp...


### In số dòng, số cột

In [257]:
df.shape[0]

9332

In [258]:
df.shape[1]

26

### In ra kiểu dữ liệu của mỗi cột

In [259]:
df.dtypes

alert       object
cdi        float64
code        object
detail      object
dmin       float64
felt       float64
gap        float64
ids         object
mag        float64
magType     object
mmi        float64
net         object
nst        float64
place       object
rms        float64
sig          int64
sources     object
status      object
time         int64
title       object
tsunami      int64
type        object
types       object
tz         float64
updated      int64
url         object
dtype: object

### Lấy list tên các cột

In [260]:
df.columns.tolist()

['alert',
 'cdi',
 'code',
 'detail',
 'dmin',
 'felt',
 'gap',
 'ids',
 'mag',
 'magType',
 'mmi',
 'net',
 'nst',
 'place',
 'rms',
 'sig',
 'sources',
 'status',
 'time',
 'title',
 'tsunami',
 'type',
 'types',
 'tz',
 'updated',
 'url']

### Dùng `.info()` để lấy thêm thông tin về null value và kiểu dữ liệu của từng cột

In [261]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9332 entries, 0 to 9331
Data columns (total 26 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alert    59 non-null     object 
 1   cdi      329 non-null    float64
 2   code     9332 non-null   object 
 3   detail   9332 non-null   object 
 4   dmin     6139 non-null   float64
 5   felt     329 non-null    float64
 6   gap      6164 non-null   float64
 7   ids      9332 non-null   object 
 8   mag      9331 non-null   float64
 9   magType  9331 non-null   object 
 10  mmi      93 non-null     float64
 11  net      9332 non-null   object 
 12  nst      5364 non-null   float64
 13  place    9332 non-null   object 
 14  rms      9332 non-null   float64
 15  sig      9332 non-null   int64  
 16  sources  9332 non-null   object 
 17  status   9332 non-null   object 
 18  time     9332 non-null   int64  
 19  title    9332 non-null   object 
 20  tsunami  9332 non-null   int64  
 21  type     9332 

### Lọc ra các cột kiểu số lưu vào `df_num`, show 3 dòng đầu

In [262]:
df_num = df.select_dtypes("number")

In [263]:
df_num.head(3)

Unnamed: 0,cdi,dmin,felt,gap,mag,mmi,nst,rms,sig,time,tsunami,tz,updated
0,,0.008693,,85.0,1.35,,26.0,0.19,28,1539475168010,0,-480.0,1539475395144
1,,0.02003,,79.0,1.29,,20.0,0.29,26,1539475129610,0,-480.0,1539475253925
2,4.4,0.02137,28.0,21.0,3.42,,111.0,0.22,192,1539475062610,0,-480.0,1539536756176


In [264]:
df_num.dtypes

cdi        float64
dmin       float64
felt       float64
gap        float64
mag        float64
mmi        float64
nst        float64
rms        float64
sig          int64
time         int64
tsunami      int64
tz         float64
updated      int64
dtype: object

### Lọc ra các cột kiểu object lưu vào `df_cat`, show 3 dòng đầu

In [265]:
df_cat = df.select_dtypes("O")

In [266]:
df_cat.head(3)

Unnamed: 0,alert,code,detail,ids,magType,net,place,sources,status,title,type,types,url
0,,37389218,https://earthquake.usgs.gov/fdsnws/event/1/que...,",ci37389218,",ml,ci,"9km NE of Aguanga, CA",",ci,",automatic,"M 1.4 - 9km NE of Aguanga, CA",earthquake,",geoserve,nearby-cities,origin,phase-data,",https://earthquake.usgs.gov/earthquakes/eventp...
1,,37389202,https://earthquake.usgs.gov/fdsnws/event/1/que...,",ci37389202,",ml,ci,"9km NE of Aguanga, CA",",ci,",automatic,"M 1.3 - 9km NE of Aguanga, CA",earthquake,",geoserve,nearby-cities,origin,phase-data,",https://earthquake.usgs.gov/earthquakes/eventp...
2,,37389194,https://earthquake.usgs.gov/fdsnws/event/1/que...,",ci37389194,",ml,ci,"8km NE of Aguanga, CA",",ci,",automatic,"M 3.4 - 8km NE of Aguanga, CA",earthquake,",dyfi,focal-mechanism,geoserve,nearby-cities,o...",https://earthquake.usgs.gov/earthquakes/eventp...


In [267]:
df_cat.dtypes

alert      object
code       object
detail     object
ids        object
magType    object
net        object
place      object
sources    object
status     object
title      object
type       object
types      object
url        object
dtype: object

### Tính descriptive stats cho `df_cat`

In [268]:
df_cat.describe()

Unnamed: 0,alert,code,detail,ids,magType,net,place,sources,status,title,type,types,url
count,59,9332,9332,9332,9331,9332,9332,9332,9332,9332,9332,9332,9332
unique,2,9332,9332,9332,10,14,5433,52,2,7807,5,42,9332
top,green,80313319,https://earthquake.usgs.gov/fdsnws/event/1/que...,",ci37362946,",ml,ak,"10km NE of Aguanga, CA",",ak,",reviewed,"M 0.4 - 10km NE of Aguanga, CA",earthquake,",geoserve,origin,phase-data,",https://earthquake.usgs.gov/earthquakes/eventp...
freq,58,1,1,1,6803,3166,306,2981,7797,55,9081,5301,1


### Tính descriptive stats cho `df_num`

In [269]:
df_num.describe()

Unnamed: 0,cdi,dmin,felt,gap,mag,mmi,nst,rms,sig,time,tsunami,tz,updated
count,329.0,6139.0,329.0,6164.0,9331.0,93.0,5364.0,9332.0,9332.0,9332.0,9332.0,9331.0,9332.0
mean,2.754711,0.544925,12.31003,121.506588,1.497345,3.651398,19.053878,0.362122,56.899914,1538284000000.0,0.006537,-451.99014,1538537000000.0
std,1.010637,2.214305,48.954944,72.962363,1.203347,1.790523,15.492315,0.317784,91.872163,608030600.0,0.080589,231.752571,656413500.0
min,0.0,0.000648,0.0,12.0,-1.26,0.0,0.0,0.0,0.0,1537229000000.0,0.0,-720.0,1537230000000.0
25%,2.0,0.020425,1.0,66.1425,0.72,2.68,8.0,0.119675,8.0,1537793000000.0,0.0,-540.0,1537996000000.0
50%,2.7,0.05905,2.0,105.0,1.3,3.72,15.0,0.21,26.0,1538245000000.0,0.0,-480.0,1538621000000.0
75%,3.3,0.17725,5.0,159.0,1.9,4.57,25.0,0.59,56.0,1538766000000.0,0.0,-480.0,1539110000000.0
max,8.4,53.737,580.0,355.91,7.5,9.12,172.0,1.91,2015.0,1539475000000.0,1.0,720.0,1539537000000.0


### Dùng tham số `include, exclude` của `.describe()` để tính summary statistics

In [271]:
df.describe(include="number")

Unnamed: 0,cdi,dmin,felt,gap,mag,mmi,nst,rms,sig,time,tsunami,tz,updated
count,329.0,6139.0,329.0,6164.0,9331.0,93.0,5364.0,9332.0,9332.0,9332.0,9332.0,9331.0,9332.0
mean,2.754711,0.544925,12.31003,121.506588,1.497345,3.651398,19.053878,0.362122,56.899914,1538284000000.0,0.006537,-451.99014,1538537000000.0
std,1.010637,2.214305,48.954944,72.962363,1.203347,1.790523,15.492315,0.317784,91.872163,608030600.0,0.080589,231.752571,656413500.0
min,0.0,0.000648,0.0,12.0,-1.26,0.0,0.0,0.0,0.0,1537229000000.0,0.0,-720.0,1537230000000.0
25%,2.0,0.020425,1.0,66.1425,0.72,2.68,8.0,0.119675,8.0,1537793000000.0,0.0,-540.0,1537996000000.0
50%,2.7,0.05905,2.0,105.0,1.3,3.72,15.0,0.21,26.0,1538245000000.0,0.0,-480.0,1538621000000.0
75%,3.3,0.17725,5.0,159.0,1.9,4.57,25.0,0.59,56.0,1538766000000.0,0.0,-480.0,1539110000000.0
max,8.4,53.737,580.0,355.91,7.5,9.12,172.0,1.91,2015.0,1539475000000.0,1.0,720.0,1539537000000.0


In [272]:
df.describe(include="O")

Unnamed: 0,alert,code,detail,ids,magType,net,place,sources,status,title,type,types,url
count,59,9332,9332,9332,9331,9332,9332,9332,9332,9332,9332,9332,9332
unique,2,9332,9332,9332,10,14,5433,52,2,7807,5,42,9332
top,green,80313319,https://earthquake.usgs.gov/fdsnws/event/1/que...,",ci37362946,",ml,ak,"10km NE of Aguanga, CA",",ak,",reviewed,"M 0.4 - 10km NE of Aguanga, CA",earthquake,",geoserve,origin,phase-data,",https://earthquake.usgs.gov/earthquakes/eventp...
freq,58,1,1,1,6803,3166,306,2981,7797,55,9081,5301,1


### Dùng `.unique()` và `.value_counts()` để inspect các cột categorical

Ví dụ cột `type`

In [273]:
df["alert"].unique()

array([nan, 'green', 'red'], dtype=object)

In [274]:
df["alert"].value_counts(normalize=True)

green    0.983051
red      0.016949
Name: alert, dtype: float64

## Subset columns

### Trích xuất cột `type` thành series `s`
- In 5 phần tử đầu của `s`
- Describe series `s`

In [275]:
s = df["type"]

In [276]:
s.head(5)

0    earthquake
1    earthquake
2    earthquake
3    earthquake
4    earthquake
Name: type, dtype: object

### Trích xuất cột `type` thành data frame `df_type`
- `df_type` chỉ có 1 cột `type`
- Describe `df_type`

In [277]:
df_type = df[["type"]]
df_type.head(5)

Unnamed: 0,type
0,earthquake
1,earthquake
2,earthquake
3,earthquake
4,earthquake


### Dùng `.filter()` để subset cột
- Lọc ra cột có tên chứa `mag`
- Lọc ra cột có tên bắt đầu bởi `ti`
- Lọc ra cột có tên kết thúc bởi `s`

In [279]:
df.filter(regex="^mag").head(3)

Unnamed: 0,mag,magType
0,1.35,ml
1,1.29,ml
2,3.42,ml


In [280]:
df.filter(regex="^ti").head(3)

Unnamed: 0,time,title
0,1539475168010,"M 1.4 - 9km NE of Aguanga, CA"
1,1539475129610,"M 1.3 - 9km NE of Aguanga, CA"
2,1539475062610,"M 3.4 - 8km NE of Aguanga, CA"


In [281]:
df.filter(regex="s$").head(3)

Unnamed: 0,ids,rms,sources,status,types
0,",ci37389218,",0.19,",ci,",automatic,",geoserve,nearby-cities,origin,phase-data,"
1,",ci37389202,",0.29,",ci,",automatic,",geoserve,nearby-cities,origin,phase-data,"
2,",ci37389194,",0.22,",ci,",automatic,",dyfi,focal-mechanism,geoserve,nearby-cities,o..."


## Subset rows

### Lấy những dòng có `mag >= 0.7`
- Có bao nhiêu rows?

In [283]:
df.loc[df["mag"] >= 0.7, :].shape[0]

7136

### Lấy những dòng có `tsunami = 1` VÀ `alert = 'red'`

- Có bao nhiêu rows?

In [284]:
cond = (df["tsunami"] == 1) & (df["alert"] == "red")
df.loc[cond, :].shape[0]

1

### Lấy những dòng có `tsunami = 1` HOẶC `alert = 'red'`

- Có bao nhiêu rows?

In [285]:
cond = (df["tsunami"] == 1) | (df["alert"] == "red")
df.loc[cond, :].shape[0]

61

### Tạo series giá trị `mag` của những dòng có `tsunami = 1` HOẶC `alert = 'red'`

- Tính avg mag của series này

In [286]:
cond = (df["tsunami"] == 1) | (df["alert"] == "red")
df.loc[cond, "mag"].mean()

5.3580327868852455

### Tạo DF gồm cột `mag` và `type` của những dòng có `tsunami = 1` HOẶC `alert = 'red'`
- In ra 3 dòng đầu

In [287]:
cond = (df["tsunami"] == 1) | (df["alert"] == "red")
df.loc[cond, ["mag", "type"]].head(3)

Unnamed: 0,mag,type
36,5.0,earthquake
118,6.7,earthquake
501,5.6,earthquake


### Lọc ra những dòng có `type` là `earthquake`, `ice quake`, hoặc `explosion`
- Dùng `.isin()`
- Có bao nhiêu dòng

In [288]:
values = ["earthquake", "ice quake", "explosion"]
df.loc[df["type"].isin(values), :].shape

(9233, 26)

## Add new columns

- Tạo `df2` là copy của df với những cột sau `['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']`
- Thực hiện các thao tác sau trên `df2`

In [289]:
cols = [
    'time', 'title', 'place', 
    'magType', 'mag', 'alert', 'tsunami'
]

df2 = df.loc[:, cols].copy()

### Thêm cột `ones` với giá trị `1`

In [290]:
df2["ones"] = 1

In [291]:
df2.head(2)

Unnamed: 0,time,title,place,magType,mag,alert,tsunami,ones
0,1539475168010,"M 1.4 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,1.35,,0,1
1,1539475129610,"M 1.3 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,1.29,,0,1


### Thêm cột `mag_sign` với giá trị `-1, 0, 1` thể hiện dấu của `mag`

Cách 1

In [94]:
df2["mag_sign"] = None
df2.loc[df["mag"] > 0, "mag_sign"] = 1
df2.loc[df["mag"] < 0, "mag_sign"] = -1
df2.loc[df["mag"] == 0, "mag_sign"] = 0 

In [294]:
df2["mag_sign"].value_counts()

 1.0    8792
-1.0     491
 0.0      48
Name: mag_sign, dtype: int64

Cách 2

In [295]:
df2["mag_sign"] = df2["mag"].apply(np.sign)

In [296]:
df2["mag_sign"].value_counts()

 1.0    8792
-1.0     491
 0.0      48
Name: mag_sign, dtype: int64

## Modify columns

- Tạo `df2` là copy của df với những cột sau `['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']`
- Thực hiện các thao tác sau trên `df2`

In [177]:
df2 = df.loc[:, ['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']].copy()

### Nhân `mag` với 100 và gán lại vào mag

In [297]:
df2["mag"] = df["mag"] * 100

In [298]:
df2.head(2)

Unnamed: 0,time,title,place,magType,mag,alert,tsunami,ones,mag_sign
0,1539475168010,"M 1.4 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,135.0,,0,1,1.0
1,1539475129610,"M 1.3 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,129.0,,0,1,1.0


### Biến `place` thành uppercase, lưu lại thay đổi

In [299]:
df2["place"] = df2["place"].str.upper()

In [300]:
df2.head(2)

Unnamed: 0,time,title,place,magType,mag,alert,tsunami,ones,mag_sign
0,1539475168010,"M 1.4 - 9km NE of Aguanga, CA","9KM NE OF AGUANGA, CA",ml,135.0,,0,1,1.0
1,1539475129610,"M 1.3 - 9km NE of Aguanga, CA","9KM NE OF AGUANGA, CA",ml,129.0,,0,1,1.0


## Delete columns

- Tạo `df2` là copy của df với những cột sau `['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']`
- Thực hiện các thao tác sau trên `df2`

In [301]:
df2 = df.loc[:, ['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']].copy()

In [302]:
df2.head(2)

Unnamed: 0,time,title,place,magType,mag,alert,tsunami
0,1539475168010,"M 1.4 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,1.35,,0
1,1539475129610,"M 1.3 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,1.29,,0


### Xóa cột time dùng `del`

In [303]:
del df2["time"]

In [304]:
df2.head(2)

Unnamed: 0,title,place,magType,mag,alert,tsunami
0,"M 1.4 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,1.35,,0
1,"M 1.3 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,1.29,,0


### Pop cột `mag`, gán vào biến `mag` và xóa cột `mag`

In [305]:
mag = df2.pop("mag")

In [306]:
df2.head(2)

Unnamed: 0,title,place,magType,alert,tsunami
0,"M 1.4 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,,0
1,"M 1.3 - 9km NE of Aguanga, CA","9km NE of Aguanga, CA",ml,,0


### Xóa nhiều cột cùng lúc dùng `.drop()`
- Xóa cột `title`, `place`, và `alert`
- Muốn lưu thay đổi thì set `inplace=True`

In [307]:
# Cách 1: set axis=1
df2.drop(["title", "place", "alert"], axis=1, inplace=True)

In [309]:
df2.head(2)

Unnamed: 0,magType,tsunami
0,ml,0
1,ml,0


Tạo lại df2

In [310]:
df2 = df.loc[:, ['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']].copy()

In [311]:
# Cách 2: truyền columns
df2.drop(columns=["title", "place", "alert"], inplace=True)

## Concat 2 DFs vertically
Tạo 2 DFs là `ts_0` và `ts_1` với `tsunami` lần lượt là 0 và 1. Chỉ lấy các cột `'mag', 'alert', 'tsunami'`

In [312]:
cols = ['mag', 'alert', 'tsunami']

ts_0 = df.loc[df["tsunami"] == 0, cols] # Data source 1
ts_1 = df.loc[df["tsunami"] == 1, cols] # Data source 2

In [313]:
ts_0.head(2)

Unnamed: 0,mag,alert,tsunami
0,1.35,,0
1,1.29,,0


In [314]:
ts_1.head(1)

Unnamed: 0,mag,alert,tsunami
36,5.0,,1


### Chồng `ts_0` và `ts_1` lên nhau dùng `pd.concat()`

- Lưu ý nên set `ignore_index=True` để reset lại index ở kết quả

In [315]:
pd.concat([ts_0, ts_1], ignore_index=True)

Unnamed: 0,mag,alert,tsunami
0,1.35,,0
1,1.29,,0
2,3.42,,0
3,0.44,,0
4,2.16,,0
...,...,...,...
9327,5.40,,1
9328,5.10,,1
9329,5.10,green,1
9330,5.20,,1
