# Lý thuyết Pandas

## 1. Giới thiệu Pandas

- **pandas** là thư viện Python mạnh mẽ cho **xử lý và phân tích dữ liệu dạng bảng**, thường dùng với dữ liệu CSV, Excel, SQL, JSON…

- Hai cấu trúc dữ liệu chính

| Cấu trúc        | Mô tả                                                      |
| --------------- | ---------------------------------------------------------- |
| **`Series`**    | Mảng 1 chiều, có nhãn (index)                              |
| **`DataFrame`** | Bảng 2 chiều (giống Excel), gồm nhiều `Series` cùng index  |

In [2]:
from IPython.display import display
import pandas as pd

## 2. Khởi tạo dữ liệu

- **`pd.Series(data, index=None, dtype=None, name=None)`**

    - **data**: list, array, dict, hoặc scalar
    - **index**: danh sách nhãn cho từng giá trị
    - **dtype**: kiểu dữ liệu (int, float, str…)
    - **name**: tên của series

In [3]:
sr = pd.Series([1, 2, 3, 4, 5])
print(sr, '\n')

sr_indexes = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'], name='MySeries')
print(sr_indexes)

0    1
1    2
2    3
3    4
4    5
dtype: int64 

a    1
b    2
c    3
d    4
e    5
Name: MySeries, dtype: int64


- **`pd.DataFrame(data, index=None, columns=None, dtype=None)`**

    - **data**: dict, list of lists/tuples, ndarray
    - **index**: nhãn dòng
    - **columns**: nhãn cột
    - **dtype**: kiểu dữ liệu cho toàn bộ DataFrame

In [4]:
df = pd.DataFrame({'Yesterday': [1, 2, 3], 'Today': [4, 5, 6]})
df_indexes = pd.DataFrame({'Yesterday': [1, 2, 3], 'Today': [4, 5, 6]}, index=['a', 'b', 'c'])

display(df)
display(df_indexes)

Unnamed: 0,Yesterday,Today
0,1,4
1,2,5
2,3,6


Unnamed: 0,Yesterday,Today
a,1,4
b,2,5
c,3,6


## 3. Đọc/ghi dữ liệu

- **`pd.read_csv(filepath, sep=',', header='infer', names=None, index_col=None, usecols=None, dtype=None)`**

    - **filepath**: đường dẫn file CSV
    - **sep**: ký tự phân tách (mặc định `,`)
    - **header**: dòng chứa tên cột (`None` nếu không có)
    - **names**: danh sách tên cột nếu header=None
    - **index_col**: cột dùng làm index
    - **usecols**: cột cần đọc
    - **dtype**: kiểu dữ liệu các cột

In [5]:
df = pd.read_csv('melb_data.csv', usecols=['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'Price'])
df.head()

Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
0,2,1480000.0,2.5,202.0,
1,2,1035000.0,2.5,156.0,79.0
2,3,1465000.0,2.5,134.0,150.0
3,3,850000.0,2.5,94.0,
4,4,1600000.0,2.5,120.0,142.0


- **`df.to_csv(path, sep=',', index=True, header=True)`**

    - **path**: đường dẫn lưu file CSV
    - **sep**: ký tự phân tách (mặc định `,`)
    - **index**: có ghi index hay không
    - **header**: có ghi tên cột hay không

In [6]:
df.to_csv('melb_data_sample.csv', index=False)
melb_sample = pd.read_csv('melb_data_sample.csv')
melb_sample.head()

Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
0,2,1480000.0,2.5,202.0,
1,2,1035000.0,2.5,156.0,79.0
2,3,1465000.0,2.5,134.0,150.0
3,3,850000.0,2.5,94.0,
4,4,1600000.0,2.5,120.0,142.0


## 4. Xem dữ liệu

| Hàm             | Chức năng                                              |
| --------------- | ------------------------------------------------------ |
| `df.head(n)`    | Xem n dòng đầu                                         |
| `df.tail(n)`    | Xem n dòng cuối                                        |
| `df.info()`     | Thông tin tổng quan (số dòng, cột, kiểu dữ liệu, null) |
| `df.describe()` | Thống kê mô tả (mean, std, min, max...) cho cột số     |
| `df.shape`      | Số dòng và cột                                         |
| `df.columns`    | Danh sách tên cột                                      |
| `df.index`      | Danh sách nhãn dòng                                    |

In [7]:
print("Head")
display(df.head(5))

print("Tail")
display(df.tail(5))

print("Shape", df.shape)

print("\nInfo")
df.info()

print("\nDescribe")
display(df.describe())

print("\nColumns", df.columns)

print("\nIndex", df.index)

Head


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
0,2,1480000.0,2.5,202.0,
1,2,1035000.0,2.5,156.0,79.0
2,3,1465000.0,2.5,134.0,150.0
3,3,850000.0,2.5,94.0,
4,4,1600000.0,2.5,120.0,142.0


Tail


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
13575,4,1245000.0,16.7,652.0,
13576,3,1031000.0,6.8,333.0,133.0
13577,3,1170000.0,6.8,436.0,
13578,4,2500000.0,6.8,866.0,157.0
13579,4,1285000.0,6.3,362.0,112.0


Shape (13580, 5)

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rooms         13580 non-null  int64  
 1   Price         13580 non-null  float64
 2   Distance      13580 non-null  float64
 3   Landsize      13580 non-null  float64
 4   BuildingArea  7130 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 530.6 KB

Describe


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
count,13580.0,13580.0,13580.0,13580.0,7130.0
mean,2.937997,1075684.0,10.137776,558.416127,151.96765
std,0.955748,639310.7,5.868725,3990.669241,541.014538
min,1.0,85000.0,0.0,0.0,0.0
25%,2.0,650000.0,6.1,177.0,93.0
50%,3.0,903000.0,9.2,440.0,126.0
75%,3.0,1330000.0,13.0,651.0,174.0
max,10.0,9000000.0,48.1,433014.0,44515.0



Columns Index(['Rooms', 'Price', 'Distance', 'Landsize', 'BuildingArea'], dtype='object')

Index RangeIndex(start=0, stop=13580, step=1)


## 5. Truy xuất dữ liệu

### Lọc theo cột

In [8]:
# Series
display(df['Rooms'].head(10))
display(df.Rooms.head(10))

# DataFrame
df[['Rooms', 'Price']].head(10)

0    2
1    2
2    3
3    3
4    4
5    2
6    3
7    2
8    1
9    2
Name: Rooms, dtype: int64

0    2
1    2
2    3
3    3
4    4
5    2
6    3
7    2
8    1
9    2
Name: Rooms, dtype: int64

Unnamed: 0,Rooms,Price
0,2,1480000.0
1,2,1035000.0
2,3,1465000.0
3,3,850000.0
4,4,1600000.0
5,2,941000.0
6,3,1876000.0
7,2,1636000.0
8,1,300000.0
9,2,1097000.0


### Lọc theo dòng

- Cú pháp:

    **`df.loc[row_selector, col_selector]`**

    - `row_selector`:
        - Nhãn ('row1')
        - Danh sách nhãn (['row1', 'row2'])
        - Slice nhãn ('row1':'row5') (bao gồm cả 'row5')
        - Điều kiện (df['Age'] > 30)
        - `:` để chọn tất cả
    - `col_selector`: tương tự

    **`df.iloc[row_indexer, col_indexer]`**

    - `row_indexer`:
        - Chỉ số nguyên (0, 1, 2)
        - Danh sách chỉ số ([0, 2, 4])
        - Slice chỉ số (0:5) (không bao gồm 5)
        - `:` để chọn tất cả
    - `col_indexer`: chỉ số cột tương tự

In [9]:
display(df.loc[:, ['Rooms', 'Price']].head(2))
display(df.loc[10:13, ['Rooms', 'Price']])

display(df.iloc[10:16, 0:3])
display(df.iloc[0])

Unnamed: 0,Rooms,Price
0,2,1480000.0
1,2,1035000.0


Unnamed: 0,Rooms,Price
10,2,700000.0
11,3,1350000.0
12,2,750000.0
13,2,1172500.0


Unnamed: 0,Rooms,Price,Distance
10,2,700000.0,2.5
11,3,1350000.0,2.5
12,2,750000.0,2.5
13,2,1172500.0,2.5
14,1,441000.0,2.5
15,2,1310000.0,2.5


Rooms                 2.0
Price           1480000.0
Distance              2.5
Landsize            202.0
BuildingArea          NaN
Name: 0, dtype: float64

### Lọc theo điều kiện

In [10]:
print("Price")
display(df[df['Price'] > 2000000].head(2))
    
print("Rooms")
display(df[df['Rooms'] >= 7].head(2))

rooms_price = df[df['Rooms'].isin([2, 3]) & (df['Price'] < 1000000)]

print("Rooms and Price")
display(rooms_price.head(2))

Price


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
80,3,2850000.0,3.3,211.0,198.0
85,4,2300000.0,3.3,153.0,180.0


Rooms


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
379,8,2950000.0,11.0,1472.0,618.0
589,7,1350000.0,9.2,942.0,


Rooms and Price


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea
3,3,850000.0,2.5,94.0,
5,2,941000.0,2.5,181.0,


## 6. Thao tác với dữ liệu

### Thêm cột
- Cú pháp

    - Thêm trực tiếp

        ```python
        df['NewCol'] = values
        ```
    - Dùng `assign()`

        ```python
        df = df.assign(NewCol=values)
        ```
    
    - Dùng `insert()`

        ```python
        df.insert(loc=position, column='NewCol', value=values)
        ```
    - Dùng `loc` hoặc `iloc`

        ```python
        df.loc[:, 'NewCol'] = values
        ```
        ```python
        df.iloc[:, col_index] = values
        ```

In [11]:
df['Salary'] = 50000
display(df)

df['Continent'] = 'Australia'
display(df)

df = df.assign(Vehicle='Car')
display(df)

df.insert(2, 'City', 'Melbourne')
display(df)

Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea,Salary
0,2,1480000.0,2.5,202.0,,50000
1,2,1035000.0,2.5,156.0,79.0,50000
2,3,1465000.0,2.5,134.0,150.0,50000
3,3,850000.0,2.5,94.0,,50000
4,4,1600000.0,2.5,120.0,142.0,50000
...,...,...,...,...,...,...
13575,4,1245000.0,16.7,652.0,,50000
13576,3,1031000.0,6.8,333.0,133.0,50000
13577,3,1170000.0,6.8,436.0,,50000
13578,4,2500000.0,6.8,866.0,157.0,50000


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea,Salary,Continent
0,2,1480000.0,2.5,202.0,,50000,Australia
1,2,1035000.0,2.5,156.0,79.0,50000,Australia
2,3,1465000.0,2.5,134.0,150.0,50000,Australia
3,3,850000.0,2.5,94.0,,50000,Australia
4,4,1600000.0,2.5,120.0,142.0,50000,Australia
...,...,...,...,...,...,...,...
13575,4,1245000.0,16.7,652.0,,50000,Australia
13576,3,1031000.0,6.8,333.0,133.0,50000,Australia
13577,3,1170000.0,6.8,436.0,,50000,Australia
13578,4,2500000.0,6.8,866.0,157.0,50000,Australia


Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea,Salary,Continent,Vehicle
0,2,1480000.0,2.5,202.0,,50000,Australia,Car
1,2,1035000.0,2.5,156.0,79.0,50000,Australia,Car
2,3,1465000.0,2.5,134.0,150.0,50000,Australia,Car
3,3,850000.0,2.5,94.0,,50000,Australia,Car
4,4,1600000.0,2.5,120.0,142.0,50000,Australia,Car
...,...,...,...,...,...,...,...,...
13575,4,1245000.0,16.7,652.0,,50000,Australia,Car
13576,3,1031000.0,6.8,333.0,133.0,50000,Australia,Car
13577,3,1170000.0,6.8,436.0,,50000,Australia,Car
13578,4,2500000.0,6.8,866.0,157.0,50000,Australia,Car


Unnamed: 0,Rooms,Price,City,Distance,Landsize,BuildingArea,Salary,Continent,Vehicle
0,2,1480000.0,Melbourne,2.5,202.0,,50000,Australia,Car
1,2,1035000.0,Melbourne,2.5,156.0,79.0,50000,Australia,Car
2,3,1465000.0,Melbourne,2.5,134.0,150.0,50000,Australia,Car
3,3,850000.0,Melbourne,2.5,94.0,,50000,Australia,Car
4,4,1600000.0,Melbourne,2.5,120.0,142.0,50000,Australia,Car
...,...,...,...,...,...,...,...,...,...
13575,4,1245000.0,Melbourne,16.7,652.0,,50000,Australia,Car
13576,3,1031000.0,Melbourne,6.8,333.0,133.0,50000,Australia,Car
13577,3,1170000.0,Melbourne,6.8,436.0,,50000,Australia,Car
13578,4,2500000.0,Melbourne,6.8,866.0,157.0,50000,Australia,Car


### Xóa cột/dòng
- Cú pháp

    - Xóa cột

        ```python
        df.drop(columns=['Col1', 'Col2'], inplace=True)

        df.pop('ColName')
        ```

    - Xóa dòng

        ```python
        df.drop(index=[0, 5], inplace=True)
        ```

In [12]:
df.drop(columns=['Salary'], inplace=True)
display(df)

col = df.pop('City')
display(col)

df.drop(index=[0, 5], inplace=True)
display(df)


Unnamed: 0,Rooms,Price,City,Distance,Landsize,BuildingArea,Continent,Vehicle
0,2,1480000.0,Melbourne,2.5,202.0,,Australia,Car
1,2,1035000.0,Melbourne,2.5,156.0,79.0,Australia,Car
2,3,1465000.0,Melbourne,2.5,134.0,150.0,Australia,Car
3,3,850000.0,Melbourne,2.5,94.0,,Australia,Car
4,4,1600000.0,Melbourne,2.5,120.0,142.0,Australia,Car
...,...,...,...,...,...,...,...,...
13575,4,1245000.0,Melbourne,16.7,652.0,,Australia,Car
13576,3,1031000.0,Melbourne,6.8,333.0,133.0,Australia,Car
13577,3,1170000.0,Melbourne,6.8,436.0,,Australia,Car
13578,4,2500000.0,Melbourne,6.8,866.0,157.0,Australia,Car


0        Melbourne
1        Melbourne
2        Melbourne
3        Melbourne
4        Melbourne
           ...    
13575    Melbourne
13576    Melbourne
13577    Melbourne
13578    Melbourne
13579    Melbourne
Name: City, Length: 13580, dtype: object

Unnamed: 0,Rooms,Price,Distance,Landsize,BuildingArea,Continent,Vehicle
1,2,1035000.0,2.5,156.0,79.0,Australia,Car
2,3,1465000.0,2.5,134.0,150.0,Australia,Car
3,3,850000.0,2.5,94.0,,Australia,Car
4,4,1600000.0,2.5,120.0,142.0,Australia,Car
6,3,1876000.0,2.5,245.0,210.0,Australia,Car
...,...,...,...,...,...,...,...
13575,4,1245000.0,16.7,652.0,,Australia,Car
13576,3,1031000.0,6.8,333.0,133.0,Australia,Car
13577,3,1170000.0,6.8,436.0,,Australia,Car
13578,4,2500000.0,6.8,866.0,157.0,Australia,Car


### Thay đổi tên cột
- Cú pháp

    ```python
    df.rename(columns={'OldName':'NewName'}, inplace=True)
    ```

In [13]:
df.rename(columns={'Price':'Cost'}, inplace=True)
display(df.head())

Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
1,2,1035000.0,2.5,156.0,79.0,Australia,Car
2,3,1465000.0,2.5,134.0,150.0,Australia,Car
3,3,850000.0,2.5,94.0,,Australia,Car
4,4,1600000.0,2.5,120.0,142.0,Australia,Car
6,3,1876000.0,2.5,245.0,210.0,Australia,Car


### Lấy dữ liệu không trùng lặp
- Cú pháp

    ```python
    series = df['Column'].unique()
    ```

In [14]:
continent = df['Continent'].unique()
display(continent)

array(['Australia'], dtype=object)

### Đếm tần suất xuất hiện của các giá trị
- Cú pháp

    ```python
    counts = df['Column'].value_counts()
    ```

In [15]:
counts = df['Continent'].value_counts()
display(counts)

Continent
Australia    13578
Name: count, dtype: int64


### Sắp xếp
- Cú pháp

    ```python
    df.sort_values(by='Column', ascending=False, inplace=True)
    df.sort_index(ascending=True)
    ```

In [16]:
df.sort_values(by='Landsize', ascending=True, inplace=True)
df.sort_index(ascending=True)
display(df)

Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,500000.0,2.5,0.0,60.0,Australia,Car
2114,2,672000.0,1.6,0.0,60.0,Australia,Car
193,3,720000.0,11.1,0.0,130.0,Australia,Car
221,2,599000.0,6.3,0.0,76.0,Australia,Car
223,3,995000.0,6.3,0.0,100.0,Australia,Car
...,...,...,...,...,...,...,...
5194,3,572000.0,11.2,41400.0,,Australia,Car
13245,5,1355000.0,48.1,44500.0,44515.0,Australia,Car
687,3,2000000.0,9.2,75100.0,,Australia,Car
10504,3,1085000.0,34.6,76000.0,,Australia,Car


### Thay giá trị thiếu
- Cú pháp

    ```python
    df.fillna(0)
    df.dropna()
    df['Column'].fillna(df['Column'].mean())
    ```

In [17]:
df.fillna(0)
display(df)

df.dropna()
display(df)

df['Cost'].fillna(df['Cost'].mean())
display(df)

Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,500000.0,2.5,0.0,60.0,Australia,Car
2114,2,672000.0,1.6,0.0,60.0,Australia,Car
193,3,720000.0,11.1,0.0,130.0,Australia,Car
221,2,599000.0,6.3,0.0,76.0,Australia,Car
223,3,995000.0,6.3,0.0,100.0,Australia,Car
...,...,...,...,...,...,...,...
5194,3,572000.0,11.2,41400.0,,Australia,Car
13245,5,1355000.0,48.1,44500.0,44515.0,Australia,Car
687,3,2000000.0,9.2,75100.0,,Australia,Car
10504,3,1085000.0,34.6,76000.0,,Australia,Car


Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,500000.0,2.5,0.0,60.0,Australia,Car
2114,2,672000.0,1.6,0.0,60.0,Australia,Car
193,3,720000.0,11.1,0.0,130.0,Australia,Car
221,2,599000.0,6.3,0.0,76.0,Australia,Car
223,3,995000.0,6.3,0.0,100.0,Australia,Car
...,...,...,...,...,...,...,...
5194,3,572000.0,11.2,41400.0,,Australia,Car
13245,5,1355000.0,48.1,44500.0,44515.0,Australia,Car
687,3,2000000.0,9.2,75100.0,,Australia,Car
10504,3,1085000.0,34.6,76000.0,,Australia,Car


Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,500000.0,2.5,0.0,60.0,Australia,Car
2114,2,672000.0,1.6,0.0,60.0,Australia,Car
193,3,720000.0,11.1,0.0,130.0,Australia,Car
221,2,599000.0,6.3,0.0,76.0,Australia,Car
223,3,995000.0,6.3,0.0,100.0,Australia,Car
...,...,...,...,...,...,...,...
5194,3,572000.0,11.2,41400.0,,Australia,Car
13245,5,1355000.0,48.1,44500.0,44515.0,Australia,Car
687,3,2000000.0,9.2,75100.0,,Australia,Car
10504,3,1085000.0,34.6,76000.0,,Australia,Car


## 7. Thống kê cơ bản

| Hàm                 | Chức năng                               |
| ------------------- | --------------------------------------- |
| `df.mean()`         | Trung bình                              |
| `df.median()`       | Trung vị                                |
| `df.min()/df.max()` | Giá trị min/max                         |
| `df.std()/df.var()` | Độ lệch chuẩn / phương sai              |
| `df.value_counts()` | Đếm số lượng xuất hiện của từng giá trị |

- Có thể dùng **`describe`**() để xem nhanh các thống kê cơ bản

In [18]:
print(f'Mean cost: {df["Cost"].mean()}')
print(f'Median cost: {df["Cost"].median()}')
print(f'Min cost: {df["Cost"].min()}')
print(f'Max cost: {df["Cost"].max()}')
print(f'Std cost: {df["Cost"].std()}')
print(f'Var cost: {df["Cost"].var()}')

Mean cost: 1075664.2214611871
Median cost: 903000.0
Min cost: 85000.0
Max cost: 9000000.0
Std cost: 639347.3491467353
Var cost: 408765032860.95746


## 8. Ánh xạ dữ liệu

### Dùng map()

- Dùng **`map()`**: map chỉ áp dụng cho Series (cột)

    ```python
    # Ánh xạ bằng dict
    df['Column'] = df['Column'].map(mapping_dict)

    # Ánh xạ bằng hàm
    mean = df['Column'].mean()
    df['Column'] = df['Column'].map(lambda x: x - mean)

    # Ánh xạ bằng Series khác
    df['col'].map(other_series)
    ```

In [19]:
df['Vehicle'] = df['Vehicle'].map({'Car': 'Automobile', 'Bike': 'Motorcycle'})
display(df)

mean = df['Cost'].mean()
df['Cost'] = df['Cost'].map(lambda x: x - mean)
display(df)

continent_map = pd.Series({
    'Australia': 'Oceania',
})
df['Continent'] = df['Continent'].map(continent_map)
display(df)

cean =df['Continent'].map(lambda x : 'cean' in x).sum()
print(f'Number of continents containing "cean": {cean}')

Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,500000.0,2.5,0.0,60.0,Australia,Automobile
2114,2,672000.0,1.6,0.0,60.0,Australia,Automobile
193,3,720000.0,11.1,0.0,130.0,Australia,Automobile
221,2,599000.0,6.3,0.0,76.0,Australia,Automobile
223,3,995000.0,6.3,0.0,100.0,Australia,Automobile
...,...,...,...,...,...,...,...
5194,3,572000.0,11.2,41400.0,,Australia,Automobile
13245,5,1355000.0,48.1,44500.0,44515.0,Australia,Automobile
687,3,2000000.0,9.2,75100.0,,Australia,Automobile
10504,3,1085000.0,34.6,76000.0,,Australia,Automobile


Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,-5.756642e+05,2.5,0.0,60.0,Australia,Automobile
2114,2,-4.036642e+05,1.6,0.0,60.0,Australia,Automobile
193,3,-3.556642e+05,11.1,0.0,130.0,Australia,Automobile
221,2,-4.766642e+05,6.3,0.0,76.0,Australia,Automobile
223,3,-8.066422e+04,6.3,0.0,100.0,Australia,Automobile
...,...,...,...,...,...,...,...
5194,3,-5.036642e+05,11.2,41400.0,,Australia,Automobile
13245,5,2.793358e+05,48.1,44500.0,44515.0,Australia,Automobile
687,3,9.243358e+05,9.2,75100.0,,Australia,Automobile
10504,3,9.335779e+03,34.6,76000.0,,Australia,Automobile


Unnamed: 0,Rooms,Cost,Distance,Landsize,BuildingArea,Continent,Vehicle
23,2,-5.756642e+05,2.5,0.0,60.0,Oceania,Automobile
2114,2,-4.036642e+05,1.6,0.0,60.0,Oceania,Automobile
193,3,-3.556642e+05,11.1,0.0,130.0,Oceania,Automobile
221,2,-4.766642e+05,6.3,0.0,76.0,Oceania,Automobile
223,3,-8.066422e+04,6.3,0.0,100.0,Oceania,Automobile
...,...,...,...,...,...,...,...
5194,3,-5.036642e+05,11.2,41400.0,,Oceania,Automobile
13245,5,2.793358e+05,48.1,44500.0,44515.0,Oceania,Automobile
687,3,9.243358e+05,9.2,75100.0,,Oceania,Automobile
10504,3,9.335779e+03,34.6,76000.0,,Oceania,Automobile


Number of continents containing "cean": 13578


### Dùng apply()
- Dùng **`apply()`**: apply có thể áp dụng cho cả DataFrame và Series

    ```python
    # Dùng lambda cho Series
    df['Column'] = df['Column'].apply(lambda x: x * 2)

    df['Total'] = df.apply(lambda row: row['Col1'] + row['Col2'], axis=1)

    # Dùng hàm cho DataFrame
    df['Column'] = df['Column'].apply(custom_function)
    
    df = df.apply(np.sqrt)
    ```

In [None]:
def star(rows):
    if rows['Rooms'] == 2:
        return 1
    elif rows['Rooms'] == 3:
        return 2
    else:
        return 3

star_ratings = df.apply(star, axis=1)
display(star_ratings.head())

23      1
2114    1
193     2
221     1
223     2
dtype: int64

## 9. Nhóm và tổng hợp dữ liệu

```python
df.groupby('Department')['Salary'].mean()
df.groupby(['Dept','Gender']).agg({'Salary':'mean','Age':'max'})
```

* `groupby(col)` → nhóm dữ liệu theo cột và thực hiện tính các hàm thống kê
    - mean() – trung bình
    - sum() – tổng
    - count() – đếm
    - min() – giá trị nhỏ nhất
    - max() – giá trị lớn nhất
    - agg() – dùng để áp dụng nhiều phép thống kê cho nhiều cột


In [None]:
rooms_price = df.groupby('Rooms').agg({'Cost':'mean', 'Landsize':'max'})
display(rooms_price)

rooms = df.groupby('Rooms').size()
rooms = df['Rooms'].value_counts()
rooms = df.groupby('Rooms')['Rooms'].count() # Đếm giá trị không NaN
display(rooms)

rooms_price = df.groupby('Rooms')['Landsize'].mean()
display(rooms_price)

rooms_price = df.groupby('Rooms')['Landsize'].max()
display(rooms_price)

Unnamed: 0_level_0,Cost,Landsize
Rooms,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-641839.770654,14500.0
2,-300821.879443,37000.0
3,416.403433,433014.0
4,369617.514402,40468.0
5,794596.194646,44500.0
6,773701.450181,4413.0
7,845035.778539,5022.0
8,527085.778539,1472.0
10,-175664.221461,313.0


Rooms
1      681
2     3646
3     5881
4     2688
5      596
6       67
7       10
8        8
10       1
Name: Rooms, dtype: int64

Rooms
1      384.681351
2      421.290730
3      597.700731
4      639.727679
5      798.505034
6      841.462687
7     1089.700000
8      843.875000
10     313.000000
Name: Landsize, dtype: float64

Rooms
1      14500.0
2      37000.0
3     433014.0
4      40468.0
5      44500.0
6       4413.0
7       5022.0
8       1472.0
10       313.0
Name: Landsize, dtype: float64

## 10. Kết hợp dữ liệu

* `pd.concat([df1, df2], axis=0)` → nối theo dòng
* `pd.concat([df1, df2], axis=1)` → nối theo cột
* `pd.merge(df1, df2, on='key', how='inner')` → merge giống join SQL

  * **how**: 'inner', 'left', 'right', 'outer'

## 11. Các hàm phổ biến khác

| Hàm                    | Chức năng                          |
| ---------------------- | ---------------------------------- |
| `df.apply(func)`       | Áp dụng hàm cho cột hoặc dòng      |
| `df.applymap(func)`    | Áp dụng hàm cho từng phần tử       |
| `df.drop_duplicates()` | Xóa dòng trùng lặp                 |
| `df.sample(n)`         | Lấy mẫu ngẫu nhiên n dòng          |
| `df.astype(type)`      | Chuyển kiểu dữ liệu cột            |
| `df.corr()`            | Ma trận tương quan giữa các cột số |
