### 1.3 Pandas基础

Pandas是一个数据处理和分析的开源库，提供了高效易用的数据结构和数据分析工具。Pandas的核心数据结构是Series和DataFrame，它们用于存储和操作表格数据。

#### 1.3.1 Pandas的核心数据结构

1. **Series**：一维数据结构，可以看作是带标签的数组。
2. **DataFrame**：二维数据结构，可以看作是带有行标签和列标签的表格。

##### 创建Series和DataFrame

1. **创建Series**：

In [1]:
import pandas as pd

# 从列表创建Series
series = pd.Series([1, 2, 3, 4, 5])

# 从字典创建Series
series_from_dict = pd.Series({'a': 1, 'b': 2, 'c': 3})

print(series)
print(series_from_dict)

0    1
1    2
2    3
3    4
4    5
dtype: int64
a    1
b    2
c    3
dtype: int64


2. **创建DataFrame**：

In [3]:
# 从字典创建DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

print(df)

      Name  Age           City
0    Alice   24       New York
1      Bob   27  San Francisco
2  Charlie   22    Los Angeles
3    David   32        Chicago


##### DataFrame的基本操作

1. **查看数据**：

In [4]:
# 显示前几行数据
print(df.head())

# 显示后几行数据
print(df.tail())

# 显示数据的基本信息
print(df.info())

# 显示数据的描述性统计信息
print(df.describe())

      Name  Age           City
0    Alice   24       New York
1      Bob   27  San Francisco
2  Charlie   22    Los Angeles
3    David   32        Chicago
      Name  Age           City
0    Alice   24       New York
1      Bob   27  San Francisco
2  Charlie   22    Los Angeles
3    David   32        Chicago
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None
             Age
count   4.000000
mean   26.250000
std     4.349329
min    22.000000
25%    23.500000
50%    25.500000
75%    28.250000
max    32.000000


2. **选择数据**：

In [5]:
# 选择列
print(df['Name'])

# 选择多列
print(df[['Name', 'Age']])

# 选择行
print(df.iloc[1])  # 使用整数位置选择
print(df.loc[1])   # 使用标签选择

# 条件选择
print(df[df['Age'] > 25])

0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object
      Name  Age
0    Alice   24
1      Bob   27
2  Charlie   22
3    David   32
Name              Bob
Age                27
City    San Francisco
Name: 1, dtype: object
Name              Bob
Age                27
City    San Francisco
Name: 1, dtype: object
    Name  Age           City
1    Bob   27  San Francisco
3  David   32        Chicago


3. **数据操作**：

In [6]:
# 增加新列
df['Salary'] = [50000, 60000, 70000, 80000]

# 修改列数据
df['Age'] = df['Age'] + 1

# 删除列
df = df.drop(columns=['City'])

print(df)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   28   60000
2  Charlie   23   70000
3    David   33   80000


4. **处理缺失值**：

In [7]:
# 创建含有缺失值的DataFrame
data_with_nan = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, None, 32],
    'City': ['New York', None, 'Los Angeles', 'Chicago']
}
df_nan = pd.DataFrame(data_with_nan)

# 查看缺失值
print(df_nan.isnull())

# 填充缺失值
df_nan_filled = df_nan.fillna({'Age': df_nan['Age'].mean(), 'City': 'Unknown'})

# 删除含有缺失值的行
df_nan_dropped = df_nan.dropna()

print(df_nan)
print(df_nan_filled)
print(df_nan_dropped)

    Name    Age   City
0  False  False  False
1  False  False   True
2  False   True  False
3  False  False  False
      Name   Age         City
0    Alice  24.0     New York
1      Bob  27.0         None
2  Charlie   NaN  Los Angeles
3    David  32.0      Chicago
      Name        Age         City
0    Alice  24.000000     New York
1      Bob  27.000000      Unknown
2  Charlie  27.666667  Los Angeles
3    David  32.000000      Chicago
    Name   Age      City
0  Alice  24.0  New York
3  David  32.0   Chicago


5. **数据分组与聚合**：

In [8]:
# 分组并求和
grouped = df.groupby('Name').sum()

print(grouped)

         Age  Salary
Name                
Alice     25   50000
Bob       28   60000
Charlie   23   70000
David     33   80000


#### 示例代码

In [9]:
import pandas as pd

# 创建一个DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print("原始DataFrame:")
print(df)

# 显示前几行数据
print("\n前几行数据:")
print(df.head())

# 显示数据的基本信息
print("\n数据的基本信息:")
print(df.info())

# 选择列
print("\n选择'Name'列:")
print(df['Name'])

# 条件选择
print("\n选择年龄大于25的行:")
print(df[df['Age'] > 25])

# 增加新列
df['Salary'] = [50000, 60000, 70000, 80000]
print("\n增加新列后的DataFrame:")
print(df)

# 修改列数据
df['Age'] = df['Age'] + 1
print("\n修改'Age'列后的DataFrame:")
print(df)

# 删除列
df = df.drop(columns=['City'])
print("\n删除'City'列后的DataFrame:")
print(df)

# 创建含有缺失值的DataFrame
data_with_nan = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, None, 32],
    'City': ['New York', None, 'Los Angeles', 'Chicago']
}
df_nan = pd.DataFrame(data_with_nan)
print("\n含有缺失值的DataFrame:")
print(df_nan)

# 填充缺失值
df_nan_filled = df_nan.fillna({'Age': df_nan['Age'].mean(), 'City': 'Unknown'})
print("\n填充缺失值后的DataFrame:")
print(df_nan_filled)

# 分组并求和
grouped = df.groupby('Name').sum()
print("\n按'Name'分组并求和:")
print(grouped)

原始DataFrame:
      Name  Age           City
0    Alice   24       New York
1      Bob   27  San Francisco
2  Charlie   22    Los Angeles
3    David   32        Chicago

前几行数据:
      Name  Age           City
0    Alice   24       New York
1      Bob   27  San Francisco
2  Charlie   22    Los Angeles
3    David   32        Chicago

数据的基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None

选择'Name'列:
0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

选择年龄大于25的行:
    Name  Age           City
1    Bob   27  San Francisco
3  David   32        Chicago

增加新列后的DataFrame:
      Name  Age           City  Salary
0    Alice   24       New York   50000
1      Bob   27  San Francisco   

Pandas库提供了强大的数据处理和分析功能，掌握这些基本操作将为你进一步的数据分析奠定坚实的基础。