# Pandas Series and Dataframe 一维数组和二维数组 

Series：ond/one dimensional array with flexible indices (索引),     
like a typed dictionary but more efficient for certain computations.     


**create Series using dictionary**

In [2]:
import pandas as pd

students_data = dict(business = 25, AI = 30, JS = 30, JAVA = 27)
students_data

{'business': 25, 'AI': 30, 'JS': 30, 'JAVA': 27}

In [3]:
series_program = pd.Series(students_data)
series_program

business    25
AI          30
JS          30
JAVA        27
dtype: int64

In [4]:
print(series_program)

business    25
AI          30
JS          30
JAVA        27
dtype: int64


In [7]:
series_program.iloc[0], series_program.iloc[-1]

(np.int64(25), np.int64(27))

In [12]:
print(series_program.iloc[0])

25


In [8]:
series_program.keys()

Index(['business', 'AI', 'JS', 'JAVA'], dtype='object')

In [10]:
series_program["AI"], series_program.loc["AI"]

(np.int64(30), np.int64(30))

In [11]:
print(series_program["AI"])

30


In [13]:
# could do calculations
series_program["AI"] + 50

np.int64(80)

**another series using list**

**（两个星号）通常用于 加粗文本

rnd.seed(42) 是用来设置 随机数生成器的种子。       
在 Python 中，random 模块用于生成随机数，而 seed() 方法则用于设置随机数生成器的起始状态。通过设置种子值，你可以确保每次运行代码时生成的随机数序列相同。      
.seed(42) 设置了随机数生成器的种子值为 42。种子值是随机数生成器的起点，根据相同的种子值，生成的随机数序列会相同。种子值可以是任何整数，常见的值有 42（这只是一个常用的值，并没有特别的意义）。        

如果不设置 seed 的情况：     
第一次运行，可能输出 45   
第二次运行，可能输出 88    
第三次运行，可能输出 13      
设置 seed 的情况：三次都是81  

In [18]:
import random as rnd
rnd.seed(42)

dice_list = [rnd.randint(1,6) for _ in range(5)]
dice_list


[6, 1, 1, 6, 3]

In [20]:
dice_series = pd.Series(dice_list)
dice_series

0    6
1    1
2    1
3    6
4    3
dtype: int64

In [23]:
dice_series.min(), dice_series.max(), dice_series.mean()

(np.int64(1), np.int64(6), np.float64(3.4))

## Dataframe

- Analog of 2D numpy array with flexible row indices and col names
- 具有灵活行索引和列名称的 2D numpy 数组的模拟
- Dataframe is built on one or more series

In [24]:
series_program

business    25
AI          30
JS          30
JAVA        27
dtype: int64

In [27]:
# ("Num students",) 是一个只有一个元素的元组，表示 DataFrame 只有一列，列名是 "Num students"。
# 即便只有一个元素，元组后面也需要加一个逗号，以确保它被识别为元组，而不是普通的字符串。
df_programs = pd.DataFrame(series_program,columns=("Num students",))
df_programs

Unnamed: 0,Num students
business,25
AI,30
JS,30
JAVA,27


In [28]:
# create 2 series objects using dictionary
students = pd.Series(dict(AI = 25, NET = 30, APP = 30, JAVA = 27))
language = pd.Series(dict(AI = 'Python', NET = 'C#', APP = 'kotlin', JAVA = 'Java'))

students

AI      25
NET     30
APP     30
JAVA    27
dtype: int64

In [29]:
language

AI      Python
NET         C#
APP     kotlin
JAVA      Java
dtype: object

In [31]:
df_programs = pd.DataFrame({"Students": students, "Language":language})
df_programs

Unnamed: 0,Students,Language
AI,25,Python
NET,30,C#
APP,30,kotlin
JAVA,27,Java


Use Numpy and list to create Dataframe

In [34]:
import numpy as np

pd.DataFrame(
    {
        "Student": np.array((25, 30, 30, 27)),
        "Language": ["Python", "C#", "kotlin", "Java"]
    },
    index= ['AI', 'NET', 'APP', 'JAVA'] 
)

Unnamed: 0,Student,Language
AI,25,Python
NET,30,C#
APP,30,kotlin
JAVA,27,Java


In [35]:
df_programs.index

Index(['AI', 'NET', 'APP', 'JAVA'], dtype='object')

## Data selection - important

iloc could get index

In [36]:
df_programs['Students']

# result likes series as before

AI      25
NET     30
APP     30
JAVA    27
Name: Students, dtype: int64

外部的 [] 用来索引整个 DataFrame。        
内部的 [] 用来指定列的名称，可以是单列名称（例如 ['Language']）或多列名称（例如 ['Language', 'Students']）。

In [38]:
df_programs[['Language', 'Students']]

Unnamed: 0,Language,Students
AI,Python,25
NET,C#,30
APP,kotlin,30
JAVA,Java,27


. dot syntax can also get series , but do not use .index too much!

In [39]:
df_programs.Language

AI      Python
NET         C#
APP     kotlin
JAVA      Java
Name: Language, dtype: object

In [40]:
df_programs["Language"]["NET"]

'C#'

## Indexers

- Indexers gives a slicing interface for the indices

- loc > iloc loc更多使用
- loc is slicing, indexing, referencing and  it is referencing explicit index loc 是切片、索引、引用，并且它引用显式索引。 指定行列名字索引
- iloc is slicing, indexing, referencing 指定下角标索引

In [42]:
df_programs.loc['JAVA']

Students      27
Language    Java
Name: JAVA, dtype: object

In [43]:
df_programs.loc['AI']

Students        25
Language    Python
Name: AI, dtype: object

In [44]:
df_programs.loc[['JAVA', 'AI']]

Unnamed: 0,Students,Language
JAVA,27,Java
AI,25,Python


In [46]:
try:
    df_programs[['JAVA', 'AI']]
except KeyError as err:
    print(err)

"None of [Index(['JAVA', 'AI'], dtype='object')] are in the [columns]"


In [49]:
# slicing - from AI to APP, in pandas it works, but can not use in dictionary
df_programs.loc["AI": "APP"]

Unnamed: 0,Students,Language
AI,25,Python
NET,30,C#
APP,30,kotlin


In [51]:
df_programs.iloc[1:3]     

 # index [1:3]：这是一个 Python 切片操作，表示从索引位置 1 开始，直到索引位置 3 之前（不包括 3）。因此，它会选择 索引为 1 和 2 的行。在 Pandas 中，索引是从 0 开始的。

Unnamed: 0,Students,Language
NET,30,C#
APP,30,kotlin


## Masking

filter base on conditions

In [52]:
df_programs["Students"] > 25

# get True or False

AI      False
NET      True
APP      True
JAVA     True
Name: Students, dtype: bool

In [53]:
df_programs[df_programs["Students"] > 25]

# retrieve all the rows where this condition is True

Unnamed: 0,Students,Language
NET,30,C#
APP,30,kotlin
JAVA,27,Java


In [55]:
# or use query

df_programs.query("Students > 25")

Unnamed: 0,Students,Language
NET,30,C#
APP,30,kotlin
JAVA,27,Java
