In [4]:
import pandas as pd

df = pd.read_csv(
    "../../DATA/linux-etc-passwd.txt",
    sep=":",
    comment="#",
    header=None,
    names=["username", "password", "userid", "groupid", "name", "homedir", "shell"],
    skip_blank_lines=True,
)

In [5]:
df.head()

Unnamed: 0,username,password,userid,groupid,name,homedir,shell
0,root,x,0,0,root,/root,/bin/bash
1,daemon,x,1,1,daemon,/usr/sbin,/usr/sbin/nologin
2,bin,x,2,2,bin,/bin,/usr/sbin/nologin
3,sys,x,3,3,sys,/dev,/usr/sbin/nologin
4,sync,x,4,65534,sync,/bin,/bin/sync


默认情况下，pandas 假定我们有逗号分隔的值。如果我们想使用另一个字符，那也没问题，但是我们需要在 `sep` 关键字参数中指定它。

`read_csv` 可以优雅地完成有注释文件读取的工作，让我们指定标记注释行开头的字符串。通过向其传递 `comment='#'`，我们指示解析器应忽略此类行。

默认情况下，`read_csv` 假设文件的第一行是包含列名的标题。它还使用第一行来确定每行将有多少个字段。如果文件包含标题但不在第一行，我们可以将 `header` 设置为整数值，指示 `read_csv` 应该在哪一行查找它们。

`read_csv` 默认会忽略空白行。如果我们想将空白行视为 `NaN` 值，我们可以传递 `skip_blank_lines=False`，而不是接受默认值 `True`。

如果我们不提供任何 `names`，数据框的列将用从 0 开始的整数标记。这在技术上没有任何问题，但处理数据会更困难。此外，我们很容易将我们想要赋予列的名称作为字符串列表传递。

![Turning the passwd file into a data frame](../../IMAGES/3-9.png)

Beyond

In [6]:
df = pd.read_csv(
    "../../DATA/linux-etc-passwd.txt",
    sep=":",
    comment="#",
    header=None,
    names=["username", "password", "userid", "groupid", "name", "homedir", "shell"],
    usecols=["username", "userid", "name", "homedir", "shell"],
    skip_blank_lines=True,
)
df.head()

Unnamed: 0,username,userid,name,homedir,shell
0,root,0,root,/root,/bin/bash
1,daemon,1,daemon,/usr/sbin,/usr/sbin/nologin
2,bin,2,bin,/bin,/usr/sbin/nologin
3,sys,3,sys,/dev,/usr/sbin/nologin
4,sync,4,sync,/bin,/bin/sync


In [7]:
df.loc[df.userid >= 1000]

Unnamed: 0,username,userid,name,homedir,shell
17,nobody,65534,nobody,/nonexistent,/usr/sbin/nologin
23,user,1000,"user,,,",/home/user,/bin/bash
24,reuven,1001,"Reuven M. Lerner,,,",/home/reuven,/bin/bash
33,genadi,1002,"Genadi Reznichenko,,,",/home/genadi,/bin/bash
34,shira,1003,"Shira Friedman,,,",/home/shira,/bin/bash
35,atara,1004,"Atara Lerner-Friedman,,,",/home/atara,/bin/bash
36,shikma,1005,"Shikma Lerner-Friedman,,,",/home/shikma,/bin/bash
37,amotz,1006,"Amotz Lerner-Friedman,,,",/home/amotz,/bin/bash
44,git,1007,"GitLab,,,",/home/git,/bin/bash
47,deploy,1008,"Deploy,,,",/home/deploy,/bin/bash


In [8]:
df.shell.drop_duplicates()

0             /bin/bash
1     /usr/sbin/nologin
4             /bin/sync
18           /bin/false
31              /bin/sh
42         /bin/nologin
Name: shell, dtype: object