In [1]:
import pandas as pd

## 8.1 Analisando timestamps Unix
O arquivo que estamos usando aqui é um arquivo de concurso de popularidade.

In [4]:
popcon = pd.read_csv('pandas-cookbook-master/data/popularity-contest', sep=' ',)[:-1]
popcon

Unnamed: 0,POPULARITY-CONTEST-0,TIME:1387295813,ID:d9bdc557ae8941c19e95da1c5da786bd,ARCH:amd64,POPCONVER:1.53ubuntu1
0,1387295797,1367633260,perl-base,/usr/bin/perl,
1,1387295796,1354370480,login,/bin/su,
2,1387295743,1354341275,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,1387295743,1387224204,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,1387295742,1354341253,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,
...,...,...,...,...,...
2892,0,0,libreadline-dev,<NOFILES>,
2893,0,0,notify-osd-icons,<NOFILES>,
2894,0,0,python-apt-common,<NOFILES>,
2895,0,0,libindicator-messages-status-provider1,<NOFILES>,


In [5]:
popcon.columns = ['atime', 'ctime', 'package-name', 'mru-program', 'tag']

In [6]:
popcon

Unnamed: 0,atime,ctime,package-name,mru-program,tag
0,1387295797,1367633260,perl-base,/usr/bin/perl,
1,1387295796,1354370480,login,/bin/su,
2,1387295743,1354341275,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,1387295743,1387224204,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,1387295742,1354341253,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,
...,...,...,...,...,...
2892,0,0,libreadline-dev,<NOFILES>,
2893,0,0,notify-osd-icons,<NOFILES>,
2894,0,0,python-apt-common,<NOFILES>,
2895,0,0,libindicator-messages-status-provider1,<NOFILES>,


As colunas são o horário de acesso, o horário de criação, o nome do pacote, o programa usado recentemente e uma tag.

A parte mágica sobre a análise de timestamps em pandas é que datetimes numpy já estão armazenados como timestamps Unix. Portanto, tudo o que precisamos fazer é dizer aos pandas que esses números inteiros são na verdade datas e horas - não é necessário fazer nenhuma conversão.

Precisamos convertê-los em ints para começar:



In [7]:
popcon['atime'] = popcon['atime'].astype(int)
popcon['ctime'] = popcon['ctime'].astype(int)

Cada matriz numpy e série de pandas tem um dtype - geralmente é `int64`, `float64` ou `object`. Alguns dos tipos de tempo disponíveis são `datetime64[s]`, `datetime64[ms]` e `datetime64[us]`. Existem também `timedelta` tipos, da mesma forma.

Podemos usar a `pd.to_datetime` função para converter nossos timestamps inteiros em datas e horas. Esta é uma operação de tempo constante - não estamos realmente alterando nenhum dos dados, apenas como os pandas pensam sobre isso.

In [9]:
popcon['atime'] = pd.to_datetime(popcon['atime'], unit='s')
popcon['ctime'] = pd.to_datetime(popcon['ctime'], unit='s')

Se olharmos para o dtype agora, é `<M8[ns]`. Tanto quanto eu posso dizer, `M8` é um código secreto para arquivos `datetime64`.

In [11]:
popcon['atime'].dtype

dtype('<M8[ns]')

Então agora podemos ver nossas datas `atime` e `ctime`!

In [12]:
popcon[:5]

Unnamed: 0,atime,ctime,package-name,mru-program,tag
0,2013-12-17 15:56:37,2013-05-04 02:07:40,perl-base,/usr/bin/perl,
1,2013-12-17 15:56:36,2012-12-01 14:01:20,login,/bin/su,
2,2013-12-17 15:55:43,2012-12-01 05:54:35,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,2013-12-17 15:55:43,2013-12-16 20:03:24,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,2013-12-17 15:55:42,2012-12-01 05:54:13,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,


Agora suponha que queremos olhar para todos os pacotes que não são bibliotecas.

Primeiro, quero me livrar de tudo com carimbo de data/hora 0. Observe como podemos usar apenas uma string nessa comparação, mesmo que na verdade seja um carimbo de data/hora por dentro? Isso porque os pandas são incríveis.

In [14]:
popcon = popcon[popcon['atime'] > '1970-01-01']
popcon

Unnamed: 0,atime,ctime,package-name,mru-program,tag
0,2013-12-17 15:56:37,2013-05-04 02:07:40,perl-base,/usr/bin/perl,
1,2013-12-17 15:56:36,2012-12-01 14:01:20,login,/bin/su,
2,2013-12-17 15:55:43,2012-12-01 05:54:35,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,2013-12-17 15:55:43,2013-12-16 20:03:24,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,2013-12-17 15:55:42,2012-12-01 05:54:13,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,
...,...,...,...,...,...
2093,2010-10-15 16:41:50,2012-12-01 05:54:37,pptp-linux,/usr/sbin/pptp,<OLD>
2094,2010-06-08 10:06:29,2012-12-01 05:54:57,libfile-basedir-perl,/usr/share/perl5/File/BaseDir.pm,<OLD>
2095,2010-03-06 14:44:18,2012-12-01 05:54:37,laptop-detect,/usr/sbin/laptop-detect,<OLD>
2096,2010-02-22 14:59:21,2012-12-01 05:54:14,libfribidi0,/usr/bin/fribidi,<OLD>


Agora podemos usar as habilidades mágicas de string dos pandas para apenas olhar para as linhas onde o nome do pacote não contém 'lib'.

In [17]:
nonlibraries = popcon[~popcon['package-name'].str.contains('lib')]

Esse `~popcon['package-name'] meio que a negação de popcon['package-name']`. Veja:

In [18]:
libraries = popcon[popcon['package-name'].str.contains('lib')]

In [19]:
nonlibraries

Unnamed: 0,atime,ctime,package-name,mru-program,tag
0,2013-12-17 15:56:37,2013-05-04 02:07:40,perl-base,/usr/bin/perl,
1,2013-12-17 15:56:36,2012-12-01 14:01:20,login,/bin/su,
17,2013-12-17 15:55:33,2013-11-25 16:25:38,fingerprint-gui,/lib/security/pam_fingerprint-gui.so,
18,2013-12-17 15:55:32,2012-12-01 05:53:57,dash,/bin/dash,
19,2013-12-17 15:55:29,2012-12-01 05:54:37,popularity-contest,/usr/sbin/popularity-contest,
...,...,...,...,...,...
2089,2011-04-30 20:36:36,2012-12-01 05:54:17,x11-xfs-utils,/usr/bin/xfsinfo,<OLD>
2090,2011-04-30 15:07:31,2012-12-01 05:54:15,dvd+rw-tools,/usr/bin/rpl8,<OLD>
2092,2010-12-12 02:51:56,2012-12-01 05:54:37,vbetool,/usr/sbin/vbetool,<OLD>
2093,2010-10-15 16:41:50,2012-12-01 05:54:37,pptp-linux,/usr/sbin/pptp,<OLD>


In [20]:
libraries

Unnamed: 0,atime,ctime,package-name,mru-program,tag
2,2013-12-17 15:55:43,2012-12-01 05:54:35,libtalloc2,/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7,
3,2013-12-17 15:55:43,2013-12-16 20:03:24,libwbclient0,/usr/lib/x86_64-linux-gnu/libwbclient.so.0,<RECENT-CTIME>
4,2013-12-17 15:55:42,2012-12-01 05:54:13,libselinux1,/lib/x86_64-linux-gnu/libselinux.so.1,
5,2013-12-17 15:55:42,2012-12-01 05:54:35,libstdc++6,/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16,
6,2013-12-17 15:55:40,2013-12-16 20:03:22,libpam-winbind,/lib/x86_64-linux-gnu/security/pam_winbind.so,<RECENT-CTIME>
...,...,...,...,...,...
2084,2011-08-25 17:53:02,2012-12-01 05:54:22,guile-1.8-libs,/usr/lib/guile-1.8/bin/guile,<OLD>
2091,2011-02-15 17:01:09,2012-12-01 05:54:29,libqtgconf1,/usr/lib/qt4/imports/gconf/libQtGConfQml.so,<OLD>
2094,2010-06-08 10:06:29,2012-12-01 05:54:57,libfile-basedir-perl,/usr/share/perl5/File/BaseDir.pm,<OLD>
2096,2010-02-22 14:59:21,2012-12-01 05:54:14,libfribidi0,/usr/bin/fribidi,<OLD>


In [22]:
# Valores embaralhados!
nonlibraries.sort_values('ctime', ascending=False)[:10]

Unnamed: 0,atime,ctime,package-name,mru-program,tag
57,2013-12-17 04:55:39,2013-12-17 04:55:42,ddd,/usr/bin/ddd,<RECENT-CTIME>
450,2013-12-16 20:03:20,2013-12-16 20:05:13,nodejs,/usr/bin/npm,<RECENT-CTIME>
454,2013-12-16 20:03:20,2013-12-16 20:05:04,switchboard-plug-keyboard,/usr/lib/plugs/pantheon/keyboard/options.txt,<RECENT-CTIME>
445,2013-12-16 20:03:20,2013-12-16 20:05:04,thunderbird-locale-en,/usr/lib/thunderbird-addons/extensions/langpac...,<RECENT-CTIME>
396,2013-12-16 20:08:27,2013-12-16 20:05:03,software-center,/usr/sbin/update-software-center,<RECENT-CTIME>
449,2013-12-16 20:03:20,2013-12-16 20:05:00,samba-common-bin,/usr/bin/net.samba3,<RECENT-CTIME>
397,2013-12-16 20:08:25,2013-12-16 20:04:59,postgresql-client-9.1,/usr/lib/postgresql/9.1/bin/psql,<RECENT-CTIME>
398,2013-12-16 20:08:23,2013-12-16 20:04:58,postgresql-9.1,/usr/lib/postgresql/9.1/bin/postmaster,<RECENT-CTIME>
452,2013-12-16 20:03:20,2013-12-16 20:04:55,php5-dev,/usr/include/php5/main/snprintf.h,<RECENT-CTIME>
440,2013-12-16 20:03:20,2013-12-16 20:04:54,php-pear,/usr/share/php/XML/Util.php,<RECENT-CTIME>


A mensagem toda aqui é que, se você tiver um carimbo de data / hora em segundos, milissegundos ou nanossegundos, basta "lançá-lo" para um '`datetime64[the-right-thing]`' e pandas/numpy cuidará do resto.