### Pandas Data Structures ###

Key building blogs of pandas are:
-  __Indexes:__ Sequence of Labels whicha are immutable (like dictionary keys) and homogeneous in data type (like np.arrays)
-  __Series:__ 1D array with Index (labels)
-  __DataFrames:__ 2D array with ___Series___ as columns

In [11]:
import pandas as pd
import os

In [3]:
prices = [10.7, 10.86, 10.74, 10.71, 10.79]
shares = pd.Series(prices)

print(shares)

0    10.70
1    10.86
2    10.74
3    10.71
4    10.79
dtype: float64


In [5]:
# We can add an index to the previous Series:
days = ['Mon', 'Tue', 'Wed', 'Thur', "Fri"]
shares = pd.Series(prices, index=days)

print(shares)

Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64


In [8]:
# notice that index is like an array that can be sliced:
shares.index[:2]

Index(['Mon', 'Tue'], dtype='object')

In [10]:
# It also has a .name attribute that we can assign:
shares.index.name='WeekDays'
print(shares)

WeekDays
Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64


In [12]:
col_names = ['year', 'month', 'day', 'dec_date', 'sunspots', 'standard_dev', 'nobs', 'definite']
sunspots = pd.read_csv("data/SN_d_tot_V2.0.csv", 
                       header=None, 
                       sep=';', 
                       names=col_names,
                       na_values = {'sunspots':['  -1'], 'standard_dev':[' -1.0']},
                       parse_dates=[[0,1,2]])

In [13]:

sunspots.head()

Unnamed: 0,year_month_day,dec_date,sunspots,standard_dev,nobs,definite
0,1818-01-01,1818.001,,,0,1
1,1818-01-02,1818.004,,,0,1
2,1818-01-03,1818.007,,,0,1
3,1818-01-04,1818.01,,,0,1
4,1818-01-05,1818.012,,,0,1


In [14]:
## By default, when a new dataframe is created, an index of consecutive, increasing integers is created.
sunspots.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72896 entries, 0 to 72895
Data columns (total 6 columns):
year_month_day    72896 non-null datetime64[ns]
dec_date          72896 non-null float64
sunspots          69649 non-null float64
standard_dev      69649 non-null float64
nobs              72896 non-null int64
definite          72896 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(2)
memory usage: 3.3 MB


In [16]:
# We can assign a different index in a variety of ways:
sunspots.index = sunspots['year_month_day']
sunspots.info() # Notice that now instead of RangeIndex we have a DatetimeIndex

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 72896 entries, 1818-01-01 to 2017-07-31
Data columns (total 6 columns):
year_month_day    72896 non-null datetime64[ns]
dec_date          72896 non-null float64
sunspots          69649 non-null float64
standard_dev      69649 non-null float64
nobs              72896 non-null int64
definite          72896 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(2)
memory usage: 3.9 MB


In [17]:
sunspots.head() # Notice that now, we have a redundant column of year_month_day. 

Unnamed: 0_level_0,year_month_day,dec_date,sunspots,standard_dev,nobs,definite
year_month_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1818-01-01,1818-01-01,1818.001,,,0,1
1818-01-02,1818-01-02,1818.004,,,0,1
1818-01-03,1818-01-03,1818.007,,,0,1
1818-01-04,1818-01-04,1818.01,,,0,1
1818-01-05,1818-01-05,1818.012,,,0,1


In [18]:
# We can delete the redundant column using del:
del sunspots['year_month_day']

In [22]:
sunspots.head()

Unnamed: 0_level_0,dec_date,sunspots,standard_dev,nobs,definite
year_month_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1818-01-01,1818.001,,,0,1
1818-01-02,1818.004,,,0,1
1818-01-03,1818.007,,,0,1
1818-01-04,1818.01,,,0,1
1818-01-05,1818.012,,,0,1


### Hierarchical indexing ###
It is preferable to use an index that uniquely identifies the columns. Sometimes uniqueness arises from a combination of columns. In this cases, a hierarchical index is necessary.


In [28]:
col_types = {
    'storecode': str,
    'barcode': str,
    'receipt_number': str,
    'Times': str
}
am = pd.read_csv("data/ALPHAMEGA_201705.txt", sep=';', dtype=col_types, parse_dates=['salesdate']).iloc[:20000, :]

In [32]:
am.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 15 columns):
barcode               20000 non-null object
salesdate             20000 non-null datetime64[ns]
salestime             20000 non-null object
loyaltyno             20000 non-null object
storecode             20000 non-null object
quantity              20000 non-null float64
salesvalue            20000 non-null float64
price                 20000 non-null float64
salesvalue_exclvat    20000 non-null float64
price_exclvat         20000 non-null float64
promotion             4503 non-null object
receipt_number        20000 non-null object
key_row               20000 non-null object
Flag                  219 non-null object
Times                 20000 non-null object
dtypes: datetime64[ns](1), float64(5), object(9)
memory usage: 2.3+ MB


In [33]:
am.head()

Unnamed: 0,barcode,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,key_row,Flag,Times
0,5202234640297,2017-05-04,00:00,MA043395,600011,1.0,2.98,0.0,2.83,0.0,,66540,600011-0001_011-66540-20170504,,1
1,5425014220537,2017-05-06,00:00,MA067142,600011,1.0,3.69,0.0,3.1,0.0,,33325,600011-0001_015-33325-20170506,,1
2,42268116,2017-05-06,00:00,MA005817,600012,2.0,2.1,0.0,2.0,0.0,,30120,600012-0002_105-30120-20170506,,2
3,2106692000002,2017-05-06,00:00,MA133178,600015,0.295,2.06,0.0,1.96,0.0,,43570,600015-0005_508-43570-20170506,,1
4,2001374000007,2017-05-02,00:00,MA169973,600017,1.0,1.99,0.0,1.89,0.0,,54635,600017-0007_343-54635-20170502,,1


In [38]:
# A unique index could be created by combining the barcode and key_row columns
am = am.set_index(['key_row', 'barcode'])
am.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
key_row,barcode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
600011-0001_011-66540-20170504,5202234640297,2017-05-04,00:00,MA043395,600011,1.0,2.98,0.0,2.83,0.0,,66540,,1
600011-0001_015-33325-20170506,5425014220537,2017-05-06,00:00,MA067142,600011,1.0,3.69,0.0,3.1,0.0,,33325,,1
600012-0002_105-30120-20170506,42268116,2017-05-06,00:00,MA005817,600012,2.0,2.1,0.0,2.0,0.0,,30120,,2
600015-0005_508-43570-20170506,2106692000002,2017-05-06,00:00,MA133178,600015,0.295,2.06,0.0,1.96,0.0,,43570,,1
600017-0007_343-54635-20170502,2001374000007,2017-05-02,00:00,MA169973,600017,1.0,1.99,0.0,1.89,0.0,,54635,,1


In [40]:
print(am.index.name)
print(am.index.names)

None
['key_row', 'barcode']


In [49]:
am = am.sort_index()
am.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
key_row,barcode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
600010-0010_452-4052-20170505,2804421000001,2017-05-05,00:00,MA152257,600010,0.695,6.94,0.0,6.61,0.0,,4052,,1
600010-0010_452-4064-20170506,410,2017-05-06,00:00,MA132084,600010,1.0,3.5,0.0,3.33,0.0,,4064,,1
600010-0010_452-4081-20170506,2800922000007,2017-05-06,00:00,MA201302,600010,0.302,7.14,0.0,6.8,0.0,,4081,,1
600010-0010_452-4084-20170506,5290135000852,2017-05-06,00:00,MA107804,600010,1.0,1.69,0.0,1.61,0.0,,4084,,1
600010-0010_452-4109-20170506,2804407000001,2017-05-06,00:00,MA179766,600010,0.425,4.24,0.0,4.04,0.0,,4109,,1


In [42]:
# We can use accessors a s before but now we use tuples to access a specific row (indexing an individual row):
am.loc[('600010-0010_452-4109-20170506','2804407000001')]

Unnamed: 0_level_0,Unnamed: 1_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
key_row,barcode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
600010-0010_452-4109-20170506,2804407000001,2017-05-06,00:00,MA179766,600010,0.425,4.24,0.0,4.04,0.0,,4109,,1


In [43]:
# Using only the outer index slices the dataframe and returns all rows belonging to that index. (The basket chosen below
# only has 1 item so it is a single line)
am.loc['600010-0010_452-4064-20170506']

Unnamed: 0_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
410,2017-05-06,00:00,MA132084,600010,1.0,3.5,0.0,3.33,0.0,,4064,,1


In [50]:
# We can extract a slice using a range of of the outermost index
am.loc['600010-0010_452-4064-20170506': '600010-0010_452-4109-20170506']

Unnamed: 0_level_0,Unnamed: 1_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
key_row,barcode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
600010-0010_452-4064-20170506,410,2017-05-06,00:00,MA132084,600010,1.0,3.5,0.0,3.33,0.0,,4064,,1
600010-0010_452-4081-20170506,2800922000007,2017-05-06,00:00,MA201302,600010,0.302,7.14,0.0,6.8,0.0,,4081,,1
600010-0010_452-4084-20170506,5290135000852,2017-05-06,00:00,MA107804,600010,1.0,1.69,0.0,1.61,0.0,,4084,,1
600010-0010_452-4109-20170506,2804407000001,2017-05-06,00:00,MA179766,600010,0.425,4.24,0.0,4.04,0.0,,4109,,1


In [52]:
am.loc[(['600010-0010_452-4081-20170506', '600010-0010_452-4109-20170506'])]

Unnamed: 0_level_0,Unnamed: 1_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
key_row,barcode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
600010-0010_452-4081-20170506,2800922000007,2017-05-06,00:00,MA201302,600010,0.302,7.14,0.0,6.8,0.0,,4081,,1
600010-0010_452-4109-20170506,2804407000001,2017-05-06,00:00,MA179766,600010,0.425,4.24,0.0,4.04,0.0,,4109,,1


In [59]:
# To use slicing ony on the inner index, we cannot use colon slicing on the first one. So we use the slice() method:
am.loc[(slice(None), ['2800922000007', '2804407000001']), :].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,salesdate,salestime,loyaltyno,storecode,quantity,salesvalue,price,salesvalue_exclvat,price_exclvat,promotion,receipt_number,Flag,Times
key_row,barcode,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
600010-0010_452-4081-20170506,2800922000007,2017-05-06,00:00,MA201302,600010,0.302,7.14,0.0,6.8,0.0,,4081,,1
600010-0010_452-4109-20170506,2804407000001,2017-05-06,00:00,MA179766,600010,0.425,4.24,0.0,4.04,0.0,,4109,,1
600015-0005_504-55135-20170531,2804407000001,2017-05-31,00:00,MA114878,600015,0.54,5.39,0.0,5.13,0.0,,55135,,1
600022-0012_206-45511-20170529,2804407000001,2017-05-29,00:00,RZ999,600022,0.575,5.74,0.0,5.47,0.0,,45511,,1
