# D07: Useful things in Pandas


* Binning & Categorical Ordering
* Duplicates
* Pivot
* Datetime format
* String Slicing
* Stack / Unstack

In [None]:
import pandas as pd
import numpy as np

## Slicing in Pandas

String slicing in pandas is just as easy as in regular Python using the .str method:

In [None]:
df = pd.DataFrame({'uid': [1000,1001,1001,1002,1002,1003,1004,1005],
                    'data1':['Aaaaa1','Baaaa2','Baaaa3','Caaaa4','Daaaa5','Eaaaa6','Faaaa7','Gaaaa8']})

df['Capital'] = df['data1'].str[0:1]    # First 1
df['Trails'] = df['data1'].str[1:]      # Everything from 2nd
df

## Binning & Categorical Ordering

<a href = "https://en.wikipedia.org/wiki/Data_binning">Binning</a> is the process of separating numeric continuous data into representative categorical 'bins'. A good example of this is creating categorical decages based upon numeric year data.

There is a good example of this <a href = "http://chrisalbon.com/python/pandas_binning_data.html">here</a>.

In [None]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 
                         'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
            'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
            'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

In [None]:
bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
categories = pd.cut(df['postTestScore'], bins, labels=group_names)
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)
df


## Categorical Variables

As with many categorical variables, alphabetical order isn't necessariuly the best order in which to present the data. 

In such cases you can use the Categorical class to define a custom order as follows:

In [None]:
df = pd.DataFrame({'station': ['London','York','Newcastle','London','York','Newcastle','London','York','Newcastle','London','York','Newcastle'],
                   'value':[45000,18000,20000,36000,23000,22000,93000,45000,42000,96000,88000,54000]})

df['station'] = pd.Categorical(df['station'],['London','York','Newcastle'])  # Creating the Categorical Variable
gp1 = df.groupby(['station']).sum()                                          # Groupby  
gp1

## Duplicates

Duplicate records can cause havoc if not picked up! Pandas has a number of options for dealing with these, both for the overall dataframe and for individual columns: 

In [None]:
df1 = pd.DataFrame({'uid': [1000,1001,1001,1002,1002,1003,1004,1005],
                  'data1':['A','B','B','C','D','E','F','G']})
df1 

In [None]:
dup = df1.duplicated()              # Finds duplicates
df2 = df1.drop_duplicates()         # Drops duplicates
df2

Unless you provide it with an argument, the drop_duplicates() method will look at all the data in the dataframe and only drop records that are have duplicated data in all the columns (1 & 2 but not 3 & 4). The .duplicated() method works in the same way.

However when we specify a column, it behaves differently and will drop all duplicates in that column regardless of what the rest of the data looks like:

In [None]:
df4 = df1.drop_duplicates(['uid'])    # Drops duplicates in the specified column - Defaults to keep the first duplicate
df4

Note that pandas will always default to keeping the first duplicate unless you specify otherwise:

In [None]:
df5 = df1.drop_duplicates(['uid'],keep='last')  # Keeping the last record
df5

## Pivoting

Pivoting allows you to change the shape of your datasets. In the example below we can see that the data is in 'long' form. 

In [None]:
df = pd.DataFrame({'time':[1,1,1,2,2,2,3,3,3,4,4,4],
                   'category':['A','B','C','A','B','C','A','B','C','A','B','C'],
                   'data':np.random.randint(0,1000,12)*100})
df

Using the pivot function we can transpose this data to 'wide' form:

In [None]:
df2 = df.pivot(index='time', columns='category', values='data')
df2.index.name = None
df2

## Stack & Unstack

Stack and unstack are functions that allow you to reshape your dataframe

In [None]:
df = pd.DataFrame(np.arange(8).reshape((2, 4)),
                  index=pd.Index(['LA', 'SF'], name='city'),
                  columns=pd.Index(['col1', 'col2', 'col3','col4'], name='letter'))
df

In [None]:
df = df.stack()     # Stack 'stacks' the columns into a series
df

In [None]:
df = df.unstack() # Stack 'unstacks' the series back into a dataframe
df

## Datetime Format

Python has it's own specific <a href = "https://docs.python.org/3.5/library/datetime.html">Datetime built-in package</a> for dealing with datetime data. You can find a good overview of this <a href = "http://effbot.org/librarybook/datetime.htm">here</a>, however we'll now look at how you can deal with datetime format in pandas.

First we'll import some data with some dates in it as follows:

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'date':['11/05/2016','18/05/2016','01/06/2016','08/06/2016','15/06/2016'],  # Importing some dates as a string
                   'data':np.random.randint(0,1000,5)*100})                                    # Some random data
df = df.sort_values(by=['date'])                                                               # Sorting the values by date
df

As we can see, Python doesn't recognise the values as dates. However we can change that by using the to_datetime function as follows:

In [None]:
def dttm(row):
	try:
		return pd.to_datetime(row['date'],dayfirst=True, format= "%d/%m/%Y") 
	except ValueError:
		pass
    
df['datetime'] = df.apply(dttm,axis=1)
df = df.sort_values(by=['datetime'])  
df

In [None]:
type(df['datetime'].iloc[0])

The format = argument look a little bit daunting at first but the syntax is deceptively simple...

Each of the values relates to a specifc format as follows:

    %d Day of the month as a zero-padded decimal number (e.g. 30)
    %m Month as a zero-padded decimal number. (e.g. 09)
    %Y Year with century as a decimal number. (e.g. 2013)
    
The - in the argument represents the delimiter between the values.

Also note that we pass the dayfirst = True argument. This is to prevent the month appearing first as per the US convention for displaying dates!

Let's look at a slightly different example:

In [None]:
import pandas as pd
import numpy as np

df2 = pd.DataFrame({'date':['11-May-16','18-May-16','01-Jun-16','08-Jun-16','15-Jun-16'], # Importing some dates as a string
                   'data':np.random.randint(0,1000,5)*100})                               # Some random data
df2 = df2.sort_values(by=['date'])                                                        # Sorting the values by date
df2

Here the delimiters and formats are different so we need to pass different tokens:

In [None]:
def dttm2(row):
	try:
		return pd.to_datetime(row['date'], dayfirst=True, format= "%d-%b-%y") 
	except ValueError:
		pass
    
df2['datetime'] = df2.apply(dttm2,axis=1)
df2 = df2.sort_values(by=['datetime'])  
df2

Here, the codes we're using are as follows:

    %d Day of the month as a zero-padded decimal number (e.g. 30)
    %b Month as locale’s abbreviated name. (e.g. Sep)
    %y Year without century as a zero-padded decimal number. (e.g. 16)

Lastly, Pandas handles DateTime values in the same way:

In [None]:
df3 = pd.DataFrame({'date':['11-May-16 06:01:23','18-May-16 12:22:01','01-Jun-16 18:51:34',
                            '08-Jun-16 23:19:16','15-Jun-16 00:01:04'],                      # Importing some datetimes
                   'data':np.random.randint(0,1000,5)*100})                                  # Some random data


def dttm3(row):
	try:
		return pd.to_datetime(row['date'],dayfirst=True, format= "%d-%b-%y %H:%M:%S") 
	except ValueError:
		pass
    
df3['datetime'] = df3.apply(dttm3,axis=1)
df3 = df3.sort_values(by=['datetime'])  
df3 

You can find a full reference of the datetime tokens and what they mean <a href = "http://strftime.org/">here</a>.

You'll also notice that I've used try/except blocks instead of if/elif. This is because a lot of real world date data has inconsistencies and issues with it and try/except will allow you to specify an exception handling process to account for it.