# PANDAS - in depth (part2)

We will look at

    * Data transformation
    * Data aggregation

In [157]:
# Setting up working environment

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('max_columns', 50)
%matplotlib inline

## Dropping duplicates

Detecting duplicates rows in huge datasets can be problematic. Pandas provides tools for handling
duplicate values.

   * The duplicated() function applied to a DataFrame can detect the rows which appear to be duplicated.
   * It returns a Series of Booleans where each element corresponds to a row, with True if the row is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the previous elements.

In [158]:
dframe = pd.DataFrame({'color': ['white','white','red','red','white'],'value': [2,1,3,3,2]})

In [159]:
dframe

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3
3,red,3
4,white,2


In [160]:
# Detecting duplicates

dframe.duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

## Boolean returns and removing duplicates

In [161]:
# We can make use of the fact that the result of this operation is a boolean series to filter rows:
# To find the duplicate rows, just type:

dframe[dframe.duplicated()]

Unnamed: 0,color,value
3,red,3
4,white,2


In [162]:
dframe

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3
3,red,3
4,white,2


In [163]:
# The drop_duplicates() function, returns the DataFrame without duplicate rows.

dframe.drop_duplicates()

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3


# Replace

Often in the data structure that you have assembled, there are values that do not meet your needs.
   * For instance, some of the text may be in a foreign language, 
   * may contain unwanted synonyms
   * may be in the wrong shape etc.
   
In such cases, we can use the replace function.

In [164]:
frame = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                               'color':['white','rosso','verde','black','yellow'],
                               'price':[5.56,4.20,1.30,0.56,2.75]})

In [165]:
frame

Unnamed: 0,color,item,price
0,white,ball,5.56
1,rosso,mug,4.2
2,verde,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


## Creating a mapping

In [166]:
# First, we create a mapping as follows:

newcolors = {'rosso': 'red','verde': 'green'}

# Now we use replace using the mapping as an argument:

In [167]:
frame.replace(newcolors)

Unnamed: 0,color,item,price
0,white,ball,5.56
1,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


## Replacing instances of NaN

In [168]:
# For example with 0s:

ser = pd.Series([1,3,np.nan,4,6,np.nan,3])


In [169]:
ser   #original series

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

In [170]:
ser.replace(np.nan,0)  #modified series

0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    3.0
dtype: float64

## Using mapping to add values into a column

In [171]:
# The mapping is always defined separately. First defining the dataframe:

frame = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                               'color':['white','red','green','black','yellow']})

In [172]:
frame

Unnamed: 0,color,item
0,white,ball
1,red,mug
2,green,pen
3,black,pencil
4,yellow,ashtray


## The mapping

Let's suppose you want to add a column to indicate the price of the item shown in the DataFrame ‘frame’. Assume you have a price list available somewhere, in which the price for each type of item is described. Then, define a dict object that contains a list of prices for each type of item.

In [173]:
price = {'ball' : 5.56, 'mug' : 4.20, 'bottle' : 1.30, 'scissors' : 3.41, 'pen' : 1.30, 'pencil' : 0.56, 
         'ashtray' :2.75}

In [174]:
price   # is a dictionary of mapping where we store price for each item

{'ashtray': 2.75,
 'ball': 5.56,
 'bottle': 1.3,
 'mug': 4.2,
 'pen': 1.3,
 'pencil': 0.56,
 'scissors': 3.41}

## Applying the mapping

The map() function applied to a Series or to a column of a DataFrame accepts a function or an object containing a dict with mapping. So in your case you can apply the mapping of the prices on the column item, making sure to add a column to the price data frame.

In [175]:
frame['price'] = frame['item'].map(price) 

In [176]:
frame   # the newly modified frame where we added the price column !!

Unnamed: 0,color,item,price
0,white,ball,5.56
1,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


## Discretization and Binning

In [177]:
# Supposing we have readings of an experimental value between 0 and 100. 
# These data are collected in a list.

results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

You know that the experimental values have a range from 0 to 100; therefore you can uniformly
divide this interval, for example, into four equal parts, i.e., bins. The first contains the values between
0 and 25, the second between 26 and 50, the third between 51 and 75, and the last between 76 and
100.

In [178]:
# To do this binning with pandas, first you have to define an array containing the values for the
# separation of the bins:

bins = [0,25,50,75,100]

In [179]:
#Question: How would you create the above bins using numpy inbuilt function linspace?
# Write the code here

#the_bins = ??

In [180]:
# Then there is a special function called cut() which is applied to the array of results, passing the bins.

cat = pd.cut(results, bins)

In [181]:
# The object returned by the cut() function is a special object of Categorical type. You can consider it
# as an array of strings indicating the name of the bin. Internally it contains a levels array indicating
# the names of the different internal categories and a labels array that contains a list of numbers equal
# to the elements of results (i.e., the array subjected to binning).
# The number corresponds to the bin to which the corresponding element of results is assigned.

In [182]:
cat

[(0, 25], (25, 50], (50, 75], (50, 75], (25, 50], ..., (75, 100], (0, 25], (25, 50], (75, 100], (75, 100]]
Length: 17
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

In [183]:
# Implies that
# The value 12 (from the results) belong to bin (0-25], value 34 belongs to bin (25-50] and so on

In [184]:
cat.labels

  """Entry point for launching an IPython kernel.


array([0, 1, 2, 2, 1, 3, 3, 0, 0, 2, 2, 1, 3, 0, 1, 3, 3], dtype=int8)

In [185]:
cat.codes

array([0, 1, 2, 2, 1, 3, 3, 0, 0, 2, 2, 1, 3, 0, 1, 3, 3], dtype=int8)

In [186]:
# Finally to know the occurrences for each bin, that is, how many results fall into each category, you
# have to use the value_counts() function.

pd.value_counts(cat)

(75, 100]    5
(50, 75]     4
(25, 50]     4
(0, 25]      4
dtype: int64

# Detecting and filtering outliers

In [187]:
# We often wish to detect and remove outlying datapoints.
# By way of example, create a DataFrame with three columns from 1,000 completely random values:

randframe = pd.DataFrame(np.random.randn(1000,3))

In [188]:
#randframe

In [189]:
randframe.shape

(1000, 3)

In [190]:
# With the describe() function you can see the statistics for each column.

randframe.describe()

Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,-0.003095,-0.018349,-0.06093
std,0.978118,1.032731,0.96874
min,-3.577447,-3.149915,-3.468253
25%,-0.640088,-0.765361,-0.674932
50%,0.010943,0.001035,-0.080416
75%,0.606777,0.688341,0.527926
max,3.538345,3.036227,2.894686


In [191]:
# For example, you might consider outliers those that have a value greater than three times the standard deviation.

# To have only the standard deviation of each column of the DataFrame, use the std() function:
randframe.std()

0    0.978118
1    1.032731
2    0.968740
dtype: float64

In [192]:
# Now we apply the filter to all the values of the DataFrame, applying the corresponding standard deviation
# for each column.

# The any() function, enables easy application of the filter to each column.

In [193]:
# Detecting and removing outliers ..

result = randframe[(np.abs(randframe) > (3*randframe.std())).any(1)]    # displays following

In [194]:
#result

# Permutation

In [195]:
# Permutation operations (the random reordering) of a Series or the rows of a DataFrame are easy to do
# using the numpy.random.permutation() function.

In [196]:
nframe = pd.DataFrame(np.arange(25).reshape(5,5))

In [197]:
nframe

Unnamed: 0,0,1,2,3,4
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19
4,20,21,22,23,24


In [198]:
# Now create an array of five integers from 0 to 4 arranged in random order with the permutation()
# function. This will be the new order in which to determine the order of the rows in the DataFrame.

new_order = np.random.permutation(5)

In [199]:
new_order

array([2, 1, 4, 3, 0])

In [200]:
# Now apply it to all of the rows of the DataFrame, using the take() function:

nframe.take(new_order)

# Now the indices follow the same order as indicated in the new_order array.

Unnamed: 0,0,1,2,3,4
2,10,11,12,13,14
1,5,6,7,8,9
4,20,21,22,23,24
3,15,16,17,18,19
0,0,1,2,3,4


In [201]:
# You can submit just a portion of the entire DataFrame to a permutation. It generates an array that has a
# sequence limited to a certain range, for example, in our case from 2 to 4.

In [202]:
new_order = [3,4,2]

In [203]:
nframe.take(new_order)

Unnamed: 0,0,1,2,3,4
3,15,16,17,18,19
4,20,21,22,23,24
2,10,11,12,13,14


## Random sampling

Sometimes, when you have a huge DataFrame, you may have the need to sample it randomly, and the quickest way to do this is by using the 
   * np.random.randint() function.

In [204]:
sample = np.random.randint(0, len(nframe), size=3)

In [205]:
sample


array([2, 0, 0])

In [206]:
# take random samples

nframe.take(sample)

Unnamed: 0,0,1,2,3,4
2,10,11,12,13,14
0,0,1,2,3,4
0,0,1,2,3,4


# Data aggregation

The last stage of data manipulation is data aggregation.

  * By data aggregation we often mean a transformation that produces a single integer from an array.
  * We have already seen examples using sum, mean, count etc.
  * A major function for aggregation in Pandas is GroupBy.

## GroupBy

We can think of the GroupBy process as comprising of 3 stages: Splitting, applying and combining.

  * Splitting: The initial splitting into groups is usually done on the basis of a common index or data value.
  * Applying: The second phase, that of applying, consists in applying a function, or better a calculation, which will produce a new and single value per group.
  * Combining: The last phase, that of combining, will collect all the results obtained from each group and combine them together to form a new object.

In [207]:
# We define a DataFrame containing both numeric and string values as:
    
frame = pd.DataFrame({ 'color': ['white','red','green','red','green'],
                      'object': ['pen','pencil','pencil','ashtray','pen'],
                      'price1' : [5.56,4.20,1.30,0.56,2.75],
                      'price2' : [4.75,4.12,1.60,0.75,3.15]})

In [208]:
frame

Unnamed: 0,color,object,price1,price2
0,white,pen,5.56,4.75
1,red,pencil,4.2,4.12
2,green,pencil,1.3,1.6
3,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


In [209]:
# Suppose you want to calculate the average price1 column using group labels listed in the column color.
# There are several ways to do this. You can for example access the price1 column and call the groupby()
# function with the column color.

In [210]:
group = frame['price1'].groupby(frame['color'])

In [211]:
group   # print the group object

<pandas.core.groupby.SeriesGroupBy object at 0x7f6028443a58>

In [212]:
# The object that we got is a GroupBy object.

# In the operation that we just did, there was not really any calculation; there was just a collection of all
# the information needed to go into the calculation.

# What we have done is in fact a process of grouping, in which all rows having the same value of color
# are grouped into a single item

In [213]:
# To analyse in detail how the division into groups of rows of the DataFrame was made, we call the
# attribute groups of the GroupBy object.

group.groups

{'green': Int64Index([2, 4], dtype='int64'),
 'red': Int64Index([1, 3], dtype='int64'),
 'white': Int64Index([0], dtype='int64')}

In [214]:
# Each group is listed explicitly specifying the rows of the data frame assigned to each of them.

## Now we can apply the operation to obtain the results for each individual group:

In [215]:
group.sum()

color
green    4.05
red      4.76
white    5.56
Name: price1, dtype: float64

In [216]:
group.mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

# Hierarchical grouping

In [217]:
# The same thing can be extended to multiple columns, i.e., make a grouping of multiple keys:

ggroup = frame['price1'].groupby([frame['color'],frame['object']])

In [218]:
ggroup

<pandas.core.groupby.SeriesGroupBy object at 0x7f6028641390>

In [219]:
ggroup.groups 

{('green', 'pen'): Int64Index([4], dtype='int64'),
 ('green', 'pencil'): Int64Index([2], dtype='int64'),
 ('red', 'ashtray'): Int64Index([3], dtype='int64'),
 ('red', 'pencil'): Int64Index([1], dtype='int64'),
 ('white', 'pen'): Int64Index([0], dtype='int64')}

In [220]:
ggroup.sum()

color  object 
green  pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white  pen        5.56
Name: price1, dtype: float64

In [221]:
# So far we have applied the grouping to a single column of data. 
# It can be extended to multiple columns or the entire data frame.

# Also if you do not need to reuse the object GroupBy several times, it is convenient to combine into a
# single pass all of the groupings and calculations to be done, without defining any intermediate
# variable.

In [222]:
# print origin dataframe 'frame'

frame

Unnamed: 0,color,object,price1,price2
0,white,pen,5.56,4.75
1,red,pencil,4.2,4.12
2,green,pencil,1.3,1.6
3,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


In [223]:
# Apply grouping

frame[['price1','price2']].groupby(frame['color']).mean()

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


## Group iteration

In [224]:
# The GroupBy object supports the operation of an iteration for generating a sequence of 2-tuples
# containing the name of the group together with the data portion.

for name, group in frame.groupby('color'):
    print (name)
    print (group)

green
   color  object  price1  price2
2  green  pencil    1.30    1.60
4  green     pen    2.75    3.15
red
  color   object  price1  price2
1   red   pencil    4.20    4.12
3   red  ashtray    0.56    0.75
white
   color object  price1  price2
0  white    pen    5.56    4.75


In [225]:
# In this example, we only applied the print function for illustration.
# In practice, you replace the printing operation of a variable with the function to be applied.