## Series DataStructure in Pandas


This is a data structure which has cross over between a list and dictionary.

Items are all stored in order and there are labels which with these can be retrieved.


In [107]:
import pandas as pd
pd.Series?

In [106]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

In [108]:
numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [7]:
# Important thing to note is how the Pandas (using NumPy underneath) handle missing data
animals = ['Tiger', 'Bear', None]
pd.Series(animals) # Note the type as object, it inserts None

0    Tiger
1     Bear
2     None
dtype: object

In [8]:
numbers = [1, 2, None]
pd.Series(numbers) # not there the type is float & missing data is NaN (similar to None but for numeic values)

0    1.0
1    2.0
2    NaN
dtype: float64

In [10]:
# NaN is not None
import numpy as np
np.nan == None

False

In [11]:
# The equality test of NaN can't be done
np.nan == np.nan

False

In [12]:
# Special functions to test for the presence of Nan
np.isnan(np.nan)

True

In [13]:
# Series can be created from the dictionaries as well
sports = { 'Archery': 'Bhutan',
         'Golf': 'Scotland',
         'Hockey': 'India'}
s = pd.Series(sports)
s # Note the type is set as object for the elements

Archery      Bhutan
Golf       Scotland
Hockey        India
dtype: object

In [14]:
type(s)

pandas.core.series.Series

In [15]:
# get the index object
s.index

Index(['Archery', 'Golf', 'Hockey'], dtype='object')

In [16]:
# Index can be passed as a separate list during the Series creation
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

India      Tiger
America     Bear
Canada     Moose
dtype: object

In [17]:
# What happens when the indices are missing, Pandas will add NaN
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s

Golf      Scotland
Sumo         Japan
Hockey         NaN
dtype: object

### Querying a Series

In [19]:
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [21]:
# Query using the index. Use .iloc attribute
s.iloc[3]

'South Korea'

In [22]:
s.loc['Golf'] # Query using the label . use .loc attribute

'Scotland'

In [23]:
s.loc

<pandas.core.indexing._LocIndexer at 0x68494f0>

In [24]:
s[3] # Panda Series, takes integer input and uses iLoc to determine the value using index position

'South Korea'

In [25]:
s['Golf'] # If string input then it uses label

'Scotland'

In [26]:
sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)
s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead

KeyError: 0

In [27]:
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s

0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

In [28]:
# work on top of data, typical approach 
total = 0
for item in s:
    total += item
print(total)

324.0


Pandas and underlying NumPy library supports Vectorization - works with most of the functions in NumPy library.

In [29]:
import numpy as np
total = np.sum(s)
print(total)

324.0


In [31]:
# Now which one is faster. We need to create a bigger series
s = pd.Series(np.random.randint(0, 1000, 10000))
s.head()

0    581
1     85
2     30
3    298
4    712
dtype: int32

In [32]:
len(s)

10000

In [33]:
%%timeit -n 100 # magic function to  time the execution of the code. -n 100 (no of loops)
summary = 0
for item in s:
    summary += item

100 loops, best of 3: 894 µs per loop


In [34]:
%%timeit -n 100
summary = np.sum(s)

100 loops, best of 3: 131 µs per loop


**Wow! Vectorization has some termendous speed benefits.**

In [35]:
# Broadcasting - applying some changes to all the elements of the Series
s += 2 # This is the parallel computing way of doing this
s.head()

0    583
1     87
2     32
3    300
4    714
dtype: int32

In [36]:
# If the above is to be done in procedural way then it would be tedious
for label, value in s.iteritems():
    s.set_value(label, value+2)
s.head()

  This is separate from the ipykernel package so we can avoid doing imports until


0    585
1     89
2     34
3    302
4    716
dtype: int32

In [37]:
# Series can have heterogeneous elements
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears' # Use the .loc attribute to add more items to the series (if it does not exist)

In [38]:
s

0             1
1             2
2             3
Animal    Bears
dtype: object

In [40]:
# What happens if the lables are not unique
original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries) # returns a new Series, watch out for Pandas methods returning new objects

In [41]:
original_sports

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [42]:
cricket_loving_countries

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

In [43]:
all_countries # Note the same index values

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

In [44]:
all_countries['Cricket'] # Returns a Series object

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

# DataFrame Data Structure

Heart of the pandas library. 2-D Series object.

In [48]:
import pandas as pd
purchase_1 = pd.Series({'Name': 'Deepak',
                       'Item Purchased': 'Football Shoes',
                       'Cost': 2700})
purchase_2 = pd.Series({'Name': 'Neil',
                       'Item Purchased': 'Running Shoes',
                       'Cost': 2400})
purchase_3 = pd.Series({'Name': 'Samiksha',
                       'Item Purchased': 'Hookah',
                       'Cost': 1500})
purchase_4 = pd.Series({'Name': 'Harry',
                       'Item Purchased': 'Notebook',
                       'Cost': 200})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3, purchase_4], index=['Amazon', 'AliExpress', 'Flipkart', 'Amazon'])
df.head() # Notice below how the notebook renders it in a  nice table format

Unnamed: 0,Cost,Item Purchased,Name
Amazon,2700,Football Shoes,Deepak
AliExpress,2400,Running Shoes,Neil
Flipkart,1500,Hookah,Samiksha
Amazon,200,Notebook,Harry


In [49]:
# Look at the items purchased from Amazon
df.loc['Amazon']

Unnamed: 0,Cost,Item Purchased,Name
Amazon,2700,Football Shoes,Deepak
Amazon,200,Notebook,Harry


In [55]:
# Notice that the type of the object returned is a Dataframe if there is more than one Series in the output from the .loc attribute
type(df.loc['Amazon'])


pandas.core.frame.DataFrame

In [51]:
type(df.loc['Flipkart']) # Here the type of object returned is 1 and hence a Series object is returned

pandas.core.series.Series

In [57]:
# If we want to reference only the cost of items purchased from the Amazon store. \
# Pandas dataframes allows selecting values based on both the indices
df.loc['Amazon', 'Cost']

Amazon    2700
Amazon     200
Name: Cost, dtype: int64

In [53]:
# If we want to pull the Name & cost from alll the stores  and .loc supports slicing
df.loc[:, ['Name', 'Cost']]

Unnamed: 0,Name,Cost
Amazon,Deepak,2700
AliExpress,Neil,2400
Flipkart,Samiksha,1500
Amazon,Harry,200


In [54]:
type(df.loc[:, ['Name', 'Cost']]) # Notice that this returns a dataframe 

pandas.core.frame.DataFrame

In [58]:
# What if we wanted the cost attribute from all the stores.
# One approach can be to do a Transpose, thus converting the Cost as a row and then using .loc attribute to select it
df.T

Unnamed: 0,Amazon,AliExpress,Flipkart,Amazon.1
Cost,2700,2400,1500,200
Item Purchased,Football Shoes,Running Shoes,Hookah,Notebook
Name,Deepak,Neil,Samiksha,Harry


In [60]:
df.T.loc['Cost']

Amazon        2700
AliExpress    2400
Flipkart      1500
Amazon         200
Name: Cost, dtype: object

In [62]:
# But this is ugly, we can instead do this. Dataframes already has this built in
df['Cost']

Amazon        2700
AliExpress    2400
Flipkart      1500
Amazon         200
Name: Cost, dtype: int64

In [63]:
# Index references can be chained like below (comes with cost, Pandas return a copy when chaining used instead of view on the data.
# So try avoiding it). 
# Below is equivalent to -> df.loc{'Amazon', 'Cost'}
df.loc['Amazon']['Cost'] 

Amazon    2700
Amazon     200
Name: Cost, dtype: int64

#### Dropping data in Pandas dataframes

The drop method returns a new dataframe , it does not do an inplace change (mentioned otherwise).
It allows dropping a column as well.

In [64]:
df.drop('Flipkart') # Takes the row index as argument (default) to drop the rows

Unnamed: 0,Cost,Item Purchased,Name
Amazon,2700,Football Shoes,Deepak
AliExpress,2400,Running Shoes,Neil
Amazon,200,Notebook,Harry


In [65]:
df # wait the original dataframe is not modified

Unnamed: 0,Cost,Item Purchased,Name
Amazon,2700,Football Shoes,Deepak
AliExpress,2400,Running Shoes,Neil
Flipkart,1500,Hookah,Samiksha
Amazon,200,Notebook,Harry


In [66]:
copy_df = df.copy()
copy_df = copy_df.drop('Flipkart')
copy_df

Unnamed: 0,Cost,Item Purchased,Name
Amazon,2700,Football Shoes,Deepak
AliExpress,2400,Running Shoes,Neil
Amazon,200,Notebook,Harry


In [67]:
help(df.drop) # Notice the inplace set to False , change this to True if we need to modify the original dataframe
# Also see the axis=0, change this to 1 if you need to drop columns instead
# See the help below it has some examples

Help on method drop in module pandas.core.generic:

drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') method of pandas.core.frame.DataFrame instance
    Return new object with labels in requested axis removed.
    
    Parameters
    ----------
    labels : single label or list-like
        Index or column labels to drop.
    axis : int or axis name
        Whether to drop labels from the index (0 / 'index') or
        columns (1 / 'columns').
    index, columns : single label or list-like
        Alternative to specifying `axis` (``labels, axis=1`` is
        equivalent to ``columns=labels``).
    
        .. versionadded:: 0.21.0
    level : int or level name, default None
        For MultiIndex
    inplace : bool, default False
        If True, do operation inplace and return None.
    errors : {'ignore', 'raise'}, default 'raise'
        If 'ignore', suppress error and existing labels are dropped.
    
    Returns
    -------
    dropped

In [72]:
# Another method to drop columns, this modifies the dataframe in place
del copy_df['Name']
copy_df

Unnamed: 0,Cost,Item Purchased
Amazon,2700,Football Shoes
AliExpress,2400,Running Shoes
Amazon,200,Notebook


In [73]:
# Adding a new column is straight forward, assign it
df['Location'] = 'Bangalore'
df

Unnamed: 0,Cost,Item Purchased,Name,Location
Amazon,2700,Football Shoes,Deepak,Bangalore
AliExpress,2400,Running Shoes,Neil,Bangalore
Flipkart,1500,Hookah,Samiksha,Bangalore
Amazon,200,Notebook,Harry,Bangalore


### DataFrame indexing and loading

In [74]:
costs = df['Cost']
costs

Amazon        2700
AliExpress    2400
Flipkart      1500
Amazon         200
Name: Cost, dtype: int64

In [75]:
costs += 2 # Use Vectorization to increase all the elements
costs

Amazon        2702
AliExpress    2402
Flipkart      1502
Amazon         202
Name: Cost, dtype: int64

In [76]:
# Important thing to consider here is that the above operation has modified the original dataframe as well.
df

Unnamed: 0,Cost,Item Purchased,Name,Location
Amazon,2702,Football Shoes,Deepak,Bangalore
AliExpress,2402,Running Shoes,Neil,Bangalore
Flipkart,1502,Hookah,Samiksha,Bangalore
Amazon,202,Notebook,Harry,Bangalore


In [77]:
# Remember if you want to make changes which are not polluting the original dataframe then you should make a call to the copy
# function first
cp_costs = df['Cost'].copy()
cp_costs

Amazon        2702
AliExpress    2402
Flipkart      1502
Amazon         202
Name: Cost, dtype: int64

In [78]:
cp_costs *= 0.8 # 20% discount given
cp_costs

Amazon        2161.6
AliExpress    1921.6
Flipkart      1201.6
Amazon         161.6
Name: Cost, dtype: float64

In [79]:
# Notice how this has not modified the origina dataframe
df

Unnamed: 0,Cost,Item Purchased,Name,Location
Amazon,2702,Football Shoes,Deepak,Bangalore
AliExpress,2402,Running Shoes,Neil,Bangalore
Flipkart,1502,Hookah,Samiksha,Bangalore
Amazon,202,Notebook,Harry,Bangalore


In [82]:
!dir # Using the ! operator one can execute OS Shell specific commands in iPython

 Volume in drive E is Dexwork
 Volume Serial Number is F0C7-464B

 Directory of E:\Documents\Python\Jupyter-notebooks

01/09/2018  03:00 PM    <DIR>          .
01/09/2018  03:00 PM    <DIR>          ..
01/08/2018  10:52 AM    <DIR>          .ipynb_checkpoints
01/09/2018  02:56 PM            55,466 Basic_data_processing_with_Pandas.ipynb
01/05/2018  02:18 PM            29,499 NYCrimeAnalysis.ipynb
12/03/2016  02:15 AM             8,419 olympics.csv
01/05/2018  07:43 AM           128,513 process.csv
01/08/2018  10:43 AM            26,251 Python_basics_NumPy_basics.ipynb
               5 File(s)        248,148 bytes
               3 Dir(s)  131,955,474,432 bytes free


In [83]:
# Read a CSV file in a dataframe by using the read_csv() method
df = pd.read_csv('olympics.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total
1,Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
2,Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
3,Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
4,Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12


In [84]:
# read_csv() method can be instructed which column & row to be taken as index 
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


In [85]:
# Notice that some of the column names are not properly named.
# time to clean up some of the column names
# Pandas DF stores all the column names in the columns attribute
df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

In [92]:
for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True)
        
df.head()

Unnamed: 0,# Summer,Gold,Silver,Bronze,Total,# Winter,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


In [97]:
df.index # see the name of the indexes do not show special chars inserted between them

Index(['Afghanistan (AFG)', 'Algeria (ALG)', 'Argentina (ARG)',
       'Armenia (ARM)', 'Australasia (ANZ) [ANZ]', 'Australia (AUS) [AUS] [Z]',
       'Austria (AUT)', 'Azerbaijan (AZE)', 'Bahamas (BAH)', 'Bahrain (BRN)',
       ...
       'Uzbekistan (UZB)', 'Venezuela (VEN)', 'Vietnam (VIE)',
       'Virgin Islands (ISV)', 'Yugoslavia (YUG) [YUG]',
       'Independent Olympic Participants (IOP) [IOP]', 'Zambia (ZAM) [ZAM]',
       'Zimbabwe (ZIM) [ZIM]', 'Mixed team (ZZX) [ZZX]', 'Totals'],
      dtype='object', length=147)

In [104]:
df.iloc[0].name # reference the name of the index

'Afghanistan\xa0(AFG)'

In [105]:
df.loc['Afghanistan\xa0(AFG)'] # Note the special chars that have been inserted :O

# Summer          13
Gold               0
Silver             0
Bronze             2
Total              2
# Winter           0
Gold.1             0
Silver.1           0
Bronze.1           0
Total.1            0
# Games           13
Gold.2             0
Silver.2           0
Bronze.2           2
Combined total     2
Name: Afghanistan (AFG), dtype: int64