# Essential Functionality

This section will walk through the fundamentals of interacting with the data contained within `Series` and `DataFrames`

___Resources___

https://bit.ly/2m3bEvi - pandas documentation - **Essential Basic Functionality**

In [1]:
## Base imports

import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)

### Arithmetic and Data Alignment in `DataFrames` and `Series`

Index objects are key to data alignment when performing arithmetic between objects. When adding together objects, if any index pairs are not the same, the respective index of the result will be the union of the index pairs.

In [2]:
# Example - construct two Series with unequal index.

s1 = pd.Series([1,2,3,np.nan], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10,20,30,40], index=['g', 'c', 'b', 'f'])

In [3]:
s1

a    1.0
b    2.0
c    3.0
d    NaN
dtype: float64

In [4]:
s2

g    10
c    20
b    30
f    40
dtype: int64

In [5]:
# Adding the two series together produces a new Series with missing values in the label locations that don't overlap

s1 + s2

a     NaN
b    32.0
c    23.0
d     NaN
f     NaN
g     NaN
dtype: float64

### Fill values

As such pandas provides convenience functions that allow you to fill in a special value when an axis label is found in one object but not the other.

In [6]:
s1.add(s2, fill_value=1000)

a    1001.0
b      32.0
c      23.0
d       NaN
f    1040.0
g    1010.0
dtype: float64

These convenience methods exist for multiple arithmetic options.

| Method | Description |
| --- | --- | 
| add | method for addition (+)| 
| sub | method for subtraction (-)| 
| div | method for division (/) | 
| mul | method for multiplication (//) | 
| floordiv | method for floor division (*)| 
| pow | method for exponentiation (\**) | 

In [7]:
# Example with dataframes

population = pd.Series([66900000, 53000000, 60600000, 82670000, 46560000], index= ['France', 'England', 'Italy', 'Germany', 'Spain'])
world_cups = pd.Series([1, 1, 4, 4, 1], index= ['England', 'Spain', 'Italy', 'Germany', 'France'])

In [8]:
df1 = pd.DataFrame({'Population': population, 'World_Cups': world_cups})
df1

Unnamed: 0,Population,World_Cups
England,53000000,1
France,66900000,1
Germany,82670000,4
Italy,60600000,4
Spain,46560000,1


In [9]:
df2 = pd.DataFrame({ 'World_Cups': [0, 1, 0, 0]}, index= ['England', 'France', 'Belgium', 'Croatia'])
df2

Unnamed: 0,World_Cups
England,0
France,1
Belgium,0
Croatia,0


In [10]:
df3 = df1.add(df2, fill_value=0)
df3

Unnamed: 0,Population,World_Cups
Belgium,,0.0
Croatia,,0.0
England,53000000.0,1.0
France,66900000.0,2.0
Germany,82670000.0,4.0
Italy,60600000.0,4.0
Spain,46560000.0,1.0


### Exercise

__1)__ Take some time out to wrap your heads around arithmetic for `Series` & `DataFrames`.

Remember - `Series` can be created in multiple ways but the easiest is to pass a list into `pd.Series` function, for example `pd.Series([1,2,3,4])`

Try creating `Series/DataFrames` of different lengths.

How do arithmetic operations result when you have duplicate entries in an index?


## Calling `Series/DataFrame` Methods

Series methods are the primary way to use the abilities that the Series object offers.

**Note** - methods are functions that belong to an object class whereas attributes are features of an object

In [11]:
# Work with a new dataset - 50 years of pop music and lyrics
# Data sourced from https://github.com/walkerkq/musiclyrics

music = pd.read_csv("./Data/50_years_billboard.csv", encoding = "ISO-8859-1")
music.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...
2,3,i cant get no satisfaction,the rolling stones,1965,
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...


The dataset features 5100 observations with the features rank (1-100), song, artist, year, and lyrics. It contains 50 years of Billboard Year-End Hot 100 (1965-2015).

| Rank | Song | Artist | Year | Lyrics |
| --- | --- | --- | --- | --- |
| __Year End Ranking__ - Calculated using an inverse point system <br> based on the weekly Billboard charts (100 points for a week <br> at number one, 1 point for a week at number 100, etc)| __Song Name__ | __Artist Name__ | __Year of Release__ | __Song Lyrics__ |

In [12]:
# the dir function can uncover all the attributes and methods that are associated with an object

num_series = len(set(dir(pd.Series)))
num_df = len(set(dir(pd.DataFrame)))
num_both = len((set(dir(pd.DataFrame)))&(set(dir(pd.Series))))

print(f"There are {num_series} Series methods/attributes and {num_df} DatFrame methods/attributes. {num_both} methods overlap for both Series and DataFrames")

There are 442 Series methods/attributes and 445 DatFrame methods/attributes. 376 methods overlap for both Series and DataFrames


### Head and Tail methods

To view a small sample of a Series or DataFrame object, use the `head` and `tail` methods. The default number of elements to display is five, but you may pass a custom number.

In [13]:
### First n rows from Series or DataFrame

music.head(7)

Unnamed: 0,Rank,Song,Artist,Year,Lyrics
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...
2,3,i cant get no satisfaction,the rolling stones,1965,
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...
5,6,downtown,petula clark,1965,when youre alone and life is making you lonel...
6,7,help,the beatles,1965,help i need somebody help not just anybody hel...


In [14]:
# Last n rows from Series or DataFrame

music.tail(4)

Unnamed: 0,Rank,Song,Artist,Year,Lyrics
5096,97,she knows,neyo featuring juicy j,2015,
5097,98,night changes,one direction,2015,going out tonight changes into something red ...
5098,99,back to back,drake,2015,oh man oh man oh man not againyeah i learned ...
5099,100,how deep is your love,calvin harris and disciples,2015,i want you to breathe me in let me be your ai...


### Method chaining

In Pandas many `Series` and `Dataframe` methods return more `Series` and `DataFrames`. This lends itself very well to a process called method chaining, where each attribute or method can be sequentially invoked using dot notation.

### Unique values, Value counts

`value_counts` is one of the most useful methods for a Series of object data type. Counts all occurences of each unique value.

In [15]:
# Example of method chaining here too

music.Year.value_counts()\
            .head()

2015    100
1965    100
2005    100
2001    100
1997    100
Name: Year, dtype: int64

In [16]:
# ? operator displays pandas in built documentation

pd.Series.value_counts?

In [17]:
# Can also be used to provide insight into numeric Series aswell
# The bins parameter can be used to group them into half open bins 

music.Rank.value_counts(bins=5)

(80.2, 100.0]    1020
(60.4, 80.2]     1020
(40.6, 60.4]     1020
(20.8, 40.6]     1020
(0.9, 20.8]      1020
Name: Rank, dtype: int64

In [18]:
#Likewise the unique method computes an array of unique values in a Series - returned in observed order

music.Artist.unique()

array(['sam the sham and the pharaohs', 'four tops', 'the rolling stones',
       ..., 'nicky jam and enrique iglesias', 'neyo featuring juicy j',
       'calvin harris and disciples'], dtype=object)

### Counting the number of elements in a Series/DataFrame

Multiple options here. Either using the `size/shape` attribute or the `len` function.

In [19]:
print(f"size attribute =>" , music.size)
print(f"shape attribute =>" , music.shape)
print(f"len function =>" , len(music))

size attribute => 25500
shape attribute => (5100, 5)
len function => 5100


Additionally there is the count function - which counts the number of __non-missing__ values

In [20]:
music.count()

Rank      5100
Song      5100
Artist    5100
Year      5100
Lyrics    4913
dtype: int64

### Summary Statistics

Basic summary statistics can be produced with the `sum`, `min`, `max`, `mean`, `median`, `std` and `sum` methods.

In [21]:
# By default it operates over the index (0) - only option for a Series

music.sum(numeric_only=True)

Rank      257550
Year    10149000
dtype: int64

In [22]:
# But can change the parameters so that the function operates over the columns

music.sum(axis=1).head()

0    1966
1    1967
2    1968
3    1969
4    1970
dtype: int64

Alternatively we can use the `describe` method which returns both summary statistics and quantiles at once.

In [23]:
# By default it will only return information for numeric columns

music.describe()

Unnamed: 0,Rank,Year
count,5100.0,5100.0
mean,50.5,1990.0
std,28.8689,14.721045
min,1.0,1965.0
25%,25.75,1977.0
50%,50.5,1990.0
75%,75.25,2003.0
max,100.0,2015.0


In [24]:
# Can be coerced to work with an object data type column but a different output is returned

In [25]:
music.Artist.describe()

count        5100
unique       2473
top       madonna
freq           35
Name: Artist, dtype: object

### Handling Missing Values

In [26]:
# isnull method evaluates to True or False (True if missing, False if not missing)

music.isnull().sum()

Rank        0
Song        0
Artist      0
Year        0
Lyrics    187
dtype: int64

This calculation works because:

1. The **`sum`** method for a `DataFrame` operates on **`axis=0`** by default (and thus produces column sums).
2. In order to add boolean values, pandas converts **`True`** to **1** and **`False`** to **0**.

**How to handle missing values** depends on the dataset as well as the nature of your analysis. Here are some options:

**[`dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)**

In [27]:
music.shape

(5100, 5)

In [28]:
music.dropna().shape

(4913, 5)

**[`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)**

In [29]:
music.fillna('No Lyrics').shape

(5100, 5)

[Working with missing data in pandas](http://pandas.pydata.org/pandas-docs/stable/missing_data.html)

### Reindexing / Setting / Resetting indexes

Pandas provides three main functions for manipulating the index of a `DataFrame/Series`.

__set_index()__ - Provide a new column of the `DataFrame` or a separate `Series` to act as the index  
__reset_index()__ - Resets the index to a default 0 to N-1 length array but adds the existing index as a `DataFrame` column  
__reindex()__ - Creates a new index with the data _conformed_ to the new index

In [30]:
# Let's take a look at our dataframe

music.head()

Unnamed: 0,Rank,Song,Artist,Year,Lyrics
0,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...
1,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...
2,3,i cant get no satisfaction,the rolling stones,1965,
3,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...
4,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...


In [31]:
# Standard index is ok but we actually want a new index which is a concatenation of Year and Rank

new_index = music.Year.astype(str) + ' Rank ' + music.Rank.astype(str)

music = music.set_index(new_index)
music.head()

# Alternatively music.index = new_index

Unnamed: 0,Rank,Song,Artist,Year,Lyrics
1965 Rank 1,1,wooly bully,sam the sham and the pharaohs,1965,sam the sham miscellaneous wooly bully wooly b...
1965 Rank 2,2,i cant help myself sugar pie honey bunch,four tops,1965,sugar pie honey bunch you know that i love yo...
1965 Rank 3,3,i cant get no satisfaction,the rolling stones,1965,
1965 Rank 4,4,you were on my mind,we five,1965,when i woke up this morning you were on my mi...
1965 Rank 5,5,youve lost that lovin feelin,the righteous brothers,1965,you never close your eyes anymore when i kiss...


In [32]:
# Actually, let's reindex our dataset to rearrange the index according to a new index

music = music.reindex(['1989 Rank 1', '1990 Rank 1', '1991 Rank 1'], columns = ['Song', 'Artist'])
music.head()

Unnamed: 0,Song,Artist
1989 Rank 1,look away,chicago
1990 Rank 1,hold on,wilson phillips
1991 Rank 1,everything i do i do it for you,bryan adams


In [33]:
# Finally, let's drop the index but retain it within the DataFrame

music = music.reset_index()
music.head()

Unnamed: 0,index,Song,Artist
0,1989 Rank 1,look away,chicago
1,1990 Rank 1,hold on,wilson phillips
2,1991 Rank 1,everything i do i do it for you,bryan adams


In [34]:
# Reset the DataFrame for future use

music = pd.read_csv("./Data/50_years_billboard.csv", encoding = "ISO-8859-1", index_col = ['Year', 'Rank'])
music.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Song,Artist,Lyrics
Year,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1965,1,wooly bully,sam the sham and the pharaohs,sam the sham miscellaneous wooly bully wooly b...
1965,2,i cant help myself sugar pie honey bunch,four tops,sugar pie honey bunch you know that i love yo...
1965,3,i cant get no satisfaction,the rolling stones,
1965,4,you were on my mind,we five,when i woke up this morning you were on my mi...
1965,5,youve lost that lovin feelin,the righteous brothers,you never close your eyes anymore when i kiss...


### Dropping Entries from an Axis

As we have seen from the above, it is possible to simply reindex a `DataFrame` or a `Series` to drop certain values. However it is often easier to use the convenience method [**`drop`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) provided by pandas.

In [35]:
# Example for rows

music.drop(2015, axis=0).tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Song,Artist,Lyrics
Year,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014,96,studio,schoolboy q featuring bj the chicago kid,im just sitting in the studio just trying to ...
2014,97,0 to 100 the catch up,drake,part i 0 to 100 verse 1 fuck bein on some chil...
2014,98,i dont dance,lee brice,ill never settle down thats what i always tho...
2014,99,somethin bad,miranda lambert and carrie underwood,stand on the bar stomp your feet get clapping...
2014,100,adore you,miley cyrus,ah hey ah ohbaby baby yeah are you listenin w...


In [36]:
# Example for columns

music.drop(['Lyrics', 'Artist'], axis = 1).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Song
Year,Rank,Unnamed: 2_level_1
1965,1,wooly bully
1965,2,i cant help myself sugar pie honey bunch
1965,3,i cant get no satisfaction
1965,4,you were on my mind
1965,5,youve lost that lovin feelin


Many methods, drop included, have the option to manipulate the object **in place** without returning a new object

In [37]:
music.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Song,Artist,Lyrics
Year,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1965,1,wooly bully,sam the sham and the pharaohs,sam the sham miscellaneous wooly bully wooly b...
1965,2,i cant help myself sugar pie honey bunch,four tops,sugar pie honey bunch you know that i love yo...
1965,3,i cant get no satisfaction,the rolling stones,
1965,4,you were on my mind,we five,when i woke up this morning you were on my mi...
1965,5,youve lost that lovin feelin,the righteous brothers,you never close your eyes anymore when i kiss...


### Sorting

Sorting datasets by a criteria is another very common task.

In [38]:
# To sort by index we use the sort_index method. This returns a new sorted object.
# For DataFrames you can sort by the axis on either axis.
# Data is sorted in ascending order by default - but can be switched to descending order through argument options.

music.sort_index(axis=1).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Artist,Lyrics,Song
Year,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1965,1,sam the sham and the pharaohs,sam the sham miscellaneous wooly bully wooly b...,wooly bully
1965,2,four tops,sugar pie honey bunch you know that i love yo...,i cant help myself sugar pie honey bunch
1965,3,the rolling stones,,i cant get no satisfaction
1965,4,we five,when i woke up this morning you were on my mi...,you were on my mind
1965,5,the righteous brothers,you never close your eyes anymore when i kiss...,youve lost that lovin feelin


In [39]:
# To sort by value we use the sort_values method. This returns a new sorted object.
# Any missing values are sorted to the end of the Series by default.

music.sort_values('Song').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Song,Artist,Lyrics
Year,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014,97,0 to 100 the catch up,drake,part i 0 to 100 verse 1 fuck bein on some chil...
2003,37,03 bonnie clyde,jayz featuring beyonce,
1996,40,1 2 3 4 sumpin new,coolio,this is some of the linguafringa of da funk bu...
1968,48,1 2 3 red light,1910 fruitgum company,1910 fruitgum company miscellaneous 1 2 3 red ...
2005,5,1 2 step,ciara and missy elliott,ladies and gentleman ladies and gentleman thi...


In [40]:
# Multiple sort criterion can also be passed in

music.sort_values(['Artist', 'Song'], ascending= [True, False]).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Song,Artist,Lyrics
Year,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1966,5,96 tears,the mysterians,the mysterians miscellaneous 96 tears too ma...
1970,84,somebodys been sleeping,100 proof aged in soul,
1994,40,because the night,10000 maniacs,take me now baby here as i am hold me close n...
1977,44,the things we do for love,10cc,too many broken hearts have fallen in the riv...
1975,42,im not in love,10cc,


In [41]:
# Why are the myseterians appearing first...?

music.loc[(1966, 5),'Artist' ]

'  the mysterians'

## Exercises
***

These exercises are all based off of the original 50 years billboard data set. So that we start from a fresh instance, a new `DataFrame` will be created called music_new.

In [42]:
music_new = pd.read_csv("./Data/50_years_billboard.csv", encoding = "ISO-8859-1", index_col = 'Year', dtype={'Song': str, 'Artist': str, 'Lyrics': str})
music_new.head()

Unnamed: 0_level_0,Rank,Song,Artist,Lyrics
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1965,1,wooly bully,sam the sham and the pharaohs,sam the sham miscellaneous wooly bully wooly b...
1965,2,i cant help myself sugar pie honey bunch,four tops,sugar pie honey bunch you know that i love yo...
1965,3,i cant get no satisfaction,the rolling stones,
1965,4,you were on my mind,we five,when i woke up this morning you were on my mi...
1965,5,youve lost that lovin feelin,the righteous brothers,you never close your eyes anymore when i kiss...


__1)__ Manipulate the index such that the new index is a multi-level index of both Year and Rank. Make sure that Year and Rank are only present in the index.

__2)__ Get a sense of each of the columns in the dataset. How many missing entries in each column?

__3)__ It looks like the only column with missing values is 'Lyrics', which also has a most frequent value of '  ' - a string with just a double space. 

Change the values of the single/double space string values in the Lyrics column to `np.nan`.

__4)__ `DataFrame` is now relatively clean, now we can answer some basic questions.

Which artist had the most top 100 ranked songs in the nineties?

**Note** - Subset selection on multi-level indexing is somewhat harder than indexing an index with one level.

**Help:** 

[**`MultiIndex / Advanced Indexing`**](https://pandas.pydata.org/pandas-docs/stable/advanced.html)   
[**`query method`**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html)


# Recap
***

1. The index is key to Pandas arithmetic. If index does not exist in a data structure, then the result of arithmetic with a structure with the index will be a `NaN` value by default.


2. Methods are functions that act upon the object that called it.   


3. Methods can be chained together by using dot notation.


4. There are hundreds of methods that exist for both `Series` and `DataFrames` . Common data analysis methods exist for finding unique values, handling missing values, setting index values etc. If you can't figure out how to perform a particular data operation, Google it - chances are an optimised method/recipe exists.


5. `DataFrames` can have multi-level indexing. This allows n-dimensional data to be expressed in a table format.


<!--NAVIGATION-->
< [Pandas Data Selection](04_Pandas_DataSelection_Indexing.ipynb) | [Contents](Index.ipynb) | [Merging, Joining and Concatenation](06_Merge_Join_Concatenate.ipynb) >