# Pandas Foundations

___Resources___

https://bit.ly/ZvcIe9 - pandas documentation

https://bit.ly/2tfdyet - pandas documentation - Intro to Data Structures



In [1]:
## Base imports - common alias for pandas is pd and np for numpy

import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)

## Pandas Data Types

Broadly speaking, data can be classed as continuous or categorical. 

__Continuous__ -  numeric and represents some form of measurement

__Categorical__ - discrete, finite amounts of values

In pandas there are distinct data types. Almost all data types are built directly from NumPy. This makes it easier to integrate pandas and NumPy operations.

| Data Type Name | NumPy/pandas object | Pandas string name | Notes |
| --- | --- | --- | --- |
| Boolean | _np.bool_ | _boolean_ | -  Stored as a single byte 0/1 <br> - True or False values |
| Integer | _np.int_ | _int_ | -  Default is 64 bits |
| Float | _np.float_ | _float_ | -  Default is 64 bits |
| Complex | _np.complex_ | _complex_ | - Rarely used |
| Object | _np.object_ | _O, object_ | - Typically strings <br> -  Catch-all for columns with multiple <br> different types or other Python objects |
| Datetime | _np.datetime64,_ <br> _pd.Timestamp_ | _datetime64_ | -  Specific moment in time <br> (nanosecond precision) |
| Timedelta | _np.timedelta64,_ <br> _pd.Timedelta_ | _timedelta64_ | -  Duration of time <br> (days to nanoseconds) |
| Categorical | _pd.categorical_ | _category_ | -  Useful for object columns with <br> relatively few unique values |

Pandas defaults its core numeric types - integers and floats - to 64 bits regardless of the size necessary for all data to fit in memory.

**Quick win** - If stuggling with a large dataset, look to set the data type of individual columns for memory reduction.

## Pandas Data Structures

### Series

A `Series` is a one-dimensional array of __indexed__ data capable of holding any data type. The axis labels are collectively referred to as the index. It is this index that differentiates a Series from a normal array.

In [2]:
# Simple series - created from a list

obj = pd.Series([3,6,9,12,15])
obj

0     3
1     6
2     9
3    12
4    15
dtype: int64

By default, each item will receive an index label from 0 to N, where N is the length of the`Series`minus one. However because of the **explicit** nature of the index, it need not be an integer.

In [3]:
# Alternatively - we can specify an index

population = pd.Series([66900000, 53000000, 60600000, 82670000, 46560000], index= ['France', 'England', 'Italy', 'Germany', 'Spain'])
population

France     66900000
England    53000000
Italy      60600000
Germany    82670000
Spain      46560000
dtype: int64

### DataFrames

A`DataFrame`represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

-  A DataFrame has both a row and a column index
-  Akin to a spreadsheet
-  you can think of a`DataFrame`as a sequence of aligned`Series`objects. (aligned since they share the same index).

In [4]:
# New series listing the biggest export of the 5 countries from above

In [5]:
exports = pd.Series(['Pharmaceuticals', 'Plastics', 'Refined Petroleum', 'Cars', 'Aircraft'], index= ['England', 'Spain', 'Italy', 'Germany', 'France'])
exports

England      Pharmaceuticals
Spain               Plastics
Italy      Refined Petroleum
Germany                 Cars
France              Aircraft
dtype: object

In [6]:
# Population Series defined earlier

population

France     66900000
England    53000000
Italy      60600000
Germany    82670000
Spain      46560000
dtype: int64

Now we have two`Series`, exports and the population`Series`from before. We can use a dictionary to construct a single two-dimensional object containing this information:

In [7]:
countries = pd.DataFrame({'population':population, 'exports':exports})
countries

Unnamed: 0,exports,population
England,Pharmaceuticals,53000000
France,Aircraft,66900000
Germany,Cars,82670000
Italy,Refined Petroleum,60600000
Spain,Plastics,46560000


Note the indexes have automatically aligned during the creation of the`DataFrame`. This is part of what makes Pandas data structures so efficient for data operations.

In [8]:
# Simple DataFrame - dictionary of equal sized lists or numpy arrays

data = {'Country': ['France', 'England', 'Italy', 'Germany', 'Spain'],
       'Population': [66900000, 53000000, 60600000, 82670000, 46560000],
       'Year': [2018, 2018, 2018, 2018, 2018]}

df = pd.DataFrame(data)

In [9]:
df

Unnamed: 0,Country,Population,Year
0,France,66900000,2018
1,England,53000000,2018
2,Italy,60600000,2018
3,Germany,82670000,2018
4,Spain,46560000,2018


By default the index is automatically assigned and the columns are sorted by index order

In [10]:
# Alternatively we can be more verbose, specifying parameters to the DataFrame method

df = pd.DataFrame(data, index = data['Country'], columns = ['Year', 'Population', 'World Cup Winner'] )
df

Unnamed: 0,Year,Population,World Cup Winner
France,2018,66900000,
England,2018,53000000,
Italy,2018,60600000,
Germany,2018,82670000,
Spain,2018,46560000,


### Index Objects

Pandas`Index`objects are responsible for holding the axis labels and other metadata (axis name/names). They are implemented using hash tables that allow for very fast selection and data alignment. 

-  Index objects are immutable and thus can't be modified by the user
-  Any array or sequence that is referenced as an Index when constructing a Series/DataFrame is internally converted to an Index object
- Similar to Python sets (intersection /union) but dissimilar in that they can contain duplicate labels


In [11]:
# Example of index immutability - any function that alters an index will return a new index. 
# Immutable objects can't be changed after they have been created

df.index[0] = 'Portugal'

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple`DataFrames`and`Series`, without the potential for side effects from inadvertent index modification.

## The Anatomy of a DataFrame

<img src="./Images/Anatomy_DataFrame.png">

In [12]:
movies = pd.read_table('./Data/imdbratings.csv', delimiter=',')

In [13]:
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [14]:
# dtypes attribute - displays each column and it's data type

movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

### Accessing the main `DataFrame` components

In [15]:
columns = movies.columns
index = movies.index
data = movies.values

In [16]:
columns

Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',
       'actors_list'],
      dtype='object')

In [17]:
index

RangeIndex(start=0, stop=979, step=1)

In [18]:
data

array([[9.3, 'The Shawshank Redemption', 'R', 'Crime', 142,
        "[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunton']"],
       [9.2, 'The Godfather', 'R', 'Crime', 175,
        "[u'Marlon Brando', u'Al Pacino', u'James Caan']"],
       [9.1, 'The Godfather: Part II', 'R', 'Crime', 200,
        "[u'Al Pacino', u'Robert De Niro', u'Robert Duvall']"],
       ...,
       [7.4, 'Master and Commander: The Far Side of the World', 'PG-13',
        'Action', 138,
        "[u'Russell Crowe', u'Paul Bettany', u'Billy Boyd']"],
       [7.4, 'Poltergeist', 'PG', 'Horror', 114,
        '[u\'JoBeth Williams\', u"Heather O\'Rourke", u\'Craig T. Nelson\']'],
       [7.4, 'Wall Street', 'R', 'Crime', 126,
        "[u'Charlie Sheen', u'Michael Douglas', u'Tamara Tunie']"]],
      dtype=object)

## The Anatomy of a Series

<img src="./Images/Anatomy_Series.png">

### Accessing the main `Series` components

In [20]:
# Select a single column of data as a Series - will go into more detail later

film_name = movies['title']
film_name.head()

0    The Shawshank Redemption
1               The Godfather
2      The Godfather: Part II
3             The Dark Knight
4                Pulp Fiction
Name: title, dtype: object

In [21]:
# Check that film_name is infact a Series

type(film_name)

pandas.core.series.Series

In [22]:
# There is no columns index, however the column name has been taken as the Series name attribute

index = film_name.index
data = film_name.values
name = film_name.name

In [23]:
index

RangeIndex(start=0, stop=979, step=1)

In [24]:
data

array(['The Shawshank Redemption', 'The Godfather',
       'The Godfather: Part II', 'The Dark Knight', 'Pulp Fiction',
       '12 Angry Men', 'The Good, the Bad and the Ugly',
       'The Lord of the Rings: The Return of the King',
       "Schindler's List", 'Fight Club',
       'The Lord of the Rings: The Fellowship of the Ring', 'Inception',
       'Star Wars: Episode V - The Empire Strikes Back', 'Forrest Gump',
       'The Lord of the Rings: The Two Towers', 'Interstellar',
       "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Goodfellas',
       'Star Wars', 'The Matrix', 'City of God', "It's a Wonderful Life",
       'The Usual Suspects', 'Se7en', 'Life Is Beautiful',
       'Once Upon a Time in the West', 'The Silence of the Lambs',
       'Leon: The Professional', 'City Lights', 'Spirited Away',
       'The Intouchables', 'Casablanca', 'Whiplash', 'American History X',
       'Modern Times', 'Saving Private Ryan', 'Raiders of the Lost Ark',
       'Rear Window', 'Psycho'

In [25]:
# Same as the original DataFrame column name

name

'title'

# Recap
***

1. Pandas has distinct data types that that determine how data is stored and how it can be manipulated. Often, unexpected results are the result of incorrect data types.  


2. Large `DataFrame`? Quick fix is to set the data types of columns on import - this can reduce the strain on memory.  


3. **`Series`** - 1 dimensional array of __indexed__ data. This can be any data type but all values will be stored as the same type.  


4. **`DataFrame`** - Multi dimensional array of __indexed__ data. Can be thought of as an ordered collection of columns. Columns can have different data types to one another.  


5. The index is what differentiates a `Series` from a normal array and the reason why Pandas data structures are so efficient for data operations.  


6. Attributes of `Series`/`DataFrames` contain information regarding key features of the underlying data.

<!--NAVIGATION-->
< [Fundamentals](01_Python_Fundamentals.ipynb) | [Contents](Index.ipynb) | [Pandas IO](03_Pandas_Data_IO.ipynb) >