## Introduction

Pandas is a powerful open source library built for `data analysis and manipulation`.
This notebook explores Pandas techniques to do the following:

<li>Reading various types of files into dataframes</li>
<li>Displaying different crucial aspects of the dataframes</li>
<li>Selecting and filtering columns</li>
<li>Important operations on dataframes</li>
<li>Advanced tips and tricks</li>

Although the last item involves some advanced tips, trust that there are nuggets dropped throughout the entire notebook

## READING FILES

There are a variety of file formats you can read into dataframes. A small list of some of them are:
1. Comma Separated Values(csv).
2. Tab Separated Values(tsv).
3. Parquet.
4. XLSX(Excel).
5. Pickle.
6. JSON.

And so many more.

The beauty about dataframes is that they are the same regardless of the format of the original document making the wrangling and manipulation you do afterwards format-agnostic(A few caveats may be in place)!

Before working with the Pandas API, it is essential to import it into the environment. Usually, an alias suffices but it is standard practice to import it as `pd`.

In [1]:
import pandas as pd

### CSV FILE

Ideally, when reading the file into a Pandas dataframe, you want to assign this to a variable that you can access instead of dumping the file into memory. For this case, the variable used shall be `csvExample`.

In [2]:
# Read the file into variable
csvExample = pd.read_csv(filepath_or_buffer='./data/tmdb.movies.csv')

# Preview the data
csvExample.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


### TSV FILE

Since TSV and CSV files are so similar, Pandas offers the same method to read the files with the exception that the argument `delimiter` is set to "\t" to represent tabs. By default, it is set to recognize commas instead.

In [3]:
# Read the data
tsvExample = pd.read_csv(filepath_or_buffer='./data/rt.movie_info.tsv', delimiter='\t')

# Preview the data stored in the variable
tsvExample.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


### EXCEL FILES

In [4]:
# Load a single sheet from an excel file
attendance = pd.read_excel(io='./data/attendance.xlsx', sheet_name=0)
attendance.head()

Unnamed: 0,First name,Duration,Time joined,Time exited
0,Beatrice,1 hr 41 min,2:04 PM,3:45 PM
1,Fanice,1 hr 49 min,1:51 PM,3:40 PM
2,Cynthia,1 hr 35 min,2:02 PM,3:41 PM
3,Ida,1 hr 46 min,1:58 PM,3:45 PM
4,Collins,1 hr 45 min,2:03 PM,3:48 PM


In [5]:
# Investigate default behaviour when reading sheets
defaultSheet = pd.read_excel(io='./data/playground.xlsx')
defaultSheet.head()

Unnamed: 0,First name,Duration,Time joined,Time exited
0,Beatrice,1 hr 41 min,2:04 PM,3:45 PM
1,Fanice,1 hr 49 min,1:51 PM,3:40 PM
2,Cynthia,1 hr 35 min,2:02 PM,3:41 PM
3,Ida,1 hr 46 min,1:58 PM,3:45 PM
4,Collins,1 hr 45 min,2:03 PM,3:48 PM


In [6]:
# Specify which sheet to read
pd.read_excel(io='./data/playground.xlsx', sheet_name='Emails').head()

Unnamed: 0,First name,Email
0,Alice,alice@exampleemail.com
1,Beatrice,beatrice@exampleemail.com
2,Fanice,fanice@exampleemail.com
3,Cynthia,cynthia@exampleemail.com
4,Ida,ida@exampleemail.com


In [7]:
# Similar effect is achieved when passing an Index instead
pd.read_excel(io='./data/playground.xlsx', sheet_name=1).head()

Unnamed: 0,First name,Email
0,Alice,alice@exampleemail.com
1,Beatrice,beatrice@exampleemail.com
2,Fanice,fanice@exampleemail.com
3,Cynthia,cynthia@exampleemail.com
4,Ida,ida@exampleemail.com


<i>Note that when using indexing, `0` represents the first sheet.</i>

In [8]:
# Read multiple sheets together
pd.read_excel(io='./data/playground.xlsx', sheet_name=['Attendees','Emails'])

{'Attendees':    First name     Duration Time joined Time exited
 0    Beatrice  1 hr 41 min     2:04 PM     3:45 PM
 1      Fanice  1 hr 49 min     1:51 PM     3:40 PM
 2     Cynthia  1 hr 35 min     2:02 PM     3:41 PM
 3         Ida  1 hr 46 min     1:58 PM     3:45 PM
 4     Collins  1 hr 45 min     2:03 PM     3:48 PM
 5     Kenneth  1 hr 45 min     1:56 PM     3:40 PM
 6       Brian  1 hr 46 min     2:02 PM     3:48 PM
 7     Cynthia  1 hr 46 min     2:00 PM     3:47 PM
 8     Cynthia       29 min     2:00 PM     2:32 PM
 9     Josphat  1 hr 51 min     1:52 PM     3:44 PM
 10    Vincent  1 hr 45 min     2:03 PM     3:48 PM
 11       Mike  1 hr 41 min     1:58 PM     3:40 PM
 12    Phyllis  1 hr 45 min     1:58 PM     3:42 PM
 13      Hanan  1 hr 42 min     2:01 PM     3:43 PM
 14     George  1 hr 37 min     2:08 PM     3:45 PM
 15      Jimmy  1 hr 36 min     2:04 PM     3:40 PM
 16     Daniel  1 hr 39 min     2:05 PM     3:43 PM
 17      Edwin  1 hr 19 min     2:08 PM     3:43 PM

The above type returns a dictionary with each sheet being a key and the associated value consists of data in the sheet. Very rarely will this format be needed.

## Displaying DataFrame Information

There are multiple crucial aspects that a Data Scientist needs to look at when first working with a dataset. These can include:
1. The top and bottom N rows of the data.
2. The dimensions of the data.
3. The schema of the dataset
4. A random sample of the dataset
5. A list of the column names in the dataset

Let's use the `tmdb` dataset we had previously looked at for this section. For easier reference, the variable used will be `df` though it's best practice to use more descriptive variable names to minimize confusion.

In [9]:
# Read the dataset into the environment
df = pd.read_csv('./data/tmdb.movies.csv')

<i> Small tip:</i> Sometimes, datasets come with their indexes leading to an unnecessary column being read. Scroll up to when we worked with this dataset and look for a column named `Unnamed: 0`. To get around this, you can specify the position of the index column when reading the dataframe. This enables Pandas to not include it in the resulting dataframe. For example:

In [10]:
df = pd.read_csv(filepath_or_buffer='./data/tmdb.movies.csv',
            index_col=0)

#### Top/Bottom N Rows

Using the `head` and `tail` methods, you can easily preview either the top/bottom rows of a dataset. By default, the number of rows returned is 5 but this is easy to override if need be.

In [11]:
# Top 5 rows of the dataframe
df.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [12]:
# Top 10 rows of the dataset
df.head(n=10)   #df.head(10) is also viable

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186
5,"[12, 14, 10751]",32657,en,Percy Jackson & the Olympians: The Lightning T...,26.691,2010-02-11,Percy Jackson & the Olympians: The Lightning T...,6.1,4229
6,"[28, 12, 14, 878]",19995,en,Avatar,26.526,2009-12-18,Avatar,7.4,18676
7,"[16, 10751, 35]",10193,en,Toy Story 3,24.445,2010-06-17,Toy Story 3,7.7,8340
8,"[16, 10751, 35]",20352,en,Despicable Me,23.673,2010-07-09,Despicable Me,7.2,10057
9,"[16, 28, 35, 10751, 878]",38055,en,Megamind,22.855,2010-11-04,Megamind,6.8,3635


In [13]:
# Previewing the bottom 5 rows
df.tail()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1


In [14]:
# Previewing the bottom 10 rows
df.tail(10)

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
26507,[99],545555,ar,Dreamaway,0.6,2018-10-14,Dream Away,0.0,2
26508,[16],514492,en,Jaws,0.6,2018-05-29,Jaws,0.0,1
26509,[27],502255,en,Closing Time,0.6,2018-02-24,Closing Time,0.0,1
26510,[99],495045,en,Fail State,0.6,2018-10-19,Fail State,0.0,1
26511,[99],492837,en,Making Filmmakers,0.6,2018-04-07,Making Filmmakers,0.0,1
26512,"[27, 18]",488143,en,Laboratory Conditions,0.6,2018-10-13,Laboratory Conditions,0.0,1
26513,"[18, 53]",485975,en,_EXHIBIT_84xxx_,0.6,2018-05-01,_EXHIBIT_84xxx_,0.0,1
26514,"[14, 28, 12]",381231,en,The Last One,0.6,2018-10-01,The Last One,0.0,1
26515,"[10751, 12, 28]",366854,en,Trailer Made,0.6,2018-06-22,Trailer Made,0.0,1
26516,"[53, 27]",309885,en,The Church,0.6,2018-10-05,The Church,0.0,1
