# Learning Pandas

[Pandas](https://pandas.pydata.org/) is a python library whose purpose is to make data processing and analysis easier than ever, these are explanatory notes on the subject and will largely draw on [this resource](https://media.readthedocs.org/pdf/pandasguide/latest/pandasguide.pdf). Before we embark on this journey I want to outline some motivation for learning this, first of all I am intensely interested in data and how it can be used for the greater good. 

Applications of the tools I am about to learn are:
- Processing and analysing data
- Collecting data with the intention of reducing cleaning time
- Visualisations of large datasets

So, let's get going shall we?

First we will import pandas

In [2]:
import pandas as pd

# Chapter 1 - Pandas Basics

Here I will discuss and apply foundational data structures of pandas, the Series and DataFrame.

## Chapter 1.2 Series and DataFrames

Pandas provides two very useful data structures for processing data, these are Series and Dataframes, which will be discussed in turn.

### **Series**

A Series is a one-dimensional array that can store mixed data types, any list, tuple or dictionary in Python can be converted to a Series by using the `pd.Series()` method. The **row labels** in a Series are called the index.

In [3]:
# Converting a tuple to a series
t = ('AA', '2012-02-01', 100, 12.2)
ts = pd.Series(t)

print(ts)

0            AA
1    2012-02-01
2           100
3          12.2
dtype: object


In [4]:
# Converting a list to a series
l = ['FB','2001-08-02', 90, 3.2]
ls = pd.Series(l)

print(ls)

0            FB
1    2001-08-02
2            90
3           3.2
dtype: object


Note that when converting a tuple or a list to a series, the index names are automatically set to be 0,1,2... however we can change this by assigning the index.

In [5]:
# Give the series a custom index
ls = pd.Series(l, index=['name','date','shares','price'])

print(ls)

name              FB
date      2001-08-02
shares            90
price            3.2
dtype: object


In [6]:
# Converting a dictionary to a series
d = {'name':'IBM', 'date':'2016-04-08','shares':100,'price':10.2}
ds = pd.Series(d)

print(ds)

name             IBM
date      2016-04-08
shares           100
price           10.2
dtype: object


#### Accessing Elements of a Series

We can access elements of a series either by using the index name or the index value, e.g. `ls['shares']` or `ds[1]`. We can also access slices of the Series or multiple values in a given row of data using standard slicing notation.

In [7]:
print(ls['shares'],"\n")
print(ls[0:3])

90 

name              FB
date      2001-08-02
shares            90
dtype: object


### **DataFrames**

A DataFrame is very similar to a Series but it is two-dimensional data, with both row-name and column-name attributed to it. It is an important data structure in Pandas because all text and spreadsheet files are interpreted as a DataFrame.

The most common way to create a DataFrame is to use a dictionary of equal length which is shown below and then applying the `pd.DataFrame()` method to convert it to a DataFrame.

In [8]:
# Creating a dictionary
data = {
    'name':['AA','BB','CC','DD'],
    'date':['2001-12-01','2012-02-10','2010-04-09','2018-06-03'],
    'shares':[100,950,35,65],
    'price':[12.1,130,55.,16.]
}
# Converting it to a dataframe.
df = pd.DataFrame(data)

print(df)

  name        date  shares  price
0   AA  2001-12-01     100   12.1
1   BB  2012-02-10     950  130.0
2   CC  2010-04-09      35   55.0
3   DD  2018-06-03      65   16.0


#### Manipulating DataFrames

We can change the dataframe in a variety of ways, for example we can create a new column by

In [9]:
df['ligma'] = 'balls'

print(df)

  name        date  shares  price  ligma
0   AA  2001-12-01     100   12.1  balls
1   BB  2012-02-10     950  130.0  balls
2   CC  2010-04-09      35   55.0  balls
3   DD  2018-06-03      65   16.0  balls


By default the indexes are 0,1,2... again, however we can change this by changing the `df.index` attribute

In [10]:
df.index = ['one', 'two', 'three', 'four']
print(df)

      name        date  shares  price  ligma
one     AA  2001-12-01     100   12.1  balls
two     BB  2012-02-10     950  130.0  balls
three   CC  2010-04-09      35   55.0  balls
four    DD  2018-06-03      65   16.0  balls


#### 

# Chapter 2 - Functionality of Pandas

In this section we will explore the basic functionality of Pandas, including reading CSV files, doing operations on data, functions in Pandas and many more.

Included in this section are two files, cast.csv and titles.csv which are two large CSV files which I will work with.

## **2.1 - Reading CSV Files**

A core method in Pandas is `pd.read_csv()`, this allows us to store the CSV as a variable in python, I will start by reading the cast.csv file, I will pass in the argument `index_col = None` which tells the function that the first column of the csv is data, there are no row indexes.

When the CSV file is read, the variable will be saved as a DataFrame, so all of the last chapter applies to this. 

To take a brief look at the data the methods `head()` and `tail()` will show the first and last 5 elements of the variable respectively. You can also pass in a natural number value to set how many elements are displayed.

If we choose to inspect the csv file by simply calling the object then the first 30 and last 20 elements will be displayed, we can change this by calling the `.set_option()` method. Take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html) to read more about how to do this.

The `len()` method will tell you how many elements there are in the csv file.


In [11]:
# Read the cast csv file
cast = pd.read_csv('cast.csv', index_col = None)
print(cast.head())

# Read the titles csv file
title = pd.read_csv('titles.csv', index_col = None)
print(title.tail())


                  title  year      name   type                character     n
0        Closet Monster  2015  Buffy #1  actor                  Buffy 4  31.0
1       Suuri illusioni  1985    Homo $  actor                   Guests  22.0
2   Battle of the Sexes  2017   $hutter  actor          Bobby Riggs Fan  10.0
3  Secret in Their Eyes  2015   $hutter  actor          2002 Dodger Fan   NaN
4            Steve Jobs  2015   $hutter  actor  1988 Opera House Patron   NaN
                      title  year
49995                 Rebel  1970
49996               Suzanne  1996
49997                 Bomba  2013
49998  Aao Jao Ghar Tumhara  1984
49999            Mrs. Munck  1995


## **2.2 Data Operations**

We often aren't satisfied with the raw data that we have, or it doesn't suit our particular purpose right off the bat, so we may have to manipulate the data in various ways until we have a result that is useful. In this section we will explore some such methods for doing this.

### 2.2.1 Selecting Rows and Columns

Any row or column can be selected by passing in the name of the row or column, e.g. `title['title']` will create a one-dimensional array (a Series) that contains all of the titles, we can further select items by slicing the array if we'd like to.

We can also select a row from the DataFrame by using the `loc` command.

In [12]:
# Print the first 10 elements of the title row
title['title'][0:10]
title.loc[3]

title    Country
year        2000
Name: 3, dtype: object

### 2.2.2 Filtering Data

We can also filter data by passing in a boolean command to select values from the DataFrame that correspond to a particular value. For example, we can get all of the movies released in the year 2001 and save them as a new DataFrame.

We could also have multiple boolean statements to filter out between certain years in this example.

In [13]:
# Filter out the movies that are made in 2001
mov01 = title[title['year'] == 2001]
mov01.head()

Unnamed: 0,title,year
28,Tera Mera Saath Rahen,2001
34,Nariman,2001
351,The Long Run,2001
353,Familjehemligheter,2001
398,Amorseko: Damong ligaw,2001


### 2.2.3 Sorting

There are two crucial methods for sorting data in pandas, by default when we filter out values the resulting DataFrame is sorted by index values, more explicitly the method `sort_index()` is applied, so the if we were to apply this on the mov01 object it would have no effect. We can also sort by particular values by using the `sort_values()` method.

Shown below are both of these methods, first on the mov01 object and then on all 

In [15]:
mac = title[title['title'] == 'Macbeth']
mac.head()

Unnamed: 0,title,year
4226,Macbeth,1913
9322,Macbeth,2006
11722,Macbeth,2013
17166,Macbeth,1997
25847,Macbeth,1998


We can also sort by the values of a column, this is useful if we are wanting to sort alphabetically or numerically depending on the data type.

In [17]:
mac = title[title['title']=='Macbeth'].sort_values('year')
mac.head()

Unnamed: 0,title,year
4226,Macbeth,1913
17166,Macbeth,1997
25847,Macbeth,1998
9322,Macbeth,2006
11722,Macbeth,2013


### 2.2.4  Null Values

Dealing with empyt values in datasets is important, often with data some fields will have no entry, this will be stored as NaN when pandas reads the csv file, if we are looking at cast.csv we notice that the 3rd and 4th rows have no entry in the n column. 

Two useful methods for pulling out values that are empty are `isnull` and `notnull`; these methods return the cells that meet the criteria, in this case, being null values. If we simply call it on the DataFrame is returns a DataFrame occupied by the boolean values of the cells, that is, whether or not they are null values.

Calling the method on a single column will of course return a series. This can be extremely valuable for cleaning data sets.

In [24]:
# Calling isnull() and notnull() on the cast csv
print(cast.isnull().head())
print(cast.notnull().head())

# Calling the isnull() on a single column
print(cast['n'].isnull().head())

   title   year   name   type  character      n
0  False  False  False  False      False  False
1  False  False  False  False      False  False
2  False  False  False  False      False  False
3  False  False  False  False      False   True
4  False  False  False  False      False   True
   title  year  name  type  character      n
0   True  True  True  True       True   True
1   True  True  True  True       True   True
2   True  True  True  True       True   True
3   True  True  True  True       True  False
4   True  True  True  True       True  False
0    False
1    False
2    False
3     True
4     True
Name: n, dtype: bool


Like with R, in order to show the rows that meet specific conditions, this must be passed into the DataFrame. We will use this to fill the values that are empty with something else, there are a few methods for doing this such as `fillna`, `ffill`, `bfill` and some others.

In [26]:
# Show the rows that have null values, this could be useful for picking out the rows that contain no values
c = cast[cast['n'].isnull()]
print(c.head())

# Lets fill those null values with something else
c_fill = c.fillna('NA')
c_fill.head()

Unnamed: 0,title,year,name,type,character,n
3,Secret in Their Eyes,2015,$hutter,actor,2002 Dodger Fan,
4,Steve Jobs,2015,$hutter,actor,1988 Opera House Patron,
5,Straight Outta Compton,2015,$hutter,actor,Club Patron,
6,Straight Outta Compton,2015,$hutter,actor,Dopeman,
7,For Thy Love 2,2009,Bee Moe $lim,actor,Thug 1,


### 2.2.5 String Operations

Much of our data is likely to be string values, we want some way of performing operations on these values, we will cover some very basic ones here, these are accessed through the `.str.` option. You can read more [here](https://pandas.pydata.org/pandas-docs/stable/text.html) about working with text in Pandas, and this is a jump to a [method summary](https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary).

Methods that I will explore are the `contains`, `startswith` and `endswith` methods, which are pretty self explanatory.


In [33]:
# Using the contains string method
c_contains = cast[cast['title'].str.contains('her')]
print("There are",len(c_contains),"movies that contain her")

# Using the beginswith method
c_begins = cast[cast['title'].str.startswith("The")]
print("There are",len(c_begins),"movies that begin with The")

# Using the endswith method
c_ends = cast[cast['title'].str.endswith("er")]
print("There are",len(c_ends),"movies that end with er")

There are 1280 movies that contain her
There are 8806 movies that begin with The
There are 3083 movies that end with er


### 2.2.6 Count Values

### 2.2.7 

## 2.3 Groupby