# Pandas

Pandas is a library for data analysis and manipulation. It sits on top of numpy, and uses all of the functionality that we learned already. Two very powerful tools within the pandas libraries are Series and Dataframes. To learn more about what's possible with pandas, check out their [site]('https://pandas.pydata.org'), and [documentation]('https://pandas.pydata.org/pandas-docs/stable/'). 

## Installation: 

If you've installed your environment via downloading anaconda, you likely already have pandas installed. To check, enter the below command into your console: 

`conda list | grep pandas`

Otherwise, if you're using pip, install pandas with the below command via your command line: 

`pip install pandas`

Now let's import the data. For the sake of not having to write pd.<whatever>, we'll be importing series, and dataframe separately from pandas (to save on typing): 

In [None]:
import numpy as np

import pandas as pd
from pandas import DataFrame
from pandas import Series



## Series

Series are a very powerful tool within pandas. Series are ultimately a wrapper on top of the numpy, but instead of just using arrays and matrices like in numpy, with series, we get to index our data. 

In [None]:
fibonacciNumbers = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

mySeries = Series(fibonacciNumbers)
print(mySeries)

What's happening above is that we have a list that we're passing into series, and we're getting back an indexed list of fibonacci numbers. To access these numbers, we'll use the indices just like how we would use an array for now. Let's grab the 7th fibonacci number: 

In [None]:
mySeries[7]

It's not always the case that you'll want to have your series indexed by numbers though. Let's try being a bit more creative: 

In [None]:
pseudonyms = ["scary", "posh", "sporty", "ginger", "baby"]
names = ["Mel B", "Victoria Beckham", "Mel C", "Geri Halliwell", "Emma Bunton"]

spiceGirls = Series(names, index=pseudonyms)
spiceGirls

Unlike arrays, we don't want grab any of our data by numerical index anymore because, while our numerical index exists, we've added new indices for easier access. Ultimately, think of our index now as a dictionary (from the intro to python notebook): 

In [None]:
spiceGirls[0]

In [None]:
spiceGirls['scary']

To find whether a specific person (or any key for that matter) is in a series, you need to use the `in` keyword: 

In [None]:
'posh' in spiceGirls

Given that our series is now acting entirely like a dictionary, it is actually possible to convert the series itself to a dictionary (granted, by doing so, you would be losing a lot of functionality): 

In [None]:
spiceGirlsDictionary = spiceGirls.to_dict()
spiceGirlsDictionary

Since we can go from a series to a dictionary, it is also 100% the case that we can take any dictionary we have already created from somewhere else, and make a series out of it: 

In [None]:
beastieBoysDictionary = {
    "mikeD": "Michael Diamond", 
    "MCA": "Adam Yauch", 
    "Ad-Rock": "Adam Horovitz"
}

beastieBoySeries = Series(beastieBoysDictionary)
beastieBoySeries

Similar to being able to search for boolean values in numpy, we can do the same in series. Let's try it first with some numerical data (let's shift to age): 

In [None]:
nsyncAges = {
    "justinT": 38,
    "chrisK": 47,
    "joeyF": 42,
    "lanceB":  40,
    "jcC": 43
}

nsyncSeries = Series(nsyncAges)
nsyncSeries

In [None]:
olderThan40 = nsyncSeries > 40
olderThan40

In [None]:
nsyncSeries[olderThan40]

It's also the case that you can add series together just like numpy arrays. By doing this, you're adding like to like: 

In [None]:
nsyncSeries + nsyncSeries

Let's take a look at that again, but this time without Justin Timberlake: 

In [None]:
nsyncAgesWithoutJT = {
    "chrisK": 47,
    "joeyF": 42,
    "lanceB":  40,
    "jcC": 43
}
nsyncMinusJT = Series(nsyncAgesWithoutJT)
nsyncMinusJT

Now let's add them together again to see what happens: 

In [None]:
newNsyncSeries = nsyncMinusJT + nsyncSeries
newNsyncSeries

Because we're adding like with like, what's happening is that because there is no `justinT` in the second series, we're trying to add 38 to a number that doesn't exist in the second, so we wind up with `NaN`

You'll often want to find whether data don't exist in a given series. To do that, you'll need to use the pandas `pd.isna()` function to get a boolean array: 

In [None]:
pd.isna(newNsyncSeries)

We can find the opposite with `pd.notna()`: 

In [None]:
pd.notna(newNsyncSeries)

Because we're getting a boolean array back, we can filter on that data: 

In [None]:
notMissingNsync = pd.notna(newNsyncSeries)

newNsyncSeries[notMissingNsync]

It's not always the case that you'll want to filter a series on a given number though. You can filter a series on anything that can return a truth value! Let's go back to beastie boys and filter on whether or not their names contain Adam. For this, we'll want to use the `seriesname.str.contains()` method: 

In [None]:
beastieBoySeries.str.contains('Adam')

In [None]:
namedAdam = beastieBoySeries.str.contains('Adam')

beastieBoySeries[namedAdam]

Series are very powerful datastructures, however, it's unlikely you'll be coming across a singular series of data. Most datasets out there are extremely large with many different features! Let's take a look at DataFrames: 

## DataFrames

You can think of dataframes in multiple ways. I personally like to think of dataframes as a giant spreadsheet that's living in memory, and being powered by a series of series (think of how we had the numpy matrix). 

Let's start with importing our dataframe from pandas and alias it as df: 

In [None]:
from pandas import DataFrame as df

You can create your own dataframes in almost any number of ways. Let's take a look at creating our own from scratch, then we'll pull a dataframe from the internet: 

In [None]:
batmanDF = pd.DataFrame({
               'born': [pd.Timestamp('1952-10-13'), 
                        pd.Timestamp('1984-04-17'),
                        None, 
                        pd.Timestamp('1943-02-04')],
                'name': ['Alfred', 
                         'Bruce Wayne', 
                         '????', 
                         'Victor Fries'],
                'pseudonym': [None, 
                          'Batman', 
                          'Joker', 
                          'Mr. Freeze']})

batmanDF

In the above, we've created a very small dataframe from a dictionary with the keys as columns and their arrays as the row data (where the 0th index refers to Alfred, the 1st to Bruce Wayne, and so on). We can access each individual column by accessing its column name like a dictionary key: 

In [None]:
batmanDF['name']

What makes pandas' dataframes so powerful is that we can use what we have learned with series and numpy and apply it to much larger datasets. Let's take a look when we get a boolean array from `batmanDF['name']` and apply it to our whole dataframe: 

In [None]:
isBruceWayne = batmanDF['name'] == 'Bruce Wayne'
isBruceWayne

In [None]:
batmanDF[isBruceWayne]

What just happened above is that we got a boolean array that returned true on a single object, and then when applied to our dataframe, we got that specific row! Equally, we can do the exact opposite with a `!=` (and this time, we'll do everything inline): 

In [None]:
batmanDF[ batmanDF['name'] != 'Bruce Wayne' ]

Those are the very basics of DataFrames, let's get a little bit more involved with our dataframes by calling an api and getting some star wars character data. To get this data, we'll have to import `requests`.

To install requests, if you have anaconda installed you should already have it (otherwise, type `conda install requests`), and if you're using pip, just enter into the command line: 

`pip install requests` 

Then we import it! 

In [None]:
import requests

In [None]:
response = requests.get("https://swapi.co/api/people/")
response

Now that we have our response data, we need to get it into a useable format. Luckily, pandas is really good at reading lots of formats, and can easily take in JSON format). If you're interested in learning more about JSON, check out their [webpage]('https://www.json.org').  


In [None]:
jsonResponse = response.json()
jsonResponse

In [None]:
jsonBody = jsonResponse["results"]
jsonBody

In [None]:
starWarsDF = DataFrame(jsonBody)

If we had wanted to save the reseponse into a file and read it back into the dataframe from the file system, we'd want to use `pd.read_json`. 

Let's start dipping into our data and see what we got. 

In [None]:
starWarsDF.head()

There are a lot of data in this dataframe, so let's see what we've got going for us by just looking at the columns: 

In [None]:
starWarsDF.columns

Let's see how many characters we've receieved from our API call:

In [None]:
len(starWarsDF)

10 seems a little low for a series of multiple movies. Let's take a look at our whole dataframe: 

In [None]:
starWarsDF

It's a little hard to suss through all of the above dataframe, so let's just see what specific characters we've recieved from the api call: 

In [None]:
starWarsDF['name']

Clearly we didn't grab every single star wars character in the entire saga. Let's take a look at our response again just to make sure of what we're getting: 

In [None]:
jsonResponse

It's always good to double check what's in our data before we start working with it! We can see that there are 87 characters, but we only grabbed 10 in our dataframe. There's one key that we should absolutely take note of, and that's the `next` key. In order to keep our api calls quick, the data have been paginated into segments of 10 characters, so we'll need to make 9 calls in total! First, let's make a call to see what the next page is: 

In [None]:
nextPageURI = jsonResponse["next"]
nextPageURI

In [None]:
nextPage = requests.get(nextPageURI)
nextPage

In [None]:
nextPageJSON = nextPage.json()
nextPageJSON

So, now we've got a whole new page of new materials! Let's add it to a dataframe and combine the two! 

In [None]:
secondPageDataframe = DataFrame(nextPageJSON['results'])
secondPageDataframe.head()

To add two dataframes together, we can just concatenate them by placing each of them as an element in an array, and concatenating the array. Let's take a look at both, then concatenate them to see what we get: 

In [None]:
starWarsDF

In [None]:
secondPageDataframe

In [None]:
dataFrameList = [ starWarsDF, secondPageDataframe ]

In [None]:
superList = pd.concat(dataFrameList)

In [None]:
superList

We've now got two dataframes converged into one! This means we can do the same for every single page! Though, first, before we move forward with getting the rest, look closely at the indices of the new dataframe. We have repeating indices of each dataframe. It's not ideal to have multiple values with the same index, so let's reindex these values with the method `reset_index`: 

In [None]:
superList.reset_index()

We've reset our index, however, notice that we've stored the old index in our dataframe now. It's not always the case that you'll want to overwrite the dataframe's indices, so you might want to keep them, but for our sake, we do want to overwrite, so let's give it a go with the parameter `drop` set to `True`: 

In [None]:
resetIndexDataFrame = superList.reset_index(drop=True)
resetIndexDataFrame

Now everything's set back to a `0-(n-1)` list. We could keep doing this over and over again for our API calls, however, it's not really efficient. In order to not do this over and over ad infinitum, we want to automate the tedious things! Let's create a function that can iterate over the entire list and make consecutive calls, ultimately returning to us a list of dataframes (that we can then use to concat and reset our indices): 

In [None]:
def makeTheCallsAndReturnTheWholeList():
    nextURI = "https://swapi.co/api/people/" 
    dataFrameList = list()


    while(nextURI): 
        response = requests.get(nextURI)
        jsonResponse = response.json()

        nextURI = jsonResponse['next']
        jsonBody = jsonResponse["results"]

        dataFrame = DataFrame(jsonBody)
        dataFrameList.append(dataFrame)
        
    
    wholeDataFrame = pd.concat(dataFrameList)
    resetIndexDF = wholeDataFrame.reset_index(drop=True)
    return resetIndexDF
    

In [None]:
starWarsDataFrame = makeTheCallsAndReturnTheWholeList()

Let's doublecheck that we've actually got all 87 of our characters. We know from the jsonResponse that we're supposed to have 87. Let's double check with `len()`: 

In [None]:
len(starWarsDataFrame)

Before we go any further, let's also double check that we've made the right calls and that we've got a unique list of star wars characters! 

In [None]:
uniqueNames = starWarsDataFrame['name'].unique()
len(uniqueNames)

Now, before we move on, let's take a moment to refactor some code. Our function above is great, but it's very likely we'll want to use some of what we used above later on, so let's refactor! Our function origionally looked like: 

```python
def makeTheCallsAndReturnTheWholeList():
    nextURI = "https://swapi.co/api/people/" 
    dataFrameList = list()


    while(nextURI): 
        response = requests.get(nextURI)
        jsonResponse = response.json()

        nextURI = jsonResponse['next']
        jsonBody = jsonResponse["results"]

        dataFrame = DataFrame(jsonBody)
        dataFrameList.append(dataFrame)
        
    
    wholeDataFrame = pd.concat(dataFrameList)
    resetIndexDF = wholeDataFrame.reset_index(drop=True)
    return resetIndexDF
```

There are two things our function is doing: 
    1. Making a call and getting a body of information back
    2. Concatenating the list of dataframes together. 
    
Let's start with the making the call and retrieving the json body and the next URI: 

In [None]:
def getResponseBodyAndNext(URI):
    response = requests.get(URI)
    jsonResponse = response.json()

    nextURI = jsonResponse['next']
    jsonBody = jsonResponse["results"]
    
    return (jsonBody, nextURI)

Now let's try to see if this function works: 

In [None]:
(filmList, nextURI) = getResponseBodyAndNext('https://swapi.co/api/films')

In [None]:
filmList

Now that we know our first function works, let's redefine our function so it looks a little cleaner: 

In [None]:
def makeTheCallsAndReturnTheWholeList(nextURI):
    dataFrameList = list()

    while(nextURI): 
        (jsonBody, nextURI) = getResponseBodyAndNext(nextURI)
        dataFrame = DataFrame(jsonBody)
        dataFrameList.append(dataFrame)

    wholeDataFrame = pd.concat(dataFrameList)
    resetIndexDF = wholeDataFrame.reset_index(drop=True)
    return resetIndexDF

Let's double check to see if our function works: 

In [None]:
starWarsPeopleTest = makeTheCallsAndReturnTheWholeList("https://swapi.co/api/people/")
starWarsPeopleTest

Now we've got our setup done, let's take a look at what's in our dataset: 

In [None]:
starWarsDataFrame.columns

It's not always the case that you'll be wanting all of the data that come from an api (for our purposes, we don't even care about metadata such as "created and edited"), so let's drop a couple of columns. The axis parameter is specifying whether or not we want to drop a specific column index, or a specific row index: 

In [None]:
starWarsDF.drop(['created'], axis=1)

You might be likely to think that since we've called "drop" we therefore have dropped the dataframe. You'd be wrong. Pandas functions don't typically follow in-place mutations, and follow a generally functional framework. You can work with this in multiple ways: 

1. You can create a brand new dataframe (recommended)
2. You can use the inplace parameter. 
3. You can self assign the dataframe after the function call (but this is not recommended)

If you are going to mutate your dataframe, I'd recommend using the inplace modifier. 

Two things of note below: 
1. Notice that instead of one column, we're now sending in two? The column list is an array and can take as many columns as you can put in.
2. We're going to be using the inplace parameter set to true, so that that once we call the funciton we will be dropping our columns. 

In [None]:
starWarsDF.drop(['created', 'edited'], axis=1, inplace=True)

In [None]:
starWarsDF.head()

Now, we've removed the metadata for the dataframe, let's take a look at some of our other data. Notice that some of our data are actually urls: 

In [None]:
starWarsDF[['films', 'homeworld', 'species', 'starships', 'vehicles']]

In order to work with this data, we'll need an example of each: 

In [None]:
filmUrl = starWarsDataFrame['films'][0][0]
filmUrl
getResponseBodyAndNext(filmUrl)

Looks like we've run into a potential problem. Not every response from the api will return a next. In many language, that'll just return with a null, however, python will crash if we try to access a key that isn't there. Let's iterate on our method and use a special library called pydash (very similar to lodash for javascript). 

In order to get pydash, we'll have to install it (it is not part of the conda install). To properly install it (if you have anaconda installed), write: 

```conda install -c conda-forge pydash```

If you have not installed anaconda, you can install pydash by typing: 

``` pip install pydash ```


Once finished installing pydash, you can import it like any other package. For our purposes, however, we'll only be using the `get` method: 

In [None]:
from pydash import get

pydash get is a method that allows for us to safely retrieve data. If that data does not exist, you can specify a default value to return without worry for your program crashing: 

In [None]:
def getResponseBodyAndNext(URI):
    response = requests.get(URI)
    jsonResponse = response.json()

    nextURI = get(jsonResponse, 'next', None) # safely get next
    jsonBody = jsonResponse["results"]
    
    return (jsonBody, nextURI)

Now let's try to safely run our data: 

In [None]:
filmUrl = starWarsDataFrame['films'][0][0]
print(filmUrl)
getResponseBodyAndNext(filmUrl)

Now we're seeing that not all of our data that are lists. What we can do is make our function a bit more robust to handle this by using the `type` keyword: 

In [None]:
def getResponseBodyAndNext(URI):
    response = requests.get(URI)
    jsonResponse = response.json()

    nextURI = get(jsonResponse, 'next', None) # safely get next
    jsonBody = get(get(jsonResponse, 'results')) if isinstance(jsonResponse, list)else jsonResponse
    
    return (jsonBody, nextURI)

In [None]:
getResponseBodyAndNext(filmUrl)

Now let's try it with the rest of our data: 

In [None]:
homeworldUrl = starWarsDataFrame['homeworld'][0]
getResponseBodyAndNext(homeworldUrl)

In [None]:
speciesUrl = starWarsDataFrame['species'][0][0]
getResponseBodyAndNext(speciesUrl)

In [None]:
starshipsUrl =  starWarsDataFrame['starships'][0][0]
getResponseBodyAndNext(starshipsUrl)

In [None]:
vehiclesUrl = starWarsDataFrame['vehicles'][0][0]
getResponseBodyAndNext(vehiclesUrl)