# Pandas

In this section, we will learn about the Pandas Python library and its main data structure: the DataFrame. DataFrames are similar to excel where it is very easy to do manipulation on the data. However, one major advantage of DataFrames over excel is the ability to do complex aggregation easier and ease when it comes to plotting. To do the same things in excel, it would require you to learn VBA, which is a less common language and lacks as many resources compared to Python.

Checklist:

    1) Series
    2) DataFrames
    3) Dataframe Indexing
    4) Selecting sub DataFrames based on conditions (.iloc and .loc)
    5) Missing Data / NaN values
    6) Group By
    7) Combining DataFrames (Concat, Merge, Join)
    8) Built in Methods and Applying Functions
___

# 0) Pip Install Pandas

Similar to when we worked with Numpy, we need to install the Pandas library if you don't it for your kernel/environment already

In [1]:
pip install Pandas

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Users\cryst\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


# 1) Series

Series is similar to a Numpy array (its actually built using Numpy Arrays) but instead of just an array of numbers, it allows you to associate each value with some index

In [3]:
# first we will import both numpy and pandas
import numpy as np
import pandas as pd

Let's create our first Series data structure

In [8]:
labels = ['a','b','c']
numbers = [10,20,30]

Using a list as the input (default indexing are integers)

In [12]:
pd.Series(numbers)

0    10
1    20
2    30
dtype: int64

Adding labels to it

In [14]:
pd.Series(numbers,labels)

a    10
b    20
c    30
dtype: int64

using numpy array as the input

In [15]:
pd.Series(np.array(numbers),labels)

a    10
b    20
c    30
dtype: int32

Using a dictionary as the input

In [16]:
my_dict = {"a": 10, "b":20, "c":30}

pd.Series(my_dict)

a    10
b    20
c    30
dtype: int64

The index is used to help identify and label your data so it is easier to retrieve (similar to the concept of a dictionary)

In [18]:
series = pd.Series([27,18,28,5],index = ['USA', 'Germany','Italy', 'Japan'])  
series   

USA        27
Germany    18
Italy      28
Japan       5
dtype: int64

With a Series, we can easy get the value for a desired index

In [20]:
series['Japan']

5

We can also do math on the Series

In [21]:
series*2

USA        54
Germany    36
Italy      56
Japan      10
dtype: int64

# 2) DataFrames

DataFrames are practically the most import concept in the Pandas library and is made up of Series put together that share the same index. Its similar to how you would set up data where the index is the time step or maybe the category of data

In [24]:
input = [[1,2,3],
        [4,5,6],
        [7,8,9]]

df = pd.DataFrame(input, index="A B C".split(), columns="X Y Z".split())

df

Unnamed: 0,X,Y,Z
A,1,2,3
B,4,5,6
C,7,8,9
