# Introduction to Pandas

Pandas is one of the most widely used tools for data anlysis. It provides robust tools for all of the data manipulation tasks we will cover including transforming, filtering, rehsping, and aggregating data.

In Pandas, your primary canvas of work will be the **Dataframe**, a multi dimensional table with rows and columns. In a later section of this chapter we will explore the inner workings of the Pandas Dataframe, but for now, we will get started with some data so we can explore some crucial data analysis tools.


Let's import pandas and numpy

In [9]:
import numpy as np
import pandas as pd

## Basic data Structure in pandas

Pandas provides two general classes for handling data:

1. `Series`: a **one-dimensional labeled array holding data of any type** such as integers, strings, etc.
2. `DataFrame`: a **two-dimensional** data structure that holds data like a two-dimension array or a table with rows and columns.

## Object creation

We can create a `Series` by passing a list of values

In [10]:
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Note that the `dtype` of the series is `float64` and pandas has automatically created a `RangeIndex` to index the data, starting at 0.

Contrast this construction with the following

In [11]:
s = pd.Series([1,3,5,6,8])
s

0    1
1    3
2    5
3    6
4    8
dtype: int64

In this case the dtype is `int64`. This is becuase pandas does not allow `NaN` values in integer Series since `np.nan` is represented as a floating point number under the hood. 

There are a number of ways to construct a Pandas Dataframe. First, we can directly convert a 2-dimensional numpy array to a `DataFrame` object.

In [12]:
dat = pd.DataFrame(np.random.randn(6,4), columns = [f"col_{i}" for i in range(4)])
dat

Unnamed: 0,col_0,col_1,col_2,col_3
0,-1.009849,0.304862,1.432951,0.181906
1,1.792727,2.49668,-0.840773,-0.846036
2,-0.553669,0.065429,0.360896,2.073973
3,0.172146,-1.953964,-0.760879,1.389976
4,0.578897,-0.165676,-0.215204,-0.637173
5,0.124783,-0.123609,0.827657,-0.907289


Let's do the same thing, but let's add our own date index

In [13]:
dates = pd.date_range(start="2023-01-01", periods=6)
dat = pd.DataFrame(np.random.randn(6,4), index = dates, columns = [f"col_{i}" for i in range(4)])
dat

Unnamed: 0,col_0,col_1,col_2,col_3
2023-01-01,0.86212,-0.201652,-0.405356,-0.360163
2023-01-02,-0.70664,1.758638,1.090322,-0.629024
2023-01-03,0.789678,-0.063033,0.194267,0.747253
2023-01-04,-0.193173,-0.403081,0.019604,-0.4196
2023-01-05,-0.160241,-1.419203,-1.02915,0.863046
2023-01-06,-0.003435,0.362646,1.568392,-1.129814


We now have assigned an index to the data, which we will see we can use for quick slicing and subsetting of our data in later sections.

We can also create DataFrames more explicitly by passing a dictionary of objects where the keys are the column labels and the values are the column values

In [14]:
df = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1),
    "D": [1,2,3,4],
    "E": ["This", "Is", "A", "Dataframe"],
    "F": ["Foo"]
}, index=list(range(4)))
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,1,This,Foo
1,1.0,2013-01-02,,2,Is,Foo
2,1.0,2013-01-02,,3,A,Foo
3,1.0,2013-01-02,,4,Dataframe,Foo


Because we passed an index of length 4, any scalar values will be repeated to match the index length. The columns of the resulting `DataFrame` have difference `dtypes`

In [15]:
df.dtypes

A           float64
B    datetime64[ns]
C           float64
D             int64
E            object
F            object
dtype: object

We can also create a dataframe from a list of lists.

In [16]:
l1 = [1,'foo',3,4]
l2 = [1,'bar',5,7]
l3 = [2,'fizz',6,8]

all_l = [l1,l2,l3]

pd.DataFrame.from_records(all_l)

Unnamed: 0,0,1,2,3
0,1,foo,3,4
1,1,bar,5,7
2,2,fizz,6,8
