# Pandas Introduction


## What is Pandas?
- Library for manupulating tables of data
- Primarily used for cleaning and restructuring data in preperation for plotting and modeling
- 2 primary data structures
    - Series - 1D, columns of data
    - DataFrames - 2D, tables of data
- Columnar
    - Most operations are designed to operate on columns of data, not individual elements or rows

In [None]:
import matplotlib.pyplot as plt
import sklearn.ensemble as mdl
import pandas as pd
import numpy as np
datapath = '/data/cs2300/examples/IRIS.csv'

## Caveats
- Pandas offers multiple ways to do things. Some ways are newer and have learned from the mistakes of the old ways. This can be confusing and frustrating
- Pandas documentation is complex and not well organized
- It can be difficult to predict when a copy is made versus a view is created - this makes optimization challenging

## Creating DataFrames
- Read from a csv file

In [None]:
df1 = pd.read_csv(datapath)

- Show the first 5 lines of the file

In [None]:
df1.head()

- From existing lists, Numpy arrays, or series

In [None]:
df2 = pd.DataFrame( {"column1" : [0.0, 1.0, 2.0],
                    "column2" : np.random.randint(10,size = (3)),
                    "column3" : df1["species"][0:3] } )
df2.head()

## Investigating DataFrames
- There are multiple functions to investigate existing DataFrames

In [None]:
df1.head(10)

In [None]:
df1.dtypes

In [None]:
df1.shape

In [None]:
df1.info()

In [None]:
help(df1.info)

## Indexing / Selecting / Slicing Columns
- Pandas has multiple ways to index. The slice operator works on columns

In [None]:
df1.head(1)

In [None]:
df1["sepal_length"][0:2]

Or another way...

In [None]:
df1.sepal_length[0:2]

In [None]:
df1[["sepal_length","species"]][0:5]

## Indexing
- You can index by position (numerical index). This follows the Numpy pattern of row, then column:

In [None]:
df1.iloc[5]

## Creating a New Column
- The simplest way to create a new column:

In [None]:
extra_col = np.random.randint(2,size=(150))

In [None]:
df1["Is_pretty"] = extra_col==1
df1.head()

- The assign method is used too, since it returns a new DataFrame and can be used with method chaining:

In [None]:
new_df = df1.assign(Smells_bad = np.ones(150)==1)
new_df.head()

## Modifying a column
- Convert data types - may need to specify function for parsing /conversion
- Cleaning data
- Extracting fields from complex types
    - e.g., hour, month, etc... from date times

1) Get the Series for the column of interest

In [None]:
column = new_df["Smells_bad"]

2) Use the map() method to apply a function to each element in the Series and return a new Series

In [None]:
converted = column.map(lambda s: (not s))
converted.head()

3) Then update the df, either by adding a new column or overwritng the orignal column

In [None]:
df1["Smells_bad"] = converted
df1.head()

## Dropping a Column
- I prefer to use the drop() method becuase it returns a DataFrame object, so it work with chaining:

In [None]:
new_df = df1.drop(columns=["Smells_bad"])

- You might also see this format

In [None]:
df1.head()

In [None]:
del df1["Smells_bad"]
df1.head()

## Filtering
We can apply boolean indexing to filter our dataframe

In [None]:
df1_filtered = df1[df1['sepal_length'] > 5]
df1_filtered.head()

We can also use string operations to slice based on string properties.  We can also find out how many unique values there are in a column using the following code.  

In [None]:
df1_filtered2 = df1[df1['species'].str.len() > 11]
print(df1_filtered2.species.unique())

Notice in the preceeding cell that the second line with the unique call uses a different filtering syntax that allows you to refer to a column (if it doesn't have spaces) directly after the dataframe name.  This is a strong reason to avoid using spaces in your column names.  

You can slice multiple columns using double brackets or a single column with a single bracket.  If you are slicing a single column with a single bracket, the return type will be a Series (not a DataFrame)

In [None]:
df1[['sepal_length', 'sepal_width']]

We can sort a DataFrame with a simple method call.  You should add a more complex sort with multiple columns where some are ascending and some are descending.  

In [None]:
df1_sorted = df1.sort_values(by = 'sepal_length')
df1_sorted.head(20)

You can also call methods that will provide basic descriptive statistics on a dataframe using simple method calls.  Add a few in the following cell.  

In [None]:
df1.skew()

That's it!  Except, there is still a lot to learn about DataFrames.  There is a lot more to learn, and you can start by digging into the official documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html