# Pandas

- pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
- the most popular Python library for data analysis
- pandas stands for "*panel data*": an econometrics term for multidimensional structured data sets
- [Getting Started with Pandas - Pandas Documentation Page](https://pandas.pydata.org/docs/getting_started/index.html)

In [None]:
# import the pandas library
import pandas as pd 

- There are two core objects in pandas: 
  - **DataFrame** 
  - **Series**

# Creating Data

### DataFrame Creation

- a DataFrame is a table 
  - it contains an array of individual entries 
  - each entry corresponds to a row and a column 


- use the `pd.DataFrame()` constructor to generate these DataFrame objects

- the syntax for declaring a new DataFrame is a dictionary whose keys are the column names and whose values are a list of entries

- this is the standard way of constructing a new DataFrame, and the one you are most likely to encounter

In [None]:
# create a dictionary
new_dictionary = {'Day': [50, 21], 'Night': [131, 2]}

# create DataFrame from ditionary 
pd.DataFrame(new_dictionary)

In [None]:
# create a dictionary
new_dictionary = {'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']}

# create DataFrame from dictionary
pd.DataFrame(new_dictionary)

- the dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels

### Series Creation

- a Series is a sequence of data values
  - if a DataFrame is a table, a Series is a list

- you can create one with nothing more than a list
  - a Series is, in essence, a single column of a DataFrame

In [None]:
# create a series from a python list
pd.Series([1, 2, 3, 4, 5])

- you can assign column values to the Series the same way as before, using an `index` parameter. 

- however, a Series does not have a column name, it only has one overall `name`

In [None]:
# create a series from a python list 
# add row index labels and seriss name 
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

- the Series and the DataFrame are intimately related

- it's helpful to think of a DataFrame as actually being just a bunch of Series "glued together"

![Dataframe and Series](https://i.imgur.com/MQCBcpLl.jpg)

# Reading data files

- we are reading the `train.csv` file from the kaggle housing price dataset 

- find the data files [here](https://www.kaggle.com/c/home-data-for-ml-course/data)
  - find the "Download All" button and click it to download

In [None]:
# read the csv from drive (from google drive in this case)
data = pd.read_csv('/content/drive/My Drive/Datasets/home-data-for-ml-course/train.csv')
# add your own path above to read the train.csv file

In [None]:
# check data head (first five rows) of the DataFrame
data.head()

In [None]:
# check data tail (last five rows) of the DataFrame
data.tail()

In [None]:
# specify number of rows to show
data.head(10)

In [None]:
# specify number of rows to show
data.head(7)

In [None]:
# specify number of rows to show
data.tail(11)

In [None]:
# specify number of rows to show
data.tail(3)

# Native Accessors

In [None]:
# Access the DataFrame
data

In [None]:
# check the number of rows and columns in the DataFrame
data.shape

In [None]:
# column labels
data.columns

In [None]:
# check the "neighborhood" column in 'data' DataFrame
data.Neighborhood

In [None]:
# check the "neighborhood" column in 'data' DataFrame
data['Neighborhood']

In [None]:
# check the first entry in "neighborhood" column in 'data' DataFrame
data['Neighborhood'][0]

# Indexing in Pandas

- the indexing operator and attribute selection work just like they do in the rest of the Python ecosystem

- pandas also has its own accessor operators, `loc` and `iloc` for more advanced operations

- Pandas indexing works in one of two paradigms
  - `iloc`: index-based selection
  - `loc`: label-based selection

- both `loc` and `iloc` are row-first, column-second
  - this is the opposite of what we do in native Python, which is column-first, row-second

### Index-based Selection: `iloc()`

- the first paradigm is index-based selection 
  - it is selecting data based on its numerical position in the data 
  - `iloc` follows this paradigm

In [None]:
# select the first row of data in a DataFrame
data.iloc[0]

In [None]:
# get the 11th column with iloc
data.iloc[:,10]

##### `:` operator 

- comes from native Python, means "everything". 
- when combined with other selectors, however, it can be used to indicate a range of values

In [None]:
# get the first three rows from the 11th column
data.iloc[:3, 10]

In [None]:
# get the 3rd and the 4th entries from the 11th column
data.iloc[2:4,10]

In [None]:
# pass a `list` to get 5th, 6th and 7th elements from 11th column
data.iloc[[4, 5, 6], 10]

In [None]:
# negative numbers can also be used in selection: last 5 rows of the DataFrame using -5 indexing
data.iloc[-5:]

### Label-based Selection: `loc()` 

- second paradigm for attribute selection is the label-based selection
- it's the data index value, not its position, which matters


In [None]:
# to get the first entry in reviews, we would now do the following
data.loc[0,'Id']

### `loc()` vs `iloc()`

- `iloc` is conceptually simpler than `loc` because it ignores the dataset's indices 

- when we use `iloc`, we treat the dataset like a big matrix (a list of lists), 
  - one that we have to index into by position. 
  
- `loc`, by contrast, uses the information in the indices to do its work 

- since your dataset usually has meaningful indices, it's usually easier to do things using `loc` instead

##### Choosing between `loc()` and `iloc()`

- when choosing or transitioning between `loc` and `iloc`, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes

- `iloc` uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10


# Manipulating the Index

- Label-based selection derives its power from the labels in the index. 
- Critically, the index we use is not immutable. 
  - We can manipulate the index in any way we see fit.

In [None]:
# change the row index to the iD column
data.set_index('Id')

# Conditional Selection

- find out all the uniue values that the Neighborhood can take using the `.unique()` built in function

In [None]:
data.Neighborhood.unique()

- then create a boolean filter using `==` operator 

In [None]:
# create a simple boolean filter 
data.Neighborhood == 'Edwards'

- then extract entries from `data` DataFrame that match the filter condition

In [None]:
# use the boolean filter to extract data using `loc`
data.loc[data.Neighborhood == 'Edwards']

- the output is all 100 entries that correspond to the filter specification of `Neghborhood == Edwards`

### `|` - the pipe operator: element wise OR

- the pipe operator is used in creating a complex boolean filter 
  - multiple conditionals can be used to filter the data 
  - `|` follows OR logic

In [None]:
# use boolean complex filter to extract entries
data.loc[(data.Neighborhood == 'Edwards') | (data.LotFrontage >= 50.0)]

### `&` - the 'ampersand' operator: element wise AND

- this is similiar to `and` logical operator
- use this to build complex queries in extracting specific entries 

In [None]:
# use boolean complex filter to extract entries
data.loc[(data.Neighborhood == 'Edwards') & (data.LotArea >= 5000)]

# Assigning Data

- going the other way, assigning data to a DataFrame is easy. 

- you can assign either a constant value, or with an iterable of values

In [None]:
# assign constant value to a newly created column "Visitor"
data['Visitor'] = 'everyone'
data['Visitor']

In [None]:
# assign a range of backwards index to a newly created column called "index_backwards"
data['index_backwards'] = range(len(data), 0, -1)
data['index_backwards']

In [None]:
# check the columns headers for new columns added to the DataFrame
data.columns