<a href="https://colab.research.google.com/github/CC-MNNIT/2018-19-Classes/blob/master/MachineLearning/2019_04_11_ML3_content/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 MNNIT Computer Club.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Pandas (Current Release : 0.23)

**Learning Objectives : **

- What is Pandas, why do we need it ?
- Basics of Pandas

# What is Pandas ?

- It is a python library/package which provides expressive **data structures** designed to make working with relational(tabular), labeled data easier.
- It is build on top of numpy

Think of pandas as MS Excel for python programmers.


## It offers two main datastructures : 
1. **Series** : 1D labeled homogeneously-typed array
2. **DataFrame** : General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

Think of pandas DS as containers
df is a container made of series containers.
Series is a container made of scalars.

# Why do we need pandas ?

Actually you can do without pandas, but as the saying goes

> "Don't reinvent the wheel"

1. In a lot of problems the data is in a csv/tsv (a.k.a tabular) format or is brought to one if not.
2. Numpy arrays must be homogeneous, Pandas columns need to be homogenous but not the whole dataframe.
3. Has a lot of inbuilt functionalities which one might need to implement everytime they encounter a fairly common workflow

# When and When not to use pandas ?

- Pandas is **good for large data(<100GB)**.
- **Not for BIG data.**
- Pandas is extremely efficient on small datasets(<1GB) and performance is rarely a concern.

For big data there can be performance issues as pandas was not made for BIG data and the internal implementations might fail.


# Motivation to use

An experience : large dataset required to reduce the memory usage to fit into local memory for analysis (even before reading the data set to a dataframe).

- Pandas allows the csv file to be read in chunks, (eg. 1000 lines at a time)
- then later on these chunks can be concatenated and the resulting dataframe fits the memory


# Enough talk, show me the code

In [0]:
import pandas as pd
import numpy as np

# Pandas Series



```python
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
```

> data can be :
1. Python dictionary
2. Python List
3. Numpy array
4. Scalar

> ["fastpath" is an internal parameter, will probably be fixed in next release](https://github.com/pandas-dev/pandas/issues/6903)



## Creating a Series

In [0]:
# Using python list
pd.Series(['a','b','c'])

In [0]:
# series can accomodate heterogeneous datatypes
pd.Series(['a','b',4,1.414])

In [0]:
# Using numpy array
pd.Series(np.array([1,2,3,4]))

### Indexing a Series

- By default indices 0....len(data)-1 are assigned, as can be seen above
- To set your own indices, use the parameter **index**
- **Indices are also called labels**
- Labels/indices  can appear to be same, but must have a unique hash during their existence

In [0]:
pd.Series([180,72,"DG"],index=['Height','Weight','Name'])

In [0]:
human = pd.Series({'Height' : 180, 'Weight' : 72, 'Name' : 'DG'})

# Pandas DataFrame



```python
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
```

> **data** can be :
- numpy ndarray (structured or homogeneous)
- dict (values can be pandas Series, arrays constants, list like objects)
- another pandas DataFrame

> **index** : row labels

> **columns** : column labels

> A few points about DataFrame
- Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). 
- **Arithmetic operations align on both row and column labels.**
- Can be thought of as a dict-like container for Series objects.
- Most used pandas datastructure


## Creating pandas DataFrame

In [0]:
# from numpy array
pd.DataFrame(np.array([[1,2],[3,4]]))

In [0]:
# from list
pd.DataFrame([1,2,'hello','world'])

In [0]:
# from dictionary
df = pd.DataFrame({'Height':[180,178],'Weight':[72,70],'Name':['DG','AR']})
df

In [0]:
# pandas automatically handles datatypes, for each column
# unlike numpy which would have forced all columns to be string objects in this case
df.dtypes

In [0]:
# from a series
pd.DataFrame(human)

In [0]:
# creates a replica of dataframe df
duplicate_df = df.copy()

## Creating a dataframe from dataset

In [0]:
# downloading data
!rm -rf datasets
!mkdir -p datasets
!wget -O datasets/data.csv https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
# !gunzip datasets/data.tsv.gz 

In [0]:
# see documentation for full set of parameters
iris_data = pd.read_csv('datasets/data.csv',sep=',')

In [0]:
iris_data.head()

In [0]:
iris_data['species'].describe()

## Selecting and Indexing a DataFrame

Three main methods :

1. .loc : location (primarily label based)
2. .iloc : integer location
3. [ ] : 


### **General Selection Syntax**

```python
df.loc[row-selector,column-selector]
```

Selectors can be : 
1. A label 
2. List/Numpy-Array of labels
3. Label slice

[pandas documentation about indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

### .loc

In [0]:
# rows with labels 0,1,2.....10, and the species column
# 0:10 -> are row labels not the indices in a numerical sense

iris_data.loc[0:10,'species']

In [0]:
# rows with labels 10,90
# columns with labels sepal_length, species

iris_data.loc[np.array([10,90]),['sepal_length','species'] ]

In [0]:
# label slices
X_data = iris_data.loc[:,'sepal_length':'petal_width']
y_data = iris_data.loc[:,'species']


##### IMPORTANT #######
# While using .loc and label slices(start:stop), both the start and stop label are inclusive
# unlike regular python where stop is not included

In [0]:
X_data.head()

In [0]:
y_data.head()

### .iloc

In [0]:
iris_data.iloc[0:10]

In [0]:
iris_data.iloc[0:10,0:1]
# integer based selectors/slices.. .iloc won't take row,column label names 

### [ ] 

In [0]:
# [ ] selects columns
iris_data['species']

In [0]:
# multiple columns are selected by passing a list containing 
my_columns = ['species','sepal_length']
iris_data[my_columns]

# the above is equivalent to the following:
# iris_data[ ['species','sepal_length'] ]

### Important point about selectors:

```python
iris_data[['species','sepal_length']]
```
is not the same as
```python
iris_data[['sepal_length','species']]
```

**ORDER MATTERS!!**

### TODO

1. at()
2. iat()

## Modifying a Dataframe

In [0]:
# most common way is to use apply method and use a lambda function to modify the column
iris_data['petal_width'].apply(lambda x : x*10)

In [0]:
# we can also use replace method for a consistent replacement mapping
mapping = {flower:idx for idx,flower in enumerate(iris_data.species.unique())}
iris_data['species'].replace(mapping)

In [0]:
# changing datatype of a column
iris_data['species'].astype('category')

In [0]:
iris_data

**Pandas also has a very powerful text manipulation tools under the *str* attribute of the dataframe. Do check it for text data/column**

[Working with text Data in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

# Data Exploration, Statistics tools

In [0]:
iris_data.head(25)

In [0]:
iris_data.tail(7)

In [0]:
iris_data.describe()

In [0]:
iris_data.info()

In [0]:
# can also be applied on sub part of a dataframe
iris_data.mean()

# iris_data.median()
# iris_data.std()
# iris_data.mode()

In [0]:
# gives mean of sepal_length of first 11 entries
iris_data.loc[0:10,'sepal_length'].mean()

In [0]:
iris_data.median()

# Exporting a pandas dataframe/series

In [0]:
# returns a numpy array
# the following is going to be deprecated in 0.24
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values
# will be replaced by
# iris_data.to_numpy()

iris_data.values

In [0]:
# index = False ensures that indices are not written to the csv file
iris_data.to_csv('datasets/fromDataFrame.csv',sep=',',encoding='utf-8', index=False)

# Try this yourself (Home Assignment ;-) )

1. Concatenating two dataframes along axis=0, axis=1
2. Appending a row to dataframe

---

Authored By [Dipunj Gupta](https://github.com/dipunj) | Report errors/typos as github issues.

---