![panda](figures/panda.png)

 > **Pandas** is an open source Python library for data analysis.
- It gives Python the ability to work with numerical tables and time series for fast data loading, manipulating, aligning, merging, etc.
- The name is derived from 'panel data', an econometrics term for multidimensional structured datasets.

In [1]:
import pandas as pd
import numpy as np

# Series and DataFrame

Pandas introduces two new data types to Python: **Series** and **DataFrame**

## Series

> A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its **index**

In [None]:
s = pd.Series([4, 7, -5, 3])
s

In [None]:
s_test = pd.Series([4, 7, -5, 3, 'string'])
2 * s_test

- The string representation of a Series displayed interactively shows the index on the
left and the values on the right.
- Since we did not specify an index for the data, a
default one consisting of the integers 0 through n-1 (where n is the length of the data)

In [None]:
s = pd.Series([4, 7, -5, 3], index=['a', 'b', 'c', 'd'])
s

In [None]:
s.values

In [None]:
s.index

### Selecting single or a set of values using index

In [None]:
s['b']

In [None]:
s[['c', 'a', 'b']]

In [None]:
s.iloc[2]

In [None]:
s[1:3]

In [None]:
s.iloc[[1,3]]

### Filtering

In [None]:
s

In [None]:
s > 0

In [None]:
s[s > 0]

### Math operation

In [None]:
s**2

In [None]:
np.exp(s)

In [None]:
s.mean()

In [None]:
s.var()

aligns by index label in arithmetic operations

In [None]:
s

In [None]:
s2 = pd.Series([1, 2, 3, 4], index = ['a', 'c', 'd', 'e'])
s2

In [None]:
s + s2

**Note**: "NaN" stands for missing values in pandas

In [None]:
s.index = ['a', 'c', 'd', 'e']
s

In [None]:
s + s2

## More method for series

In [None]:
 dir(s)

In [None]:
[attr for attr in dir(s) if not attr.startswith('_')]

In [None]:
help(s.all)

## DataFrame

> A DataFrame represents a rectangular table of data and contains an ordered collection
of columns.

* The DataFrame has both a row and column index.
* Since each column of a DataFrame is essentially a Series with its column index, it can be thought of as a dictionary of Series all sharing the same index.
<!-- * Each column (Series) has to be the same type, whereas, each row can contain mixed types. -->

### Creating DataFrame

#### from a dict of equal-length lists

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
data

In [None]:
d = pd.DataFrame(data)
d

#### Start with an empty DataFrame

In [None]:
d1 = pd.DataFrame()
d1

In [None]:
d1['state'] = ['Ohio', 'Nevada']
d1

In [None]:
d1['year'] = [2001, 2001]
d1['pop'] = [1.7, 2.4]
d1

### select columns

In [None]:
d

In [None]:
d['state'] # return a Series

In [None]:
type(d['state'])

In [None]:
d[['state','pop']]

### select rows

In [None]:
rows = np.arange(16).reshape((4, 4))
rows

In [None]:
d2 = pd.DataFrame(rows,
                  index=['Ohio', 'Colorado', 'Utah', 'New York'],
                  columns=['one', 'two', 'three', 'four'])
d2

In [None]:
d2.loc['Ohio':"Utah"]

In [None]:
d2.iloc[1:3]

### change row index and column name

In [None]:
d2

In [None]:
d2.rename(index={'Colorado':'Connecticut'},columns={'one':'five'})

In [None]:
d2 # notice d2 is still the same

In [None]:
d3 = d2.rename(index={'Colorado':'Connecticut'},columns={'one':'five'}) # or assign to a new variable
d3

In [71]:
# set the inplace=True will change original DataFrame.
d2.rename(index={'Colorado':'Connecticut'},columns={'one':'five'}, inplace=True)

In [None]:
d2

### basics attributes and methods

In [None]:
d2.index

In [None]:
d2.columns

In [None]:
d2.values

In [None]:
d2.shape

In [None]:
d2.mean() # column-wise mean, More on aggregation later.

### Alignment by index

In [None]:
df3 = pd.DataFrame({'A':[1,2,3]},index=[1,2,3])
df3

In [None]:
df4 = pd.DataFrame({'A':[1,2,3]},index=[3,1,2])
df4

In [None]:

df3-df4 

### add and delete rows and columns

In [None]:
d2

In [None]:
d2.drop(index = "Connecticut", columns="five")
# add "inplace=True" will change the original DataFrame

In [None]:
d2

In [None]:
del d2['five'] # this will change d2 directly
d2

In [None]:
d2['one'] = [1, 2, 3, 4] # add new column
d2

In [None]:
d2.pop('one') # directly change the original DataFrame

In [None]:
d2

### Common method

You can import dataset as well

#### csv file

In [88]:
df = pd.read_csv("./data/table.csv")

In [None]:
df

#### txt file

In [None]:
df_txt = pd.read_table("data/table.txt")
df_txt

In [None]:
help(pd.read_table)

#### xlsx file

In [None]:
conda install openpyxl

In [None]:
df_excel = pd.read_excel('data/table.xlsx', sheet_name="Sheet1")
df_excel

#### Head and Tail

These two methods show the first and the last a few records from a DataFrame, default is 5

In [None]:
df.head()

In [None]:
df.iloc[:6]

In [None]:
df.tail()

In [None]:
df.head(3)

### unique and nunique

In [None]:
df['Physics']

In [None]:
df['Physics'].unique() # Shows only unique values

In [None]:
df['Physics'].nunique() # len(df['Physics'].unique())

### count and value_counts

In [None]:
df['School']

In [None]:
df['School'].count() # Count of non missing values

In [None]:
df['School'].value_counts()

In [None]:
df['Physics'].value_counts()

### describe and and info

In [None]:
df.info() # How many missing for each column and type of each column

In [None]:
df.describe() # summary statistics for numeric type columns

In [None]:
df.describe(percentiles=[x/10 for x in list(range(1, 10, 1))])

In [None]:
df['Physics'].describe()

### idxmax and nlargest

In [None]:
df['Math']

In [None]:
df['Math'].max() # return the largest value

In [None]:
df['Math'].idxmax() # return the index of the largest value

In [None]:
df['Math'].idxmin() # return the index of the smallest value

In [None]:
df['Math'].nlargest(3) # return the largest 3 values with their index (default is 5).

In [None]:
df['Math'].nlargest()

In [None]:
df['Math'].nsmallest(3) # return the smallest 3 values with their index (default is 5).

### apply

In [None]:
df[["Height", "Weight"]]

In [None]:
df[["Height", "Weight"]].apply(lambda x: x.max() + x.min())

In [None]:
df[["Height", "Weight"]].apply(lambda x: x.mean())

In [None]:
df[["Height", "Weight"]].mean()

In [None]:
df.apply(lambda x:x.count(), axis=0) # 0 is column-wise and 1 is row-wise

In [None]:
df.apply(lambda x:x.count(), axis=1) # 0 is column-wise and 1 is row-wise

### sort

In [None]:
df

In [None]:
df.sort_values(by='Class')

In [None]:
df.sort_values(by=['Address','Height'], ascending=True)

In [None]:
df.sort_values(by=['Address','Height'], ascending=[False, True])

In [None]:
df.sort_values(by=['Math','Height'], ascending=[False, True])