# Pandas
---

## code examples on the most frequently used functions

Collected, Created and Edited by __Pawel Rosikiewicz__ www.SimpleAI.ch

In [3]:
import os
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Pandas Series & DataFrame
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

* __Series__;    
    * one-dimensional labeled array; 
    * A series is like a DataFrame with a single column.
* __DataFrame__;    	
    * a two-dimensional and labeled array, 
    * Columns can store different dtypes
    * __Dimensions__; the number of axes, that are labeled starting at 0.
    * __rows__;axis 0; represent the data points, by convention
    * __columns__; axis 1; represent the variables, -||-
    * __Column names__; bold names on the top 
    * __Index labels__; bold nr’s, from 0 to 9 on the left side, for rows only
    * __Data__; everything else inside the cells 
    * __Cell__; place set with one row and one column



### pd.Series
---
* most fucntion works for data frame too, so I am presenting them once here,
* we use s.dtype instead of df.dtypes()
* df.info() doenst work for series, 

#### Create new Series
* typically, created from the list
* it contains only one dtype, 
* but, you can store different dtypes form the list in one pd.Series, using "object" dtype in that series, 
* __index__: additional co, visible on the left, Starts at 0!

In [51]:
s = pd.Series(["a", "b", "c"], index=["first", "second", "third"]) 
s2 = s.copy() # obj, with new id( )

#### inspect pd.Series

In [46]:
# dimensions
s.shape # (3,)
s.size # 3
s.ndim # 1

# object 
type(s) # pandas.core.series.Series

# examples
s.head()
s.tail()

# summary
s.dtype  # no Brackets!, '0' - object dtype
s.describe()

count     3
unique    3
top       a
freq      1
dtype: object

#### get values
* Caution! if you use non-existed key value form the index, it returns, KeyError

In [19]:
# idx
s[0] # index 0
s.iloc[0] # index 0

# index value
s['first'] # it may cause problems, if that is 
s.loc['first'] # with loc, you must use the name
s['first']

# utiple values can be get
s[['first', 'second']] # return a miniseries

first     a
second    b
dtype: object

### pd.DataFrame
---

* __important to know when working with df__
    * __df indexing	for rows__; 
        * data points are automatically indexed, starting from 0, 
        * one of the columns in df can be set as new index using __df.set_index(“column_name”)__ method

    * __Copy vs innplace__
        * important to know whether a given method introduce changes in a modified obj, or it creates a new object. 
        * Most of pandas methods returns a copy, 
        * thus you must eaither use __inplace = True__ or __df = df.modiffication()__ to save changes in an original obj.
        * Inplace = True, ensures that the original object will be modified, but it is not implemented in all methods

    * __.copy( )__	
        * ensures that new obj will be created,  not in all methods

    * __na_values=[“char”]__;  
        * Pandas will automatically recognize and parse common missing data indicators, suchn as NA, empty fields. 
        * For other types of missing data use np.nan (special float dtype from numpy) 
        * eg: __df = pd.read_csv('file_name.csv', na_values=['?'])__


#### create new data frame

In [74]:
# empty df
''' you provide no data
    it has only col names, indexes and dtype, no data''';
pd.DataFrame( 
    index = range( 0 , 2 ), 
    columns = [ 'A' , 'B' ], 
    dtype = 'float'
)

Unnamed: 0,A,B
0,,
1,,


In [76]:
# df filled with one value
''' it can be any numeric item, but not a string 
    eg: you may use np.nan because it is a float. 
    pd.DataFrame(np.nan, index=[0,1], columns=['A', “B”])'''
pd.DataFrame( 
    np.nan,
    index = range( 0 , 2 ), 
    columns = [ 'A' , 'B' ]
)

Unnamed: 0,A,B
0,,
1,,


In [81]:
# with dictionary,
'''keys used as column names'''
df=pd.DataFrame({
    "col1":[1,2,3],
    "col2":['a', 'b', 'c'],
    "col3":['first', 'second', 'third']
    }, index=['row1', 'row2', 'row3']
)


# provided as rows, 
'''one row == one embded list'''
df=pd.DataFrame(
    [[1, 'a', 'first'], [2, 'b', 'second'], [3, 'c', 'third']],
    columns=['col1', 'col2', 'col3'],
    index=['row1', 'row2', 'row3']
)


# from list with dictionary with colnames as keys
'''often used in for-loops to collect results'''
lst=[]
for i in range(3):
    lst.append({'col1':i, 'col2':i+10})
df = pd.DataFrame(lst)


# from numpy array
df=pd.DataFrame(
    np.arange(9).reshape(3,3),
    columns=['col1', 'col2', 'col3'],
    index=['row1', 'row2', 'row3']
)
df

Unnamed: 0,col1,col2,col3
row1,0,1,2
row2,3,4,5
row3,6,7,8


#### Load data from file to pd.DataFrame