# Introduction to Pandas DataFrame

## Reading Data into DataFrame

Pandas provide several methods to read data into a Pandas DataFrame objects.  The most common ones are ```pd.DataFrame()```, ```pd.read_csv()```, and ```pd.read_json()```. Here are some examples that demonstrate the use of these methods.

### ```pd.DataFrame``` 

We usually use ```pd.DataFrame``` to read dictionary, series, arrays, list-like objects into a 2D DataFrame that contains labeled axes (rows and columns). We can thought of as a dict-like container for Series objects.

Key Arguments:

*index, column, dtype, and copy*

Resources: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

Python Code:

``` Python
# Create a dictionary object
d = {'col_1': [1, 3, 5, 7, 9], 'col_2': [2, 4, 6, 8, 10]}

# Reading the dictionary into Pandas DataFrame
df = pd.DataFrame(data=d)

# Create a numpy ndarray (5 x 2 matrix)
a = np.array([(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)])

# Reading the numpy ndarray into Pandas DataFrame
df = pd.DataFrame(data=a, colmuns=['col_1', 'col_2'])
```

### ```pd.read_csv```

The ```pd.read_csv``` method allow us to read a comma-separated values (csv) file into Pandas object.

Key Arguments:

*sep, delimiter, header, names, index_col, dtype, nrows*, etc ...

Resources: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Python Code:

``` Python
# Assume there is a path to the csv file in the local machine
path = "./.../file_name.csv"

# Reading the csv file into DataFrame object
df = pd.read_csv(path)
```

### ```pd.read_json```

The ```pd.read_json``` method converts a JSON string to Pandas object.

Key Arguments:

*dtype, numpy, encoding, nrows*, etc ...

Resource: https://pandas.pydata.org/pandas-docs/version/1.1.3/reference/api/pandas.read_json.html

Python Code:

``` Python
# Assume there is a path to the csv file in the local machine
path = "./.../file_name.json"

# Reading the csv file into DataFrame object
df = pd.read_json(path)
```

## Pandas Objects Attributes and Methods

When it comes to Python, many people struggle to understand the different between **attributes** and **methods**. Generally speaking, a *method* is an *attribute* in an **instance** or **class**. 

**Attribute** is a variable stored in an instance or class, which is a value associated with an object which is reference by name using dotted expressions. 

**Method** is a function stored in an instance or class, which is defined inside a class body. If called as an attribute of an instance of that class, the method will get the instance object as its first argument (which is usually called self). 

There are so many *attributes* and *methods* for **Pandas objects**, which you can find the resources from below.

Pandas DataFrame Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#

## Using ```.apply()```, ```.map()```, and ```.applymap()```

### ```apply``` Method

The ```apply``` method is used for one-dimensional arrays. The method can be used for columns and rows. A one-dimensional array is always taken as input, which the function is executed row by row or column by column. If add ```axis=1``` as parameter, the function preforms a row-wise calculation.

Python Code:

``` Python
# Assume a dataframe with numerical columns X_1, X_2, and X_3

# Define a function that calculate the range
def cal_range(x):
    diff = x.max() - x.min()
    return diff

# Apply the function to all needed columns of the dataframe
df[['X_1', 'X_2', 'X_3']].apply(cal_range)

# Apply the function to all needed rows of the dataframe
df[['X_1', 'X_2', 'X_3']].apply(cal_range, axis=1)
```

### ```applymap``` Method

The ```applymap``` method applies a function that accepts and returns a scalar to every element of a DataFrame (**THE WHOLE DATAFRAME!!**). In the previous example, we apply a specific function to each numeric column of the DataFrame. If we want to change each element individually, for instance, add the suffix "\_SF" as a label to each element in the DataFrame, we can use the ```applymap``` method.

Note: You cannot use ```applymap``` with Pandas Series, but only Pandas DataFrame, which means the function will apply to every elements in the dataframe.

Python Code:

``` Python
# Assume a dataframe with string columns X_4 and X_5

# Define a function that add the string to a string object
def sf_string(x):
    result = str(x) + "_SF"
    return result

# Apply the function to the whole dataframe, include X_4 and X_5
df.applymap(sf_string)
```

### ```map``` Method

The ```map``` method applies a function to every single element in a column (Pandas Series), while with ```applymap()``` applies the function to a Pandas DataFrame.

Python Code:

``` Python
# Assume a dataframe with string columns X_4 and X_5

# Define a function that add the string to a string object
def sf_string(x):
    result = str(x) + "_SF"
    return result

# Apply the function to the specific column in the dataframe, only X_4
df['X_4'].map(sf_string) 
```

In [2]:
# Import the dependencies
import pandas as pd
import numpy as np

In [3]:
# Create a customer dataframe object
df = pd.DataFrame({'Customer': 'Norman Greg Erin Chester Uohna Jeffery'.split(),
                   'Costs_1': np.arange(6) * 2,
                   'Costs_2': np.arange(6) * 3,
                   'Costs_3': np.arange(6) * 4})

df

Unnamed: 0,Customer,Costs_1,Costs_2,Costs_3
0,Norman,0,0,0
1,Greg,2,3,4
2,Erin,4,6,8
3,Chester,6,9,12
4,Uohna,8,12,16
5,Jeffery,10,15,20


In [5]:
type(df.notna())

pandas.core.frame.DataFrame

In [13]:
df.notna()

Unnamed: 0,Customer,Costs_1,Costs_2,Costs_3
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True


In [17]:
df[df.notna()]

Unnamed: 0,Customer,Costs_1,Costs_2,Costs_3
0,Norman,0,0,0
1,Greg,2,3,4
2,Erin,4,6,8
3,Chester,6,9,12
4,Uohna,8,12,16
5,Jeffery,10,15,20


In [12]:
type(df['Customer'] == 'Norman')

pandas.core.series.Series

In [11]:
(df['Customer'] == 'Norman')

0     True
1    False
2    False
3    False
4    False
5    False
Name: Customer, dtype: bool

In [15]:
df['Costs_1'] > 5

0    False
1    False
2    False
3     True
4     True
5     True
Name: Costs_1, dtype: bool

In [16]:
df[df['Costs_1'] > 5]

Unnamed: 0,Customer,Costs_1,Costs_2,Costs_3
3,Chester,6,9,12
4,Uohna,8,12,16
5,Jeffery,10,15,20


In [None]:
# Try to apply a function to this dataframe using the "apply" method


In [None]:
# Try to apply a function to this dataframe using the "applymap" method


In [None]:
# Try to apply a function to this dataframe using the "map" method


## Lambda Functions / Anonymous Functions

### Lambda Function

First of all, "**Lambda Function**" is not Pandas object specific, which can be applied to different data structures in Python. In Python, we can create function values on the fly using *lambda* expressions, which evaluate to unnamed functions. A lambda expression evaluates to a function that has a single return expression as its body.  Assignment and control statement are NOT ALLOWED!!

Generally speaking, lambda expression are limited: They are only useful for simple, one-line function that evaluate and return a single expression. In those special cases where they apply, lambda expression can be quite expressive.

Structure of a *lambda* Expression:

```  lambda                     x                    :              f(g(x))```

```  A function that         takes x            and return          f(g(x))```

The result of a lambda expression is called a lambda function. It has no intrinsic name, but otherwise behaves like any other function.

Python Code:

``` Python
# Define a function that square a value
def s1(x):
    return x*x

# Define a lambda function that square a value
s2 = lambda x : x*x

# We can use both the calculate a squared value
s1(10) # return 100
s2(10) # return 100
```

### Using Lambda Functions with Map or Apply Methods

It's often cases that we don't want to write a separate block of code just to define a function to complete a task with a DataFrame, for instance, adding a suffix to a specific column in a DataFrame. The *lambda* function actually provides us a simple and elegant (Pythonic) way to achieve that objective and improve the readability of the code.

Here is an example:

``` Python
# Assume a dataframe with string columns X_4 and X_5
# Add a suffix to every elements in column X_4
df['X_4'].map(lambda x: x + "_CA") 

# Add a suffix to every elements in column X_4 and X_5
df[['X_4', 'X_5']].applymap(lambda x: x + "_CA")

# Finding the maximum value from each column
df.apply(lambda x: x.max())
```

Try to run the following codes and see if you can find any kind of pattern for using the ```.apply()```, ```.applymap()```, and ```.map()```.  Some of the code will throw you an error and you may want to investigate the reason for the error.

In [None]:
# Example 1:
df["Customer"].applymap(lambda x: x + "_SF") 

In [None]:
# Example 2:
df[["Customer"]].applymap(lambda x: x + "_SF") 

In [None]:
# Example 3:
df.applymap(lambda x: str(x) + "_SF")

In [None]:
# Example 4:
df.map(lambda x: str(x) + "_SF")

In [None]:
# Example 5:
df.apply(lambda x: str(x) + "_SF")

In [None]:
# Example 6:
df.apply(lambda x: x.max())