<a href="https://colab.research.google.com/github/statrliu/data_bootcamp_part1/blob/main/introduction_to_Pandas_lectures_part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
pd.__version__

# **DataFrame**
One of the main data structures in Pandas is the DataFrame, which is a two-dimensional table-like data structure consisting of rows and columns.

## *Create DataFrame Objects*

* When creating a DataFrame object, Pandas decides the indexes (row labels) and columns (column labels) first, than populate the content of the dataframe.
* Two ways to create DataFame objects:
  1. Column-wise: Combine list-like (list, dict, Pandas Series, numpy ndarray, ) objects in a column-by-column way.
  2. Row-wise: Combine list-like objects in a row-by-row way.
* There are several Pandas functions/methods for creating DataFrames, the commonly used one is the `.DataFrame()` constructor function.   


### **Column-wise Creation**
We can use a **dictionary** of list-like objects to create a DataFrame object using `.DataFrame()` or `.DataFrame.from_dict()`.

* Each list-like object is treated as a column of the DataFrame object.
* If `columns = None` (default) when calling the `.DataFrame()`, the keys of the dictionary object will be the column labels.
* If `index = None` (default) when calling the `DataFrame()`, the index of the DataFrame object will be, depending on the type of list-like objects in the dictionary,  
  * the **union** of indexes of Series objects.
  * the **union** of keys of dictionaries.
  * `range(n)` for list or numpy ndarray object. Here `n` is the number of elements in the object.




In [None]:
# Using Dictionary of lists
data = {'Name': ['John', 'Mary', 'Peter', 'Jane'],
        'Age': [25, 30, 20, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney'],
        'Country': ['USA', 'France', 'UK', 'Australia']}

df = pd.DataFrame(data)

print(df, end = '\n\n')
print(df.index, end = '\n\n')
print(df.columns)

In [None]:
# Using Dictionary of dictionary
data = {'Name': {'row1': 'John', 'row2': 'Mary', 'row3': 'Peter', 'row4': 'Jane'},
        'Age': {'row1': 25, 'row2': 30, 'row3': 20, 'row4': 35},
        'City': {'row1': 'New York', 'row2': 'Paris', 'row3': 'London', 'row4': 'Sydney'},
        'Country': {'row1': 'USA', 'row2': 'France', 'row3': 'UK', 'row4': 'Australia'}}

df = pd.DataFrame(data)
# Or use from_dict method: df = pd.DataFrame.from_dict(data, orient='columns')

print(df)

### **Row-wise Creation**

In [4]:
# Use list of list.
data = [['Alice', 25, 'New York'],
        ['Bob', 30, 'Los Angeles'],
        ['Charlie', 22, 'Chicago']]

df = pd.DataFrame(data, columns= ["Name", "Age", "City"], index=None)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   22      Chicago


### *Set_index and Reset_index*
* The `set_index() method is used to set a column or a combination of columns as the DataFrame's index. This method returns a new DataFrame with the specified column(s) set as the index.

* The `reset_index()` method is used to reset the index of a DataFrame. By default, the method adds a new integer index starting from 0 and makes the current index a new column.

Both of methods have `inplace` parameter to change the original DataFrame



In [None]:
# Set index
data = {'Name': ['John', 'Mary', 'Peter', 'Jane'],
        'Age': [25, 30, 20, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney'],
        'Country': ['USA', 'France', 'UK', 'Australia']}

df = pd.DataFrame(data)
print(df, end = "\n\n")

# set 'Name' column as index
df_new = df.set_index('Name')
print(df_new)

Question:

What is the index of the following dataframe ?
```
data = {'Name': ['John', 'Mary', 'Peter', 'Jane'],
        'Age': [25, 30, 20, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney'],
        'Country': ['USA', 'France', 'UK', 'Australia']}

df.set_index(['Name','Age'])
```

In [None]:
# Reset index
data = {'Name': ['John', 'Mary', 'Peter', 'Jane'],
        'Age': [25, 30, 20, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney'],
        'Country': ['USA', 'France', 'UK', 'Australia']}

df = pd.DataFrame(data)

# set 'Name' column as index
df_new = df.set_index('Name')
print(df_new, end = "\n\n")

print(df_new.reset_index())
#print(df_new.reset_index(drop = True))

Question:
What is the index of the following dataframe?
```
data = {'Name': ['John', 'Mary', 'Peter', 'Jane'],
        'Age': [25, 30, 20, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney'],
        'Country': ['USA', 'France', 'UK', 'Australia']}

df.set_index(['Name','Age']).reset_index(level = 0)
```

## *Quick Summary Info of a DataFrame*


In [None]:
import numpy as np
# create data
data = {'int_col': [1, 2, 3, 4, 5],
        'float_col': [1.1, 2.2, 3.3, np.nan, 5.5],
        'str_col': ['apple', 'banana', 'cherry', 'date', np.nan],
        'datetime_col': [pd.Timestamp('2022-01-01'), pd.Timestamp('2022-01-02'), pd.Timestamp('2022-01-03'), pd.NaT, pd.Timestamp('2022-01-05')]}

df = pd.DataFrame(data)

print(df.head(3))

print(df.shape)

In [None]:
# Show column types
print(df.dtypes)


In [None]:
print(df.info()) # memory_usage = "deep"

In [None]:
# describe method
print(df.describe()) #include = "all"

## *Data Type Conversion*
In Pandas, you can use the `astype()` method to convert the data types of one or more columns in a DataFrame.

Note that the `astype()` method **by default (`copy = False`)** returns a new copy of the DataFrame with the specified data type conversions. If you want to modify the original DataFrame, you need to assign the result of the `astype()` method back to the original DataFrame.


In [None]:
data = {'Name': ['John', 'Mary', 'Peter', 'Jane'],
        'Age': [25, 30, 20, 35],
        'City': ['New York', 'Paris', 'London', 'Sydney'],
        'Salary': ['10000', '20000', '30000', '40000']}

df = pd.DataFrame(data)
print("Age: ", df.Age.dtype)
print("Salary: ", df.Salary.dtype, end = "\n\n")


# convert 'Age' column to float
df['Age'] = df['Age'].astype(float)

# convert 'Salary' column to int
df['Salary'] = df['Salary'].astype(int)

# print DataFrame
print(df.dtypes)

Question:

We can also use `pd.to_numeric()` function to convert to float numbers. What is the output of the following statements?
```
data = {'Values': [10.5, '20.2', 'abc', 30.7, 'xyz']}

df_mixed = pd.DataFrame(data)

df_mixed['Values'] = pd.to_numeric(df_mixed['Values'])         
print(df_mixed.Values.dtype)
```

a. `float64`

b. `int64`

c. `object`

d. None of the above

In [None]:
data = {'Values': [10.5, '20.2', 'abc', 30.7, 'xyz']}

df_mixed = pd.DataFrame(data)

df_mixed['Values'] = pd.to_numeric(df_mixed['Values'], errors='coerce')         #, errors='coerce'
print(df_mixed.Values.dtype)
#df_mixed

## *Add and Drop Rows/Columns*

### **Add Rows**
*Add Rows*:

In Pandas, you can add rows to a DataFrame using the `append()` method or by concatenating two DataFrames together using `concate()`, which is prefered method.


In [None]:
import pandas as pd

# create initial dataframe
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
print(df, end = '\n\n')

# concatenate dataframes
new_df = pd.DataFrame({'A': [5], 'B': [6]})
df = pd.concat([df, new_df], ignore_index=True)

print(df)

### **Add Columns**:

To add one or more columns to a DataFrame in Pandas, you can simply assign a new column (or multiple columns) to the DataFrame using bracket notation.


In [None]:
import pandas as pd

# create initial dataframe
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

# add a new column with a constant value
df['C'] = 5 # don't use df.C = 5 !!

print(df, end = '\n\n')

# add multiple columns at once
df[['E', 'F']] = pd.DataFrame([[6, 7]])

# print updated dataframe
print(df)

### **Drop Rows/Columns**
To drop rows or columns from a DataFrame in Pandas, you can use the `drop()` method. By default, you will get a copy of the new DataFrame, but you can set `inplace = True` to change the original DataFrame.



In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.index = ["a", "b", "c"]

# drop a column by label
print(df.drop('C', axis=1), end = "\n\n")
# or use print(df.drop(columns = 'C'))

print(df)


In [None]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.index = ["a", "b", "c"]
# drop a row by label
print(df.drop("a", axis=0), end = '\n\n')

# drop multiple rows by label
print(df.drop(["a", "c"], axis=0))
# or use:  print(df.drop(index = ["a", "c"]))


## *Subset a DataFrame*

### **Indexing and Slicing**
There are several ways to index a Pandas DataFrame:

* By column name: You can index a DataFrame by specifying the column name using the square bracket notation, e.g., `df['column_name']`.

* By row/column label: You can index a DataFrame by specifying the row/column label using `loc`, e.g., `df.loc[row_label], df.loc[row_label, col_label]`.

* By integer position: You can index a DataFrame by specifying the integer position using `iloc`, e.g., `df.iloc[row_position]`.

* By Boolean indexing: You can index a DataFrame by specifying a Boolean condition using the square bracket notation, e.g., `df[df['column_name'] > 0]`.

#### *Select Columns using `[]`*


In [None]:
import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}, index = ["a", "b", "c", "d"])

print(df["name"], end = '\n\n')

print(df[["name", "city"]])

#### *Using `loc()`*


In [None]:
import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}, index = ["a", "b", "c", "d"])

print(df.loc['a'], end = '\n\n') # return a Series
print(df.loc[['a']]) # return a dataframe


In [None]:
# Select multiple rows with list or slice
print(df.loc[['a', 'a', 'd']])
print(df.loc['a':'d'])

In [None]:
# Select rows and columns
print(df.loc[['a', 'd'], ["name", "age"]])

#### *Using `iloc`*


In [None]:
import pandas as pd

# Create a simple DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}, index = ["a", "b", "c", "d"])

print(df.iloc[0])
print(df.iloc[[0]])

In [None]:
# Select using list of integers and slice obejct
print(df.iloc[[0,2]])
print(df.iloc[0:3])


In [None]:
# Select rows and columns using positions
print(df.iloc[[1,2], [0,1]])

#### *Boolean Indexing*


In [None]:
# Create a simple DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}, index = ["a", "b", "c", "d"])

# Use []
print(df[df.age > 30])

In [None]:
# Combine boolean condtions useing &, |, donnot use and, or
# Use ~ instead of not to negate a boolean condition
print(df[(df.age > 30) & df.city.isin(["New York", "Chicago"])])

#print(df[(df.age > 30) | df.city.isin(["New York", "Chicago"])])

#print(df[~(df.age > 30)])

In [None]:
# Use .loc[] to get rows and columns
print(df.loc[df.age > 30, ["name", "age"] ])

Question:

What will happen if we run

```
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}, index = ["a", "b", "c", "d"])

print(df.iloc[df.age > 30, ["name", "age"] ])
```

### **Filtering with Filter Method**
The filter() method in Pandas DataFrame is used to subset the DataFrame based on the column names or row labels.

**Syntax:** `df.filter(items=None, like=None, regex=None, axis=None)`

* `items:` a list-like object containing the column names to be selected.
* `like:` a string containing a regex expression to match the column names.
* `regex:` a string containing a regex expression to match the column names.
* `axis:` the axis to filter. By default, axis=columns.
* The `items, like, and regex` parameters are enforced to be mutually exclusive.



In [None]:
# Create a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [25, 30, 35],
        'Big City': ['New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)
print(df)

In [27]:
# Filter columns by name
print(df.filter(items=['Name', 'Age']))

    Name  Age
0   John   25
1  Alice   30
2    Bob   35


In [None]:
# Filter columns by regular expression
print(df.filter(regex='e$')) # selects columns ending with 'e'

In [None]:
# Filter columns by substring
print(df.filter(like='City')) # selects columns containing 'City'


In [None]:
# Filter rows by index labels
print(df.filter(items=[0, 2], axis=0)) # selects rows with index labels 0 and 2

## *Sort a DataFrame*

### **Sort_values Method**
To sort a DataFrame (either rows or columns) in Pandas, you can use the `sort_values()` method. By default, it sorts rows.

Note that the `sort_values()` method returns a new DataFrame object with the specified sort order, and does not modify the original DataFrame in place. If you want to modify the original DataFrame, you can either assign the result of sort_values() back to the original DataFrame, or use the `inplace=True` parameter to modify the DataFrame in place



In [None]:
import pandas as pd

# create initial dataframe
df = pd.DataFrame({'A': [2, 1, None, 2, 3, 2], 'B': [5, 4, 12, 6, 3, 9], 'C': [8, 7, 23, 9, 4, 12]})

print(df)


In [None]:
# sort by values in column A
print(df.sort_values(by='A'))

In [None]:
# sort by values in column A and B
print(df.sort_values(by=['A', 'B']))
print(df.sort_values(by=['A', 'B'], ignore_index = True))


In [None]:
# sort by values in column A in descending order
print(df.sort_values(by=['A', 'B'], ascending= [False, True], na_position = "first"))

### **Sort_index Method**
To sort a DataFrame in Pandas by its index/columns, you can use the `sort_index()` method. Simillary, you can set `inplace = True` to modify the origianl DataFrame



In [None]:
import pandas as pd

# create initial dataframe
df = pd.DataFrame({'A': [2, 1, 3], 'B': [5, 4, 6], 'C': [8, 7, 9]}, index=['c', 'a', 'b'])
print(df)


In [None]:
# sort by index in ascending order
print(df.sort_index())
print(df.sort_index(ignore_index = True))

# sort by index in descending order
#print(df.sort_index(ascending=False))


In [None]:
# sort by columns in descending order
print(df.sort_index(axis = 1, ascending=False))


## *Duplicated Rows in a DataFrame*

### **Find Duplicated Rows**
The `duplicated()` method returns a boolean series indicating which rows are duplicates. You may need think about the following to help you identify duplicated rows.
* How to define duplicated rows? Use all the columns or just some of the columns?
* How to mark duplicated rows? Mark all the rows with the same value or leave one of them as non-duplicated and the others as duplicated?



In [None]:
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Emily', 'Alice', 'Bob'],
        'age': [25, 30, 35, 25, 40, 25, 30],
        'height': [165, 175, 172, 165, 173, 165, 180]}
df = pd.DataFrame(data)

print(df)


In [None]:
# Check for duplicated rows
duplicated_rows = df.duplicated()

# Display the boolean series
print(duplicated_rows)

In [None]:
# Mark all duplicated rows
print(df, end = '\n\n')
print(df.duplicated(keep = False))

In [None]:
# Find duplicates rows based on a subset of columns
print(df, end = '\n\n')
print(df.duplicated(subset = ["name", 'age']))

### **Remove Duplicated Rows**
The `drop_duplicates()` method returns a DataFrame with duplicate rows removed.

* Parameter `subset` : column label or sequence of labels. Only consider certain columns for identifying duplicates, by default use all of the columns.
* The `inplace` parameter allows you to modify the DataFrame in place. By default, it is set to `False`
* Parameter `ignore_index` : bool, default `False`. If `True`, the resulting axis will be labeled `0, 1, …, n - 1.`
* Parameter `keep` : `{'first', 'last', False}`, default `'first'`. Determines which duplicates (if any) to keep.



In [None]:
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Emily', 'Alice', 'Bob'],
        'age': [25, 30, 35, 25, 40, 25, 30],
        'height': [165, 175, 172, 165, 173, 165, 180]}
df = pd.DataFrame(data)

print(df)

In [None]:
# Drop all duplicated rows except the first occurrence.
print(df.drop_duplicates())
# print(df.drop_duplicates(keep = False))

In [None]:
# Regenerate the index
print(df.drop_duplicates(ignore_index = True))

In [None]:
# Use a subset of columns to drop duplicates
print(df.drop_duplicates(subset = ["name"]))

## *Missing Values in a DataFrame*

### **Find Missing Values**
`isna() or isnull()`: These methods return a DataFrame of the same shape as the original DataFrame, where each value is either `True` or `False`, indicating whether it is a missing value.


In [None]:
import numpy as np

# Create a sample DataFrame with missing values
data = {'A': [1, 2, np.nan, 4],
        'B': [5, None, 7, 8],
        'C': [9, 10, 11, pd.NA]}
df = pd.DataFrame(data)

print(df, "\n\n")
# Check for missing values
# print(df['A'].hasnans)
print(df.isna())

### **Drop Rows/Columns with Missing Values**
`dropna()` method is used to remove rows or columns with missing values.

*Important parameters:*
* `axis`: `{0 or 'index', 1 or 'columns'}`, default 0. Determine if rows or columns which contain missing values are removed.
* `how `: `{'any', 'all'}`, default `'any'`. Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
* `thresh` : `int`, optional. Require that many `non-NA` values. Cannot be combined with how.
* `subset` : column label or sequence of labels, optional. Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
* `inplace` : `bool`, default `False`. Whether to modify the DataFrame rather than creating a new one.



In [None]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, 13]}
df = pd.DataFrame(data)

print(df)


In [None]:
# Drop rows with missing values
print(df.dropna()) # axis = 1


In [None]:
# Keep rows with at least "thresh" number of non-missing values
print(df, '\n\n')
print(df.dropna(thresh = 3))


In [None]:
# Drop missing rows based on a subset of columns
print(df, '\n\n')
print(df.dropna(subset = ["A", "C"]))

### **Fill in Missing Values**
We can use
* `fillna()` method to fill missing values with a specified value or method.
* `interpolate()` method to fill missing values with interpolated values based on neighboring values.

#### *Fillna Method*
Important parameters:

* `value:` scalar, dict, Series, or DataFrame, optional. The value to use to fill NaN values. If it is a dict, it is used to fill NaN values with the specified values for each column.

* `method: {'backfill', 'bfill', 'pad', 'ffill', None},` optional. The method to use for filling missing values. `'backfill' or 'bfill'` is used to fill the next row's missing values with the current row's values. `'pad' or 'ffill'` is used to fill the previous row's missing values with the current row's values. If `None`, it fills with the specified value.

* `axis: {0 or 'index', 1 or 'columns'}`, optional. The axis to fill the missing values along. By default, it is 0 or 'index'.

* `inplace:` bool, optional. If True, fill the DataFrame in place and return None. Otherwise, return a new DataFrame with the missing values filled.



In [None]:
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8],
        'C': [9, 10, 11, np.nan]}
df = pd.DataFrame(data)
print(df)

In [None]:
# Fill missing values with a specified value
print(df.fillna(value=0))

In [None]:
# Fill missing values with a dictionary. Fill missing values with the specified values for each column
fill_values = {'A': 0, 'B': 10, 'C': 20}
print(df, '\n\n')
print(df.fillna(value=fill_values))

In [None]:
# Use forward fill method. top to bottom
print(df, end = '\n\n')
print(df.fillna(method = "ffill"))


In [None]:
# Use forward fill method. left to right
print(df, end = '\n\n')
print(df.fillna(method = "ffill", axis = 1))

#### *Interpolate Method*


In [None]:
data = {'A': [1, np.nan, 3, np.nan, 5],
        'B': [np.nan, 7, 8, np.nan, 10],
        'C': [5, 10, 20, 12, None]}
df = pd.DataFrame(data)
print(df)


In [None]:
# Fill missing values with interpolated values
df.interpolate(method='linear', axis=0, inplace=True)

# Display the modified DataFrame
print(df)

## *Operations on a DataFrame*

### **Arithmetic Operations**
Like Series, you can do all the common arithmetic operation for a DataFrame.


In [None]:
import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]}, index = list("abc"))

print(df)

In [None]:
# Sum of two columns
print(df, "\n")
print(df['col1'] + df['col2'])

In [None]:
# Use method instead of using operator
print(df, '\n')
print(df.sub(df['col2'], axis = "index"))

In [None]:
# Subtract different values from different columns
print(df, '\n')
print(df - [1,5])

In [None]:
# Subtract values from columns
print(df, '\n')
print(df.sub([1, 5], axis='columns'))

In [None]:
# Subtract values from rows.
print(df, '\n')
print(df.sub([10, 20, 30], axis='index'))

### **Comparision Operations**
You can compare two columns in a DataFrame to get a boolean Series.


In [None]:
import pandas as pd

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [1, 5, 6]})

print(df['col1'] != df['col2'])

### **Aggregation Method**
Like Series, Pandas provides many built-in aggregation methods that can be used to summarize data in a data frame. These methods can be applied to individual columns or rows.



In [None]:
# sum, mean, median, min, max, std, var, count
df = pd.DataFrame({'col1': [1, 2, None, 3, 10], 'col2': [1, 5, 6, 7, None]})

print(df, '\n')
print(df.sum(axis = 1), '\n')
print(df.std())

#### *Agg() Method*
The `agg()` method in Pandas dataframe is used to perform aggregation operations on the specified columns/rows of a data frame.



In [None]:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]})

print(df)

In [None]:
# apply aggregation functions to columns
print(df.agg(['sum', 'mean', 'max']))

In [None]:
# apply aggregation fucntion to rows
print(df, '\n')
print(df.agg(sum, axis = 1))

In [None]:
# using dictionary
print(df,'\n')
print(df.agg({"col1": ["mean", "std"], "col2": ["sum", "var"]}))

In [None]:
# using tuple to rename the rows
print(df,'\n')
print(df.agg( a = ("col2", "mean"), b = ("col1", sum)))

### **Apply() Method**
The `apply()` method in Pandas dataframe is used to apply a function along a specific axis of the data frame. The function can be either a built-in function or a user-defined function



In [None]:
# create a sample data frame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

print(df)

In [None]:
# Column sum
print(df.apply(sum))

In [None]:
# Row sum
print(df, '\n')
print(df.apply(sum, axis = 1))

In [None]:
# Apple user defined function to rows
def fcn(ser):
  ser_sum = ser.sum()
  if ser_sum > 8:
    return [ser_sum, 100]
  else:
    return [ser_sum, 0]

print(df, '\n')
print(df.apply(fcn, axis = 1))

In [None]:
# Return a dataframe
print(df, '\n')
print(df.apply(fcn, axis = 1, result_type="expand")) # 'broadcast'

### **Applymap() Method**
The `applymap()` method in Pandas data frame is used to apply a function to each element in the data frame.
* The function can be either a built-in function or a user-defined function which returns a single value from a single value.
* The `applymap()` method returns a new data frame with the result of the function applied to each element in the original data frame.



In [None]:
# create a sample data frame
df = pd.DataFrame({'A': ["apple", "animal", None, "air"], 'B': ["blue", "bubble", "bunny", pd.NA]})

def fcn(el, power):
  return len(el) ** power

print(df, '\n')
print(df.applymap(fcn, power = 3, na_action='ignore'))

## *Split-Apply-Combine Paradigm*
The Split-Apply-Combine paradigm is a powerful data analysis technique that is commonly used in Pandas. This paradigm involves three key steps:

1. Splitting the data into groups based on a given criterion.
2. Applying a function to each group independently.
3. Combining the results of each group into a single data structure.

### **Groupby Method**
In Pandas, the `groupby()` method is the key method to implement the Split-Apply-Combine paradigm.
* The `groupby()` method generate a `groupby` object which splits a data frame into groups based on a given criterion.
* Then we can apply a function to the object. The function will be applied to each group.
* Finally, the results of each group will be combined into a single data structure.



In [77]:
# create a DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 70000, 80000, 90000],
    'gender': ['F', 'M', 'M', 'M', 'F']
})

# group the data by gender
grouped = df.groupby('gender')


In [None]:
# calculate the mean salary for each group
print(grouped['salary'].mean())

In [None]:
# Apply multiple functions.
print(grouped['salary'].agg(["min", "max", "std"]))

In [None]:
# Apply one function to multiple columns.
print(grouped.mean(numeric_only = True))

#### *Groupby with Apply Method*
The `groupby` method in pandas allows you to group rows of a DataFrame based on one or more columns and apply a function to each group. The `apply` method can then be used to apply a custom function to each group of the grouped DataFrame.



In [None]:
df = pd.DataFrame({'group': ['A', 'B', 'B', 'A', 'A', 'B'],
                   'value': [1, 2, 3, 4, 5, 6]})
print(df)

In [None]:
# Group the DataFrame by the 'group' column
grouped = df.groupby('group')
# Define a custom function to apply to each group
def custom_function(group):
    return group['value'].sum()

# Apply the custom function to each group using the apply method
result = grouped.apply(custom_function)
print(result)

## *Combine Multiple DataFrames*

### **Concate Method**
The `concat` method in pandas is used to concatenate two or more pandas objects (typically DataFrames or Series) along a particular axis, either rows or columns.



In [None]:
# Create two sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [4, 5, 6, 7]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'D': [10, 11, 12]})

print(df1, '\n')
print(df2)

In [None]:
# Concatenate the two DataFrames horizontally
print(pd.concat([df1, df2], axis=1))

In [None]:
# Concatenate the two DataFrames vertically
print(pd.concat([df1, df2], axis=0))

In [None]:
# Reset index
print(pd.concat([df1, df2], axis=0, ignore_index = True))

In [None]:
# Use union to decide the labels
print(pd.concat([df1, df2], axis=0, join= 'outer'))

In [None]:
# Use intersect to decide the labels
print(pd.concat([df1, df2], axis=0, join= 'inner'))

### **Merge Method**
The merge method in pandas is used to merge two or more DataFrames based on a common column or index. It's similar to the JOIN operation in `SQL`.



In [None]:
# Create two sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'A', 'D', 'F'], 'value': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'A'], 'value': [5, 6, 7, 8]})

print(df1, '\n')
print(df2)

In [None]:
# Merge the two DataFrames on the 'key' column
print(pd.merge(df1, df2, on='key')) # inner join
# or
#print(pd.merge(df1, df2, left_on='key', right_on = 'key'))

In [None]:
# inner, outer, left, right join
print(df1, '\n')
print(df2, '\n')
print(pd.merge(df1, df2, on='key', how = 'left')) # 'outer', 'left', 'right'

**Merge on Index**


In [None]:
# Create two sample DataFrames with the same index
df1 = pd.DataFrame({'value': [1, 2, 3, 4], 'col1': ["B", "E", "A", "C"]}, index=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame({'value': [5, 6, 7, 8]}, index=['B', 'D', 'E', 'F'])

print(df1, '\n')
print(df2, '\n')
# Merge the two DataFrames on the index
print(pd.merge(df1, df2, left_index=True, right_index=True))

In [None]:
# Merge on index and column
print(df1, '\n')
print(df2, '\n')
print(pd.merge(df1, df2, left_on="col1", right_index=True))