# DataFrame Manipulations

> Data Cleaning in Pandas requires the manipulation of `DataFrames` through adding, removing, and altering data. Techniques covered in this lesson include adding new columns and rows, dropping unnecessary ones, and applying elementwise modifications to the data, for example by applying a function to each element. Merging data from different sources and grouping for detailed analysis are also key aspects. These processes are fundamental to maintaining data integrity and usability in Pandas.

## Adding and Removing Data


### Adding a Column

#### 1. Direct Assignment:
The simplest way to add a new column is by assigning a list or array to a new column name. 



In [1]:
import pandas as pd
my_dict = {'Animal': ['Dog', 'Cat', 'Bird'], 'Age': [2, 4, 1]}
base_df = pd.DataFrame(my_dict)
base_df.head()

Unnamed: 0,Animal,Age
0,Dog,2
1,Cat,4
2,Bird,1


In [2]:

base_df['New_Column'] = [1, 2, 3]  # Adds 'New_Column' with specified values
base_df.head()

Unnamed: 0,Animal,Age,New_Column
0,Dog,2,1
1,Cat,4,2
2,Bird,1,3




#### 2. Using `assign()`:
The `assign()` method allows you to add new columns to a `DataFrame` while keeping the original `DataFrame` unchanged.



In [3]:

base_df = base_df.assign(New_Column2=[4, 5, 6])




### Adding Rows



#### Using `pd.concat()`
> Since Pandas v2, the only permitted way to add new rows or append one `DataFrame` to another is the `pd.concat()` function. Older versions of Pandas have a method called `.append()` that allowed you to add a single row, but this has now been removed. 


In [4]:


additional_data = pd.DataFrame([['Horse', 3, 4, 7]], columns=base_df.columns) # Create a df with the new rows to append
base_df = pd.concat([base_df, additional_data], ignore_index=True) ## Append the new rows to the original df
base_df.head()


Unnamed: 0,Animal,Age,New_Column,New_Column2
0,Dog,2,1,4
1,Cat,4,2,5
2,Bird,1,3,6
3,Horse,3,4,7



An important consideration here is the `ignore_index` parameter, which is used to control how the index is handled during concatenation.

- `ignore_index=True`: With this setting, Pandas resets the index to the default integer index. This means the resulting `DataFrame` will have a new index ranging from `0` to `n-1`, where `n` is the length of the `DataFrame`. Values in the second `DataFrame` are appended to the end of the first `DataFrame`, and the index is renumbered to reflect this continuous sequence.

- `ignore_index=False` (default behavior): When set to `False`, Pandas preserves the original indices of the concatenated `DataFrames`. In this case, the indices from each `DataFrame` are maintained, and the final `DataFrame` reflects these original indices. This might lead to duplicate index values if the original `DataFrames` have overlapping indices. The data are still combined as expected, with the contents of the second `DataFrame` following those of the first, but the original order and index values from each `DataFrame` are kept intact.

The choice between `ignore_index=True` and `ignore_index=False` depends on whether the original index carries meaningful information for your data and whether unique, non-overlapping indices are needed post-concatenation. 


In [5]:
# Creating two sample DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenating without ignore_index
result_without_ignore_index = pd.concat([df1, df2])
print("With ignore_index=False:\n", result_without_ignore_index)

# Concatenating with ignore_index
result_with_ignore_index = pd.concat([df1, df2], ignore_index=True)
print("\nWith ignore_index=True:\n", result_with_ignore_index)

With ignore_index=False:
    A  B
0  1  3
1  2  4
0  5  7
1  6  8

With ignore_index=True:
    A  B
0  1  3
1  2  4
2  5  7
3  6  8




### Dropping Rows and Columns

- **Dropping Rows:**
  - Rows can be removed using the `drop()` method, specifying the index labels and `axis=0`
  - Example:
    ```python
    df.drop(index=[0, 1], inplace=True)  # Drops rows with index 0 and 1
    ```

- **Dropping Columns:**
  - To drop columns, use the `drop()` method with the column names and `axis=1`
  - Example:
    ```python
    df.drop(columns=['Column1', 'Column2'], inplace=True)
    ```

- **Using `dropna()`:**
  - The `dropna()` method is useful for removing rows or columns with missing values
  - Example:
    ```python
    df.dropna(axis=0, inplace=True)  # Drops rows with any NaN values
    df.dropna(axis=1, inplace=True)  # Drops columns with any NaN values
    ```

These tools are integral to shaping your dataset into the desired format for analysis, allowing for the efficient manipulation of data in Pandas.



## Merging DataFrames

>Merging is an important function in Pandas that allows you to combine different sets of data. The `.merge()` function is used to bring together data from separate sources based a shared key column. This technique is particularly useful when dealing with datasets from different systems or sources that need to be analysed together.


### Basic Merge

Merging two `DataFrame`s is often based on one or more common columns. The default merge type is an **inner** join, which combines only the rows with matching values in both `DataFrames`.

In the example below, we have a `DataFrame` of customer support issues raised, the feedback score, and the outcome of the issue. We also have a `DataFrame` of data about which product each customer purchased. The `CustomerID` column is common to both tables.  We can use the `pd.merge()` function with its default settings and with `CustomerID` as the merge **key**, to produce a table with one row per support issue, but adding the data about which product the customer bought. 


In [26]:

import pandas as pd

# Customer data from the Sales Department
customer_data = pd.DataFrame({
    'CustomerID': ['C001', 'C002', 'C003', 'C004', 'C005', 'C006'],
    'Product': ['Laptop', 'Printer', 'Tablet', 'Monitor', 'Tablet', 'Laptop'],
})

# Extended Customer feedback and issues data from the Customer Service Department
service_data = pd.DataFrame({
    'CustomerID': ['C001', 'C001', 'C002', 'C003', 'C003', 'C004'],
    'Issue_ID': ['I001', 'I002', 'I003', 'I004', 'I005', 'I006'],
    'Feedback_Score': [4, 3, 3, 5, 4, 2],
    'Issue_Resolved': ['Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
})

# Merging the Sales and Customer Service data on 'CustomerID'
merged_customer_data = pd.merge(service_data, customer_data, on='CustomerID')
merged_customer_data.head()



Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4,Yes,Laptop
1,C001,I002,3,No,Laptop
2,C002,I003,3,Yes,Printer
3,C003,I004,5,No,Tablet
4,C003,I005,4,Yes,Tablet



### Types of Joins

- **Inner Join:** Retrieves rows with matching values in both `DataFrames`
- **Left Join:** Includes all rows from the left `DataFrame` and matching rows from the  key column of the right `DataFrame`
- **Right Join:** Includes all rows from the right `DataFrame` and matching rows from the key column of the left `DataFrame`
- **Full (Outer) Join:** Combines all rows from both `DataFrames` where there is a match in either key column

Note that all types except for `inner` might produce missing values. Run the code cells below to see how the join types differ.

In [22]:


# Left Join
left_join_df = pd.merge(service_data, customer_data, how='left', on='CustomerID')
left_join_df

Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4,Yes,Laptop
1,C001,I002,3,No,Laptop
2,C002,I003,3,Yes,Printer
3,C003,I004,5,No,Tablet
4,C003,I005,4,Yes,Tablet
5,C004,I006,2,Yes,Monitor


The left join looks the same as the `inner` join here, because all of the customer IDs in `service_data` are also present in `customer_data`.

In [25]:

# Right Join
right_join_df = pd.merge(service_data, customer_data, how='right', on='CustomerID')
right_join_df

Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4.0,Yes,Laptop
1,C001,I002,3.0,No,Laptop
2,C002,I003,3.0,Yes,Printer
3,C003,I004,5.0,No,Tablet
4,C003,I005,4.0,Yes,Tablet
5,C004,I006,2.0,Yes,Monitor
6,C005,,,,Tablet
7,C006,,,,Laptop


The right join has `NaN` values in the merged table, because there are values of `CustomerID` in `customer_data` that are not present in `service_data`.

In [28]:

# Full Join
full_join_df = pd.merge(service_data, customer_data, how='outer', on='CustomerID')
full_join_df


Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4.0,Yes,Laptop
1,C001,I002,3.0,No,Laptop
2,C002,I003,3.0,Yes,Printer
3,C003,I004,5.0,No,Tablet
4,C003,I005,4.0,Yes,Tablet
5,C004,I006,2.0,Yes,Monitor
6,C005,,,,Tablet
7,C006,,,,Laptop


In this instance, the merged table for the full `outer` join looks the same as the `right` join table. They would look different if there were customers in the `service_data` table that weren't present in the `customer_data` table.

### Merging on Multiple Columns

Merging on multiple columns is useful when a single key isn't enough to accurately join data. For instance, in a real-world scenario, you might have data from two different departments of a company, where each department uses a combination of `Employee ID` and `Department ID` to uniquely identify records. In such cases, merging on both these columns would ensure accurate alignment of data related to specific employees in specific departments.


In [8]:


# DataFrame with employee salaries and department IDs
employee_salaries = pd.DataFrame({
    'Employee ID': [1, 2, 1, 1],
    'Department ID': [101, 101, 102, 103],  # Notice fewer unique department IDs
    'Salary': [50000, 60000, 55000, 58000]
})

# DataFrame with employee names and department IDs
employee_names = pd.DataFrame({
    'Employee ID': [1, 2, 1, 1],
    'Department ID': [101, 101, 102, 103],
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

# Merging on 'Employee ID' and 'Department ID'
merged_df = pd.merge(employee_salaries, employee_names, on=['Employee ID', 'Department ID'])
merged_df.head()




Unnamed: 0,Employee ID,Department ID,Salary,Employee Name
0,1,101,50000,Alice
1,2,101,60000,Bob
2,1,102,55000,Charlie
3,1,103,58000,Diana


## Grouping Data


> Grouping data in Pandas is a powerful technique for data analysis, especially when dealing with large datasets. It involves segmenting data into subsets for more detailed analysis. This is commonly achieved using the `.groupby()` method, which groups data on certain criteria and allows for aggregate operations on the grouped data.



### Basic Grouping



Grouping typically involves selecting a column to group by and an aggregate function to apply. The output is a new `DataFrame` with one row per unique value in the column or columns you are grouping by, and the values in each row are the results of the aggregation function you applied to each group. 


In [9]:
# Sample DataFrame
data = pd.DataFrame({
    'Department': ['Sales', 'HR', 'IT', 'Sales', 'HR'],
    'Employee': ['John', 'Alice', 'Kyle', 'Laura', 'Bob'],
    'Sales': [250, None, None, 300, None],
    'Performance Score': [3, 4, 5, 2, 3]
})

# Grouping by 'Department'
grouped_df = data.groupby('Department').mean()
grouped_df.head()

  grouped_df = data.groupby('Department').mean()


Unnamed: 0_level_0,Sales,Performance Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,,3.5
IT,,5.0
Sales,275.0,2.5



This example calculates the mean of the numerical columns for each department. For example it shows the mean performance score for the members of each department.

### Grouping with Multiple Columns


You can also group by multiple columns. This is useful for more complex data scenarios.


In [10]:
# Grouping by multiple columns
multi_grouped_df = data.groupby(['Department', 'Performance Score']).sum()
multi_grouped_df.head()

  multi_grouped_df = data.groupby(['Department', 'Performance Score']).sum()


Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Department,Performance Score,Unnamed: 2_level_1
HR,3,0.0
HR,4,0.0
IT,5,0.0
Sales,2,300.0
Sales,3,250.0


This groups the data first by `Department`, then by `Performance Score`, and sums up the sales for each subgroup.


## Modifying Data: The `apply` Method


> The `apply` method in Pandas is a versatile tool for applying a function across the axis of a `DataFrame` or Series. It's particularly useful for more complex operations that aren't covered by built-in methods. The `apply` function can be applied to each column or row, making it a powerful feature for both row-wise and column-wise transformations.


### Basic Usage of `apply`
The simplest use case of `apply` is to apply a function to each element of a `Series` or each column/row of a `DataFrame`.


In [11]:
# Sample DataFrame
data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Function to increment each element by 1
def increment(x):
    return x + 1

# Applying the function to the DataFrame
incremented_data = data.apply(increment)
incremented_data.head()

Unnamed: 0,A,B
0,2,5
1,3,6
2,4,7



In this example, the `increment` function is applied to each element of the `DataFrame`, increasing each value by 1.


### Applying Functions Row-wise

You can also apply a function across each row. This is useful for operations that need to consider multiple columns.


In [12]:
# Function to calculate the sum of squares of two columns
def sum_of_squares(row):
    return row['A'] ** 2 + row['B'] ** 2

# Applying the function across each row
data['sum_of_squares'] = data.apply(sum_of_squares, axis=1)
data.head()

Unnamed: 0,A,B,sum_of_squares
0,1,4,17
1,2,5,29
2,3,6,45



Here, `sum_of_squares` calculates the sum of squares of columns `A` and `B` for each row. 


### Real-world Example: Data Normalisation


Consider a dataset where you need to normalise the values in each column. 

Normalisation is a common preprocessing step in data analysis, particularly essential when dealing with features that vary in scale. For instance, in a dataset combining annual incomes and age, the wide range in income values can dominate the smaller age values. Normalisation adjusts these scales to a common range, ensuring that each feature contributes equally to the analysis and improves the performance of some statistical tests and machine learning algorithms


In [13]:
# Sample dataset
sample_data = pd.DataFrame({
    'Feature1': [10, 20, 30],
    'Feature2': [40, 50, 60]
})

# Normalising function
def normalise(column):
    return (column - column.mean()) / column.std()

# Applying normalisation to each column
normalised_data = sample_data.apply(normalise)
normalised_data.head()

Unnamed: 0,Feature1,Feature2
0,-1.0,-1.0
1,0.0,0.0
2,1.0,1.0



In this example, the `normalise` function is applied to each column, standardising the values in each column to have a mean of `0` and a standard deviation of `1`.



## Modifying Data: The `map()` Method

>The `map()` method in Pandas is used primarily with `Series` objects to map values from one domain to another. It's an efficient way to perform element-wise transformations and is particularly useful for substituting each unique value in a `Series` with another value, or for mapping individual results to categories.



### Basic Usage of `map()`

Here's a simple example of using `map()` to replace values in a `Series` based on a mapping dictionary:



In [16]:
# Sample Series
s = pd.Series(['dog', 'dog', 'seagull', 'cod',  'dog', 'seagull'])

# Mapping dictionary
animal_map = {
    'dog': 'mammal',
    'seagull': 'bird',
    'cod' : 'fish',
}

# Using map to replace values
mapped_s = s.map(animal_map)
mapped_s


0    mammal
1    mammal
2      bird
3      fish
4    mammal
5      bird
dtype: object


In this case, `map()` takes each value from the `Series` `s` and replaces it with the corresponding value from the `animal_map` dictionary. 

## Key Takeaways

- Useful methods for data cleaning in Pandas include adding, removing, and altering data, merging sources, and grouping 
- Add new columns to a `DataFrame` using direct assignment or the `assign()` method
- Use `pd.concat()` to append rows or `DataFrame`s in Pandas, and control index handling with `ignore_index` parameter
- Use `drop()` to remove rows or columns, and `dropna()` to eliminate those with missing values in Pandas `DataFrame`
- The `.merge()` function in Pandas combines data from different sources based on a shared key column
- Use `pd.merge()` with `how` and `on` parameters to perform different types of joins in Pandas
- Merging on multiple columns in Pandas ensures accurate data alignment when a single key isn't sufficient
- Use the `.groupby()` method in Pandas to perform segmented statistical analysis on `DataFrame` columns
- Pandas allows grouping by multiple columns for complex data scenarios, summing up values for each subgroup
- The `apply` function in Pandas can be used to apply a function to each element of a `DataFrame` or `Series`
- Use `map()` with a dictionary to replace values in a Pandas `Series` or `DataFrame`