- Useful methods for data cleaning in Pandas include adding, removing, and altering data, merging sources, and grouping 
- Add new columns to a `DataFrame` using direct assignment or the `assign()` method
- Use `pd.concat()` to append rows or `DataFrame`s in Pandas, and control index handling with `ignore_index` parameter
- Use `drop()` to remove rows or columns, and `dropna()` to eliminate those with missing values in Pandas `DataFrame`
- The `.merge()` function in Pandas combines data from different sources based on a shared key column
- Use `pd.merge()` with `how` and `on` parameters to perform different types of joins in Pandas
- Merging on multiple columns in Pandas ensures accurate data alignment when a single key isn't sufficient
- Use the `.groupby()` method in Pandas to perform segmented statistical analysis on `DataFrame` columns
- Pandas allows grouping by multiple columns for complex data scenarios, summing up values for each subgroup
- The `apply` function in Pandas can be used to apply a function to each element of a `DataFrame` or `Series`
- Use `map()` with a dictionary to replace values in a Pandas `Series` or `DataFrame`

In [3]:
import pandas as pd
my_dict={'Animal':["Dog","Cat", "Bird"], "Age":[2,4,1]}
base_df=pd.DataFrame(my_dict)
base_df


Unnamed: 0,Animal,Age
0,Dog,2
1,Cat,4
2,Bird,1


In [4]:
#Adding new column
base_df['new_column']=[1,2,3]
base_df

Unnamed: 0,Animal,Age,new_column
0,Dog,2,1
1,Cat,4,2
2,Bird,1,3


#### 2. Using `assign()`:
The `assign()` method allows you to add new columns to a `DataFrame` while keeping the original `DataFrame` unchanged.


In [5]:
base_df.assign(new_column2=[3,4,6])

Unnamed: 0,Animal,Age,new_column,new_column2
0,Dog,2,1,3
1,Cat,4,2,4
2,Bird,1,3,6


In [6]:
base_df

Unnamed: 0,Animal,Age,new_column
0,Dog,2,1
1,Cat,4,2
2,Bird,1,3


In [7]:
base_df=base_df.assign(new_column2=[3,4,6])
base_df

Unnamed: 0,Animal,Age,new_column,new_column2
0,Dog,2,1,3
1,Cat,4,2,4
2,Bird,1,3,6


### Adding rows
- pd.concat()

In [8]:
additional_data=pd.DataFrame([["horse", 3, 4, 7]], columns=base_df.columns)
base_df=pd.concat([base_df, additional_data], ignore_index=True)
base_df.head()

Unnamed: 0,Animal,Age,new_column,new_column2
0,Dog,2,1,3
1,Cat,4,2,4
2,Bird,1,3,6
3,horse,3,4,7


In [9]:
# Creating two sample DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Concatenating without ignore_index
result_without_ignore_index = pd.concat([df1, df2])
print("With ignore_index=False:\n", result_without_ignore_index)

# Concatenating with ignore_index
result_with_ignore_index = pd.concat([df1, df2], ignore_index=True)
print("\nWith ignore_index=True:\n", result_with_ignore_index)

With ignore_index=False:
    A  B
0  1  3
1  2  4
0  5  7
1  6  8

With ignore_index=True:
    A  B
0  1  3
1  2  4
2  5  7
3  6  8



### Dropping Rows and Columns

- **Dropping Rows:**
  - Rows can be removed using the `drop()` method, specifying the index labels and `axis=0`
  - Example:
    ```python
    df.drop(index=[0, 1], inplace=True)  # Drops rows with index 0 and 1
    ```

- **Dropping Columns:**
  - To drop columns, use the `drop()` method with the column names and `axis=1`
  - Example:
    ```python
    df.drop(columns=['Column1', 'Column2'], inplace=True)
    ```

- **Using `dropna()`:**
  - The `dropna()` method is useful for removing rows or columns with missing values
  - Example:
    ```python
    df.dropna(axis=0, inplace=True)  # Drops rows with any NaN values
    df.dropna(axis=1, inplace=True)  # Drops columns with any NaN values
    ```

### Basic Merge
1. inner join(): which combines only the rows with matching values in both DataFrames.

In [11]:

import pandas as pd

# Customer data from the Sales Department
customer_data = pd.DataFrame({
    'CustomerID': ['C001', 'C002', 'C003', 'C004', 'C005', 'C006'],
    'Product': ['Laptop', 'Printer', 'Tablet', 'Monitor', 'Tablet', 'Laptop'],
})

# Extended Customer feedback and issues data from the Customer Service Department
service_data = pd.DataFrame({
    'CustomerID': ['C001', 'C001', 'C002', 'C003', 'C003', 'C004'],
    'Issue_ID': ['I001', 'I002', 'I003', 'I004', 'I005', 'I006'],
    'Feedback_Score': [4, 3, 3, 5, 4, 2],
    'Issue_Resolved': ['Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
})

# Merging the Sales and Customer Service data on 'CustomerID'
merge_customer_data=pd.merge(customer_data, service_data, on="CustomerID")
merge_customer_data

Unnamed: 0,CustomerID,Product,Issue_ID,Feedback_Score,Issue_Resolved
0,C001,Laptop,I001,4,Yes
1,C001,Laptop,I002,3,No
2,C002,Printer,I003,3,Yes
3,C003,Tablet,I004,5,No
4,C003,Tablet,I005,4,Yes
5,C004,Monitor,I006,2,Yes


### Types of Joins

- **Inner Join:** Retrieves rows with matching values in both `DataFrames`
- **Left Join:** Includes all rows from the left `DataFrame` and matching rows from the  key column of the right `DataFrame`
- **Right Join:** Includes all rows from the right `DataFrame` and matching rows from the key column of the left `DataFrame`
- **Full (Outer) Join:** Combines all rows from both `DataFrames` where there is a match in either key column

In [12]:
#Left Join
left_join_df=pd.merge(service_data,customer_data,how="left", on="CustomerID")
left_join_df

Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4,Yes,Laptop
1,C001,I002,3,No,Laptop
2,C002,I003,3,Yes,Printer
3,C003,I004,5,No,Tablet
4,C003,I005,4,Yes,Tablet
5,C004,I006,2,Yes,Monitor


In [13]:
# Right Join
right_join_df = pd.merge(service_data, customer_data, how='right', on='CustomerID')
right_join_df

Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4.0,Yes,Laptop
1,C001,I002,3.0,No,Laptop
2,C002,I003,3.0,Yes,Printer
3,C003,I004,5.0,No,Tablet
4,C003,I005,4.0,Yes,Tablet
5,C004,I006,2.0,Yes,Monitor
6,C005,,,,Tablet
7,C006,,,,Laptop


In [14]:
# Full Join
full_join_df = pd.merge(service_data, customer_data, how='outer', on='CustomerID')
full_join_df

Unnamed: 0,CustomerID,Issue_ID,Feedback_Score,Issue_Resolved,Product
0,C001,I001,4.0,Yes,Laptop
1,C001,I002,3.0,No,Laptop
2,C002,I003,3.0,Yes,Printer
3,C003,I004,5.0,No,Tablet
4,C003,I005,4.0,Yes,Tablet
5,C004,I006,2.0,Yes,Monitor
6,C005,,,,Tablet
7,C006,,,,Laptop


### Merging on Multiple Columns

In [15]:


# DataFrame with employee salaries and department IDs
employee_salaries = pd.DataFrame({
    'Employee ID': [1, 2, 1, 1],
    'Department ID': [101, 101, 102, 103],  # Notice fewer unique department IDs
    'Salary': [50000, 60000, 55000, 58000]
})

# DataFrame with employee names and department IDs
employee_names = pd.DataFrame({
    'Employee ID': [1, 2, 1, 1],
    'Department ID': [101, 101, 102, 103],
    'Employee Name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

# Merging on 'Employee ID' and 'Department ID'
merged_df = pd.merge(employee_salaries, employee_names, on=['Employee ID', 'Department ID'])
merged_df.head()


Unnamed: 0,Employee ID,Department ID,Salary,Employee Name
0,1,101,50000,Alice
1,2,101,60000,Bob
2,1,102,55000,Charlie
3,1,103,58000,Diana


### Grouping Data

In [20]:
# Sample DataFrame
data = pd.DataFrame({
    'Department': ['Sales', 'HR', 'IT', 'Sales', 'HR'],
    'Employee': ['John', 'Alice', 'Kyle', 'Laura', 'Bob'],
    'Sales': [250, None, None, 300, None],
    'Performance Score': [3, 4, 5, 2, 3]
})
# Grouping by 'Department'
grouped_df = data.groupby('Department').mean(numeric_only=True)
grouped_df.head()


Unnamed: 0_level_0,Sales,Performance Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,,3.5
IT,,5.0
Sales,275.0,2.5


In [21]:
# Grouping by multiple columns
multi_grouped_df = data.groupby(['Department', 'Performance Score']).sum()
multi_grouped_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Employee,Sales
Department,Performance Score,Unnamed: 2_level_1,Unnamed: 3_level_1
HR,3,Bob,0.0
HR,4,Alice,0.0
IT,5,Kyle,0.0
Sales,2,Laura,300.0
Sales,3,John,250.0


### Apply method

In [22]:
# Sample DataFrame
data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Function to increment each element by 1
def increment(x):
    return x + 1

# Applying the function to the DataFrame
incremented_data = data.apply(increment)
incremented_data.head()

Unnamed: 0,A,B
0,2,5
1,3,6
2,4,7


### Applying function Row-Wise

In [24]:
def sum_of_squres(row):
    return row['A']**2+ row["B"]**2

#Applying the functions acroos each row
data['Sum_of_squres']=data.apply(sum_of_squres, axis=1)
data.head()

Unnamed: 0,A,B,Sum_of_squres
0,1,4,17
1,2,5,29
2,3,6,45


### Real-world Example: Data Normalisation

In [25]:
# Sample dataset
sample_data = pd.DataFrame({
    'Feature1': [10, 20, 30],
    'Feature2': [40, 50, 60]
})

# Normalising function
def normalise(column):
    return (column - column.mean()) / column.std()

# Applying normalisation to each column
normalised_data = sample_data.apply(normalise)
normalised_data.head()

Unnamed: 0,Feature1,Feature2
0,-1.0,-1.0
1,0.0,0.0
2,1.0,1.0


## Map method()

In [27]:
# Sample Series
s = pd.Series(['dog', 'dog', 'seagull', 'cod',  'dog', 'seagull'])

animal_map={
    'dog':'mammal',
    "seagull": "bird",
    "cod":"fish"
}
# Using map to replace values
mapped_s=s.map(animal_map)
mapped_s

0    mammal
1    mammal
2      bird
3      fish
4    mammal
5      bird
dtype: object