#### **3. Data Manipulation**


# Data Cleaning
- Data cleaning is a crucial step in the data preprocessing pipeline. It involves handling missing data, removing duplicates, and ensuring that data types are appropriate for analysis.
- Data cleaning is a critical process in data manipulation that involves preparing data for analysis by addressing various issues. It ensures the data is accurate, consistent, and usable. Key aspects of data cleaning include handling missing data, removing duplicates, and converting data types.

# Handling Missing Data
- Handling missing data is essential to maintain the integrity of your dataset. Missing values can arise due to various reasons, such as data entry errors or incomplete data collection. There are several strategies for handling missing data:

# Drop Missing Values:


- Removing rows or columns with missing values can be effective when the amount of missing data is small and does not significantly impact the dataset. This approach ensures that only complete cases are used in analysis.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', None],
    'Age': [25, None, 35, 28],
    'City': ['Bangalore', 'Chennai', None, 'Hyderabad']
}
df = pd.DataFrame(data)

# Drop rows with any missing values
df_cleaned = df.dropna()
print(df_cleaned)


# Fill Missing Values:
- Filling missing values can be done with constants, means, or other strategies:

- This method involves replacing missing values with specific values, such as the mean, median, or mode of the column, or using a placeholder like 'Unknown'. This can help in retaining the dataset's size and ensuring that the analysis is based on complete records.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Padhmavathi'],
    'Age': [25, None, 35, 28],
    'City': ['Bangalore', 'Chennai', None, 'Chickkaballapur']
}
df = pd.DataFrame(data)

# Fill missing values with specific values
df_filled = df.fillna(value={'Age': df['Age'].mean(), 'City': 'Unknown'})
print(df_filled)


# Forward Fill:

- Forward filling is useful for time series or sequential data:
- Forward fill (also known as propagation) involves filling missing values with the last known value in the dataset. This method is particularly useful when the assumption is that the most recent observation is the best estimate for missing values.

- Use Case: Forward fill is commonly used in time series data where the value at a given point in time is assumed to be consistent until a new value is observed.

- How It Works: If a data point is missing, forward fill replaces it with the value of the previous non-missing data point.



In [None]:
import pandas as pd

data = {
    'Date': ['2024-01-01', '2024-01-02', None, '2024-01-04'],
    'Value': [10, None, 30, 40]
}
df = pd.DataFrame(data)

# Forward fill missing values
df_ffill = df.ffill()
print(df_ffill)


# Backward Fill:

- Backward filling fills missing values with the next known value:
- Backward fill (also known as bfill) involves filling missing values with the next known value in the dataset. This method is useful when the assumption is that the value observed immediately after a missing entry is the best estimate for that missing entry.

- Use Case: Backward fill is often used in scenarios where future values are assumed to provide the best estimate for missing values.

- How It Works: If a data point is missing, backward fill replaces it with the value of the next non-missing data point.

In [None]:
import pandas as pd

data = {
    'Date': ['2024-01-01', None, '2024-01-03', '2024-01-04'],
    'Value': [10, 20, None, 40]
}
df = pd.DataFrame(data)

# Backward fill missing values
df_bfill = df.bfill()
print(df_bfill)


# Removing Duplicates:

- Duplicate data can distort analysis, and removing them is crucial:
- Duplicate data can distort analysis and lead to inaccurate conclusions. Removing duplicates ensures that each entry is unique and contributes only once to the dataset:

## Remove Duplicate Rows:
- This involves deleting duplicate rows in the dataset, leaving only unique entries. This is useful for datasets where duplicate rows may have been introduced through errors or data merging.

In [None]:
# Remove Duplicate Rows
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Bhagath', 'Monika'],
    'Age': [25, 30, 25, 35],
    'City': ['Bangalore', 'Chennai', 'Bangalore', 'Hyderabad']
}
df = pd.DataFrame(data)

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)



## Remove Duplicates Based on Specific Columns:
- In some cases, duplicates may need to be removed based on certain columns while keeping others. For example, if multiple records have the same name and age, but different cities, you might choose to keep only unique combinations of these columns.

In [None]:
# Remove Duplicates Based on Specific Columns
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Bhagath', 'Monika'],
    'Age': [25, 30, 25, 35],
    'City': ['Bangalore', 'Chennai', 'Bangalore', 'Hyderabad']
}
df = pd.DataFrame(data)

# Remove duplicates based on Name and Age columns
df_no_duplicates = df.drop_duplicates(subset=['Name', 'Age'])
print(df_no_duplicates)



## Keep First or Last Occurrence:
- When duplicates are present, you can choose to keep either the first or last occurrence of the duplicate entries. This method helps in maintaining the most relevant or recent data while discarding redundant records.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Bhagath', 'Monika', 'Bhagath'],
    'Age': [25, 30, 25, 35, 25],
    'City': ['Bangalore', 'Chennai', 'Bangalore', 'Hyderabad', 'Chickkaballapur']
}
df = pd.DataFrame(data)

# Keep the first occurrence of duplicates
df_first = df.drop_duplicates(keep='first')
print(df_first)

# Keep the last occurrence of duplicates
df_last = df.drop_duplicates(keep='last')
print(df_last)


# Data Type Conversion
- Ensuring data types are correct is crucial for data manipulation and analysis:

## Convert Data Types
- Data types need to be correctly set for accurate computations and analyses. For instance, numeric data should be in integer or float format, categorical data should be converted to a categorical type, and date information should be converted to datetime format.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika'],
    'Age': ['25', '30', '35']  # Age is in string format
}
df = pd.DataFrame(data)

# Convert Age column to float
df['Age'] = df['Age'].astype(float)
print(df)


## Convert to Categorical Data

-Converting columns to categorical data types can optimize memory usage and improve performance when dealing with categorical variables.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Bhagath'],
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur']
}
df = pd.DataFrame(data)

# Convert City column to categorical data type
df['City'] = df['City'].astype('category')
print(df.dtypes)


# Convert to DateTime

- Converting date-related columns to datetime types enables proper date manipulations, such as filtering by date ranges and calculating time intervals.




In [None]:
import pandas as pd

data = {
    'Date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'Value': [10, 20, 30]
}
df = pd.DataFrame(data)

# Convert Date column to datetime type
df['Date'] = pd.to_datetime(df['Date'])
print(df.dtypes)

## Data Transformation

Data transformation involves modifying data to make it more suitable for analysis or to meet specific requirements. This can include tasks like renaming columns, applying functions, and performing aggregation and grouping operations. Here’s a deeper look into these transformations:

### Renaming Columns and Indexes

Renaming columns and indexes is a common preprocessing step in data cleaning. It helps in making the dataset more understandable and aligns with the conventions used in analysis or reporting.

- **Renaming Columns**

  Renaming columns allows for clearer labels that better describe the data or fit specific analytical needs. This is particularly useful when integrating data from multiple sources where column names might be inconsistent.

  **Theory:** The `rename()` method in Pandas can be used to change column names. It takes a dictionary where keys are the current names and values are the new names. Renaming columns can make data easier to understand and work with.

  **Example:**

  ```python
  import pandas as pd

  data = {
      'Name': ['Bhagath', 'Bharath', 'Monika', 'Padhmavathi'],
      'Age': [25, 30, 35, 28],
      'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur']
  }
  df = pd.DataFrame(data)

  # Renaming columns
  df_renamed = df.rename(columns={'Name': 'Full Name', 'Age': 'Age in Years', 'City': 'City of Residence'})
  print(df_renamed)


### Renaming Indexes

- Renaming indexes can provide meaningful labels for rows, especially when data represents entities like individuals or locations.

- Theory: The rename() method can also be used for renaming row indexes. This helps in identifying rows easily, which is useful for data reporting and analysis.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Padhmavathi'],
    'Age': [25, 30, 35, 28],
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur']
}
df = pd.DataFrame(data)

# Renaming indexes
df_renamed_index = df.rename(index={0: 'Person1', 1: 'Person2', 2: 'Person3', 3: 'Person4'})
print(df_renamed_index)


### Applying Functions
- Applying functions to data allows for transformations and calculations that help in deriving new insights or preparing data for analysis.

- Using apply()

  - apply() is a versatile method for applying functions along either axis of the DataFrame. This is useful for operations that require processing each element or row/column.

  - Theory: The apply() method can be used to perform operations across rows or columns. Functions applied can be user-defined or built-in. It’s particularly useful for custom calculations or transformations.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Padhmavathi'],
    'Age': [25, 30, 35, 28]
}
df = pd.DataFrame(data)

# Applying a function to a column
df['Age in Months'] = df['Age'].apply(lambda x: x * 12)
print(df)


### Using map()

  - map() is used for element-wise transformations. It’s ideal for replacing values with corresponding values from a dictionary or applying a function to each element.

  - The map() function is used to transform data by mapping values from one set to another. This is useful for tasks like replacing categorical data with numerical values or applying a function to each element.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika'],
    'City': ['Bangalore', 'Chennai', 'Hyderabad']
}
df = pd.DataFrame(data)

# Mapping city names to abbreviations
city_map = {
    'Bangalore': 'BLR',
    'Chennai': 'CHE',
    'Hyderabad': 'HYD'
}
df['City Abbreviation'] = df['City'].map(city_map)
print(df)


### Using applymap()

  - applymap() is used to apply a function to every element of the DataFrame. This is useful for performing operations on all values in the DataFrame.

  - The applymap() method allows for element-wise operations across the entire DataFrame. This is useful for applying functions to each cell, such as formatting or type conversion.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika'],
    'Age': [25, 30, 35]
}
df = pd.DataFrame(data)

# Applying a function to every element
df_uppercase = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)
print(df_uppercase)


## Aggregation and Grouping
- Aggregation and grouping are used to summarize and analyze data by dividing it into groups and performing calculations on these groups.

  ### Using groupby()

  - groupby() is used to group data based on one or more columns and perform aggregate operations such as sum, mean, or count.

  - The groupby() method splits the data into groups based on specified criteria and then allows for aggregate functions to be applied to each group. This is useful for summarizing data and performing analysis on subsets of the dataset.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Bhagath'],
    'Age': [25, 30, 35, 25],
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur']
}
df = pd.DataFrame(data)

# Grouping by Name and calculating the mean Age
df_grouped = df.groupby('Name').agg({'Age': 'mean'})
print(df_grouped)


  ### Using pivot_table()

  - pivot_table() allows for creating a pivot table which is a powerful tool for summarizing and analyzing data. It allows for multi-dimensional aggregation and cross-tabulation.

  - The pivot_table() method is used to create a table where values are aggregated across multiple dimensions. It’s useful for summarizing large datasets and performing complex analyses.

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika', 'Bhagath'],
    'Age': [25, 30, 35, 25],
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chickkaballapur']
}
df = pd.DataFrame(data)

# Creating a pivot table
df_pivot = pd.pivot_table(df, values='Age', index='Name', aggfunc='mean')
print(df_pivot)
