## Data Wrangling

Data wrangling is a critical step in the data analysis pipeline that involves transforming and mapping raw data into a format suitable for analysis. This process includes merging and joining DataFrames, reshaping data, and handling categorical data. Here’s a detailed look at each of these aspects:

### Merging and Joining DataFrames

Merging and joining DataFrames are operations used to combine data from different sources into a single DataFrame. This is crucial when data is split across multiple tables or datasets.

- merge():

  - The `merge()` function is a powerful tool in pandas that allows you to combine two DataFrames based on a common column or index. This function is similar to SQL JOIN operations. The `merge()` function supports various types of joins:

  - **Inner Join:** Returns only the rows that have matching values in both DataFrames.
  - **Outer Join:** Returns all rows from both DataFrames, with missing values filled in with NaNs where there is no match.
  - **Left Join:** Returns all rows from the left DataFrame and the matching rows from the right DataFrame. Rows in the left DataFrame with no match in the right DataFrame will have NaNs.
  - **Right Join:** Returns all rows from the right DataFrame and the matching rows from the left DataFrame. Rows in the right DataFrame with no match in the left DataFrame will have NaNs.

  **Example:**

  ```python
  import pandas as pd

  data1 = {
      'Name': ['Bhagath', 'Bharath', 'Monika'],
      'Age': [25, 30, 35]
  }
  data2 = {
      'Name': ['Bharath', 'Monika', 'Padhmavathi'],
      'City': ['Chennai', 'Hyderabad', 'Chickkaballapur']
  }
  df1 = pd.DataFrame(data1)
  df2 = pd.DataFrame(data2)

  # Merging DataFrames on 'Name'
  df_merged = pd.merge(df1, df2, on='Name')
  print(df_merged)


- join()

  - The join() method allows you to combine DataFrames based on their indices. This is useful when the DataFrames share the same index or you want to align data along the index. By default, the join() method performs a left join.

Example:

In [None]:
import pandas as pd

data1 = {
    'Age': [25, 30, 35]
}
data2 = {
    'City': ['Bangalore', 'Chennai', 'Hyderabad']
}
df1 = pd.DataFrame(data1, index=['Bhagath', 'Bharath', 'Monika'])
df2 = pd.DataFrame(data2, index=['Bhagath', 'Bharath', 'Monika'])

# Joining DataFrames on index
df_joined = df1.join(df2)
print(df_joined)


- concat()

  - The concat() function is used to concatenate multiple DataFrames along a particular axis (rows or columns). It’s useful for stacking DataFrames either vertically (appending rows) or horizontally (adding columns).
    

    
  Example:

In [None]:
import pandas as pd

data1 = {
    'Name': ['Bhagath', 'Bharath'],
    'Age': [25, 30]
}
data2 = {
    'Name': ['Monika', 'Padhmavathi'],
    'Age': [35, 28]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Concatenating DataFrames along rows
df_concat = pd.concat([df1, df2])
print(df_concat)


### Reshaping Data:
- Reshaping data involves transforming it between wide and long formats. This process is crucial for preparing data for analysis, especially when dealing with complex datasets.

- melt()

  -The melt() function is used to transform a DataFrame from wide format to long format. This is done by unpivoting columns into rows, making it easier to analyze data that has multiple columns representing different variables.

Example:

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath'],
    'Age': [25, 30],
    'City': ['Bangalore', 'Chennai']
}
df = pd.DataFrame(data)

# Melting DataFrame
df_melted = pd.melt(df, id_vars=['Name'], var_name='Attribute', value_name='Value')
print(df_melted)


- pivot()

  - The pivot() function reshapes data by transforming long-format data into wide-format data. It creates a new DataFrame where unique values of one column become the new columns, and the values in another column fill the cells.

Example:

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika'],
    'Attribute': ['Age', 'Age', 'City'],
    'Value': [25, 30, 'Hyderabad']
}
df = pd.DataFrame(data)

# Pivoting DataFrame
df_pivot = df.pivot(index='Name', columns='Attribute', values='Value')
print(df_pivot)


- stack()

  - The stack() function stacks the columns of a DataFrame into rows, which transforms it from a wide format to a long format. This is useful for handling multi-level columns and preparing data for complex analyses.

Example:

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath'],
    'Age': [25, 30],
    'City': ['Bangalore', 'Chennai']
}
df = pd.DataFrame(data)

# Stacking DataFrame
df_stacked = df.stack()
print(df_stacked)


- unstack()

  - The unstack() function pivots the innermost level of the index into columns, which transforms the data from long format to wide format. This method is useful for creating summary tables from multi-index DataFrames.

Example:

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath'],
    'Attribute': ['Age', 'City'],
    'Value': [25, 'Chennai']
}
df = pd.DataFrame(data)
df.set_index(['Name', 'Attribute'], inplace=True)

# Unstacking DataFrame
df_unstacked = df.unstack()
print(df_unstacked)


## Handling Categorical Data
- Handling categorical data involves converting categorical variables into a format that can be used in statistical analysis or machine learning models. This often involves encoding categorical variables or converting them into numerical codes.

- Categorical data refers to variables that represent categories or groups. This data type can include nominal categories (e.g., names of cities) and ordinal categories (e.g., ranking). Handling categorical data often involves converting these categories into numerical formats, such as integer codes or one-hot encoding, which can be more easily used in machine learning algorithms.

Example:

In [None]:
import pandas as pd

data = {
    'Name': ['Bhagath', 'Bharath', 'Monika'],
    'City': ['Bangalore', 'Chennai', 'Hyderabad']
}
df = pd.DataFrame(data)

# Converting City column to categorical data type
df['City'] = df['City'].astype('category')
print(df.dtypes)


In [None]:
import pandas as pd

data = {
    'City': ['Bangalore', 'Chennai', 'Hyderabad']
}
df = pd.DataFrame(data)

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['City'])
print(df_encoded)




#### **Types of Categorical Data**

1. **Nominal Data**

   - Nominal data represents categories without any intrinsic order. Examples include names, colors, or cities. Nominal data is purely qualitative and can be encoded as unique identifiers.

   **Example:**

   ```python
   import pandas as pd

   data = {
       'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chennai']
   }
   df = pd.DataFrame(data)

   # Converting City column to categorical data type
   df['City'] = df['City'].astype('category')
   print(df['City'].cat.categories)


2. **Ordinal Data**

  - Ordinal data represents categories with a meaningful order but not necessarily a consistent difference between categories. Examples include ratings (e.g., low, medium, high) or education levels.

Example:

In [None]:
import pandas as pd

data = {
    'Education': ['High School', 'Bachelor', 'Master', 'PhD']
}
df = pd.DataFrame(data)

# Defining the order of categories
categories = ['High School', 'Bachelor', 'Master', 'PhD']
df['Education'] = pd.Categorical(df['Education'], categories=categories, ordered=True)
print(df['Education'].cat.categories)
print(df['Education'].cat.ordered)


### Handling Categorical Data
1. Label Encoding

  -  Label encoding involves converting categorical values into numerical labels. This method assigns a unique integer to each category. It is useful for ordinal data where the order matters but may not be suitable for nominal data.

Example:

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {
    'Education': ['High School', 'Bachelor', 'Master', 'PhD']
}
df = pd.DataFrame(data)
le = LabelEncoder()

df['Education_encoded'] = le.fit_transform(df['Education'])
print(df)


2. One-Hot Encoding

-  One-hot encoding transforms categorical variables into binary columns, where each column represents a category. This method is ideal for nominal data, as it does not assume any ordinal relationship between categories.

Example:

In [None]:
import pandas as pd

data = {
    'City': ['Bangalore', 'Chennai', 'Hyderabad']
}
df = pd.DataFrame(data)

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['City'])
print(df_encoded)


3. Frequency Encoding

  -  Frequency encoding involves replacing categories with their frequency in the dataset. This method can be useful when the frequency of occurrence might be informative.

Example:

In [None]:
import pandas as pd

data = {
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chennai']
}
df = pd.DataFrame(data)

# Frequency Encoding
frequency = df['City'].value_counts()
df['City_encoded'] = df['City'].map(frequency)
print(df)


4. Target Encoding

  -  Target encoding involves replacing categories with a mean of the target variable for each category. This method is often used in predictive modeling to incorporate the relationship between the categorical variable and the target variable.

Example:

In [None]:
import pandas as pd

data = {
    'City': ['Bangalore', 'Chennai', 'Hyderabad', 'Chennai'],
    'Target': [1, 0, 1, 0]
}
df = pd.DataFrame(data)

# Target Encoding
target_mean = df.groupby('City')['Target'].mean()
df['City_encoded'] = df['City'].map(target_mean)
print(df)


5. Custom Encoding

  -  Custom encoding involves defining specific encoding rules based on domain knowledge or requirements. This approach allows for flexibility in how categories are represented numerically.

Example:

In [None]:
import pandas as pd

data = {
    'City': ['Bangalore', 'Chennai', 'Hyderabad']
}
df = pd.DataFrame(data)

# Custom Encoding
city_encoding = {'Bangalore': 1, 'Chennai': 2, 'Hyderabad': 3}
df['City_encoded'] = df['City'].map(city_encoding)
print(df)
