# Data Cleaning

Data cleaning is a critical step in the data analysis pipeline. It involves preparing raw data for analysis by addressing issues such as missing values, duplicates, irrelevant data, and inconsistencies. In Pandas, a popular Python library for data manipulation, there are several methods to handle common data cleaning tasks.

## Key Steps in Data Cleaning Using Pandas
### 1. Handling Missing Data
Missing data is one of the most common problems in datasets. Pandas provides several ways to deal with missing values (NaN).

Identifying missing values: You can identify missing data using isna() or isnull(). These functions return a DataFrame of the same shape with True where data is missing.

In [12]:
import pandas as pd

data = {
    'Name': ['Bulbasaur', 'Ivysaur', 'Venusaur', 'Charmander', None],
    'Type 1': ['Grass', 'Grass', 'Grass', 'Fire', 'Fire'],
    'Attack': [49, 62, 82, 52, None],
    'Defense': [49, 63, 83, 43, 58]
}

df = pd.DataFrame(data)

# Check for missing values
print(df.isna())

    Name  Type 1  Attack  Defense
0  False   False   False    False
1  False   False   False    False
2  False   False   False    False
3  False   False   False    False
4   True   False    True    False


#### Handling missing data:

- Drop missing values: You can remove rows or columns with missing data using dropna().

In [14]:
# Drop rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)

         Name Type 1  Attack  Defense
0   Bulbasaur  Grass    49.0       49
1     Ivysaur  Grass    62.0       63
2    Venusaur  Grass    82.0       83
3  Charmander   Fire    52.0       43


- Fill missing values: You can fill missing data using fillna() by providing a value, the mean, or forward/backward filling.

In [15]:
# Fill missing values with the mean
df['Attack'] = df['Attack'].fillna(df['Attack'].mean())
print(df)

         Name Type 1  Attack  Defense
0   Bulbasaur  Grass   49.00       49
1     Ivysaur  Grass   62.00       63
2    Venusaur  Grass   82.00       83
3  Charmander   Fire   52.00       43
4        None   Fire   61.25       58


### 2. Removing Duplicates 
Duplicate rows in your dataset can distort analysis. You can remove duplicates using the drop_duplicates() function.

In [17]:
data = {
    'Name': ['Bulbasaur', 'Ivysaur', 'Venusaur', 'Bulbasaur'],
    'Type 1': ['Grass', 'Grass', 'Grass', 'Grass'],
    'Attack': [49, 62, 82, 49],
    'Defense': [49, 63, 83, 49]
}

df = pd.DataFrame(data)

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

        Name Type 1  Attack  Defense
0  Bulbasaur  Grass      49       49
1    Ivysaur  Grass      62       63
2   Venusaur  Grass      82       83


### 3. Scaling Data 
Scaling or normalizing data ensures that all features are on the same scale, which is important when using algorithms that are sensitive to the magnitude of features, such as K-Means or neural networks.

- Min-Max Scaling: Scales the values between a specified range, typically [0, 1].

In [21]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# Assume df['Attack'] and df['Defense'] need to be scaled
df[['Attack', 'Defense']] = scaler.fit_transform(df[['Attack', 'Defense']])
print(df)

        Name Type 1    Attack   Defense
0  Bulbasaur  Grass  0.000000  0.000000
1    Ivysaur  Grass  0.393939  0.411765
2   Venusaur  Grass  1.000000  1.000000
3  Bulbasaur  Grass  0.000000  0.000000


- Standardization (Z-score Normalization): Scales the data to have a mean of 0 and a standard deviation of 1.

In [22]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Assume df['Attack'] and df['Defense'] need to be standardized
df[['Attack', 'Defense']] = scaler.fit_transform(df[['Attack', 'Defense']])
print(df)

        Name Type 1    Attack   Defense
0  Bulbasaur  Grass -0.851852 -0.861550
1    Ivysaur  Grass  0.111111  0.143592
2   Venusaur  Grass  1.592593  1.579508
3  Bulbasaur  Grass -0.851852 -0.861550


### Encoding Categorical Data
Many machine learning algorithms require numerical input, so categorical variables need to be converted into numerical format.

- Label Encoding: Converts each unique category value to an integer.

In [25]:
from sklearn.preprocessing import LabelEncoder
data = {
    'Name': ['Bulbasaur', 'Ivysaur', 'Venusaur', 'Bulbasaur'],
    'Type 1': ['Grass', 'water', 'air', 'Grass'],
    'Attack': [49, 62, 82, 49],
    'Defense': [49, 63, 83, 49]
}

df = pd.DataFrame(data)

encoder = LabelEncoder()

# Encoding 'Type 1' column
df['Type 1'] = encoder.fit_transform(df['Type 1'])
print(df)

        Name  Type 1  Attack  Defense
0  Bulbasaur       0      49       49
1    Ivysaur       2      62       63
2   Venusaur       1      82       83
3  Bulbasaur       0      49       49


- One-Hot Encoding: Creates a new binary column for each category in the categorical column.

One-Hot Encoding is a technique used to convert categorical data into a format that can be provided to machine learning algorithms, particularly those that require numerical input. It works by creating new binary columns for each category within a categorical feature and assigning a 1 or 0 depending on whether a given observation belongs to that category.
<img src="./images/onehot.png">

In [26]:
df_encoded = pd.get_dummies(df, columns=['Type 1'])
print(df_encoded)

        Name  Attack  Defense  Type 1_0  Type 1_1  Type 1_2
0  Bulbasaur      49       49      True     False     False
1    Ivysaur      62       63     False     False      True
2   Venusaur      82       83     False      True     False
3  Bulbasaur      49       49      True     False     False
