In [1]:
import sklearn
import numpy as np
import pandas as pd
print("scikit-learn version:", sklearn.__version__)


scikit-learn version: 1.7.1


In [3]:
data = {
    "Name": ["Alice", "Bob", "Charlie" , "David" , "Eve"],
    "Age": [25, None, 35, 40, None],
    "City": ["New York", "Los Angeles", "Chicago" , "Houston", "Phoenix"],
    "Salary": [70000, 80000, 120000, None, None]

}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25.0,New York,70000.0
1,Bob,,Los Angeles,80000.0
2,Charlie,35.0,Chicago,120000.0
3,David,40.0,Houston,
4,Eve,,Phoenix,


In [4]:
#  Check if any null value in 
df.isnull().sum() # Count of null values in each column

Name      0
Age       2
City      0
Salary    2
dtype: int64

In [5]:
df_drop = df.dropna() # Drop rows with any null values
df_drop

Unnamed: 0,Name,Age,City,Salary
0,Alice,25.0,New York,70000.0
2,Charlie,35.0,Chicago,120000.0


In [10]:
# df_fill = df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].mean()}) # Fill null values with mean
# df_fill

In [9]:
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25.0,New York,70000.0
1,Bob,,Los Angeles,80000.0
2,Charlie,35.0,Chicago,120000.0
3,David,40.0,Houston,
4,Eve,,Phoenix,


In [13]:
df.isnull().mean() * 100 # Percentage of null values in each column

Name       0.0
Age       40.0
City       0.0
Salary    40.0
dtype: float64

## Encoding  🎆🌏 
(Object  -> int ) converstion

### What is Encoding? 
Encoding is the process of converting categorical data into a numerical format that can be used by machine learning algorithms. This is necessary because most machine learning models require numerical input.

### Why is Encoding Important?
- **Machine Learning Compatibility**: Many algorithms can only work with numerical data.
- **Improved Model Performance**: Proper encoding can lead to better model performance by allowing the algorithm to understand the relationships between categories.

### Types of Encoding
1. **Label Encoding**: Converts each category into a unique integer.
   - Example: `['red', 'green', 'blue']` becomes `[0, 1, 2]`.
   - Use when categories are ordinal (have a meaningful order). 
2. **One-Hot Encoding**: Creates binary columns for each category.
   - Example: `['red', 'green', 'blue']` becomes three columns:
    ```

    red | green | blue
    1   | 0     | 0
    0   | 1     | 0
    0   | 0     | 1
    ```
   - Use when categories are nominal (no meaningful order).
3. **Binary Encoding**: Converts categories into binary code.
   - Example: `['red', 'green', 'blue']` becomes:
    ```
    red   | 00
    green | 01
    blue  | 10
    ```
4. **Frequency Encoding**: Replaces categories with their frequency in the dataset.

5. **Target Encoding**: Replaces categories with the mean of the target variable for that category.



In [19]:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

# Import Data
df = pd.read_csv("data/sample_data.csv")  

df.dtypes



Name      object
Gender    object
City      object
Age        int64
Passed      bool
dtype: object

In [25]:
df_label = df.copy()  # Create a copy of the DataFrame for label encoding

le = LabelEncoder()
df_label["Gender_Encoded"] = le.fit_transform(df_label["Gender"]) #Encode
df_label["Passed_Encoded"] = le.fit_transform(df_label["Passed"]) #Encode

df_label.head()

Unnamed: 0,Name,Gender,City,Age,Passed,Gender_Encoded,Passed_Encoded
0,MunnaThanos,Female,Bhilai,18,True,0,1
1,Goti_Badmas,Female,Bhilai,19,True,0,1
2,Aman,Male,Chennai,24,True,1,1
3,Deepak,Male,Pune,30,True,1,1
4,Raj,Male,Mumbai,27,True,1,1
