# **7 Days Data transformation Course**

# **Course** : Machine Learning 

# **Day 1:** Introduction to Data Transformation

# **Student**: Muhammad Shafiq

-----------------------------------

## **What is Transformation**

Data Transormation is the process of converting raw data into a format suitable for machine learning models.

it includes:
- Scaling numerical values
- Encoding categorical values
- Generating new features
- Fixing skewness or outliers
- Ensuring consistent format for model input

## **Importance of Transformation**

ML models only understand numbers â€” not:

- Categories like "male", "female"

- Scales like height in cm or salary in millions

- Text like "delayed", "on-time"

ðŸ‘‰ If we donâ€™t transform:

- Model accuracy drops

- Some algorithms break (`SVM`, `KNN` sensitive to scaling)

- Data leaks happen (if we transform test data using training stats)



### **Types of Data Transformation**

| Type                   | Example                              | When to Use                          |
| ---------------------- | ------------------------------------ | ------------------------------------ |
| **Scaling**            | Age: 18â€“90 â†’ 0â€“1 or -1 to 1          | For distance-based models (SVM, KNN) |
| **Encoding**           | Gender: Male/Female â†’ 0/1 or one-hot | For tree-based or linear models      |
| **Log Transform**      | Salary: 10kâ€“1M â†’ log scale           | Fix skewed distributions             |
| **Datetime Transform** | "2023-01-01" â†’ weekday, month, hour  | For time-based features              |
| **Text Transform**     | "I love pizza" â†’ vector/TF-IDF       | NLP tasks                            |
| **Outlier Treatment**  | Handle extreme values                | Prevent distortion in model          |


### **Load the dataset**

In [9]:
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

print(df.dtypes)
df.head()


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### **Understand Raw data**

In [2]:
# shape and column summary
print(df.shape)
df.info()

(891, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### **Column-by-Column Transformation Strategy**

| Column          | Type        | Action Needed                            |
| --------------- | ----------- | ---------------------------------------- |
| **Survived**    | Target      | No transformation needed (label)         |
| **Pclass**      | Ordinal     | Could keep as is (or one-hot)            |
| **Name**        | Text        | Drop or extract titles                   |
| **Sex**         | Categorical | Label or One-hot Encode                  |
| **Age**         | Numeric     | Impute missing, scale, maybe bin         |
| **SibSp/Parch** | Numeric     | Could scale or create "FamilySize"       |
| **Ticket**      | Text        | Drop or extract prefixes                 |
| **Fare**        | Numeric     | Scale, maybe log-transform               |
| **Cabin**       | Text        | Missing-heavy â†’ drop or use "has\_cabin" |
| **Embarked**    | Category    | Fill missing, one-hot encode             |


### **Analyzing and Tagging Transformations**




In [None]:
# Checking column data types
print(df.dtypes)

# Checking missing values
print(df.isnull().sum())

# Check unique values in categorical columns
print(df['Sex'].unique())
print(df['Embarked'].unique())


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
['male' 'female']
['S' 'C' 'Q' nan]


## **Transformation Planning Map**

In [4]:
transformation_plan = {
    'Survived': 'Target - no transform',
    'Pclass': 'Ordinal - keep or one-hot',
    'Name': 'Drop or extract title',
    'Sex': 'One-hot encode',
    'Age': 'Impute + scale',
    'SibSp': 'Keep or use in FamilySize',
    'Parch': 'Keep or use in FamilySize',
    'Ticket': 'Drop or extract prefix',
    'Fare': 'Log transform + scale',
    'Cabin': 'Create feature: has_cabin',
    'Embarked': 'Fill missing + one-hot'
}


##  **Mini Assignment**

Inspect the Titanic dataset and for each column, write:

- Current type

- % missing

- Recommended transformation (why?)

### **Load the dataset**

In [5]:
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Preview data
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [None]:
for col in df.columns:
    print(col)

PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
