## üü¶ Importing Libraries

**Purpose**  
Load all required libraries for data handling and preprocessing.

**Explanation**  
- NumPy ‚Üí numerical operations  
- Pandas ‚Üí data manipulation  
- sklearn modules ‚Üí preprocessing tools  

**Key Idea**  
Every ML workflow starts by importing required tools.


In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder


## üìÇ Loading the Dataset

The dataset is loaded into a Pandas DataFrame for analysis.

‚úÖ This step ensures the data is available in memory.  
‚úÖ Previewing the data helps verify correct loading and understand structure.

üëâ Always take a quick look at your data before doing anything else.


In [2]:
df = pd.read_csv('dataset/covid_toy.csv')
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


## üîç Checking Data Quality

Here we inspect missing values in each column.

‚úÖ Helps identify incomplete data  
‚úÖ Guides decisions for imputation  

üëâ Missing data is common in real datasets and must be handled carefully.


In [3]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [4]:
df['city'].value_counts()

city
Kolkata      32
Bangalore    30
Delhi        22
Mumbai       16
Name: count, dtype: int64

In [5]:
df['cough'].value_counts()

cough
Mild      62
Strong    38
Name: count, dtype: int64

In [6]:
df['gender'].value_counts()

gender
Female    59
Male      41
Name: count, dtype: int64

## ‚úÇÔ∏è Splitting the Dataset

The data is divided into:
- Training set (for learning)  
- Testing set (for evaluation)

‚úÖ Prevents the model from memorizing data  
‚úÖ Ensures fair performance measurement  

üëâ Never train and test on the same data.


In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_covid']),df['has_covid'],
                                                test_size=0.2)

In [8]:
X_train

Unnamed: 0,age,gender,fever,cough,city
34,74,Male,102.0,Mild,Mumbai
44,20,Male,102.0,Strong,Delhi
33,26,Female,98.0,Mild,Kolkata
55,81,Female,101.0,Mild,Mumbai
4,65,Female,101.0,Mild,Mumbai
...,...,...,...,...,...
13,64,Male,102.0,Mild,Bangalore
77,8,Female,101.0,Mild,Kolkata
15,70,Male,103.0,Strong,Kolkata
22,71,Female,98.0,Strong,Kolkata


## üß† Without column transformer

## üå°Ô∏è Handling Missing Numerical Values

Missing numeric values are filled using statistical estimates like mean or median.

‚úÖ Prevents errors during model training  
‚úÖ Keeps dataset size intact  

üëâ Models cannot handle NaN values directly.


In [9]:
# adding simple imputer to fever col
si = SimpleImputer()
X_train_fever = si.fit_transform(X_train[['fever']])

# also the test data
X_test_fever = si.fit_transform(X_test[['fever']])
                                 
X_train_fever.shape

(80, 1)

## üî¢ Ordinal Encoding

Ordered categories are converted into numbers.

Example:
- Mild ‚Üí 0  
- Strong ‚Üí 1  

‚úÖ Preserves natural ranking  
‚úÖ Useful when order matters  

üëâ Only use this when categories have real order.


In [10]:
# Ordinalencoding -> cough
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

# also the test data
X_test_cough = oe.fit_transform(X_test[['cough']])

X_train_cough.shape

(80, 1)

## üöª One-Hot Encoding

Categorical values are converted into binary columns.

‚úÖ Prevents fake numeric relationships  
‚úÖ Makes categories model-friendly  

üö® Dummy Variable Trap:
Dropping one column avoids redundancy.

üëâ Best for nominal (unordered) categories.


In [11]:
# OneHotEncoding -> gender,city
ohe = OneHotEncoder(drop='first', sparse_output = False)
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])

# also the test data
X_test_gender_city = ohe.fit_transform(X_test[['gender','city']])

X_train_gender_city.shape

(80, 4)

## üéØ Keeping Useful Numerical Features

Some columns do not need encoding.

‚úÖ These are kept as-is  
‚úÖ Later combined with transformed features  

üëâ Not every column needs preprocessing.


In [12]:
# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

## üîó Combining Features

All processed parts are merged into one dataset.

‚úÖ Produces final feature matrix  
‚ùå Manual merging can get messy and error-prone  

üëâ This motivates automated pipelines.


In [13]:
X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)
# also the test data
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape

(80, 7)

## üß† ColumnTransformer Setup

Different preprocessing steps are assigned to different columns in one pipeline.

‚úÖ Cleaner workflow  
‚úÖ Less manual work  
‚úÖ Lower risk of mistakes  

üëâ This is the professional approach.


In [14]:
from sklearn.compose import ColumnTransformer

In [15]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse_output=False,drop='first'),['gender','city'])
],remainder='passthrough')

In [16]:
transformer.fit_transform(X_train).shape

(80, 7)

In [17]:
transformer.transform(X_test).shape

(20, 7)

## ‚úÖ Applying the Pipeline

The transformer is:
- Fitted on training data  
- Applied to both train and test sets

‚úÖ Prevents data leakage  
‚úÖ Ensures consistent preprocessing  

üéâ Now the data is ready for modeling!

üëâ Good preprocessing often matters more than the model itself.
