In [40]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [8]:
df = pd.read_csv("covid_toy.csv")
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [10]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns = ['has_covid']),df['has_covid'], test_size=0.2)

In [12]:
X_train

Unnamed: 0,age,gender,fever,cough,city
66,51,Male,104.0,Mild,Kolkata
80,14,Female,99.0,Mild,Mumbai
12,25,Female,99.0,Strong,Kolkata
42,27,Male,100.0,Mild,Delhi
71,75,Female,104.0,Strong,Delhi
...,...,...,...,...,...
13,64,Male,102.0,Mild,Bangalore
37,55,Male,100.0,Mild,Kolkata
67,65,Male,99.0,Mild,Bangalore
94,79,Male,,Strong,Kolkata


In [None]:
#the normal way -hard way but goes in details


In [None]:
The `SimpleImputer` is a class in scikit-learn (sklearn) used for imputing missing values in a dataset. Imputation is the process of replacing missing data with substituted values, and `SimpleImputer` specifically provides a basic strategy for imputation.

### Key Features and Usage of `SimpleImputer`:

1. **Handling Missing Data:**
   - It replaces missing values (NaN) with a specified strategy. The default strategy is `'mean'`, which replaces NaN values with the mean along each column.

2. **Supports Different Strategies:**
   - Besides the mean, other strategies include `'median'`, `'most_frequent'` (replaces with the most frequent value), and `'constant'` (replaces with a specified constant value).

3. **Flexible Application:**
   - It can be applied to both numerical and categorical data (after encoding categorical variables into numerical form).

4. **Integration with scikit-learn Pipelines:**
   - `SimpleImputer` can be easily integrated into scikit-learn pipelines, which allows for seamless data preprocessing and model building workflows.

### Example Usage:

Let's demonstrate `SimpleImputer` with a simple example using a dataset with missing values:

```python
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample dataset with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 10, 20, np.nan, 50],
    'C': [1, 2, 3, 4, 5]
}

df = pd.DataFrame(data)

# Initialize SimpleImputer with strategy 'mean' (default)
imputer = SimpleImputer(strategy='mean')

# Fit and transform the data
imputed_data = imputer.fit_transform(df)

# Convert the numpy array back to a DataFrame (optional)
df_imputed = pd.DataFrame(imputed_data, columns=df.columns)

print("Original DataFrame:")
print(df)
print("\nDataFrame after imputation:")
print(df_imputed)
```

Output:
```
Original DataFrame:
     A     B  C
0  1.0   NaN  1
1  2.0  10.0  2
2  NaN  20.0  3
3  4.0   NaN  4
4  5.0  50.0  5

DataFrame after imputation:
     A     B  C
0  1.0  30.0  1
1  2.0  10.0  2
2  3.0  20.0  3
3  4.0  30.0  4
4  5.0  50.0  5
```

### Explanation:

- **Importing and Initializing:**
  - We import `SimpleImputer` from `sklearn.impute` and create an instance `imputer` with `strategy='mean'`.
  
- **Fit and Transform:**
  - We fit the `imputer` instance on the DataFrame `df` using `imputer.fit_transform(df)`. This calculates the mean for each column (`A` and `B` in this case) and replaces NaN values with these means.

- **Handling Missing Values:**
  - In this example, NaN values in columns `A` and `B` are replaced with the mean of the respective columns (`A` with mean 3 and `B` with mean 30).

- **Output:**
  - The transformed data is returned as a numpy array, which we convert back to a DataFrame (`df_imputed`) for better readability.

### Strategies Available:

- `'mean'`: Replaces missing values with the mean of the column.
- `'median'`: Replaces missing values with the median of the column.
- `'most_frequent'`: Replaces missing values with the most frequent value in the column.
- `'constant'`: Replaces missing values with a specified constant value (provided using `fill_value` parameter).

### Integration with Pipelines:

`SimpleImputer` can be used within scikit-learn pipelines along with other preprocessing and modeling steps, providing a streamlined approach to handling missing data and building machine learning models.

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Example pipeline with SimpleImputer and LinearRegression
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict with the pipeline
y_pred = pipeline.predict(X_test)
```

In summary, `SimpleImputer` is a fundamental tool in scikit-learn for handling missing data by replacing NaN values with appropriate substituted values based on specified strategies. Its simplicity and integration capabilities make it essential in data preprocessing workflows for machine learning.

In [14]:
si = SimpleImputer()

In [18]:
X_train_fever = si.fit_transform(X_train[['fever']])
X_test_fever = si.fit_transform(X_test[['fever']])
X_train_fever.shape

(80, 1)

In [19]:
#doing ordinal encoding
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

X_test_cough = oe.fit_transform(X_test[['cough']])

X_train_cough.shape

(80, 1)

In [20]:
#doing OneHotEncoding on gender,city
ohe = OneHotEncoder(drop='first',sparse=False)
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])

X_test_gender_city = ohe.fit_transform(X_test[['gender','city']])

X_train_gender_city.shape



(80, 4)

In [21]:
# now extracting age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

In [26]:
X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis = 1)
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis = 1)


X_train_transformed.shape

(80, 7)

In [39]:
X_test_transformed.shape


(20, 7)

In [27]:
#to get all this in simple form quick form we to column transformer function 

In [28]:
from sklearn.compose import ColumnTransformer

In [36]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse=False,drop='first'),['gender','city'])
],remainder='passthrough')

In [37]:
transformer.fit_transform(X_train).shape




(80, 7)

In [38]:
transformer.transform(X_test).shape


(20, 7)

In [None]:
#this was way more quicK