In [24]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

In [25]:
### Iterative Imputer

The Iterative Imputer is an advanced imputation method that models each feature with missing values as a function of other features in a round-robin fashion. This imputation technique is based on the concept of multiple imputation by chained equations (MICE).

### How Iterative Imputer Works

1. **Initial Imputation:** Start by filling all missing values using a simple strategy (e.g., mean imputation).
2. **Iterative Modeling:** Treat each feature with missing values as a target variable and regress it on other features, iteratively updating the missing values.
3. **Convergence:** Repeat the process until the imputations converge, or for a specified number of iterations.

### Usage in scikit-learn

The `IterativeImputer` in scikit-learn provides a straightforward way to perform this type of imputation. Here's an example to demonstrate how to use it.

#### 1. Import Necessary Libraries

```python
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
```

#### 2. Create Sample Data

```python
# Sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 35],
    'Income': [50000, np.nan, 60000, 70000, 80000],
    'Fare': [100.0, 200.0, np.nan, 300.0, 250.0]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
```

#### 3. Initialize and Apply Iterative Imputer

```python
# Initialize IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)

# Apply the imputer to the DataFrame
imputed_data = imputer.fit_transform(df)

# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

print("\nDataFrame after Iterative Imputation:")
print(imputed_df)
```

### Full Example Code

```python
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

# Sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 35],
    'Income': [50000, np.nan, 60000, 70000, 80000],
    'Fare': [100.0, 200.0, np.nan, 300.0, 250.0]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=0)

# Apply the imputer to the DataFrame
imputed_data = imputer.fit_transform(df)

# Convert the imputed data back to a DataFrame
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)

print("\nDataFrame after Iterative Imputation:")
print(imputed_df)
```

### Output

```
Original DataFrame:
    Age   Income   Fare
0  25.0  50000.0  100.0
1  30.0      NaN  200.0
2   NaN  60000.0    NaN
3  40.0  70000.0  300.0
4  35.0  80000.0  250.0

DataFrame after Iterative Imputation:
         Age        Income        Fare
0  25.000000  50000.000000  100.000000
1  30.000000  67268.658111  200.000000
2  34.674414  60000.000000  224.787345
3  40.000000  70000.000000  300.000000
4  35.000000  80000.000000  250.000000
```

### Explanation of the Output

- **Age:** The missing value in the `Age` column is imputed using a model that considers the relationship with the `Income` and `Fare` columns.
- **Income:** The missing value in the `Income` column is imputed considering the relationship with `Age` and `Fare`.
- **Fare:** The missing value in the `Fare` column is imputed considering the relationship with `Age` and `Income`.

### Considerations

- **Computationally Intensive:** Iterative imputation can be computationally intensive, especially for large datasets or a high number of iterations.
- **Convergence:** The process may not always converge, especially if the relationships between variables are complex.
- **Parameter Tuning:** You may need to experiment with the number of iterations (`max_iter`) and other parameters to get optimal results.

### Summary

The Iterative Imputer is a powerful method for handling missing data by leveraging the relationships between variables. It can provide more accurate imputations compared to simpler methods, especially when there are strong correlations between features. Using `IterativeImputer` from scikit-learn, you can implement this technique in your data preprocessing pipeline to effectively deal with missing values.

SyntaxError: invalid syntax (<ipython-input-25-cc931943e3ec>, line 3)

In [26]:
df = np.round(pd.read_csv('50_Startups.csv')[['R&D Spend','Administration','Marketing Spend','Profit']]/10000)
np.random.seed(9)
df = df.sample(5)
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
21,8.0,15.0,30.0,11.0
37,4.0,5.0,20.0,9.0
2,15.0,10.0,41.0,19.0
14,12.0,16.0,26.0,13.0
44,2.0,15.0,3.0,7.0


In [27]:
df = df.iloc[:,0:-1]
df

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,4.0,5.0,20.0
2,15.0,10.0,41.0
14,12.0,16.0,26.0
44,2.0,15.0,3.0


In [28]:
df.iloc[1,0] = np.NaN
df.iloc[3,1] = np.NaN
df.iloc[-1,-1] = np.NaN

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[1,0] = np.NaN
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[3,1] = np.NaN
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.iloc[-1,-1] = np.NaN


In [29]:
df.head()


Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,,26.0
44,2.0,15.0,


In [30]:
# Step 1 - Impute all missing values with mean of respective col

df0 = pd.DataFrame()

df0['R&D Spend'] = df['R&D Spend'].fillna(df['R&D Spend'].mean())
df0['Administration'] = df['Administration'].fillna(df['Administration'].mean())
df0['Marketing Spend'] = df['Marketing Spend'].fillna(df['Marketing Spend'].mean())

In [31]:
# 0th iternation

df0

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,9.25,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [32]:
#remove the col1 imputed value
df1 = df0.copy()

df1.iloc[1,0] = np.NaN

df1

Unnamed: 0,R&D Spend,Administration,Marketing Spend
21,8.0,15.0,30.0
37,,5.0,20.0
2,15.0,10.0,41.0
14,12.0,11.25,26.0
44,2.0,15.0,29.25


In [33]:
# Use first 3 rows to build a model and use the last for prediction
y = df1.iloc[[0,2,3,4],1:3]
y

Unnamed: 0,Administration,Marketing Spend
21,15.0,30.0
2,10.0,41.0
14,11.25,26.0
44,15.0,29.25


In [34]:
y = df1.iloc[[0,2,3,4],0]
y

21     8.0
2     15.0
14    12.0
44     2.0
Name: R&D Spend, dtype: float64

In [35]:
lr = LinearRegression()
lr.fit(X,y)
lr.predict(df1.iloc[1,1:].values.reshape(1,2))

ValueError: Expected 2D array, got 1D array instead:
array=[ 8. 15. 12.  2.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.