# Removing Features or Removing Rows

If only a few rows relative to the size of your dataset are missing some values, then it might just be a good idea to drop those rows. What does this cost you in terms of performace? It essentialy removes potential training/testing data, but if its only a few rows, its unlikely to change performance.

Sometimes it is a good idea to remove a feature entirely if it has too many null values. However, you should carefully consider why it has so many null values, in certain situations null could just be used as a separate category. 

Take for example a feature column for the number of cars that can fit into a garage. Perhaps if there is no garage then there is a null value, instead of a zero. It probably makes more sense to quickly fill the null values in this case with a zero instead of a null. Only you can decide based off your domain expertise and knowledge of the data set!

## Working based on Rows Missing Data

## Filling in Data or Dropping Data?

Let's explore how to choose to remove or fill in missing data for rows that are missing some data. Let's choose some threshold where we decide it is ok to drop a row if its missing some data (instead of attempting to fill in that missing data point). We will choose 1% as our threshold. This means if less than 1% of the rows are missing this feature, we will consider just dropping that row, instead of dealing with the feature itself. There is no right answer here, just use common sense and your domain knowledge of the dataset, obviously you don't want to drop a very high threshold like 50% , you should also explore correlation to the dataset, maybe it makes sense to drop the feature instead.

Based on the text description of the features, you will see that most of this missing data is actually NaN on purpose as a placeholder for 0 or "none".

In [None]:
# Dropping rows with missing values
df.dropna(axis=0, inplace=True)

# Dropping columns with missing values
df.dropna(axis=1, inplace=True)

### 2-Mean/median imputation:
#### Replace missing values with the mean or median value of the feature. This strategy works well if the missing values are numerical and the distribution is roughly symmetric.
Example code in Python:

In [None]:
# Imputing missing values with mean
mean = df['column_name'].mean()
df['column_name'].fillna(mean, inplace=True)

# Imputing missing values with median
median = df['column_name'].median()
df['column_name'].fillna(median, inplace=True)

### 3-Mode imputation:
#### Replace missing values with the most frequent value of the feature. This strategy works well if the missing values are categorical.
Example code in Python:

In [None]:
# Imputing missing values with mode
mode = df['column_name'].mode()[0]
df['column_name'].fillna(mode, inplace=True)

#### 4-Regression imputation:
### Use regression models to predict the missing values based on other features in the dataset. This strategy works well when the missing values are numerical and there is a strong correlation with other features.
Example code in Python:

In [None]:
from sklearn.linear_model import LinearRegression

# Splitting dataset into two parts - with and without missing values
known = df[df['column_name'].notna()]
unknown = df[df['column_name'].isna()]

# Training regression model on known values
X_train = known.drop('column_name', axis=1)
y_train = known['column_name']
reg = LinearRegression().fit(X_train, y_train)

# Predicting missing values using regression model
X_test = unknown.drop('column_name', axis=1)
y_pred = reg.predict(X_test)

# Filling in missing values with predicted values
df.loc[df['column_name'].isna(), 'column_name'] = y_pred

### 5-Multiple imputation:
#### Use statistical models to generate multiple imputed datasets and combine the results. This strategy works well when there are complex patterns of missingness in the dataset.
Example code in Python:

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Creating multiple imputed datasets
imp = IterativeImputer(random_state=0)
imputed = imp.fit_transform(df)

# Combining results from multiple imputed datasets
df = pd.DataFrame(imputed, columns=df.columns)

### 6-Forward fill/backward fill:
#### Replace missing values with the previous or next valid value along the feature axis. This strategy works well when the missing values occur in a time series or sequential data
Example code in Python:

In [None]:
# Forward filling missing values
df.fillna(method='ffill', inplace=True)

# Backward filling missing values
df.fillna(method='bfill', inplace=True)

### 1-Drop rows/columns with missing values:
#### This strategy involves removing rows or columns which have missing values. This can be done when the amount of missing data is small and doesn't significantly affect the dataset's overall quality. However, this strategy can lead to loss of valuable information and reduce the sample size.
Example in Python:

In [3]:
import pandas as pd

# Create a sample dataset with missing values
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'age': [25, 32, None, 28, 30],
        'gender': ['F', 'M', 'M', 'M', None]}
df = pd.DataFrame(data)

# Drop rows with missing values
df_dropped = df.dropna()
print(df_dropped)

    name   age gender
0  Alice  25.0      F
1    Bob  32.0      M
3  David  28.0      M


### 2-Imputation:
#### Imputation involves filling in the missing values with a reasonable estimate. This can be done using various methods, such as mean, median, mode, or using machine learning algorithms to predict the missing values.
Example in Python:

In [None]:
# Imputing missing values with mean
mean = df['column_name'].mean()
df['column_name'].fillna(mean, inplace=True)

# Imputing missing values with median
median = df['column_name'].median()
df['column_name'].fillna(median, inplace=True)

# **SimpleImputer**
## sklearn.impute import SimpleImputer

In [8]:
from sklearn.impute import SimpleImputer

# Create the DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'age': [25, 32, None, 28, 30],
        'gender': ['F', 'M', 'M', 'M', None]}

df = pd.DataFrame(data)

# Initialize the SimpleImputer with strategy='mean'
imputer = SimpleImputer(strategy='mean')

# Apply the imputer to the 'age' column
df['age'] = imputer.fit_transform(df[['age']])

print(df)

      name    age gender
0    Alice  25.00      F
1      Bob  32.00      M
2  Charlie  28.75      M
3    David  28.00      M
4      Eva  30.00   None


In [3]:
import pandas as pd
import numpy as np

# Create the DataFrame
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'age': [25, 32, None, 28, 30],
    'tall': [160, None, 175, 180, 165],
    'weight': [55, 70, None, 80, 65]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
      name   age   tall  weight
0    Alice  25.0  160.0    55.0
1      Bob  32.0    NaN    70.0
2  Charlie   NaN  175.0     NaN
3    David  28.0  180.0    80.0
4      Eva  30.0  165.0    65.0


In [2]:
# Select numerical columns
numerical_columns = df.select_dtypes(include=[np.number]).columns
print("Numerical Columns:", numerical_columns)

Numerical Columns: Index(['age', 'tall', 'weight'], dtype='object')


In [6]:
from sklearn.impute import SimpleImputer

# Initialize the SimpleImputer with strategy='mean'
imputer = SimpleImputer(strategy='median')

# Apply the imputer to the numerical columns
df[numerical_columns] = imputer.fit_transform(df[numerical_columns])

print("\nDataFrame After Imputation:")
print(df)


DataFrame After Imputation:
      name   age   tall  weight
0    Alice  25.0  160.0    55.0
1      Bob  32.0  170.0    70.0
2  Charlie  29.0  175.0    67.5
3    David  28.0  180.0    80.0
4      Eva  30.0  165.0    65.0


### 3-Categorical imputation:
#### This strategy involves filling in missing values for categorical variables using the most frequent category.
Example in Python:

In [20]:
# Create the DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'age': [25, 32, None, 28, 30],
        'gender': ['F', 'M', 'M', 'M', None]}

df = pd.DataFrame(data)

# Replace None with np.nan in the 'gender' column
df['gender'] = df['gender'].replace([None], np.nan)

# Initialize the SimpleImputer with strategy='most_frequent' for the 'gender' column
imputer_gender = SimpleImputer(strategy='most_frequent')

# Apply the imputer to the 'gender' column and flatten the result
df['gender'] = imputer_gender.fit_transform(df[['gender']]).flatten()

print(df)

      name   age gender
0    Alice  25.0      F
1      Bob  32.0      M
2  Charlie   NaN      M
3    David  28.0      M
4      Eva  30.0      M


## 4-Interpolation:
### This strategy involves filling in missing values using a linear or polynomial function that estimates the values between two known points. This method is useful when the data is continuous and the missing data is small.

In [12]:
import pandas as pd

# Create a sample dataset with missing values
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'age': [25, 32, None, 28, 30],
        'gender': ['F', 'M', 'M', 'M', None]}
df = pd.DataFrame(data)

# Interpolate missing values using linear interpolation
df_interpolated = df.interpolate()
print(df_interpolated)

      name   age gender
0    Alice  25.0      F
1      Bob  32.0      M
2  Charlie  30.0      M
3    David  28.0      M
4      Eva  30.0   None
