## Data leakage
#### Model contains information about the target, but similar data will not be available when starting the predictions.

### Target leakage

#### Data for predictions are not available when making predictions. The data are usable before the prediction is made.

### Train-Test contamination

#### The validation data corrupt the process by affecting the preprocessing behavior.

## Missing data

#### check the column with missing data and get name of them

In [None]:
cols_missing_data = [col for col in df.columns if df[col].isnull().any()]

#### deal with missing data only with non-object columns

In [None]:
X_train_full = df.drop(['targe_column', axis=1])
X_train = X_train_full.select_dtypes(exclude='object')

#### 1) drop the whole column with missing data ( fine for large amount of missing data; otherwise lots of useful data will be lost)

In [None]:
X_train = X_train.drop(cols_missing_data, axis=1, inplace=True)`

#### 2) simple imputer: replace the missing data with some sort of values such as mean, median, most_frequent, constant

In [None]:
from sklearn.imputer import SimpleImputer

my_imputer = SimpleImputer(missing_values='', strategy='')
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Add the column names back; Otherwise, the column names will be 1, 2, 3, ...
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns


#### 3) simple imputer with an extra column for showing the status of the value

In [None]:
X_train_copy = X_train.copy()
X_valid_copy = X_valid.copy()

for col in cols_missing_data:
    X_train_copy[col + '_missing'] = X_train[col].isnull()
    X_valid_copy[col + '_missing'] = X_valid[col].isnull()

X_train_copy = pd.DataFrame(my_imputer.fit_transform(X_train))
X_valid_copy = pd.DataFrame(my_imputer.transform(X_valid))

X_train_copy.columns = X_train.columns
X_valid_copy.columns = X_valid.columns

## Machine learning

#### chooce a target variable

In [None]:
y_target = df['target_column']

#### form a dataframe only with features

In [None]:
X_full = df.drop(['target_column'], axis=1, inplace=True)

#### check any missing value and categorical features in the data. Chooce appropriate methods to deal with them from 'Missing data' and 'categorical variables'. 

In [None]:
# X_full contains all features except target. After missing data and categorical process, 'X' will not be
# necessary be the same as 'X_full'. There may be several features abandoned due to some reasons.

#### form a list containing feature names

In [None]:
feature_list = ['feature_1', 'feature_2', '.....']

#### splite the data into training set and validation set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y_target, random_state=0)

### Decision Tree Model

In [None]:
from sklearn.tree import DecisionTreeRegressor

my_DTR_model = DecisionTreeRegressor(random_state=1)
my_DTR_model.fit(X_train, y_train)