## Prediction of missing values:

In the earlier methods to handle missing values, we do not use correlation advantage of the variable containing the missing value and other variables. Using the other features which don’t have nulls can be used to predict missing values.
The regression or classification model can be used for the prediction of missing values depending on nature (categorical or continuous) of the feature having missing value.
```
Here 'Age' column contains missing values so for prediction of null values the spliting of data will be,
y_train: rows from data["Age"] with non null values
y_test: rows from data["Age"] with null values
X_train: Dataset except data["Age"] features with non null values
X_test: Dataset except data["Age"] features with null values
```
```Python
from sklearn.linear_model import LinearRegression
import pandas as pd

data = pd.read_csv("train.csv")
data = data[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]

data["Sex"] = [1 if x=="male" else 0 for x in data["Sex"]]

test_data = data[data["Age"].isnull()]
data.dropna(inplace=True)

y_train = data["Age"]
X_train = data.drop("Age", axis=1)
X_test = test_data.drop("Age", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
```

- Pros:
Gives a better result than earlier methods
Takes into account the covariance between missing value column and other columns.
- Cons:
Considered only as a proxy for the true values


### Imputation using Deep Learning Library — Datawig
This method works very well with categorical, continuous, and non-numerical features. Datawig is a library that learns ML models using Deep Neural Networks to impute missing values in the datagram.

```Python
Install datawig library,
pip3 install datawig
```

Datawig can take a data frame and fit an imputation model for each column with missing values, with all other columns as inputs.
Below is the code to impute missing values in the Age column

```Python
import pandas as pd
pip install datawig
import datawig

data = pd.read_csv("train.csv")

df_train, df_test = datawig.utils.random_split(data)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['Pclass','SibSp','Parch'], # column(s) containing information about the column we want to impute
    output_column= 'Age', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)
```

- Pros:
Quite accurate compared to other methods.
It supports CPUs and GPUs.
- Cons:
Can be quite slow with large datasets.

### Conclusion:
Every dataset has missing values that need to be handled intelligently to create a robust model. In this article, I have discussed 7 ways to handle missing values that can handle missing values in every type of column. There is no thump rule to handle missing values in a particular manner, the method which gets a robust model with the best performance. One can use various methods on different features depending on how and what the data is about. Having a domain knowledge about the dataset is important, which can give an insight into how to preprocess the data and handle missing values.

### Pandas не позволяет менять индексы, но можно сделать это через values
```Python
df.column1.values[0] = 1
df.index.values[5] = 20
```

### Apply обращение к столбцам
```Python
df['АиБ'] = df[['А', 'Б']].apply(lambda x: x['А'] + x['Б'], axis=1)
```

# Подбор параметров:

``` Python
# регрессия
CV_model = GridSearchCV(estimator=LogisticRegression(), 
                            param_grid={'C': [100, 10, 1, 0.1, 0.01, 0.001]},
                            cv=5, 
                            scoring='roc_auc',
                            n_jobs=-1, 
                            verbose=10)
CV_model.fit(X, y)
print('Лучший результат', CV_model.best_score_,'при С=',CV_model.best_params_['C'])


# catboost

model = CatBoost()

grid = {'learning_rate': [0.03, 0.1, 0.2],
        'custom_metric': ['AUC:hints=skip_train~false'],
        'depth': [2, 4, 6],
        'l2_leaf_reg': [ 2, 3, 5],
        'n_estimators': [50, 100, 200]}

grid_search_result = model.grid_search(grid, 
                                       X=X, 
                                       y=y, 
                                       plot=False)
grid_search_result['params']

# еще

from catboost import CatBoostClassifier

X_train, X_test, y_train, y_test = train_test_split(data.loc[:,feature_cols], data.loc[:,target_col], test_size=0.2, stratify = data.loc[:,target_col],random_state=42)

clf = GridSearchCV(estimator=CatBoostClassifier(), param_grid={'iterations':[8,16], 'learning_rate':[0.1, 0.5, 1], 'depth':[2,4,8,12]
                                                               , 'cat_features':[cat_cols]}, scoring = 'roc_auc', n_jobs=-1)

clf.fit(X_train, y_train)

print('Параметр {}'.format(clf.best_params_))
print('Качество {}'.format(clf.best_score_))



# lgb

grid = {'learning_rate': [0.03, 0.1, 0.2, 0.3, 0.5, 5],
        'num_leaves': [2, 4, 6, 10, 40, 60],
        'max_bin': [ 2, 300, 500],
        'max_depth': [3, 5, 10, 20],
        'n_estimators': [50, 100, 200, 500],
        'metric': ['auc'],
        'feature_fraction': [0.2, 0.5, 0.7]}

lgb_estimator = lgb.LGBMClassifier(verbose_eval=20, 
                                   early_stopping_rounds=10)

g_lgbm = GridSearchCV(estimator=lgb_estimator, param_grid=grid, n_jobs = -1, cv= 3)

lgb_model = g_lgbm.fit(X=train[features], y=train[target_col], eval_set = (valid[features], valid[target_col]))
g_lgbm.best_params_


# проверяем качество 

print('Score train ROC AUC =', roc_auc_score(train[target_col], best_model.predict(train[features])))

fpr, tpr, thresholds = roc_curve(train[target_col], best_model.predict(train[features]))
roc_auc= auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

```

### Правильное отображение прогресса в ноутбуке
```Python
from tqdm.auto import tqdm, trange
```

### Создать таблицу Pandas в которой пропуски
```Python
pd.DataFrame.from_dict(d, orient='index').fillna(0).T
```

### Посмотреть картинку
```Python
import cv2
img = cv2.imread('f.jpg')
img_cvt=cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(12, 8))
plt.imshow(img_cvt)
plt.show()
```

### Показать картинку в ноутбуке
```Python

from IPython.display import Image
Image('img/picture.png')

![title](img/picture.png)


from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://my_site.com/my_picture.jpg")

PATH = "/Users/reblochonMasque/Documents/Drawings/"
Image(filename = PATH + "My_picture.jpg", width=100, height=100)

<img src="subdirectory/MyImage.png",width=60,height=60>

![alt text](test.gif "Title")
```

### Показать картинку в ноутбуке в колабе в таблице
```Python

# создадим картинку в рабочей папке колаба

from PIL import Image, ImageDraw
text = "Hello, PIL!!!"
color = (100, 5, 120)
img = Image.new('RGB', (100, 50), color)
imgDrawer = ImageDraw.Draw(img)
imgDrawer.text((10, 20), text)
img.save("pil-basic-example.png")
Image.open('pil-basic-example.png')


# покажем ее в таблице


from IPython.core.display import display, HTML
import base64
import pandas as pd

df = pd.DataFrame({"A":[1,2,3,4,5], "B":[10,20,30,40,50]})

with open('pil-basic-example.png', 'rb') as fd:
    b64 = base64.b64encode(fd.read()).decode('ascii')
    df.loc[:,'img'] = f'<img src="data:image;base64,{b64}" />'
display(HTML(df.to_html(escape=False)))
```

## Ютьюб в ноутбуке

```Python
from IPython.display import YouTubeVideo
YouTubeVideo('ewkSI2cuyoQ')
```