# Introduction to Machine Learning

Machine learning is a class of artificial intelligence methods, the characteristic feature of which is not the direct solution of a problem, but learning through the application of solutions to many similar problems. To build such methods, the tools of mathematical statistics, numerical methods, mathematical analysis, optimization methods, probability theory, graph theory, various techniques for working with data in digital form are used.

### Statement of the business problem

Understanding the problem correctly is half the battle. We will deal with the conditions and think about how we can solve it.

Imagine you are working in an online real estate service. Without resorting to the services of agents, the owners place ads, and buyers respond to them. If the transaction is successful, your service takes a commission.

To solve the problem, together with experts, you can manually write down the rules that determine the cost.
For example:
- multiply the area of the apartment by the average cost per square meter in the city;
- reduce the cost by 20% if the repair is more than three years old;
- increase the price by 30% if the metro is near the house.


There can be any number of rules.
But it's not that simple. An “expert algorithm” can become your competitive advantage or a useless feature that does not pay off the considerable time and financial costs. In addition, the rules are difficult to scale (enter other markets and regions), and over time they may lose relevance.

Machine learning will help to correct the shortcomings of expert rules and solve this problem in a different way.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action='ignore', category=SettingWithCopyWarning)
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('./data/train_data.csv')

display(df.shape)
display(df.head(5))

In data analysis, rows were called observations and columns were called variables.
In machine learning, we write objects in rows, and features in columns.

The sign to be predicted is the target one: in our problem, this is last_price.

## Learning with a teacher(supervised)

Machine learning tasks are different. We choose the one that will help develop the desired algorithm.

You have a training dataset and a target feature that you need to predict for the rest of the features - the price of the sale of housing. Such tasks belong to the class of "supervised learning". The “teacher” poses questions (features) and indicates answers (target feature). And it doesn't explain anything.

Recall that all variables and features are of two types: categorical and quantitative. The target feature is no exception.

- If the target feature is categorical, then the classification problem is solved (for example, identify an animal in the image).
- When there are only two categories - for example, whether the client will return to the online store or not - we are talking about a binary classification.
- Is the target trait quantitative? This is a regression task - according to the data, the relationship between variables is restored. This way you can predict the number of news reposts or the sales volume of an online store for the next month.

## Classification and regression

The cost of selling an apartment is a quantitative target sign. It turns out that we are faced with the task of regression. Getting acquainted with machine learning using her example is inconvenient: the calculations will be too cumbersome due to the many possible answers (any number).

It is easier when there are only two options, as in binary classification. Therefore, we first break down all prices into “high” and “low” and predict which class the housing belongs to. And then back to regression.


How to determine high and low prices?
Let's find the median of prices (it's right in the middle).

In [None]:
median = df['last_price'].median()
print(df['last_price'].median()) 

After UAH 1,159,000 - high prices, to - low.

In [None]:
df.loc[df['last_price'] > median, 'price_class'] = 1
df.loc[df['last_price'] <= median, 'price_class'] = 0

features = df.drop(['last_price', 'price_class'], axis=1)
target = df['price_class']

print(features.shape)
print(target.shape)

# Models and algorithms

To make predictions, you need to understand the relationship between features and responses. What does an analyst do? He proposes how these relationships work. And based on it makes predictions.

If they match reality, then the assumption is correct. This approach is called modeling, and the assumptions and prediction methods themselves are called machine learning models.

Consider one popular model, the decision tree. It can describe the process of making a decision in almost any situation. For example, will Oleksandr go on vacation to Rome:

<img src="./pict/1.jpg"  
  width="1000"
/>

In our problem, we will put forward the following assumption: the decision tree determines the price of the apartment.

Which of the whole set of trees? To find the one, you need to train the model: choose a decision tree that best suits our training set. Learning algorithms are responsible for this, and their result is called a trained model. From data scientists, you may hear "model" instead of "trained model".

After training, the model is ready to predict: receive new objects (features) as input and give answers (target feature). No more algorithms and training dataset needed.

It is important to remember that the process of machine learning is divided into two stages - training the model and the operation of this model.

<img src="./pict/2.png"  
  width="640" />

# scikit-learn library

Learning algorithms are often more complex than the model itself. Therefore, imagine them as black boxes. The main thing is to understand what exactly to put in the box and how to work with what it gives out.

Many algorithms are already available in the Python libraries. scikit-learn, or sklearn "scientific kit for learning".
There are many data tools and models in sklearn, so they are divided into subsections.

The tree module contains the decision tree.
Each model in sklearn has a separate data structure. DecisionTreeClassifier is a data structure for decision tree classification. Import it from the library:

In [None]:
from sklearn.tree import DecisionTreeClassifier 

Then we create an object of this data structure.

In [None]:
model = DecisionTreeClassifier()

The model variable will store the model. Our model is not yet predictive. In order for it to learn, you need to run the learning algorithm.

Let's start by training the model. We saved the training dataset in the features and target variables. To start training, call the fit() method and pass it data as a parameter.

In [None]:
model.fit(features, target) 

Now the model variable contains a full-fledged model. To predict the answers, you need to call the predict() method and pass it a table with the features of new objects. Let's create new features

In [None]:
new_features = pd.DataFrame(
    [[None, None, 2.8, 25, None, 25, 0, 0, 0, None, 0, 30706.0, 7877.0],
     [None, None, 2.75, 25, None, 25, 0, 0, 0, None, 0, 36421.0, 9176.0]],
    columns=features.columns)

new_features.loc[0, 'total_area'] = 900.0
new_features.loc[0, 'rooms'] = 12
new_features.loc[0, 'living_area'] = 409.7
new_features.loc[0, 'kitchen_area'] = 112.0

new_features.loc[1, 'total_area'] = 109.0
new_features.loc[1, 'rooms'] = 2
new_features.loc[1, 'living_area'] = 32.0
new_features.loc[1, 'kitchen_area'] = 40.5

In [None]:
new_features

In [None]:
answers = model.predict(new_features) 

In [None]:
print(answers.tolist())

## Show models

Did you guess that there are a couple of thousand lines of such text in the model?!

In Python it would look like this:

What strange conditions! But the computer has no purpose to output beautiful numbers.

The ability to look inside the model and understand how a problem is solved can be critical when it comes to people. For example, in the diagnosis of diseases or the analysis of questionnaires to eliminate discrimination in employment. But for business tasks, it is not necessary to make the model “transparent”. The main thing is that it works.

# Randomness in learning algorithms

When training a decision tree, each time a new model is obtained.

Randomness is added to many machine learning algorithms to help models notice patterns in data. Let's say you're learning Python.
We made 20 cards with new methods and regularly review them. A friend advised me to mix them every time, so the material is absorbed better. By shuffling the cards, you add randomness to the learning algorithm.

A computer does not create truly random numbers. It connects pseudo-random number generators that produce sequences that look like random ones. For example, the following number cannot be guessed.

Random numbers are not so simple: they are unpredictable. Today you trained an artificial intelligence that will flood humanity with spam, and tomorrow it cannot distinguish a cat from a dog.

Pseudo-random number generators can be configured to consistently produce the same results. The numbers are random, but the same every time, how come? In fact, they just look random.

It's the same with learning Python.

The translation cards were so successful that you decided to teach others from them. Free of charge. But before that, they prepared and wrote down a different order of cards for each day. In the same sequence, lay them out in front of the students. So you know which card will be the eighth in a row, for example, on Friday. This will not harm the educational process: for the student, the order will still look random.

<img src="./pict/3.jpg"  
  width="800" />

Fixing pseudo-randomness for the learning algorithm is very simple: when creating it, you need to specify the random_state parameter.

In [None]:
# указываем случайное состояние (число)
model = DecisionTreeClassifier(random_state=12345)

# обучаем модель как раньше
model.fit(features, target) 

# Test dataset

They caught the model and took her to the exam. But how to test her knowledge? We need a new dataset with known answers.

To know for sure that the model did not memorize the answers, let's take a new dataset - a test data set, or a test sample. Let's name the data file `test_data.csv`. Let's check how the model copes with them.

In [None]:
test_df = pd.read_csv('./data/test_data.csv')

In [None]:
test_df = test_df.loc[:10]

In [None]:
test_df.loc[test_df['last_price'] > 1159000, 'price_class'] = 1
test_df.loc[test_df['last_price'] <= 1159000, 'price_class'] = 0
test_features = test_df.drop(['last_price', 'price_class'], axis=1)
test_target = test_df['price_class']

In [None]:
model = DecisionTreeClassifier(random_state=12345)
 
model.fit(features, target)
test_predictions = model.predict(test_features)

print('Предсказания:     ',test_predictions)
print("Правильные ответы:", test_target.values)

Let's write an error_count() function that:
- Accepts correct answers and model predictions as input.
- Compares them in a for loop.
- Returns the number of discrepancies between them.

In [None]:
def error_count(answers, predictions):
    i = 0
    for n,m in zip(answers, predictions):
        if n != m:
            i += 1
    return i

print("Ошибок:", error_count(test_target.values, test_predictions))

# Proportion of correct answers

The ratio of the number of correct answers to the size of the test sample is called "accuracy". The formula looks like this:

<font size="5">  
    accuracy = number of correct answers / number of questions
</font>  

In [None]:
def accuracy(answers, predictions):
    i = 0
    for n,m in zip(answers, predictions):
        if n == m:
            i += 1
    return round(i / len(answers),3) 

In [None]:
print("Accuracy:", accuracy(test_target, test_predictions))

# Quality metrics

Is it possible to distinguish a good model from a bad one? How to evaluate the quality of a model? What metric to choose?


Quality metrics evaluate the quality of work and are expressed in numerical form. You are already familiar with one quality metric - accuracy.

There are others, for example:
- `precision` shows what proportion of objects marked as expensive by the model are really expensive (answer 1).
- `recall` reveals what part of expensive objects the model has selected.

Quality metrics are closely related to the original classification problem.

Why did we choose accuracy when determining prices for apartments? Every wrong prediction is a wrong clue and a potential missed opportunity for the seller. And vice versa: the higher the classification accuracy, the more profit the product will bring.

Yes, the higher the quality of the model, the better. But its implementation must be justified.

Obvious limits: `accuracy` cannot be less than zero (all answers are wrong) and greater than one (all answers are correct).

Set the `accuracy` metric to 0.4, for example. The quality of the model will be high or not?

<img src="./pict/4.jpg"  
  width="800" />

## Quality metrics in sklearn library

You no longer have to calculate accuracy using a formula. `sklearn` has many functions for calculating metrics.

In the `sklearn` library, the metrics are in the `sklearn.metrics` module. The accuracy is calculated by the `accuracy_score()` function

In [None]:
from sklearn.metrics import accuracy_score 

The function takes two arguments as input:
- right answers
- model predictions.

It returns the value of accuracy. `accuracy = accuracy_score(target, predictions)`

In [None]:
accuracy = accuracy_score(test_target, test_predictions) 
accuracy

In [None]:
train_predictions = model.predict(features)
test_predictions = model.predict(test_features)

In [None]:
print("Accuracy")
print("Training set:", accuracy_score(target, train_predictions))
print("Test set:", accuracy_score(test_target, test_predictions))

# Overfitting and underfitting

Did you find that the accuracy on the test sample of the model is lower than on the training one? This happens a lot in machine learning. Why?

Did the model explain the examples from the training data well, but got confused in the test set and could not answer correctly? You've run into a problem with `overfitting`.

Did you find that the accuracy on the test sample of the model is lower than on the training one? This happens a lot in machine learning. Why?

Did the model explain the examples from the training data well, but got confused in the test set and could not answer correctly? You've run into a problem with `overfitting`.


The opposite effect is `underfitting`. It occurs when the quality on the training and test samples is approximately the same, and low. Eh, it was not possible before the exam not only to memorize the answers, but in general to finish reading the tickets. Familiar story?

`It is not always possible to avoid overfitting or underfitting. When you get rid of the first, the risk of the second effect increases, and vice versa.`

Look at an example of setting up a learning algorithm. How does it affect the balance between overfitting and underfitting?

Tree depth (tree height) is the maximum number of conditions from the "top" to the final answer. Counted by the number of hops between nodes. (depth 3 is shown in the figure below)

<img src="./pict/5.png"  
  width="500" />

The depth of the tree in `sklearn` is set by the `max_depth` parameter

In [None]:
best_model = None
best_result = 0

for depth in range(1, 12):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model.fit(features, target)
    
    predictions = model.predict(features) 
    
    result = accuracy_score(target, predictions) 
    
    if result > best_result:
        best_model = model
        best_result = result
        
print("Accuracy лучшей модели:", best_result)

# Validation set

Imagine: you are preparing for an exam on tests from previous years. Don't rush to solve everything at once. Postpone part of the tasks, so that later you can check how well you understand the topic.

So it is in machine learning. To make the quality assessment more reliable, you need to prepare a new sample - "validation", or test.

1) The original dataset is available, but the test set is hidden. Then it is recommended to allocate 75% of the data for training, and 25% for validation. Ratio 3:1.

<img src="./pict/6.png"  
  width="700" />

2) There is no hidden test sample. This means that the data needs to be divided into three parts: training, validation and test. The sizes of the test and validation sets are usually equal. The initial data is divided in the ratio 3:1:1.

<img src="./pict/7.png"  
  width="700" />

## Divide by two samples

The validation sample is 25% of the original data. How to extract it?

`sklearn` provides the `train_test_split` function for this. It splits any dataset into training and test sets.

But we will use this function to get the validation and training sets.
Import `train_test_split` from `sklearn.model_selection` module:

In [None]:
from sklearn.model_selection import train_test_split 

Before splitting the data set, you need to specify two parameters:
    
- The name of the set whose data is being shared;
- The size of the validation set (test_size). It is expressed in fractions - from 0 to 1. In our example, `test_size=0.25`, since we are working with 25% of the original data.

The `train_test_split()` function returns two new datasets −

In [None]:
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345) 

In [None]:
features_train = df_train.drop(['last_price', 'price_class'], axis=1)
target_train = df_train['price_class']

features_valid = df_valid.drop(['last_price', 'price_class'], axis=1)
target_valid = df_valid['price_class'] 

In [None]:
print(features_train.shape)
print(target_train.shape)

In [None]:
print(features_valid.shape)
print(target_valid.shape)

# Hyperparameters

`Hyperparameters` are settings for learning algorithms. Unlike parameters, they are set before the learning process.

In a decision tree, for example, this is the maximum depth or the choice of a criterion - Gini or entropy. Hyperparameters help improve the model. You can change them before the start of training.

Take another look at the already familiar code:

- `criterion='gini'` is a Gini criterion that shows a measure of similarity between two sets of data. While learning, the tree at each node (at each fork) of the possible questions asks the best one. Now it selects the question for which the Gini test shows that the data assigned to the left branch is the least similar to those on the right.
- `min_samples_split` - this hyperparameter prohibits creating nodes that contain too few training sample objects.
- `min_samples_leaf` are bottom nodes with answers. And the hyperparameter does not allow you to create a sheet in which there are too few objects in the training sample.

## Changing hyperparameters

Let's tune the hyperparameters of our decision tree.

The most important decision tree hyperparameter is `max_depth`. It is he who determines what we will end up with - a stump with one question or a maple tree with a branched crown.

<img src="./pict/8.jpg"  
  width="500" />

In [None]:
accuracy_array = []
best_model = None
best_result = 0

for depth in range(1, 24):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)

    if result > best_result:
        best_result = result
        
    print('max_depth =', depth,': ', end='')
    print(result)

    accuracy_array.append([depth, result])
    
df_ac = pd.DataFrame(accuracy_array, columns=['depth', 'accuracy'])
print('')
print('best result :', best_result)

In [None]:
plt = sns.lineplot(data=df_ac, x="depth", y="accuracy")
plt.grid()

# New models

## random forest

You have changed the hyperparameters of the model. But the result is still not satisfactory. One tree is clearly not enough, you need a whole forest!

Let's try a new classification algorithm - random forest. The algorithm trains a large number of independent trees, and then makes a decision based on voting. Random forest helps to improve the prediction result and avoid overfitting.

Have you ever wondered why there are always several people on the jury? So that the final grade of the speaker is average. So personal preferences and mistakes are smoothed out. Random forest works the same way.


How to train him? In the `sklearn` library, the `RandomForestClassifier` random forest algorithm resides in the `sklearn.ensemble` module. Let's import it:

In [None]:
from sklearn.ensemble import RandomForestClassifier 

To control the number of trees in the forest, we write the `n_estimators` hyperparameter. The more trees, the longer the model will learn, but the result will be better (and vice versa). Let's take the value of `n_estimators` equal to 3 for now.

In [None]:
model = RandomForestClassifier(random_state=12345, n_estimators=3)

In [None]:
model.fit(features_train, target_train) 

In [None]:
result = model.score(features_valid, target_valid) 
print(result)

Let's test the hyperparameters by increasing the number of trees to 18

In [None]:
from tqdm.notebook import tqdm

In [None]:
best_model = None
best_result = 0

for est in tqdm(range(1, 18)):
    for depth in range(1, 18): 
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth) 
        model.fit(features_train, target_train) 
        result = model.score(features_valid, target_valid) 
        if result > best_result:
            best_depth = depth
            best_est = est
            best_model = model
            best_result = result
print("Accuracy наилучшей модели на валидационной выборке:", best_result)

print('')
print('best_depth =', best_depth)
print('best_est =', best_est)

## Logistic regression

Algorithms are not limited to trees. There are other ways to classify.

If you make the `n_estimators` hyperparameter larger, the model will grow and learn slowly. This is bad. There are few trees and the results are not better - also unsuccessful. How long can you be dependent on trees?

Let's try another algorithm - `logistic regression`.

Even if the name `mimics` a regression problem, it is still a classification algorithm.

To predict housing class, logistic regression:
- First, it considers which class the object is close to. For example, with this formula:
- Depending on the answer, selects the required class: if the result of the calculation is positive, then - 1 (high prices); negative - 0 (low prices).

<font size="5">
    proximity to class = 10 * area - distance to center
</font>

The area increases the cost, and the distance to the center reduces it. Moreover, each square meter of area is 10 times more important than one meter to the center.

We only considered area and distance. But in order to get proximity to the class, all features are placed in the black box.

The LogisticRegression model lies in the `sklearn.linear_model` module of the `sklearn` library. Import it:

In [None]:
from sklearn.linear_model import LogisticRegression 

solver `'lbfgs'` is one of the most common. It is suitable for most tasks. The `max_iter` hyperparameter sets the maximum number of training iterations, the default value of this parameter is 100, but in some cases more iterations will be needed.

before changing the `solver` parameter, familiarize yourself with what tasks they are suitable for, you can familiarize yourself with <a rel="stylesheet" type="text/css" href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">here</a>

In [None]:
#liblinear
#lbfgs
iterat = 1000
model = LogisticRegression(solver='liblinear', max_iter=iterat) 
model.fit(features_train, target_train) 

result = model.score(features_valid, target_valid)
    
print('result :', result)    

# Compare models

It is not necessary to work with three models at the same time. Each has its own merits and demerits. Let's evaluate the models in terms of quality (accuracy) and speed of work.

- Decisive tree quality Medium, speed High
- Random forest quality High, speed Low
- Logic quality Low, speed High

the necessary parameters are often chosen based on business priorities, not always a slight improvement in the quality of the model will be a priority for the speed of work

# Regression

How is classification different from regression?

The target attribute (response) can be categorical and quantitative. If it is categorical, then the classification problem is solved; if quantitative - regressions.

The price of an apartment is a quantitative target sign. So, we need to solve the regression problem.

## RMS error

What would be the "correct answer" in your problem? Complete, down to the last penny, the price match of the apartment? If absolute accuracy is not important in the task, the accuracy metric is not suitable.

The most common quality metric in a regression problem is the mean square error, MSE (Mean Squared Error).

To get the MSE, the error of each object is first calculated:

<font size="5">  
    object error = model predictions - correct answer
</font> 

MSE is calculated according to the scheme:

<font size="5">  
    MSE = sum of squared object errors / number of object
</font> 

What do calculations mean?

1) Object error shows how much the correct answer differs from the prediction. If the error is much greater than zero, the model has overestimated the flat; less - underestimated.
2) Squaring removes the difference between overestimation and underestimation. Without this step, it makes no sense to sum up the errors: the positive ones will compensate for the negative ones.
3) Averaging is needed to get data for all objects.

<img src="./pict/9.png"  
  width="700" />

We have achieved the highest accuracy in the past. The value of MSE, on the contrary, should be as small as possible.

The MSE calculation function is also available in sklearn. Import mean_squared_error

In [None]:
from sklearn.metrics import mean_squared_error 

## MSE interpretation

In [None]:
df = pd.read_csv('./data/train_data.csv')

df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345) 

df_valid=df_valid.reset_index(drop=True)

features_train = df_train.drop(['last_price'], axis=1)
target_train = df_train['last_price']

features_valid = df_valid.drop(['last_price'], axis=1)
target_valid = df_valid['last_price'] 

To assess the adequacy of the model in classification tasks, it is necessary to compare it with a random one.

Responding to all objects with the same number is a simple way of regression prediction. So that it does not differ much from the truth, we will take the average value of the price of an apartment as such a number.

In [None]:
predictions = pd.Series(target_train.mean(), index=target_train.index)
mse = mean_squared_error(target_train, predictions)

print("MSE:", mse)

In [None]:
display(target_train.mean())

"Square hryvnia" is useless. In order for the metric to show just hryvnias, let's take the root of MSE. This is the RMSE (root mean squared error) value.

In [None]:
rmse = mse ** 0.5
print("RMSE:", rmse)

## Decision tree in regression

The decision tree is suitable not only for classification problems, but also for regression.

The tree in the regression problem is trained in the same way, only it predicts not a class, but a number.

In [None]:
from sklearn.tree import DecisionTreeRegressor

<img src="./pict/10.jpg"  
  width="800" />

In [None]:
best_model = None
best_result = 9999999999999999999999
for depth in range(1, 24):
    model = DecisionTreeRegressor(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    
    result = mean_squared_error(target_valid, predictions_valid)**0.5 
    if result < best_result:
        best_model = model
        best_result = result/1000

print("RMSE best model on the validation set:", best_result)

## Random forest in regression

Where there is one tree, there is a forest. Let's figure out how to train a random forest model in regression.

The random forest for regression does not change much. It trains many independent trees and then makes a decision by averaging their responses.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
best_model = None
best_result = 9999999999999999999999

for est in tqdm(range(10, 51, 10)):
    for depth in range (1, 11):
        model = RandomForestRegressor(random_state=12345, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train) 
        predictions_valid = model.predict(features_valid) 
        result = mean_squared_error(target_valid, predictions_valid)**0.5 
        if result < best_result:
            best_model = model
            best_result = result/1000

print("RMSE best model on the validation set:", best_result)

## Linear regression

What algorithm will replace logistic regression? Linear regression!

Linear regression is similar to logistic. The name comes from linear algebra.

Due to the small number of parameters, linear regression is less prone to overfitting than, for example, decision trees.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
model.fit(features_train, target_train) 
predictions_valid = model.predict(features_valid) 

result = mean_squared_error(target_valid, predictions_valid)**0.5 /1000
print("RMSE best model on the validation set:", result)

In [None]:
df_valid = target_valid.to_frame()  # from series to dataframe

In [None]:
df_valid = target_valid.to_frame()  # from series to dataframe



df_target = pd.DataFrame(predictions_valid, columns=['target'])  # predictions_valid - numpy array, create dataframe


df_valid['predict'] = df_target  # join

In [None]:
df_valid = df_valid.sort_values('last_price').reset_index(drop=True).reset_index()

In [None]:
def create_plot(df, start=0, end=999999999):
    df_plot = df.loc[(df['index']>=start) & (df['index']<=end)]
    df_plot.loc[:, 'last_price'] = df_plot['last_price']/1000
    df_plot.loc[:, 'predict'] = df_plot['predict']/1000
    
    ax = df_plot.plot(x = 'index',
                 y = 'last_price',
                 kind = 'scatter',                    
                 style = '-o',                          
                 alpha = 0.1,       
                 legend = True,          
                 label = 'target',                                  
                 figsize = (8, 4.5),                   
                 grid = True)
    (df_plot.plot(ax = ax,
                   y = 'predict',
                   style = '-r',
                   alpha = 0.6, 
                   legend = True,          
                   label = 'predict',    
                   grid = True              
                  ))

In [None]:
create_plot(df_valid)

In [None]:
create_plot(df_valid, 5000)

In [None]:
df_valid.loc[df_valid['predict']<0].head()

# Results

<img src="./pict/25.jpg"  
  width="800" />

<h2>Linear Regression</h2>
<div class='alert alert-success'>
<b>Good</b>
    
- Simple to implement and efficient train
- Overfiting can br reduced by regularization
- Performs well when the dataset is linearly separable     
</div>

<div class='alert alert-danger'>
<b>Bad</b>
    
- Assumes thet data is independent which is rare in real life
- Prone to noise and overfiting
- Sesitive to outliers
</div>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

<img src="./pict/line.png"  
  width="600" />

<h2>Logistic Regression</h2>
<div class='alert alert-success'>
<b>Good</b>
    
- Less prone to over-fiting but it can overfit in hight dimensional dataset
- Efficient when the dataset has features that are linearly separable
- Easy to implement and efficient to train
</div>

<div class='alert alert-danger'>
<b>Bad</b>
    
- Should not be used when the number of observations are lesser than the number of features
- Assumption of linearity which is rare in practise
- Can only be used to predict discrete functions
</div>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

<img src="./pict/logic.png"  
  width="600" />

<h2>Decision Tree</h2>
<div class='alert alert-success'>
<b>Good</b>
    
- Can solve non-linear problems
- Can work on hight-dimensional data with wxcellent accuracy
- Easy to visualize and explain
</div>

<div class='alert alert-danger'>
<b>Bad</b>
    
- Overfiting. Might be resolved by random forest
- A small change in the data can lead to a large change in the structure of the optimal decision tree
- Calculations can get very complex
</div>

https://scikit-learn.org/stable/modules/tree.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

<img src="./pict/tree.png"  
  width="800" />

<h2>Support Vector Machine</h2>
<div class='alert alert-success'>
<b>Good</b>
    
- Good at hight dimensional data
- Can work on small dataset
- Can solve non-liner problems
</div>

<div class='alert alert-danger'>
<b>Bad</b>
    
- Inefficient on large data
- Requires picking the right kernal 
</div>

https://scikit-learn.org/stable/modules/svm.html#

<img src="./pict/svm.png"  
  width="400" />

<h2>Naive Bayes</h2>
<div class='alert alert-success'>
<b>Good</b>
    
- Training period is less
- Better suited for categorical imputs
- Easy to implement
</div>

<div class='alert alert-danger'>
<b>Bad</b>
    
- Assumes that all features are independent wich is rarely happening is real life 
- Zero Frequency
- Estimations can be wronmg in same cases
</div>

https://scikit-learn.org/stable/modules/naive_bayes.html

<img src="./pict/bayes.png"  
  width="400" />