# Computind Covertype from the UCI dataset with the help of basic Scikit-learn tools
### I will use dataset from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Covertype)

Lets download our dataset:

In [None]:
!curl -o dataset.data.gz https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
!gunzip dataset.data.gz
!rm -f dataset.data.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  12.6M      0 --:--:-- --:--:-- --:--:-- 12.6M
gzip: dataset.data already exists; do you wish to overwrite (y or n)? n
	not overwritten


In [None]:
import numpy as np
np.logspace(1.0, 4.0, num=6)

array([   10.        ,    39.81071706,   158.48931925,   630.95734448,
        2511.88643151, 10000.        ])

###Download required libraries for dataset representation:



In [None]:
import pandas as pd
import numpy as np

all_data = pd.read_csv('dataset.data', header = None)
all_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45,46,47,48,49,50,51,52,53,54
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


###Feature extraction

In [None]:
labels = all_data[all_data.columns[-1]].values
feature_matrix = all_data[all_data.columns[:-1]].values

### With the help of `train_test_split` divide our dataset for trainset and testset:

In [None]:
from sklearn.model_selection import train_test_split

train_feature_matrix, test_feature_matrix, train_labels, test_labels = train_test_split(feature_matrix, labels, test_size=0.2, random_state=42)

###Let's try to use LogisticRegression() estimator


In [None]:
from sklearn.linear_model import LogisticRegression

#creating model with next parametrs
model = LogisticRegression()
#fitting our model
model.fit(train_feature_matrix, train_labels)
#making prediction
test_predictions = model.predict(test_feature_matrix)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


###Checking accuracy from `metrics.accuracy_score`:


In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(test_predictions, test_labels)

0.622290302315775

### As you can see we've got a little bit unaccurate model, with 62% accuracy lvl. Let's try to improve the resoult with the help of `model_selection.GridSearchCV` by choosing best hyperparams:
I will choose folowing params

1.   `C` for the inverse of regularization strength (C=1/λ)
2.   `l1_ratio` for regularization

I will use saga solver due to the fact I will use Elastic-Net penalty for l1 and l2-norm regularization

### And also i need to norm my dataset to have zero mean and unit variance. I will use `sklearn.preprocessing.StandardScaler` for this purpose:

In [None]:
train_feature_matrix

array([[3289,   22,   19, ...,    0,    0,    0],
       [2963,   21,   18, ...,    0,    0,    0],
       [3037,  185,    9, ...,    0,    0,    0],
       ...,
       [3153,  287,   17, ...,    0,    0,    0],
       [3065,  348,   21, ...,    0,    0,    0],
       [3021,   26,   16, ...,    0,    0,    0]])

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_feature_matrix_transformed = scaler.fit_transform(train_feature_matrix)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

pipe = LogisticRegression(solver= 'saga', penalty='elasticnet', max_iter=1500)

my_params = {'C' : np.logspace(1.0, 4.0, num=6),
             'l1_ratio' : [0, 0.5, 1]}

my_grid = GridSearchCV(pipe, my_params, cv=2, refit=True, scoring= 'accuracy', verbose=4)

So our grid search is ready so let's try to find best params:

In [None]:
my_grid.fit(train_feature_matrix_transformed, train_labels)
print(my_grid.best_estimator_)
print(my_grid.best_params_)
print(my_grid.best_score_)

Fitting 2 folds for each of 18 candidates, totalling 36 fits


After waiting for about 12 hours(on the kaggle notebook) i've got the next:

`LogisticRegression(C=158.48931924611142, l1_ratio=0, max_iter=1500,
penalty='elasticnet', solver='saga')`

`{'C': 158.48931924611142, 'l1_ratio': 0}`

`0.7242136055286749`

<b>Ok. As you can see we recieved not so high accuracy, so let's try to use some other models for our purpose</b>

### K Nearest Neighbors (KNN)

The nearest neighbor method (k Nearest Neighbors, or kNN) is a very popular classification method, also sometimes used in regression problems. This is one of the most understandable approaches to classification. At the level of intuition, the essence of the method is: look at the neighbors; which predominate -- like you. The technique is formally based on the compactness hypothesis: if the distance metric between the examples is introduced well enough, then similar examples are much more common in the same class than in different ones.

<img src='https://hsto.org/web/68d/a45/6f0/68da456f00f8434e87628dbe7e3f54a7.png' width=600>

Now i will try to use Pipeline for scaling and evaluation without GridSearch

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

my_pipe = Pipeline([('scaling', StandardScaler()), 
                   ('model', KNeighborsClassifier())])

my_pipe.fit(train_feature_matrix, train_labels)
my_pipe.score(test_feature_matrix, test_labels)

0.9279708785487466

As we can see we've recieved 93% of accuracy with the help of knn and standart scaling
At the end let's try RandomForestClassifier:

### RandomForestClassifier
In a random forest classification, multiple decision trees are created using different random subsets of the data and features. Each decision tree is like an expert, providing its opinion on how to classify the data. Predictions are made by calculating the prediction for each decision tree, then taking the most popular result. (For regression, predictions use an averaging technique instead.

<img src='https://res.cloudinary.com/dyd911kmh/image/upload/v1677239993/image5_c214968fd6.png' width=1000>


In [None]:
from sklearn.ensemble import RandomForestClassifier

my_pipe = Pipeline([('scaling', StandardScaler()), 
                   ('model', RandomForestClassifier())])

my_pipe.fit(train_feature_matrix, train_labels)
my_pipe.score(test_feature_matrix, test_labels)

0.9552679362839169

As we can see we've recieved 96% of accuracy with the help of knn and standart scaling
