# IML Assignment 1

## Name: Kamil Mirgasimov


## Mail: k.mirgasimov@innopolis.university


## Group: B22-SD-02

### Code style policy 

We expect you to follow https://peps.python.org/pep-0008/ Python standart style and will reduce your points if you don't. Also we ask you to comment your code when it's needed (logical blocks, function declaration, loops), however over-documentation is the evil.

Example of nice code style (no need to run this cells):

In [164]:
# This function returns the sum of parameters
# @param my_param1 - here I explain what this parameter means
# @param my_param2 - here I explain what this parameter means
# @return - result of func if it's not void
def my_func(my_param1: int, my_param2: int):
    return my_param1 + my_param2

There are few lines only, but they are represents important logical blocks, thus you should explain what their purpose:

In [165]:
# from my_training_package import my_regression, my_loader

# # Data loading
# x, y = my_loader.load("some.csv")

# # Training
# reg = my_regression()
# reg.train(x,y)

# # Evaluation on the same data set
# y_pred = reg.evaluation(y)

Example of too detailed and meaningless commenting that is not welcome:

In [166]:
# Import numpy package
import numpy as np
# This is variable x
x = 5
# This is variable y
y = 10
# Print x
print(x)

5


Ultimately, we belive in your programming common sense :) The purpose of clear code style is fast and smooth grading of your implementation and checking that you understand ML concepts.

## Task 1

### 3.1. Linear Regression
#### Data reading

In [185]:
import pandas as pd


# TODO Write your code here
df = pd.read_csv('train_1.csv')

#### Train\validation splitting

In [168]:
from sklearn.model_selection import train_test_split

# TODO Write your code here
x = df.drop(['y'], axis=1)
y = df.loc[:,"y"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#### Linear regression model fitting

In [169]:
from sklearn.linear_model import LinearRegression


# Declare and train a linear regression model
# TODO Write your code here
linear_model = LinearRegression()
linear_model.fit(x_train, y_train)

# Prediction by model on the validation set
# TODO Write your code here
y_pred_lr = linear_model.predict(x_test)

#### Linear regression model prediction & Evaluation


In [170]:
from sklearn import metrics

# Print MSE, RMSE, MAE and R2 score
def print_metrics(y_actual, y_pred):
    mse = metrics.mean_squared_error(y_actual, y_pred)
    rmse = np.sqrt(mse)
    mae = metrics.mean_absolute_error(y_actual, y_pred)
    r_2 = metrics.r2_score(y_actual, y_pred)
    print('Mean squared error:', mse)
    print('Root mean squared error:', rmse)
    print('Mean absolute error:', mae)
    print('R-2 score:', r_2)


print_metrics(y_test, y_pred_lr)

Mean squared error: 3826.726825199097
Root mean squared error: 61.86054336327072
Mean absolute error: 51.482778038978836
R-2 score: 0.877207103326709


### 3. 2 Polynomial Regression
#### Constructing the polynomial regression pipeline

In [171]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [172]:
# TODO Write your code here
polynomial_degree = PolynomialFeatures(degree=2)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynom", polynomial_degree),
                         ("linear_regression", linear_regression)])

#### Tuning the degree hyper-parameter using GridSearch

In [184]:
from sklearn.model_selection import GridSearchCV

# Declare a GridSearch instance 
# TODO Write your code here
search = GridSearchCV(estimator=pipeline, param_grid={'polynom__degree': range(2, 6)}, cv=8, scoring ='neg_mean_squared_error')

# Train the GridSearch
# TODO Write your code here
grid_search = search.fit(x_train, y_train)
y_pred = grid_search.best_estimator_.predict(x_test)

# Find the optimum degrees
# TODO Write your code here
print(f"Best parameter: {grid_search.best_params_}" )

# Print the GridSearchCV score
# TODO Write your code here
# TODO
print(f"search score: {cross_val_score(grid_search, x_train, y_train)}")

Best parameter: {'polynom__degree': 4}


search score: [ -8.52888376  -8.1584728  -25.57196777 -19.88560918 -27.7930192 ]


In [174]:
print_metrics(y_actual=y_test, y_pred=y_pred)

Mean squared error: 1.1744904325824161
Root mean squared error: 1.083739098022405
Mean absolute error: 0.7518532309274343
R-2 score: 0.9999623126789762


#### Save the model

In [180]:
import pickle 

# Save the GridSearch model for evaluation
filename = 'poly_optimized_model.sav'
pickle.dump(search, open(filename, 'wb'))

### 3.3 Determine the linear dependent features

Use the following code cell to determine a pair of features from the training dataset that are correlated to each other. Explain your choise in the markdown cell.

In [232]:
# TODO Write your code here
df_modified = df.drop(df.columns[[0, -1]], axis=1)
for i in range(len(df_modified.columns)):
    for j in range(i + 1, len(df_modified.columns)):
        row_i = df_modified.iloc[i]
        row_j = df_modified.iloc[j]
        correlation = row_i.corr(row_j)
        print(f'Correlation for {df_modified.columns[i]} and {df_modified.columns[j]}: {correlation}')



Correlation for X_1 and X_2: 0.3933888206271863
Correlation for X_1 and X_3: 0.6964506177005838
Correlation for X_1 and X_4: 0.7162306612508836
Correlation for X_2 and X_3: -0.3689174252604542
Correlation for X_2 and X_4: 0.9026708037739416
Correlation for X_3 and X_4: -0.0017733618103473502


## Task 2

### 4.1 Data processing
#### Loading the dataset

In [177]:
import pandas as pd

#### Exploring the dataset  and removing 2 redundant features

In [249]:
# TODO Write your code here
df = pd.read_csv('pokemon_modified.csv')
df.head()
df.loc[:,'classification']

0             Seed Pokémon
1             Seed Pokémon
2             Seed Pokémon
3           Lizard Pokémon
4            Flame Pokémon
              ...         
796         Launch Pokémon
797    Drawn Sword Pokémon
798      Junkivore Pokémon
799          Prism Pokémon
800     Artificial Pokémon
Name: classification, Length: 801, dtype: object

#### Splitting the data
Use random_state = 123, stratify, and set test_size = 0.2

In [179]:
from sklearn.model_selection import train_test_split

# TODO Write your code here
X_train, X_test, y_train, y_test =...

TypeError: cannot unpack non-iterable ellipsis object

Check if the dataset is balanced or not and comment on it

In [None]:
# TODO Write your code here
print(...)

#### Checking for missing values

In [None]:
# TODO Write your code here

#### Impute the missing values

In [None]:
from sklearn.impute import SimpleImputer

# Define a SimpleImputer instance
# TODO Write your code here
imputer = ...

# Apply the imputer
# TODO Write your code here
...

#### Double check that there are no missing values

In [None]:
# TODO Write your code here

#### Encode categorically

In [None]:
# TODO Write your code here

#### Scale the data

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# Define a scaler instance from one of the above
# TODO Write your code here
scaler = ...

# Apply the scaler on both train and test features
# TODO Write your code here
x_train = ...
x_test = ...

#### <span style="color:red">Correlation matrix</span>

Are there highly co-related features in the dataset? Is it a problem? Explain in the markdown cell.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(30, 30))

# Plot the correlation matrix
# TODO Write your code here

### 4.2 Model fitting and Comparison

#### Tuning LR model

In [None]:
# Caclulate and print classification metrics: accuracy, precision, recall, and F1 score 
# TODO Write your code here
def print_clf_metrics(y_actual, y_pred ):
    pass

In [None]:
# Specify GridSearchCV as in intruction
# TODO Write your code here
parameters = ...

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

# Declare and train logistic regression inside GridSearchCV with the parameters above
# Set max_iter=1000 in LR constructor
# TODO Write your code here
lr_clf_gs = ...

In [None]:
print("Tuned Hyperparameters :",)
print("Accuracy :",)

#### Construct a LR with the best params and Evaluate the LR with the best params

In [None]:
# TODO Write your code here
lr_clf = ...
lr_y_pred = ...

In [None]:
print_clf_metrics(y_test, lr_y_pred)

#### Print the top 5 most influencing features and the top 5 ignored features

In [None]:
# TODO Write your code here

#### Tuning KNN model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Declare and train knn inside GridSearchCV
# TODO Write your code here
param_grid = ...
knn_clf_gs = ...


print("Tuned Hyperparameters :", )
print("Accuracy :",)

#### Construct a KNN model with the best params and Evaluate the KNN with the best params


In [None]:
# TODO Write your code here
knn_clf = ...
knn_y_pred = ...
print_clf_metrics(y_test, knn_y_pred)

#### Fitting GNB to the data and evaluating on the test dataset

In [None]:
from sklearn.naive_bayes import GaussianNB

# Declare and train GaussianNB. No hyperparameters tuning 
# TODO Write your code here
gauss_nb_clf = ...
gauss_y_pred = ...

print_clf_metrics(y_test, gauss_y_pred)

#### Which metric is most appropriate for this task and why?

#### Compare the 3 classifiers in terms of accuracy, precision, recall and F1-score.
What is the best model for this task? Explain

In [None]:
# TODO Write your code here

## 5. Bonus Task

#### Loading the Dataset

In [None]:
import pandas as pd

# TODO Write your code here
train_data = ...

test_data = ...

In [None]:
# Split the data
# TODO Write your code here
X_train, X_test, y_train, y_test = ...
print(X_train, y_train, X_test, y_test)

####  Plot the data using the pairplot in sns

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# TODO Write your code here

#### Fit LR to the training dataset using OVR and evaluate on the test dataset

In [None]:
# TODO Write your code here
ovr_lr = ...

#### Fit LR to the training dataset using multinomial and evaluate on the test dataset


In [None]:
# TODO Write your code here
multi_lr = ...

#### Using gridsearch to tune the C value and multi class

In [None]:
# TODO Write your code here
params = ...
grid_search_clf = ...

In [None]:
print("Tuned Hyperparameters :")
print("Accuracy :")

#### Comment on why one multi_class technique was better than the other

#### Create LR with the best params

In [None]:
# TODO Write your code here
multi_lr = ... 

#### Visualize the decision boundaries

In [None]:
from mlxtend.plotting import plot_decision_regions
# TODO Write your code here
multi_lr = ...

plot_decision_regions()

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Logistic Regression decision boundary)
plt.show()

#### Comment on the decision boundary, do you think this is a good model or not? and based on what?