# IML Assignment 1

## Name: Kamil Mirgasimov


## Mail: k.mirgasimov@innopolis.university


## Group: B22-SD-02

### Code style policy 

We expect you to follow https://peps.python.org/pep-0008/ Python standart style and will reduce your points if you don't. Also we ask you to comment your code when it's needed (logical blocks, function declaration, loops), however over-documentation is the evil.

Example of nice code style (no need to run this cells):

In [149]:
# This function returns the sum of parameters
# @param my_param1 - here I explain what this parameter means
# @param my_param2 - here I explain what this parameter means
# @return - result of func if it's not void
def my_func(my_param1: int, my_param2: int):
    return my_param1 + my_param2

There are few lines only, but they are represents important logical blocks, thus you should explain what their purpose:

In [150]:
# from my_training_package import my_regression, my_loader

# # Data loading
# x, y = my_loader.load("some.csv")

# # Training
# reg = my_regression()
# reg.train(x,y)

# # Evaluation on the same data set
# y_pred = reg.evaluation(y)

Example of too detailed and meaningless commenting that is not welcome:

In [151]:
# Import numpy package
import numpy as np

# This is variable x
x = 5
# This is variable y
y = 10
# Print x
print(x)

5


Ultimately, we belive in your programming common sense :) The purpose of clear code style is fast and smooth grading of your implementation and checking that you understand ML concepts.

## Task 1

### 3.1. Linear Regression
#### Data reading

In [152]:
import pandas as pd

# TODO Write your code here
df = pd.read_csv('train_1.csv')

#### Train\validation splitting

In [153]:
from sklearn.model_selection import train_test_split

# TODO Write your code here
x = df.drop(['y'], axis=1)
y = df.loc[:, "y"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#### Linear regression model fitting

In [154]:
from sklearn.linear_model import LinearRegression

# Declare and train a linear regression model
# TODO Write your code here
linear_model = LinearRegression()
linear_model.fit(x_train, y_train)

# Prediction by model on the validation set
# TODO Write your code here
y_pred_lr = linear_model.predict(x_test)

#### Linear regression model prediction & Evaluation


In [155]:
from sklearn import metrics


# Print MSE, RMSE, MAE and R2 score
def print_metrics(y_actual, y_pred):
    mse = metrics.mean_squared_error(y_actual, y_pred)
    rmse = np.sqrt(mse)
    mae = metrics.mean_absolute_error(y_actual, y_pred)
    r_2 = metrics.r2_score(y_actual, y_pred)
    print('Mean squared error:', mse)
    print('Root mean squared error:', rmse)
    print('Mean absolute error:', mae)
    print('R-2 score:', r_2)


print_metrics(y_test, y_pred_lr)

Mean squared error: 4760.8919492439145
Root mean squared error: 68.99921701906418
Mean absolute error: 59.726963530513444
R-2 score: 0.8186425742114126


### 3. 2 Polynomial Regression
#### Constructing the polynomial regression pipeline

In [156]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [157]:
# TODO Write your code here
polynomial_degree = PolynomialFeatures(degree=2)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynom", polynomial_degree),
                     ("linear_regression", linear_regression)])

#### Tuning the degree hyper-parameter using GridSearch

In [158]:
from sklearn.model_selection import GridSearchCV

# Declare a GridSearch instance 
# TODO Write your code here
search = GridSearchCV(estimator=pipeline, param_grid={'polynom__degree': range(2, 6)}, cv=8,
                      scoring='neg_mean_squared_error')

# Train the GridSearch
# TODO Write your code here
grid_search = search.fit(x_train, y_train)
y_pred = grid_search.best_estimator_.predict(x_test)

# Find the optimum degrees
# TODO Write your code here
print(f"Best parameter: {grid_search.best_params_}")

# Print the GridSearchCV score
# TODO Write your code here
# TODO
print(f"search score: {cross_val_score(grid_search, x_train, y_train)}")

Best parameter: {'polynom__degree': 4}
search score: [ -7.50927334  -9.42470387 -17.47431831 -10.3213842  -44.04059636]


In [159]:
print_metrics(y_actual=y_test, y_pred=y_pred)

Mean squared error: 0.7661304954487596
Root mean squared error: 0.8752888068796262
Mean absolute error: 0.6453834416592811
R-2 score: 0.9999708156673258


#### Save the model

In [160]:
import pickle

# Save the GridSearch model for evaluation
filename = 'poly_optimized_model.sav'
pickle.dump(search, open(filename, 'wb'))

### 3.3 Determine the linear dependent features

Use the following code cell to determine a pair of features from the training dataset that are correlated to each other. Explain your choise in the markdown cell.

In [161]:
# TODO Write your code here
df_modified = df.drop(df.columns[[0, -1]], axis=1)
for i in range(len(df_modified.columns)):
    for j in range(i + 1, len(df_modified.columns)):
        row_i = df_modified.iloc[i]
        row_j = df_modified.iloc[j]
        correlation = row_i.corr(row_j)
        print(f'Correlation for {df_modified.columns[i]} and {df_modified.columns[j]}: {correlation}')



Correlation for X_1 and X_2: 0.39338882062718633
Correlation for X_1 and X_3: 0.6964506177005841
Correlation for X_1 and X_4: 0.7162306612508837
Correlation for X_2 and X_3: -0.3689174252604542
Correlation for X_2 and X_4: 0.9026708037739416
Correlation for X_3 and X_4: -0.0017733618103473517


## Task 2

### 4.1 Data processing
#### Loading the dataset

In [162]:
import pandas as pd

#### Exploring the dataset  and removing 2 redundant features

In [171]:
# TODO Write your code here
df = pd.read_csv('pokemon_modified.csv')
df.head()
df.info()
df = df.drop(['name'], axis=1)
df = df.drop(['classification'], axis=1)
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 37 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   against_bug        801 non-null    float64
 1   against_dark       801 non-null    float64
 2   against_dragon     801 non-null    float64
 3   against_electric   801 non-null    float64
 4   against_fairy      801 non-null    float64
 5   against_fight      801 non-null    float64
 6   against_fire       801 non-null    float64
 7   against_flying     801 non-null    float64
 8   against_ghost      801 non-null    float64
 9   against_grass      801 non-null    float64
 10  against_ground     801 non-null    float64
 11  against_ice        801 non-null    float64
 12  against_normal     801 non-null    float64
 13  against_poison     801 non-null    float64
 14  against_psychic    801 non-null    float64
 15  against_rock       801 non-null    float64
 16  against_steel      801 non

Unnamed: 0,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,...,height_m,hp,percentage_male,type1,sp_attack,sp_defense,speed,weight_kg,generation,is_legendary
0,1.00,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,...,0.7,45,88.1,grass,65,65,45,6.9,1,0
1,1.00,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,...,1.0,60,88.1,grass,80,80,60,13.0,1,0
2,1.00,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,...,2.0,80,88.1,grass,122,120,80,100.0,1,0
3,0.50,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.50,...,0.6,39,88.1,fire,60,50,65,8.5,1,0
4,0.50,1.0,1.0,1.0,0.5,1.0,0.5,1.0,1.0,0.50,...,1.1,58,88.1,fire,80,65,80,19.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
796,0.25,1.0,0.5,2.0,0.5,1.0,2.0,0.5,1.0,0.25,...,9.2,97,,steel,107,101,61,999.9,7,1
797,1.00,1.0,0.5,0.5,0.5,2.0,4.0,1.0,1.0,0.25,...,0.3,59,,grass,59,31,109,0.1,7,1
798,2.00,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,0.50,...,5.5,223,,dark,97,53,43,888.0,7,1
799,2.00,2.0,1.0,1.0,1.0,0.5,1.0,1.0,2.0,1.00,...,2.4,97,,psychic,127,89,79,230.0,7,1


#### Splitting the data
Use random_state = 123, stratify, and set test_size = 0.2

In [178]:
from sklearn.model_selection import train_test_split

# TODO Write your code here
x = df.drop(['is_legendary'], axis=1)
y = df.loc[:, "is_legendary"]
X_train, X_test, y_train, y_test = train_test_split(x, y)

Check if the dataset is balanced or not and comment on it

In [None]:
# TODO Write your code here
print(...)

#### Checking for missing values

In [None]:
# TODO Write your code here

#### Impute the missing values

In [187]:
from sklearn.impute import SimpleImputer

# Define a SimpleImputer instance
# TODO Write your code here
# imputing missing values
imputer = SimpleImputer(strategy='most_frequent')

# Apply the imputer
# TODO Write your code here
imputer.fit(X_train)
X_train = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
X_train

Unnamed: 0,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,...,experience_growth,height_m,hp,percentage_male,type1,sp_attack,sp_defense,speed,weight_kg,generation
0,2.0,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,0.5,...,1250000,1.4,72,50.0,dark,65,70,58,50.0,5
1,2.0,1.0,1.0,0.5,1.0,1.0,2.0,2.0,1.0,0.5,...,1059860,0.8,75,50.0,grass,105,85,30,8.5,2
2,1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,...,1059860,2.0,80,88.1,grass,122,120,80,100.0,1
3,0.5,1.0,1.0,0.0,1.0,0.5,1.0,1.0,1.0,1.0,...,1059860,1.1,65,50.0,ground,35,65,85,64.8,2
4,2.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,0.5,1.0,...,1059860,1.6,60,88.1,dark,120,60,105,81.1,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,0.5,0.5,0.0,1.0,1.0,0.5,1.0,1.0,1.0,1.0,...,1250000,3.0,126,50.0,fairy,131,98,99,215.0,6
596,0.5,0.5,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,...,1000000,1.3,120,100.0,fighting,30,85,45,55.5,5
597,0.5,1.0,0.5,1.0,0.5,4.0,1.0,0.25,1.0,1.0,...,1250000,0.9,60,50.0,steel,50,50,40,120.0,3
598,0.5,0.5,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,...,1000000,0.7,35,100.0,fighting,35,35,35,21.0,2


#### Double check that there are no missing values

In [None]:
# TODO Write your code here

#### Encode categorically

In [188]:
# TODO Write your code here
from sklearn.preprocessing import OneHotEncoder


def ohe_new_features(df, features_name, encoder):
    new_feats = encoder.transform(df[features_name])
    # create dataframe from encoded features with named columns
    new_cols = pd.DataFrame(new_feats, columns=encoder.get_feature_names_out(features_name))
    new_df = pd.concat([df, new_cols], axis=1)
    new_df.drop(features_name, axis=1, inplace=True)
    return new_df


encoder = OneHotEncoder(sparse_output=False, drop='first')
f_names = ['type1']
encoder.fit(X_train[f_names])
X_train = ohe_new_features(X_train, f_names, encoder)
X_test = ohe_new_features(X_test, f_names, encoder)

Unnamed: 0,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,...,type1_ghost,type1_grass,type1_ground,type1_ice,type1_normal,type1_poison,type1_psychic,type1_rock,type1_steel,type1_water
0,2.0,0.5,2.0,0.5,4.0,2.0,0.5,1.0,0.5,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2.0,1.0,1.0,0.5,1.0,1.0,2.0,2.0,1.0,0.5,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,1.0,0.5,0.5,0.5,2.0,2.0,1.0,0.25,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.5,1.0,1.0,0.0,1.0,0.5,1.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2.0,0.5,1.0,1.0,2.0,2.0,1.0,1.0,0.5,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,0.5,0.5,0.0,1.0,1.0,0.5,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
596,0.5,0.5,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
597,0.5,1.0,0.5,1.0,0.5,4.0,1.0,0.25,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
598,0.5,0.5,1.0,1.0,2.0,1.0,1.0,2.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Scale the data

In [190]:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# Define a scaler instance from one of the above
# TODO Write your code here
scaler = MinMaxScaler()


# Apply the scaler on both train and test features
# TODO Write your code here
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
X_test

Unnamed: 0,against_bug,against_dark,against_dragon,against_electric,against_fairy,against_fight,against_fire,against_flying,against_ghost,against_grass,...,type1_ghost,type1_grass,type1_ground,type1_ice,type1_normal,type1_poison,type1_psychic,type1_rock,type1_steel,type1_water
0,0.066667,0.066667,0.50,0.250,0.466667,0.250,0.200000,0.466667,0.25,0.200000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.200000,0.200000,0.50,0.250,0.200000,0.500,0.200000,0.200000,0.00,0.200000,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.066667,0.200000,0.25,0.250,0.066667,0.500,0.466667,0.066667,0.25,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.466667,0.200000,0.50,0.125,0.200000,0.250,0.466667,0.466667,0.25,0.066667,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.466667,0.466667,0.50,0.500,0.200000,0.125,0.066667,0.200000,0.50,0.466667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196,0.200000,0.200000,0.50,0.500,0.200000,0.250,0.066667,0.200000,0.25,0.466667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
197,0.066667,0.200000,0.50,0.250,0.066667,0.250,0.066667,0.200000,0.25,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
198,0.066667,0.200000,0.50,0.500,0.066667,0.125,0.066667,0.200000,0.25,0.200000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
199,0.200000,0.200000,0.50,0.250,0.200000,0.250,0.200000,0.200000,0.25,0.200000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


#### <span style="color:red">Correlation matrix</span>

Are there highly co-related features in the dataset? Is it a problem? Explain in the markdown cell.

In [191]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(30, 30))

# Plot the correlation matrix
# TODO Write your code here

<Figure size 3000x3000 with 0 Axes>

<Figure size 3000x3000 with 0 Axes>

### 4.2 Model fitting and Comparison

#### Tuning LR model

In [192]:
# Caclulate and print classification metrics: accuracy, precision, recall, and F1 score 
# TODO Write your code here
def print_clf_metrics(y_actual, y_pred):
    mse = metrics.mean_squared_error(y_actual, y_pred)
    rmse = np.sqrt(mse)
    mae = metrics.mean_absolute_error(y_actual, y_pred)
    r_2 = metrics.r2_score(y_actual, y_pred)
    print('Mean squared error:', mse)
    print('Root mean squared error:', rmse)
    print('Mean absolute error:', mae)
    print('R-2 score:', r_2)

In [None]:
# Specify GridSearchCV as in instruction
# TODO Write your code here
parameters = ...

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

# Declare and train logistic regression inside GridSearchCV with the parameters above
# Set max_iter=1000 in LR constructor
# TODO Write your code here
lr_clf_gs = ...

In [None]:
print("Tuned Hyperparameters :", )
print("Accuracy :", )

#### Construct a LR with the best params and Evaluate the LR with the best params

In [None]:
# TODO Write your code here
lr_clf = ...
lr_y_pred = ...

In [None]:
print_clf_metrics(y_test, lr_y_pred)

#### Print the top 5 most influencing features and the top 5 ignored features

In [None]:
# TODO Write your code here

#### Tuning KNN model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Declare and train knn inside GridSearchCV
# TODO Write your code here
param_grid = ...
knn_clf_gs = ...

print("Tuned Hyperparameters :", )
print("Accuracy :", )

#### Construct a KNN model with the best params and Evaluate the KNN with the best params


In [None]:
# TODO Write your code here
knn_clf = ...
knn_y_pred = ...
print_clf_metrics(y_test, knn_y_pred)

#### Fitting GNB to the data and evaluating on the test dataset

In [None]:
from sklearn.naive_bayes import GaussianNB

# Declare and train GaussianNB. No hyperparameters tuning 
# TODO Write your code here
gauss_nb_clf = ...
gauss_y_pred = ...

print_clf_metrics(y_test, gauss_y_pred)

#### Which metric is most appropriate for this task and why?

#### Compare the 3 classifiers in terms of accuracy, precision, recall and F1-score.
What is the best model for this task? Explain

In [None]:
# TODO Write your code here

## 5. Bonus Task

#### Loading the Dataset

In [None]:
import pandas as pd

# TODO Write your code here
train_data = ...

test_data = ...

In [None]:
# Split the data
# TODO Write your code here
X_train, X_test, y_train, y_test = ...
print(X_train, y_train, X_test, y_test)

####  Plot the data using the pairplot in sns

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# TODO Write your code here

#### Fit LR to the training dataset using OVR and evaluate on the test dataset

In [None]:
# TODO Write your code here
ovr_lr = ...

#### Fit LR to the training dataset using multinomial and evaluate on the test dataset


In [None]:
# TODO Write your code here
multi_lr = ...

#### Using gridsearch to tune the C value and multi class

In [None]:
# TODO Write your code here
params = ...
grid_search_clf = ...

In [None]:
print("Tuned Hyperparameters :")
print("Accuracy :")

#### Comment on why one multi_class technique was better than the other

#### Create LR with the best params

In [None]:
# TODO Write your code here
multi_lr = ...

#### Visualize the decision boundaries

In [None]:
from mlxtend.plotting import plot_decision_regions

# TODO Write your code here
multi_lr = ...

plot_decision_regions()

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Logistic Regression decision boundary)
plt.show()

#### Comment on the decision boundary, do you think this is a good model or not? and based on what?