# Practical machine learning and deep learning. Lab 1
## Introduction 

Labs will be conducted with use of [Kaggle](https://www.kaggle.com). 

The aim of today's lab is to find out how the next labs will be held and recap ML workflow.

You're asked to predict final grades of students by treir attendance, midterm and assignments scores. The data are real anonymized grades from one of the Innopolis course, but also contain some fictional 'students'.  

## [Сompetition](https://www.kaggle.com/t/6c8eb8f31b6b47d5ac647816b21b321a)
The competition costs 2 points max. To earn them, you have to beat a baseline score. However, if the trained model predicts some results under the baseline, you're guaranteed one point. This rule will work for the further labs. The baseline score can be found on Kaggle Leaderboard page.

Evaluation metric for this competition is R^2.

## Task
Today's task is to make a submission to a [competition](https://www.kaggle.com/t/6c8eb8f31b6b47d5ac647816b21b321a). 

To do so you will need: 
- Obtain data from competition 
- Create a Jupyter notebook which will produce a file for submission
- Submit it to the competition

### Data

Data contains `train` and `test` splits. Your goal is to train any appropriate ML model on `train` split and run inference on `test` split.

In [150]:
import pandas as pd
import numpy as np
import sklearn
import warnings
warnings.filterwarnings('ignore')

In [151]:
# train_data = pd.read_csv('/kaggle/input/pmldl-week-1-test-competition/train.csv',sep=';')
train_data = pd.read_csv('/home/rizo/inno/dl/train.csv',sep=';')

train_data.head()

Unnamed: 0,Course Grade (Real),Assignment: In-class participation,Assignment: Assignment 1,Assignment: Midterm
0,100,5,100,26
1,73,1.25,98,12
2,78,-,96,14
3,100,0,100,20
4,70,0,84,26


## Preprocessing

Please note that all the features have their own scales. Also some of them have missing values. Thus, you should apply Scaler and Imputer on the features and Scaler on labels.

In [152]:
# remove '-'
train_data = train_data.replace(['', '-', ' '], np.NaN)

In [153]:
train_data['Assignment: In-class participation'] = pd.to_numeric(train_data['Assignment: In-class participation'],errors = 'coerce')
train_data['Assignment: Assignment 1'] = pd.to_numeric(train_data['Assignment: Assignment 1'],errors = 'coerce')


In [154]:
label = train_data[['Course Grade (Real)']]

# Create another DataFrame for the rest of the columns
features = train_data.drop(columns=['Course Grade (Real)'])

In [155]:
from sklearn.impute import SimpleImputer

# Create a SimpleImputer instance with the desired strategy
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the train_data
imputer.fit(features)

# Transform the train_data by replacing NaN values
features_imputed = imputer.transform(features)

# convert the resulting NumPy array back to a DataFrame
features_imputed = pd.DataFrame(features_imputed, columns=features.columns)


In [156]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Assuming 'train_data' has features and one label column 'target'
features = features_imputed

# Create a scaler instance (choose either MinMaxScaler or StandardScaler)
scaler = MinMaxScaler()  # or StandardScaler()

# Fit and transform the features
features_normalized = scaler.fit_transform(features)

# convert the normalized features back to a DataFrame
features_normalized = pd.DataFrame(features_normalized, columns=features.columns)
features_normalized


Unnamed: 0,Assignment: In-class participation,Assignment: Assignment 1,Assignment: Midterm
0,1.000000,1.00,0.866667
1,0.250000,0.98,0.400000
2,0.390398,0.96,0.466667
3,0.000000,1.00,0.666667
4,0.000000,0.84,0.866667
...,...,...,...
239,0.600000,0.85,0.466667
240,0.000000,0.00,0.000000
241,0.084000,0.93,0.933333
242,0.600000,0.98,0.600000


### Model
Implement any appropriate regression  ML model you like. Consider the number of features and data points when chose a model.

In [159]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor


# Split your data
X = features_normalized
y = label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=15)

# Initialize and train models
model = RandomForestRegressor()

model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = r2_score(y_test, predictions)
print(f"Mean Squared Error: {mse}")


Mean Squared Error: 0.8568805848780816


In [162]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define your model
model = RandomForestRegressor()

# Define the parameter grid to search
# Define the more comprehensive parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False],
    'max_samples': [None, 0.5, 0.75],
    'criterion': ['squared_error', 'absolute_error']
}

# Setup GridSearch with 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='r2')

# Fit the grid search
grid_search.fit(X, y)

# Get the best model and parameters
print(f"Best R^2 score: {grid_search.best_score_}")
print(f"Best hyperparameters: {grid_search.best_params_}")


Best R^2 score: 0.7591756368635945
Best hyperparameters: {'bootstrap': True, 'criterion': 'absolute_error', 'max_depth': None, 'max_features': 'log2', 'max_samples': 0.5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 50}


In [163]:
model = RandomForestRegressor(
    bootstrap=True,
    criterion='absolute_error',
    max_depth=None,
    max_features='log2',
    max_samples=0.5,
    min_samples_leaf=1,
    min_samples_split=10,
    n_estimators=50,
    random_state=15  # Optional: Set a seed for reproducibility
)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model (R^2 score)
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R^2 score: {r2}")

R^2 score: 0.8312623669172827


### Inference
Run you trained model on `test` split


In [164]:
# test_data = pd.read_csv('/kaggle/input/pmldl-week-1-test-competition/test.csv')
test_data = pd.read_csv('/home/rizo/inno/dl/test.csv',sep=';')
test_data.head()


Unnamed: 0,Assignment: In-class participation,Assignment: Assignment 1,Assignment: Midterm
0,3,100.0,14
1,-,100.0,18
2,1,100.0,16
3,1,100.0,16
4,-,61.0,20


In [165]:
# Write your code here - don't forget to apply the same transformation on test data

# remove '-'
test_data = test_data.replace(['', '-', ' '], np.NaN)

test_data['Assignment: In-class participation'] = pd.to_numeric(test_data['Assignment: In-class participation'],errors = 'coerce')
test_data['Assignment: Assignment 1'] = pd.to_numeric(test_data['Assignment: Assignment 1'],errors = 'coerce')

# Create a SimpleImputer instance with the desired strategy
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the test_data
imputer.fit(test_data)

# Transform the test_data by replacing NaN values
test_data_imputed = imputer.transform(test_data)

# convert the resulting NumPy array back to a DataFrame
test_data_imputed = pd.DataFrame(test_data_imputed, columns=test_data.columns)

# Create a scaler instance (choose either MinMaxScaler or StandardScaler)
scaler = MinMaxScaler()  # or StandardScaler()

# Fit and transform the features
features_normalized = scaler.fit_transform(test_data_imputed)

# convert the normalized features back to a DataFrame
features_normalized = pd.DataFrame(features_normalized, columns=features.columns)

preproc_test = features_normalized

predictions = model.predict(preproc_test.values)

### Save model predictions
Save model predictions to `submission.csv` and submit to competition

In [166]:
preds = pd.DataFrame(predictions, columns=['Course Grade (Real)'])

# Insert ID column for Kaggle
preds.insert(0, 'ID', range(0, len(preds)))

preds.head(3)

Unnamed: 0,ID,Course Grade (Real)
0,0,75.21
1,1,82.64
2,2,74.83


In [167]:
preds.to_csv('submission.csv', index=False)