# Practical machine learning and deep learning. Lab 1
## Introduction 

Labs will be conducted with use of [Kaggle](https://www.kaggle.com). 

The aim of today's lab is to find out how the next labs will be held and recap ML workflow.

You're asked to predict final grades of students by treir attendance, midterm and assignments scores. The data are real anonymized grades from one of the Innopolis course, but also contain some fictional 'students'.  

## [Сompetition](https://www.kaggle.com/t/6c8eb8f31b6b47d5ac647816b21b321a)
The competition costs 2 points max. To earn them, you have to beat a baseline score. However, if the trained model predicts some results under the baseline, you're guaranteed one point. This rule will work for the further labs. The baseline score can be found on Kaggle Leaderboard page.

Evaluation metric for this competition is R^2.

## Task
Today's task is to make a submission to a [competition](https://www.kaggle.com/t/6c8eb8f31b6b47d5ac647816b21b321a). 

To do so you will need: 
- Obtain data from competition 
- Create a Jupyter notebook which will produce a file for submission
- Submit it to the competition

### Data

Data contains `train` and `test` splits. Your goal is to train any appropriate ML model on `train` split and run inference on `test` split.

In [272]:
import pandas as pd
import numpy as np
import sklearn
import warnings
warnings.filterwarnings('ignore')

In [273]:
# train_data = pd.read_csv('/kaggle/input/pmldl-week-1-test-competition/train.csv',sep=';')
train_data = pd.read_csv('/home/rizo/inno/dl/train.csv',sep=';')

train_data.head()

Unnamed: 0,Course Grade (Real),Assignment: In-class participation,Assignment: Assignment 1,Assignment: Midterm
0,100,5,100,26
1,73,1.25,98,12
2,78,-,96,14
3,100,0,100,20
4,70,0,84,26


### Preprocessing

Please note that all the features have their own scales. Also some of them have missing values. Thus, you should apply Scaler and Imputer on the features and Scaler on labels.

In [274]:
# remove '-'
train_data = train_data.replace(['', '-', ' '], np.NaN)

In [275]:
train_data['Assignment: In-class participation'] = pd.to_numeric(train_data['Assignment: In-class participation'],errors = 'coerce')
train_data['Assignment: Assignment 1'] = pd.to_numeric(train_data['Assignment: Assignment 1'],errors = 'coerce')


In [276]:
labels = train_data[['Course Grade (Real)']]

# Create another DataFrame for the rest of the columns
features = train_data.drop(columns=['Course Grade (Real)'])

### Imputing & Scaling

In [277]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Assuming X is your feature DataFrame and y is your label DataFrame (or array)
X = features  # Features DataFrame
y = labels    # Label DataFrame or Series


# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the SimpleImputer and StandardScaler
imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()
label_scaler = StandardScaler()

# Fit the imputer on the training data and transform it
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Fit the scaler on the imputed training data and transform it
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Optionally, convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)


# Fit and transform the label scaler on the training labels, and only transform the test labels (if needed)
y_train = label_scaler.fit_transform(y_train.values.reshape(-1, 1))  # Reshaping to make it 2D
y_test = label_scaler.transform(y_test.values.reshape(-1, 1))

### model

In [278]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor

# Initialize and train models
# model = RandomForestRegressor()

model = RandomForestRegressor(
    bootstrap=True,
    criterion='absolute_error',
    max_depth=None,
    max_features='log2',
    max_samples=0.5,
    min_samples_leaf=1,
    min_samples_split=10,
    n_estimators=50,
    random_state=15  # Optional: Set a seed for reproducibility
)
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)
mse = r2_score(y_test, predictions)
print(f"Mean Squared Error: {mse}")


Mean Squared Error: 0.7997082819852053


### Inference
Run you trained model on `test` split


In [279]:
# test_data = pd.read_csv('/kaggle/input/pmldl-week-1-test-competition/test.csv')
test_data = pd.read_csv('/home/rizo/inno/dl/test.csv',sep=';')
test_data.head()


Unnamed: 0,Assignment: In-class participation,Assignment: Assignment 1,Assignment: Midterm
0,3,100.0,14
1,-,100.0,18
2,1,100.0,16
3,1,100.0,16
4,-,61.0,20


In [280]:
# Write your code here - don't forget to apply the same transformation on test data

# remove '-'
test_data = test_data.replace(['', '-', ' '], np.NaN)

test_data['Assignment: In-class participation'] = pd.to_numeric(test_data['Assignment: In-class participation'],errors = 'coerce')
test_data['Assignment: Assignment 1'] = pd.to_numeric(test_data['Assignment: Assignment 1'],errors = 'coerce')


# Fit the imputer on the training data and transform it
test_data = imputer.transform(test_data)

# Fit the scaler on the imputed training data and transform it
test_data = scaler.transform(test_data)

# Convert back to DataFrame for readability
test_data = pd.DataFrame(test_data, columns=X_test.columns)

preproc_test = test_data

predictions = model.predict(preproc_test.values)

### Save model predictions
Save model predictions to `submission.csv` and submit to competition

In [281]:
# Reshape predictions to be a 2D array with one column
predictions = predictions.reshape(-1, 1)

preds = pd.DataFrame(label_scaler.inverse_transform(predictions), columns=['Course Grade (Real)'])

# Insert ID column for Kaggle
preds.insert(0, 'ID', range(0, len(preds)))

preds.head(3)

Unnamed: 0,ID,Course Grade (Real)
0,0,76.17
1,1,78.69
2,2,78.72


In [282]:
preds.to_csv('submission.csv', index=False)