### Competition Overview
AI Real AI Camp is an internal datathon hosted by the School of AI Algiers (SOAI) community in The Higher National School of Computer Science (ESI ex INI) Algiers for 22-23/02/2024.
### Description
The following problem is a regression one where you are required to predict a Codeforces user rating based on their performance on previously recorded contests on competitive programming.  
[**Link to the competetion**](https://www.kaggle.com/competitions/codeforces-user-rating-prediction/overview)

#### loading trainning and target data

In [2]:
import pandas as pd

train_data = pd.read_csv('./codeforces-user-rating-prediction/X_train.csv')
target_data = pd.read_csv('./codeforces-user-rating-prediction/y_train.csv')

#### Sample visualization

In [3]:
train_data.head()

Unnamed: 0,id,contest1,contest2,contest3,contest4,contest5,contest6,contest7,contest8,contest9,contest10
0,5316558161,2186,2283,2121,2030.0,1937.0,1943.0,1805.0,1776.0,1635.0,1541.0
1,8188371053,2143,2254,2170,2185.0,2083.0,2105.0,2073.0,1982.0,2057.0,1966.0
2,6014349269,2487,2315,2376,2348.0,2356.0,2275.0,2205.0,2183.0,2122.0,2188.0
3,5969753904,2252,2221,2220,2290.0,2272.0,2265.0,2223.0,2329.0,2262.0,2269.0
4,9129361700,2137,2091,2107,2147.0,2047.0,2035.0,2118.0,2105.0,2133.0,2109.0


In [4]:
target_data.head()

Unnamed: 0,id,rating
0,5316558161,2283
1,8188371053,2254
2,6014349269,2487
3,5969753904,2345
4,9129361700,2386


#### dropping the id column

In [20]:
y_train = target_data.drop(['id'],axis=1)
X_train = train_data.drop(['id'],axis=1)

#### Pipelines for data cleaning
* For numerical columns : - Replacing missing
                          - Normalizing data
* No categorical culumns in this dataset 

In [38]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline,FeatureUnion

numerical_columns = X_train.select_dtypes(include=['number']).columns.tolist() 

numerical_transformer = Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy='most_frequent')),
        ('normalizer',MinMaxScaler())
    ]
) 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
    ])


#### Model And Tunning
* Ranforest performed pretty well on this dataset (way better than XGBoost) with ```negative mean squared error``` as scoring 
* Used ```grid_search``` for hypertunning the model's parameters (optuna is also an option)

In [39]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

classifier = RandomForestRegressor()

random_forst_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', classifier)
])

param_grid = {
    'regressor__n_estimators': [10,20,50,60,70],
    'regressor__max_depth': [None,5,10,15,20,25,30,35,40],
    'regressor__min_samples_split': [2,5]
}

grid_search = GridSearchCV(random_forst_pipeline, param_grid, cv=5,verbose=3,n_jobs=-1,scoring='neg_mean_squared_error')

grid_search.fit(X_train, y_train.values.ravel())


print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

Fitting 5 folds for each of 90 candidates, totalling 450 fits
Best Parameters: {'regressor__max_depth': 10, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 20}
Best Score: -3558.4571198431804


## Predict and Save

In [40]:
model = grid_search.best_estimator_
test_data = pd.read_csv('./codeforces-user-rating-prediction/X_test.csv')
test_data_without_ids = test_data.drop(['id'],axis=1)
y_predict = model.predict(test_data_without_ids)
y_predict

array([2130.50727451, 2092.52963925, 2345.23953665, 2530.08041667,
       2197.03158037, 2593.93291667, 2353.34416209, 2326.72405079,
       2021.89190069, 2303.43341583, 2207.54241156, 2105.65563723,
       2187.82081349, 2425.8365518 , 2127.76266912, 2960.79333333,
       2121.25778519, 2051.32438606, 2212.52074634, 2526.51666667,
       2173.97309803, 2448.7       , 2173.67108942, 2110.25981511,
       2816.95388889, 2222.51094689, 2222.98322549, 2497.11375   ,
       2140.12139055, 2059.45694249, 2483.13333333, 2569.975     ,
       2430.43562692, 2089.63557539, 2459.86702815, 2475.76935574,
       2296.04580999, 2843.01458333, 2266.64812657, 2352.14013889,
       2278.5703869 , 2305.52096753, 2222.82102953, 2416.04202742,
       2873.54166667, 2052.60579823, 2591.52029609, 2083.48300314,
       2773.79875   , 2504.84708333, 2113.03604468, 2285.74118358,
       2087.68138702, 2532.61479167, 2277.39229167, 2083.2940594 ,
       2030.86362563, 2117.07666933, 2284.66330437, 2347.87058

In [37]:
df_submission = pd.DataFrame({
    'id': test_data['id'],
    'rating': y_predict
})
df_submission.to_csv('submission.csv', index=False)