# Rappi ML Engineer challenge
Hi! Thank you for applying for our ML Engineer role. We have a lot of exciting stuff to build and we are glad you choose to apply.
The goal of this challenge is to test your software engineering and machine learning skills.
> ⚠️ We are going to evaluate your **code**, **best practices** and **communication skills**, not the precision of the model. So show us your best coding practices 
## Instructions
We need you to build a python package with a CLI to train and evaluate a ML model over the famous [Kaggle's Titanic competition](https://www.kaggle.com/competitions/titanic/overview).
### Requirements
- Code should be written using _object oriented programming_.
- Package should have a _command-line interface_ (CLI) to interact with the code.
- Code should be tested and a coverage report should be included.
### Expected outputs
- A Github repository with the package and a `README.md` indicating how to use it.
- A Jupyter notebook with an analysis of the results with:
  - Your evaluation metrics and the explanation of them: _Is this a good model? A bad one? Why?_
  - What are the most important features for the ML model: _All the features are equally important? What are the most useful ones? Why?_
  - An explanation of how we can put on production the model (please mention the technologies that you'd use and MLOps best practices): _What system architecture you'd use? How we can automate stuff? Which cloud services you'd use? Why?_
- A coverage report of the tests indicating how we can read it.
### Advices
- Concentrate on the quality of the package more than on the precision of the model. The titanic problem is really fam

## Part 1 - Data Preprocessing

### Importing libraries

In [1]:
from ml.load.datasets import DatasetLoader
from ml.functions import PandasProfiler, DropPdColumns, FeatureEngine, SetSplit
from ml.preprocess.feature import Encoder, Scalers
from ml.models.svm import SVM

### Importing datasets

In [2]:
train_loader = DatasetLoader('data/input/train.csv')

train_df = train_loader.load()
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### For the train set we have 12 variables with 891 rows, 5 of the 12 columns are numerical and 7 categorical. There´s a total 8% of missing cells.

#### There are a bunch of columns with high cardinality, such as ´Name´ with 891 distinct values, ´Ticket´ with 681 distinct values and ´Cabin´ with 147 distinct values. Columns ´Sex´and ´Survived´ presents high correlation vs the rest of the variables being greater than 0.5

#### On the other hand, columns like 'Age' and 'Cabin' has 19.9% and 77% of missing values respectively

In [None]:
pp_train = PandasProfiler(train_df, 'Pandas Profiling report of "Train" set')

pp_train.profiler()

### Let´s drop the columns from train set where we find too much cardinality and the ID as well, we don´t need it.

### Cardinality Rank in Train set:
    - Name: has a high cardinality: 891 distinct values
    - Ticket: has a high cardinality: 681 distinct values
    - Cabin: has a high cardinality: 147 distinct values
    - PassengerId: unuseful column with very high cardinality

In [3]:
cols_to_drop_train = [
    "PassengerId", 
    "Name", 
    "Ticket", 
    "Cabin"
]

drop_train_cols = DropPdColumns(train_df, cols_to_drop_train)

drop_train_cols.drop()

train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


### Importing datasets

In [4]:
test_loader = DatasetLoader('data/input/test.csv')

test_df = test_loader.load()
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### For the test set we have 11 variables with 418 rows, 5 of the 11 columns are numerical and 6 categorical. There´s a total 9% of missing cells. 

### There are a bunch of columns with high cardinality, such as ´Name´ with 418 distinct values, ´Ticket´ with 363 distinct values and ´Cabin´ with 76 distinct values. 

### On the other hand, columns like 'Age' and 'Cabin' has 20.6% and 78% of missing values respectively

In [None]:
pp_test = PandasProfiler(test_df, 'Pandas Profiling report of "Test" set')

pp_test.profiler()

#### Let´s drop the columns from test set where we find too much cardinality.

#### Cardinality Rank in Test set:
    - Name: has a high cardinality: 418 distinct values
    - Ticket: has a high cardinality: 363 distinct values
    - Cabin: has a high cardinality: 76 distinct values

In [5]:
cols_to_drop_test = [ 
    "Name", 
    "Ticket", 
    "Cabin"
]

drop_test_cols = DropPdColumns(test_df, cols_to_drop_test)

drop_test_cols.drop()

test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,male,34.5,0,0,7.8292,Q
1,893,3,female,47.0,1,0,7.0,S
2,894,2,male,62.0,0,0,9.6875,Q
3,895,3,male,27.0,0,0,8.6625,S
4,896,3,female,22.0,1,1,12.2875,S


#### Number of null on Train set by column:
    - no. of nulls for col 'Age' in set 'train_df': 177
    - no. of nulls for col 'Embarked' in set 'train_df': 2
    
#### Number of null on Test set by column:
    - no. of nulls for col 'Age' in set 'test_df': 86
    - no. of nulls for col 'Fare' in set 'test_df': 1

In [6]:
feat_eng = FeatureEngine()

feat_eng.check_nulls(train_df=train_df)
feat_eng.check_nulls(test_df=test_df)





#### First, we will impute 'Age' column by a random number between the mean age, std age, in a range of all the null values. This will be applied to both sets train and test.

In [7]:
data = [train_df, test_df]

feat_eng.nan_inputer(data, "Age")

#### Now let's impute by the common value which appears in column 'Embarked' from train set, the common value is the string 'S' with 644 values. So we will use 'S' value for impute the 2 missing values.

In [8]:
feat_eng.common_value_inputer(train_df, "Embarked", "S")

### And finally we'll impute by the mean 'Fare' we have in test set.

In [9]:
test_df = feat_eng.mean_inputer(test_df, "Fare")

In [10]:
feat_eng.check_nulls(train_df=train_df)
feat_eng.check_nulls(test_df=test_df)





In [11]:
train_df = Encoder(["Sex", "Embarked"]).fit_transform(train_df)
train_df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22,1,0,7.2500,2
1,1,1,0,38,1,0,71.2833,0
2,1,3,0,26,0,0,7.9250,2
3,1,1,0,35,1,0,53.1000,2
4,0,3,1,35,0,0,8.0500,2
...,...,...,...,...,...,...,...,...
886,0,2,1,27,0,0,13.0000,2
887,1,1,0,19,0,0,30.0000,2
888,0,3,0,29,1,2,23.4500,2
889,1,1,1,26,0,0,30.0000,0


In [12]:
test_df = Encoder(["Sex", "Embarked"]).fit_transform(test_df)
test_df

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,1,22,0,0,7.8292,1
1,893,3,0,38,1,0,7.0000,2
2,894,2,1,26,0,0,9.6875,1
3,895,3,1,35,0,0,8.6625,2
4,896,3,0,35,1,1,12.2875,2
...,...,...,...,...,...,...,...,...
413,1305,3,1,35,0,0,8.0500,2
414,1306,1,0,44,0,0,108.9000,0
415,1307,3,1,23,0,0,7.2500,2
416,1308,3,1,34,0,0,8.0500,2


In [13]:
split = SetSplit(train_df, test_df)
split.split(train_col="Survived", test_col="PassengerId")

In [14]:
sc = Scalers()
X_train, X_test = sc.fit_transform(split.X_train, split.X_test)

In [15]:
classifier = SVM()

Y_pred = classifier.fit_predict(X_train, X_test, split.Y_train)

In [16]:
classifier.score(X_train, split.Y_train)

84.06


In [None]:
# Get the feature names
feature_names = list(split.X_train.columns)

classifier.shap(X_train, X_test, feature_names)

In [None]:
import pandas as pd

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })

submission.to_csv('data/output/predictions.csv', index=False)

submission.head()

In [None]:
pip freeze