<hr style="border:0.02in solid gray"> </hr>

<center> <font color= #847ACC> <h1> Atrato JR Data Scientist Challenge <center></h1> </font>
    
<center> <font color= #847ACC> <font size = 4>  Rubén Hernández Guevara <center> </font> 
<br>
<center> Repository: https://github.com/Ruhguevara/Atrato_JR_DS_Challenge
<br>
<br>
<hr style="border:0.02in solid gray"> </hr>


<font color= #847ACC> <h1> Problem: </h1> </font>

**Introduction:** 

In this challenge, you will tackle the task of predicting the probability that a student will pass a grade. As a data scientist, you must choose and apply the best algorithm to build a predictive model. 

**Context:** 

Imagine you are part of a data science team working for an educational institution. The team is tasked with developing a predictive model that can assist in identifying students who are likely to pass or fail the grade. Such a model can provide valuable insights into student performance and help in designing targeted interventions to support struggling students.

**Tasks:**

- Load and explore the dataset
- Visualize the relationships:
    - Bivariate analysis.
    - Correlation matrix.
    - Others
- Normalize or standardize features if necessary.
- Build a predictive model.
- Train the model.
- Assess the model's performance using metrics such as accuracy, confusion matrix, and classification report.
- Interpret the results of the model.
- Communicate conclusions regarding the founding relationships.
- Provide actionable recommendations based on the

**Dataset:** 

https://archive.ics.uci.edu/dataset/320/student+performance

- Share a diagram that shows an en-to- end pipeline data science project from experimentation to a productive environment.
- You can use https://www.drawio.com/

**Extra Points:**
- How to integrate DVC in the pipeline.
- How to integrate MLFLOW in the pipeline.

Can you explain and give eofes about the following concepts:
- Encapsulation
- Abstraction
- Inheritance
- Polymorphis

___

<h1> <font color= #847ACC> Content </h1>
<a id="IND"></a>

<b> I.  [Exploratory Data Analysis and Data Cleaning](#EDA) </b><br>
- [ ] [Exploration](#EXP)
- [ ] [Missing Values](#MISS)
- [ ] [Duplicates](#DUPLI)
- [ ] [Unique Value](#UNIQ)
- [ ] [Correlation](#CORR)

<b> II. [Statistics](#EST)</b><br>
- [ ] [Measures of Central Tendency & Variability](#MTCV)
- [ ] [Distributions](#DIST)
- [ ] [Class Balance](#BAL)
    
<b> III. [Pre-processing](#PREP)</b><br>
- [ ] [Scaling](#SCAL)
- [ ] [Sampling](#SAMP)

<b> IV. [Models](#MODELOS)</b><br>
- [ ] [Logistic Regression](#RL)
  - [ ] [Metrics](#MRL)
  - [ ] [Decile Analysis](#ADE)
- [ ] [XGBoost](#XGB)
    
<b> V. [Conclusions](#CONC)</b><br>

___

### How to run the notebook:

First, it is suggested to create a virtual environment and activate it with the following command:

`python -m venv venv` -> `source venv/Scripts/activate` 

After activating the virtual env, install the requirements:

`pip install -r requirements.txt`

Then, execute the following code to start the notebook:

`jupyter notebook`

___

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
import plotly.express as px
import plotly.graph_objects as go
from IPython.display import display_html


# Machine Learning
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, roc_curve, confusion_matrix, make_scorer, roc_auc_score


# Aditional Libraries
from collections import Counter
from EDA import DataExplorer as de # I made this script
import time

# Configurations
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
matplotlib.style.use('seaborn')

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

In [None]:
start_time = time.time()

847ACC

<h1> <font color= #847ACC> Exploratory Data Analysis and Cleaning </h1>
<a id="EDA"></a>

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#847ACC'>💡 Parquet:</font></h3>
    
Parquet as a data storage reduces file size, processing time, and associated costs, which translates to saved money, time, and storage.
    
<center><img src="imgs/parquet.png" width="650" height="200"></center>
    
Source: <br>
<a href="https://parquet.apache.org/docs/"> Apache Parquet Documentation </a> <br>
<a href="https://www.databricks.com/glossary/what-is-parquet"> General Description </a> <br>    

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#847ACC'>💡</font></h3>

**``The 2.0 version of pandas introduces the Apache Arrow backend, which allows for a more efficient way of storing data in memory:``**

For example:
- int64 $\rightarrow$ int64[pyarrow]
- float64 $\rightarrow$ double[pyarrow]
- string $\rightarrow$ string[pyarrow]
    
For the first part of the project, I will work with the pyarrow backend. 

Another alternative for improving data reading speed could be `PySpark`, `Polars` or the recent `cuDF` from NVDIA.

In [None]:
%%time
df = pd.read_parquet('data/student-por.parquet', engine = "pyarrow", dtype_backend = "pyarrow")

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#847ACC'>💡</font></h3>

The time required to load the data is actually higher than using a .csv with the traditional backend, 
it is beause of the data size, in larger datasets, the power of using .parquet and pyarrow is huge!

In [None]:
df["approved"] = [1 if df["G3"][i] > 12 else 0 for i in range(len(df))]

In [None]:
df.head()

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Exploration </h2>
<a id="EXP"></a>

## Qualitative Research

### PISA Study

PISA 
(Programme for International Student Assessment) is an international study of student performance in
reading, mathematics and science. The last study is from 2018 and there's a new study every three years.
A small fraction of all students of age 15 is randomly chosen to participa

- The test has a mean score of 487 for girls and 492 for boys, and Portugal is slightly above the average with 488 for girls and 497 for boys. This suggest that ``sex may be a factor affecting performance`` (Based on PISA scores in 2018).

- Socio-economic status explained 17% of the variation in mathematics performance for the
Portuguese participants. This suggests that factors like parent's education and job, address, parental
status, school, family support, going out with friends and the use of alcohol may be factors thatff
aects student performanc

Sources:

https://data.oecd.org/pisa/mathematics-performance-pisa.htm

https://www.oecd.org/pisa/publications/PISA2018_CN_PRT.pdfes.te.

In [None]:
df.info()

In [None]:
display_html(df.head(2), df.sample(2), df.tail(2))

In [None]:
Counter(df["approved"])

In [None]:
df_shape_1 = df.shape
print(df_shape_1)

In [None]:
df[df["G2"] == 0]

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Missing Values </h2>
<a id="MISS"></a>

In [None]:
total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', '%']).transpose()

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#847ACC'>💡 Insights:</font></h3>
    
It doesn't seem like there are any missing values. If there are any, there are several techniques for data imputation:

- Mean
- Mode
- Models
- Distributions

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="10"></div>](#IND)
<h2> <font color= #847ACC> Duplicates </h2>
<a id="DUPLI"></a>

In [None]:
print("Duplicates: ", df.duplicated().sum())

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#847ACC'>💡 Insights:</font></h3>
    
There are not duplicate data.

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Unique Values </h2>
<a id="UNIQ"></a>

In [None]:
for i in df.columns:
    print(f"Unique value in {i}:")
    print(df[i].unique(),'\n')

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Correlation </h2>
<a id="CORR"></a>

In [None]:
categorical_columns = df.select_dtypes(include=['string']).columns
numerical_columns = df.select_dtypes(include=['number']).columns

In [None]:
fig = px.imshow(df[numerical_columns].corr().round(1), text_auto=True, aspect="auto", color_continuous_scale=px.colors.sequential.Blues)
fig.layout.height = 600
fig.layout.width = 1050
fig.update_coloraxes(showscale=False)
fig.update_layout(
    title_text="Correlation Map")
fig.show()

<h2> <font color= #847ACC> Statistics</h2>
<a id="EST"></a>
    
____

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h3> <font color= #847ACC> Measures of Central Tendency & Variability </h3>
<a id="MTCV"></a>

In [None]:
%%time
de.generate_data_description_table(df[numerical_columns])

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Distributions </h2>
<a id="DIST"></a>

In [None]:
sns.distplot(df["age"], bins=8)

In [None]:
# num_columns = len(df[numerical_columns].columns)
# num_plots_per_row = 3
# num_rows = (num_columns + num_plots_per_row - 1) // num_plots_per_row

# fig, axes = plt.subplots(num_rows, num_plots_per_row, figsize=(15, 5*num_rows))

# for i, col in enumerate(df[numerical_columns].columns):
#     ax = axes[i // num_plots_per_row, i % num_plots_per_row]
#     sns.distplot(df[numerical_columns][col], ax=ax)
#     ax.set_title(col)

# plt.tight_layout()
# plt.savefig("imgs/distributions.png")
# plt.show()

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Class Balance </h2>
<a id="BAL"></a>

In [None]:
class_count = pd.value_counts(df['approved'], sort = True).sort_index()
class_count

In [None]:
%%time
# Class Composition - Pie Plot
fig = go.Figure(data=[go.Pie(labels=['Not approved', 'approved'], 
                                    values=class_count, 
                                    pull=[0.05, 0, 0], 
                                    opacity=0.85)])
fig.update_layout(
    title_text="Class Composition")
fig.show()

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#847ACC'>💡 Insights:</font></h3>
    
a

- Subsampling
- Oversampling
- Class Weights

[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h1> <font color= #847ACC> Pre-processing </h1>
<a id="PREP"></a>


In [None]:
Counter(df["sex"])

In [None]:
df[categorical_columns]

In [None]:
df_test = df.copy()

In [None]:
labelencoder = LabelEncoder()

for column in df.select_dtypes(include=['object', 'category', 'string']).columns:
    df_test[column] = labelencoder.fit_transform(df_test[column])


In [None]:
df_test

___

<h1> <font color= #847ACC> Models </h1>
<a id="MODELOS"></a>

_"All models are wrong, but some are useful"_ \
— George Box


[<div style="text-align:center"><img src="imgs/arrow.png" width="10" height="5"></div>](#IND)
<h2> <font color= #847ACC> Logistic Regression </h2>
<a id="RL"></a>

<div style="border-radius:14px; border:#847ACC solid; padding: 15px; background-color: #FFFFFF; font-size:100%; text-align:left">
<h3 align="left"><font color='#ECB431'>💡 Justification:</font></h3>
    
- Logistic Regression is a relatively simple algorithm; it transforms a linear input into a probability (in the range of 0-1) using the Sigmoid function:

$$S(x) = \frac{1}{1+e^{-X\beta}}$$
 
Where $X$ is the set of predictor features, and $\beta$ is the corresponding weight vector. Computing $S(x)$ yields a probability indicating whether an observation should be classified as `1` or `0`.
<br>

- It is highly interpretable due to its output (probabilities), making it easier to explain to a non-technical audience compared to other models.

- Computationally, it requires less computational power compared to more complex models such as Neural Networks or Decision Trees.
    

In [None]:
# Save as array the features and the classes
X = df_test.iloc[:, 0:-1].values
y = df.iloc[:, -1].values

In [None]:
# Split data after Oversampling and Undersampling
x_train1, x_test1, y_train1, y_test1 = train_test_split(X, y, test_size = 0.35, random_state = 42)

In [None]:
# Logistic Regression with Subsampling, Oversampling and Class Weights
LR = LogisticRegression(random_state = 0, C=10, penalty= 'l2', class_weight = {0: 1, 1:2})

In [None]:
LR.fit(x_train1, y_train1) #.predict(X).sum()

In [None]:
LR.fit(x_train1, y_train1) #.predict(X).sum()

In [None]:
# Refit is for GS to optimize for that metric.
# Increasing the search space for the weights helps the model focus on fraud cases.

def grids(search_space: np.ndarray, opt_metric: str, cv: int):
    
    grid_ = GridSearchCV(
    estimator = LogisticRegression(max_iter = 500),
    param_grid = {'class_weight': [{0: 1, 1:v} for v in search_space]},
    scoring = {'precision': make_scorer(precision_score), 'recall': make_scorer(recall_score), 'f1': make_scorer(f1_score)},
    refit = opt_metric, 
    return_train_score = True,
    cv = cv,
    n_jobs = -1
    )
    
    return grid_

In [None]:
grid_prec = grids(np.linspace(1, 20, 30), opt_metric = 'precision', cv = 10)

In [None]:
grid_rec = grids(np.linspace(1, 20, 30), opt_metric = 'recall', cv = 10)

In [None]:
grid_f1 = grids(np.linspace(1, 20, 30), opt_metric = 'f1', cv = 10)

In [None]:
%%time
grid_prec.fit(x_train1, y_train1)
grid_prec.best_params_['class_weight']

In [None]:
%%time
# ¡¡¡Notice how a model focused on detecting more frauds (recall) assigns a lot of weight to '1'.!!!
grid_rec.fit(x_train1, y_train1)
grid_rec.best_params_['class_weight']

In [None]:
%%time
grid_f1.fit(x_train1, y_train1)
grid_f1.best_params_['class_weight']

In [None]:
# These are the results of the GridSearch
# Originally, it uses 'mean_test_score' for optimization, with the default score being Accuracy.
# In a highly imbalanced case, it's not a good choice to use it as a reference metric.
# It will be changed to precision, recall, or f1, as appropriate. They are added to the GridSearch.

df = pd.DataFrame(grid_prec.cv_results_)

In [None]:
df[['param_class_weight', 'params', 
    'mean_test_precision', 'mean_train_precision', 
    'mean_test_recall', 'mean_train_recall',
    'mean_test_f1', 'mean_train_f1']]

# These results are from the first fit, that is, with optimized precision, which is why weights 0:1 and 1:1 are chosen.
# Notice how the precision is decreasing

In [None]:
plt.figure(figsize = (12, 4))
for score in ['mean_test_recall', 'mean_test_precision', 'mean_test_f1']:
    plt.plot([_[1] for _ in df['param_class_weight']],
             df[score],
             label = score)
plt.legend();

- The class weights are on the 'x' axis, and the 3 scores are on the 'y' axis.
- If you are looking for balance, the crossing point is the optimal weight (f1_score).
- If you are looking to maximize one, its peak point is the goal.
    - For 'precision', it's with a 1:1 weight ratio.
    - For 'recall', it's a higher weight ratio, between 1:15 and 1:20.
    - For 'f1', it's with a 1:1 weight ratio.

In [None]:
# Unlike .predict, .predict_proba returns the probabilities without the threshold criterion.
probs = grid_f1.predict_proba(x_test1)

In [None]:
# There is an array with the probabilities for class 0 and 1, where these are complementary.
# We will work with the probabilities of class 1.
print(probs.shape)
probs_1 = probs[:, 1]

In [None]:
fpr, tpr, thresholds = roc_curve(y_test1, probs_1)

In [None]:
# Plot the ROC curve
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], 'k--')  # Plot the random guess line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.show()

In [None]:
def print_score(label, prediction, train=True):
    if train:
        clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
        print("Train Result:\n==========================================================================")
        print("__________________________________________________________________________")
        print(f"Classification Report:\n{clf_report}")
        print("__________________________________________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(label, prediction)}\n")
        
    elif train==False:
        clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
        print("Test Result:\n==========================================================================")        
        print("__________________________________________________________________________")
        print(f"Classification Report:\n{clf_report}")
        print("__________________________________________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(label, prediction)}\n")

In [None]:
# Confussion Matrix Function
def CM(y_test, y_pred):
    
    # Confussion Matrix
    cm = confusion_matrix(y_test, y_pred)
    
    names = ['True Neg','False Pos','False Neg','True Pos']
    counts = [value for value in cm.flatten()]
    percentages = ['{0:.2%}'.format(value) for value in cm.flatten()/np.sum(cm)]
    labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in zip(names, counts, percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    
    sns.heatmap(cm, annot = labels, cmap = 'Blues', fmt ='')

<h3> <font color= #847ACC> Case 1- Optimizing Precision </h3>

In [None]:
prec_train_pred = grid_prec.predict(x_train1)
prec_test_pred = grid_prec.predict(x_test1)

In [None]:
print_score(y_train1, prec_train_pred, train=True)
print_score(y_test1, prec_test_pred, train=False)

In [None]:
CM(y_test1, prec_test_pred)

In [None]:
from xgboost import XGBClassifier
import xgboost as xgb

In [None]:
xgb_clf = XGBClassifier()
xgb_clf.fit(x_train1, y_train1, eval_metric='aucpr')

In [None]:
y_train_pred = xgb_clf.predict(x_train1)
y_test_pred = xgb_clf.predict(x_test1)

In [None]:
probs_xgb = xgb_clf.predict_proba(x_test1)

In [None]:
print_score(y_train1, y_train_pred, train=True)
print_score(y_test1, y_test_pred, train=False)

In [None]:
CM(y_test1, y_test_pred)

In [None]:
end_time = time.time()
total_time = end_time - start_time
print(f"Total runtime: {total_time:.2f} seconds")