In [17]:
from sklearn.ensemble import IsolationForest
import pandas as pd

# Etape 1 - Chargement des données

df = pd.read_csv('./data/employee-earnings-report-2021.csv', encoding='ISO-8859-1')

# Etape 2 - On enlève les lignes vides

df = df.dropna(how='all')

# Etape 3 - Convertir en valeurs numériques Python

numeric_columns = ['REGULAR', 'RETRO', 'OTHER', 'OVERTIME', 'INJURED', 'DETAIL', 'QUINN_EDUCATION_INCENTIVE', 'TOTAL_GROSS']

def clean_numeric_column(column):
    return column.apply(lambda x: str(x).replace(',', '') if isinstance(x, str) else x).astype(float)

for column in numeric_columns:
    df[column] = clean_numeric_column(df[column])

# Etape 4 - Chercher des données aberrantes

model = IsolationForest(contamination=0.05)

df['outlier'] = model.fit_predict(df[numeric_columns])
df['anomaly_score'] = model.decision_function(df[numeric_columns])
top_6_outliers = df.nlargest(6, 'anomaly_score')

df_cleaned = df.drop(top_6_outliers.index).drop(['outlier', 'anomaly_score'], axis=1)

print(df_cleaned)

# Etape 5 -


                             NAME               DEPARTMENT_NAME  \
0                 Beckers,Richard      Boston Police Department   
1           McGowan,Jacqueline M.      Boston Police Department   
2                  Harris,Shawn N      Boston Police Department   
3               Washington,Walter      Boston Police Department   
4               Mosley Jr.,Curtis      Boston Police Department   
...                           ...                           ...   
22541  Bartholomew,Joseph William         BPS Special Education   
22542       Rabouin,Shante Evelin  BPS Substitute Teachers/Nurs   
22543           Francisco,Carla E           Tech Boston Academy   
22544              Ellis,Nicole L          Food & Nutrition Svc   
22545             Smith,Wayne Lee         Boston Public Library   

                                TITLE    REGULAR  RETRO       OTHER  OVERTIME  \
0                      Police Officer        NaN    NaN  1264843.63       NaN   
1                      Police Off

## Questions & Réponses

### Etape 2

> Combien de lignes et de colonnes a le Data Frame ?

`22552` lignes et `12` colonnes.

> Pour chaque colonne, combien y a-t-il de
données manquantes ?

- **NAME**: 6
- **DEPARTMENT_NAME**: 6
- **TITLE**: 6
- **REGULAR**: 644
- **RETRO**: 22,150
- **OTHER**: 8,423
- **OVERTIME**: 15,706
- **INJURED**: 21,096
- **DETAIL**: 20,493
- **QUINN_EDUCATION_INCENTIVE**: 21,166
- **TOTAL_GROSS**: 6
- **POSTAL**: 6

> Y a-t-il des lignes entièrement vides ?

Oui, il y en a 6.

### Etape 3