# PROJET IA for HumanForYou

|Auteur|
|---|
|G. DUBOYS DE LAVIGERIE|
|T. VILETTE|
|O. BOUSSARD|
|A. BRICON|

## Objectifs du Livrable

Le présent document constitue le premier livrable du projet IA pour HumanForYou, visant à analyser et traiter les différentes données fournies par l'entreprise dans le but de comprendre les facteurs influençant le taux de rotation des employés. Les objectifs principaux de ce livrable sont les suivants :

1. **Compréhension des données :** Explorer et comprendre les différentes sources de données fournies par HumanForYou, y compris les données des ressources humaines, les évaluations des managers, les enquêtes sur la qualité de vie au travail et les horaires de travail.

2. **Prétraitement des données :** Nettoyer les données en éliminant les valeurs manquantes, en identifiant et en traitant les éventuelles erreurs ou incohérences, et en préparant les données pour l'analyse et la modélisation.

3. **Exploration des données :** Effectuer une analyse exploratoire des données pour identifier les tendances, les relations et les motifs significatifs pouvant influencer le taux de rotation des employés.

4. **Visualisation des données :** Utiliser des techniques de visualisation de données pour représenter graphiquement les principales caractéristiques des données et faciliter la compréhension des résultats.

5. **Préparation des données pour la modélisation :** Préparer les données en sélectionnant les variables pertinentes, en transformant les variables catégorielles en variables numériques, et en divisant les données en ensembles d'entraînement et de test pour la modélisation.

## Attendus

À la fin de ce livrable, nous nous attendons à ce que les données fournies par HumanForYou soient prêtes à être analysées et utilisées pour la modélisation. Nous fournirons une analyse détaillée de la qualité des données, des tendances et des relations identifiées, ainsi qu'une documentation complète sur les étapes de prétraitement des données.

### 1. Préparation de l'environnement

Pour faciliter l'importation des bibliothèques nécessaires au bon fonctionnement du code, veuillez exécuter le fichier `setup.ipynb`. Ce fichier s'occupe de configurer l'environnement en important toutes les librairies essentielles. 

In [None]:
# compatibilité python 2 et python 3
from __future__ import division, print_function, unicode_literals

# imports
import numpy as np
import os

# stabilité du notebook d'une exécution à l'autre
np.random.seed(42)

# jolies figures directement dans le notebook
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# où sauver les figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "workflowDS"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID) # le dossier doit exister

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# ignorer les warnings inutiles (voir SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

### 2. Import des données

Dans cette section, nous automatisons le processus d'importation des données en implémentant une fonction qui effectue les étapes suivantes :
1. Téléchargement de l'archive contenant les fichiers.
2. Extraction des fichiers de l'archive.

Le code ci-dessous réalise le chargement des fichiers suivants :
- `employee_survey_data.csv`
- `general_data.csv`
- `in_time.csv`
- `out_time.csv`
- `manager_survey_data.csv`

De même, on va créer une fonction utilisant [`pandas`](https://pandas.pydata.org/) qui charge les données en mémoire dans un [`Pandas DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame).

In [None]:
import os
import zipfile
from six.moves import urllib
import pandas as pd
from sklearn.impute import SimpleImputer

pd.set_option('display.max_columns', None)

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/"
REPO_PATH = "AnatholyB1/AI_A4/main/"
DATA_PATH = os.path.join("../datasets", "all")
DATA_URL = DOWNLOAD_ROOT + REPO_PATH + "data.zip"


def fetch_data(data_url=DATA_URL, data_path=DATA_PATH):
    if not os.path.isdir(data_path):
        os.makedirs(data_path)
    zip_path = os.path.join(data_path, "data.zip")
    urllib.request.urlretrieve(data_url, zip_path)
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(data_path)

fetch_data()

def load_employee_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path + "\employe", "employee_survey_data.csv")
    return pd.read_csv(csv_path)

def load_general_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path + "\general", "general_data.csv")
    return pd.read_csv(csv_path)

def load_in_time_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path + "\in_out_time", "in_time.csv")
    return pd.read_csv(csv_path)

def load_out_time_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path + "\in_out_time", "out_time.csv")
    return pd.read_csv(csv_path)

def load_manager_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path + "\manager", "manager_survey_data.csv")
    return pd.read_csv(csv_path)

employee = load_employee_data()
general = load_general_data()
in_time = load_in_time_data()
out_time = load_out_time_data()
manager = load_manager_data()

### 3. Analyse des différentes données

#### 3.1. employee

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Créez un pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
])


# Appliquez le pipeline à l'ensemble du DataFrame
employee_transformed = numeric_transformer.fit_transform(employee)

# Transformez le résultat du pipeline en DataFrame
employee_df = pd.DataFrame(employee_transformed, columns = employee.columns)
employee_df.info()

#### 3.2 General

| **Variable**           | **Type de données** | **Nombre de valeurs Manquantes** | **Nature de la variable** | **Nombre de catégories uniques** |
|------------------------|---------------------|----------------------------------|---------------------------|----------------------------------|
| **Age**                | Quantitative        | 0                                | Ordinal                   | 35                               |
| **Attrition**          | Qualitative         | 0                                | Nominal (Booléen)         | 2                                |
| **BusinessTravel**     | Qualitative         | 0                                | Ordinal                   | 3                                |
| **Department**         | Qualitative         | 0                                | Nominal                   | 3                                |
| **DistanceFromHome**   | Quantitative        | 0                                | Ordinal                   | 16                               |
| **Education**          | Quantitative        | 0                                | Ordinal                   | 5                                |
| **EducationField**     | Qualitative         | 0                                | Nominal                   | 6                                |
| **EmployeeCount**      | Quantitative        | 0                                | Ordinal                   | 1                                |
| **EmployeeID**         | Quantitative        | 0                                | Ordinal                   | 4410                             |
| **Gender**             | Qualitative         | 0                                | Nominal (Booléen)         | 2                                |
| **JobLevel**           | Quantitative        | 0                                | Ordinal                   | 5                                |
| **JobRole**            | Qualitative         | 0                                | Nominal                   | 9                                |
| **MaritalStatus**      | Qualitative         | 0                                | Nominal                   | 3                                |
| **MonthlyIncome**      | Quantitative        | 0                                | Ordinal                   | 1349                             |
| **NumCompaniesWorked** | Quantitative        | 19                               | Ordinal                   | 10                               |
| **Over18**             | Qualitative         | 0                                | Nominal                   | 1                                |
| **PercentSalaryHike**  | Quantitative        | 0                                | Ordinal                   | 14                               |
| **StandardHours**      | Quantitative        | 0                                | Ordinal                   | 1                                |
| **StockOptionLevel**   | Quantitative        | 0                                | Ordinal                   | 4                                |
| **TotalWorkingYears**  | Quantitative        | 9                                | Ordinal                   | 29                               |
| **TrainingTimesLastYear** | Quantitative      | 0                                | Ordinal                   | 7                                |
| **YearsAtCompany**     | Quantitative        | 0                                | Ordinal                   | 29                               |
| **YearsSinceLastPromotion** | Quantitative    | 0                                | Ordinal                   | 15                               |
| **YearsWithCurrManager** | Quantitative      | 0                                | Ordinal                   | 16                               |

In [None]:
# Supprimer useless colonnes
general_dr = general.drop(['Over18', 'EmployeeCount', 'StandardHours'], axis=1)

# Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

general_num = general_dr.select_dtypes(include=[np.number]) 
num_attribs = list(general_num)

general_cat_ordinal = ['BusinessTravel']
general_cat_nominal = ['Attrition','Department','EducationField','Gender','JobRole','MaritalStatus']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
])

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat_nom", OneHotEncoder(), general_cat_nominal),
    ("cat_ord", OrdinalEncoder(), general_cat_ordinal),
])

# Appliquer le pipeline au DataFrame initial
general_tr = full_pipeline.fit_transform(general_dr)

# Créer un DataFrame à partir du résultat
general_prepared = pd.DataFrame(general_tr, columns=full_pipeline.get_feature_names_out())
general_prepared.drop(['cat_nom__Attrition_No'], axis=1)

#### 3.3 In_time

In [None]:

def convert_all_to_datetime(df):
    year_dict = {}
    month_dict = {}
    day_dict = {}

    for column_name in df.columns:
        try:
            df[column_name] = pd.to_datetime(df[column_name])
            year_dict[column_name + '-year'] = df[column_name].dt.year
            month_dict[column_name + '-month'] = df[column_name].dt.month
            day_dict[column_name + '-day'] = df[column_name].dt.day
        except ValueError:
            # Skip columns that cannot be converted to datetime
            pass

    year_df = pd.DataFrame(year_dict)
    month_df = pd.DataFrame(month_dict)
    day_df = pd.DataFrame(day_dict)

    df = pd.concat([df, year_df, month_df, day_df], axis=1)
    return df

in_time = convert_all_to_datetime(in_time)
out_time = convert_all_to_datetime(out_time)

suprimmer les colones qui ne sont pas des dates

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

def convert_all_to_datetime(X,  Y):    
    cols_to_drop = X.filter(regex='-day$').columns
    X = X.drop(cols_to_drop, axis=1)
    cols_to_drop = X.filter(regex='-month$').columns
    X = X.drop(cols_to_drop, axis=1)
    cols_to_drop = X.filter(regex='-year$').columns
    X = X.drop(cols_to_drop, axis=1)


    first_column = X.columns[0]
    X = X.drop([first_column], axis=1)
    # Supprimer les colonnes où toutes les valeurs sont NaT
    X = X.dropna(axis=1, how='all')

    cols_to_drop = Y.filter(regex='-day$').columns
    Y = Y.drop(cols_to_drop, axis=1)
    cols_to_drop = Y.filter(regex='-month$').columns
    Y = Y.drop(cols_to_drop, axis=1)
    cols_to_drop = Y.filter(regex='-year$').columns
    Y = Y.drop(cols_to_drop, axis=1)


    first_column = Y.columns[0]
    Y = Y.drop([first_column], axis=1)

    # Supprimer les colonnes où toutes les valeurs sont NaT
    Y = Y.dropna(axis=1, how='all')


    # Calculer la médiane du temps horaire pour chaque personne
    median_time = (Y - X).median()

    for column in X.columns:
        # Trouver où les valeurs sont manquantes
        in_time_nan = X[column].isna()
        out_time_nan = Y[column].isna()

        # Si 'in_time' est manquant, soustraire la médiane du temps horaire à 'out_time'
        X.loc[in_time_nan, column] = X.loc[in_time_nan, column] - median_time[column]

        # Si 'out_time' est manquant, ajouter la médiane du temps horaire à 'in_time'
        Y.loc[out_time_nan, column] = Y.loc[out_time_nan, column] + median_time[column]
    
    return  Y - X



class PreTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.imputer = SimpleImputer(strategy="median")

    def fit(self, X, y=None):
        self.imputer.fit(X)
        return self

    def transform(self, X, y=None):
        for col in X.columns:
            X[col] = X[col].apply(lambda x: x.total_seconds() / 3600 if pd.notnull(x) else np.nan)    
        stats = pd.DataFrame()
        stats['mean'] = pd.Series(X.mean(axis=1))
        stats['median'] = pd.Series(X.median(axis=1))
        stats['min'] = pd.Series(X.min(axis=1))
        stats['max'] = pd.Series(X.max(axis=1))
        return stats, stats.columns
        # Convertir les timedelta en une quantité numérique représentant le nombre d'heures


# Définir le pipeline
num_pipeline = Pipeline([
    ('pre_transformer', PreTransformer()),
])

# Appliquer le pipeline
hourly_time_prepared, stats_col = num_pipeline.fit_transform(convert_all_to_datetime(in_time, out_time))

hourly_time_prepared = pd.DataFrame(hourly_time_prepared, columns= stats_col )

#Ajouter la colonne 'EmployeeID' pour pouvoir joindre les DataFrames
hourly_time_prepared['EmployeeID'] = employee_df['EmployeeID']


#### Jointure

In [None]:
# Utiliser la fonction merge pour faire la première jointure avec general
result = pd.merge(employee_df, manager, on='EmployeeID')

# Utiliser la fonction merge pour faire la deuxième jointure avec general_prepared
result = pd.merge(result, general_prepared, left_on='EmployeeID', right_on='num__EmployeeID')

# Utiliser la fonction merge pour faire la troisieme jointure avec hourly_time_prepared
result = pd.merge(result, hourly_time_prepared, on='EmployeeID')

result = result.drop(['EmployeeID'], axis=1)
result = result.drop(['num__EmployeeID'], axis=1)
# Créez un pipeline
result_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

result_array = result_transformer.fit_transform(result)
result = pd.DataFrame(result_array, columns = result.columns)


from sklearn.model_selection import train_test_split

# Supposons que vous voulez prédire 'cat_nom__Attrition_Yes'
y = result['cat_nom__Attrition_Yes']
X = result.drop(['cat_nom__Attrition_Yes', 'cat_nom__Attrition_No'], axis=1)

# Diviser les données en un ensemble d'entraînement et un ensemble de test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_df = pd.DataFrame(pd.concat([X_train, y_train], axis = 1), columns=result.columns)


# Calculer la matrice de corrélation
corr_matrix = train_df.corr()
# Filtrer les corrélations
filtered_corr_matrix = corr_matrix.where((corr_matrix >= 0.4) | (corr_matrix <= -0.4))

# Trier la colonne 'cat_nom__Attrition_Yes' par valeur absolue
sorted_corr = corr_matrix['cat_nom__Attrition_Yes'].apply(abs).sort_values(ascending=False)

# Afficher les éléments les plus corrélés avec 'cat_nom__Attrition_Yes'
sorted_corr

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score


# Create a DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)

# Train the model
def train_model(trainer):
    trainer.fit(X_train, y_train)

    trainer_predictions = trainer.predict(X_test)
    trainer_mse = mean_squared_error(y_test, trainer_predictions)
    trainer_rmse = np.sqrt(trainer_mse)

    print("mse:", trainer_mse)
    print("rmse:", trainer_rmse)

    scores = cross_val_score(trainer, X_test, y_test,
                         scoring="neg_mean_squared_error", cv=10)
    trainer_rmse_scores = np.sqrt(-scores)
    
    print("Scores:", trainer_rmse_scores)
    print("Mean:", trainer_rmse_scores.mean())
    print("Standard deviation:", trainer_rmse_scores.std())


train_model(tree_reg)    








In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(fit_intercept=False)
train_model(lin_reg)

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(max_depth=15, max_features='log2', n_estimators=26,
                      random_state=42)
train_model(forest_reg)

In [None]:
from sklearn.svm import SVR

svr_reg = SVR(C=9.318742350231167, gamma=0.09849250205191949)
train_model(svr_reg)

In [None]:
import numpy as np


shuffle_index = np.random.permutation(len(X_train))
X_train, y_train = X_train.iloc[shuffle_index], y_train.iloc[shuffle_index]



In [None]:
y_train_yes = (y_train > 0)
y_test_yes = (y_test > 0 )

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, random_state=42)
sgd_clf.fit(X_train, y_train_yes)

sgd_clf.predict(X_test)


# cross-val a la mano
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

X_train_np = X_train.values
y_train_yes_np = y_train_yes.values

for train_index, test_index in skfolds.split(X_train_np, y_train_yes_np):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train_np[train_index]
    y_train_folds = y_train_yes_np[train_index]
    X_test_fold = X_train_np[test_index]
    y_test_fold = y_train_yes_np[test_index]

    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_fold)
    n_correct = sum(y_pred == y_test_fold)
    print(n_correct / len(y_pred))
# ⨯-val avec scikit-learn
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_yes,cv=3 ,scoring="accuracy" )    


In [None]:
from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

In [None]:
never_5_clf = Never5Classifier()
predictions = never_5_clf.predict(X_train)

In [None]:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_yes, cv=3)

In [None]:
from sklearn.metrics import confusion_matrix
conf_matrix =  confusion_matrix(y_train_yes, y_train_pred)

from sklearn.metrics import precision_score, recall_score
precision_score(y_train_yes, y_train_pred)


In [None]:
recall_score(y_train_yes, y_train_pred)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_train_yes, y_train_pred)

In [None]:
y_scores = sgd_clf.decision_function(X_train)

In [None]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_yes, cv=3,
                             method="decision_function")

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_yes, y_scores)


def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b-", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
plt.show()

In [None]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "k-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.show()

In [None]:
y_train_pred_90 = (y_scores > 70000)

precision_score(y_train_yes, y_train_pred_90), recall_score(y_train_yes, y_train_pred_90)

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_yes, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()

In [None]:
from sklearn.metrics import roc_auc_score

# Supposons que y_true sont vos vraies étiquettes binaires, et y_scores sont les scores de décision de votre modèle
auc_roc = roc_auc_score(y_train_yes, y_scores)

print("AUC-ROC: ", auc_roc)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict

forest_clf = RandomForestClassifier(random_state=42, n_estimators=10)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_yes, cv=3,
                                    method="predict_proba")

In [None]:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression

# Créer une instance de RandomForestClassifier
clf = RandomForestClassifier()

# Créer une instance de LogisticRegression
logisticRegr = LogisticRegression()

# Créer une instance de VotingClassifier
voting_clf = VotingClassifier(
    estimators=[('lr', logisticRegr), ('rf', clf)],
    voting='soft')

y_probas_forest = cross_val_predict(voting_clf, X_train, y_train_yes, cv=5,
                                    method="predict_proba")
                             

In [None]:
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_yes,y_scores_forest)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()

In [None]:
auc = roc_auc_score(y_train_yes, y_scores_forest)
print("AUC: ", auc)



In [None]:
y_pred_forest = (y_scores_forest > 0.5)
precision_score(y_train_yes, y_pred_forest)

In [None]:
recall_score(y_train_yes, y_pred_forest)

In [None]:
sgd_clf.fit(X_train, y_train_yes) 
sgd_clf.predict(X_test)

In [None]:
some_digit_scores = sgd_clf.decision_function(X_test)
class_max_index = np.argmax(some_digit_scores)

In [None]:
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, random_state=42))
ovo_clf.fit(X_train, y_train_yes)
ovo_clf.predict(X_test)

In [None]:
forest_clf.fit(X_train, y_train_yes)
forest_clf.predict(X_test)

In [None]:
forest_clf.predict_proba(X_test)

In [None]:
cross_val_score(sgd_clf, X_train, y_train_yes, cv=3, scoring="accuracy")

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train_yes, cv=3, scoring="accuracy")

In [None]:
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train_yes, cv=3)
conf_mx = confusion_matrix(y_train_yes, y_train_pred)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

def plot_confusion_matrix(matrix):
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    cax = ax.matshow(matrix)
    fig.colorbar(cax)

plot_confusion_matrix(conf_mx)

In [None]:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums

np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

plot_confusion_matrix(norm_conf_mx)