<a href="https://colab.research.google.com/github/PosgradoMNA/actividades-del-projecto-equipo_98/blob/main/Actividad_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Actividad 1 - Semana 4**

# **Limpieza de datos**

In [1]:
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA



In [2]:
inPath = "https://raw.githubusercontent.com/PosgradoMNA/Actividades_Aprendizaje-/main/default%20of%20credit%20card%20clients.csv"

#**Información del conjunto de datos:**

Esta investigación apuntó al caso de los pagos predeterminados de los clientes en Taiwán y **compara la precisión predictiva** de **la probabilidad de incumplimiento** entre **seis métodos de minería de datos**.

Desde la perspectiva de la gestión de riesgos, el resultado de la precisión predictiva de **la probabilidad de incumplimiento estimada** **será más valioso** que el resultado binario de la clasificación: **clientes creíbles o no creíbles**. Debido a que se *desconoce* **la probabilidad real de incumplimiento**, este estudio presentó el novedoso "**Método de suavizado de clasificación**" para **estimar** la probabilidad real de incumplimiento.

Con **la probabilidad real de incumplimiento** como variable de respuesta **(Y)** y **la probabilidad predictiva de incumplimiento** como variable independiente **(X)**, el resultado de la regresión lineal simple **(Y = A + BX)** muestra que el modelo de pronóstico producido por la red neuronal artificial tiene **el coeficiente de determinación más alto**; su intersección de regresión **(A)** es cercana a cero y el coeficiente de regresión **(B)** a uno. Por lo tanto, **entre las seis técnicas de minería de datos**, la red neuronal artificial **es la única que puede estimar con precisión** la **probabilidad real de incumplimiento**.


##**Descripción de la base de datos**

Esta investigación empleó una variable binaria, pago por defecto **(Sí = 1, No = 0**), como variable de respuesta. Este estudio revisó la literatura y utilizó las siguientes **23 variables** como variables explicativas:

*   **X1**: Monto del crédito otorgado (dólar NT): incluye tanto el crédito de consumo individual como su crédito familiar (complementario).

*   **X2**: Género (1 = masculino; 2 = femenino).

*   **X3**: Educación (1 = posgrado; 2 = universidad; 3 = secundaria; 4 = otros)

*   **X4**: Estado civil (1 = casado; 2 = soltero; 3 = otros).

*   **X5**: Edad (año).

*   **X6 - X11**: Historial de pagos pasados. Hicimos un seguimiento de los registros de pagos mensuales pasados ​​(de abril a septiembre de 2005) de la siguiente manera:

    *   **X6** = el estado de pago en septiembre de 2005;
    *   **X7** = el estado de pago en agosto de 2005;
    *   **X11** = estado de amortización en abril de 2005. La escala de medición del estado de amortización es: -1 = pagar debidamente; 1 = retraso en el pago de un mes; 2 = retraso en el pago de dos meses; . . .; 8 = retraso en el pago de ocho meses; 9 = retraso en el pago de nueve meses o más.

*  **X12-X17**: Importe del estado de cuenta (dólar NT).  

    *   **X12** = monto del estado de cuenta en septiembre de 2005;
    *   **X13** = monto del estado de cuenta en agosto de 2005;
    *   **X17** = monto del estado de cuenta en abril de 2005.

*   **X18-X23**: Monto del pago anterior (dólar NT).

    *   **X18** = monto pagado en septiembre de 2005;
    *   **X19** = monto pagado en agosto de 2005;
    *   **X23** = monto pagado en abril de 2005. 


Base de datos (DataFrame):

In [3]:
df = pd.read_csv(inPath, index_col = 0)
df.index.name = None
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,20000,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
2,120000,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
3,90000,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
4,50000,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
5,50000,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,220000,1.0,3.0,1.0,39.0,0.0,0.0,0.0,0.0,0.0,...,88004.0,31237.0,15980.0,8500.0,20000.0,5003.0,3047.0,5000.0,1000.0,0.0
29997,150000,1.0,3.0,2.0,43.0,-1.0,-1.0,-1.0,-1.0,0.0,...,8979.0,5190.0,0.0,1837.0,3526.0,8998.0,129.0,0.0,0.0,0.0
29998,30000,1.0,2.0,2.0,37.0,4.0,3.0,2.0,-1.0,0.0,...,20878.0,20582.0,19357.0,0.0,0.0,22000.0,4200.0,2000.0,3100.0,1.0
29999,80000,1.0,3.0,1.0,41.0,1.0,-1.0,0.0,0.0,0.0,...,52774.0,11855.0,48944.0,85900.0,3409.0,1178.0,1926.0,52964.0,1804.0,1.0


In [4]:
df = df.dropna()
target = ["X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10", "X11", "X12", "X13", "X13", "X14",
"X15", "X16", "X17",r"X18", "X19", "X20", "X21", "X22", "X23", "Y"]
target_df = df [target]

#**Paso 1:**
Determine el número mínimo de componentes principales que representan la mayor parte de la variación en sus datos:

In [5]:
df= df.head(10)
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,20000,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
2,120000,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
3,90000,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
4,50000,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
5,50000,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0
6,50000,1.0,1.0,2.0,37.0,0.0,0.0,0.0,0.0,0.0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0.0
7,500000,1.0,1.0,2.0,29.0,0.0,0.0,0.0,0.0,0.0,...,542653.0,483003.0,473944.0,55000.0,40000.0,38000.0,20239.0,13750.0,13770.0,0.0
8,100000,2.0,2.0,2.0,23.0,0.0,-1.0,-1.0,0.0,0.0,...,221.0,-159.0,567.0,380.0,601.0,0.0,581.0,1687.0,1542.0,0.0
9,140000,2.0,3.0,1.0,28.0,0.0,0.0,2.0,0.0,0.0,...,12211.0,11793.0,3719.0,3329.0,0.0,432.0,1000.0,1000.0,1000.0,0.0
10,20000,1.0,3.0,2.0,35.0,-2.0,-2.0,-2.0,-2.0,-1.0,...,0.0,13007.0,13912.0,0.0,0.0,0.0,13007.0,1122.0,0.0,0.0


#Normalización de mínimos y máximos a cada columna:

In [6]:
df = df.copy()
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,20000,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
2,120000,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
3,90000,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
4,50000,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
5,50000,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0


In [7]:
df.columns

Index(['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11',
       'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21',
       'X22', 'X23', 'Y'],
      dtype='object')

In [8]:
(
    (df-df.min())/
    (df.max()-df.min()) 
).head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,0.0,1.0,0.5,0.0,0.029412,1.0,1.0,0.25,0.5,0.0,...,0.0,0.000329,0.0,0.0,0.017225,0.0,0.0,0.0,0.0,1.0
2,0.208333,1.0,0.5,1.0,0.088235,0.25,1.0,0.5,1.0,1.0,...,0.00603,0.00748,0.006881,0.0,0.025,0.026316,0.04941,0.0,0.145243,1.0
3,0.145833,1.0,0.5,1.0,0.323529,0.5,0.5,0.5,1.0,1.0,...,0.026409,0.031267,0.032808,0.0276,0.0375,0.026316,0.04941,0.072727,0.363108,0.0
4,0.0625,1.0,0.5,0.0,0.411765,0.5,0.5,0.5,1.0,1.0,...,0.052177,0.060266,0.062343,0.036364,0.050475,0.031579,0.054351,0.077745,0.072622,0.0
5,0.0625,0.0,0.5,0.0,1.0,0.25,0.5,0.25,1.0,1.0,...,0.038588,0.039956,0.040366,0.036364,0.917025,0.263158,0.444686,0.050109,0.04931,0.0


#Otra forma de normalizar con sklearn

In [9]:
from sklearn import preprocessing
#from sklearn.preprocessing import MinMaxScaler

In [10]:
X = df.X5.to_frame()

scaler = preprocessing.MinMaxScaler().fit(X)
#scaler

In [11]:
scaler.fit_transform(X)

array([[0.02941176],
       [0.08823529],
       [0.32352941],
       [0.41176471],
       [1.        ],
       [0.41176471],
       [0.17647059],
       [0.        ],
       [0.14705882],
       [0.35294118]])

#**PCA**

In [12]:
df= df.head(10)
df

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,20000,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
2,120000,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
3,90000,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
4,50000,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
5,50000,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0
6,50000,1.0,1.0,2.0,37.0,0.0,0.0,0.0,0.0,0.0,...,19394.0,19619.0,20024.0,2500.0,1815.0,657.0,1000.0,1000.0,800.0,0.0
7,500000,1.0,1.0,2.0,29.0,0.0,0.0,0.0,0.0,0.0,...,542653.0,483003.0,473944.0,55000.0,40000.0,38000.0,20239.0,13750.0,13770.0,0.0
8,100000,2.0,2.0,2.0,23.0,0.0,-1.0,-1.0,0.0,0.0,...,221.0,-159.0,567.0,380.0,601.0,0.0,581.0,1687.0,1542.0,0.0
9,140000,2.0,3.0,1.0,28.0,0.0,0.0,2.0,0.0,0.0,...,12211.0,11793.0,3719.0,3329.0,0.0,432.0,1000.0,1000.0,1000.0,0.0
10,20000,1.0,3.0,2.0,35.0,-2.0,-2.0,-2.0,-2.0,-1.0,...,0.0,13007.0,13912.0,0.0,0.0,0.0,13007.0,1122.0,0.0,0.0


In [13]:
df = df.copy()
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
1,20000,2.0,2.0,1.0,24.0,2.0,2.0,-1.0,-1.0,-2.0,...,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1.0
2,120000,2.0,2.0,2.0,26.0,-1.0,2.0,0.0,0.0,0.0,...,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1.0
3,90000,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0.0
4,50000,2.0,2.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0.0
5,50000,1.0,2.0,1.0,57.0,-1.0,0.0,-1.0,0.0,0.0,...,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0.0


In [23]:
X = df.iloc[:, 0:10].values
Y = df.iloc[:, 0:23].values


#  Dividimos los datos en conjunto de pruebas

In [15]:
from sklearn.model_selection import train_test_split

In [24]:
X_train, x_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

#Estandarizamos las escalas

In [25]:
from sklearn.preprocessing import StandardScaler

In [None]:
transformada = pca.transform(df)
print(df.shape)
print(transformada.shape)

In [None]:
    plt.title('PCA')
    plt.xlabel('Tamaño del conjunto de entrenamiento')
    plt.ylabel('Exactitud (accuracy)')
    plt.grid()
    plt.legend(loc='lower left')
    plt.show()