aplique o CRISP-DM e monte um material de apresentação considerando até a etapa de avaliação do modelo.

Sobre o Conjunto de Dados
Marketing Bancário

Resumo:
Os dados estão relacionados com campanhas de marketing direto (ligações telefônicas) de uma instituição bancária portuguesa. O objetivo da classificação é prever se o cliente irá subscrever um depósito a prazo (variável y).

Informações do Conjunto de Dados:
Os dados estão relacionados com campanhas de marketing direto de uma instituição bancária portuguesa. As campanhas de marketing foram baseadas em ligações telefônicas. Muitas vezes, mais de um contato com o mesmo cliente foi necessário para avaliar se o produto (depósito a prazo bancário) seria ('sim') ou não ('não') subscrito.

Fonte:

	Conjunto de dados de: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#


# Imports

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from collections import Counter
from boruta import BorutaPy
from scipy import stats

# Dados

In [44]:
# Abrir os dados disponíveis
df_bank_full = pd.read_csv(r'C:\Users\kawda\OneDrive\Desktop\Data_Science\Semana_7\Tarefa\Produto_Bancario\bank-additional-full.csv', sep=';')
df_bank_full.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


# Identificar e separar o target e fazer o test-split

In [45]:
# Verificar os dados e separar o target
lista_spec = ['y']
abt_00 = df_bank_full.drop(columns=lista_spec)

# Split the data into train and test sets
abt_01, abt_test = train_test_split(abt_00, test_size=0.3, random_state=88)

abt_01.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
21740,31,technician,single,university.degree,no,no,yes,cellular,aug,tue,60,6,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1
1321,35,blue-collar,married,basic.9y,unknown,no,no,telephone,may,thu,170,1,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0
25125,31,blue-collar,married,basic.9y,no,yes,no,cellular,nov,tue,60,2,999,0,nonexistent,-0.1,93.2,-42.0,4.153,5195.8
3159,44,blue-collar,married,basic.4y,unknown,no,no,telephone,may,thu,139,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0
35424,41,admin.,single,high.school,no,no,no,cellular,may,mon,722,4,999,0,nonexistent,-1.8,92.893,-46.2,1.244,5099.1


## Lista de Variáveis e Suas Descrições

### Dados do Cliente Bancário
- **age**: Idade do cliente.  
- **job**: Tipo de trabalho do cliente.  
- **martial**: Estado civil do cliente.  
- **education**: Nível de escolaridade do cliente.  
- **default**: Indica se o cliente tem crédito em default.  
- **housing**: Indica se o cliente possui um empréstimo habitacional.  
- **loan**: Indica se o cliente possui um empréstimo pessoal.  

### Relacionado ao Último Contato da Campanha Atual
- **contact**: Tipo de comunicação utilizada no contato com o cliente.  
- **month**: Último mês em que o cliente foi contatado.  
- **day_of_week**: Último dia da semana em que o cliente foi contatado.  
- **duration**: Duração do último contato, em segundos.  

### Outros Atributos
- **campaign**: Número de contatos realizados durante a campanha atual para este cliente.  
- **pdays**: Número de dias desde o último contato do cliente em uma campanha anterior.  
- **previous**: Número de contatos realizados antes desta campanha para este cliente.  
- **poutcome**: Resultado da campanha de marketing anterior.  

### Atributos de Contexto Social e Econômico
- **emp.var.rate**: Taxa de variação do emprego (indicador trimestral).  
- **cons.price.idx**: Índice de preços ao consumidor (indicador mensal).  
- **cons.conf.idx**: Índice de confiança do consumidor (indicador mensal).  
- **euribor3m**: Taxa Euribor de 3 meses (indicador diário).  
- **nr.employed**: Número de empregados (indicador trimestral).  

### Variável de Saída (Alvo)
- **y**: Indica se o cliente subscreveu um depósito a prazo.  


# Data Preparation

## Tratamento inicial padrão (Alta porcentagem de nulos, Variáveis constantes, Missings)

In [46]:
def get_metadata(dataframe):
	# Coleta de metadados básicos
	metadata = pd.DataFrame({
		'nome_variavel': dataframe.columns,
		'tipo': dataframe.dtypes,
		'qt_nulos': dataframe.isnull().sum(),
		'percent_nulos': round((dataframe.isnull().sum() / len(dataframe)) * 100, 2),
		'cardinalidade': dataframe.nunique(),
	})

	# Função para testar normalidade usando o teste de D'Agostino e Pearson
	def test_normality(series, alpha=0.05):
		if series.dtype in ["float64", "int64", "int32"]:
			statistic, p_value = stats.normaltest(series.dropna())  # dropping NA values for the test
			return p_value > alpha
		else:
			return None  # Return None for non-numeric data types

	# Aplicando teste de normalidade
	metadata["fl_normal"] = dataframe.apply(test_normality)

	metadata = metadata.sort_values(by='percent_nulos', ascending=False)
	metadata = metadata.reset_index(drop=True)

	return metadata

# Apply the function to the dataframe
metadados = get_metadata(abt_01)
metadados

Unnamed: 0,nome_variavel,tipo,qt_nulos,percent_nulos,cardinalidade,fl_normal
0,age,int64,0,0.0,76,False
1,job,object,0,0.0,12,
2,euribor3m,float64,0,0.0,305,False
3,cons.conf.idx,float64,0,0.0,26,False
4,cons.price.idx,float64,0,0.0,26,False
5,emp.var.rate,float64,0,0.0,10,False
6,poutcome,object,0,0.0,3,
7,previous,int64,0,0.0,8,False
8,pdays,int64,0,0.0,26,False
9,campaign,int64,0,0.0,39,False


In [47]:
def preprocess_dataframe(df):
	# Drop columns with >80% missing values
	total_count = len(df)
	columns_to_drop = [col for col in df.columns if df[col].isnull().sum() / total_count > 0.8]
	df = df.drop(columns=columns_to_drop)
	
	# Replace missing values
	for col_name in df.columns:
		data_type = df[col_name].dtype
		
		if np.issubdtype(data_type, np.number):
			mean_value = df[col_name].mean()
			df[col_name] = df[col_name].fillna(mean_value)
		elif data_type == object:
			df[col_name] = df[col_name].fillna("Desconhecido")
	
	# Drop columns with variance equals to 0
	numeric_columns = df.select_dtypes(include=[np.number]).columns
	variances = df[numeric_columns].var()
	columns_to_drop = variances[variances == 0].index.tolist()
	df = df.drop(columns=columns_to_drop)
	
	return df

# Apply the function to the dataframe
abt_02 = preprocess_dataframe(abt_01)
abt_02.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
21740,31,technician,single,university.degree,no,no,yes,cellular,aug,tue,60,6,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1
1321,35,blue-collar,married,basic.9y,unknown,no,no,telephone,may,thu,170,1,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0
25125,31,blue-collar,married,basic.9y,no,yes,no,cellular,nov,tue,60,2,999,0,nonexistent,-0.1,93.2,-42.0,4.153,5195.8
3159,44,blue-collar,married,basic.4y,unknown,no,no,telephone,may,thu,139,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0
35424,41,admin.,single,high.school,no,no,no,cellular,may,mon,722,4,999,0,nonexistent,-1.8,92.893,-46.2,1.244,5099.1


# Processamentos das variáveis

## Tratamento de variáveis numéricas (Padronização)

In [48]:
def normalize_dataframe(df):
	# Instanciando o scaler
	scaler = StandardScaler()

	# Selecionando colunas numéricas
	numeric_cols = df.select_dtypes(include=['float64', 'int64', 'int32']).columns

	# Aplicando a normalização
	df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

	return df

# Apply the function to the dataframe
abt_03 = normalize_dataframe(abt_02)

# Show the result
abt_03.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
21740,-0.866347,technician,single,university.degree,no,no,yes,cellular,aug,tue,-0.762692,1.265698,0.195464,-0.347517,nonexistent,0.835936,-0.233955,0.950295,0.770424,0.84293
1321,-0.483082,blue-collar,married,basic.9y,unknown,no,no,telephone,may,thu,-0.340152,-0.571768,0.195464,-0.347517,nonexistent,0.64468,0.719336,0.88534,0.708122,0.329971
25125,-0.866347,blue-collar,married,basic.9y,no,yes,no,cellular,nov,tue,-0.762692,-0.204275,0.195464,-0.347517,nonexistent,-0.120341,-0.656869,-0.327147,0.303159,0.396338
3159,0.379263,blue-collar,married,basic.4y,unknown,no,no,telephone,may,thu,-0.459232,-0.204275,0.195464,-0.347517,nonexistent,0.64468,0.719336,0.88534,0.711006,0.329971
35424,0.091815,admin.,single,high.school,no,no,no,cellular,may,mon,1.780227,0.530712,0.195464,-0.347517,nonexistent,-1.204122,-1.188979,-1.236513,-1.374957,-0.940675


## Tratamento de variáveis categóricas

### Baixa Cardinalidade (Dummy)

In [49]:
def apply_onehot_encoding(df, metadata, card_cutoff=5):
	# Filtrar variáveis categóricas de baixa cardinalidade
	df_categ_onehot = metadata[(metadata['cardinalidade'] <= card_cutoff) & (metadata['tipo'] == 'object')]
	lista_onehot = list(df_categ_onehot.nome_variavel.values)
	print('Lista de vars para OneHot Encoding: ', lista_onehot)

	# Instanciar o encoder
	encoder = OneHotEncoder(drop='first', sparse_output=False)

	# Aplicar o one-hot encoding
	encoded_data = encoder.fit_transform(df[lista_onehot])

	# Criar um DataFrame com as colunas codificadas, mantendo o índice original
	encoded_cols = encoder.get_feature_names_out(lista_onehot)
	encoded_df = pd.DataFrame(encoded_data, columns=encoded_cols, index=df.index)

	# Concatenar o DataFrame codificado com o DataFrame original
	df = pd.concat([df.drop(lista_onehot, axis=1), encoded_df], axis=1)

	return df

# Apply the function to the dataframe
abt_04 = apply_onehot_encoding(abt_03, metadados)

Lista de vars para OneHot Encoding:  ['poutcome', 'day_of_week', 'contact', 'loan', 'housing', 'default', 'marital']


### Alta Cardinalidade (Label Encoding)

In [50]:
def apply_label_encoding(df, metadata, card_cutoff=5):
	# Filtrar variáveis categóricas de alta cardinalidade
	df_categ_labelenc = metadata[(metadata['cardinalidade'] > card_cutoff) & (metadata['tipo'] == 'object')]
	lista_lenc = list(df_categ_labelenc.nome_variavel.values)
	print('Lista de vars para Label Encoding: ', lista_lenc)

	# Aplicando LabelEncoder nas colunas desejadas
	for col in lista_lenc:
		encoder = LabelEncoder()
		df[col] = encoder.fit_transform(df[col])

	return df

# Apply the function to the dataframe
abt_05 = apply_label_encoding(abt_04, metadados)

Lista de vars para Label Encoding:  ['job', 'month', 'education']


### Juntar os dados sensíveis

In [51]:
# Inner join usando o índice
abt_model = pd.merge(abt_05, df_bank_full[lista_spec], left_index=True, right_index=True, how='inner')
abt_model['y'] = abt_model['y'].map({'yes': 1, 'no': 0})
abt_model.head()

Unnamed: 0,age,job,education,month,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,...,loan_unknown,loan_yes,housing_unknown,housing_yes,default_unknown,default_yes,marital_married,marital_single,marital_unknown,y
21740,-0.866347,9,6,1,-0.762692,1.265698,0.195464,-0.347517,0.835936,-0.233955,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1321,-0.483082,1,2,6,-0.340152,-0.571768,0.195464,-0.347517,0.64468,0.719336,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0
25125,-0.866347,1,2,7,-0.762692,-0.204275,0.195464,-0.347517,-0.120341,-0.656869,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0
3159,0.379263,1,0,6,-0.459232,-0.204275,0.195464,-0.347517,0.64468,0.719336,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0
35424,0.091815,0,3,6,1.780227,0.530712,0.195464,-0.347517,-1.204122,-1.188979,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1


In [None]:
def apply_pca_and_plot(df, target_column, variance_threshold=0.90):
	from sklearn.decomposition import PCA
	import matplotlib.pyplot as plt

	# Separate features and target
	X = df.drop(columns=[target_column])
	y = df[target_column]

	# Apply PCA
	pca = PCA()
	pca.fit(X)

	# Calculate explained variance
	explained_variance = np.cumsum(pca.explained_variance_ratio_)

	# Plot explained variance
	plt.figure(figsize=(10, 6))
	plt.plot(explained_variance, marker='o', linestyle='--')
	plt.xlabel('Number of Components')
	plt.ylabel('Cumulative Explained Variance')
	plt.title('Explained Variance by Number of Principal Components')
	plt.grid(True)
	plt.axhline(y=variance_threshold, color='r', linestyle='-')
	plt.show()

	# Determine the number of components needed to explain at least the variance_threshold
	num_components = np.argmax(explained_variance >= variance_threshold) + 1
	print(f'Number of components needed to explain at least {variance_threshold*100}% variance: {num_components}')

	# Transform the data using the selected number of components
	pca = PCA(n_components=num_components)
	X_pca = pca.fit_transform(X)

	# Create a new DataFrame with the selected components
	df_pca = pd.DataFrame(X_pca, columns=[f'PC{i+1}' for i in range(num_components)])
	df_pca[target_column] = y.values

	return df_pca

# Apply the function to the dataframe
df_pca = apply_pca_and_plot(abt_model, 'y')
df_pca.head()

(28831, 11)