In [1]:
# import the libraries top use
import numpy as np
import pandas as pd
import seaborn as sns

# Step 1: Problem statement and data collection

We can see the data information in the page <https://insideairbnb.com/new-york-city/> where each feature is:

- age. Age of customer (numeric)
- job. Type of job (categorical)
- marital. Marital status (categorical)
- education. Level of education (categorical)
- default. Do you currently have credit (categorical)
- housing. Do you have a housing loan (categorical)
- loan. Do you have a personal loan? (categorical)
- contact. Type of contact communication (categorical)
- month. Last month in which you have been contacted (categorical)
- day_of_week. Last day on which you have been contacted (categorical)
- duration. Duration of previous contact in seconds (numeric)
- campaign. Number of contacts made during this campaign to the customer (numeric)
- pdays. Number of days that elapsed since the last campaign until the customer was contacted (numeric)
- previous. Number of contacts made during the previous campaign to the customer (numeric)
- poutcome. Result of the previous marketing campaign (categorical)
- emp.var.rate. Employment variation rate. Quarterly indicator (numeric)
- cons.price.idx. Consumer price index. Monthly indicator (numeric)
- cons.conf.idx. Consumer confidence index. Monthly indicator (numeric)
- euribor3m. EURIBOR 3-month rate. Daily indicator (numeric)
- nr.employed. Number of employees. Quarterly indicator (numeric)
- y. TARGET. Whether the customer takes out a long-term deposit or not (categorical)

In [2]:
from utils import load_data 

file_path = '../data/raw/bank-marketing-campaign-data.csv'
url = 'https://raw.githubusercontent.com/4GeeksAcademy/logistic-regression-project-tutorial/main/bank-marketing-campaign-data.csv'

df = load_data(file_path=file_path, url=url)

File not found. Loading data from URL: https://raw.githubusercontent.com/4GeeksAcademy/logistic-regression-project-tutorial/main/bank-marketing-campaign-data.csv
Data saved to file: ../data/raw/bank-marketing-campaign-data.csv


## Problem to solve:

We want to create a logistic classifier using the collected data.

# Step 2: Exploration and data cleaning

## Eliminate duplicates

## Eliminate irrelevant information

# Step 3: Analysis of univariate variables

A **univariate variable** is a statistical term used to refer to a set of observations of an attribute. That is, the column-by-column analysis of the DataFrame. To do this, we must distinguish whether a variable is categorical or numerical, as the body of the analysis and the conclusions that can be drawn will be different.

## Analysis of categorical variables

A **categorical variable** is a type of variable that can be one of a limited number of categories or groups. These groups are often nominal (e.g., the color of a car: red, blue, black, etc., but none of these colors is inherently "greater" or "better" than the others) but can also be represented by finite numbers.

**To represent these types of variables we will use histograms.**

## Analysis on numeric variables

A **numeric variable** is a type of variable that can take numeric values (integers, fractions, decimals, negatives, etc.) in an infinite range. A numerical categorical variable can also be a numerical variable. 

**They are usually represented using a histogram and a boxplot, displayed together.**

# Step 4: Analysis of multivariate variables

After analyzing the characteristics one by one, it is time to analyze them in relation to the predictor and to themselves, in order to draw clearer conclusions about their relationships and to be able to make decisions about their processing.

Thus, if we would like to eliminate a variable due to a high amount of null values or certain outliers, it is necessary to first apply this process to ensure that the elimination of certain values are not critical for the survival of a passenger. For example, the variable Cabin has many null values, and we would have to ensure that there is no relationship between it and survival before eliminating it, since it could be very significant and important for the model and its presence could bias the prediction.

## Numerical-numerical analysis

When the two variables being compared have numerical data, the analysis is said to be numerical-numerical. 

**Scatterplots and correlation analysis are used to compare two numerical columns.**

## Categorical-categorical analysis

When the two variables being compared have categorical data, the analysis is said to be categorical-categorical. 

**Histograms and combinations are used to compare two categorical columns.**

### Combinations of class with various predictors

## Numerical-categorical analysis (complete)

# Step 5: Feature engineering

Feature engineering is a process that involves the creation of new features (or variables) from existing ones to improve model performance. This may involve a variety of techniques, such as normalization, data transformation, and so on. The goal is to improve the accuracy of the model and/or reduce the complexity of the model, thus making it easier to interpret.

Although this could have been done in this step as it is part of the feature engineering, it is usually done before analyzing the variables, separating this process into a previous one and the one we are going to see next.

## Outlier analysis

An outlier is a data point that deviates significantly from the others. It is a value that is noticeably different from what would be expected given the general trend of the data. These outliers may be caused by errors in data collection, natural variations in the data, or they may be indicative of something significant, such as an anomaly or extraordinary event.

Descriptive analysis is a powerful tool for characterizing the data set: the mean, variance and quartiles provide powerful information about each variable. The describe() function of a DataFrame helps us to calculate in a very short time all these values.

## Missing value analysis

A **missing** value is a space that has no value assigned to it in the observation of a specific variable. These types of values are quite common and can arise for many reasons. For example, there could be an error in data collection, someone may have refused to answer a question in a survey, or it could simply be that certain information is not available or not applicable.

## Inference of new features

Another typical use of this engineering is to obtain new features by "merging" two or more existing ones.

## Divide the set into train and test,

## Feature scaling

**Feature scaling** is a crucial step in data preprocessing for many Machine Learning algorithms. It is a technique that changes the range of data values so that they can be compared to each other. Scaling usually involves normalization, which is the process of changing the values so that they have a mean of 0 and a standard deviation of 1. Another common technique is min-max scaling, which transforms the data so that all values are between 0 and 1.

# Step 6: Feature selection

The feature selection is a process that involves selecting the most relevant features (variables) from our dataset to use in building a Machine Learning model, discarding the rest.

There are several reasons to include it in our exploratory analysis:

1. To simplify the model so that it is easier to understand and interpret.
2. To reduce the training time of the model.
3. Avoid overfitting by reducing the dimensionality of the model and minimizing noise and unnecessary correlations.
4. Improve model performance by removing irrelevant features.
 
In addition, there are several techniques for feature selection. Many of them are based on trained supervised or clustering models. More information is available here.

The sklearn library contains many of the best alternatives to perform it. One of the most commonly used tools for fast and successful feature selection processes is SelectKBest. This function selects the k best features from our dataset based on a function of a statistical test. This statistical test is usually an ANOVA or a Chi-Square.