## Important imports

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# sklearn is essential, but it has many different imports, so I won't put it here.
from import_data import import_data

## How to get the data

In [3]:
train_data, test_data, y_train, X_train, X_test = import_data()

## Tips

### Data analysis

__-Univariate__:

$\rightarrow$ Analyses each feature in X_train and X_test dataframes, and the labels in the y_train dataframe individually. 

$\rightarrow$ Histograms, Boxplots, Bar graphics and Swarmplots are a good beginning.

$\rightarrow$ df.describe() to see metrics, df.shape to see number of rows and columns, df.dtypes to see features types, df.info() to see many informations, such as the number of NaN's in the data.

__-Multivariate__:

$\rightarrow$ Analyses the relation between features or between features and labels.

$\rightarrow$ Common approaches are plotting Scatterplots, Heatmap of correlations, Lineplots (if a variable evolves through another one), Bar graphics, Density plots, Hexagonal Compartment plots and Linear Regression plots.

$\rightarrow$ df.corr(numeric_only = True) to get the correlations between all numerical features.

$\rightarrow$ It's very important to analyse not only the relation between features and label! The relation between features is also essential to determine if we have repetitive or ambiguous information, which can cause problems in the model's prediction.

### Preprocessing:

__- Missing values detection and treatment:__

$\rightarrow$ Detection: X_train.info(), X_test.info() or pd.DataFrame.isna().sum() (for number of NaN's in each column) or pd.DataFrame.isna().sum().sum() (for total number of NaN's).

$\rightarrow$ Treatment: Imputation (mode, mean, median), drop or, if just one class is missing, NaN may be substituted by a value (e.g. 0).

__- Outliers detection and treatment:__

$\rightarrow$ Detection: through metrics (standard deviation, for example), histograms, boxplots (any point after Q3 + 1.5 * interquartile_distance or before Q1 - 1.5 * interquartile_distance is classified as outlier, where Q3 is the 3rd quartile and Q1 the 1st quartile), or through Z-score (any point that falls out of 3rd standard deviation is classified as outlier).

$\rightarrow$ Treatment: log scaling and clipping are common solutions.

__-Imbalanced classes detection and treatment:__

$\rightarrow$ Detection: X_train.value_counts(), X_test.value_counts(). If the number of elements of a class is much higher than the other ones, then we say that the dataset has imbalanced data.

$\rightarrow$ Treatment: possible solution is downsampling and upwweighting. Explanation in the link: https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data.

__-Categorical data to numerical detection and treatment:__

$\rightarrow$ Detection: X_train.dtypes, X_test.dtypes, y_train.dtypes.

$\rightarrow$ Treatment: One-Hot Encoding (if classes don't have a specific order) or Ordinal Encoding (if classes have an order) are common solutions.

__-Creating new features__:

$\rightarrow$ This is almost an art. We need to think if a combination (maybe a sum, or a product) of features will be a better predictor, for example. Or we can separate a feature into bins to help our model to learn specific characteristics of these bins. We can also extract informations from categorical features without making any numerical transformation (remember Titanic's title extraction from the 'Name' feature and deck from the 'Cabin'), or maybe we can limit a feature information (remember Titanic's feature 'Traveled_Alone', which has been extracted from 'Family_Size' feature). 

$\rightarrow$ The summary: it is practically an art!