# Basic Concepts

### First let's import a set with pandas to train our model 

This table has the fields **"home, how_it_works, contact and bought"** with 0's and 1's to refer to something that did or did not happen. 

In [22]:
import pandas as pd

uri = "https://gist.githubusercontent.com/guilhermesilveira/2d2efa37d66b6c84a722ea627a897ced/raw/10968b997d885cbded1c92938c7a9912ba41c615/tracking.csv"
data = pd.read_csv(uri)

## `Training and Test Models`

When building a Machine Learning model, it is common to **separate the data into training and testing sets**. 
The *training set is used to train the model*, while the *test set is used to evaluate the model's performance* 
This is done to ensure that the model can generalize well to new data and not just memorize the training data. 
A real-world example is splitting an image dataset into training and testing and using the training set to 
train an image classification model and the test set to evaluate the model's accuracy. Here's an example of how this works: 

In [27]:

x = data[["home", "how_it_works", "contact"]]
y = data[["bought"]]

# Separating the train set and test set

train_x = x[:75]
train_y = y[:75]
test_x = x[75:]
test_y = y[75:]

In [31]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

model = LinearSVC()

# Training the model

model.fit(train_x, train_y)

# Predictions

predictions = model.predict(test_x)

# Accuracy

accuracy = accuracy_score(test_y, predictions) * 100

print("The accuracy was: %.2f%%" %accuracy)

The accuracy was: 96.00%


  y = column_or_1d(y, warn=True)


## `Layering Splits`

Split Stratification **is a technique used in machine learning to divide a dataset into training and testing subsets** 
so that the distribution of classes is maintained in both subsets.

In other words, splitting stratification ensures that the proportions of classes are the same in both sets, so as 
to prevent the trained model from being biased towards any particular class.

#

To perform splitting, you can use the function `train_test_split` from Python's `scikit-learn library`. This function 
can be used to split the dataset into training and test subsets randomly, but keeping the proportion of the classes. 
It always returns in this order (training X, test X, training Y, test Y).

Here is an example:


In [32]:
from sklearn.model_selection import train_test_split

# Setting a random seed to avoid different 
# results each time you run the program.

SEED = 20

train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size = 0.25)

# Training now with the splitted trains

model.fit(train_x, train_y)

# Predicting

predictions = model.predict(test_x)

accuracy = accuracy_score(test_y, predictions) * 100

# Voi lá, each time that you execute the program
# will return the same result, in this case (96%)

print("The accuracy was: %.2f%%" %accuracy)

The accuracy was: 96.00%


  y = column_or_1d(y, warn=True)


## `Overfitting And Underfitting`

This section will cover a more theoretical and fundamental part of machine learning, so I will not show 
exemplifying code, but I hope you understand the concepts.

#

### `Overfitting`

Overfitting **occurs when the model is too complex relative to the training data**. In this case, the 
*model learns the details and noise from the training set, becoming too specific to the training data 
and thus unable to generalize to new data*. This can lead to poor model performance on test data. 
Overfitting can be identified by **a high performance on the training set, but a low performance on the test set**.

#

### `Underfitting`

Underfitting, on the other hand, **occurs when the model is too simple to capture the nuances and 
complexities of the training data set**. In this case, *the model cannot fit the training data well 
and also cannot generalize well to new data.* This can lead to poor model performance on both training 
and test sets. Underfitting can be identified by **poor performance on both the training set and the test set**.

#

### `Resolution`

To solve these problems, it is necessary to **adjust the complexity of the model**. If the model is 
suffering from `overfitting`, it is necessary to **reduce the complexity of the model**. For example, 
by **reducing the number of features, using regularization, or reducing the depth of the neural network**. 
But, if the model is suffering from `underfitting`, it is necessary to **increase the complexity of the model**. 
For example, by **adding more features, increasing the depth of the neural network, or changing the hyperparameters of the model.**


Problem | Resolution
------- | ----------
Overfitting | Reduce the complexity of the model
Underfitting | Increase the complexity of the model

Evaluating the performance of the model on a test dataset is also important to identify overfitting 
and underfitting problems. This can be done using a variety of **evaluation metrics, such as accuracy, 
recall, F1-score, etc.**