# Task proposal

How to understand and manipulate real-world data when dealing with Machine Learning projects. Several data science concepts and tools are introduced in this task.

Course: [Machine Learning, Data Science and Generative AI with Python](https://www.udemy.com/course/data-science-and-machine-learning-with-python-hands-on/?couponCode=ST7MT41824)


Sections: More Data Mining and Machine Learning Techniques; Dealing with Real-World Data. (2 hours and 30 minutes)

#K - Nearest Neighbor (KNN)


Classifier based on the distance of a new data based on the category of its neighbors. The category of the k closest neighbors is considered to vote on which category should be classified for the new data.


## Using KNN to predict a movie's rating


Using information such as ratings and movie categories, both already known and provided, a distance metric is created to use KNN. This metric is based on the genres that each film falls into, and then how close one is to another is calculated based on the genres of the films.


The data set provided has user identification, film identification and notes.


Using this information, groupby can be used to group films by their ids. Then, from all the votes for certain films, it calculates how many people voted for a film and the average of those votes.


The same is done using the following agg command:

In [None]:
movieProperties = ratings.groupby('movie_id').agg('rating': [np.size, np.mean])

After this, it is understood that the popularity of a film being measured through the number of votes cast on it is not a good approximation to understand its popularity, as a number alone can have its interpretation varied in different contexts. To do this, the number of votes for a film is normalized in a range from 0 to 1, with those films close to 1 being the most popular films, that is, with the most votes.


This understanding is subjective and it is up to the professional to have it, something important to note for future cases.


Having this scenario of films in hand with their popularity and ratings, the rating of a new film can be calculated by averaging the ratings of its k closest neighbors (similar genres).


KNN proves to be a good and simple alternative for predicting movie recommendations in the course context.


# Curse of dimensionality


The curse of dimensionality refers to the challenge faced when dealing with high-dimensional datasets. As the number of features or dimensions increases, the consumption of computational resources and difficulty in finding meaningful patterns in the data also increases. To combat this curse, it is important to perform feature selection and dimensionality reduction to improve the performance of machine learning models.


##PCA


One of the proposed solutions to this problem is Principal Component Analysis, an approach that applies linear algebra in higher dimensions, creating plans with greater variations and projecting data into lower dimensions by reducing the number of columns.


The use of this technique is shown in the Iris dataset, where through the object created for PCA it is possible to see how much of the variance is still present in the dataset with the columns reduced.


One of the reasons for applying PCA is to make it possible to plot data in two dimensions, selecting the 2 main components.


# Data Warehousing


A centralized database, containing information from different sources, and therefore having a large volume of data, is a data warehouse. One of its objectives is to enable data analysis from large corporations.


This scenario promotes different challenges, such as missing data, data normalization, and how to join and maintain data from different sources. With a large volume of data, data transformations themselves can also become a problem.


There are two ways to create and maintain a Data Warehouse:


## ETL (extract, transform and load)


ETL indicates an order of processes for Data Warehousing that is more conventional, extracting data from the real world, transforming it into the desired pattern, and then loading it into the warehouse.


This approach has a problem of scale. Since transformations are applied to all data in order to standardize it, when there is a large volume of data, this becomes a difficult task in itself.


## ELT (extract, load and transform)


Unlike the previous method, ELT extracts the data and then loads it, where the warehouse's own computational power will apply the transformations to the data. This transformation process is done through tools mentioned in the video such as hadoop and hive, which allow it to be possible to query the data.


# Reinforcement Learning


One of the learning tactics is to involve an agent exploring an environment, and rewarding it with correct actions, reinforcing a certain behavior.


## Q-Learning


It allows an agent to learn to make sequential decisions in an environment, optimizing its actions to maximize a cumulative reward over time. In the case of the video, the pacman example is shown, where the possible decisions that the doll can make are stored, starting each one with quality 0. The agent updates these values based on the observed rewards, allowing it to learn the best strategy to achieve your goals over time.


One of the problems with this approach is the lack of efficiency in exploring all learning possibilities.


# Confusion matrix


Fundamental tool in the evaluation of classification models. It is a table that allows you to compare the predictions of a model with the actual values of the data. This matrix provides a clear view of the model's performance, allowing you to calculate important metrics such as precision, recall, F1-score and hit rate, which are crucial for evaluating the quality of a classification algorithm.


## Precision


Accuracy measures the proportion of correctly predicted positive instances relative to the total instances predicted as positive. It is a metric that evaluates the model's ability to avoid misclassifying negatives as positives.


## Recall:


Recall evaluates the model's ability to correctly find all positive instances. It measures the proportion of actual positive instances that were correctly identified by the model.


## F1-Score:


The F1-Score is a metric that combines precision and recall into a single value. It is useful when it is necessary to balance the importance of avoiding false positives and false negatives. The higher the F1-Score, the better the balance between precision and recall.


## ROC curve (Receiver Operating Characteristic)


The ROC curve is a graphical representation of the discriminative ability of a model at different classification thresholds. It shows the rate of true positives versus the rate of false positives.


## AUC (Area Under the Curve)


The AUC is the area under the ROC curve and provides a quantitative measure of the model's ability to differentiate between classes. The higher the AUC, the better the model performs in classifying instances correctly.


# Bias and Variance


Bias measures how far the data is from the correct values. Bias data is consistent data leaning to one side.


Variance concerns how spread out the data is.


Therefore, any data set has both in different quantities.


## Error


At the end of the day, the objective becomes to reduce error, which is measured through bias and variance.


Error = bias^2 + variance


These terms can describe model behaviors, and are a way of expressing information about different real cases.


# K-fold cross validation


Normally, models are trained from training data, and then evaluated on test data. The problem in this scenario is that the model is evaluated incorrectly, as the model can adapt to training data that does not fully represent the problem.


Instead, cross validation proposes separating data into different pieces of equal size. From there, the model will learn and be tested with different parts of the data, bringing a broader perspective on how it performs on the data.


# Cleaning input data


Cleaning the data directly impacts the rest of a problem pipeline, influenced by the output of models trained on it, so this is an important part of the process.


##Outliers


Outliers that stand out in a data set and can distort analyzes and models.


## Missing data


The presence of missing data is common in real datasets. Handling them involves imputation (estimating missing values) or deleting observations, depending on the impact on the results.


## Malicious data


Malicious data is inserted to trick or corrupt analyzes and models.


## Erroneous data


Erroneous data is inaccurate or incorrect information resulting from human or measurement error.


## Irrelevant data


Irrelevant data does not contribute to the objectives of the analysis or model.


## Inconsistent data


Inconsistencies occur when data contradicts each other or does not follow expected patterns.


## Formatting


Formatting involves standardizing data values and types, facilitating analysis and understanding, ensuring that dates, numbers and categories are correct.


# Normalizing data


Ensures information is in a uniform and comparable format. This is essential so that some algorithms and models can operate effectively, avoiding distortions arising from different scales, units or data representations.


There is the possibility of denormalizing the data at the end to interpret it.


#Outliers


Discrepant and atypical values. They must be analyzed and judged to be disregarded or not, as there are scenarios where they can be positive.


# Feature Engineering


It involves the creation, transformation and selection of variables (features) from raw data. The importance of feature engineering lies in the fact that appropriate features can significantly improve the performance of models, making them more capable of capturing complex relationships in data. This involves identifying informative features, reducing dimensionality, and creating more meaningful representations of the data.


# Inputting missing data


One of the ways to handle missing data is subheadings by column average. This case is affected when there are outliers in the data, making the median option a better option.


One of the best ways is to use machine learning models to predict the values of missing data. This works well for numerical data as it creates a distance measurement between the data.


For categorical missing data where there is no way to create this distance, deep learning are shown to be more suitable models.


In the end, when there is missing data, the best option will be to collect more data.


# Unbalanced data


Imbalanced data occurs when classes in a data set are not equally represented. This can lead to problems as learning models tend to favor the majority class, resulting in biased predictions and poor performance on minority classes.


## Oversampling


It involves the criaction of additional copies of observations from the minority class to equalize the number of samples in both classes. This helps avoid bias towards the majority class and improve model performance on minority classes.


## Undersampling


Some samples from the majority class are randomly removed to balance proportions between classes. This reduces the influence of the majority class and can improve the model's ability to identify the minority class.


This approach may not make much sense in most cases, as it proposes throwing away data. One possible scenario for discarding data is perhaps when there is not enough computing power to process a large volume.


##SMOT


SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling technique that creates new synthetic samples for the minority class, generating intermediate points between neighboring instances.


# Other feature engineering techniques


## Binning


Technique that involves dividing a set of data into intervals (bins) based on numerical values or categories. This can be useful for simplifying continuous data, creating discrete ranges, and making it easier to analyze or visualize patterns in the data.


## Transforming


Processes that modify the structure or format of data. This may include operations such as logarithm, exponentiation, or applying mathematical functions to make data more suitable for statistical analysis or to meet the requirements of certain machine learning models.


## Encoding


Encoding is the process of converting categorical data into a numerical form, allowing machine learning algorithms to work with this data. There are coding techniques, such as one-hot encoding and label encoding, which are used to represent categories in an appropriate way for analysis or modeling.


## Scaling


Scaling is the process of adjusting the scale of data, usually by normalizing or standardizing values. This is important to ensure that different features or variables have the same influence on machine learning models, preventing one feature from dominating the others due to different scales.


## Shuffling


Shuffling involves the random reorganization of observations or samples in a data set. This is often used during data preparation or model training to ensure that samples do not follow a specific order that could introduce bias or unwanted patterns into the analysis or model.