# Machine Learning Assignment - 04

# 1. What are the key tasks involved in getting ready to work with machine learning modeling?

The key tasks involved in getting ready to work with machine learning modeling are:

Define the problem and set goals: Identify the business problem to be solved and set clear goals for the model.

Data collection and preparation: Gather the necessary data and clean, preprocess and transform it into a suitable format for modeling.

Exploratory data analysis (EDA): Analyze the data to identify patterns, relationships, and anomalies, which can inform feature selection and model building.

Feature engineering: Select the most relevant features to include in the model and create new features, if necessary.

Split the data into training and testing sets: Divide the data into two sets, one for training the model and the other for evaluating its performance.

Select an appropriate model: Choose a model that is appropriate for the problem and data, and train it on the training data.

Evaluate model performance: Evaluate the model on the test data to assess its accuracy and make any necessary improvements.

Fine-tune and optimize the model: Refine the model based on the evaluation results and fine-tune its parameters to optimize performance.





# 2. What are the different forms of data used in machine learning? Give a specific example for each of them.

In machine learning, there are several forms of data used, including:

Numerical data: Data that consists of numbers, such as continuous or discrete variables. For example, the height, weight, or age of an individual.

Categorical data: Data that consists of categories or labels, such as the gender of an individual.

Ordinal data: Data that has an order or ranking, such as the level of education (high school, bachelor's, master's, etc.).

Binary data: Data that has only two categories, such as a yes/no answer.

Time-series data: Data that has a time component, such as stock prices over time.

Text data: Data that consists of words, such as product reviews or news articles.

Image data: Data that consists of pixels, such as photographs or drawings.

Audio data: Data that consists of sound, such as speech or music.

Video data: Data that consists of images with a temporal component, such as movies or TV shows.

# 3. Distinguish:

1. Numeric vs. categorical attributes

2. Feature selection vs. dimensionality reduction

# 1. Numeric vs. categorical attributes

Numeric and categorical attributes are two different types of data that can be used in machine learning.

Numeric attributes are variables that contain numbers, such as height, weight, or age. These attributes can take on any value within a certain range and can be either continuous or discrete. Numeric data can be used for regression or classification tasks and can be easily processed by machine learning algorithms.

Categorical attributes, on the other hand, are variables that contain categories or labels. They are often represented as nominal or ordinal data. For example, a person's gender can be represented as a categorical attribute with two categories (male or female), or a person's education level can be represented as an ordinal attribute with multiple categories (high school, bachelor's, master's, etc.). Categorical data must be preprocessed and transformed into numerical data before they can be used in machine learning algorithms. This can be done using techniques such as one-hot encoding or ordinal encoding.

# 2. Feature selection vs. dimensionality reduction

Feature selection and dimensionality reduction are two related, but distinct, techniques used in the preprocessing of data for machine learning.

Feature selection involves selecting a subset of the features in a dataset for use in the modeling process. This is done to improve the performance of the model, reduce overfitting, and increase the interpretability of the model. Feature selection can be done manually, based on domain knowledge, or automatically, using algorithms such as feature importance or mutual information.

Dimensionality reduction, on the other hand, involves transforming the features of a dataset into a lower-dimensional representation. This is done to remove redundant or irrelevant information, improve the computational efficiency of the model, and reduce overfitting. Dimensionality reduction can be achieved using techniques such as principal component analysis (PCA), singular value decomposition (SVD), or t-distributed stochastic neighbor embedding (t-SNE).

Both feature selection and dimensionality reduction can be useful steps in preparing data for machine learning, but the choice of which to use, and when, depends on the specific problem and dataset.








![dimensionality-reduction-technique%20%281%29.jpg](attachment:dimensionality-reduction-technique%20%281%29.jpg)

# 4. Make quick notes on any two of the following:

1. The histogram

2. Use a scatter plot

3.PCA (Personal Computer Aid)

# 1. The Histogram - 

A histogram is a graphical representation of the distribution of a set of numerical data. It is a type of bar graph that shows the frequency of the data points within a range of values called "bins". Each bin is represented by a bar, and the height of the bar represents the number of data points within that bin.

Histograms are used to visualize the distribution of the data and to identify patterns or anomalies. For example, a histogram can show if the data is normally distributed, positively skewed, or negatively skewed. It can also show if there are any outliers or gaps in the data.

Histograms are commonly used in exploratory data analysis (EDA) to gain a better understanding of the data before building a machine learning model. They can also be used to identify potential issues with the data, such as skewness or outliers, which may need to be addressed in the preprocessing stage.





# 2. Use a scatter plot - 

A scatter plot is a graphical representation of the relationship between two numerical variables. It is used to visualize the relationship between two variables by plotting each data point as a point in a 2D plane. The position of each point on the x-axis represents the value of one variable, while the position of each point on the y-axis represents the value of the other variable.

Scatter plots are used to identify the relationship between two variables, such as whether they are positively or negatively correlated, or if there is no relationship at all. For example, a scatter plot can show if there is a linear relationship between two variables, or if there is a non-linear relationship.

Scatter plots are also useful in identifying outliers and other patterns in the data. For example, a scatter plot can show if there are any clusters of data points or if there are any data points that are significantly different from the rest of the data.

Scatter plots are commonly used in exploratory data analysis (EDA) to gain a better understanding of the relationship between two variables and to identify potential issues with the data. They can also be used to visualize the results of machine learning models, such as by plotting the predicted values against the actual values.









# 3.PCA (Personal Computer Aid)- 

PCA stands for Principal Component Analysis, which is a statistical technique used for dimensionality reduction and feature extraction in machine learning.

PCA works by transforming the original features of a dataset into a new set of features, called "principal components", which are orthogonal to each other and capture the most important information in the data. The first principal component captures the most variation in the data, the second principal component captures the second most variation, and so on.

PCA is commonly used to reduce the number of features in a dataset and to remove redundant or irrelevant information. By using only the first few principal components, which capture most of the important information in the data, PCA can significantly reduce the complexity of the data and improve the performance of machine learning algorithms.

PCA can also be used for visualization purposes, such as by projecting high-dimensional data onto a 2D or 3D scatter plot to reveal patterns and relationships in the data.

Overall, PCA is a powerful tool for data preprocessing and feature extraction, and is widely used in many areas of machine learning, such as computer vision, speech recognition, and natural language processing.







# 5. Why is it necessary to investigate data? Is there a discrepancy in how qualitative and quantitative data are explored?

Investigating data is necessary because it helps to gain a better understanding of the data, identify potential issues or problems with the data, and improve the performance of machine learning algorithms.

Exploratory data analysis (EDA) is the process of investigating data to understand its structure, distribution, relationships, and anomalies. EDA can be done through visualizations, such as histograms, scatter plots, and box plots, as well as through statistical measures, such as mean, median, and standard deviation.

There is a difference in the way that qualitative and quantitative data are explored. Qualitative data, such as categorical or nominal data, is often explored through bar charts, pie charts, or frequency tables. These visualizations can show the frequency of each category in the data and help to identify patterns or anomalies.

Quantitative data, such as numerical or ordinal data, is often explored through histograms, scatter plots, and box plots. These visualizations can show the distribution of the data and help to identify outliers or skewness in the data.

In general, it is important to perform EDA on both qualitative and quantitative data to gain a comprehensive understanding of the data and to identify any issues that need to be addressed before building a machine learning model.





# 6. What are the various histogram shapes? What exactly are 'bins'?.

The various histogram shapes depend on the distribution of the data. Common histogram shapes include:

Normal: A histogram that is bell-shaped and symmetrical, with the majority of the data points concentrated in the middle and tapering off towards the ends.

Skewed right: A histogram that has a long tail to the right, indicating that the data has a positive skew, with most of the data points concentrated on the left and a few data points on the right.

Skewed left: A histogram that has a long tail to the left, indicating that the data has a negative skew, with most of the data points concentrated on the right and a few data points on the left.

Multi-modal: A histogram that has multiple peaks, indicating that the data has multiple clusters or groups.

"Bins" in a histogram refer to the intervals or ranges into which the data is divided for the purpose of constructing the histogram. Each bin is represented by a bar in the histogram, and the height of the bar represents the number of data points that fall within that bin. The number of bins and the size of the bins can affect the appearance of the histogram, so it is important to choose appropriate bin sizes for the data.






# 7. How do we deal with data outliers?

There are several methods for dealing with data outliers in a dataset:

Removing outliers: Outliers can be removed from the dataset if they are identified as genuine errors or extreme values that do not represent the majority of the data. This method is appropriate if the outliers are not representative of the underlying pattern in the data.

Replacing outliers: Outliers can be replaced with more representative values, such as the mean, median, or a quantile of the data. This method is appropriate if the outliers are affecting the summary statistics of the data, such as the mean or standard deviation.

Clipping outliers: Outliers can be clipped or capped to a maximum or minimum value, such as a multiple of the standard deviation from the mean. This method is appropriate if the outliers are affecting the range of the data.

Winsorizing outliers: Outliers can be replaced with a set value, such as the minimum or maximum value of the data. This method is similar to clipping outliers, but it replaces outliers with a specific value rather than capping them to a maximum or minimum value.

The choice of method for dealing with outliers depends on the nature of the outliers and the purpose of the analysis. It is important to carefully consider the effect of outliers on the data and the consequences of removing or transforming the outliers.






# 8. What are the various central inclination measures? Why does mean vary too much from median in certain data sets?

Central inclination measures are summary statistics that describe the center or average value of a dataset. The most common central inclination measures are:

Mean: The mean or average of a dataset is calculated by summing all the values and dividing by the number of values. The mean is a good measure of central tendency when the data is symmetrical and the outliers are not extreme.

Median: The median of a dataset is the middle value when the data is sorted in ascending or descending order. The median is a good measure of central tendency when the data is skewed or has extreme outliers.

Mode: The mode of a dataset is the value that appears most frequently. The mode is a good measure of central tendency when the data is multimodal, that is, when it has multiple peaks.

The mean and median can vary from each other in certain datasets because the mean is sensitive to extreme values or outliers, while the median is not. When the data has extreme outliers, the mean will be skewed in the direction of the outliers, while the median will remain close to the middle of the data. For example, in a dataset with a small number of high values, the mean will be higher than the median, while in a dataset with a small number of low values, the mean will be lower than the median. In this case, the median is a better indicator of the central tendency of the data than the mean.






# 9. Describe how a scatter plot can be used to investigate bivariate relationships. Is it possible to find outliers using a scatter plot?

A scatter plot is a graphical representation of the relationship between two variables. It is used to investigate the relationship between two variables and can help to identify patterns, trends, and outliers in the data.

In a scatter plot, each data point is represented by a dot and the x-axis and y-axis represent the two variables being plotted. If the two variables have a positive relationship, the data points will form a pattern that slopes upwards from left to right. If the two variables have a negative relationship, the data points will form a pattern that slopes downwards from left to right. If the two variables are uncorrelated, the data points will be scattered randomly around the plot.

Outliers in a scatter plot are data points that fall outside of the overall pattern of the data. They can be identified as data points that are far away from the majority of the data points, or data points that are not aligned with the overall trend of the data. Outliers can be important because they can indicate errors in the data, or they can represent interesting or unusual observations that need further investigation.

In conclusion, a scatter plot can be a useful tool for investigating bivariate relationships and identifying outliers in the data. It provides a visual representation of the relationship between two variables and can help to identify patterns, trends, and outliers in the data.





# 10. Describe how cross-tabs can be used to figure out how two variables are related.

Cross-tabs, also known as contingency tables or cross-tabulations, are a tool used to investigate the relationship between two categorical variables. A cross-tab is a table that summarizes the frequency or count of data points that fall into each combination of categories for two variables.

For example, if we have two variables, "gender" and "preferred sport," a cross-tab can be used to determine how many people in the data prefer each sport for each gender. The resulting cross-tab would show the count or frequency of each combination of sport and gender, and would provide a summary of the relationship between the two variables.

Cross-tabs are useful for identifying patterns and relationships in the data, such as whether there is a relationship between the two variables, or whether one variable has a stronger effect on the other. They can also be used to identify any discrepancies or anomalies in the data, or to check for errors or inconsistencies in the data.

In conclusion, cross-tabs are a useful tool for investigating the relationship between two categorical variables, and can be used to identify patterns, relationships, discrepancies, and anomalies in the data.