# Questions

1. What are the key tasks that machine learning entails? What does data pre-processing imply?
2. Describe quantitative and qualitative data in depth. Make a distinction between the two.
3. Create a basic data collection that includes some sample records. Have at least one attribute from each of the machine learning data types.

4. What are the various causes of machine learning data issues? What are the ramifications?

5. Demonstrate various approaches to categorical data exploration with appropriate examples.

6. How would the learning activity be affected if certain variables have missing values? Having said that, what can be done about it?

7. Describe the various methods for dealing with missing data values in depth.

8. What are the various data pre-processing techniques? Explain dimensionality reduction and function selection in a few words.

9.

     i. What is the IQR? What criteria are used to assess it?

    ii. Describe the various components of a box plot in detail? When will the lower whisker surpass the upper whisker in length? How can box plots be used to identify outliers?

10. Make brief notes on any two of the following:

    1. Data collected at regular intervals

    2. The gap between the quartiles

    3. Use a cross-tab

11. Make a comparison between:

    1. Data with nominal and ordinal values

    2. Histogram and box plot

    3. The average and median

# Ans 1

The key tasks in machine learning include:

1. Data Collection: Gathering relevant data from various sources to build a dataset for analysis and modeling.

2. Data Preprocessing: Preparing and cleaning the data for analysis, which involves handling missing values, dealing with outliers, normalizing or scaling data, and encoding categorical variables.

3. Feature Engineering: Creating new features or selecting relevant features from the dataset to improve model performance.

4. Model Selection: Choosing an appropriate machine learning algorithm or model based on the problem type, data characteristics, and desired outcome.

5. Model Training: Using the prepared dataset to train the selected model, adjusting the model's parameters or weights to minimize errors or maximize performance.

6. Model Evaluation: Assessing the trained model's performance using evaluation metrics and validation techniques to determine its accuracy, precision, recall, or other relevant measures.

7. Model Deployment: Implementing the trained model into a production environment for making predictions or decisions on new, unseen data.

Data preprocessing, on the other hand, refers to the steps taken to transform raw data into a format suitable for analysis and modeling. It involves techniques such as data cleaning (handling missing values, outliers, etc.), data integration (combining data from multiple sources), data transformation (scaling, normalization, etc.), and data reduction (dimensionality reduction, feature selection, etc.). The goal is to improve data quality, remove inconsistencies, reduce noise, and extract relevant information.

# Ans 2

Quantitative and qualitative data are two types of data used in machine learning:

1. Quantitative data: Quantitative data represents numerical values that can be measured or counted. It is typically continuous or discrete in nature. Examples include age, height, temperature, sales revenue, or number of customers. Quantitative data allows for mathematical operations and can be further classified as interval (e.g., temperature in Celsius) or ratio (e.g., number of sales).

2. Qualitative data: Qualitative data represents non-numerical or categorical information that describes characteristics or attributes. It does not have a numerical value or order. Examples include gender, color, product category, or customer satisfaction level. Qualitative data can be further classified as nominal (e.g., eye color) or ordinal (e.g., rating scale).

# Ans 3

Example of a basic data collection with different machine learning data types:

Record	Numeric	Categorical	Textual	Image	Time Series

    1	25.7	Red	"Hello"	Digital image of a cat	Temperature
    2	18.2	Blue	"World"	Digital image of a dog	Stock prices
    3	33.1	Green	"Machine"	Digital image of a landscape	Website traffic
    4	14.5	Red	"Learning"	Digital image of a flower	Sales per month

# Ans 4

Various causes of machine learning data issues include:

1. Missing Data: Data may have missing values due to various reasons such as data entry errors, data corruption, or incomplete data collection. Missing data can lead to biased or inaccurate results.

2. Outliers: Outliers are extreme values that deviate significantly from the overall pattern of the data. They can distort analysis, affect model performance, and introduce bias.

3. Imbalanced Data: Imbalanced data occurs when the classes or categories in the dataset are not represented equally. It can impact the model's ability to learn patterns from minority classes.

4. Noisy Data: Noisy data contains errors, inconsistencies, or irrelevant information that can mislead the learning process and affect model accuracy.

# Ans 5

Approaches to categorical data exploration:

1. Frequency Distribution: Analyze the frequency or count of each category to understand the distribution and identify dominant or rare categories.

2. Bar Plot: Visualize the frequency or proportion of each category using a bar chart to compare and rank the categories.

3. Cross-Tabulation: Create cross-tabulation or contingency tables to explore the relationship between two categorical variables and identify associations or dependencies.

4. Chi-Square Test: Perform a chi-square test to assess the independence or significance of association between categorical variables.

# Ans 6

When variables have missing values, the learning activity can be affected as it reduces the amount of available data for training the model. Missing values can introduce bias or distort the learned patterns. It is important to handle missing values appropriately to avoid biased results.

To deal with missing values, several techniques can be employed:

    a. Removal: If the proportion of missing values is small and randomly distributed, the records or variables with missing values can be removed from the dataset. However, this approach may result in loss of information.

    b. Imputation: Missing values can be imputed or replaced with estimated values. Common imputation methods include mean imputation, median imputation, mode imputation, or regression imputation, depending on the nature of the data and the missing value patterns.

    c. Advanced Techniques: Advanced imputation methods, such as multiple imputation or matrix completion algorithms, can be employed to handle complex missing value scenarios.

# Ans 7

Various methods for dealing with missing data values include:

1. Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the available data.

2. Forward/Backward Fill: Propagate the last observed value forward or backward to fill missing values in time series data.

3. Hot Deck Imputation: Replace missing values with similar values from other similar records or neighboring observations.

4. Regression Imputation: Predict missing values using regression models based on other variables in the dataset.

5. Multiple Imputation: Generate multiple imputed datasets by estimating missing values multiple times using statistical techniques and combining the results for analysis.


# Ans 8

 Data preprocessing techniques include:
 
     1. Dimensionality Reduction: It aims to reduce the number of features or variables in the dataset while preserving the most important information. Techniques like Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) can be used to reduce dimensionality.

    2. Feature Selection: It involves selecting a subset of relevant features from the dataset that have the most predictive power. Techniques like Recursive Feature Elimination (RFE), L1 Regularization (Lasso), or correlation analysis can be used for feature selection.


# Ans 9

    i. IQR (Interquartile Range): It is a measure of statistical dispersion and represents the range between the first quartile (25th percentile) and the third quartile (75th percentile) of a dataset. The IQR measures the spread of the middle 50% of the data. It is used to assess the variability or spread of a distribution and detect outliers.

    ii. Components of a box plot:

        1. Median: The line inside the box represents the median value, which divides the data into two equal halves.

        2. Box: The box represents the interquartile range (IQR), showing the spread of the central 50% of the data.

        3. Whiskers: The whiskers extend from the box and indicate the range of the data within a certain distance from the box. They typically extend 1.5 times the IQR from the upper and lower quartiles. However, the whiskers can vary in length depending on the data distribution and presence of outliers.

        4. Outliers: Points beyond the whiskers are considered outliers and plotted individually. They are represented as individual points or asterisks.

The lower whisker may surpass the upper whisker in length when the data distribution is highly skewed or contains outliers on the lower end. This indicates that the lower end of the data has a wider range compared to the upper end.

Box plots can be used to identify outliers by visually examining the data points beyond the whiskers. Outliers are typically considered values that fall outside a certain range, such as 1.5 times the IQR.


# Ans 10

    a. Data collected at regular intervals: Time series data refers to data collected at regular intervals over time. It can include various measurements such as temperature recorded every hour, stock prices captured daily, or website traffic tracked hourly.

    b. The gap between the quartiles: The gap between the quartiles, also known as the interquartile range (IQR), measures the spread of the central 50% of the data. It provides insights into the variability or dispersion of the data distribution.

    c. Cross-tab: A cross-tabulation or contingency table is a tabular representation that displays the relationship between two or more categorical variables. It shows the count or frequency of observations falling into different combinations of categories for each variable, allowing for the exploration of associations and dependencies between variables.


# Ans 11

    a. Data with nominal and ordinal values:

Nominal data represents categories or labels with no inherent order, such as colors or product categories.
Ordinal data represents categories with a predefined order or hierarchy, such as rating scales (e.g., low, medium, high) or educational levels (e.g., elementary, high school, college).

    b. Histogram and box plot:

Histogram: A histogram is a graphical representation that displays the distribution of numerical data by dividing it into intervals or bins along the x-axis and showing the frequency or count of observations falling into each bin on the y-axis. It provides insights into the shape, central tendency, and variability of the data distribution.
Box Plot: A box plot, also known as a box-and-whisker plot, visually displays the distribution of numerical data through quartiles, median, and outliers. It provides insights into the central tendency, spread, skewness, and presence of outliers in the data.
    
    c. Average and median:

Average (Mean): The average is a measure of central tendency that represents the sum of all values divided by the total number of values. It is affected by extreme values and can be influenced by outliers.
Median: The median is the middle value in a sorted dataset. It is less affected by extreme values or outliers and provides a measure of the central position in the data distribution.





