# Questions

1. What are the key tasks involved in getting ready to work with machine learning modeling?
2. What are the different forms of data used in machine learning? Give a specific example for each of them.

3. Distinguish:

    1. Numeric vs. categorical attributes

    2. Feature selection vs. dimensionality reduction

4. Make quick notes on any two of the following:


    1. The histogram

      2. Use a scatter plot

      3. PCA (Personal Computer Aid)



5. Why is it necessary to investigate data? Is there a discrepancy in how qualitative and quantitative data are explored?

6. What are the various histogram shapes? What exactly are ‘bins'?

7. How do we deal with data outliers?

8. What are the various central inclination measures? Why does mean vary too much from median in certain data sets?

9. Describe how a scatter plot can be used to investigate bivariate relationships. Is it possible to find outliers using a scatter plot?

10. Describe how cross-tabs can be used to figure out how two variables are related.

# Ans 1

The key tasks involved in getting ready to work with machine learning modeling include:
1. Data Collection: Gathering relevant data from various sources that will be used for training and testing the model.

2. Data Cleaning: Removing or handling missing values, outliers, and inconsistencies in the data to ensure its quality and reliability.

3. Data Preprocessing: Transforming the data into a suitable format for analysis, which may include normalization, scaling, encoding categorical variables, and handling imbalanced data.

4. Exploratory Data Analysis (EDA): Analyzing and visualizing the data to gain insights, understand patterns, identify relationships, and detect anomalies.

5. Feature Engineering: Creating new features or selecting relevant features from the existing data to enhance the model's performance.

6. Train-Test Split: Dividing the data into training and testing sets to evaluate the model's performance on unseen data.

7. Model Selection: Choosing the appropriate machine learning algorithm or model based on the problem type, data characteristics, and desired outcome.

8. Model Training and Evaluation: Training the selected model on the training data and evaluating its performance using suitable metrics and validation techniques.

9. Model Tuning: Optimizing the model's hyperparameters to improve its performance and generalization capability.

# Ans 2

Different forms of data used in machine learning include:

1. Numerical Data: Numerical data consists of quantitative values that represent measurements or counts. Example: Temperature readings, stock prices, or customer age.

2. Categorical Data: Categorical data represents qualitative variables that belong to specific categories or groups. Example: Gender (Male/Female), product categories (Electronics/Clothing), or customer ratings (Good/Average/Poor).

3. Textual Data: Textual data comprises unstructured text documents or strings of characters. Example: Customer reviews, tweets, or articles.

4. Image Data: Image data consists of visual information in the form of pixels, often represented as arrays or matrices. Example: Digital images, medical scans, or satellite imagery.

5. Time Series Data: Time series data represents observations recorded over time, typically at regular intervals. Example: Stock prices over months, temperature recordings over hours, or website traffic over days.

# Ans 3

Distinguishing:

Numeric vs. Categorical Attributes:

Numeric attributes are quantitative variables that represent measurements or counts. They can be continuous (e.g., height, weight) or discrete (e.g., number of siblings). Numeric attributes allow for mathematical operations and can have a meaningful order or magnitude.


Categorical attributes are qualitative variables that represent categories or groups. They can be binary (e.g., gender), nominal (e.g., color), or ordinal (e.g., rating scale). Categorical attributes represent characteristics or labels that do not have a numerical value or order.
Feature Selection vs. Dimensionality Reduction:

Feature selection involves selecting a subset of relevant features from the original set of variables. The goal is to reduce complexity, improve model performance, and interpretability. Feature selection methods can be based on statistical measures, domain knowledge, or model-based evaluations.

Dimensionality reduction aims to reduce the number of features while preserving the essential information. It transforms the original high-dimensional feature space into a lower-dimensional space. Techniques like Principal Component Analysis (PCA) and t-SNE are commonly used for dimensionality reduction.
Quick notes on:

# Ans 4

The Histogram:

4.1 A histogram is a graphical representation that displays the distribution of a continuous or discrete variable. It consists of a series of bins or intervals on the x-axis and the frequency or count of observations falling within each bin on the y-axis. Histograms provide insights into the data's central tendency, spread, skewness, and outliers.
Scatter Plot:

4.2 A scatter plot is a graphical representation that shows the relationship between two continuous variables. Each data point is represented as a dot on the plot, with one variable on the x-axis and the other on the y-axis. Scatter plots help visualize the correlation or pattern between variables and identify any outliers or clusters in the data.
PCA (Principal Component Analysis):

4.3 PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space. It identifies the principal components, which are linear combinations of the original features that capture the maximum variance in the data. PCA is commonly used for visualization, feature extraction, and reducing computational complexity in machine learning.
Investigating data is necessary to understand its characteristics, identify patterns, relationships, and outliers, and make informed decisions during the modeling process. Both qualitative and quantitative data can be explored, although the methods and techniques may differ. Qualitative data exploration often involves textual analysis, coding, theme identification, or sentiment analysis. Quantitative data exploration focuses on statistical measures, visualization techniques, and numerical analysis to uncover patterns, trends, and associations.

# Ans 5

Histogram shapes can vary and provide insights into the data distribution. Some common histogram shapes are:

1. Normal Distribution: The data is symmetrically distributed around the mean, forming a bell-shaped curve.

2. Skewed Distribution: The data is concentrated more on one side of the distribution, either positively (skewed right) or negatively (skewed left).

3. Bimodal Distribution: The data has two distinct peaks, indicating the presence of two subgroups or modes.

Bins in a histogram represent the intervals or ranges into which the data is divided. They help visualize the distribution of data and determine the frequency or count of observations falling within each interval.

# Ans 6

Dealing with data outliers can involve various approaches:

1. Removing outliers: In some cases, outliers can be removed from the dataset if they are considered erroneous or significantly deviate from the overall pattern.
2. Transforming data: Applying transformations like logarithmic or power transformations can reduce the impact of outliers and make the distribution more symmetric.
3. Winsorizing: Winsorizing replaces outliers with values that lie within a specified range, such as replacing extreme values with the nearest values within the 1st and 99th percentiles.
4. Robust statistics: Robust statistical measures like the median or trimmed mean can be used instead of the mean to reduce the influence of outliers.

# Ans 7

Various central inclination measures include:
1. Mean: The arithmetic average of a set of values. It is calculated by summing all the values and dividing by the number of observations. The mean is sensitive to outliers.
2. Median: The middle value that separates the higher and lower half of a dataset. It is less sensitive to outliers.
3. Mode: The value or values that occur most frequently in a dataset. It can be used for categorical as well as discrete numerical data.
The mean can vary significantly from the median in certain datasets, especially when there are outliers or the data is skewed. The mean is influenced by extreme values, while the median represents the central value that is less affected by outliers.

# Ans 8

A scatter plot can be used to investigate bivariate relationships by plotting the values of two continuous variables on the x and y axes. It helps visualize the correlation, trend, or pattern between the variables. Outliers can be identified as data points that deviate significantly from the overall pattern in the scatter plot.

# Ans 9

Cross-tabs, also known as contingency tables, are used to analyze the relationship between two categorical variables. They display the frequency or count of occurrences for each combination of categories from the two variables. Cross-tabs help identify associations, dependencies, or patterns between the variables and assess their relationship strength.





