# Machine Learning Q&A Study Guide

This notebook contains essential machine learning concepts with explanations and examples.

## 1. What is a parameter?

In [None]:
# A parameter is a variable in a mathematical or statistical model whose value is estimated from data.
# In machine learning models, parameters are the internal variables that the algorithm adjusts during training.
# For example, in linear regression, parameters are the coefficients (weights) and the intercept.
# These values are optimized using data to minimize errors between predictions and actual results.

## 2. What is correlation?

In [None]:
# Correlation is a statistical measure that quantifies the degree to which two variables move in relation to each other.
# It ranges from -1 (perfect negative relationship) to +1 (perfect positive relationship), with 0 indicating no linear relationship.
# Correlation helps to identify whether and how strongly pairs of variables are related, but it does not imply causation.
# Commonly used correlation coefficients include Pearson, Spearman, and Kendall.

## 3. What does negative correlation mean?

In [None]:
# Negative correlation means that as one variable increases, the other decreases.
# The correlation coefficient value will be less than 0 (e.g., -0.8).
# For example, for the relationship between the amount of time spent watching TV and physical activity level, as TV time increases, physical activity typically decreases.
# Negative correlation does not necessarily mean causation; it only indicates an inverse relationship.

## 4. Define Machine Learning. What are the main components in Machine Learning?

In [None]:
# Machine Learning (ML) is a subfield of artificial intelligence in which algorithms learn patterns from data to make predictions or decisions.
# ML models automatically improve their performance as they are exposed to more data over time, without being explicitly programmed.
# Main Components:
# 1. Data: The input collected for training and evaluation.
# 2. Features: Measurable properties or variables of data (attributes used for modeling).
# 3. Model: The mathematical representation for learning patterns from data.
# 4. Loss Function: Measures how well a model's predictions match actual values.
# 5. Optimizer: The algorithm used to update model parameters to minimize the loss (e.g., Gradient Descent).
# 6. Training: The process of adjusting the model parameters using data and the optimizer.

## 5. How does loss value help in determining whether the model is good or not?

In [None]:
# The loss value in a machine learning model quantitatively measures the difference between the actual target values and the model's predicted values.
# Lower loss values indicate a model that is making more accurate predictions.
# A high loss signals the model still makes large errors and may require further training, more data, or a better algorithm.
# Continued monitoring of the loss function during training helps to determine if the model is improving or overfitting.

## 6. What are continuous and categorical variables?

In [None]:
# Continuous variables are numeric variables that can take an infinite number of values within a range (e.g., height, weight, temperature).
# They are often measured on an interval or ratio scale.
# Categorical variables represent distinct groups or categories (e.g., color, gender, country).
# These are usually measured on a nominal or ordinal scale and may need to be encoded before being used in machine learning models.

## 7. How do we handle categorical variables in Machine Learning? What are the common techniques?

In [None]:
# Most machine learning algorithms require input features to be numeric.
# To handle categorical variables, we transform them into numerical representations.
# Common techniques:
# 1. Label Encoding: Assigns a unique integer to each category (suitable for ordinal data).
# 2. One-hot Encoding: Creates binary columns for each unique category (suitable for nominal data).
# 3. Ordinal Encoding: Similar to label encoding but takes into account feature hierarchy.
# The choice of technique depends on the nature of the categorical variable and the algorithms used.

## 8. What do you mean by training and testing a dataset?

In [None]:
# Training a dataset means exposing a machine learning model to a set of data for learning patterns.
# The training set contains input data and the associated correct outputs (labels).
# Testing a dataset refers to evaluating the trained model with new, unseen data to check how well it generalizes.
# This process helps to estimate the true performance of the model when deployed in real-world scenarios.

## 9. What is sklearn.preprocessing?

In [None]:
# sklearn.preprocessing is a module within scikit-learn that offers tools and functions for data preprocessing.
# It includes functions for scaling numerical features, normalizing data, encoding categorical features (e.g., LabelEncoder, OneHotEncoder), and transforming features (e.g., PolynomialFeatures).
# Preprocessing ensures that data is properly formatted and scaled for use in machine learning models, which can improve performance and convergence.

## 10. What is a Test set?

In [None]:
# A test set is the subset of the entire dataset that is held back and not used during the training phase.
# After a model is trained, the test set evaluates its predictive performance.
# The results on the test set help to estimate how well the model will perform on truly unseen data (in production).

## 11. How do we split data for model fitting (training and testing) in Python?

In [None]:
# Use train_test_split from sklearn.model_selection to randomly split data into training and testing sets.
# Example usage:
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Here, 20% of the data is reserved for testing, and 80% is used for training.
# This process prevents data leakage and ensures an unbiased evaluation.

## 12. How do you approach a Machine Learning problem?

In [None]:
# The typical steps to solve a machine learning problem are:
# 1. Problem Definition: Clearly define the objective and determine what to predict.
# 2. Data Collection: Gather the relevant and sufficient data.
# 3. Data Cleaning: Handle missing values, remove duplicates, and correct errors.
# 4. Exploratory Data Analysis (EDA): Investigate the data to understand distributions, relationships, and anomalies.
# 5. Feature Engineering: Select, create, or transform features that help improve model performance.
# 6. Model Selection: Choose suitable algorithms based on the problem type and data characteristics.
# 7. Training: Fit the model to the training data.
# 8. Evaluation: Use metrics to evaluate model performance on test or validation data.
# 9. Hyperparameter Tuning: Optimize algorithm settings to improve results.
# 10. Deployment: Integrate the model into practical applications, monitor, and retrain as necessary.

## 13. Why do we have to perform EDA before fitting a model to the data?

In [None]:
# Exploratory Data Analysis (EDA) is a crucial step that involves visualizing and summarizing data before modeling.
# EDA helps identify patterns, trends, outliers, missing values, and relationships among variables.
# It aids in uncovering errors or anomalies that could affect modeling results.
# By understanding the data thoroughly, we can make better choices about feature engineering, model selection, and parameter tuning.

## 14. What is correlation?

In [None]:
# Correlation is a numerical measure that describes how two variables move in relation to one another.
# Positive correlation means they increase or decrease together, while negative correlation means one increases as the other decreases.
# Correlation is commonly used for initial data analysis, feature selection, and understanding variable relationships in a dataset.

## 15. What does negative correlation mean?

In [None]:
# Negative correlation means that as one variable increases, the other variable tends to decrease, and vice versa.
# It is measured by a correlation coefficient between 0 and -1 (e.g., -0.5).
# Example: As outdoor temperature falls, the demand for heating likely rises, showing a negative relationship.
# Negative correlation does not mean that one variable causes the decrease, only that their values tend to move in opposite directions.

## 16. How can you find correlation between variables in Python?

In [None]:
# Correlation in Python is commonly calculated using the pandas library with the .corr() method.
# Example:
# import pandas as pd
# df = pd.DataFrame({...}) # your data
# correlation_matrix = df.corr()
# This matrix shows the pairwise correlation coefficients for the columns in the DataFrame.
# For a single pair, use df['col1'].corr(df['col2']).

## 17. What is causation? Explain difference between correlation and causation with an example.

In [None]:
# Causation means one variable directly affects or brings about a change in another variable.
# Correlation, on the other hand, simply implies that two variables have a relationship, but one does not necessarily cause the other.
# For example: Ice cream sales and drowning deaths are correlated (both rise in summer), but eating ice cream does not cause drowning—summer is the confounding cause.
# Therefore: Correlation does not imply causation.

## 18. What is an Optimizer? What are different types of optimizers? Explain each with an example.

In [None]:
# An optimizer is an algorithm or method used to adjust the parameters of a machine learning model during training to minimize the loss function.
# It iteratively updates weights based on gradients computed from the loss.
# Common optimizers:
# 1. Gradient Descent: Moves parameters in the direction of steepest loss reduction using the gradient.
#    Example: optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# 2. Adam (Adaptive Moment Estimation): Maintains adaptive learning rates by using moving averages of gradient and squared gradient.
#    Example: optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# 3. RMSProp: Adapts learning rates based on a decaying average of squared gradients, often used for recurrent neural networks.
# Choice of optimizer can impact training speed and model performance.

## 19. What is sklearn.linear_model?

In [None]:
# sklearn.linear_model is a module in Python's scikit-learn library that contains linear models for regression and classification tasks.
# Examples include LinearRegression, LogisticRegression, Ridge, and Lasso.
# These models are based on linear equations that relate features to targets and are used for problems where a linear relationship is assumed.
# The module provides functions for fitting, predicting, and evaluating linear models efficiently.

## 20. What does model.fit() do? What arguments must be given?

In [None]:
# model.fit() is a method used to train a machine learning model on input features (X) and the corresponding target outputs (y).
# It adjusts the model parameters to best map inputs to outputs using the specified learning algorithm.
# Arguments:
# - X: The training input data (features)
# - y: The known labels (targets)
# Examples:
# model.fit(X_train, y_train)
# For unsupervised models, y may not be required.

## 21. What does model.predict() do? What arguments must be given?

In [None]:
# model.predict() is used to make predictions on new, unseen data using a trained model.
# It takes input features (X), applies learned parameters, and returns predicted labels or values.
# Arguments:
# - X: The feature array or dataset for which predictions are required.
# Example:
# predictions = model.predict(X_test)
# The result can be used to evaluate model performance or for practical decision-making.

## 22. What are continuous and categorical variables?

In [None]:
# Continuous variables are numeric and can take on any value within a range, including decimals (e.g., height, weight, age).
# They are useful for regression problems and are measured, not counted.
# Categorical variables consist of discrete categories, labels, or groups.
# Examples include eye color (blue, green, brown), city names, or binary outcomes (yes/no).
# Categorical variables often need to be encoded numerically for use in ML models.

## 23. What is feature scaling? How does it help in Machine Learning?

In [None]:
# Feature scaling is the process of normalizing or standardizing the range of independent variables or features of data.
# It ensures that each feature contributes equally to the model, preventing features with larger value ranges from dominating.
# Common methods include:
# - Min-Max Scaling: Scales features to a fixed range (usually 0 to 1).
# - Standardization (Z-score): Centers data around zero with a standard deviation of one.
# Scaling is critical for algorithms that calculate distances (KNN, SVM) or gradients (neural networks), as it speeds up convergence and may improve accuracy.

## 24. How do we perform scaling in Python?

In [None]:
# In Python, scaling is typically performed using scikit-learn's preprocessing module.
# StandardScaler performs Z-score standardization; MinMaxScaler transforms data to a fixed range.
# Example using StandardScaler:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)
# For Min-Max scaling:
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# X_scaled = scaler.fit_transform(X)

## 25. Explain data encoding?

In [None]:
# Data encoding is the process of converting categorical variables into a numerical format that can be provided to machine learning algorithms.
# Many models require inputs to be integers or floats; encoding makes data machine-readable.
# Common encoding types:
# - Label Encoding: Converts each category to a unique integer. Useful for ordinal features.
# - One-hot Encoding: For each category, creates a separate binary column (0 or 1). Useful for nominal features.
# Encoding is necessary for algorithms like SVM, linear regression, and neural networks which cannot handle categorical text data directly.