Theoretical Questions & Answers
1. What is a parameter?
A parameter is a constant value that characterizes a statistical model or function. In Machine Learning, parameters refer to the internal variables that a model learns from training data, allowing it to make predictions. Examples of parameters include weights in a neural network or coefficients in a linear regression model. Unlike hyperparameters, which are set manually, parameters are optimized automatically during model training. Understanding parameters is crucial because they define the behavior and accuracy of a predictive model.
2. What is correlation? What does negative correlation mean?
Correlation measures the relationship between two variables and how they move together. It is expressed as a value between -1 and 1:
- Positive correlation: As one variable increases, the other also increases (e.g., height and weight).
- Negative correlation: As one variable increases, the other decreases (e.g., temperature and sweater sales).
- Zero correlation: No relationship between the variables.
A negative correlation implies an inverse relationship—higher values in one variable lead to lower values in another. For example, as exercise duration increases, body fat percentage may decrease, showing negative correlation.
3. Define Machine Learning. What are the main components in Machine Learning?
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn from data and improve performance without being explicitly programmed. It involves building models that recognize patterns and make decisions.
The main components of ML include:
- Data: Raw input that fuels model training.
- Features: Relevant attributes extracted from data for better learning.
- Model: A mathematical framework that learns relationships in data.
- Loss Function: Measures how well the model predicts compared to actual outcomes.
- Optimization Algorithm: Adjusts parameters to minimize loss (e.g., Gradient Descent).
- Training & Testing: Model is trained on a dataset and then evaluated on unseen data.
4. How does loss value help in determining whether the model is good or not?
The loss value measures how far a model’s predictions deviate from actual results. A low loss indicates better performance, while a high loss suggests the model’s predictions are inaccurate. Different loss functions exist, such as Mean Squared Error (MSE) for regression tasks and Cross-Entropy Loss for classification tasks. Monitoring loss during training helps in fine-tuning model parameters, identifying overfitting, and improving overall accuracy.
5. What are continuous and categorical variables?
- Continuous variables can take an infinite number of values within a range (e.g., height, age, temperature).
- Categorical variables represent discrete categories or labels (e.g., gender, city, product type).
Handling them correctly is crucial in ML, as continuous data often undergoes scaling, while categorical data requires encoding.
6. How do we handle categorical variables in Machine Learning? What are common techniques?
Common techniques include:
- Label Encoding (assigning numeric values to categories).
- One-Hot Encoding (creating binary columns for each category).
- Ordinal Encoding (ranking categories in order).
- Binary Encoding (reducing dimensionality using binary conversion).
These techniques ensure categorical features are numerically represented for ML models.
7. What do you mean by training and testing a dataset?
- Training dataset is used to teach a model by adjusting parameters.
- Testing dataset evaluates the model’s accuracy on unseen data.
A good model generalizes well, meaning performance on training and testing sets should be similar.
8. What is sklearn.preprocessing?
sklearn.preprocessing is a module in Scikit-Learn used for data preprocessing. It provides tools like scaling, normalization, encoding, and feature transformation to improve ML model efficiency.
9. What is a Test set?
A test set is a portion of data reserved for evaluating a trained model’s performance. Unlike training data, the model never sees the test set during training, ensuring realistic performance assessment.
10. How do we split data for model fitting (training and testing) in Python?
In Python, we use train_test_split() from sklearn.model_selection:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This splits 80% for training and 20% for testing, ensuring fair evaluation.
11. Why do we have to perform EDA before fitting a model to the data?
Exploratory Data Analysis (EDA) helps understand data patterns, detect missing values, outliers, correlations, and distribution trends. EDA ensures data is cleaned and structured properly for training, preventing biased models.
12. What is correlation?
Correlation measures the relationship between variables. Positive correlation means they move together, negative correlation means one increases while the other decreases. It's essential for feature selection in ML.
13. What does negative correlation mean?
A negative correlation means an inverse relationship—as one variable increases, the other decreases (e.g., more exercise leads to lower body fat). Correlation values range from -1 (strong negative) to 1 (strong positive).
14. How can you find correlation between variables in Python?
Use Pandas:
import pandas as pd
df.corr()  # Computes correlation matrix
This helps identify highly correlated features for ML.
15. What is causation? Explain the difference between correlation and causation with an example.
- Correlation shows an association between variables, but it does not imply causation.
- Causation means one event directly affects another.
Example: Ice cream sales and drowning rates correlate positively, but ice cream does not cause drowning—hot weather affects both independently.
16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
An optimizer minimizes model error by adjusting parameters efficiently. Types include:
- Gradient Descent (adjusts weights iteratively).
- Adam Optimizer (combines momentum for faster convergence).
- RMSprop (adaptive learning rate).
Example in TensorFlow:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
Optimizers accelerate ML training and improve accuracy.
17. What is sklearn.linear_model?
A module in Scikit-Learn for linear regression, logistic regression, and other linear models.
Example:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
from sklearn.linear_model import LinearRegression
model = LinearRegression()
It’s useful for predictive modeling.
18. What does model.fit() do? What arguments must be given?
model.fit(X_train, y_train) trains a ML model using features (X_train) and labels (y_train). The function adjusts parameters to minimize error.
19. What does model.predict() do? What arguments must be given?
model.predict(X_test) generates predictions using unseen data (X_test). These predictions help evaluate model accuracy.
20. What are continuous and categorical variables?
Covered in question 5—continuous values vary infinitely, categorical values belong to discrete groups.
21. What is feature scaling? How does it help in Machine Learning?
Feature scaling ensures numerical features are on the same scale, preventing models from favoring large values. Techniques include Normalization and Standardization.
22. How do we perform scaling in Python?
Using sklearn.preprocessing:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
This standardizes data for better model performance.
23. What is sklearn.preprocessing?
Covered in question 8—a Scikit-Learn module for data transformation techniques.
24. How do we split data for model fitting (training and testing) in Python?
Covered in question 10—use train_test_split() function.
25. Explain data encoding?
Data encoding transforms categorical variables into numerical format for ML models. Common encoding techniques:
- One-Hot Encoding (creates binary columns).
- Label Encoding (assigns numeric labels).
- Ordinal Encoding (ranks categories).
Example in Python:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(df[['category_column']])