1. What is a parameter?

A parameter in machine learning is an internal variable of a model learned from data during training. Examples include weights in linear regression or coefficients in a neural network. These parameters are optimized using algorithms like gradient descent to minimize prediction error. Parameters help models generalize from training data to unseen data.

2. What is correlation?

Correlation is a statistical measure showing the degree to which two variables move together. It ranges from -1 to 1. A positive value indicates direct proportionality, a negative value shows inverse proportionality, and 0 indicates no relationship. It’s vital for feature selection in machine learning.

3. What does negative correlation mean?

Negative correlation occurs when one variable increases while the other decreases. A coefficient close to -1 implies a strong inverse relationship. For example, more exercise may correlate with lower body fat percentage. Understanding negative correlation helps in selecting or eliminating features.

4. Define Machine Learning. What are the main components in Machine Learning?

Machine Learning is a field of AI that allows computers to learn patterns from data and make predictions. Key components include: (1) Data, (2) Features, (3) Algorithms, (4) Model, (5) Training Process, and (6) Evaluation. These components work together to build systems that improve performance based on experience.

5. How does loss value help in determining whether the model is good or not?

The loss value measures the difference between predicted and actual outputs. A lower loss indicates a better model. It guides the optimization algorithm in adjusting model parameters. High loss suggests underfitting or poor learning, whereas a low, stable loss indicates effective learning and better generalization.

6. What are continuous and categorical variables?

Continuous variables take infinite numeric values (e.g., height, weight), whereas categorical variables represent distinct categories or groups (e.g., gender, color). In ML, continuous variables are used directly or scaled, while categorical variables are encoded using techniques like one-hot or label encoding for algorithm compatibility.

7. How do we handle categorical variables in Machine Learning? What are the common techniques?

Categorical variables must be converted into numerical format. Common techniques include Label Encoding (assigns numeric labels) and One-Hot Encoding (creates binary columns for each category). More advanced techniques include Target Encoding and Frequency Encoding. Encoding ensures machine learning models can process and learn from categorical features.

8. What do you mean by training and testing a dataset?

Training a dataset involves feeding data into a model to learn patterns. Testing evaluates the model’s performance on unseen data. The dataset is typically split into a training set (to fit the model) and a test set (to evaluate generalization). This ensures model robustness and prevents overfitting.

9. What is sklearn.preprocessing?

sklearn.preprocessing is a module in Scikit-learn used for feature scaling and transformation. It provides tools for standardization, normalization, encoding categorical features, imputing missing values, and binarization. It ensures that data is in the appropriate format for modeling and helps improve the efficiency and performance of ML algorithms.

10. What is a Test set?

A test set is a portion of the dataset not used during training but reserved for evaluating model performance. It helps measure how well the model generalizes to new, unseen data. The accuracy, precision, recall, and other metrics computed on the test set indicate real-world performance.

11. How do we split data for model fitting (training and testing) in Python?

In Python, train_test_split() from sklearn.model_selection is commonly used to split data. It randomly divides the dataset into training and testing subsets, typically with an 80-20 or 70-30 ratio. The function allows shuffling and stratifying to ensure balanced class distributions in both sets.

12. How do you approach a Machine Learning problem?

Approach includes: (1) Define the problem, (2) Collect and clean data, (3) Perform Exploratory Data Analysis (EDA), (4) Select relevant features, (5) Split data, (6) Choose and train the model, (7) Evaluate using metrics, (8) Tune hyperparameters, and (9) Validate and deploy the model.

13. Why do we have to perform EDA before fitting a model to the data?

EDA helps understand the data’s structure, detect outliers, check for missing values, and assess feature relationships. It aids in feature selection and preprocessing decisions. Without EDA, model accuracy may suffer due to data issues. Visualization tools and summary statistics are key elements of effective EDA.

14. What is correlation?

Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation means that as one variable increases, the other tends to increase as well. A negative correlation means that as one variable increases, the other tends to decrease. Correlation is measured using coefficients like Pearson’s correlation (ranges from -1 to 1). A value close to 1 indicates a strong positive correlation, -1 indicates a strong negative correlation, and 0 indicates no correlation. For example, there is a strong positive correlation between height and weight — taller individuals generally weigh more.

15. What does negative correlation mean?

Negative correlation implies that as one variable increases, the other decreases. In statistical terms, a negative correlation coefficient ranges from -1 to 0. The closer it is to -1, the stronger the inverse relationship. For instance, consider the relationship between the number of hours spent watching TV and academic performance — generally, as screen time increases, grades may decline, indicating a negative correlation. It's important to remember that correlation does not imply causation; just because two variables are inversely related doesn't mean one causes the other to change.

16. How can you find correlation between variables in Python?

In Python, correlation between variables can be computed using the pandas library. First, import your data into a DataFrame. Then, use the .corr() method to find pairwise correlation coefficients. By default, it calculates Pearson correlation. Here's a sample code:

import pandas as pd  
df = pd.read_csv('data.csv')  
correlation_matrix = df.corr()  
print(correlation_matrix)
To visualize correlations, use Seaborn's heatmap():

import seaborn as sns  
import matplotlib.pyplot as plt  
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')  
plt.show()
This helps identify linear relationships between variables for further analysis or feature selection in machine learning.

17. What is causation? Explain difference between correlation and causation with an example.

Causation means one event causes another to happen, whereas correlation means two events occur together but do not necessarily have a cause-effect relationship. Correlation can exist without causation, but causation implies correlation. For example, there may be a strong correlation between ice cream sales and drowning incidents — both rise in summer — but buying ice cream doesn’t cause drowning. The real cause is the rise in temperature. On the other hand, causation can be illustrated with smoking and lung cancer; multiple studies show that smoking directly contributes to cancer development. Misinterpreting correlation as causation can lead to incorrect conclusions in research and data science.

18. What is an Optimizer? What are different types of optimizers? Explain each with an example.

An optimizer in machine learning adjusts model parameters to minimize the loss function and improve accuracy. Optimizers determine how the model learns by updating weights using gradients. Popular optimizers include:

SGD (Stochastic Gradient Descent): Updates weights using one sample at a time. It's fast but may oscillate.
Example: Used in simple neural networks.

Adam (Adaptive Moment Estimation): Combines momentum and adaptive learning rates, converging faster.
Example: Widely used in deep learning models like CNNs.

RMSProp: Modifies learning rate for each parameter, effective for non-stationary objectives.
Example: Performs well in RNNs.

19. What is sklearn.linear_model?

sklearn.linear_model is a module in the Scikit-learn library that provides tools for linear models in machine learning. It includes regression and classification models based on linear approaches. Key estimators include LinearRegression for ordinary least squares regression, LogisticRegression for binary and multiclass classification, Ridge and Lasso for regularized regression, and SGDRegressor for stochastic gradient descent. These models assume a linear relationship between input features and target outputs, making them efficient and interpretable. Example usage:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
This module is foundational for understanding more complex models.

20. What does model.fit() do? What arguments must be given?

The model.fit() method trains a machine learning model by learning patterns from input data. It adjusts model parameters to minimize error between predicted and actual outputs. The basic syntax requires at least two arguments:
X: Feature matrix (independent variables)
y: Target vector (dependent variable)
model.fit(X_train, y_train)
Additional parameters may be passed depending on the model, such as sample weights or epochs. This function is essential for building predictive models, allowing them to generalize from the data provided during training.

21.  What does model.predict() do? What arguments must be given?

The model.predict() function is used after training a model with .fit(). It takes new input data and returns predicted outputs based on the learned patterns. The only required argument is:

X: A 2D array or DataFrame containing new input features.

predictions = model.predict(X_test)
The output shape depends on the model type — for regression, it returns continuous values; for classification, it returns class labels. This method is used for evaluating performance or making real-world predictions once the model has been trained.

22. What are continuous and categorical variables?

Continuous variables are numerical values that can take any value within a range. They are measurable and often represented with floating-point numbers. Examples include height, temperature, and salary.
Categorical variables represent discrete categories or groups and have a limited number of distinct values. They can be nominal (no order, e.g., colors) or ordinal (with order, e.g., education level). These variables are typically represented as strings or integers but require encoding before being used in most machine learning models. Differentiating between them is crucial for preprocessing, selecting algorithms, and applying proper statistical techniques.

23. What is feature scaling? How does it help in Machine Learning?

Feature scaling transforms numerical values into a common scale without distorting differences in the ranges. It is crucial in machine learning because many algorithms (e.g., KNN, SVM, gradient descent) are sensitive to feature magnitude. Without scaling, features with larger ranges dominate others, leading to biased models.
Common methods include:

Standardization: Subtract mean and divide by standard deviation

Normalization: Rescale to [0, 1] using min-max scaling
Scaling ensures fair treatment of all features and faster convergence in optimization algorithms.

24. How do we perform scaling in Python?

Scaling in Python is typically done using Scikit-learn’s preprocessing module. Two common techniques are:

StandardScaler: Standardizes features (mean = 0, std = 1)

MinMaxScaler: Scales values to a fixed range, usually [0, 1]
Example:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Alternatively:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
This scaled data is then used for training machine learning models to improve performance and reliability.

25. What is sklearn.preprocessing?
sklearn.preprocessing is a Scikit-learn module used for transforming data before feeding it into machine learning models. It includes techniques like:

Scaling: StandardScaler, MinMaxScaler

Encoding: LabelEncoder, OneHotEncoder

Imputation: SimpleImputer for missing values

Binarization and Normalization
These transformations ensure that all input features are in a suitable format and scale, which improves model accuracy and convergence speed. Preprocessing is a critical part of any machine learning pipeline, especially when working with real-world, messy data.

26. How do we split data for model fitting (training and testing) in Python?

In Python, data splitting is commonly done using train_test_split from Scikit-learn’s model_selection module. It separates the dataset into training and testing sets, ensuring that the model can be trained and then evaluated on unseen data.
Example:

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

test_size defines the proportion of data for testing (e.g., 0.2 = 20%), and random_state ensures reproducibility. This process helps prevent overfitting and gives a reliable estimate of model performance.

27. Explain data encoding.
Data encoding transforms categorical variables into numerical form so they can be used in machine learning models. Most algorithms require numeric input, so textual categories must be converted. Common techniques include:

Label Encoding: Assigns each category a unique integer (e.g., Red=0, Blue=1).

One-Hot Encoding: Creates binary columns for each category to avoid implying ordinal relationships.
Example using pandas:

pd.get_dummies(df['color'])
Encoding helps models interpret categories correctly without introducing bias or misleading hierarchies. Proper encoding ensures improved accuracy and model interpretability.