# THEORY QUESTION

1. What is a parameter?

- A parameter is basically a value that defines or configures a model. In machine learning, parameters are the internal variables that the model learns during training. Like in linear regression, the slope and intercept are parameters. The model adjusts these values to minimize error and make better predictions. Parameters are different from hyperparameters which we set manually before training.

2. 2. What is correlation?

- Correlation measures how two variables are related to each other. It tells you if when one variable goes up, does the other also go up, go down, or stay random. Correlation values range from -1 to +1. Positive correlation means both variables move in same direction, negative means they move in opposite directions, and zero means no relationship.

3. What does negative correlation mean?

- Negative correlation means when one variable increases, the other variable decreases. Like the relationship between outside temperature and heating bills - as temperature goes up, heating costs go down. The correlation coefficient will be negative, somewhere between -1 and 0. Perfect negative correlation is -1, meaning they're perfectly opposite.

4. Define Machine Learning. What are the main components in Machine Learning?

- Machine Learning is basically teaching computers to learn patterns from data without explicitly programming every rule. The main components are: data (the information we feed the model), algorithms (the methods used to find patterns), features (the input variables), target variable (what we want to predict), training process (where model learns), and evaluation (checking how well it performs). You also need good feature engineering and proper validation techniques.

5. How does loss value help in determining whether the model is good or not?

- Loss value measures how wrong your model's predictions are compared to actual values. Lower loss means better model performance. During training, we try to minimize loss through optimization. If loss keeps decreasing during training, the model is learning. If loss stops improving or starts increasing, the model might be overfitting. Different types of problems use different loss functions like mean squared error for regression or cross-entropy for classification.

6. What are continuous and categorical variables?

- Continuous variables can take any numerical value within a range, like height, weight, temperature, or salary. You can have decimals and infinite possible values. Categorical variables represent categories or groups, like gender, color, city names, or product types. They're usually text or numbers that represent categories rather than actual quantities.

7. How do we handle categorical variables in Machine Learning? What are the common techniques?

- Most ML algorithms only work with numbers, so we need to convert categorical data. Common techniques include: One-hot encoding (creates binary columns for each category), Label encoding (assigns numbers to categories), Ordinal encoding (for ordered categories like small/medium/large), and Target encoding (uses target variable statistics). Choice depends on whether categories have natural order and how many unique values there are.

8. What do you mean by training and testing a dataset?

- Training means using part of your data to teach the model patterns and relationships. The model adjusts its parameters based on training data. Testing means evaluating the trained model on completely new data it hasn't seen before to check if it can generalize. This split helps us know if the model learned real patterns or just memorized the training data. Usually we use 70-80% for training and 20-30% for testing.

9. What is sklearn.preprocessing?

- sklearn.preprocessing is a module in scikit-learn that contains tools for preparing and cleaning data before feeding it to machine learning models. It has functions for scaling features, encoding categorical variables, handling missing values, and transforming data distributions. Common functions include StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, and train_test_split.

10. What is a Test set?

- Test set is the portion of your dataset that you keep completely separate during model development. Its used only at the end to evaluate final model performance. The model never sees this data during training, so it gives you an honest assessment of how well your model will perform on new, unseen data. Its different from validation set which is used during development for hyperparameter tuning.

11. How do we split data for model fitting (training and testing) in Python?

- We use train_test_split from sklearn. Basic syntax is: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42). The test_size parameter controls what percentage goes to testing (0.2 means 20%). Random_state ensures you get same split every time you run the code. You can also add stratify parameter to maintain class distribution in classification problems.

12. How do you approach a Machine Learning problem?

- First, understand the problem and define what you're trying to predict. Then collect and explore the data through EDA to understand patterns and issues. Clean the data by handling missing values and outliers. Do feature engineering to create better variables. Split data into train/test sets. Choose appropriate algorithm based on problem type. Train the model and evaluate performance. Tune hyperparameters to improve results. Finally, validate on test set and deploy if satisfactory.

13. Why do we have to perform EDA before fitting a model to the data?

- EDA helps you understand your data before building models. You can spot missing values, outliers, and data quality issues that need fixing. It reveals relationships between variables and helps with feature selection. You can identify the distribution of your target variable and see if data is balanced. EDA also helps choose appropriate algorithms and preprocessing steps. Without EDA, you might build models on messy data and get poor results.

14. What is correlation?

- Correlation shows the strength and direction of linear relationship between two variables. It ranges from -1 to +1. Values close to +1 mean strong positive relationship, close to -1 mean strong negative relationship, and close to 0 mean weak or no linear relationship. However, correlation doesn't imply causation - two variables can be correlated without one causing the other.

15. What does negative correlation mean?

- Negative correlation occurs when two variables move in opposite directions. As one increases, the other tends to decrease. For example, as study hours increase, failure rates typically decrease. The correlation coefficient is negative (between -1 and 0). The closer to -1, the stronger the negative relationship. Its important to remember this only measures linear relationships.

16. How can you find correlation between variables in Python?

- You can use pandas corr() method on a dataframe to get correlation matrix between all numerical variables. For specific pairs, use df['col1'].corr(df['col2']). You can also use numpy.corrcoef() or scipy.stats.pearsonr() for more detailed statistics. For visualization, seaborn heatmap works great: sns.heatmap(df.corr(), annot=True). Different correlation methods include Pearson (linear), Spearman (rank-based), and Kendall.

17. What is causation? Explain difference between correlation and causation with an example.

- Causation means one variable directly causes changes in another. Correlation just means variables move together but doesn't prove causation. Classic example: ice cream sales and drowning incidents are positively correlated, but ice cream doesn't cause drowning. Both are caused by hot weather (more people swim and buy ice cream). Correlation is necessary but not sufficient for causation. To prove causation, you need controlled experiments or strong causal inference methods.

18. What is an Optimizer? What are different types of optimizers? Explain each with an example.

- Optimizer is an algorithm that adjusts model parameters to minimize loss function during training. SGD (Stochastic Gradient Descent) updates parameters using gradients from small batches - simple but can be slow. Adam combines momentum and adaptive learning rates - works well for most problems. RMSprop adapts learning rate based on recent gradients - good for RNNs. AdaGrad accumulates gradients but can stop learning too early. Each has different strengths depending on the problem and data characteristics.

19. What is sklearn.linear_model?

- sklearn.linear_model is a module containing linear algorithms for regression and classification. It includes LinearRegression for basic linear regression, LogisticRegression for classification, Ridge and Lasso for regularized regression, ElasticNet combining both penalties, and SGDRegressor for large datasets. These models assume linear relationships between features and target variable. They're interpretable and work well when linearity assumption holds.

20. What does model.fit() do? What arguments must be given?

- model.fit() trains the machine learning model on your data. It takes the training features (X) and target values (y) as required arguments. The model learns patterns by adjusting its internal parameters to minimize prediction error. Some models have additional optional parameters like sample_weight for giving different importance to samples. After fitting, the model can make predictions on new data.

21. What does model.predict() do? What arguments must be given?

- model.predict() uses the trained model to make predictions on new data. It requires one argument - the feature matrix (X) containing the input variables for which you want predictions. The model applies the patterns it learned during training to generate predictions. For classification, it returns predicted class labels. For regression, it returns predicted numerical values. The input must have same number and order of features as training data.

22. What are continuous and categorical variables ?

- Continuous variables represent measurable quantities that can take any value within a range, including decimals. Examples include age, income, temperature, and distance. Categorical variables represent discrete categories or labels, like gender, product type, or city. They're often text values or numbers representing groups rather than quantities. Understanding variable types is crucial for choosing appropriate preprocessing and modeling techniques.

23. What is feature scaling? How does it help in Machine Learning?

- Feature scaling brings all variables to similar ranges so no single feature dominates due to its scale. For example, if you have age (0-100) and salary (0-100000), salary will have much larger impact just due to numbers. Scaling techniques include standardization (mean=0, std=1) and normalization (0-1 range). Many algorithms like SVM, neural networks, and k-means are sensitive to feature scales and perform poorly without scaling.

24. How do we perform scaling in Python?

- Use sklearn.preprocessing scalers. StandardScaler for standardization: scaler = StandardScaler();


- X_scaled = scaler.fit_transform(X).
- MinMaxScaler for 0-1 normalization: scaler = MinMaxScaler();
- X_scaled = scaler.fit_transform(X).

RobustScaler uses median and IQR, less sensitive to outliers. Always fit scaler on training data only, then transform both train and test sets to avoid data leakage.

25. What is sklearn.preprocessing?

- Its a comprehensive module in scikit-learn for data preprocessing tasks. Contains scalers (StandardScaler, MinMaxScaler), encoders (LabelEncoder, OneHotEncoder), imputers for missing values, polynomial features for feature engineering, and data transformers. These tools help prepare raw data for machine learning algorithms by handling common issues like different scales, categorical variables, and missing values.

26. How do we split data for model fitting (training and testing) in Python?

- The standard approach uses train_test_split from sklearn.model_selection. Basic usage: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42). You can control split ratio with test_size, ensure reproducibility with random_state, and maintain class proportions with stratify parameter. For time series data, use temporal splits instead of random splits to avoid data leakage.

27. Explain data encoding?

- Data encoding converts categorical variables into numerical format that machine learning algorithms can process. One-hot encoding creates binary columns for each category (good for nominal data). Label encoding assigns integers to categories (risky unless ordinal). Ordinal encoding preserves order for ranked categories. Target encoding uses target statistics but can cause overfitting. Binary encoding converts to binary representation, useful for high cardinality. Choice depends on cardinality, ordinality, and algorithm requirements.

# by  Arghadeep Misra

+91 8250675419

Email - arghadeepmisra@gmail.com