**Assignment Questions**

1. What is a parameter?
*  A **parameter** in machine learning refers to a variable that the model **learns from the training data**. These values define how the model makes predictions. For example, in linear regression, the slope and intercept are parameters adjusted during training to minimize error. Parameters differ from **hyperparameters**, which are set before training and control the learning process (e.g., learning rate or tree depth). Accurate parameter values are essential for the model's performance, as they determine how well the model captures patterns in the data.

2. What is correlation?
What does negative correlation mean?
* **Correlation** is a statistical measure that describes the strength and direction of a relationship between two variables. It ranges from -1 to 1. A **positive correlation** means that as one variable increases, the other also increases. A **negative correlation** means that as one variable increases, the other decreases — they move in opposite directions. For example, if time spent exercising increases and body weight decreases, they have a negative correlation. A correlation close to 0 indicates no linear relationship between variables.

3. Define Machine Learning. What are the main components in Machine Learning?
* **Machine Learning** is a branch of artificial intelligence that enables systems to learn from data and improve their performance without being explicitly programmed. It focuses on building models that can recognize patterns, make predictions, or take actions based on input data.

The main components of Machine Learning are:

1. **Data** – The foundation used to train and evaluate models.
2. **Model** – An algorithm that learns patterns from the data.
3. **Features** – Input variables used to make predictions.
4. **Training** – The process of teaching the model using data.
5. **Evaluation** – Measuring model performance.

4. How does loss value help in determining whether the model is good or not?
* The **loss value** measures how well a machine learning model’s predictions match the actual outcomes. It quantifies the error between predicted and true values. A **lower loss** indicates that the model is making accurate predictions, while a **higher loss** suggests poor performance. During training, the model adjusts its parameters to minimize the loss, improving its accuracy. Monitoring the loss value over time helps assess whether the model is learning or overfitting. It’s a key metric in determining model quality and guiding improvements.


5. What are continuous and categorical variables?
* Continuous variables are numerical values that can take any value within a range. They are measurable and can be divided infinitely. Examples include height, temperature, and age — values like 5.6 or 72.3 are possible.

Categorical variables, on the other hand, represent distinct groups or categories. They are not measured numerically but represent types or labels. Examples include gender, color, or country. Categorical variables can be nominal (no order, like color) or ordinal (ordered, like education level).

Understanding both types is essential for proper data preprocessing and model selection.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
* Handling categorical variables in Machine Learning involves converting them into a numerical format that algorithms can process. Common techniques include:

Label Encoding – Assigns a unique integer to each category. Best for ordinal data where order matters.

One-Hot Encoding – Creates binary columns for each category. Suitable for nominal (unordered) data.

Target Encoding – Replaces categories with the mean of the target variable for that category.

Frequency Encoding – Replaces each category with its frequency in the dataset.

Choosing the right method depends on the data and the model being used.

7. What do you mean by training and testing a dataset?
* Training and testing a dataset refers to splitting data into two parts to build and evaluate a machine learning model:

Training dataset: This portion is used to train the model. The algorithm learns patterns, relationships, and parameters from this data to make accurate predictions.

Testing dataset: This separate portion is used to evaluate how well the trained model performs on unseen data. It checks the model’s generalization ability and helps detect overfitting.

This split ensures that the model not only memorizes data but also performs well in real-world scenarios.

8. What is sklearn.preprocessing?
* sklearn.preprocessing is a module in Scikit-learn, a popular Python library for machine learning. It provides a set of tools for preprocessing data, which is a crucial step before training models. This module includes functions and classes to:

Scale features (e.g., StandardScaler, MinMaxScaler)

Encode categorical variables (e.g., OneHotEncoder, LabelEncoder)

Normalize data

Generate polynomial features (PolynomialFeatures)

Impute missing values (SimpleImputer)

Using sklearn.preprocessing ensures that your data is in the right format and scale for machine learning algorithms to perform effectively.

9. What is a Test set?
* A test set is a portion of a dataset that is not used during model training but is reserved to evaluate the model's performance on unseen data. It helps determine how well the trained model generalizes to new, real-world inputs.

The test set is critical because:

It provides an unbiased assessment of a model’s accuracy.

It helps detect overfitting (when the model performs well on training data but poorly on new data).

It’s used for final performance reporting (e.g., accuracy, precision, recall).

10. How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?
* In Python, you typically use Scikit-learn's train_test_split() function to divide your dataset:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


X is the feature matrix

y is the target vector

test_size=0.2 means 20% of the data is used for testing

random_state ensures reproducibility

To approach a machine learning problem:

Understand the problem and data.

Perform exploratory data analysis (EDA).

Preprocess: handle missing values, encode categories, scale features.

Split data into training and testing sets.

Choose and train a model.

Evaluate with appropriate metrics.

Tune hyperparameters.

Deploy and monitor the model.

11. Why do we have to perform EDA before fitting a model to the data?
* Exploratory Data Analysis (EDA) is essential before fitting a model because it helps you understand the structure, quality, and patterns in your data. Key reasons to perform EDA include:

Identify missing values or incorrect data types

Detect outliers or unusual patterns

Understand distributions of features

Reveal relationships between variables

Guide feature selection and engineering

Prevent data leakage and bias

Ensure the data is suitable for modeling

EDA ensures you're building your model on clean, well-understood data, leading to better performance and insights.

12. What is correlation?
* Correlation is a statistical measure that indicates the strength and direction of a linear relationship between two variables. It ranges from -1 to 1:

+1: Perfect positive correlation (as one increases, the other increases)

0: No linear correlation

–1: Perfect negative correlation (as one increases, the other decreases)

For example, height and weight often have a positive correlation, while hours of TV watched and test scores might show a negative correlation.

Correlation helps identify dependencies between variables during data analysis.

13. What does negative correlation mean?
Negative correlation means that as one variable increases, the other decreases — they move in opposite directions. The correlation coefficient for a negative correlation falls between 0 and -1.

A value close to -1 indicates a strong negative relationship.

A value close to 0 means a weak or no linear relationship.

Example: As the number of hours studied decreases, exam scores tend to decrease — this would be a positive correlation.
But if the number of hours watching TV increases and exam scores decrease, that’s a negative correlation.

14. How can you find correlation between variables in Python?
* Using .corr() in Pandas:

In [None]:
import pandas as pd

# Example DataFrame
data = pd.DataFrame({
    'age': [25, 32, 47, 51, 62],
    'income': [50000, 60000, 80000, 82000, 90000]
})

# Compute correlation matrix
correlation_matrix = data.corr()
print(correlation_matrix)


In [None]:
To visualize:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()


15. What is causation? Explain difference between correlation and causation with an example.
* Causation means that one variable directly affects another — a change in one variable causes a change in the other.

Difference Between Correlation and Causation:
Correlation: Two variables move together, but one does not necessarily cause the other to change.

Causation: One variable directly influences the other.

Example:
Correlation: Ice cream sales and drowning incidents both increase in summer. They’re correlated, but eating ice cream doesn't cause drowning.

Causation: Smoking causes lung damage. Here, smoking has a direct causal effect.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
* An optimizer is an algorithm used in machine learning to adjust model parameters (like weights) to minimize the loss function during training. It guides how the model learns from data.

Common types of optimizers:
Gradient Descent – Updates weights using the gradient of the loss.
Example: Basic linear regression using gradient descent.

Stochastic Gradient Descent (SGD) – Updates weights using one data point at a time.
Faster, but noisier.

Adam (Adaptive Moment Estimation) – Combines momentum and adaptive learning rates.
Widely used in deep learning.

17. What is sklearn.linear_model ?
* sklearn.linear_model is a module in the Scikit-learn library that provides tools to build and train linear models for regression and classification tasks.

Common models in sklearn.linear_model:
LinearRegression – For predicting continuous values (e.g., house prices).

LogisticRegression – For binary or multi-class classification (e.g., spam detection).

Ridge and Lasso – Linear models with regularization to prevent overfitting.

SGDRegressor and SGDClassifier – Use stochastic gradient descent for large-scale learning.

These models are easy to use, fast, and effective for many ML problems.

18. What does model.fit() do? What arguments must be given?
* The model.fit() function in machine learning is used to train the model on given data. It adjusts the model's internal parameters (like weights) to learn the relationship between features and target values.

What it does:
Takes input features (X) and target values (y)

Learns from the data by minimizing the loss function

Prepares the model to make accurate predictions

Required arguments:

In [None]:
model.fit(X, y)


X: Feature matrix (input data)

y: Target values (labels for supervised learning)

19. What does model.predict() do? What arguments must be given?
* The model.predict() function is used to make predictions using a trained machine learning model. After training with model.fit(), you use model.predict() to apply the learned patterns to new or test data.

What it does:
Accepts input features (X)

Returns predicted output values (e.g., class labels or numerical predictions)

Required argument:

In [None]:
model.predict(X_new)


X_new: A 2D array or DataFrame of new input data (same number of features as used in training)

It helps evaluate model performance or make real-world predictions.

20. What are continuous and categorical variables?
* Continuous variables are numerical values that can take any value within a range and are measurable. They often include decimals and can be infinitely divided.
Examples: height, temperature, age, income.

Categorical variables represent distinct categories or groups and are usually non-numeric, or treated as labels. They can be:

Nominal (no natural order): color, gender, country

Ordinal (ordered categories): education level, customer rating

These variable types are handled differently in data preprocessing — continuous variables are often scaled, while categorical variables are encoded.

21. What is feature scaling? How does it help in Machine Learning?
* Feature scaling is a preprocessing technique in machine learning where numerical features are rescaled to a common range (e.g., 0–1 or with zero mean and unit variance). This ensures that all features contribute equally to the model.

Why it helps:
Prevents features with larger ranges from dominating the learning process

Speeds up convergence in optimization algorithms (e.g., gradient descent)

Improves performance of models sensitive to feature scales (like KNN, SVM, Logistic Regression)

Common methods:
Normalization (Min-Max Scaling)

Standardization (Z-score Scaling)

22. How do we perform scaling in Python?
* 1. Standardization (Z-score scaling)
Centers data around mean = 0 and standard deviation = 1

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


 2. Normalization (Min-Max scaling)
Scales data to a 0–1 range

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)


Replace X with your feature data (usually a NumPy array or DataFrame).

23. What is sklearn.preprocessing?
* **`sklearn.preprocessing`** is a module in the **Scikit-learn** library used to **prepare and transform data** before training machine learning models. It offers tools for **scaling features**, **encoding categorical variables**, **handling missing values**, and **normalizing data**. This ensures that input data is in a suitable format and scale for algorithms to learn effectively. Common classes include `StandardScaler`, `MinMaxScaler`, `OneHotEncoder`, and `SimpleImputer`. Proper preprocessing improves model performance, convergence speed, and accuracy by treating data inconsistencies and formatting issues.


24. How do we split data for model fitting (training and testing) in Python?
* To split data for model fitting in Python, you use the train_test_split() function from Scikit-learn. This separates your dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


X: Feature variables

y: Target variable

test_size: Proportion of data used for testing (e.g., 0.2 = 20%)

random_state: Ensures reproducibility

This helps evaluate how well the model generalizes to new data.

25.Explain data encoding?
* Data encoding is the process of converting categorical (non-numeric) data into a numerical format so that machine learning algorithms can interpret and use it effectively. Since most models work with numbers, encoding is essential for handling variables like gender, color, or product category.

Common Encoding Techniques:
Label Encoding: Assigns a unique integer to each category (e.g., Red = 0, Blue = 1).

One-Hot Encoding: Creates binary columns for each category (e.g., [1, 0, 0] for Red).

Proper encoding ensures the model correctly understands the data's structure.