## Theoritical Questions
**Q.1  What is a parameter?**
 - A parameter is a variable listed in a function’s definition that receives an argument when the function is called.

 def greet(name):
   
    print("Hello, " + name)

greet("Alice")  # "Alice" is the argument passed to the parameter "name"


**Q.2 What is correlation?What does negative correlation mean?**

  - Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It ranges from -1 to 1

  A negative correlation means that when one variable increases, the other decreases. It is represented by a correlation coefficient between -1 and 0.

**Q.3 Define Machine Learning. What are the main components in Machine Learning?**

  - Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed

  Main Components of Machine Learning

1. Data – The foundation of ML, including training and testing datasets.

2. Features – Relevant attributes or characteristics extracted from data for learning.


3. Model – The mathematical or algorithmic framework used to learn from data.

4. Algorithm – The method used to train the model (e.g., decision trees, neural networks).

5. Loss Function – A measure of how well the model's predictions match the actual values.

6. Training Process – The process of feeding data into the model to adjust parameters.

7. Evaluation Metrics – Used to assess the model’s performance (e.g., accuracy, precision).

8. Hyperparameters – Tunable settings that affect model performance (e.g., learning rate).

9. Prediction/Inference – Using the trained model to make predictions on new data.

**Q.4 How does loss value help in determining whether the model is good or not?**

 - The loss value is a key indicator of how well a machine learning model is performing. It measures the difference between the predicted output and the actual target values. A lower loss generally indicates a better model.

**Q.5 What are continuous and categorical variables?**

 - Continuous Variables

 These are numerical variables that can take an infinite number of values within a range.

 They can be measured and have decimal precision

 - Categorical Variables

 These variables represent distinct groups or categories.

 They cannot be measured numerically but can be labeled or classified.

**Q.7 How do we handle categorical variables in Machine Learning? What are the common technique?**

 - Categorical variables need to be converted into numerical values for ML models. Common techniques include:

1. Label Encoding – Assigns unique integers to categories (best for ordinal data).

2. One-Hot Encoding (OHE) – Creates binary columns for each category (best for nominal data).

3. Ordinal Encoding – Assigns ordered integers to categories with a meaningful rank.

4. Frequency Encoding – Replaces categories with their occurrence count.

5. Target Encoding – Replaces categories with the mean of the target variable (risk of data leakage).

6. Binary Encoding – Converts categories into binary form, reducing dimensionality.

7.  Hash Encoding – Uses a hash function to map categories to a fixed number of features.

**Q.7 What do you mean by training and testing a dataset?**

  In machine learning, a dataset is divided into two main parts:

  - Training Dataset

Used to train the model by allowing it to learn patterns from the data.
The model adjusts its parameters based on this dataset.
Usually comprises 70-80% of the total data.

 - Testing Dataset

Used to evaluate the model's performance on unseen data.
Helps assess accuracy and generalization.
Typically 20-30% of the total data.

**Q.8 What is sklearn.preprocessing?**

 - sklearn.preprocessing in Scikit-Learn

 sklearn.preprocessing is a module in Scikit-Learn that provides various functions to transform and scale data before feeding it into a machine learning model.

**Q.9 What is a Test set?**
  
  - A test set is a portion of the dataset used to evaluate the performance of a trained machine learning model.

  Key Characteristics:

   It is never used during training.

   It helps assess how well the model generalizes to new, unseen data.

  Typically makes up 20-30% of the total dataset.


**Q.10 How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?**

 -In Scikit-Learn, we use the train_test_split() function from sklearn.model_selection to divide the dataset into training and testing sets.



### **Approach to a Machine Learning Problem**  

1️⃣ **Understand the Problem Statement**  
   - Define the objective (e.g., classification, regression).  
   - Identify key business or research questions.  

2️⃣ **Collect and Explore Data**  
   - Gather relevant datasets.  
   - Perform **exploratory data analysis (EDA)** to understand distributions, missing values, and correlations.  

3️⃣ **Preprocess Data**  
   - Handle **missing values** (imputation, removal).  
   - Convert **categorical variables** (label encoding, one-hot encoding).  
   - Normalize/scale numerical features if needed.  

4️⃣ **Split Data**  
   - Divide data into **training** and **testing** sets.  
   - Optionally, use a **validation set** for hyperparameter tuning.  

5️⃣ **Select and Train Model**  
   - Choose an appropriate **algorithm** (e.g., Decision Tree, SVM, Neural Networks).  
   - Train the model on the **training set**.  

6️⃣ **Evaluate Model Performance**  
   - Use the **test set** to measure accuracy, precision, recall, F1-score, RMSE, etc.  
   - Detect **overfitting** or **underfitting**.  

7️⃣ **Hyperparameter Tuning & Optimization**  
   - Improve performance using techniques like **GridSearchCV** or **RandomizedSearchCV**.  

8️⃣ **Deploy the Model**  
   - Save the trained model using `joblib` or `pickle`.  
   - Deploy in production using APIs, cloud, or embedded systems.  

9️⃣ **Monitor & Improve**  
   - Continuously **track model performance** and update with new data.  

**Q.11 Why do we have to perform EDA before fitting a model to the data?**


 EDA (Exploratory Data Analysis) helps ensure data quality and improves model performance by:  

1️⃣ **Understanding Data Distribution** – Identifies patterns, trends, and skewness.  
2️⃣ **Handling Missing Values** – Detects and fills or removes null values.  
3️⃣ **Detecting Outliers** – Prevents skewed model training.  
4️⃣ **Checking Feature Correlation** – Reduces redundancy and multicollinearity.  
5️⃣ **Identifying Data Imbalance** – Avoids biased predictions.  
6️⃣ **Guiding Feature Engineering** – Helps with encoding, scaling, and selection.  

**Q.12 What is correlation?**

  -  Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It ranges from -1 to 1



**Q.13 What does negative correlation mean?**
  
  -   A negative correlation means that when one variable increases, the other decreases. It is represented by a correlation coefficient between -1 and 0.


**Q.14 How can you find correlation between variables in Python?**

   - we can find the correlation between variables in Python using:  

1. **Pandas** (`.corr()`) for Pearson, Spearman, or Kendall correlation:  
   ```python
   df.corr(method='pearson')  # Default is Pearson
   ```

2. **NumPy** (`np.corrcoef()`) for Pearson correlation:  
   ```python
   np.corrcoef(x, y)[0, 1]
   ```

3. **SciPy** (`pearsonr`, `spearmanr`, `kendalltau`) for correlation with p-values:  
   ```python
   from scipy.stats import pearsonr
   pearsonr(x, y)
   ```

4. **Seaborn** (for visualization with a heatmap):  
   ```python
   sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
   ```

**Q.15 What is causation? Explain difference between correlation and causation with an example**


  - Causation (also called **causality**) means that one event **directly causes** another. In other words, a change in one variable **results in** a change in another.  

🔹 **Example of Causation:**  
If you increase the temperature of water, it starts boiling. Here, increasing temperature **directly causes** boiling.  

---

### **Difference Between Correlation and Causation**  

| Feature        | Correlation                         | Causation                           |
|---------------|-----------------------------------|-----------------------------------|
| **Definition** | A relationship where two variables move together (positively or negatively). | One variable directly causes a change in another. |
| **Direction**  | No direct cause-effect relationship. | A direct cause-effect relationship exists. |
| **Example**    | Ice cream sales and drowning rates are correlated (both increase in summer). | Eating contaminated food **causes** food poisoning. |

---

### **Example to Illustrate Difference**  

📌 **Scenario:**  
A study finds that people who exercise more tend to be happier.  

- **Correlation:** Exercise and happiness are related, but exercise may not be the **cause**. Other factors like social interaction, better health, or endorphins could be involved.  
- **Causation:** If a controlled experiment proves that increasing exercise levels **directly** leads to increased happiness, then we can say exercise **causes** happiness.  

**Q.16 What is an Optimizer? What are different types of optimizers? Explain each with an example.**


An **optimizer** is an algorithm that adjusts a machine learning model’s weights to **minimize the loss function** and improve accuracy.  

---

**Types of Optimizers in Machine Learning**  

**1. Gradient Descent**  
Updates weights based on the gradient of the loss function.  
🔹 **Example (SGD in TensorFlow)**  
```python
from tensorflow.keras.optimizers import SGD
optimizer = SGD(learning_rate=0.01)
```

 **2. Momentum-Based Optimizer (Momentum SGD)**  
Uses momentum to speed up training and avoid local minima.  
🔹 **Example:**  
```python
optimizer = SGD(learning_rate=0.01, momentum=0.9)
```
 **3. AdaGrad (Adaptive Gradient Algorithm)**  
Adapts learning rates based on past gradients, useful for sparse data.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import Adagrad
optimizer = Adagrad(learning_rate=0.01)
```

**4. RMSprop (Root Mean Square Propagation)**  
Maintains a moving average of squared gradients for better stability.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import RMSprop
optimizer = RMSprop(learning_rate=0.01)
```

 **5. Adam (Adaptive Moment Estimation) [Most Common]**  
Combines Momentum and RMSprop for adaptive learning rates.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
```

 **6. AdamW (Adam with Weight Decay)**  
Prevents overfitting by adding weight decay.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import AdamW
optimizer = AdamW(learning_rate=0.001, weight_decay=0.01)
```
**Q.17 What is sklearn.linear_model ?**

 - sklearn.linear_model is a module in scikit-learn that provides various linear models for regression and classification tasks. These models assume a linear relationship between input features and the target variable.


**Q.18 What does model.fit() do? What arguments must be given?**

 - model.fit() is a method used in scikit-learn to train a machine learning model by learning patterns from the input data. It adjusts the model’s parameters (like weights in linear regression) based on the given dataset.


**Q.19 What does model.predict() do? What arguments must be given?**


 - 'model.predict()` is used in **scikit-learn** to make predictions on new data after training a model.  

---

### **Arguments for `model.predict()`**
```python
model.predict(X_new)
```
- **`X_new`** → New input data (must have the same number of features as training data).  

---

### **Example (Regression Prediction)**  
```python
from sklearn.linear_model import LinearRegression

X_train = [[1], [2], [3], [4], [5]]
y_train = [2, 4, 6, 8, 10]

model = LinearRegression()
model.fit(X_train, y_train)

X_new = [[6], [7]]  # New inputs
y_pred = model.predict(X_new)

print(y_pred)  # Output: [12. 14.]
```

---

### **Example (Classification Prediction)**  
```python
from sklearn.linear_model import LogisticRegression

X_train = [[1], [2], [3], [4], [5]]
y_train = [0, 0, 1, 1, 1]

model = LogisticRegression()
model.fit(X_train, y_train)

X_new = [[3.5]]
print(model.predict(X_new))  # Output: [1]
```

---

### **Key Points**
- Used for **making predictions** after training a model.  
- Requires **`X_new`** (new input data).  
- Returns **predicted values** for regression or **class labels** for classification.  
- Use `predict_proba()` for class probabilities.  

**Q.20 What are continuous and categorical variables?**

Continuous Variables

These are numerical variables that can take an infinite number of values within a range.

They can be measured and have decimal precision

Categorical Variables

These variables represent distinct groups or categories.

They cannot be measured numerically but can be labeled or classified.

**Q.21What is feature scaling? How does it help in Machine Learning?**

 - Feature scaling is the process of normalizing or standardizing numerical features to a common scale without distorting differences in the data. It is essential in machine learning to improve model performance and training stability.

it help in Machine learning by the following way
Improves Model Performance – Many algorithms (e.g., Gradient Descent, SVM, KNN) perform better when features are on the same scale.

Faster Convergence – Helps optimization algorithms (e.g., Gradient Descent) converge faster.

Prevents Bias – Models like KNN and K-Means use distance calculations, so unscaled features can dominate the results.


**Q22How do we perform scaling in Python?**


 - **Feature Scaling in Python**  

We use **scikit-learn** to scale features using two common methods:  

 **1. Min-Max Scaling (Normalization)**  
Scales data between **0 and 1**.  
```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)  # X is your feature matrix
```
 **2. Standardization (Z-score Scaling)**  
Centers data around **mean = 0** and **std = 1**.  
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)
```

**Q.23  What is sklearn.preprocessing?**

sklearn.preprocessing in Scikit-Learn

sklearn.preprocessing is a module in Scikit-Learn that provides various functions to transform and scale data before feeding it into a machine learning model.

**Q.24How do we split data for model fitting (training and testing) in Python?**

 - -In Scikit-Learn, we use the train_test_split() function from sklearn.model_selection to divide the dataset into training and testing sets.


**Q.25


## Theoritical Questions
**Q.1  What is a parameter?**
 - A parameter is a variable listed in a function’s definition that receives an argument when the function is called.

 def greet(name):
   
    print("Hello, " + name)

greet("Alice")  # "Alice" is the argument passed to the parameter "name"


**Q.2 What is correlation?What does negative correlation mean?**

  - Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It ranges from -1 to 1

  A negative correlation means that when one variable increases, the other decreases. It is represented by a correlation coefficient between -1 and 0.

**Q.3 Define Machine Learning. What are the main components in Machine Learning?**

  - Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed

  Main Components of Machine Learning

1. Data – The foundation of ML, including training and testing datasets.

2. Features – Relevant attributes or characteristics extracted from data for learning.


3. Model – The mathematical or algorithmic framework used to learn from data.

4. Algorithm – The method used to train the model (e.g., decision trees, neural networks).

5. Loss Function – A measure of how well the model's predictions match the actual values.

6. Training Process – The process of feeding data into the model to adjust parameters.

7. Evaluation Metrics – Used to assess the model’s performance (e.g., accuracy, precision).

8. Hyperparameters – Tunable settings that affect model performance (e.g., learning rate).

9. Prediction/Inference – Using the trained model to make predictions on new data.

**Q.4 How does loss value help in determining whether the model is good or not?**

 - The loss value is a key indicator of how well a machine learning model is performing. It measures the difference between the predicted output and the actual target values. A lower loss generally indicates a better model.

**Q.5 What are continuous and categorical variables?**

 - Continuous Variables

 These are numerical variables that can take an infinite number of values within a range.

 They can be measured and have decimal precision

 - Categorical Variables

 These variables represent distinct groups or categories.

 They cannot be measured numerically but can be labeled or classified.

**Q.7 How do we handle categorical variables in Machine Learning? What are the common technique?**

 - Categorical variables need to be converted into numerical values for ML models. Common techniques include:

1. Label Encoding – Assigns unique integers to categories (best for ordinal data).

2. One-Hot Encoding (OHE) – Creates binary columns for each category (best for nominal data).

3. Ordinal Encoding – Assigns ordered integers to categories with a meaningful rank.

4. Frequency Encoding – Replaces categories with their occurrence count.

5. Target Encoding – Replaces categories with the mean of the target variable (risk of data leakage).

6. Binary Encoding – Converts categories into binary form, reducing dimensionality.

7.  Hash Encoding – Uses a hash function to map categories to a fixed number of features.

**Q.7 What do you mean by training and testing a dataset?**

  In machine learning, a dataset is divided into two main parts:

  - Training Dataset

Used to train the model by allowing it to learn patterns from the data.
The model adjusts its parameters based on this dataset.
Usually comprises 70-80% of the total data.

 - Testing Dataset

Used to evaluate the model's performance on unseen data.
Helps assess accuracy and generalization.
Typically 20-30% of the total data.

**Q.8 What is sklearn.preprocessing?**

 - sklearn.preprocessing in Scikit-Learn

 sklearn.preprocessing is a module in Scikit-Learn that provides various functions to transform and scale data before feeding it into a machine learning model.

**Q.9 What is a Test set?**
  
  - A test set is a portion of the dataset used to evaluate the performance of a trained machine learning model.

  Key Characteristics:

   It is never used during training.

   It helps assess how well the model generalizes to new, unseen data.

  Typically makes up 20-30% of the total dataset.


**Q.10 How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?**

 -In Scikit-Learn, we use the train_test_split() function from sklearn.model_selection to divide the dataset into training and testing sets.



### **Approach to a Machine Learning Problem**  

1️⃣ **Understand the Problem Statement**  
   - Define the objective (e.g., classification, regression).  
   - Identify key business or research questions.  

2️⃣ **Collect and Explore Data**  
   - Gather relevant datasets.  
   - Perform **exploratory data analysis (EDA)** to understand distributions, missing values, and correlations.  

3️⃣ **Preprocess Data**  
   - Handle **missing values** (imputation, removal).  
   - Convert **categorical variables** (label encoding, one-hot encoding).  
   - Normalize/scale numerical features if needed.  

4️⃣ **Split Data**  
   - Divide data into **training** and **testing** sets.  
   - Optionally, use a **validation set** for hyperparameter tuning.  

5️⃣ **Select and Train Model**  
   - Choose an appropriate **algorithm** (e.g., Decision Tree, SVM, Neural Networks).  
   - Train the model on the **training set**.  

6️⃣ **Evaluate Model Performance**  
   - Use the **test set** to measure accuracy, precision, recall, F1-score, RMSE, etc.  
   - Detect **overfitting** or **underfitting**.  

7️⃣ **Hyperparameter Tuning & Optimization**  
   - Improve performance using techniques like **GridSearchCV** or **RandomizedSearchCV**.  

8️⃣ **Deploy the Model**  
   - Save the trained model using `joblib` or `pickle`.  
   - Deploy in production using APIs, cloud, or embedded systems.  

9️⃣ **Monitor & Improve**  
   - Continuously **track model performance** and update with new data.  

**Q.11 Why do we have to perform EDA before fitting a model to the data?**


 EDA (Exploratory Data Analysis) helps ensure data quality and improves model performance by:  

1️⃣ **Understanding Data Distribution** – Identifies patterns, trends, and skewness.  
2️⃣ **Handling Missing Values** – Detects and fills or removes null values.  
3️⃣ **Detecting Outliers** – Prevents skewed model training.  
4️⃣ **Checking Feature Correlation** – Reduces redundancy and multicollinearity.  
5️⃣ **Identifying Data Imbalance** – Avoids biased predictions.  
6️⃣ **Guiding Feature Engineering** – Helps with encoding, scaling, and selection.  

**Q.12 What is correlation?**

  -  Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It ranges from -1 to 1



**Q.13 What does negative correlation mean?**
  
  -   A negative correlation means that when one variable increases, the other decreases. It is represented by a correlation coefficient between -1 and 0.


**Q.14 How can you find correlation between variables in Python?**

   - we can find the correlation between variables in Python using:  

1. **Pandas** (`.corr()`) for Pearson, Spearman, or Kendall correlation:  
   ```python
   df.corr(method='pearson')  # Default is Pearson
   ```

2. **NumPy** (`np.corrcoef()`) for Pearson correlation:  
   ```python
   np.corrcoef(x, y)[0, 1]
   ```

3. **SciPy** (`pearsonr`, `spearmanr`, `kendalltau`) for correlation with p-values:  
   ```python
   from scipy.stats import pearsonr
   pearsonr(x, y)
   ```

4. **Seaborn** (for visualization with a heatmap):  
   ```python
   sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
   ```

**Q.15 What is causation? Explain difference between correlation and causation with an example**


  - Causation (also called **causality**) means that one event **directly causes** another. In other words, a change in one variable **results in** a change in another.  

🔹 **Example of Causation:**  
If you increase the temperature of water, it starts boiling. Here, increasing temperature **directly causes** boiling.  

---

### **Difference Between Correlation and Causation**  

| Feature        | Correlation                         | Causation                           |
|---------------|-----------------------------------|-----------------------------------|
| **Definition** | A relationship where two variables move together (positively or negatively). | One variable directly causes a change in another. |
| **Direction**  | No direct cause-effect relationship. | A direct cause-effect relationship exists. |
| **Example**    | Ice cream sales and drowning rates are correlated (both increase in summer). | Eating contaminated food **causes** food poisoning. |

---

### **Example to Illustrate Difference**  

📌 **Scenario:**  
A study finds that people who exercise more tend to be happier.  

- **Correlation:** Exercise and happiness are related, but exercise may not be the **cause**. Other factors like social interaction, better health, or endorphins could be involved.  
- **Causation:** If a controlled experiment proves that increasing exercise levels **directly** leads to increased happiness, then we can say exercise **causes** happiness.  

**Q.16 What is an Optimizer? What are different types of optimizers? Explain each with an example.**


An **optimizer** is an algorithm that adjusts a machine learning model’s weights to **minimize the loss function** and improve accuracy.  

---

**Types of Optimizers in Machine Learning**  

**1. Gradient Descent**  
Updates weights based on the gradient of the loss function.  
🔹 **Example (SGD in TensorFlow)**  
```python
from tensorflow.keras.optimizers import SGD
optimizer = SGD(learning_rate=0.01)
```

 **2. Momentum-Based Optimizer (Momentum SGD)**  
Uses momentum to speed up training and avoid local minima.  
🔹 **Example:**  
```python
optimizer = SGD(learning_rate=0.01, momentum=0.9)
```
 **3. AdaGrad (Adaptive Gradient Algorithm)**  
Adapts learning rates based on past gradients, useful for sparse data.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import Adagrad
optimizer = Adagrad(learning_rate=0.01)
```

**4. RMSprop (Root Mean Square Propagation)**  
Maintains a moving average of squared gradients for better stability.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import RMSprop
optimizer = RMSprop(learning_rate=0.01)
```

 **5. Adam (Adaptive Moment Estimation) [Most Common]**  
Combines Momentum and RMSprop for adaptive learning rates.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
```

 **6. AdamW (Adam with Weight Decay)**  
Prevents overfitting by adding weight decay.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import AdamW
optimizer = AdamW(learning_rate=0.001, weight_decay=0.01)
```
**Q.17 What is sklearn.linear_model ?**

 - sklearn.linear_model is a module in scikit-learn that provides various linear models for regression and classification tasks. These models assume a linear relationship between input features and the target variable.


**Q.18 What does model.fit() do? What arguments must be given?**

 - model.fit() is a method used in scikit-learn to train a machine learning model by learning patterns from the input data. It adjusts the model’s parameters (like weights in linear regression) based on the given dataset.


**Q.19 What does model.predict() do? What arguments must be given?**


 - 'model.predict()` is used in **scikit-learn** to make predictions on new data after training a model.  

---

### **Arguments for `model.predict()`**
```python
model.predict(X_new)
```
- **`X_new`** → New input data (must have the same number of features as training data).  

---

### **Example (Regression Prediction)**  
```python
from sklearn.linear_model import LinearRegression

X_train = [[1], [2], [3], [4], [5]]
y_train = [2, 4, 6, 8, 10]

model = LinearRegression()
model.fit(X_train, y_train)

X_new = [[6], [7]]  # New inputs
y_pred = model.predict(X_new)

print(y_pred)  # Output: [12. 14.]
```

---

### **Example (Classification Prediction)**  
```python
from sklearn.linear_model import LogisticRegression

X_train = [[1], [2], [3], [4], [5]]
y_train = [0, 0, 1, 1, 1]

model = LogisticRegression()
model.fit(X_train, y_train)

X_new = [[3.5]]
print(model.predict(X_new))  # Output: [1]
```

---

### **Key Points**
- Used for **making predictions** after training a model.  
- Requires **`X_new`** (new input data).  
- Returns **predicted values** for regression or **class labels** for classification.  
- Use `predict_proba()` for class probabilities.  

**Q.20 What are continuous and categorical variables?**

Continuous Variables

These are numerical variables that can take an infinite number of values within a range.

They can be measured and have decimal precision

Categorical Variables

These variables represent distinct groups or categories.

They cannot be measured numerically but can be labeled or classified.

**Q.21What is feature scaling? How does it help in Machine Learning?**

 - Feature scaling is the process of normalizing or standardizing numerical features to a common scale without distorting differences in the data. It is essential in machine learning to improve model performance and training stability.

it help in Machine learning by the following way
Improves Model Performance – Many algorithms (e.g., Gradient Descent, SVM, KNN) perform better when features are on the same scale.

Faster Convergence – Helps optimization algorithms (e.g., Gradient Descent) converge faster.

Prevents Bias – Models like KNN and K-Means use distance calculations, so unscaled features can dominate the results.


**Q22How do we perform scaling in Python?**


 - **Feature Scaling in Python**  

We use **scikit-learn** to scale features using two common methods:  

 **1. Min-Max Scaling (Normalization)**  
Scales data between **0 and 1**.  
```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)  # X is your feature matrix
```
 **2. Standardization (Z-score Scaling)**  
Centers data around **mean = 0** and **std = 1**.  
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)
```

**Q.23  What is sklearn.preprocessing?**

sklearn.preprocessing in Scikit-Learn

sklearn.preprocessing is a module in Scikit-Learn that provides various functions to transform and scale data before feeding it into a machine learning model.

**Q.24How do we split data for model fitting (training and testing) in Python?**

 - -In Scikit-Learn, we use the train_test_split() function from sklearn.model_selection to divide the dataset into training and testing sets.


**Q.25


## Theoritical Questions
**Q.1  What is a parameter?**
 - A parameter is a variable listed in a function’s definition that receives an argument when the function is called.

 def greet(name):
   
    print("Hello, " + name)

greet("Alice")  # "Alice" is the argument passed to the parameter "name"


**Q.2 What is correlation?What does negative correlation mean?**

  - Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It ranges from -1 to 1

  A negative correlation means that when one variable increases, the other decreases. It is represented by a correlation coefficient between -1 and 0.

**Q.3 Define Machine Learning. What are the main components in Machine Learning?**

  - Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn patterns from data and make predictions or decisions without being explicitly programmed

  Main Components of Machine Learning

1. Data – The foundation of ML, including training and testing datasets.

2. Features – Relevant attributes or characteristics extracted from data for learning.


3. Model – The mathematical or algorithmic framework used to learn from data.

4. Algorithm – The method used to train the model (e.g., decision trees, neural networks).

5. Loss Function – A measure of how well the model's predictions match the actual values.

6. Training Process – The process of feeding data into the model to adjust parameters.

7. Evaluation Metrics – Used to assess the model’s performance (e.g., accuracy, precision).

8. Hyperparameters – Tunable settings that affect model performance (e.g., learning rate).

9. Prediction/Inference – Using the trained model to make predictions on new data.

**Q.4 How does loss value help in determining whether the model is good or not?**

 - The loss value is a key indicator of how well a machine learning model is performing. It measures the difference between the predicted output and the actual target values. A lower loss generally indicates a better model.

**Q.5 What are continuous and categorical variables?**

 - Continuous Variables

 These are numerical variables that can take an infinite number of values within a range.

 They can be measured and have decimal precision

 - Categorical Variables

 These variables represent distinct groups or categories.

 They cannot be measured numerically but can be labeled or classified.

**Q.7 How do we handle categorical variables in Machine Learning? What are the common technique?**

 - Categorical variables need to be converted into numerical values for ML models. Common techniques include:

1. Label Encoding – Assigns unique integers to categories (best for ordinal data).

2. One-Hot Encoding (OHE) – Creates binary columns for each category (best for nominal data).

3. Ordinal Encoding – Assigns ordered integers to categories with a meaningful rank.

4. Frequency Encoding – Replaces categories with their occurrence count.

5. Target Encoding – Replaces categories with the mean of the target variable (risk of data leakage).

6. Binary Encoding – Converts categories into binary form, reducing dimensionality.

7.  Hash Encoding – Uses a hash function to map categories to a fixed number of features.

**Q.7 What do you mean by training and testing a dataset?**

  In machine learning, a dataset is divided into two main parts:

  - Training Dataset

Used to train the model by allowing it to learn patterns from the data.
The model adjusts its parameters based on this dataset.
Usually comprises 70-80% of the total data.

 - Testing Dataset

Used to evaluate the model's performance on unseen data.
Helps assess accuracy and generalization.
Typically 20-30% of the total data.

**Q.8 What is sklearn.preprocessing?**

 - sklearn.preprocessing in Scikit-Learn

 sklearn.preprocessing is a module in Scikit-Learn that provides various functions to transform and scale data before feeding it into a machine learning model.

**Q.9 What is a Test set?**
  
  - A test set is a portion of the dataset used to evaluate the performance of a trained machine learning model.

  Key Characteristics:

   It is never used during training.

   It helps assess how well the model generalizes to new, unseen data.

  Typically makes up 20-30% of the total dataset.


**Q.10 How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?**

 -In Scikit-Learn, we use the train_test_split() function from sklearn.model_selection to divide the dataset into training and testing sets.



### **Approach to a Machine Learning Problem**  

1️⃣ **Understand the Problem Statement**  
   - Define the objective (e.g., classification, regression).  
   - Identify key business or research questions.  

2️⃣ **Collect and Explore Data**  
   - Gather relevant datasets.  
   - Perform **exploratory data analysis (EDA)** to understand distributions, missing values, and correlations.  

3️⃣ **Preprocess Data**  
   - Handle **missing values** (imputation, removal).  
   - Convert **categorical variables** (label encoding, one-hot encoding).  
   - Normalize/scale numerical features if needed.  

4️⃣ **Split Data**  
   - Divide data into **training** and **testing** sets.  
   - Optionally, use a **validation set** for hyperparameter tuning.  

5️⃣ **Select and Train Model**  
   - Choose an appropriate **algorithm** (e.g., Decision Tree, SVM, Neural Networks).  
   - Train the model on the **training set**.  

6️⃣ **Evaluate Model Performance**  
   - Use the **test set** to measure accuracy, precision, recall, F1-score, RMSE, etc.  
   - Detect **overfitting** or **underfitting**.  

7️⃣ **Hyperparameter Tuning & Optimization**  
   - Improve performance using techniques like **GridSearchCV** or **RandomizedSearchCV**.  

8️⃣ **Deploy the Model**  
   - Save the trained model using `joblib` or `pickle`.  
   - Deploy in production using APIs, cloud, or embedded systems.  

9️⃣ **Monitor & Improve**  
   - Continuously **track model performance** and update with new data.  

**Q.11 Why do we have to perform EDA before fitting a model to the data?**


 EDA (Exploratory Data Analysis) helps ensure data quality and improves model performance by:  

1️⃣ **Understanding Data Distribution** – Identifies patterns, trends, and skewness.  
2️⃣ **Handling Missing Values** – Detects and fills or removes null values.  
3️⃣ **Detecting Outliers** – Prevents skewed model training.  
4️⃣ **Checking Feature Correlation** – Reduces redundancy and multicollinearity.  
5️⃣ **Identifying Data Imbalance** – Avoids biased predictions.  
6️⃣ **Guiding Feature Engineering** – Helps with encoding, scaling, and selection.  

**Q.12 What is correlation?**

  -  Correlation is a statistical measure that indicates the strength and direction of the relationship between two variables. It ranges from -1 to 1



**Q.13 What does negative correlation mean?**
  
  -   A negative correlation means that when one variable increases, the other decreases. It is represented by a correlation coefficient between -1 and 0.


**Q.14 How can you find correlation between variables in Python?**

   - we can find the correlation between variables in Python using:  

1. **Pandas** (`.corr()`) for Pearson, Spearman, or Kendall correlation:  
   ```python
   df.corr(method='pearson')  # Default is Pearson
   ```

2. **NumPy** (`np.corrcoef()`) for Pearson correlation:  
   ```python
   np.corrcoef(x, y)[0, 1]
   ```

3. **SciPy** (`pearsonr`, `spearmanr`, `kendalltau`) for correlation with p-values:  
   ```python
   from scipy.stats import pearsonr
   pearsonr(x, y)
   ```

4. **Seaborn** (for visualization with a heatmap):  
   ```python
   sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
   ```

**Q.15 What is causation? Explain difference between correlation and causation with an example**


  - Causation (also called **causality**) means that one event **directly causes** another. In other words, a change in one variable **results in** a change in another.  

🔹 **Example of Causation:**  
If you increase the temperature of water, it starts boiling. Here, increasing temperature **directly causes** boiling.  

---

### **Difference Between Correlation and Causation**  

| Feature        | Correlation                         | Causation                           |
|---------------|-----------------------------------|-----------------------------------|
| **Definition** | A relationship where two variables move together (positively or negatively). | One variable directly causes a change in another. |
| **Direction**  | No direct cause-effect relationship. | A direct cause-effect relationship exists. |
| **Example**    | Ice cream sales and drowning rates are correlated (both increase in summer). | Eating contaminated food **causes** food poisoning. |

---

### **Example to Illustrate Difference**  

📌 **Scenario:**  
A study finds that people who exercise more tend to be happier.  

- **Correlation:** Exercise and happiness are related, but exercise may not be the **cause**. Other factors like social interaction, better health, or endorphins could be involved.  
- **Causation:** If a controlled experiment proves that increasing exercise levels **directly** leads to increased happiness, then we can say exercise **causes** happiness.  

**Q.16 What is an Optimizer? What are different types of optimizers? Explain each with an example.**


An **optimizer** is an algorithm that adjusts a machine learning model’s weights to **minimize the loss function** and improve accuracy.  

---

**Types of Optimizers in Machine Learning**  

**1. Gradient Descent**  
Updates weights based on the gradient of the loss function.  
🔹 **Example (SGD in TensorFlow)**  
```python
from tensorflow.keras.optimizers import SGD
optimizer = SGD(learning_rate=0.01)
```

 **2. Momentum-Based Optimizer (Momentum SGD)**  
Uses momentum to speed up training and avoid local minima.  
🔹 **Example:**  
```python
optimizer = SGD(learning_rate=0.01, momentum=0.9)
```
 **3. AdaGrad (Adaptive Gradient Algorithm)**  
Adapts learning rates based on past gradients, useful for sparse data.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import Adagrad
optimizer = Adagrad(learning_rate=0.01)
```

**4. RMSprop (Root Mean Square Propagation)**  
Maintains a moving average of squared gradients for better stability.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import RMSprop
optimizer = RMSprop(learning_rate=0.01)
```

 **5. Adam (Adaptive Moment Estimation) [Most Common]**  
Combines Momentum and RMSprop for adaptive learning rates.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
```

 **6. AdamW (Adam with Weight Decay)**  
Prevents overfitting by adding weight decay.  
🔹 **Example:**  
```python
from tensorflow.keras.optimizers import AdamW
optimizer = AdamW(learning_rate=0.001, weight_decay=0.01)
```
**Q.17 What is sklearn.linear_model ?**

 - sklearn.linear_model is a module in scikit-learn that provides various linear models for regression and classification tasks. These models assume a linear relationship between input features and the target variable.


**Q.18 What does model.fit() do? What arguments must be given?**

 - model.fit() is a method used in scikit-learn to train a machine learning model by learning patterns from the input data. It adjusts the model’s parameters (like weights in linear regression) based on the given dataset.


**Q.19 What does model.predict() do? What arguments must be given?**


 - 'model.predict()` is used in **scikit-learn** to make predictions on new data after training a model.  

---

### **Arguments for `model.predict()`**
```python
model.predict(X_new)
```
- **`X_new`** → New input data (must have the same number of features as training data).  

---

### **Example (Regression Prediction)**  
```python
from sklearn.linear_model import LinearRegression

X_train = [[1], [2], [3], [4], [5]]
y_train = [2, 4, 6, 8, 10]

model = LinearRegression()
model.fit(X_train, y_train)

X_new = [[6], [7]]  # New inputs
y_pred = model.predict(X_new)

print(y_pred)  # Output: [12. 14.]
```

---

### **Example (Classification Prediction)**  
```python
from sklearn.linear_model import LogisticRegression

X_train = [[1], [2], [3], [4], [5]]
y_train = [0, 0, 1, 1, 1]

model = LogisticRegression()
model.fit(X_train, y_train)

X_new = [[3.5]]
print(model.predict(X_new))  # Output: [1]
```

---

### **Key Points**
- Used for **making predictions** after training a model.  
- Requires **`X_new`** (new input data).  
- Returns **predicted values** for regression or **class labels** for classification.  
- Use `predict_proba()` for class probabilities.  

**Q.20 What are continuous and categorical variables?**

Continuous Variables

These are numerical variables that can take an infinite number of values within a range.

They can be measured and have decimal precision

Categorical Variables

These variables represent distinct groups or categories.

They cannot be measured numerically but can be labeled or classified.

**Q.21What is feature scaling? How does it help in Machine Learning?**

 - Feature scaling is the process of normalizing or standardizing numerical features to a common scale without distorting differences in the data. It is essential in machine learning to improve model performance and training stability.

it help in Machine learning by the following way
Improves Model Performance – Many algorithms (e.g., Gradient Descent, SVM, KNN) perform better when features are on the same scale.

Faster Convergence – Helps optimization algorithms (e.g., Gradient Descent) converge faster.

Prevents Bias – Models like KNN and K-Means use distance calculations, so unscaled features can dominate the results.


**Q22How do we perform scaling in Python?**


 - **Feature Scaling in Python**  

We use **scikit-learn** to scale features using two common methods:  

 **1. Min-Max Scaling (Normalization)**  
Scales data between **0 and 1**.  
```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)  # X is your feature matrix
```
 **2. Standardization (Z-score Scaling)**  
Centers data around **mean = 0** and **std = 1**.  
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)
```

**Q.23  What is sklearn.preprocessing?**

sklearn.preprocessing in Scikit-Learn

sklearn.preprocessing is a module in Scikit-Learn that provides various functions to transform and scale data before feeding it into a machine learning model.

**Q.24How do we split data for model fitting (training and testing) in Python?**

 - -In Scikit-Learn, we use the train_test_split() function from sklearn.model_selection to divide the dataset into training and testing sets.


**Q.25 Explain data encoding ?**
  

  - Data encoding is the process of converting categorical data into a numerical format that machine learning models can understand. Since most ML models work with numbers, categorical data (like "Red", "Blue", "Green") must be encoded into numeric values.

