<a href="https://colab.research.google.com/github/Anjali-Sinkar/PW-Assignments/blob/main/Feature_Engineering_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  1. What is a parameter?

* **Definition:** In programming, a parameter is a variable used to pass information into a function. In Machine Learning, parameters are internal variables learned by the model (like weights in neural networks).
* **Example:** In `def add(a, b): return a+b`, `a` and `b` are parameters.
* **Follow-up note:** In ML, unlike hyperparameters (set manually), parameters are learned during training.


###  2. What is correlation?

* **Definition:** Correlation measures the statistical relationship between two variables.
* **Range:** Values lie between `-1` and `+1`.
* **Example:** If study hours increase and marks increase, correlation is positive.

###  2[i]. What does negative correlation mean?

* **Definition:** When one variable increases while the other decreases, it’s called negative correlation.
* **Example:** More exercise → lower body weight.


###  3. Define Machine Learning. What are the main components in Machine Learning?

* **Definition:** Machine Learning is a subset of AI that enables systems to learn patterns from data and make predictions/decisions without being explicitly programmed.
* **Main Components:**

  1. **Data** – raw information for training/testing.
  2. **Model/Algorithm** – method used (e.g., linear regression, decision trees).
  3. **Loss Function** – measures error between prediction and actual.
  4. **Optimizer** – adjusts parameters to minimize loss.
  5. **Evaluation Metrics** – accuracy, precision, recall, etc.

###  4. How does loss value help in determining whether the model is good or not?

* **Answer:** The loss value quantifies the difference between predicted output and actual output.

  * **Low loss** → model predictions are closer to actual values → good performance.
  * **High loss** → poor performance.
* **Example:** In regression, Mean Squared Error (MSE) shows how far predictions deviate.

###  5. What are continuous and categorical variables?

* **Continuous Variable:** Numerical values within a range, measurable and infinite.

  * Example: Height, temperature, salary.
* **Categorical Variable:** Discrete values representing groups or labels.

  * Example: Gender (Male/Female), Colors (Red/Blue/Green).

### 6. How do we handle categorical variables in Machine Learning? What are the common techniques?

* **Answer:** Categorical variables need to be converted into numerical form for ML models.
* **Common Techniques:**

  1. **Label Encoding** – assigns numbers to categories (e.g., Male=0, Female=1).
  2. **One-Hot Encoding** – creates binary columns for each category (used for nominal data).
  3. **Ordinal Encoding** – for categories with order (e.g., Low=1, Medium=2, High=3).
  4. **Target/Mean Encoding** – replaces category with average target value (useful in regression/classification).



### 7. What do you mean by training and testing a dataset?

* **Training dataset:** Used to teach the model patterns from data.
* **Testing dataset:** Used to evaluate model performance on unseen data.
* **Analogy:** Like studying from books (training) and then giving an exam (testing).




### 8. What is sklearn.preprocessing?

* **Answer:** `sklearn.preprocessing` is a module in Scikit-learn that provides functions to transform data before feeding into ML models.
* **Examples:**

  * `StandardScaler` (normalize data)
  * `LabelEncoder` (convert labels to numbers)
  * `OneHotEncoder` (create dummy variables)
  * `MinMaxScaler` (scale between 0 and 1)



###  9. What is a Test set?

* **Definition:** The **test set** is a portion of the dataset kept aside to check how well the model generalizes to unseen data.
* **Note:** It should **never** be used in training.



###  10. How do we split data for model fitting (training and testing) in Python?

* **Answer:** We usually use **train\_test\_split** from scikit-learn.
* **Code :**

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

* Here, `80%` data → training, `20%` → testing.



### 10[i]. How do you approach a Machine Learning problem?

**Step-by-step Approach:**

1. **Understand the Problem** – what needs to be predicted or classified.
2. **Collect & Explore Data** – gather datasets, check missing values, visualize.
3. **Preprocess Data** – handle missing values, categorical encoding, scaling.
4. **Feature Engineering** – create/select important features.
5. **Model Selection** – choose suitable algorithm (regression, classification, etc.).
6. **Training** – fit model on training data.
7. **Evaluation** – test with unseen data using metrics (accuracy, RMSE, F1-score).
8. **Optimization** – tune hyperparameters, improve accuracy.
9. **Deployment** – integrate into real-world system.


###  11. Why do we have to perform EDA before fitting a model to the data?

* **Answer:** Exploratory Data Analysis (EDA) helps understand dataset structure, detect missing values, outliers, correlations, and distributions.
* **Benefit:** Prevents poor model performance by ensuring clean and relevant data.
* **Example:** Checking if categorical variables need encoding or numerical variables need scaling.



###  12. What is correlation?

* **Definition:** Correlation measures the linear relationship between two variables (range: –1 to +1).
* **Example:** Height and weight usually show positive correlation.



###  13. What does negative correlation mean?

* **Definition:** If one variable increases while the other decreases, the correlation is negative.
* **Example:** More hours spent on social media ↔ lower exam scores.



###  14. How can you find correlation between variables in Python?

```python
import pandas as pd

# Example DataFrame
df = pd.DataFrame({
    "Hours_Studied": [1, 2, 3, 4, 5],
    "Marks": [20, 40, 60, 80, 100]
})

correlation_matrix = df.corr()
print(correlation_matrix)
```

* `.corr()` gives Pearson correlation by default.



###  15. What is causation? Explain difference between correlation and causation with an example.

* **Causation:** Implies one variable **directly affects** another.
* **Correlation vs. Causation:**

  * Correlation: Ice cream sales and drowning cases increase in summer (not causal).
  * Causation: Increasing study hours → higher marks (direct effect).



###  16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

* **Definition:** Optimizer adjusts model parameters (like weights) to minimize loss function.
* **Types:**

  1. **Gradient Descent** – basic method, updates parameters step by step.
  2. **Stochastic Gradient Descent (SGD)** – updates after each data point, faster for large datasets.
  3. **Momentum** – adds momentum to avoid local minima.
  4. **Adam (Adaptive Moment Estimation)** – combines momentum + adaptive learning rates (most popular).
* **Example (Adam in TensorFlow):**

```python
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
```



###  17. What is sklearn.linear\_model ?

* **Answer:** It is a Scikit-learn module that implements linear models such as Linear Regression, Logistic Regression, Ridge, and Lasso.
* **Example:**

```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()
```



### 18. What does model.fit() do? What arguments must be given?

* **Answer:** Trains the model on given data (learns parameters).
* **Arguments:** `X_train` (features) and `y_train` (labels).
* **Example:**

```python
model.fit(X_train, y_train)
```



###  19. What does model.predict() do? What arguments must be given?

* **Answer:** Uses the trained model to make predictions on new data.
* **Arguments:** `X_test` (features).
* **Example:**

```python
y_pred = model.predict(X_test)
```



### 20. What are continuous and categorical variables?

* **Continuous Variable:** Numeric values within a range (e.g., height, salary).
* **Categorical Variable:** Discrete labels or groups (e.g., gender, country).



###  21. What is feature scaling? How does it help in Machine Learning?

* **Definition:** Process of normalizing numerical data to the same range.
* **Why:** Prevents large-valued features from dominating small-valued ones (important for distance-based algorithms like KNN, SVM).
* **Example:** Scaling salary (in lakhs) and age (in years) to same scale.



###  22. How do we perform scaling in Python?

```python
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

* `StandardScaler` → mean = 0, variance = 1
* `MinMaxScaler` → scales values between 0 and 1



###  23. What is sklearn.preprocessing?

* **Answer:** A module in Scikit-learn for data preprocessing before modeling.
* **Includes:**

  * Scaling (`StandardScaler`, `MinMaxScaler`)
  * Encoding (`LabelEncoder`, `OneHotEncoder`)
  * Normalization, Polynomial features


###  24. How do we split data for model fitting (training and testing) in Python?

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

* `test_size=0.2` → 80% training, 20% testing.



###  25. Explain data encoding.

* **Definition:** Transforming categorical data into numeric format so ML models can process it.
* **Types:**

  1. **Label Encoding** – assigns numeric codes.
  2. **One-Hot Encoding** – creates binary columns for each category.
  3. **Ordinal Encoding** – preserves order of categories.
  4. **Target Encoding** – replaces category with target mean value.
