# Feature Engineering Assignment

1. What is a parameter?

A parameter is a variable that defines a model’s configuration and is learned from data during training.

2. What is correlation? What does negative correlation mean?

Correlation measures the strength and direction of a linear relationship between two variables. It ranges from -1 to +1.

A negative correlation means that when one variable increases, the other decreases.
Example: As exercise increases, body fat percentage tends to decrease.

3. Define Machine Learning. What are the main components in Machine Learning?

Machine Learning (ML) is a branch of AI that enables systems to learn and improve from experience without being explicitly programmed.
Main components:

- Data
- Model
- Loss function
- Optimizer (Learning algorithm)
- Evaluation metrics

4. How does loss value help in determining whether the model is good or not?

The loss value indicates how far the model’s predictions are from actual values.
- Low loss → good model performance.
- High loss → poor model performance.

5. What are continuous and categorical variables?

Continuous variables: Numeric values with infinite possible outcomes (e.g., height, weight).

Categorical variables: Represent discrete categories or labels (e.g., gender, color).

6. How do we handle categorical variables in Machine Learning? What are the common techniques?

Common techniques:
- Label Encoding
- One-Hot Encoding
- Ordinal Encoding

7. What do you mean by training and testing a dataset?

Training set: Used to train the model.

Testing set: Used to evaluate the model’s performance on unseen data.

8. What is sklearn.preprocessing?

It is a module in Scikit-learn that provides tools for data preprocessing, such as scaling, normalization, encoding, and feature transformation.

9. What is a Test set?

A Test set is a subset of the dataset used to assess how well a trained model generalizes to unseen data.

In [8]:
#10. How do we split data for model fitting (training and testing) in Python?

#Using Scikit-learn:
import pandas as pd
from sklearn.model_selection import train_test_split
data = {
    'Age': [22, 25, 30, 28, 35, 40, 45, 50, 55, 60],
    'BMI': [18.5, 22.0, 25.5, 26.0, 28.0, 30.0, 27.5, 29.5, 31.0, 32.5],
    'Exercise_Level': [3, 4, 2, 3, 1, 2, 4, 5, 2, 1],
    'Blood_Pressure': [110, 115, 120, 118, 130, 140, 135, 145, 138, 150],
    'Health_Status': [1, 1, 0, 1, 0, 0, 1, 1, 0, 0]  # Target variable
}

df = pd.DataFrame(data)
X = df[['Age', 'BMI', 'Exercise_Level', 'Blood_Pressure']]  # independent variables
y = df['Health_Status']                                     # dependent variable
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("Training features:\n", X_train)
print("\nTesting features:\n", X_test)
print("\nTraining labels:\n", y_train)
print("\nTesting labels:\n", y_test)




Training features:
    Age   BMI  Exercise_Level  Blood_Pressure
0   22  18.5               3             110
7   50  29.5               5             145
2   30  25.5               2             120
9   60  32.5               1             150
4   35  28.0               1             130
3   28  26.0               3             118
6   45  27.5               4             135

Testing features:
    Age   BMI  Exercise_Level  Blood_Pressure
8   55  31.0               2             138
1   25  22.0               4             115
5   40  30.0               2             140

Training labels:
 0    1
7    1
2    0
9    0
4    0
3    1
6    1
Name: Health_Status, dtype: int64

Testing labels:
 8    0
1    1
5    0
Name: Health_Status, dtype: int64


10. How do you approach a Machine Learning problem?

Steps:
- Define the problem
- Collect and clean data
- Perform EDA (Exploratory Data Analysis)
- Select features
- Choose a model
- Train and evaluate
- Optimize performance
- Deploy and monitor

11. Why do we have to perform EDA before fitting a model to the data?

EDA (Exploratory Data Analysis) helps to:
- Understand data patterns
- Detect outliers or missing values
- Choose appropriate preprocessing methods
- Ensure better model accuracy

12. What is correlation?

Correlation measures how strongly two variables move together.

13. What does negative correlation mean?

Negative correlation indicates that as one variable increases, the other decreases.

In [9]:
#14. How can you find correlation between variables in Python?
import pandas as pd
corr_matrix = df.corr()
print(corr_matrix)


                     Age       BMI  Exercise_Level  Blood_Pressure  \
Age             1.000000  0.888384       -0.176791        0.941796   
BMI             0.888384  1.000000       -0.348203        0.923375   
Exercise_Level -0.176791 -0.348203        1.000000       -0.168282   
Blood_Pressure  0.941796  0.923375       -0.168282        1.000000   
Health_Status  -0.400577 -0.581667        0.866921       -0.423968   

                Health_Status  
Age                 -0.400577  
BMI                 -0.581667  
Exercise_Level       0.866921  
Blood_Pressure      -0.423968  
Health_Status        1.000000  


15. What is causation? Explain difference between correlation and causation with an example.

Causation means one event directly affects another.

Correlation only shows they move together but doesn’t prove cause-effect.

Example:
Ice cream sales and drowning incidents are correlated, but hot weather causes both.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

An Optimizer adjusts model parameters to minimize loss.

Common types:
- Gradient Descent: Basic method.
- Stochastic Gradient Descent (SGD): Updates weights per batch.
- Adam: Combines momentum and adaptive learning rates.
- RMSprop: Adjusts learning rate based on recent gradients.

In [7]:
from keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)


17. What is sklearn.linear_model?

A Scikit-learn module containing linear models such as:
- Linear Regression
- Logistic Regression
- Ridge, Lasso, etc.

18. What does model.fit() do? What arguments must be given?

model.fit(X_train, y_train)

It trains the model on training data by adjusting parameters to minimize loss.

19. What does model.predict() do? What arguments must be given?

model.predict(X_test)

It uses the trained model to make predictions on new or unseen data.

Q20. What are continuous and categorical variables?

Continuous Variables:
These are numerical variables that can take an infinite number of values within a given range.
They are measured on a scale and can include fractions or decimals.
Examples: Height, Weight, Temperature, Age, Income.

Categorical Variables:
These represent categories or groups that describe qualitative characteristics.
They take on a limited number of distinct values (labels).
Examples: Gender (Male/Female), Blood Type (A, B, AB, O), Color (Red, Blue, Green).

21. What is feature scaling? How does it help in Machine Learning?

Feature scaling standardizes the range of independent variables.
It ensures that all features contribute equally and prevents bias toward large values.

In [10]:
#22. How do we perform scaling in Python?
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

23. What is sklearn.preprocessing?

Used for data transformation like scaling, normalization, encoding, etc.

24. How do we split data for model fitting (training and testing) in Python?

In Machine Learning, we split the dataset into training and testing sets to evaluate how well a model performs on unseen data.

Training Set: Used to train the model (learn patterns).

Testing Set: Used to check how well the model generalizes.

In [12]:
#25. Explain data encoding.
# Data encoding converts categorical data into numerical format for model training.
# Common types: Label Encoding & One-Hot Encoding
# Example:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
    'Category': ['Fruits', 'Vegetables', 'Fruits', 'Dairy', 'Snacks']
})

print("Original DataFrame:")
print(df)
encoder = OneHotEncoder()

encoded = encoder.fit_transform(df[['Category']]).toarray()

encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Category']))

final_df = pd.concat([df, encoded_df], axis=1)

print("\nEncoded DataFrame:")
print(final_df)


Original DataFrame:
     Category
0      Fruits
1  Vegetables
2      Fruits
3       Dairy
4      Snacks

Encoded DataFrame:
     Category  Category_Dairy  Category_Fruits  Category_Snacks  \
0      Fruits             0.0              1.0              0.0   
1  Vegetables             0.0              0.0              0.0   
2      Fruits             0.0              1.0              0.0   
3       Dairy             1.0              0.0              0.0   
4      Snacks             0.0              0.0              1.0   

   Category_Vegetables  
0                  0.0  
1                  1.0  
2                  0.0  
3                  0.0  
4                  0.0  
