## Feature Engineering Assignment

1. What is a parameter?
A parameter is a characteristic or feature that defines a particular aspect of a population or dataset. In machine learning, a parameter is typically a value that the model learns from the training data to make predictions. For example, in a linear regression model, the coefficients (weights) of the features are parameters.

2. What is correlation?
Correlation refers to a statistical relationship between two or more variables. It indicates the degree to which a change in one variable is associated with a change in another variable. Correlation is typically measured using Pearson’s correlation coefficient, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 meaning no correlation.

3. What does negative correlation mean?
Negative correlation means that as one variable increases, the other decreases. In other words, when one variable goes up, the other variable tends to go down. For example, there is often a negative correlation between the amount of time spent on social media and academic performance.

4. Define Machine Learning. What are the main components in Machine Learning?
Machine Learning is a subset of artificial intelligence (AI) that involves the development of algorithms that allow computers to learn from and make predictions or decisions based on data. The main components in machine learning are:

Data: The raw input used to train and test models.

Model: The mathematical representation of the relationships in the data.

Learning Algorithm: The method that adjusts the model to fit the data.

Loss Function: A function that measures how well the model performs.

Optimizer: The mechanism used to minimize the loss function during training.

5. How does loss value help in determining whether the model is good or not?
The loss value indicates how well or poorly the model is performing. A low loss value indicates that the model's predictions are close to the actual outcomes (good performance), while a high loss value suggests that the model is making poor predictions. The goal is to minimize the loss function during training.

6. What are continuous and categorical variables?
Continuous variables are numerical variables that can take any value within a range. Examples include height, weight, and temperature.

Categorical variables are variables that can take on one of a limited and fixed number of values. Examples include gender (male/female), country (USA, India, etc.), or product type (A, B, C).

7. How do we handle categorical variables in Machine Learning? What are the common techniques?
Categorical variables need to be converted into a format that machine learning algorithms can understand. Common techniques include:

One-Hot Encoding: Creating binary columns for each category.

Label Encoding: Converting each category to a numeric value.

Ordinal Encoding: Assigning numerical values to categories based on their order.

8. What do you mean by training and testing a dataset?
Training dataset is used to train the machine learning model by allowing it to learn patterns from the data.

Testing dataset is used to evaluate the performance of the trained model, providing an unbiased estimate of the model's generalization ability.

9. What is sklearn.preprocessing?
sklearn.preprocessing is a module in the scikit-learn library that provides functions for scaling, encoding, and transforming data. It helps in normalizing data, encoding categorical features, and scaling features to ensure that the model works efficiently.

10. What is a Test set?
A test set is a subset of the dataset that is used to evaluate the performance of a machine learning model after it has been trained. It helps in assessing how well the model generalizes to new, unseen data.

11. How do we split data for model fitting (training and testing) in Python?
In Python, we can use train_test_split from the scikit-learn library to split the data:

python
Copy
Edit
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Here, X is the feature matrix and y is the target variable. The test_size determines the proportion of the data to be used for testing.

12. How do you approach a Machine Learning problem?
To approach a machine learning problem, follow these steps:

Define the problem: Understand the problem and determine the type of machine learning task (classification, regression, clustering).

Collect and preprocess data: Gather data and clean it (handling missing values, encoding categorical variables, etc.).

Explore the data: Perform exploratory data analysis (EDA) to understand patterns and relationships.

Select a model: Choose an appropriate machine learning model based on the problem.

Train the model: Fit the model to the training data.

Evaluate the model: Test the model using the test set and evaluate performance using relevant metrics.

Tune the model: Optimize the model by adjusting hyperparameters.

Deploy the model: Once satisfied with the model's performance, deploy it for real-world use.

13. Why do we have to perform EDA before fitting a model to the data?
Exploratory Data Analysis (EDA) helps in understanding the dataset better. It helps identify:

Outliers and missing values.

Distribution of features.

Relationships between variables.

Data quality issues.
EDA ensures that the model is trained on a clean and well-understood dataset.

14. What is causation? Explain the difference between correlation and causation with an example.
Causation means that one event or variable directly causes another to happen. Correlation, on the other hand, means that two variables are related but do not necessarily cause each other.

Example:

Correlation: There might be a correlation between ice cream sales and drowning incidents in summer, but it doesn’t mean ice cream sales cause drowning. The underlying cause is the warmer weather.

Causation: Smoking causes lung cancer. This is a causal relationship.

15. What is an Optimizer? What are different types of optimizers? Explain each with an example.
An optimizer in machine learning adjusts the parameters (weights) of the model to minimize the loss function. Common optimizers include:

Gradient Descent: Iteratively adjusts the parameters to minimize the loss by calculating gradients. It's commonly used in training neural networks.

Stochastic Gradient Descent (SGD): A variant of gradient descent that uses a random subset of data (mini-batch) to compute gradients, making it faster.

Adam: An adaptive optimizer that combines the advantages of both AdaGrad and RMSProp. It adapts learning rates based on the first and second moments of the gradients.

16. What is sklearn.linear_model?
sklearn.linear_model is a module in scikit-learn that contains linear models for regression and classification, such as Linear Regression, Logistic Regression, and Ridge Regression. These models are used for tasks where the relationship between the features and the target is assumed to be linear.

17. What does model.fit() do? What arguments must be given?
The model.fit() method trains a machine learning model on a dataset. It takes in two arguments:

X_train: The feature matrix (input data).

y_train: The target vector (labels or outcomes).

Example:

python
Copy
Edit
model.fit(X_train, y_train)
18. What does model.predict() do? What arguments must be given?
The model.predict() method makes predictions using the trained model. It takes in the feature matrix of new, unseen data as an argument:

python
Copy
Edit
predictions = model.predict(X_test)
19. What are continuous and categorical variables?
(This was already answered in question 6)

20. What is feature scaling? How does it help in Machine Learning?
Feature scaling refers to the process of standardizing or normalizing the range of independent variables (features). It is important because many machine learning algorithms perform better when features are on a similar scale. For example, gradient descent converges faster if the features are scaled.

21. How do we perform scaling in Python?
You can scale data using sklearn.preprocessing module in Python:

python
Copy
Edit
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
22. What is sklearn.preprocessing?
sklearn.preprocessing is a module in scikit-learn that contains functions for scaling, normalizing, and encoding features in the dataset. It includes tools like StandardScaler, MinMaxScaler, and OneHotEncoder.

23. How do we split data for model fitting (training and testing) in Python?
(This was already answered in question 10)

24. Explain data encoding?
Data encoding is the process of converting categorical data into a numerical format so that machine learning algorithms can work with it. Common encoding techniques include:

One-Hot Encoding: Each category is represented by a binary vector.

Label Encoding: Assigns a unique integer to each category.