# Q.1. What is a parameter?

ANSWER-->

  A parameter is an internal variable that a machine learning model learns during training. These parameters define how the model makes predictions.
  * For example, in a linear regression model, the coefficients (weights) and intercept are parameters.
  
  The model adjusts these values to minimize error during training. Parameters are not manually set by the user; they are optimized automatically through learning algorithms such as gradient descent.

#Q.2.What is correlation? What does negative correlation mean?

ANSWER-->

  a) Correlation:

   Correlation is a statistical measure that describes how strongly two variables are related to each other and in what direction.
It tells us whether an increase or decrease in one variable will likely correspond to an increase or decrease in another.

The correlation coefficient (denoted by r) ranges from –1 to +1:

+ +1 → perfect positive correlation (both increase together)

* 0 → no linear relationship

* –1 → perfect negative correlation (one increases, the other decreases)

Example:
* If the temperature increases and ice-cream sales also increase, they have a positive correlation.


b) **Negative Correlation:**

A negative correlation means that as one variable increases, the other decreases — they move in opposite directions.

The correlation value lies between –1 and 0.

The closer the value is to –1, the stronger the negative relationship.

Examples:

* As the speed of a car increases, the time to reach the destination decreases.

* As study time increases, number of mistakes on a test may decrease.

* In short, negative correlation shows an inverse relationship between two variables.




#Q.3. Define Machine Learning. What are the main components in Machine Learning?

ANSWER-->

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn patterns from data and make decisions or predictions without being explicitly programmed.
Instead of following fixed rules, ML systems improve automatically through experience and exposure to more data.

Example:

* Email systems use ML to classify messages as spam or not spam.

* Netflix uses ML to recommend movies based on your watch history.


 **Main Components of Machine Learning**

1. Data:
The foundation of ML — includes input features and target labels.
Example – Student marks, temperature readings, stock prices.

2. Model:
A mathematical representation that learns the relationship between input and output.
Example – Linear Regression model, Decision Tree, Neural Network.

3. Training:
The process of feeding data to the model so it can adjust parameters and learn patterns.
Example – The model “learns” which factors most affect house prices.

4. Evaluation:
Testing how well the trained model performs on unseen (test) data using metrics like accuracy, precision, or loss.

5. Prediction (Inference):
Using the trained model to make predictions on new data.
Example – Predicting tomorrow’s weather using a trained weather model.

#Q.4. How does loss value help in determining whether the model is good or not?

ANSWERS-->
The loss value (also called the cost function) is a numerical measure of how far the model’s predictions are from the actual target values.
It tells us how well or poorly a model is performing during training.

🔹 Key Points:

A low loss value means the model’s predictions are close to the true values, indicating good performance.

A high loss value means the model is making large errors and needs improvement.

The model’s goal during training is to minimize this loss using optimization algorithms like Gradient Descent.

Tracking loss over epochs helps detect overfitting or underfitting.

Example:

In Linear Regression, the loss function is often Mean Squared Error (MSE):

  $$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

Example:

- $y_i$ = actual value  
- $\hat{y}_i$ = predicted value  
- $n$ = number of samples



A smaller MSE means the model fits the data better.

 Example in Python:
from sklearn.metrics import mean_squared_error
loss = mean_squared_error(y_true, y_pred)
print("Loss Value:", loss)

#Q.5. What are continuous and categorical variables?

ANSWER-->

- **Continuous Variables:**

      Numeric values that can take any value within a range. Example – height, temperature, or salary.

- **Categorical Variables:**

      Represent categories or groups, not numbers. Example – gender (male/female), color (red/blue), or city name.
      Continuous data is analyzed using mathematical operations, while categorical data often requires encoding before use in ML models.

#Q.6. How do we handle categorical variables in Machine Learning? What are the common techniques?

ANSWERS-->

Categorical variables are handled in machine learning by converting them into numerical representations, as most algorithms require numerical input. Common techniques include one-hot encoding, which creates new binary columns for each category, and label encoding, which assigns a unique integer to each category. Other methods like target encoding, frequency encoding, and binary encoding are also used depending on the data's characteritics.

**Common techniques:**

- Label Encoding: Assigns numeric values (e.g., Male = 0, Female = 1).

- One-Hot Encoding: Creates binary columns for each category (e.g., “City_Mumbai”, “City_Delhi”).

- Ordinal Encoding: Used when categories have order (e.g., Low = 1, Medium = 2, High = 3).

Example:

    from sklearn.preprocessing import OneHotEncoder
    ncoder = OneHotEncoder()
    encoded = encoder.fit_transform(data[['City']])

#Q.7. What do you mean by training and testing a dataset?

ANSWERS-->
Training and testing a dataset are fundamental steps in machine learning, used to create and evaluate a predictive model. A single dataset is split into two or three subsets to teach an algorithm how to find patterns (training) and then assess its performance on new, unseen data (testing). This process is crucial for developing accurate, reliable models that can generalize to real-world scenarios.

The process of training and testing

- Splitting the data. The overall dataset is first divided into two or three parts:
  - Training dataset: The largest portion of the data, typically 70–80%, is used to train the machine learning model. This data is fed to the algorithm so it can learn the underlying relationships and patterns between the input features and the target output.
  -  Testing dataset: A smaller portion, typically 20–30%, is set aside and not used during the training phase. It provides an independent, unbiased final evaluation of the model's performance on new data.
  - Validation dataset (optional): Some workflows also include a third dataset to fine-tune the model's hyperparameters and prevent overfitting to the training data. This is often used during the development phase before the final test.
- Training the model. The model is given the training data to learn from. In supervised learning, this includes both the input data and the corresponding correct answers, or "labels."
  - For example, to train a spam filter, you would feed the model thousands of emails with the correct labels of "spam" or "not spam". The model adjusts its internal parameters and weights to minimize the difference between its predictions and the actual labels.

- Testing the model. After training is complete, the model's performance is measured by having it make predictions on the testing dataset. Because the model has never seen this data before, the testing phase determines how well the model has learned to generalize its knowledge. The algorithm's predictions for the test data are then compared against the known, correct answers to calculate its accuracy and other performance metrics.




#Q.8. What is sklearn.preprocessing?

ANSWERS-->

sklearn.preprocessing is a module within the scikit-learn library in Python that provides a comprehensive set of tools for data preprocessing. Data preprocessing is a crucial step in machine learning, involving the transformation of raw data into a format suitable for training machine learning models. This is often necessary because many algorithms perform better with clean, scaled, or transformed data.



#Q.9. What is a Test set?
The test set is a subset of the dataset reserved for evaluating the trained model’s performance. It helps determine whether the model can generalize well to unseen data.
Example: After training a model on 80% of the data, the remaining 20% is used as a test set to compute accuracy or loss.
It ensures that the model’s success isn’t due to memorization but real learning

#Q.10. How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?

ANSWER-->
We use the train_test_split() function from sklearn.model_selection.

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- Approach a Machine Learning problem

Understand the problem and business goal.

Collect and clean data by handling missing values and outliers.

Perform EDA (Exploratory Data Analysis).

Feature Engineering – create or transform features.

Select an algorithm based on data type.

Train the model and tune parameters.

Evaluate performance on test data.

Deploy and monitor the model.

Example:
- Predicting house prices involves these steps from gathering property data to evaluating accuracy.


#Q.11. Why do we have to perform EDA before fitting a model to the data?

ANSWER-->

Exploratory Data Analysis (EDA) helps us understand the dataset’s structure, detect missing values, outliers, and relationships among variables.
It ensures the data is clean and meaningful before modeling.
- For example, EDA might reveal that some columns are highly correlated or irrelevant, allowing better feature selection. Without EDA, the model may produce inaccurate results or overfit the data.



#Q.12. What is correlation?

ANSWERS-->

 Correlation is a statistical measure that describes how strongly two variables are related to each other and in what direction. It tells us whether an increase or decrease in one variable will likely correspond to an increase or decrease in another.

- The correlation coefficient (denoted by r) ranges from –1 to +1:

- +1 → perfect positive correlation (both increase together)
0 → no linear relationship

- –1 → perfect negative correlation (one increases, the other decreases)

#Q.13. What does negative correlation mean?

ANSWER-->

A negative correlation means that as one variable increases, the other decreases — they move in opposite directions.

The correlation value lies between –1 and 0.

The closer the value is to –1, the stronger the negative relationship.

Examples:

* As the speed of a car increases, the time to reach the destination decreases.

* As study time increases, number of mistakes on a test may decrease.

* In short, negative correlation shows an inverse relationship between two variables.




#Q.14. How can you find correlation between variables in Python?

ANSWER-->
Using Pandas and Seaborn libraries:

    import pandas as pd
    import seaborn as sns
    sns.heatmap(df.corr(), annot=True)


df.corr() computes pairwise correlation, and the heatmap visually displays it.
For instance, if two variables show r = 0.85, they are strongly positively correlated.



#Q.15.  What is causation? Explain difference between correlation and causation with an example.

Causation means one variable directly affects another.

Correlation means variables move together but one doesn’t necessarily cause the other.
- Example: Ice cream sales and drowning cases are correlated because both rise in summer.

However, ice cream doesn’t cause drowning — the underlying factor (temperature) causes both.
Hence, correlation ≠ causation.

#Q.16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

ANSWER-->

An optimizer adjusts model parameters to minimize loss during training.
Common types:

- SGD (Stochastic Gradient Descent): Updates weights after each sample.

- Adam: Combines momentum and adaptive learning rates for faster convergence.

- RMSProp: Works well for non-stationary data.

Example:

  optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)


Optimizers are crucial for improving accuracy and reducing training time.

#Q.17. What is sklearn.linear_model?

sklearn.linear_model is a module in Scikit-learn that implements linear models such as:

- LinearRegression() for continuous prediction

- LogisticRegression() for classification

- Ridge(), Lasso() for regularization
Example:

    from sklearn.linear_model import LinearRegression
    model = LinearRegression()


These models form the basis for many predictive ML tasks.

#Q.18. What does model.fit() do? What arguments must be given?

model.fit() trains a machine learning model on the provided data.
Syntax:

    model.fit(X_train, y_train)


Here, X_train is the input data and y_train is the target variable. The function adjusts model parameters (like weights) to minimize the loss function based on the training data.

#Q.19. What does model.predict() do? What arguments must be given?

model.predict() is used to generate predictions on new or test data using the trained model.
Example:

    y_pred = model.predict(X_test)


It requires only the feature inputs (X_test) and returns predicted outputs. The results can then be compared with actual labels to measure accuracy.

#Q.20. What are continuous and categorical variables?

ANSWER-->Variables (features) in datasets are types of data. Knowing the type helps you choose the right model and preprocessing.

1. Continuous Variables

Take numerical values that can be measured on a continuous scale.

Often involve real numbers.

Example:

- Age (e.g., 23, 23.5)

- Temperature (e.g., 36.6°C)

- House price (e.g., 250000.75)

Can be used directly in most ML models.

2. Categorical Variables

Represent categories or groups, not numbers in a measurable sense.

Example:

- Color: red, blue, green

- Gender: male, female

- Car type: SUV, sedan, hatchback

Often need to be encoded into numbers for models (e.g., One-Hot Encoding, Label Encoding).

Summary Table:

| Feature Type | Example Values   | ML Notes                             |
| ------------ | ---------------- | ------------------------------------ |
| Continuous   | 23, 45.6, 100    | Use as-is; can calculate mean/std    |
| Categorical  | red, blue, green | Encode before feeding to most models |odels

#Q.21What is feature scaling? How does it help in Machine Learning?

ANSWER-->

Feature Scaling standardizes or normalizes the range of independent variables so that no variable dominates due to its magnitude.
It helps gradient-based algorithms (like logistic regression, SVM, neural networks) converge faster.
Example:
If one feature has values 1–10 and another 1,000–10,000, scaling makes both comparable, improving accuracy and performance.




#Q.22. How do we perform scaling in Python?
ANSWER-->
  

Using StandardScaler or MinMaxScaler from Scikit-learn:

    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler() = scaler.fit_transform(X)


StandardScaler: Transforms data to have mean = 0 and standard deviation = 1.

MinMaxScaler: Rescales data to a fixed range (usually 0–1).
Scaling ensures all features contribute equally to the model.

#Q.23.What is sklearn.preprocessing?

ANSWER-->


sklearn.preprocessing is a module within the scikit-learn library in Python that provides a comprehensive set of tools for data preprocessing. Data preprocessing is a crucial step in machine learning, involving the transformation of raw data into a format suitable for training machine learning models. This is often necessary because many algorithms perform better with clean, scaled, or transformed data.



#Q.24. How do we split data for model fitting (training and testing) in Python?

In [3]:
# Import necessary library
from sklearn.model_selection import train_test_split
import numpy as np

# Suppose X is your feature matrix and y is the target variable
# Example:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]) # Example feature matrix
y = np.array([0, 1, 0, 1, 0]) # Example target variable


# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% data for testing
    random_state=42,  # ensures reproducibility
    shuffle=True      # shuffle data before splitting
)

print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

Training data shape: (4, 2)
Testing data shape: (1, 2)


#Q.25. Explain data encoding?

ANSWER-->

Data encoding is the process of converting categorical (non-numeric) or textual data into numerical values that a machine learning model can understand.

Example:
Suppose you have a feature Color with values: Red, Blue, Green. A model cannot understand strings like "Red" directly. Encoding converts them into numbers like:

| Color | Encoded Value |
| ----- | ------------- |
| Red   | 0             |
| Blue  | 1             |
| Green | 2             |

2. Types of Data Encoding
a) Label Encoding

Converts each category into a unique integer.

Simple but may introduce unintended ordinal relationships (e.g., 0 < 1 < 2) even if categories are not ordered.

    from sklearn.preprocessing import LabelEncoder

    colors = ['Red', 'Blue', 'Green', 'Blue']
    encoder = LabelEncoder()
    encoded_colors = encoder.fit_transform(colors)

    print(encoded_colors)  # Output: [2 0 1 0] (order may vary)


.

b) One-Hot Encoding

Converts each category into a binary vector (0 or 1).

Prevents the model from assuming any order among categories.

    from sklearn.preprocessing import OneHotEncoder
    import numpy as np

    colors = np.array(['Red', 'Blue', 'Green', 'Blue']).reshape(-1, 1)
    encoder = OneHotEncoder(sparse=False)
    encoded_colors = encoder.fit_transform(colors)

    print(encoded_colors)


Output:

| Red | Blue | Green |
| --- | ---- | ----- |
| 1   | 0    | 0     |
| 0   | 1    | 0     |
| 0   | 0    | 1     |
| 0   | 1    | 0     |


 When to use: For nominal categorical data (no order).

c) Ordinal Encoding

Similar to label encoding but explicitly respects the order.

Example: Size: Small < Medium < Large → [0, 1, 2].

    from sklearn.preprocessing import OrdinalEncoder
    sizes = [['Small'], ['Medium'], ['Large'], ['Medium']]
    encoder = OrdinalEncoder(categories=[['Small','Medium','Large']])
    encoded_sizes = encoder.fit_transform(sizes)

print(encoded_sizes)

d) Other Encodings

Binary Encoding: Converts categories into binary numbers (useful for high-cardinality data).

Frequency/Count Encoding: Uses the frequency of each category as its numerical value.

Target Encoding: Uses the mean of the target variable for each category (mostly in supervised learning).





