In [None]:
'''
1.What is a parameter?
    ->A parameter in ML is an internal variable that a model learns from the training data.
        These are the values that define how the model makes predictions.
        The learning algorithm adjusts parameters to minimize error (loss function).

        Examples of Parameters in ML

        Linear Regression:-

        y = wx + b

        w (weight/slope) and b (bias/intercept) are parameters that the algorithm learns during training to find the best fit.

        Logistic Regression:-

        The coefficients (w) determine how strongly each feature influences the probability prediction and are considered parameters.

        Neural Networks:-

        Each connection weight and bias in the network acts as a parameter. Deep learning models can contain millions or even billions of parameters.

        Decision Trees:-

        The split points and thresholds at each node are treated as parameters.
        
        
2.What is correlation?
What does negative correlation mean?
    ->In machine learning, correlation describes the statistical relationship between features (independent variables) or between a feature and the target variable.
    It measures how strongly two variables move together.
    A correlation coefficient (usually Pearson’s r) tells us the strength (magnitude) and direction (positive or negative) of that relationship.

    Why it matters in ML:

        Feature selection: Highly correlated features may be redundant (multicollinearity).

        Understanding data: Correlation between features and the target helps identify useful predictors.

        Interpretability: Explains why a model might weigh features in a certain way.

     What does Negative Correlation Mean in ML?

        A negative correlation means that as one variable increases, the other tends to decrease.

        In ML, if a feature has a negative correlation with the target, it means higher values of that feature are associated with lower target values (and vice versa).

    Examples in ML:

        =>House Prices Dataset

            Feature: Distance from city center

            Target: House price

            Relationship: Negative correlation (farther from the city → lower price).

        =>Employee Attrition Prediction

            Feature: Job satisfaction

            Target: Probability of quitting

            Relationship: Negative correlation (higher satisfaction → lower chance of quitting).

        =>Medical Dataset

            Feature: Exercise per week

            Target: Risk of heart disease

            Relationship: Negative correlation (more exercise → lower risk).
            
            
3.Define Machine Learning. What are the main components in Machine Learning?
    ->Machine Learning is a branch of Artificial Intelligence (AI) that focuses on creating algorithms and models that can learn patterns from data and make predictions or decisions without being explicitly programmed.

        In simple words:
          Instead of writing rules manually, we feed data to a machine, and it learns rules/patterns automatically.

          Main Components in Machine Learning

        A typical ML system has 4 main components:

        1. Data

        The foundation of ML — raw information used for training and testing models.

        Types:

        Training data → used to teach the model.

        Validation data → used to tune hyperparameters.

        Test data → used to evaluate performance.

          Example: Customer purchase history, images, medical records.

        2. Model

        The mathematical structure/algorithm that makes predictions or decisions.

        It defines the relationship between input features and output target.

          Example:

        Linear Regression model: predicts house prices.

        Neural Network: recognizes handwritten digits.

        3. Learning Algorithm

        The method/process used to adjust the model’s parameters (weights, biases) based on data.

        Goal: Minimize error (loss function) and improve performance.

          Examples:

        Gradient Descent (optimizes weights).

        Backpropagation (used in neural networks).

        4. Prediction / Inference

        Once trained, the model is used to predict outputs for new, unseen data.

        This is the real-world application of the ML system.

          Example:

        Predicting whether an email is spam.

        Recommending movies on Netflix.

        (Optional but Important) Supporting Components:

        Features → The input variables (e.g., age, income, image pixels).

        Loss function → Measures error between predictions and actual values.

        Evaluation metrics → Accuracy, precision, recall, RMSE, etc.

4.How does loss value help in determining whether the model is good or not?
    ->The loss function, also known as the cost function, measures the difference between the model's predictions and the actual values (ground truth). 
      It indicates how poorly the model is performing—smaller loss means better predictions.

        For example, in regression, Mean Squared Error (MSE) is commonly used:  
        Loss = (1/n) Σ(y_true - y_pred)².  
        If the predictions are close to the actual values, the loss will be small.

        Loss plays a crucial role in determining model quality:  

        - Training Guidance: The optimizer, like gradient descent, uses the loss value to adjust the model's parameters (weights and biases). A lower loss shows the model is learning meaningful patterns.  
        - Model Comparison: Loss values can compare different models or hyperparameters. A consistently lower loss on validation or test data indicates a better model.  
        - Overfitting/Underfitting Detection:  
          - Training loss ↓, Validation loss ↓ → Good fit.  
          - Training loss ↓, Validation loss ↑ → Overfitting (memorizing training data).  
          - High training and validation loss → Underfitting (model too simple).  
        - Early Stopping & Monitoring: In deep learning, loss curves help determine when to stop training to avoid overfitting.


5.What are continuous and categorical variables?
    ->1. Continuous Variables

        A continuous variable can take any value within a range (including decimals/fractions).

        They are measured, not counted.

        Infinite possible values between two points.

        Examples:

            Height (e.g., 172.3 cm)

            Weight (e.g., 65.8 kg)

            Temperature (e.g., 36.6 °C)

            Time taken to finish a race (e.g., 12.54 seconds)

     In ML: Continuous variables are often treated as numerical features and used in regression tasks.

     2. Categorical Variables:-

        A categorical variable represents distinct groups or categories.

        They are counted, not measured.

        Values are qualitative, not numerical (though sometimes encoded as numbers).

        Types of Categorical Variables:

        Nominal (no order):

            Example: Colors (red, blue, green), Gender (male, female).

        Ordinal (with order/ranking):

            Example: Education level (High school < Bachelor < Master < PhD).

            Example: Customer satisfaction (Poor < Fair < Good < Excellent).

        In ML: Categorical variables need encoding (e.g., one-hot encoding, label encoding) before being fed into models.


6.How do we handle categorical variables in Machine Learning? What are the common techniques?
    ->Models like Linear Regression, Logistic Regression, SVM, Neural Networks, etc., require numbers.

        Categorical features (like color = red/blue/green) must be encoded into numeric form without losing information.

        Common Techniques to Handle Categorical Variables
        1. Label Encoding

        Converts each category into an integer.

          Example:

        Color: Red → 0, Blue → 1, Green → 2


          Pros: Simple, memory-efficient.
          Cons: Implies ordinal relationship (0 < 1 < 2), which may mislead models.

        2. One-Hot Encoding

        Creates binary columns (0/1) for each category.

          Example:

        Color: Red → [1,0,0], Blue → [0,1,0], Green → [0,0,1]


          Pros: No false ordering, widely used.
          Cons: Can cause high dimensionality if many categories.

        3. Ordinal Encoding (for ordered categories)

        Assigns integers based on order/rank.

          Example:

        Size: Small → 1, Medium → 2, Large → 3


          Pros: Preserves natural order.
          Cons: Not suitable if categories have no ranking.

        4. Frequency / Count Encoding

        Replace each category with its frequency/count in the dataset.

          Example:

        City: Mumbai (100), Delhi (80), Bangalore (60)


          Pros: Keeps some information about distribution.
          Cons: Can still mislead models into thinking higher frequency means more importance.

        5. Target Encoding (Mean Encoding)

        Replace each category with the mean of target variable for that category.

           Example: Predicting loan approval → replace “Job Title” with average approval rate for that job.
           Pros: Useful for high-cardinality features.
           Cons: Risk of data leakage (must use CV).

        6. Embedding Representations (Deep Learning)

        Learn a dense vector representation for categories (used in neural networks).

        Common in NLP (word embeddings like Word2Vec, GloVe).
        
        
7.What do you mean by training and testing a dataset?
    ->1. Training Dataset

        The portion of data used to teach the model.

        The model looks at the input features (X) and corresponding output labels (y) and learns patterns.

        During training, the algorithm adjusts its parameters (like weights, biases) to minimize the loss function.

        Example:

            Data: House size, number of rooms → House price

            Training: Model learns the relationship between features (size, rooms) and target (price).

     2. Testing Dataset

        The unseen portion of data used to evaluate the trained model.

        It checks whether the model has actually learned general patterns (not just memorized training data).

        Performance on the test set tells us if the model can generalize to new data.

        Example:

            After training on 80% of housing data, we test on the remaining 20% to see how well the model predicts prices of houses it hasn’t seen before.

        Why Split into Training and Testing?

            If we train and test on the same data, the model may perform well (low loss) but fail on new data → this is called overfitting.

            Splitting ensures we evaluate true predictive ability.

        Typical Dataset Splits

            Training set: 70–80%

            Testing set: 20–30%

            Sometimes, we also use a Validation set (10–20%) to tune hyperparameters.

             Or we use cross-validation instead of a fixed split.

        Quick Analogy

            Training dataset = studying with practice questions.

            Testing dataset = final exam with new questions.
            
            
8.What is sklearn.preprocessing?
    ->In scikit-learn (sklearn),
        sklearn.preprocessing is a module that provides functions and classes for scaling, transforming, and encoding data before feeding it into a machine learning model.

        In ML, raw data often isn’t ready for models — we need to:

            Normalize or standardize numerical features.

            Encode categorical variables.

            Generate polynomial features.

            Handle missing values (via imputers, though found in sklearn.impute).

            That’s where sklearn.preprocessing comes in.

        Common Tasks in sklearn.preprocessing
            1. Feature Scaling

                Most ML algorithms perform better if features are on a similar scale.

                StandardScaler → scales data to mean = 0, std = 1.

                MinMaxScaler → scales data to a fixed range (usually [0, 1]).

                RobustScaler → uses median & IQR (robust to outliers).

            2. Encoding Categorical Features

                LabelEncoder → converts categories to integer labels.

                OneHotEncoder → creates binary columns for each category.

                OrdinalEncoder → encodes categories with an order (e.g., small < medium < large).

            3. Feature Transformation

                PolynomialFeatures → generates polynomial & interaction terms.

                Binarizer → converts numerical values to 0/1 based on a threshold.

                Normalizer → scales rows to have unit norm (for text mining, cosine similarity, etc.).

            4. Custom Transformation Pipelines

                FunctionTransformer → apply custom transformation functions.

                Works seamlessly with Pipelines (sklearn.pipeline.Pipeline) to chain preprocessing + model training steps.


9.What is a Test set?
    ->In Machine Learning, a test set is a portion of the dataset that is kept separate from the training data and used to evaluate the performance of a trained model on unseen data.

        The model never sees this data during training.

        It acts as a proxy for how the model will perform in the real world.

        Purpose of a Test Set

            Evaluate Generalization

            Checks if the model can predict accurately on new, unseen data.

            Detect Overfitting

            If the model performs well on training data but poorly on the test set → it is overfitting.

            Compare Models

            Helps in choosing the best model or algorithm based on its performance on unseen data.

        Typical Split

            Common practice:

                Training set: 70–80% (to train the model)

                Test set: 20–30% (to evaluate the model)

            Sometimes a validation set is also used (or cross-validation) to tune hyperparameters before final testing.

        Analogy

            Training set: practice questions you study to learn.

            Test set: final exam questions you’ve never seen before.

        Key Points

            Test set must be representative of the real-world data.

            Performance metrics on the test set (accuracy, RMSE, F1-score, etc.) are used to judge the model.
            
            
10.How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?
    ->In Machine Learning, the dataset is usually divided into training and testing sets to evaluate a model’s performance and its ability to generalize to unseen data.

        Training Set:

            A subset of the data used to train the model.

            The model learns patterns, relationships, and parameters (like weights in regression or neural networks) from this data.

        Test Set:

            A separate subset of the data not seen by the model during training.

            Used to assess the model’s performance on new, unseen data and check its generalization ability.

        Typical Split Ratios:

            Training: 70–80% of the data

            Testing: 20–30% of the data

        Notes:

            Sometimes, a validation set is also used to fine-tune hyperparameters.

            Splitting ensures the model does not overfit to the training data.

     How to Approach a Machine Learning Problem

        A structured approach is necessary for solving ML problems effectively. The common steps are:

        Step 1: Problem Definition

            Understand the objective of the task: classification, regression, clustering, etc.

            Identify input features (independent variables) and the output/target (dependent variable).

        Step 2: Data Collection

            Gather data from relevant sources such as databases, files, sensors, or APIs.

            Ensure data quality, consistency, and sufficiency.

        Step 3: Data Exploration and Preprocessing

            Perform Exploratory Data Analysis (EDA) to understand data distribution, patterns, and relationships.

            Handle missing values, outliers, and inconsistencies.

            Encode categorical variables and scale/normalize numerical features as required.

        Step 4: Data Splitting

            Divide data into training and testing sets to evaluate the model’s performance.

            Optionally, create a validation set for hyperparameter tuning.

        Step 5: Model Selection

            Choose an appropriate algorithm based on the problem type.

            Regression → Linear Regression, Decision Trees, etc.

            Classification → Logistic Regression, SVM, Random Forest, etc.

            Clustering → K-Means, DBSCAN, etc.

        Step 6: Model Training

            Fit the selected model on the training set so it can learn the relationships between features and target.

        Step 7: Model Evaluation

            Use the test set to evaluate model performance using metrics appropriate for the problem:

            Regression → MSE, RMSE, R²

            Classification → Accuracy, Precision, Recall, F1-score

        Step 8: Hyperparameter Tuning

            Optimize the model using techniques like Grid Search, Random Search, or Cross-Validation to improve performance.

        Step 9: Deployment and Monitoring

            Deploy the trained model to make predictions on real-world data.

            Monitor performance continuously and update the model as needed.


11.Why do we have to perform EDA before fitting a model to the data?
    ->Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its structure, patterns, and relationships before building a machine learning model.

        Performing EDA is important because it helps in several key ways:

        => Understand Data Distribution

            EDA helps us see how features are distributed (normal, skewed, uniform, etc.).

            Knowing the distribution allows us to choose appropriate preprocessing steps (like scaling, normalization, or transformation).

            Example:

                A highly skewed feature might need log transformation before training a regression model.

        => Detect and Handle Missing Values

            Real-world datasets often have missing or null values.

            EDA helps identify missing data and decide how to handle it:

            Drop rows/columns

            Impute with mean, median, mode, or predictive methods

        => Identify Outliers

            Outliers can distort model training and reduce performance.

            EDA helps spot extreme values using visualizations like boxplots or scatter plots.

        => Discover Relationships Between Variables

            EDA helps understand correlations and interactions between features and target variable.

            This can guide feature selection and reduce multicollinearity.

            Example:

                A feature highly correlated with the target is likely useful for prediction.

                Features strongly correlated with each other may need dimensionality reduction.

        => Detect Data Quality Issues

            EDA reveals inconsistencies, duplicate records, or errors in data.

            Fixing these ensures the model learns true patterns rather than noise.

        => Guide Feature Engineering

            EDA provides insights for creating new features or transforming existing features to improve model performance.

        => Choose the Right Model

            Based on the patterns observed during EDA, you can decide whether to use:

            Linear vs non-linear models

            Simple vs complex models

            Regression vs classification


12.What is correlation?
    ->In machine learning, correlation describes the statistical relationship between features (independent variables) or between a feature and the target variable.
    It measures how strongly two variables move together.
    A correlation coefficient (usually Pearson’s r) tells us the strength (magnitude) and direction (positive or negative) of that relationship.

    Why it matters in ML:

        Feature selection: Highly correlated features may be redundant (multicollinearity).

        Understanding data: Correlation between features and the target helps identify useful predictors.

        Interpretability: Explains why a model might weigh features in a certain way.

13. What does Negative Correlation Mean?
    ->A negative correlation means that as one variable increases, the other tends to decrease.

        In ML, if a feature has a negative correlation with the target, it means higher values of that feature are associated with lower target values (and vice versa).

    Examples in ML:

        =>House Prices Dataset

            Feature: Distance from city center

            Target: House price

            Relationship: Negative correlation (farther from the city → lower price).

        =>Employee Attrition Prediction

            Feature: Job satisfaction

            Target: Probability of quitting

            Relationship: Negative correlation (higher satisfaction → lower chance of quitting).

        =>Medical Dataset

            Feature: Exercise per week

            Target: Risk of heart disease

            Relationship: Negative correlation (more exercise → lower risk).
                      


'''

In [7]:
#14.How can you find correlation between variables in Python?
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    "Age": [25, 32, 47, 51, 23, 40],
    "Salary": [50000, 60000, 80000, 90000, 45000, 70000],
    "Experience": [2, 5, 20, 25, 1, 15]
})

# Correlation matrix
corr_matrix = data.corr()
print(corr_matrix)


                 Age    Salary  Experience
Age         1.000000  0.995940    0.990271
Salary      0.995940  1.000000    0.986608
Experience  0.990271  0.986608    1.000000


In [None]:
'''
15.What is causation? Explain difference between correlation and causation with an example.

    ->Causation (also called a causal relationship) occurs when a change in one variable directly leads to a change in another variable.

        It establishes a cause-and-effect relationship.

        Causation implies that manipulating one variable will produce a predictable change in the other.

        Determining causation usually requires controlled experiments, longitudinal studies, or strong statistical evidence.

        Example of Causation:

        Smoking → Lung cancer

        Scientific studies show that smoking cigarettes increases the risk of lung cancer.

        Here, smoking is the cause, and lung cancer is the effect.

        Exercise → Weight loss

        Regular exercise can lead to weight reduction.

        Exercise is the causal factor, weight loss is the outcome.

        Key Point: Causation implies a mechanism or rationale for why the change occurs.
    
    
    =>Correlation measures the statistical relationship between two variables — how they move together.

        It shows strength and direction (positive or negative).

        Correlation does not imply causation — it only shows that two variables are related.

        Example of Correlation:

        Ice cream sales ↑ and drowning incidents ↑ in summer

        These two variables are positively correlated.

        However, ice cream sales do not cause drowning incidents.

        The hidden factor is temperature/season — hot weather increases both swimming (leading to drownings) and ice cream consumption.

        Key Point: Correlation is an association, not proof of cause-and-effect
        
        
        | Aspect             | Correlation                                                                     | Causation                                                                   |
| ------------------ | ------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| Definition     | Measures the strength and direction of a relationship between two variables | One variable directly influences or causes a change in another variable |
| Nature         | Statistical association; can be positive, negative, or zero                     | Cause-and-effect relationship                                               |
| Directionality | No causality implied; could be bidirectional or influenced by a third variable  | Always directional: cause → effect                                          |
| Proof Required | No proof needed; observed from data                                             | Requires experiments, longitudinal studies, or strong evidence              |
| Example        | Ice cream sales and drowning deaths rise together in summer                     | Smoking causes lung cancer; exercise causes weight loss                     |

'''


In [None]:
'''
16.What is an Optimizer? What are different types of optimizers? Explain each with an example.
    ->In Machine Learning (especially in Neural Networks / Deep Learning), an optimizer is an algorithm used to update the parameters (weights and biases) of the model to minimize the loss function.

        The goal of an optimizer is to find the best set of parameters that results in lowest error.

        Optimizers guide the model during training by adjusting weights based on the gradient of the loss function.

     Key Concept:

        Most optimizers use gradient descent or its variants.

        Gradient descent moves in the direction of steepest descent to minimize the loss.

    Types of Optimizers

        There are several types of optimizers. The common ones are:

        A) Gradient Descent (GD)

        The most basic optimizer.

        Updates weights by moving in the negative direction of the gradient of the loss function.

        Update Rule:
        𝑤 = 𝑤 − 𝜂 ⋅ ∇𝐿(𝑤)

        Where:

        𝑤 = weights  
        𝜂 = learning rate  
        ∇𝐿(𝑤) = gradient of the loss function  

    Variants of Gradient Descent:

    Batch Gradient Descent:

        Uses all training samples to compute the gradient.

        Accurate but slow for large datasets.

    Stochastic Gradient Descent (SGD):

        Uses one training sample at a time to update weights.

        Faster, introduces randomness → can escape local minima.

    Mini-Batch Gradient Descent:

        Uses a subset (batch) of data to compute gradients.

        Combines advantages of batch and stochastic GD.

    B) Momentum Optimizer

        Accelerates SGD by adding a momentum term, which helps smooth out oscillations in updates.

        Updates are influenced by previous gradients.

        Update Rule:
            𝑣𝑡 = 𝛾𝑣𝑡−1 + 𝜂∇𝐿(𝑤)  
            𝑤 = 𝑤 − 𝑣𝑡  

            Here, 𝑣𝑡 represents the velocity term, and 𝛾 is the momentum coefficient, typically set to 0.9.
         Example:

            Helps in scenarios where the loss surface has narrow valleys, like deep networks.

    C) AdaGrad (Adaptive Gradient)

        Adjusts the learning rate for each parameter individually based on past gradients.

        Parameters with frequent updates get smaller learning rates.

      Good For:

            Sparse data (like text or NLP tasks)

      Limitation:

            Learning rate can shrink too much over time.

    D) RMSProp (Root Mean Square Propagation)

        Improves on AdaGrad by using a decaying average of squared gradients to prevent learning rate from shrinking too much.

        Popular in RNNs and deep learning tasks.

    E) Adam (Adaptive Moment Estimation)

        Combines Momentum + RMSProp.

        Maintains moving averages of gradients and squared gradients.

        Automatically adapts learning rate for each parameter.

        Update Rule:

            𝑚ₜ = 𝛽₁𝑚ₜ₋₁ + (1 − 𝛽₁)∇𝐿(𝑤)  
            𝑣ₜ = 𝛽₂𝑣ₜ₋₁ + (1 − 𝛽₂)(∇𝐿(𝑤))²  
            𝑤 = 𝑤 − 𝜂 * 𝑚ₜ / (√𝑣ₜ + 𝜖)  



        Pros:
            Works well out-of-the-box for most neural networks.
            Default optimizer in TensorFlow and PyTorch.

    F) Other Optimizers

        Adadelta → Like RMSProp, but no need to set initial learning rate.

        Nadam → Adam + Nesterov momentum.

'''


In [1]:
'''
17.What is sklearn.linear_model ?
    ->sklearn.linear_model is a module in scikit-learn (sklearn), a popular Python library for machine learning. This module provides classes and functions to implement linear models — models that assume a linear relationship between the input variables (features) and the output (target).

        In simpler terms, it’s used when you want to predict a value (or class) as a linear combination of features.

        Key Features of sklearn.linear_model:

        Regression Models – Predict continuous values:

            LinearRegression: Standard linear regression.

            Ridge: Linear regression with L2 regularization (helps prevent overfitting).

            Lasso: Linear regression with L1 regularization (can shrink some coefficients to zero).

            ElasticNet: Combines L1 and L2 regularization.

        Classification Models – Predict categorical values:

            LogisticRegression: For binary or multiclass classification.

        Robust Models – Resistant to outliers:

            RANSACRegressor: Fits a model robustly by ignoring outliers.

            HuberRegressor: Less sensitive to outliers in regression.

        Other Linear Models:

            SGDRegressor / SGDClassifier: Uses stochastic gradient descent for optimization.

            Perceptron: Simple linear binary classifier.

'''
from sklearn.linear_model import LinearRegression
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Create linear regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)

# Predict
pred = model.predict([[6]])
print(pred)  # Output: [12.]


[12.]


In [2]:
'''
18.What does model.fit() do? What arguments must be given?
    ->In scikit-learn, model.fit() is the method used to train a machine learning model on your dataset. Essentially, it tells the model: “Here’s the data and the target values — learn the patterns from this.”

        What model.fit() does:

            Takes input features and target values (training data).

            Calculates the model parameters (like weights in linear regression or coefficients in logistic regression) that best map inputs to outputs.

            Stores the learned parameters inside the model object, so you can use model.predict() later.

        Arguments for model.fit():

            X – Input features (independent variables)

            Should be array-like of shape (n_samples, n_features).

            Example: 2D array or pandas DataFrame where each row is a sample and each column is a feature.

            y – Target values (dependent variable)

            Should be array-like of shape (n_samples,) for regression or (n_samples,) / (n_samples, n_outputs) for classification.

            Example: 1D array or pandas Series for labels.

        Optional arguments (depends on the model):

        Some models accept extra arguments like sample_weight to give different importance to different samples.
'''

from sklearn.linear_model import LinearRegression
import numpy as np

# Features and target
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Create model
model = LinearRegression()

# Train the model
model.fit(X, y)

# Model has learned the parameters
print(model.coef_)   # [2.]
print(model.intercept_)  # 0.0


[2.]
0.0


In [3]:
'''
19.What does model.predict() do? What arguments must be given?
    ->In scikit-learn, model.predict() is used to make predictions using a trained model. Essentially, after you’ve trained your model with model.fit(), you can feed it new data and it will output predicted values or classes based on the patterns it learned.

    What model.predict() does:

        Takes new input data (features) that the model hasn’t seen.

        Uses the learned parameters from model.fit() to calculate predictions.

        For regression, it predicts continuous values.

            For classification, it predicts class labels.

            Arguments for model.predict():

    X – Input features for which you want predictions.

        Should be array-like of shape (n_samples, n_features).

        Example: 2D array or pandas DataFrame where each row is a sample and each column is a feature.
    
    Important: The number of features (n_features) must match the training data used in .fit().
'''

from sklearn.linear_model import LinearRegression
import numpy as np

# Training data
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 6, 8, 10])

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# New data for prediction
X_new = np.array([[6], [7]])

# Predict
predictions = model.predict(X_new)
print(predictions)  # Output: [12. 14.]


[12. 14.]


In [None]:
'''
20.What are continuous and categorical variables?
    ->1. Continuous Variables

        A continuous variable can take any value within a range (including decimals/fractions).

        They are measured, not counted.

        Infinite possible values between two points.

        Examples:

            Height (e.g., 172.3 cm)

            Weight (e.g., 65.8 kg)

            Temperature (e.g., 36.6 °C)

            Time taken to finish a race (e.g., 12.54 seconds)

     In ML: Continuous variables are often treated as numerical features and used in regression tasks.

     2. Categorical Variables:-

        A categorical variable represents distinct groups or categories.

        They are counted, not measured.

        Values are qualitative, not numerical (though sometimes encoded as numbers).

        Types of Categorical Variables:

        Nominal (no order):

            Example: Colors (red, blue, green), Gender (male, female).

        Ordinal (with order/ranking):

            Example: Education level (High school < Bachelor < Master < PhD).

            Example: Customer satisfaction (Poor < Fair < Good < Excellent).

        In ML: Categorical variables need encoding (e.g., one-hot encoding, label encoding) before being fed into models.


'''

In [4]:
'''
21.What is feature scaling? How does it help in Machine Learning?
    ->Feature scaling is a data preprocessing technique in machine learning that normalizes or standardizes the range of independent variables (features). Essentially, it ensures that all features contribute equally to the model, especially when the features have very different scales.

    Why Feature Scaling is Important:

        Equal Contribution of Features

            Many algorithms (like gradient descent-based models) calculate distances or weights.

            If one feature ranges from 0–1 and another from 0–1000, the larger-scale feature will dominate, skewing the results.

        Faster Convergence

            Algorithms like gradient descent converge faster when features are on a similar scale.

            Helps reduce training time.

        Improved Accuracy in Distance-Based Models

            Models like K-Nearest Neighbors (KNN), K-Means, and SVM rely on distance metrics.

            Without scaling, features with larger ranges dominate the distance calculations.

        Helps Regularization

            For models like Ridge and Lasso regression, scaling ensures that regularization penalizes all features equally.

'''

from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
X = np.array([[1, 100],
              [2, 200],
              [3, 300]])

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)


[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


In [12]:
#22.How do we perform scaling in Python?
    #1. Standardization (Z-score Scaling)
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data
X = np.array([[1, 100],
              [2, 200],
              [3, 300]])

# Create scaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

print(X_scaled)
print("Centers the data around 0 with unit standard deviation.\n")

    
    #2. Min-Max Scaling (Normalization)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print("Scales features to a fixed range [0,1].\n")

    
    #3. Max Abs Scaling
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print("Scales data by the maximum absolute value, keeps sign.\n")
   
    
    #4. Robust Scaling
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
print("Less sensitive to outliers; uses median and IQR.")
    



[[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]
Centers the data around 0 with unit standard deviation.

[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]
Scales features to a fixed range [0,1].

[[0.33333333 0.33333333]
 [0.66666667 0.66666667]
 [1.         1.        ]]
Scales data by the maximum absolute value, keeps sign.

[[-1. -1.]
 [ 0.  0.]
 [ 1.  1.]]
Less sensitive to outliers; uses median and IQR.


In [None]:
'''
23.What is sklearn.preprocessing?
    ->In scikit-learn (sklearn),
        sklearn.preprocessing is a module that provides functions and classes for scaling, transforming, and encoding data before feeding it into a machine learning model.

        In ML, raw data often isn’t ready for models — we need to:

            Normalize or standardize numerical features.

            Encode categorical variables.

            Generate polynomial features.

            Handle missing values (via imputers, though found in sklearn.impute).

            That’s where sklearn.preprocessing comes in.

        Common Tasks in sklearn.preprocessing
            1. Feature Scaling

                Most ML algorithms perform better if features are on a similar scale.

                StandardScaler → scales data to mean = 0, std = 1.

                MinMaxScaler → scales data to a fixed range (usually [0, 1]).

                RobustScaler → uses median & IQR (robust to outliers).

            2. Encoding Categorical Features

                LabelEncoder → converts categories to integer labels.

                OneHotEncoder → creates binary columns for each category.

                OrdinalEncoder → encodes categories with an order (e.g., small < medium < large).

            3. Feature Transformation

                PolynomialFeatures → generates polynomial & interaction terms.

                Binarizer → converts numerical values to 0/1 based on a threshold.

                Normalizer → scales rows to have unit norm (for text mining, cosine similarity, etc.).

            4. Custom Transformation Pipelines

                FunctionTransformer → apply custom transformation functions.

                Works seamlessly with Pipelines (sklearn.pipeline.Pipeline) to chain preprocessing + model training steps.


'''

In [13]:
#24.How do we split data for model fitting (training and testing) in Python?
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Split data: 70% training, 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("X_train:", X_train)
print("X_test:", X_test)
print("y_train:", y_train)
print("y_test:", y_test)


X_train: [[ 1]
 [ 8]
 [ 3]
 [10]
 [ 5]
 [ 4]
 [ 7]]
X_test: [[9]
 [2]
 [6]]
y_train: [ 1  8  3 10  5  4  7]
y_test: [9 2 6]


In [18]:
'''
25.Explain data encoding?
  ->Data encoding is a preprocessing step in machine learning where categorical or non-numeric data is converted into a numeric format so that models can understand and use it. Most machine learning algorithms require numeric input, so encoding is essential for features like categories, labels, or text.

    Why Data Encoding is Important:

        ML algorithms need numbers

            Algorithms like linear regression, logistic regression, SVM, and neural networks cannot directly process text or categories.

        Preserve information

            Encoding transforms categories into numbers while keeping the meaningful relationships between them (depending on the encoding type).

        Avoid model bias

            Encoding ensures that numeric representations don’t mislead the model (e.g., 0, 1, 2 should not imply ranking unless intended).

'''

'''Common Types of Data Encoding:
1. Label Encoding

    Converts each category into a unique integer.

    Good for ordinal data (where order matters, e.g., Low < Medium < High).
'''

from sklearn.preprocessing import LabelEncoder

categories = ['Red', 'Green', 'Blue', 'Green']
encoder = LabelEncoder()
encoded = encoder.fit_transform(categories)
print(encoded)  # Output: [2 1 0 1]
print("****************************")

'''
2. One-Hot Encoding

    Converts each category into a binary vector (0 or 1).

    Good for nominal data (no order, e.g., color, city).
'''
from sklearn.preprocessing import OneHotEncoder
import numpy as np

categories = np.array([['Red'], ['Green'], ['Blue'], ['Green']])
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(categories)
print(encoded)
print("****************************")



#3. Binary / Custom Encoding

   # Maps categories to binary representations or custom numeric codes.

    
    #useful when you want compact encoding for high-cardinality features.



[2 1 0 1]
****************************
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]
****************************
