## Machine Learning Assignment Answers (Theory Section)
1. What is a parameter?
- In machine learning, a parameter is a configuration variable that is internal to the model and whose value can be estimated from data. These are the values that the learning algorithm learns during training, such as the weights and biases in a neural network. They define the model's predictive capability.
2. What is correlation?
- Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It indicates both the strength and direction of the linear relationship between the variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).
3. What does negative correlation mean?
- Negative correlation means that as one variable increases, the other variable tends to decrease. For example, in many scenarios, as the temperature outside decreases, heating bill costs tend to increase. The correlation coefficient for a negative correlation will be between -1 and 0.
4. Define Machine Learning. What are the main components in Machine Learning?
- Machine Learning is a field of artificial intelligence where systems learn from data to identify patterns, make decisions, or predict outcomes without being explicitly programmed. The main components typically include data (for training and testing), a model (the algorithm that learns), features (the input variables), and an objective function/loss function (to evaluate model performance).
5. How does loss value help in determining whether the model is good or not?
- The loss value (or error) quantifies how poorly a model performs on a given dataset. A lower loss value generally indicates a better-performing model, as it means the model's predictions are closer to the actual values. During training, the goal is often to minimize this loss value.
6. What are continuous and categorical variables?
- Continuous variables are numerical variables that can take any value within a given range, often involving decimals (e.g., temperature, height, price). Categorical variables, on the other hand, represent types or categories and can only take on a limited number of discrete values (e.g., gender, city, product type).
7. How do we handle categorical variables in Machine Learning? What are the common techniques?
- Categorical variables need to be converted into a numerical format before being fed into most machine learning models. Common techniques include One-Hot Encoding (creates new binary columns for each category) and Label Encoding (assigns a unique integer to each category).
8. What do you mean by training and testing a dataset?
- Training a dataset involves using a portion of the data to teach the machine learning model to learn patterns and relationships. Testing a dataset involves using a separate, unseen portion of the data to evaluate how well the trained model generalizes to new data and to assess its performance.
9. What is sklearn.preprocessing?
- sklearn.preprocessing is a module within the scikit-learn library that provides a collection of functions and classes for data preprocessing tasks. These tasks are crucial for transforming raw data into a format suitable for machine learning algorithms, such as scaling, encoding categorical variables, and imputation.
10. What is a Test set?
- A test set is a subset of your original dataset that is held out from the training process. Its purpose is to provide an unbiased evaluation of the final model's performance on unseen data. It helps in assessing how well the model generalizes and avoids overfitting.
11. How do we split data for model fitting (training and testing) in Python?
- In Python, we typically use the train_test_split function from scikit-learn's model_selection module. This function randomly divides the dataset into training and testing subsets, ensuring that both sets are representative of the original data.
12. How do you approach a Machine Learning problem?
- Approaching a machine learning problem usually involves several steps: understanding the problem and data, data collection, data preprocessing (cleaning, transformation), exploratory data analysis (EDA), feature engineering, model selection, model training, model evaluation, hyperparameter tuning, and finally, deployment.
13. Why do we have to perform EDA before fitting a model to the data?
- Exploratory Data Analysis (EDA) is crucial before model fitting because it helps in understanding the dataset's characteristics, identifying patterns, detecting outliers, and uncovering relationships between variables. This insight guides preprocessing steps, feature engineering, and model selection, ultimately leading to a more robust and accurate model.
14. What is causation? Explain difference between correlation and causation with an example.
- Causation means that one event is the direct result of another event. Correlation, as discussed, only indicates that two variables move together. The key difference is that correlation does not imply causation. For example, ice cream sales and shark attacks might be positively correlated (both increase in summer), but ice cream sales don't cause shark attacks; the causal factor is warm weather, which drives both.
15. What is an Optimizer? What are different types of optimizers? Explain each with an example.
- An optimizer is an algorithm or function that adjusts the parameters of a machine learning model (like weights and biases) during training to minimize the loss function. Its goal is to find the optimal set of parameters that results in the best model performance.
Different types of optimizers include:
Gradient Descent (GD): Updates parameters by taking steps proportional to the negative of the gradient of the loss function. It can be slow on large datasets.
Example: Imagine finding the lowest point in a valley by always taking a small step downhill.
Stochastic Gradient Descent (SGD): Similar to GD but updates parameters after evaluating the gradient for each training example. This makes it faster and helps escape local minima but can be noisy.
Example: Taking a step downhill after looking at only one patch of ground.
Mini-Batch Gradient Descent: A compromise between GD and SGD, updating parameters after processing a small "batch" of training examples. This balances speed and stability.
Example: Taking a step downhill after looking at a small group of patches of ground.
Adam (Adaptive Moment Estimation): A more advanced optimizer that uses adaptive learning rates for each parameter, combining ideas from other optimizers. It's often a good default choice for many deep learning tasks.
Example: Finding the lowest point in a valley, but smartly adjusting your step size for different terrains based on past observations.
16. What is sklearn.linear_model?
- sklearn.linear_model is a module within the scikit-learn library that provides a wide range of algorithms for linear models. This includes models for regression (like Linear Regression, Ridge, Lasso) and classification (like Logistic Regression, Perceptron).
17. What does model.fit() do? What arguments must be given?
- model.fit() is a method used to train a machine learning model. It takes the training data and corresponding target values, allowing the model to learn the underlying patterns and relationships between features and the target. The primary arguments required are X (the training features) and y (the target variable).
18. What does model.predict() do? What arguments must be given?
- model.predict() is a method used to make predictions using a trained machine learning model. After the model has learned from the training data, predict() takes new, unseen input features and outputs the model's estimated target values. The main argument required is X_new (the features of the data for which you want predictions).
19. What are continuous and categorical variables?

- Continuous variables are numerical values that can be measured along a continuum and often have infinite possibilities between any two values (e.g., age, height, temperature). Categorical variables, in contrast, represent distinct groups or labels and can't be measured on a continuous scale (e.g., colors, types of fruit, yes/no answers).
20. What is feature scaling? How does it help in Machine Learning?
- Feature scaling is a data preprocessing technique used to standardize the range of independent variables or features of the data. It helps in machine learning by preventing features with larger numerical ranges from dominating features with smaller ranges during model training, especially for algorithms sensitive to feature magnitudes (e.g., K-Nearest Neighbors, Support Vector Machines, Gradient Descent).
21. How do we perform scaling in Python?
- In Python, scaling is typically performed using transformers from scikit-learn's sklearn.preprocessing module. Common scalers include StandardScaler (which standardizes features by removing the mean and scaling to unit variance) and MinMaxScaler (which scales features to a fixed range, usually 0 to 1).
22. What is sklearn.preprocessing?
- sklearn.preprocessing is a vital module in the scikit-learn library dedicated to transforming raw data into a suitable format for machine learning algorithms. It offers a diverse set of tools for tasks like standardizing, normalizing, encoding categorical data, and imputing missing values, which are essential for model performance and stability.
23. How do we split data for model fitting (training and testing) in Python?
- In Python, the most common and robust way to split data is by using train_test_split from sklearn.model_selection. This function randomly partitions the dataset into two subsets (training and testing) while maintaining the original data's distribution (especially important with the stratify parameter for classification).
24. Explain data encoding?
- Data encoding is the process of converting data from one format or representation to another, often from categorical to numerical. This is a critical step in machine learning because most algorithms require numerical input. Encoding schemes like One-Hot Encoding and Label Encoding translate non-numerical labels into numerical values that algorithms can process.


# **Machine Learning Assignment Answers (Coding Section)**

In [1]:
#Question 11/23. How do we split data for model fitting (training and testing) in Python?

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load a sample dataset (e.g., Iris dataset)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

print("Original data shape (X):", X.shape)
print("Original data shape (y):", y.shape)

# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("\nShape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Original data shape (X): (150, 4)
Original data shape (y): (150,)

Shape of X_train: (105, 4)
Shape of X_test: (45, 4)
Shape of y_train: (105,)
Shape of y_test: (45,)


In [2]:
#Question 14. How can you find correlation between variables in Python?
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [2, 4, 6, 8, 10], # Positively correlated
    'Feature3': [50, 40, 30, 20, 10], # Negatively correlated
    'Feature4': [1, 5, 2, 8, 3] # Weakly correlated
}
df = pd.DataFrame(data)

print("Sample DataFrame:")
print(df)

# Calculate the correlation matrix for all numerical columns
correlation_matrix = df.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

# To find correlation between two specific columns
corr_feature1_feature2 = df['Feature1'].corr(df['Feature2'])
print(f"\nCorrelation between Feature1 and Feature2: {corr_feature1_feature2:.2f}")

corr_feature1_feature3 = df['Feature1'].corr(df['Feature3'])
print(f"Correlation between Feature1 and Feature3: {corr_feature1_feature3:.2f}")

Sample DataFrame:
   Feature1  Feature2  Feature3  Feature4
0        10         2        50         1
1        20         4        40         5
2        30         6        30         2
3        40         8        20         8
4        50        10        10         3

Correlation Matrix:
          Feature1  Feature2  Feature3  Feature4
Feature1  1.000000  1.000000 -1.000000  0.398862
Feature2  1.000000  1.000000 -1.000000  0.398862
Feature3 -1.000000 -1.000000  1.000000 -0.398862
Feature4  0.398862  0.398862 -0.398862  1.000000

Correlation between Feature1 and Feature2: 1.00
Correlation between Feature1 and Feature3: -1.00


In [3]:
#Question 17. What does model.fit() do? What arguments must be given?
#Question 18. What does model.predict() do? What arguments must be given?


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Create a synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10 # 100 samples, 1 feature
y = 2 * X + 1 + np.random.randn(100, 1) * 2 # y = 2x + 1 + noise

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Linear Regression model instance
model = LinearRegression()

# model.fit(X_train, y_train) - Training the model
# X_train: Training features (independent variables)
# y_train: Target variable for training (dependent variable)
print("Training the model...")
model.fit(X_train, y_train)
print("Model training complete.")

print(f"\nModel coefficients: {model.coef_}")
print(f"Model intercept: {model.intercept_}")

# model.predict(X_test) - Making predictions
# X_test: New, unseen features for which predictions are desired
print("\nMaking predictions on the test set...")
y_pred = model.predict(X_test)
print("Predictions made.")

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"\nMean Squared Error on test set: {mse:.2f}")

# Example of predicting a single new data point
new_data_point = np.array([[7.5]])
predicted_value = model.predict(new_data_point)
print(f"Prediction for new data point {new_data_point[0][0]}: {predicted_value[0][0]:.2f}")



Training the model...
Model training complete.

Model coefficients: [[1.9066758]]
Model intercept: [1.23580112]

Making predictions on the test set...
Predictions made.

Mean Squared Error on test set: 2.52
Prediction for new data point 7.5: 15.54


In [4]:
#Question 21. How do we perform scaling in Python?


import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.datasets import load_diabetes

# Load a sample dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

print("Original data (first 5 rows and a few columns):")
print(X.head())
print("\nDescriptive statistics before scaling:")
print(X.describe().loc[['mean', 'std', 'min', 'max']])

# 1. Using StandardScaler (Z-score normalization)
# Scales features to have a mean of 0 and standard deviation of 1
scaler_standard = StandardScaler()
X_scaled_standard = scaler_standard.fit_transform(X)
X_scaled_standard_df = pd.DataFrame(X_scaled_standard, columns=X.columns)

print("\nData after StandardScaler (first 5 rows and a few columns):")
print(X_scaled_standard_df.head())
print("\nDescriptive statistics after StandardScaler:")
print(X_scaled_standard_df.describe().loc[['mean', 'std', 'min', 'max']])

# 2. Using MinMaxScaler
# Scales features to a specified range, typically 0 to 1
scaler_minmax = MinMaxScaler()
X_scaled_minmax = scaler_minmax.fit_transform(X)
X_scaled_minmax_df = pd.DataFrame(X_scaled_minmax, columns=X.columns)

print("\nData after MinMaxScaler (first 5 rows and a few columns):")
print(X_scaled_minmax_df.head())
print("\nDescriptive statistics after MinMaxScaler:")
print(X_scaled_minmax_df.describe().loc[['mean', 'std', 'min', 'max']])

Original data (first 5 rows and a few columns):
        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641  

Descriptive statistics before scaling:
               age           sex           bmi            bp            s1  \
mean -2.511817e-19  1.230790e-17 -2.245564e-16 -4.797570e-17 -1.381499e-17   
std   4.761905e-02  4.761905e-02  4.761905e-02  4.761905e-02  4.761905e-02   
min  -1.072256e-01 -4.4

In [5]:
#Question 24. Explain data encoding? (Coding for One-Hot and Label Encoding)

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Create a sample DataFrame with categorical data
data = {
    'City': ['New York', 'Paris', 'London', 'New York', 'London', 'Paris'],
    'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy', 'Cloudy'],
    'Temperature': [25, 18, 12, 27, 10, 19]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# --- 1. Label Encoding ---
# Best for ordinal categorical data (where order matters) or when few categories
# Or for target variable encoding
le = LabelEncoder()
df['City_LabelEncoded'] = le.fit_transform(df['City'])
print("\nDataFrame after Label Encoding 'City':")
print(df[['City', 'City_LabelEncoded']])
print("Classes learned by LabelEncoder for 'City':", le.classes_)


# --- 2. One-Hot Encoding ---
# Best for nominal categorical data (where order does NOT matter)
# Creates new binary (0 or 1) columns for each category
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore') # sparse_output=False to get a dense array
city_encoded = ohe.fit_transform(df[['City']]) # Note the double brackets for DataFrame input
city_encoded_df = pd.DataFrame(city_encoded, columns=ohe.get_feature_names_out(['City']))

# Combine with original DataFrame
df_ohe = pd.concat([df, city_encoded_df], axis=1)

print("\nDataFrame after One-Hot Encoding 'City':")
print(df_ohe[['City'] + list(ohe.get_feature_names_out(['City']))])

# Using pandas get_dummies for simpler One-Hot Encoding (often preferred)
df_dummies = pd.get_dummies(df, columns=['Weather'], dtype=int)
print("\nDataFrame after One-Hot Encoding 'Weather' using pd.get_dummies:")
print(df_dummies)

Original DataFrame:
       City Weather  Temperature
0  New York   Sunny           25
1     Paris  Cloudy           18
2    London   Rainy           12
3  New York   Sunny           27
4    London   Rainy           10
5     Paris  Cloudy           19

DataFrame after Label Encoding 'City':
       City  City_LabelEncoded
0  New York                  1
1     Paris                  2
2    London                  0
3  New York                  1
4    London                  0
5     Paris                  2
Classes learned by LabelEncoder for 'City': ['London' 'New York' 'Paris']

DataFrame after One-Hot Encoding 'City':
       City  City_London  City_New York  City_Paris
0  New York          0.0            1.0         0.0
1     Paris          0.0            0.0         1.0
2    London          1.0            0.0         0.0
3  New York          0.0            1.0         0.0
4    London          1.0            0.0         0.0
5     Paris          0.0            0.0         1.0

DataFrame a