In [None]:
### Machine Learning Q&A

# 1. What is a parameter?
# A parameter is a variable in a machine learning model that is learned from the training data.
# It defines the relationship between input and output and is updated during model training.
# Examples include weights in a neural network and coefficients in linear regression.

# 2. What is correlation?
# Correlation is a statistical measure that expresses the strength and direction of the relationship between two variables.
# It ranges from -1 to 1, where 1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and 0 means no correlation.
# Correlation does not imply causation, meaning one variable does not necessarily cause changes in the other.

# 3. What does negative correlation mean?
# Negative correlation means that as one variable increases, the other decreases.
# For example, in financial markets, bond prices and interest rates typically have a negative correlation.
# A correlation coefficient close to -1 indicates a strong negative correlation.

# 4. Define Machine Learning. What are the main components in Machine Learning?
# Machine learning is a field of artificial intelligence that enables computers to learn from data without being explicitly programmed.
# The main components of machine learning include data, a model, a loss function, an optimizer, and evaluation metrics.
# The model learns patterns from the data, the loss function measures its performance, and the optimizer updates the model parameters to improve predictions.

# 5. How does loss value help in determining whether the model is good or not?
# The loss value quantifies the difference between the actual and predicted values of the model.
# A lower loss value indicates that the model's predictions are close to the true values, implying better performance.
# If the loss is high, it means the model is making significant errors, requiring adjustments in parameters or training data.

# 6. What are continuous and categorical variables?
# Continuous variables are numerical variables that can take an infinite range of values, such as height, weight, or temperature.
# Categorical variables, on the other hand, represent categories or labels, such as gender, color, or product type.
# Machine learning models handle these variable types differently, often requiring encoding for categorical data.

# 7. How do we handle categorical variables in Machine Learning? What are the common techniques?
# Since most machine learning models require numerical inputs, categorical variables need to be transformed.
# Common techniques include label encoding (assigning numeric values to categories), one-hot encoding (creating binary columns for each category), and target encoding.
# The choice of encoding technique depends on the model type and data characteristics.

# 8. What do you mean by training and testing a dataset?
# The dataset is split into two parts: the training set, used to train the model, and the test set, used to evaluate its performance.
# The training set helps the model learn patterns, while the test set checks if the model generalizes well to unseen data.
# Proper data splitting ensures that the model does not overfit to the training data.

# 9. What is sklearn.preprocessing?
# The sklearn.preprocessing module in Scikit-learn provides functions for feature scaling, encoding categorical variables, and transforming data.
# It includes tools like StandardScaler for normalization, OneHotEncoder for categorical encoding, and PolynomialFeatures for feature expansion.
# These preprocessing steps help improve model performance and compatibility with various algorithms.

# 10. What is a Test set?
# A test set is a portion of the dataset reserved for evaluating the trained model’s performance.
# It helps assess how well the model generalizes to new, unseen data.
# A properly chosen test set ensures unbiased performance measurement.

# 11. How do we split data for model fitting (training and testing) in Python?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# This splits the dataset into 80% training data and 20% testing data, ensuring reproducibility with random_state.

# 12. How do you approach a Machine Learning problem?
# The typical approach involves defining the problem, collecting and cleaning data, performing exploratory data analysis (EDA), choosing and training a model, and evaluating its performance.
# Further tuning and optimization may be required to enhance model accuracy.
# Finally, the model is deployed, monitored, and maintained for real-world usage.

# 13. Why do we have to perform EDA before fitting a model to the data?
# Exploratory Data Analysis (EDA) helps understand the dataset’s structure, distribution, and relationships.
# It allows us to detect missing values, outliers, and patterns that can influence model performance.
# Proper EDA ensures better preprocessing, feature selection, and model selection.

# 14. What is correlation?
# Correlation measures the statistical relationship between two variables.
# It helps identify dependencies that can be useful in feature selection for machine learning models.
# Different correlation types include Pearson, Spearman, and Kendall correlation.

# 15. What does negative correlation mean?
# A negative correlation means that as one variable increases, the other decreases.
# It is represented by a correlation coefficient between -1 and 0.
# For example, an increase in study time may lead to a decrease in failure rates.

# 16. How can you find correlation between variables in Python?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df.corr()
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# The heatmap visually represents correlations among variables.

# 17. What is causation? Explain difference between correlation and causation with an example.
# Causation implies that one event directly affects another, while correlation only suggests a relationship without proving causality.
# Example: More ice cream sales and more drowning incidents are correlated, but the cause is the hot weather.
# In contrast, smoking causes lung cancer, showing causation.

# 18. What is an Optimizer? What are different types of optimizers? Explain each with an example.
# Optimizers adjust model parameters to minimize loss.
# Common types: Gradient Descent (simple but slow), Adam (adaptive learning rates), RMSprop (suited for non-stationary objectives).
# Example: Adam optimizer is often used in deep learning models due to its efficiency.

# 19. What is sklearn.linear_model?
# It is a Scikit-learn module that provides linear regression models for predictive analysis.
# It includes algorithms like Linear Regression, Logistic Regression, and Ridge Regression.

# 20. What does model.fit() do? What arguments must be given?
# It trains the model using input features (X) and target labels (y).
# Example: model.fit(X_train, y_train)

# 21. What does model.predict() do? What arguments must be given?
# It generates predictions based on input features (X).
# Example: predictions = model.predict(X_test)

# 22. What is feature scaling? How does it help in Machine Learning?
# Feature scaling normalizes numerical values to a uniform range, improving model efficiency.
# Methods: Standardization (mean=0, variance=1), Normalization (range 0 to 1).

# 23. How do we perform scaling in Python?
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 24. Explain data encoding?
# Data encoding converts categorical variables into numeric formats for ML models.
# Common techniques include One-Hot Encoding, Label Encoding, Ordinal Encoding, and Target Encoding.
# Encoding helps models interpret categorical data efficiently and improve performance.

# 25. How do we split data for model fitting (training and testing) in Python?
# We use train_test_split from sklearn.model_selection.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# This ensures a balanced and reproducible data split for model training and evaluation.
