# Introduction to Machine Learning

In this notebook, we will cover the basics of Machine Learning, its types, why it is used, and the complete Machine Learning workflow. This comprehensive guide is designed to ensure you have a clear understanding of each concept with practical examples.


## Machine Learning Basics

### What is Machine Learning?
Machine Learning is a field of artificial intelligence that enables computers to learn from data and make decisions without being explicitly programmed.

### Types of Machine Learning

##### 1. Supervised Learning
Supervised learning involves training a model on labeled data. The model learns to predict the output from the input data based on provided examples.

- **Regression**: Predicts continuous values.
  - **Example**: Predicting house prices based on features like size and location.

- **Classification**: Predicts categorical outcomes.
  - **Example**: Classifying emails as spam or not spam.

##### 2. Unsupervised Learning
Unsupervised learning finds patterns in data without labeled responses.

- **Clustering**: Groups similar data points together.
  - **Example**: Customer segmentation in marketing.


- **Association**: Finds rules that describe large portions of data.
  - **Example**: Market basket analysis.


##### 3. Semi-Supervised Learning
Uses a combination of labeled and unlabeled data for training.

- **Example**: Text classification with a small amount of labeled data and a large amount of unlabeled data.

##### 4. Reinforcement Learning
Trains models to make sequences of decisions by rewarding good actions.

- **Example**: Training a model to play a game by rewarding good moves.


## Why Machine Learning is Used

- **Automation of tasks**: Automating repetitive processes.
- **Handling large volumes of data**: Analyzing and making sense of vast amounts of data.
- **Improved accuracy and efficiency**: Enhancing performance through data-driven methods.
- **Personalization**: Tailoring experiences to individual needs (e.g., recommendations).
- **Decision making**: Supporting data-driven decisions for better outcomes.

### Practical Examples
- **Recommendation Systems**: Suggesting products based on past behavior.
- **Fraud Detection**: Identifying unusual patterns in transactions.
- **Predictive Maintenance**: Forecasting equipment failures based on usage patterns.


## Machine Learning Workflow

### 1. Data Collection
Data collection involves gathering data from various sources.

#### Sources of Data
- Databases
- APIs
- Web scraping

#### Data Formats
- CSV
- JSON
- XML

### 2. Data Preprocessing
Data preprocessing transforms raw data into a suitable format for analysis.

#### Data Cleaning
- **Handling missing values**: Filling or removing missing data.
- **Removing duplicates**: Eliminating duplicate entries.
- **Dealing with outliers**: Identifying and managing outliers.






#### Load the dataset and Fill missing values using forward fill method

In [7]:
import pandas as pd

#### Load the dataset
data = pd.read_csv(r'C:\Users\nazil\Downloads\sample_data.csv')
print(data)

#### Fill missing values using forward fill method
data.fillna(method='ffill', inplace=True)  # Forward fill missing values


print("after filling:\n",data)

   Feature1  Feature2  Target
0       1.2       0.5       0
1       2.3       1.8       0
2       2.3       1.8       0
3       3.4       2.2       1
4       NaN       3.3       1
5       4.5       3.3       1
6       4.5       NaN       1
7       5.6       4.1       1
8       6.7       5.0       1
9       6.7       5.0       1
after filling:
    Feature1  Feature2  Target
0       1.2       0.5       0
1       2.3       1.8       0
2       2.3       1.8       0
3       3.4       2.2       1
4       3.4       3.3       1
5       4.5       3.3       1
6       4.5       3.3       1
7       5.6       4.1       1
8       6.7       5.0       1
9       6.7       5.0       1


  data.fillna(method='ffill', inplace=True)  # Forward fill missing values


#### After removing duplicates

In [8]:


data.drop_duplicates(inplace=True)

print("Data after removing duplicates:")
print(data)


Data after removing duplicates:
   Feature1  Feature2  Target
0       1.2       0.5       0
1       2.3       1.8       0
3       3.4       2.2       1
4       3.4       3.3       1
5       4.5       3.3       1
7       5.6       4.1       1
8       6.7       5.0       1


#### Data Transformation
- **Normalization**: Scaling data to a range (e.g., 0 to 1).
- **Standardization**: Scaling data to have a mean of 0 and standard deviation of 1.

### without Inbuild methods

In [9]:
import numpy as np

# Normalize a numpy array
def normalize(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

# Standardize a numpy array
def standardize(x):
    return (x - np.mean(x)) / np.std(x)

# Example usage
data = np.array([[1, 2, 3],[ 4,8, 5]])
normalized_data = normalize(data)
standardized_data = standardize(data)

print("Normalized Data:", normalized_data)
print("Standardized Data:", standardized_data)


Normalized Data: [[0.         0.14285714 0.28571429]
 [0.42857143 1.         0.57142857]]
Standardized Data: [[-1.24986486 -0.80873608 -0.36760731]
 [ 0.07352146  1.83803656  0.51465024]]


### with inbuild methods

In [10]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Initialize the scalers
scaler = MinMaxScaler()
standardizer = StandardScaler()

# Normalize and standardize the data
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
data_standardized = standardizer.fit_transform(data.reshape(-1, 1))

print("Normalized Data:", data_scaled)
print("Standardized Data:", data_standardized)


Normalized Data: [[0.        ]
 [0.14285714]
 [0.28571429]
 [0.42857143]
 [1.        ]
 [0.57142857]]
Standardized Data: [[-1.24986486]
 [-0.80873608]
 [-0.36760731]
 [ 0.07352146]
 [ 1.83803656]
 [ 0.51465024]]


#### Correlation and Covariance
- **Correlation**: Measures the relationship between two variables.
- **Covariance**: Indicates the direction of the linear relationship between variables.


#### Why Use Them?

- To understand the relationships between features.
- To select relevant features for model building.

In [11]:
import numpy as np

# Calculate correlation between two arrays
def correlation(x, y):
    return np.corrcoef(x, y)[0, 1]

# Calculate covariance between two arrays
def covariance(x, y):
    return np.mean((x - np.mean(x)) * (y - np.mean(y)))

# Example usage
x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
corr = correlation(x, y)
cov = covariance(x, y)

print("Correlation:", corr)
print("Covariance:", cov)


Correlation: -0.9999999999999999
Covariance: -2.0


#### Data Reduction
- **Feature selection**: Selecting relevant features for the model.
- **Feature extraction**: Creating new features from existing ones.
#### Data Splitting
- **Train-test split**: Dividing data into training and testing sets.
- **Cross-validation**: Ensuring the model's performance is reliable.


In [12]:
import pandas as pd

# Load the dataset
data = pd.read_csv(r'C:\Users\nazil\Downloads\sample_data.csv')
data.fillna(method="ffill", inplace=(True))

from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = data[['Feature1', 'Feature2']]
y = data['Target']


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  data.fillna(method="ffill", inplace=(True))


### 3. Model Building

Model building involves selecting and training a machine learning model.

**Selecting the Right Algorithm**
- Linear Regression
- Decision Trees
- Support Vector Machines

**How to Select the Better Model**:

- **Performance Metrics**: Compare accuracy, precision, recall, F1 score.
- **Cross-Validation Scores**: Ensure the model generalizes well to unseen data.
- **Computational Efficiency**: Consider the time and resources required for training.

In [13]:
# Example: Training a Linear Regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

### 4. Model Evaluation
Model evaluation assesses the model's performance using various metrics.

**Performance Metrics**
- Accuracy
- Precision
- Recall
- F1 Score

In [18]:
# Example: Model evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 1.0
Confusion Matrix:
[[1 0]
 [0 1]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



#### Cross-Validation

In [15]:
# Example: Cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5,scoring='accuracy')
print(f'Cross-Validation Scores: {scores}')


Cross-Validation Scores: [1. 1. 1. 1. 1.]




### 5. Model Tuning
Model tuning optimizes the model's performance.

**Hyperparameter Tuning**
- Grid Search
- Random Search


In [25]:
# Example: Grid Search
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge



model = Ridge()

param_grid = {'alpha': [0.1, 0.5, 1.0]}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f'Best Parameters: {grid_search.best_params_}')


Best Parameters: {'alpha': 0.1}




#### Optimization Techniques
Optimization techniques are used to minimize or maximize an objective function.

**Purpose**:

- Improve the model's performance.
- Reduce overfitting or underfitting.

**Common Techniques**:

- Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent

In [None]:
# Example: Gradient Descent
import numpy as np

# Define the objective function
def objective_function(x):
    return x**2

# Gradient Descent Algorithm
def gradient_descent(learning_rate, initial_value, iterations):
    x = initial_value
    for _ in range(iterations):
        gradient = 2 * x  # Derivative of x^2
        x = x - learning_rate * gradient
    return x

# Apply Gradient Descent
optimized_value = gradient_descent(learning_rate=0.1, initial_value=10, iterations=100)
print(f'Optimized Value: {optimized_value}')


Optimized Value: 2.0370359763344878e-09


### 6. Model Deployment
Model deployment makes the trained model available for use.


# I will now provide a detailed explanation of the machine learning workflow, along with an example of how to create an ML model step by step.

## To create a model to predict house prices. Here's a step-by-step guide, breaking down every small step involved:

### 1. *Problem Definition*
   - *Goal*: Predict the price of a house based on various features (e.g., size, location, number of bedrooms).
   - *Type of Problem*: Supervised learning, specifically a regression problem, as the target variable (house price) is continuous.

### 2. *Data Collection*
   - *Source*: Obtain a dataset that includes features like house size, number of bedrooms, location, etc., along with their corresponding prices.
   - *Format*: Typically in CSV or another tabular format.
   - *Tools*: Use pandas to load the dataset.

### 3. *Data Exploration and Visualization*
   - *Summary Statistics*: Use .describe() to get a summary of the dataset.
   - *Data Types*: Check data types with .info() to ensure they're appropriate (e.g., numeric for numerical features).
   - *Missing Values*: Identify missing values using .isnull().sum().
   - *Visualization*: Use plots (e.g., histograms, box plots, scatter plots) to understand the distribution and relationships in the data.
     - Use matplotlib or seaborn for visualization.

### 4. *Data Preprocessing*
   - *Handling Missing Data*: 
     - Impute missing values with mean/median for numerical features, or the mode for categorical features.
     - Drop rows/columns with excessive missing data if necessary.
   - *Encoding Categorical Variables*:
     - Convert categorical features into numerical ones using techniques like One-Hot Encoding or Label Encoding.
     - Use pandas.get_dummies() or scikit-learn’s LabelEncoder.
   - *Feature Scaling*:
     - Standardize or normalize numerical features to bring them to a similar scale.
     - Use scikit-learn's StandardScaler or MinMaxScaler.

     
   - *Feature Engineering*:
     - Create new features if necessary, such as interaction terms or polynomial features.

     ***Correlation and covariance are essential for understanding the relationships between features and the target variable, guiding feature selection and model choice.***

     *Why*: Correlation and covariance help you understand the relationships between variables, which is crucial for:
  - *Feature Selection*: Knowing which features have strong relationships with the target can guide you in selecting relevant features for your model.
  - *Multicollinearity Detection*: High correlations between features can indicate multicollinearity, which may require you to drop or combine features to avoid redundancy and improve model performance.


### 5. *Data Splitting*
   - *Train-Test Split*:
     - Split the data into training and testing sets (e.g., 80% training, 20% testing).
     - Use scikit-learn's train_test_split() function.
   - *Validation Set*:
     - Optionally split the training set further into a training and validation set to tune hyperparameters.

### 6. *Model Selection*

#### *Model Selection Based on Relationships*:

- *Scatter plots and histograms* help visualize these relationships, confirming whether they are linear or non-linear, and thus influencing which model to choose.

   - *Choosing Algorithms*:
     - Consider various regression algorithms like Linear Regression, Decision Trees, Random Forests, Gradient Boosting, etc.

 - *Linear Relationship*:
    - *Model Choice: If the relationship between the features and the target is primarily linear (e.g., as a feature increases, the target increases proportionally), models like **Linear Regression* are appropriate.
    - *Visualization*: Scatter plots showing a straight-line trend can indicate a linear relationship.

- *Non-Linear Relationship*:
  - *Model Choice: If the relationship is non-linear (e.g., quadratic, logarithmic), more complex models like **Polynomial Regression, **Decision Trees, or **Random Forests* may be needed.
  - *Visualization*: Scatter plots with curves, or patterns that can't be captured by a straight line, suggest a non-linear relationship.



   - *Baseline Model*:
     - Start with a simple model like Linear Regression to establish a baseline performance.

### 7. *Model Training*
   - *Fit the Model*:
     - Train the chosen model on the training data using `.fit()`.

- *Hyperparameter Tuning*:
     - Use techniques like Grid Search or Random Search to find the best hyperparameters.
     - Use scikit-learn's `GridSearchCV` or `RandomizedSearchCV`.
     - Evaluate the model on the validation set during tuning to select the best model.

### 8. *Model Evaluation*
   - *Cross-Validation*:
     - Perform k-fold cross-validation to assess the model's generalization ability.
     - Use scikit-learn's `cross_val_score()`.

### 9. *Model Improvement*
   - *Feature Selection*:
     - Identify and select the most important features to reduce model complexity.
     - Use techniques like Recursive Feature Elimination (RFE) or Feature Importance.
   - *Model Ensemble*:
     - Consider combining multiple models to improve performance (e.g., using techniques like bagging or boosting).
   - *Hyperparameter Re-tuning*:
     - Revisit hyperparameters and tune them further if needed.

### 10. *Final Model Evaluation*
   - *Test Set Evaluation*:
     - Evaluate the final model on the test set to get an unbiased estimate of model performance.
     - This is the first and only time the test set should be used.
   - *Final Metrics*:
     - Report final evaluation metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.
     - Use scikit-learn's `mean_absolute_error()`, `mean_squared_error()`, and `r2_score()` to calculate these metrics on the test set.


### 11. *Model Deployment*
   - *Model Saving*:
     - Save the trained model using joblib or pickle.
   - *Integration*:
     - Integrate the model into a production environment (e.g., a web application, API).
   - *Monitoring*:
     - Set up monitoring to track model performance over time and retrain the model as necessary.

### 12. *Documentation and Reporting*
   - *Document Process*:
     - Document the entire process, including data exploration, preprocessing steps, model selection, and evaluation metrics.
   - *Reporting*:
     - Create a report or presentation summarizing the findings and model performance.
     - Share the results with stakeholders.

### 13. *Continuous Learning*
   - *Model Maintenance*:
     - Periodically retrain the model with new data to ensure it stays accurate.
   - *Feedback Loop*:
     - Incorporate feedback from users and update the model accordingly.

This detailed workflow ensures that you methodically approach the problem and build a robust model to predict house prices.

## Conclusion

In this notebook, we covered the basics of Machine Learning, its types, reasons for its usage, and the detailed workflow involved in building, evaluating, tuning, and deploying a machine learning model. By following these steps and understanding each component, you should have a solid foundation in Machine Learning.


## References

- Books: "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron
- Online Tutorials: [Scikit-Learn Documentation](https://scikit-learn.org/stable/user_guide.html)
- Articles: [Towards Data Science](https://towardsdatascience.com/)
