# Machine Learning 101: How to start?

So let's get started with some hands on machine learning!


First thing first. Let's install some packages. Select the next code block and press enter to execute.

We are going to use python. Most of the code block will be pre-filled. So no worries if you're not an python expert. When you have any questions. Don't hesitate to ask!


### Step 1: Importing Libraries

In [None]:
pip install pandas numpy matplotlib scikit-learn jupyter

Once you've installed these packages, you're ready to start working with machine learning in Jupyter Notebook.

Let's start by importing the packages we'll be using:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


### Step 2: Loading the dataset

Next, let's load in a dataset. We will start with numerical and categorical data. For this example, we'll use the California Housing dataset from scikit-learn.

You can load the dataset using the following code:

In [None]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing(as_frame=True)
dataFrame = pd.DataFrame(california.data, columns=california.feature_names)
dataFrame['prices'] = california.target

### Step 3: Explorartory Data Analysis (EDA)

Before diving into machine learning, it's important to understand the data.
Let's take a look at the first few rows of the dataset:

In [None]:
dataFrame.head()

Some datasets come with a description. Let's get the description of this dataset:


In [None]:
print(california.DESCR)

Now we want some more information about the statistics on this dataframe.

By calling `.describe()`, you get a summary table that provides an overview of the distribution and statistical properties of your dataset's numerical columns. It is useful for quickly understanding the range, spread, and central tendency of the data, as well as identifying potential outliers or unusual patterns.

In [None]:
dataFrame.describe()

Explanation of the columns

*   count: The number of non-missing values in the column.
*   mean: The average value of the column.
*   std: The standard deviation of the column, which measures the spread or dispersion of the values.
*   min: The minimum value in the column.
*   25%, 50%, 75%: The quartiles of the column. The 25th percentile (1st quartile) represents the value below which 25% of the data falls, the 50th percentile (2nd quartile) represents the median, and the 75th percentile (3rd quartile) represents the value below which 75% of the data falls.
*   max: The maximum value in the column.

### Step 4: Preparing the data
Before training a machine learning model, we need to prepare the data by splitting it into input features `(mostly called X, here housing)` and target variable `(mostly called y, here prices)`.

Now let's split the data into training and test sets:

In [None]:
housing = dataFrame.drop('prices', axis=1)
pricing = dataFrame['prices']

### Step 5: Splitting the Data
To evaluate the performance of our model, we'll split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.


In [None]:
housing_train, housing_test, pricing_train, pricing_test = train_test_split(housing, pricing, test_size=0.2, random_state=42)

In this case, the dataset housing and pricing will be split into training and testing sets, where 20% of the data will be reserved for testing (test_size=0.2). The random_state is set to 42, ensuring that the data is shuffled and split in a consistent manner across different runs of the code.

Using a fixed random_state value helps in achieving reproducibility, as the same split will be generated each time the code is executed with the same random_state value. This is beneficial for sharing and comparing results, especially when fine-tuning models or conducting experiments.

We will also need to scale our data.

Check if the features in your dataset are on different scales. If so, consider applying feature scaling techniques such as standardization or normalization to bring all features to a similar scale. This can help the model converge faster and improve its performance. `StandardScaler()` in transforms the data by scaling it to have zero mean and unit variance, ensuring all features are on a similar scale.

We use `fit_transform()` on the training data during the preprocessing step to learn the necessary transformations and apply them to the data. This allows the model to learn and adapt to the training data's characteristics.

On the other hand, we use `transform()` on the testing data to apply the same learned transformations without relearning them, ensuring consistency between the training and testing datasets and enabling fair evaluation of the model's performance on unseen data. This way, the testing data undergoes the same preprocessing steps as the training data, maintaining the integrity of the feature scaling or other transformations.

In [None]:
scaler = StandardScaler()
housing_train_scaled = scaler.fit_transform(housing_train)
housing_test_scaled = scaler.transform(housing_test)

### Step 6: Training the model
Now we can create our linear regression model:

In [None]:
model = LinearRegression()
model.fit(housing_train_scaled, pricing_train)

### Step 7: Making Predictions
Once the model is trained, we can use it to make predictions on new data. Let's make predictions on the testing data.

In [None]:
pricing_pred = model.predict(housing_test_scaled)

### Step 8: Evaluating the Model
Finally, let's evaluate our model using the mean squared error:

In [None]:
mse = mean_squared_error(pricing_test, pricing_pred)
print(f"Mean Squared Error: {mse:.2f}")

**The meaning om Mean squared error**

The mean squared error (MSE) is a commonly used metric to evaluate the performance of regression models. It measures the average squared difference between the predicted values and the actual values. The formula for calculating MSE is:

MSE = (1/n) * Σ(yᵢ - ȳ)²

where:

*   n is the number of samples in the dataset
*   yᵢ is the predicted value for the i-th sample
*   ȳ is the actual (true) value for the i-th sample


MSE provides a measure of how close the predicted values are to the true values on average. It calculates the average squared deviation, which means that larger deviations from the true values are penalized more.

A lower value of MSE indicates better performance, as it means that the predicted values are closer to the true values. Conversely, a higher value of MSE indicates larger errors and greater discrepancy between the predicted and actual values.

It's important to note that the interpretation of "good" or "bad" MSE values depends on the specific context of the problem. The scale of the MSE depends on the units of the target variable. For example, if the target variable represents house prices in thousands of dollars, an MSE of 10,000 would mean an average squared error of $10,000,000, which might be considered high. However, it's crucial to compare the MSE with other models or establish a baseline to determine the relative performance.

In summary, a lower MSE indicates better performance, but the interpretation of what constitutes a "good" MSE value depends on the specific context and should be considered relative to other models or benchmarks.

### Step 9: Visualize the predictions
Finally, let's visualize the predicted values compared to the actual values.

In [None]:
plt.scatter(pricing_test, pricing_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()

Congrats, the basic model is in place! You should now have a basic understanding of how to work with the California Housing dataset and build a machine learning model to predict housing prices.

## Improving the model
Now that we have a basic model in place, let's explore ways to improve its
performance.



### 1. Feature Engineering
Consider exploring the dataset and creating additional features based on domain knowledge or feature interactions. This can provide the model with more relevant input and potentially improve its predictive power.

⚠ This can also make your model much worse. Take the next in mind:

1.   Cost-benefit tradeoff: Feature engineering may not offer substantial gains relative to the effort invested.
2.   Feature sufficiency: Existing features might already provide enough information for accurate predictions, reducing the need for further engineering.
3.   Overfitting risk: Care must be taken to avoid overfitting when introducing new features, as it can lead to poor generalization and performance degradation on unseen data




In [None]:
# Example: Creating interaction terms
dataFrame['rooms_per_person'] = dataFrame['total_rooms'] / dataFrame['population']


### 2. Hyperparameter Tuning
Experiment with different hyperparameter values for the linear regression model. Hyperparameters control the behavior of the model and can significantly impact its performance. Use techniques like grid search or random search to find the best combination of hyperparameter values.



In [None]:
# Example: Grid Search with Cross-Validation
from sklearn.model_selection import GridSearchCV

param_grid = {'alpha': [0.1, 1.0, 10.0]}
model = Ridge()
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(housing_train_scaled, pricing_train)

best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

### 3. Model Selection
Try different algorithms or models and compare their performance. Linear regression is just one algorithm for regression tasks. Explore other algorithms such as decision trees, random forests, support vector regression, or gradient boosting. Each algorithm has its own strengths and weaknesses, and it's worth experimenting with different models to find the best one for your problem.


In [None]:
# Example: Random Forest Regression
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(housing_train_scaled, pricing_train)
pricing_pred = model.predict(housing_test_scaled)
mse = mean_squared_error(pricing_test, pricing_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Example 2: Gradient Boosting Regression (GBR)
from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor()
model.fit(housing_train_scaled, pricing_train)
pricing_pred = model.predict(housing_test_scaled)
mse = mean_squared_error(pricing_test, pricing_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Try with a different model like  Support Vector Regression (SVR)
from sklearn.svm import SVR

### 4. Handling Outliers
Investigate if there are any outliers in your dataset and consider how to handle them. Outliers can have a significant impact on the model's performance. Depending on the situation, you might choose to remove outliers, apply transformations, or use robust models that are less sensitive to outliers.

In [None]:
# Example: Winsorization
from scipy.stats.mstats import winsorize

dataFrame['prices'] = winsorize(dataFrame['prices'], limits=[0.05, 0.05])
# Try to find out the correct limits for this dataFrame.

In this example the limits set at the 5th and 95th percentiles This means that values below the 5th percentile are replaced with the value at the 5th percentile, and values above the 95th percentile are replaced with the value at the 95th percentile.

### 5. Cross-Validation
Instead of relying on a single train-test split, consider using cross-validation techniques to evaluate the model's performance. Cross-validation provides a more robust estimate of the model's performance by splitting the data into multiple train-test splits and evaluating the model on each split.


In [None]:
# Example: K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score

model = LinearRegression()
scores = cross_val_score(model, housing, prices, cv=5, scoring='neg_mean_squared_error')
mse_scores = -scores

When cv is set to 5, it means that the dataset will be split into 5 equal-sized folds or subsets. The cross-validation process will then be performed 5 times, where each time, one of the folds will be used as the validation set, and the remaining 4 folds will be used as the training set. This allows for comprehensive evaluation of the model's performance by rotating the validation set across different portions of the data.

### 6. Collect More Data
If feasible, collecting more data can often help improve the model's performance. More data provides the model with more examples to learn from, reduces overfitting, and helps capture the underlying patterns in the data more effectively.


### 7. Regularization Techniques
Consider applying regularization techniques such as L1 or L2 regularization to the linear regression model. Regularization helps prevent overfitting by adding a penalty term to the model's loss function. It encourages the model to have simpler and smoother solutions, reducing the risk of overfitting the training data.


In [None]:
# Example: Lasso Regression
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(housing_train_scaled, pricing_train)

In the context of Lasso regression, the `alpha` parameter controls the strength of the regularization penalty applied to the model. A value of `alpha=0.1` means that a moderate regularization penalty will be applied during the training of the Lasso regression model.

Lasso regression includes a term in the loss function that adds the sum of the absolute values of the coefficients multiplied by alpha to the ordinary least squares loss. This penalty encourages the model to shrink the coefficients towards zero, promoting sparsity and feature selection.

By setting alpha=0.1, you are specifying a moderate amount of regularization. Increasing the value of alpha towards 1 will increase the strength of the regularization, potentially resulting in more coefficients being set to zero and a more sparse model. Conversely, reducing the value of alpha towards 0 will decrease the regularization effect, allowing the model to fit the data more closely. The choice of the alpha parameter depends on the specific dataset and the trade-off between bias and variance in the model.

##Conclusion
Remember that improving the model's performance is an iterative process. It requires experimentation, analysis of results, and fine-tuning of various components. Iterate on these steps, try different approaches, and carefully evaluate the impact on the model's performance.

Congratulations on completing this Machine Learning 101 tutorial! You've learned the fundamental steps of building and evaluating a machine learning model using Python and various libraries. Continue to explore and expand your knowledge in the exciting field of machine learning.

##Bonus

Edit the above code to make the mean squared error smaller.