# Machine Learning 101: How to start?

So let's get started with some hands on machine learning!


First thing first. Let's install some packages. Select the next code block and press enter to execute.

We are going to use python. Most of the code block will be pre-filled. So no worries if you're not an python expert. When you have any questions. Don't hesitate to ask!


### Step 1: Importing Libraries

In [None]:
pip install pandas numpy matplotlib scikit-learn jupyter

Once you've installed these packages, you're ready to start working with machine learning in Jupyter Notebook.

Let's start by importing the packages we'll be using:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


### Step 2: Loading the dataset

Next, let's load in a dataset. We will start with some text driven data. For this example, we'll use the California Housing dataset from scikit-learn. 

You can load the dataset using the following code:

In [None]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing(as_frame=True)
dataFrame = pd.DataFrame(california.data, columns=california.feature_names)
dataFrame['prices'] = california.target

### Step 3: Explorartory Data Analysis (EDA)

Before diving into machine learning, it's important to understand the data.
Let's take a look at the first few rows of the dataset:

In [None]:
dataFrame.head()

Some datasets come with a description. Let's get the description of this dataset:


In [None]:
print(california.DESCR)

Now we want some more information about the statistics on this dataframe. 

By calling `.describe()`, you get a summary table that provides an overview of the distribution and statistical properties of your dataset's numerical columns. It is useful for quickly understanding the range, spread, and central tendency of the data, as well as identifying potential outliers or unusual patterns.

In [None]:
dataFrame.describe()

Explanation of the columns

*   count: The number of non-missing values in the column.
*   mean: The average value of the column.
*   std: The standard deviation of the column, which measures the spread or dispersion of the values.
*   min: The minimum value in the column.
*   25%, 50%, 75%: The quartiles of the column. The 25th percentile (1st quartile) represents the value below which 25% of the data falls, the 50th percentile (2nd quartile) represents the median, and the 75th percentile (3rd quartile) represents the value below which 75% of the data falls.
*   max: The maximum value in the column.

### Step 4: Preparing the data
Before training a machine learning model, we need to prepare the data by splitting it into input features `(mostly called X, here housing)` and target variable `(mostly called y, here prices)`.

Now let's split the data into training and test sets:

In [None]:
housing = dataFrame.drop('prices', axis=1)
pricing = dataFrame['prices']

### Step 5: Splitting the Data
To evaluate the performance of our model, we'll split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance on unseen data.


In [None]:
housing_train, housing_test, pricing_train, pricing_test = train_test_split(housing, pricing, test_size=0.2, random_state=42)

In this case, the dataset housing and pricing will be split into training and testing sets, where 20% of the data will be reserved for testing (test_size=0.2). The random_state is set to 42, ensuring that the data is shuffled and split in a consistent manner across different runs of the code.

Using a fixed random_state value helps in achieving reproducibility, as the same split will be generated each time the code is executed with the same random_state value. This is beneficial for sharing and comparing results, especially when fine-tuning models or conducting experiments.

We will also need to scale our data

In [None]:
scaler = StandardScaler()
housing_train_scaled = scaler.fit_transform(housing_train)
housing_test_scaled = scaler.transform(housing_test)

### Step 6: Training the model
Now we can create our linear regression model:

In [None]:
model = LinearRegression()
model.fit(housing_train_scaled, pricing_train)

### Step 7: Making Predictions
Once the model is trained, we can use it to make predictions on new data. Let's make predictions on the testing data.

In [None]:
pricing_pred = model.predict(housing_test_scaled)

### Step 8: Evaluating the Model 
Finally, let's evaluate our model using the mean squared error:

In [None]:
mse = mean_squared_error(pricing_test, pricing_pred)
print(f"Mean Squared Error: {mse:.2f}")

**The meaning om Mean squared error**  

The mean squared error (MSE) is a commonly used metric to evaluate the performance of regression models. It measures the average squared difference between the predicted values and the actual values. The formula for calculating MSE is:

MSE = (1/n) * Σ(yᵢ - ȳ)²

where:

*   n is the number of samples in the dataset
*   yᵢ is the predicted value for the i-th sample
*   ȳ is the actual (true) value for the i-th sample


MSE provides a measure of how close the predicted values are to the true values on average. It calculates the average squared deviation, which means that larger deviations from the true values are penalized more.

A lower value of MSE indicates better performance, as it means that the predicted values are closer to the true values. Conversely, a higher value of MSE indicates larger errors and greater discrepancy between the predicted and actual values.

It's important to note that the interpretation of "good" or "bad" MSE values depends on the specific context of the problem. The scale of the MSE depends on the units of the target variable. For example, if the target variable represents house prices in thousands of dollars, an MSE of 10,000 would mean an average squared error of $10,000,000, which might be considered high. However, it's crucial to compare the MSE with other models or establish a baseline to determine the relative performance.

In summary, a lower MSE indicates better performance, but the interpretation of what constitutes a "good" MSE value depends on the specific context and should be considered relative to other models or benchmarks.

### Step 9: Visualize the predictions
Finally, let's visualize the predicted values compared to the actual values.

In [None]:
plt.scatter(pricing_test, pricing_pred)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()

And that's it! You should now have a basic understanding of how to work with the California Housing dataset and build a machine learning model to predict housing prices.