# Machine Learning 101: How to start?

So let's get started with some hands on machine learning!


First thing first. Let's install some packages. Select the next code block and press enter to execute.

We are going to use python. Most of the code block will be pre-filled. So no worries if you're not an python expert. When you have any questions. Don't hesitate to ask!


In [3]:
pip install pandas numpy matplotlib scikit-learn jupyter

Note: you may need to restart the kernel to use updated packages.


Once you've installed these packages, you're ready to start working with machine learning in Jupyter Notebook.

Let's start by importing the packages we'll be using:

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Next, let's load in a dataset. We will start with some text driven data. For this example, we'll use the California Housing dataset from scikit-learn. 

You can load the dataset using the following code:

In [5]:
from sklearn.datasets import fetch_california_housing

california = fetch_california_housing()
df = pd.DataFrame(california.data, columns=california.feature_names)
df['prices'] = california.target

This code will load the California Housing dataset into a pandas DataFrame called df. The dataset has a target variable called MEDV.

Let's take a look at the first few rows of the dataset:

In [6]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,prices
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


  The ouput should look like this
    
    MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  prices
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88    -122.23   4.526
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86    -122.22   3.585
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85    -122.24   3.521
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85    -122.25   3.413
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85    -122.25   3.422


Now let's split the data into training and test sets:

In [8]:
housing = df.drop('prices', axis=1)
pricing = df['prices']

housing_train, housing_test, pricing_train, pricing_test = train_test_split(housing, pricing, test_size=0.2, random_state=42)


We will also need to scale our data

In [9]:
scaler = StandardScaler()
housing_train_scaled = scaler.fit_transform(housing_train)
housing_test_scaled = scaler.transform(housing_test)

Now we can create our linear regression model:

In [10]:
lr = LinearRegression()
lr.fit(housing_train_scaled, pricing_train)

Let's make predictions on the test set:

In [11]:
pricing_pred = lr.predict(housing_test_scaled)

Finally, let's evaluate our model using the mean squared error:

In [12]:
mse = mean_squared_error(pricing_test, pricing_pred)
print(f"Mean Squared Error: {mse:.2f}")


Mean Squared Error: 0.56


And that's it! You should now have a basic understanding of how to work with the California Housing dataset and build a machine learning model to predict housing prices.