# **Intro to Machine learning**

# i) What is machine learning
 - sub-set of AI that enable systems to learn patterns from data & make predictions or decsions without explicit programming.

 ### Types of Machine learning
 - Supervised learning - the model learn's from labeled data e.g (classification, regression).
 - Unsupervised learning - the model learns patterns from unlabeled data e.g(clustering, dimensitionality reduction).
 - Re-enforcement learning - the model learns through rewards & penalties in an environment.

 ### ii) Real world application of ML
 - Image recognision(face id, medical imaging).
 - Natural language processing(chart bots, centiment analysis).
 - Recommendation systems (Netflix, Amazon, Spotify).
 - Fraud detection(credit card fraud, cyber security).


## Data pre-processing and exploration
### Data cleaning
- Handling missing values (mean/mode impitation, dropping)
- Removing duplicates and inconsistencies.
- Converting categorical variable into numerical format.
- One-hot encoding, label encoding.
### Feature enginering
- Creating new features from existing data.
- Normalization Vs standardization.


### Explorarory data analysis (EDA)
- Summary statistic (mean, median, variance).
- Visualization (boxplot, scatter plot).
- Identifying correlations (pearson's correlation).


## Supervised Learning
- Type of ML where the algorithm is trained on labeled data. This means each training example has an input (features) & a known output labeled/target.
- The model learns the relationship btw input & output to make predictions on unseen data.

### Type of Supervised learning
- Regression(continues output) :
used when the target variable is numerical or continues eg (predicting house prices).
#### common algorithms for regression;
* Linear rregression
* polynomial regression
* Desicion tree regression
* Random forest regression
* Suport vector regression

#### Model evaluation metrics
 - Mean square error(MSE)
 - Root mean square error(RME)
 - R^2 score

 ### Training process in supervised learning
 1) Collect & process data(handling missing values, encode categorical variables, normalize features)
 2) Split data into training & testing sets(typicaly 70% to 80 %, 20% to 30% testing)
 3) Chose a suitable model based on the problem type(classification or regression).
 4) Train the model by feeding it input-output pairs & optimazing weights.
 5) Evaluate model performance using metrics (accuracy, RMSE, precision)
 6) Tune hyper parameters to improve model performance.
 7) Deploy the model & monitor its real world performance.


# Regression
## Hands on implimentation in Python

#### Boston Housing dataset description
The Boston housing dataset is a classic dataset in ML used for regression tasks.
It contains 506 observations/rows & 14 variable/features/columns, including information about housing prices in various neighbourhood in Boston.

#### Purpose of using this dataset.
We will use linear regression to predict prices(MEDV based on these features).
The goal is to;
- Understand how different factors(eg crime rate,number of rooms, & tax rates influence housing prices).
- Build a linear regression model to predict house prices given neighbourhood x-tics.
- Evaluate the models accuracy using mean square error & R^2 score.
- Visualize actual vs predicted house prices to access model performance.

In [None]:
# importing libraries
import numpy as np
import pandas as pd


In [None]:
# importing data
from google.colab import files
uploaded= files.upload()

Saving boston.csv to boston.csv


In [None]:
# reading the dataset
boston=pd.read_csv('boston.csv')

In [None]:
# viewing the head
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [None]:
#
boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


In [None]:
# summary statistic
boston.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [None]:
# checking for missing values
boston.isnull().sum()

Unnamed: 0,0
CRIM,0
ZN,0
INDUS,0
CHAS,0
NOX,0
RM,0
AGE,0
DIS,0
RAD,0
TAX,0


In [None]:
# define x(features) & y(target)
#dropping the target variable
X=boston.drop('MEDV', axis=1)
# The target variable
y=boston['MEDV']


## splitting the dataset for training & testing

In [None]:
# importing library sklearn
from sklearn.model_selection import train_test_split
#
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data: {X_train.shape}")
print(f"Testing data: {X_test.shape}")

Training data: (404, 13)
Testing data: (102, 13)


In [None]:
# importing linear regression libraary
from sklearn.linear_model import LinearRegression

In [None]:
# intialize the mode
model=LinearRegression()

# fit/train the model
model.fit(X_train, y_train)

### Making predictions
After training the model we now make predictions

In [None]:
# prediction on test set
y_pred=model.predict(X_test)

# comparing actual vs predicted
comparison=pd.DataFrame({"Actual":y_test, "Predicted":y_pred})
print(comparison.head())

     Actual  Predicted
173    23.6  28.996724
274    32.4  36.025565
491    13.6  14.816944
72     22.8  25.031979
452    16.1  18.769880


### Evaluating the model
To measure performance we use MSE, RMSE R^2 score

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
mse=mean_squared_error(y_test, y_pred)
rmse=mse**0.5
r2=r2_score(y_test, y_pred)
print(f"Mean squared error(MSE):{mse:2f}")
print(f"Root mean squared error(RMSE):{rmse:2f}")
print(f"R^2 score:{r2:2f}")



Mean squared error(MSE):24.291119
Root mean squared error(RMSE):4.928602
R^2 score:0.668759


### Evaluating model performance
1. Mean squared error(MSE):24.291119
- Measures the average squared difference btw the actual & predicted prices.
- Lower is better since the target(MEDV)is in thousands, this means the average squared error is around 24290.

2. Root mean squared error(RMSE):4.928602
- RMSE is the square roor of MSE making it easy to interprete.
- the typical prediction error is around 4930.
3. R^2 score:0.668759
- This indicate how much variance in the house prices is explained by our model.
- R^2= 0.67 means 67% of the price variation is explained by the features.
- A perfect model would have R^2=1.0 but for real world data 0.67 is decsent but not great.