## Choosing right estimator/algorithm for your problem

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# Choosing Datasets 

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [4]:
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [5]:
housing['feature_names']

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [6]:
housing_df = pd.DataFrame(housing['data'], columns= housing['feature_names'])
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [7]:
housing_df['target'] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [8]:
housing_df.shape

(20640, 9)

- A Ridge model is a type of linear regression used for predicting a target value based on input features, but with a twist to improve its performance. 

In [9]:
# Import algorithm/estimator
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

# Setup random seed
np. random.seed(42)

# Create the data
x = housing_df.drop('target', axis=1)
y = housing_df['target']

# Split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate and fit the model (on the training set)
model = Ridge()
model.fit(x_train, y_train)

# Check the score of the model (on the test set)
model.score(x_test, y_test)

0.5758549611440126

## Key Points

- **Next Step**: Try **EnsembleRegressors**.
- **Ensemble Methods**: Combine multiple models for decision-making.
- **Common Method**: **Random Forest**.
  - Known for fast training and prediction times.
  - Adaptable to different problems.
  - Combines multiple random decision trees.
  - Makes predictions by averaging results from the decision trees.
- **Model for Regression**: Use **Scikit-Learn's `RandomForestRegressor`**.
- **Workflow**: Follow the same as before, only change the model.
- **Further Reading**: ["An Implementation and Explanation of the Random Forest in Python" by Will Koehrsen](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76).


In [10]:
# Import algorithm/Estimators
from sklearn.ensemble import RandomForestRegressor

# Create seed
np.random.randint(42)

# Spliting into (x) & (y) 
x = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Spliting into train and testsets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiation of model 
model = RandomForestRegressor()
model.fit(x_train, y_train)

# Model score
model.score(x_test, y_test)

0.793008226748187

## 2.2 Picking a machine learning model for a classification problem

In [11]:
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [12]:
len(heart_disease)

303

Following the cheat-sheet we end up at LinearSVC which stands for Linear Support Vector Classifier.

In [13]:
# Import the RandomForestClassifier model class from the ensemble module
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Split the data into X (features/data) and y (target/labels)
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit the model (on the training set)
clf = RandomForestClassifier(n_estimators=1000) # 100 is the default, but you could try 1000 and see what happens
clf.fit(X_train, y_train)

# Check the score of the model (on the test set)
clf.score(X_test, y_test)

0.8688524590163934

## 3.1 Fitting a model to data
In Scikit-Learn, the process of having a machine learning model learn patterns from a dataset involves calling the fit() method and passing it data, such as, fit(X, y).

Where X is a feature array and y is a target array.

Other names for X include:

- Data
- Feature variables
- Features

Other names for y include:

- Labels
- Target variable
For supervised learning there is usually an X and y.

For unsupervised learning, there's no y (no labels).

Let's revisit the example of using patient data (X) to predict whether or not they have heart disease (y).

In [14]:
# Import RandomForestClassifier model form the ensemble module
from sklearn.ensemble import RandomForestClassifier

# Setup Random Seed
np.random.seed(42)

# Creating Data
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Spliting data into training and testsets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiation of model 
clf = RandomForestClassifier()

# Fitting a model 
clf.fit(x_train, y_train)

# Check the score
clf.score(x_test, y_test)

0.8524590163934426

In [15]:
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [16]:
y.tail()

298    0
299    0
300    0
301    0
302    0
Name: target, dtype: int64

## 3.2 Making predictions using a machine learning model

## Making prediction using Machine Learning model

- `.predict()`
- `.predict_proba()`

**Training Phase:**

- The machine learning algorithm analyzes a dataset.
- It identifies patterns and uses them to make predictions.
- It continuously adjusts itself to improve accuracy (e.g., with `x_train`).

**Testing/Production Phase:**

The algorithm applies the learned patterns to new, unseen data.
It makes predictions based on its prior learning(e.g., using `clf.predict(x_test)`).

In [17]:
# Here our model is trying to predict whether we have heart disease or not '0' & '1'
# Try looking at the both of the result (clf.predict & y_test)

In [18]:
clf.predict(x_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [19]:
np.array(y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [20]:
# It's the exact same thing that we've done above ^   
clf.score(x_test, y_test)                                 

0.8524590163934426

In [21]:
clf.predict_proba(x_test[:5])

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

### Comparing prediction to truth labels to evaluate the model

- It's standard practice is to save these predictions to a variable named something 
- like `y_preds` for later comparison to `y_test` or `y_true` 
-  (usually same as `y_test` just another name).

In [22]:
# It's standard practice is to save these predictions to a variable named something 
# like y_preds for later comparison to y_test or y_true 
# (usually same as y_test just another name).

y_preds = clf.predict(x_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [23]:
clf.score(x_test, y_test)

0.8524590163934426

In [24]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

In [25]:
clf.predict(x_test[:5])

array([0, 1, 1, 0, 1])

In [26]:
heart_disease['target'].value_counts()

target
1    165
0    138
Name: count, dtype: int64

### Prediction on Regression(Model)

In [27]:
# Import RandomForestRegressor for Our problem
from sklearn.ensemble import RandomForestRegressor

# Creating a random seed
np.random.seed(42)

# Create Data
x = housing_df.drop('target', axis=1)
y = housing_df['target']

# Spliting into training and testsets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Create a Model instance
model = RandomForestRegressor()

# Fit the Model (Training sets)
model.fit(x_train, y_train)

# Make predictions 
y_preds = model.predict(x_test)

In [28]:
y_preds[:10]

array([0.49058  , 0.75989  , 4.9350165, 2.55864  , 2.33461  , 1.6580801,
       2.34237  , 1.66708  , 2.5609601, 4.8519781])

In [29]:
np.array(y_test[:10])

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   , 1.587  , 1.982  ,
       1.575  , 3.4    , 4.466  ])

In [30]:
len(y_preds)

4128

In [31]:
len(x_test)

4128

In [32]:
# Three different ways of doing the same thing
# we use test data(y_test) output-label to make prediction(y_preds)
# we din't take(x_test) because it contains input data which can't use for prediction

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

ValueError: continuous is not supported