# Introoduction to Scikit-Learn (sklearn)
this notebook demonstrates some of the most useful functions of the scikit learn library

0. An end to end Sscikit-learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on data
4. evaluating the model
5. improve a model
6. save and load a trained model
7. putting it all toghether!


# 0. An end to end Sscikit-learn workflow

In [2]:
# 1.get the data ready
import numpy as np
import pandas as pd 
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# Create X (features matrix)
X = heart_disease.drop("target",axis=1)

# create Y (labels)
y = heart_disease["target"]

In [4]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

# we'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [5]:
#3. Fit the model to the training data
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test= train_test_split(X,y,test_size=0.2)

In [6]:
clf.fit(X_train,y_train);

In [7]:
# make a prediction
y_label = clf.predict(np.array([0,2,3,4]))




ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [8]:
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1], dtype=int64)

In [9]:
y_test

29     1
90     1
81     1
147    1
209    0
      ..
6      1
226    0
19     1
114    1
12     1
Name: target, Length: 61, dtype: int64

In [10]:
# 4. Evaluate the model
clf.score(X_train,y_train)

1.0

In [11]:
clf.score(X_test,y_test)


0.8360655737704918

In [12]:
from sklearn.metrics import classification_report , confusion_matrix,accuracy_score
print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.94      0.64      0.76        25
           1       0.80      0.97      0.88        36

    accuracy                           0.84        61
   macro avg       0.87      0.81      0.82        61
weighted avg       0.86      0.84      0.83        61



In [13]:
confusion_matrix(y_test,y_preds)

array([[16,  9],
       [ 1, 35]], dtype=int64)

In [14]:
accuracy_score(y_test,y_preds)

0.8360655737704918

In [15]:
#5. Improve a model
# Try different amount of n_estiamtors
np.random.seed(42)
for i in range (10,100,10):
    print(f"Trying model with {i} estimators...")
    clf= RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f"Model accuracy on test set:{clf.score(X_test,y_test)*100:.2f}%")
    # :.2f => this is a shortcut in python using format string syntax.
    # it is similar to doing : round(some_number,2)


Trying model with 10 estimators...
Model accuracy on test set:75.41%
Trying model with 20 estimators...
Model accuracy on test set:85.25%
Trying model with 30 estimators...
Model accuracy on test set:83.61%
Trying model with 40 estimators...
Model accuracy on test set:81.97%
Trying model with 50 estimators...
Model accuracy on test set:85.25%
Trying model with 60 estimators...
Model accuracy on test set:83.61%
Trying model with 70 estimators...
Model accuracy on test set:81.97%
Trying model with 80 estimators...
Model accuracy on test set:83.61%
Trying model with 90 estimators...
Model accuracy on test set:83.61%


In [16]:
# 6 Save the model and load it 
import pickle
pickle.dump(clf,open("random_forest_model_1.pk1","wb"))

In [17]:
loaded_model = pickle.load(open("random_forest_model_1.pk1","rb"))
loaded_model.score(X_test,y_test)

0.8360655737704918

In [18]:
import sklearn


In [19]:
sklearn.show_versions()



System:
    python: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\jicky\Desktop\sample_project_1\env\python.exe
   machine: Windows-10-10.0.22621-SP0

Python dependencies:
      sklearn: 1.2.2
          pip: 23.2.1
   setuptools: 68.0.0
        numpy: 1.25.0
        scipy: 1.10.1
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\jicky\Desktop\sample_project_1\env\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2023.1-Product
    num_threads: 8
threading_layer: intel

       filepath: C:\Users\jicky\Desktop\sample_project_1\env\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 12


# 1. Getting our data ready to be used with machine learning
three main things we have to do :
1. Split the data into features and labels(usually 'X' & 'Y')
2. Filling (also called computing) or disregarding missing values
3. Converting non-numerical values to numerical values (also called feature encoding) 


In [20]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [21]:
X = heart_disease.drop("target", axis=1)

In [22]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [23]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# 1.1 Making a fake datasheet


In [25]:
car_sales = pd.read_csv("car-sales.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"


In [26]:
car_sales.Make.unique()

array(['Toyota', 'Honda', 'BMW', 'Nissan'], dtype=object)

In [27]:
car_sales.Make.value_counts()

Toyota    4
Honda     3
Nissan    2
BMW       1
Name: Make, dtype: int64

# Create fake "Make" data

In [28]:
# Create fake "Make" data

toyota = ["Toyota" for i in range(0, 393)]
len(toyota), toyota[:10]

(393,
 ['Toyota',
  'Toyota',
  'Toyota',
  'Toyota',
  'Toyota',
  'Toyota',
  'Toyota',
  'Toyota',
  'Toyota',
  'Toyota'])

In [29]:
honda = ["Honda" for i in range(0, 304)]
len(honda), honda[:10]


(304,
 ['Honda',
  'Honda',
  'Honda',
  'Honda',
  'Honda',
  'Honda',
  'Honda',
  'Honda',
  'Honda',
  'Honda'])

In [30]:
nissan = ["Nissan" for i in range(0, 198)]
len(nissan), nissan[:10]



(198,
 ['Nissan',
  'Nissan',
  'Nissan',
  'Nissan',
  'Nissan',
  'Nissan',
  'Nissan',
  'Nissan',
  'Nissan',
  'Nissan'])

In [31]:
bmw = ["BMW" for i in range(0, 105)]
len(bmw), bmw[:10]

(105, ['BMW', 'BMW', 'BMW', 'BMW', 'BMW', 'BMW', 'BMW', 'BMW', 'BMW', 'BMW'])

In [32]:
makes = bmw+nissan+toyota+honda
len(makes)

1000

# Create fake "Colour" data

In [33]:
car_sales.Colour.unique()

array(['White', 'Red', 'Blue', 'Black', 'Green'], dtype=object)

In [34]:
car_sales.Colour.value_counts()


White    4
Blue     3
Red      1
Black    1
Green    1
Name: Colour, dtype: int64

In [35]:
white = ["White" for i in range(0, 407)]
len(white), white[:3]

(407, ['White', 'White', 'White'])

In [36]:
blue = ["Blue" for i in range(0, 321)]
len(blue), blue[:3]

(321, ['Blue', 'Blue', 'Blue'])

In [37]:
green = ["Green" for i in range(0, 79)]
len(green), green[:3]

(79, ['Green', 'Green', 'Green'])

In [38]:
black = ["Black" for i in range(0, 99)]
len(black), black[:3]


(99, ['Black', 'Black', 'Black'])

In [39]:
red = ["Red" for i in range(0, 94)]
len(red), red[:3]

(94, ['Red', 'Red', 'Red'])

In [40]:
colours = white+blue+green+black+red
len(colours)

1000

In [41]:
import random
colours_shuffled = random.sample(colours, len(colours))
len(colours_shuffled), colours_shuffled[:10]


(1000,
 ['White',
  'White',
  'Black',
  'Blue',
  'White',
  'Black',
  'Blue',
  'Black',
  'White',
  'White'])

# Create fake Odometer (KM) data

In [42]:
car_sales


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [43]:
odometer = [random.randint(9789, 250000) for i in range(0, 1000)]
len(odometer), odometer[:10]

(1000,
 [160732,
  19782,
  140177,
  205123,
  116256,
  198929,
  234820,
  87979,
  118444,
  176627])

# Create fake "Doors" data

In [44]:
five_doors = [5 for i in range(0, 79)]
three_doors = [3 for i in range(0, 65)]
four_doors = [4 for i in range(0, 856)]
doors = five_doors + three_doors + four_doors
doors_shuffled = random.sample(doors, len(doors))

In [45]:
doors_shuffled

[4,
 4,
 4,
 4,
 3,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 5,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 3,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 5,
 4,
 4,
 4,
 5,
 4,
 4,
 5,
 4,
 5,
 4,
 3,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 5,
 4,
 4,
 3,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 3,
 4,
 4,
 5,
 4,
 4,
 4,
 4,
 3,
 4,


# Create fake "Price" data

In [46]:
makes_series = pd.Series(makes)
makes_series.value_counts()

Toyota    393
Honda     304
Nissan    198
BMW       105
dtype: int64

In [47]:
car_sales


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [48]:
car_sales[car_sales["Make"] == "Toyota"]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
5,Toyota,Green,99213,4,"$4,500.00"
8,Toyota,White,60000,4,"$6,250.00"


In [49]:
car_sales[car_sales["Make"] == "Honda"]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
1,Honda,Red,87899,4,"$5,000.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"


In [50]:
car_sales[car_sales["Make"] == "Nissan"]

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
4,Nissan,White,213095,4,"$3,500.00"
9,Nissan,White,31600,4,"$9,700.00"


In [51]:
prices = [random.randint(5000, 30000) for i in range(0, 1000)]
len(prices), prices[:30]

(1000,
 [13295,
  24215,
  13734,
  26490,
  27357,
  23113,
  5357,
  12621,
  25069,
  20185,
  17349,
  25498,
  22096,
  7406,
  11335,
  20448,
  9585,
  13992,
  9114,
  24716,
  6579,
  18406,
  18124,
  10427,
  12592,
  27307,
  15520,
  24153,
  13302,
  21796])

# Create base dataframe with manufactured data

In [52]:
fake_sales = pd.DataFrame(columns = ["Make", "Colour", "Odometer (KM)", "Doors", "Price"])
fake_sales


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price


In [53]:
fake_sales["Make"] = makes
fake_sales["Colour"] = colours_shuffled
fake_sales["Odometer (KM)"] = odometer
fake_sales["Doors"] = doors
fake_sales["Price"] = prices
fake_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,BMW,White,160732,5,13295
1,BMW,White,19782,5,24215
2,BMW,Black,140177,5,13734
3,BMW,Blue,205123,5,26490
4,BMW,White,116256,5,27357
...,...,...,...,...,...
995,Honda,Blue,110206,4,6449
996,Honda,White,227163,4,10878
997,Honda,White,26608,4,27453
998,Honda,Green,80979,4,26556


# Adjust the price column 

# For the price column:

## Generate random numbers between the certain values
## If the Odometer reading is above 100K, multiply price by 0.75
## If the Odometer reading is above 150K, multiply price by 0.6
## If the Odometer reading is above 200K, multiply price by 0.5
## If the Make column is BMW, multiply price by 1.5 + 2500
## If the Make column is Toyota, multuply price by 1.2
## If the Make is Nissan, multiply price by 1.1
## If the Make is Honda, add ## 1000 to price

In [54]:
fake_sales["Price"].describe()

count     1000.000000
mean     17468.268000
std       7153.367872
min       5004.000000
25%      11778.000000
50%      17043.500000
75%      24100.500000
max      29970.000000
Name: Price, dtype: float64

In [55]:
def price_od(price, odometer):
    """
    Changes price according to Odometer values.
    """
    if 100000 <= odometer <= 150000:
        return round(price * 0.75)
    elif 150001 <= odometer <= 200000:
        return round(price * 0.6)
    elif 200001 <= odometer:
        return round(price * 0.5)
    else:
        return price

fake_sales["Price"] = fake_sales.apply(lambda x: price_od(x["Price"], 
                                                          x["Odometer (KM)"]), 
                                                          axis=1)

fake_sales["Price"].describe()

count     1000.000000
mean     13154.543000
std       6510.187191
min       2506.000000
25%       8044.000000
50%      12146.500000
75%      17133.750000
max      29913.000000
Name: Price, dtype: float64

In [56]:
def price_make(price, make):
    """
    Manipulates the price base on the cars make.
    """
    if make == "BMW":
        return round((price * 1.5) + random.randint(3000, 10000))
    elif make == "Toyota":
        return round(price * 1.2)
    elif make == "Nissan":
        return round(price * 1.1)
    elif make == "Honda":
        return round(price + 1000)
    else:
        return price

fake_sales["Price"] = fake_sales.apply(lambda x: price_make(x["Price"], 
                                                            x["Make"]), 
                                                            axis=1)

fake_sales["Price"].describe()

count     1000.000000
mean     16138.085000
std       8379.168904
min       2867.000000
25%       9674.750000
50%      14638.000000
75%      21065.500000
max      51918.000000
Name: Price, dtype: float64

In [57]:
fake_sales = fake_sales.sample(frac=1)

In [58]:
fake_sales.reset_index(drop=True, inplace=True)
fake_sales.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,BMW,White,117883,5,24074
1,Toyota,White,33483,4,21752
2,Toyota,Green,85918,4,27071
3,Toyota,Red,29619,4,27109
4,Toyota,Blue,148415,4,6691
5,Toyota,Blue,198036,4,4636
6,Toyota,Blue,190063,4,15958
7,Nissan,White,151217,4,18304
8,Nissan,White,111418,4,8260
9,Toyota,Green,66474,4,29104


In [59]:
fake_sales.to_csv("./car-sales-extended.csv")

In [60]:
sales_ext = pd.read_csv("car-sales-extended.csv")


In [61]:
len(sales_ext)


1000

In [62]:
len(sales_ext)

1000

In [63]:
# Replicate the df
sales_ext_dropped = sales_ext



In [64]:
# Make column
np.random.seed(10)
make_idx = np.random.randint(0, 1000, 50)

In [65]:
make_idx

array([265, 125, 996, 527, 320, 369, 123, 156, 985, 733, 496, 925, 881,
         8,  73, 256, 490,  40, 502, 420, 371, 528, 356, 239, 395,  54,
       344, 363, 122, 574, 545, 200, 868, 974, 689, 691,  54,  77, 453,
        13, 755, 409, 382, 653, 860, 342, 798, 670,  89, 652])

In [66]:
for value in make_idx:
    sales_ext_dropped.loc[value, "Make"] = ""

In [67]:
sales_ext_dropped["Make"][266]

'Toyota'

In [68]:
# Colour column
np.random.seed(42)
colour_idx = np.random.randint(0, 1000, 50)
for value in colour_idx:
    sales_ext_dropped.loc[value, "Colour"] = ""

In [69]:
# Odometer (KM) column
np.random.seed(1)
odom_idx = np.random.randint(0, 1000, 50)
for value in odom_idx:
    sales_ext_dropped.loc[value, "Odometer (KM)"] = None

In [70]:
# Doors column
np.random.seed(2)
door_idx = np.random.randint(0, 1000, 50)
for value in door_idx:
    sales_ext_dropped.loc[value, "Doors"] = None

In [71]:
# Price column
np.random.seed(3)
price_idx = np.random.randint(0, 1000, 50)
for value in price_idx:
    sales_ext_dropped.loc[value, "Price"] = None

In [72]:
sales_ext_dropped.head(50)

Unnamed: 0.1,Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,0,BMW,White,117883.0,5.0,24074.0
1,1,Toyota,White,33483.0,4.0,21752.0
2,2,Toyota,Green,85918.0,4.0,27071.0
3,3,Toyota,Red,29619.0,4.0,27109.0
4,4,Toyota,Blue,148415.0,4.0,6691.0
5,5,Toyota,Blue,198036.0,4.0,4636.0
6,6,Toyota,Blue,190063.0,4.0,15958.0
7,7,Nissan,White,,4.0,18304.0
8,8,,White,111418.0,4.0,8260.0
9,9,Toyota,Green,66474.0,4.0,29104.0


In [73]:
# Check how many of our values are missing/NaN
sales_ext_dropped.isna().sum()

Unnamed: 0        0
Make              0
Colour            0
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [74]:
# Export dataframe with random missing values
sales_ext_dropped.to_csv("./car-sales-extended-missing-data.csv", index=False)

# 1.1 Make sure it's all numerical

In [75]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales

Unnamed: 0.1,Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,0,BMW,White,117883,5,24074
1,1,Toyota,White,33483,4,21752
2,2,Toyota,Green,85918,4,27071
3,3,Toyota,Red,29619,4,27109
4,4,Toyota,Blue,148415,4,6691
...,...,...,...,...,...,...
995,995,Toyota,Blue,243456,4,12670
996,996,Toyota,White,175526,4,18428
997,997,Toyota,Black,226075,4,14858
998,998,Toyota,Blue,155283,4,14196


In [76]:
len(car_sales)

1000

In [77]:
car_sales.dtypes

Unnamed: 0        int64
Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [78]:
# Split into X/y
X = car_sales.drop("Price",axis=1)
y = car_sales["Price"]

# Split into training and testing
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.2)

In [79]:
X_train


Unnamed: 0.1,Unnamed: 0,Make,Colour,Odometer (KM),Doors
689,689,Toyota,Blue,92819,4
95,95,Honda,White,32264,4
312,312,Honda,Blue,204719,4
192,192,Honda,White,232947,4
555,555,Toyota,Green,59884,4
...,...,...,...,...,...
383,383,Toyota,Blue,10187,4
405,405,Toyota,White,72688,4
561,561,Honda,White,174403,4
862,862,Honda,Black,23124,4


In [80]:
# Build Machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test,y_test)


ValueError: could not convert string to float: 'Toyota'

In [81]:
# convert string to numbers

In [82]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour", "Doors","Odometer (KM)"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X

 

<1000x1012 sparse matrix of type '<class 'numpy.float64'>'
	with 4999 stored elements in Compressed Sparse Row format>

In [83]:
pd.DataFrame(transformed_X)

Unnamed: 0,0
0,"(0, 0)\t1.0\n (0, 8)\t1.0\n (0, 11)\t1.0\n..."
1,"(0, 3)\t1.0\n (0, 8)\t1.0\n (0, 10)\t1.0\n..."
2,"(0, 3)\t1.0\n (0, 6)\t1.0\n (0, 10)\t1.0\n..."
3,"(0, 3)\t1.0\n (0, 7)\t1.0\n (0, 10)\t1.0\n..."
4,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 10)\t1.0\n..."
...,...
995,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 10)\t1.0\n..."
996,"(0, 3)\t1.0\n (0, 8)\t1.0\n (0, 10)\t1.0\n..."
997,"(0, 3)\t1.0\n (0, 4)\t1.0\n (0, 10)\t1.0\n..."
998,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 10)\t1.0\n..."


In [84]:
 dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,5,1,0,0,0,0,0,0,0,1
1,4,0,0,0,1,0,0,0,0,1
2,4,0,0,0,1,0,0,1,0,0
3,4,0,0,0,1,0,0,0,1,0
4,4,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,0,1,0,0,0
996,4,0,0,0,1,0,0,0,0,1
997,4,0,0,0,1,1,0,0,0,0
998,4,0,0,0,1,0,1,0,0,0


In [85]:
# Let's refit the model
np.random.seed(42)
X_train, X_test,y_train,y_test = train_test_split(transformed_X,
                                                 y,
                                                 test_size =0.2)
model.fit(X_train,y_train);

In [86]:
model.score(X_test,y_test)

0.11340769617275792

### 1.2 What if there were missing values?
1. fill the with some value (also known as imputation).
2. Remove the samples with missing data altogether.

In [87]:
# Import car sales missing data
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0.1,Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,0,BMW,White,117883.0,5.0,24074.0
1,1,Toyota,White,33483.0,4.0,21752.0
2,2,Toyota,Green,85918.0,4.0,27071.0
3,3,Toyota,Red,29619.0,4.0,27109.0
4,4,Toyota,Blue,148415.0,4.0,6691.0


In [88]:
car_sales_missing.isna().sum()

Unnamed: 0        0
Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [89]:
#create x and y 
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# let's convert our data into numbers


#### Option 1: Fill missing data with pandas

In [90]:
# Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace = True)

#  Fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace = True)

# #  Fill the "Odometer" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace = True)

# fill the Doors coloumn
car_sales_missing["Doors"].fillna(4, inplace=True)


In [91]:
# remove rows with mmissing price values
car_sales_missing.dropna(inplace=True)

In [92]:
car_sales_missing.isna().sum()

Unnamed: 0       0
Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [93]:
len(car_sales_missing)

950

In [94]:
X = car_sales_missing.drop("Price", axis =1)
y = car_sales_missing["Price"]

In [95]:
# Let's convert our data into  numbers
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",one_hot,
                                 categorical_features)],
                               remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X


array([[1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.17883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 3.34830e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.00000e+00, 8.59180e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        9.97000e+02, 2.26075e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        9.98000e+02, 1.55283e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        9.99000e+02, 1.70922e+05]])

### Option 2: Fill missing values with scikit-learn


In [96]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")

In [97]:
car_sales_missing.isna().sum()

Unnamed: 0        0
Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [98]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()


Unnamed: 0        0
Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [99]:
# Split into x and y
X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [100]:
# Fill missing values with scikit learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")
# Define columns 
cat_features = ["Make", "Colour"]
door_feature =["Doors"]
num_features=["Odometer (KM)"]
# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer",cat_imputer,cat_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer",num_imputer, num_features)
])
#Transform the data
filled_X = imputer.fit_transform(X)
filled_X

array([['BMW', 'White', 5.0, 117883.0],
       ['Toyota', 'White', 4.0, 33483.0],
       ['Toyota', 'Green', 4.0, 85918.0],
       ...,
       ['Toyota', 'Black', 4.0, 226075.0],
       ['Toyota', 'Blue', 4.0, 155283.0],
       ['Toyota', 'White', 4.0, 170922.0]], dtype=object)

In [101]:
car_sales_filled = pd.DataFrame(filled_X,columns =["Make","Colour","Doors", "Odometer (KM)"])
car_sales_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,BMW,White,5.0,117883.0
1,Toyota,White,4.0,33483.0
2,Toyota,Green,4.0,85918.0
3,Toyota,Red,4.0,29619.0
4,Toyota,Blue,4.0,148415.0


In [102]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [103]:
 #Let's try and convert our data to numbers 
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([(
"one_hot",one_hot,
categorical_features)],
remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [104]:
# now we've got our data as numbers and filled (no missing values)
# Let's fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                   y,
                                                   test_size=0.2
                                                   )
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.1505107246502434

## Choosing the right estimator/algorithm for your problem
Some Things to note:
* Sklearn refers to machine learning models, algorithms as estimators.
* Classification problem - predicting a category(heart disease or not)
* Sometimes you'll see`clf` (short for classifier) used as a classification estimator
* Regresssion problem - predicting a number (selling price of a car)
Refer to this map https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [124]:
# Get California Housing DataSet
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [125]:
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [150]:
housing_df["MedHouseVal"] = housing["target"]
housing_df.head()



Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [151]:
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [152]:

# housing_df = housing_df.drop("MedHouseVal",axis=1)


In [153]:
# import algorithm
from sklearn.linear_model import Ridge
#setup random seed
np.random.seed(42)

#create the data
X = housing_df.drop("MedHouseVal",axis=1)
y = housing_df["MedHouseVal"]# median house price in $100,000s

# Split into train and test sets
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

# Instantiate and fit the model(on the training set)
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the model (on the test set)
model.score(X_test,y_test)

# Check the score of the model (on the test Set)
model.fit(X_train, y_train);

In [154]:
#Import the RandomForest Regressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor

# Setup random seed 
np.random.seed(42)

#create the data
X = housing_df.drop("MedHouseVal", axis=1)
y = housing_df["MedHouseVal"]

# Split into train and test sets
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.2)

# Create random forest model
model = RandomForestRegressor()
model.fit(X_train,y_train)

# Check the score of the model(on the test set)
model.score(X_test, y_test)


0.8065734772187598

## 2.2 Picking a machine learning Model for a classification Problem

In [113]:
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


* consulting the map it says to try LinearSVC


In [114]:
# Import the LinearSVC estimatorv class 
from sklearn.svm import LinearSVC 

# Setup random seed 
np.random.seed(42)

# Make the data 
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2)

#Instantiate LinearSVC 
clf = LinearSVC(max_iter=1000000000)
clf.fit(X_train, y_train)

# Evaluate the LinearSVC 
clf.score(X_test, y_test)

0.8688524590163934

In [115]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

In [116]:
# Import the LinearSVC estimatorv class 
from sklearn.ensemble import RandomForestClassifier

# Setup random seed 
np.random.seed(42)

# Make the data 
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2)

#Instantiate LinearSVC 
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Evaluate the LinearSVC 
clf.score(X_test, y_test)

0.8524590163934426

 ### 3.1 Fitting the Model on data
 
 #### Different names for:
 * X = features, features variables, data
 * y = labels, targets,target variable

In [117]:
# Import the LinearSVC estimatorv class 
from sklearn.ensemble import RandomForestClassifier

# Setup random seed 
np.random.seed(42)

# Make the data 
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.2)

#Instantiate LinearSVC 
clf = RandomForestClassifier()
# Fit the model to the data
clf.fit(X_train, y_train)

# Evaluate the LinearSVC 
clf.score(X_test, y_test)

0.8524590163934426

In [118]:
y.tail()

298    0
299    0
300    0
301    0
302    0
Name: target, dtype: int64

### 3.2 Making prediction using a machine learning model
 2 Ways to make predictions:
 1. predict()
 2. predict_proba()

In [119]:
# Use a trained model to make predictions
clf.predict(np.array([1,7,8,3,4]))# this doesn't work




ValueError: Expected 2D array, got 1D array instead:
array=[1. 7. 8. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
clf.predict(X_test)

In [None]:
np.array(y_test)

In [None]:
# compare predictions to truth labels to eavaluate the model
y_preds= clf.predict(X_test)
np.mean(y_preds==y_test)

In [None]:
clf.score(X_test,y_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

### Make prediction with predict_proba()

In [None]:
# predict_proba() returns probablities of a classification label
clf.predict_proba(X_test[:5])

In [None]:
# Let's predict() on the same data ....
clf.predict(X_test[:5])

In [122]:
housing_df1.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [123]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data
X = housing_df.drop("MedHouseVal", axis=1)
y = housing_df["MedHouseVal"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Create model instance
model= RandomForestRegressor()

# Fit the model to the data
model.fit(X_train,y_train)

# Make predictions
y_preds = mode.predict(X_test)
y_preds


KeyError: "['MedHouseVal'] not found in axis"