# DA301 Week 2: Predicting outcomes using classification & clustering

## 2.1.3 Worked example I: Checking assumptions in binary logistic regression

Logistic regression is used in statistical software to determine, predict, and understand the probability of the occurrence of an event. It does this by fitting data to a logit function between a dependent and one or more independent variables by employing probabilities.

In this worked example, we will practise using binary logistic regression (BLR). You’ll apply a BLR model that’s based on binary data.

### Prepare the workstation

In [None]:
# Import all the necessary packages.
import pandas as pd
import numpy as np

# Read the provided CSV file/data set.
df = pd.read_csv('customer_data.csv') 

# Print the table.
df.head()  

### Determine the data types for each column

In [None]:
# Find the data types of columns.
df.dtypes

### Determine the shape of the data set

In [None]:
# Determine the shape of the data set.
df.shape  

# We should be able to perform BLR on a data set this size.
# Recall BLR requires large data sets.

### Check for missing values

In [None]:
# Determine missing values, column names, shape of data set, and data type:
df.info()

Assumptions 1, 2 & 6 are met for BLR

The first logistic regression assumption to be met is that the output would be binary. The last column of the data set is named Target and consists of binary data. We've already indicated whether clients have a connection as binary so the desired outcome should be P(Y=1) as indicated in the Target column. So Assumptions 1 & 2 are met & 6 was met previously about data size.


## 2.1.4 Worked example II: Checking for meaningful variables

For this assumption to be true, we need to investigate each of the columns and determine whether they are meaningful and should be included. In short, if the variable contributes to the binary outcome, it needs to be included.

### Determine the object containing counts of unique values
The education column (Edu) in the DataFrame consists of strings. Before we can create dummy variables, we need to convert the strings into single words that will be easier to analyse than strings of variable lengths. For this worked example, we will only update the details of the education column.

We know we can use the value_counts() function to determine the count per variable; Python will then return the name of each unique variable and the number of them in the specified column. This helps data analysts limit the number of spelling errors when specifying how to update the details (i.e. the variables) of the column. To determine the values within the Edu column:

In [None]:
# Specify the DataFrame column & add/determine the values.
df['Edu'].value_counts() 

Some of the categories have . in their name. We need to change this.

### Update the categories in the Edu Column

In [None]:
# Create two lists: one with initial and one with new values.
intial_vals = ['illiterate', 'unknown', 'basic', 'high', 'university', 'professional']
new_vals = ['other', 'other', 'pre-school', 'high-school', 'uni', 'masters']

# Create a for loop to replace the values.
for old_val, new_val in zip(intial_vals, new_vals):
    df.loc[df['Edu'].str.contains(old_val),'Edu' ] = new_val

# Display all the unique values/check changes.
df['Edu'].unique()  

In [None]:
df['Edu'].value_counts() 

### Convert strings into numbers

We need to convert strings to a single value for ease of data analysis. The sklearn library has a class called LabelEncoder that can convert values within a column to a number. Therefore, the categorical values become understandable numbers for our machine learning (ML) model. For example, we can transform Poor, Good, Very Good, and Excellent to 0, 1, 2, and 3. Important to remember is that the Label Encoder will order values alphabetically.

Another option is to create dummy variables with the pd.get_dummies() function. The dummies variable function creates a new column for each value, resulting in a bigger DataFrame. Keeping the DataFrame smaller has a positive impact on the performance of the model, especially when you work with big data.

In [None]:
# Import the necessary modules, classes and packages.
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import column_or_1d

# Create a class and write a user defined function.
class MyLabelEncoder(LabelEncoder):
    
    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

# View the output.
df.select_dtypes(include='object')

In [None]:
# Next replace all unique values in the columns with numbers.
# Order lists of the values for each column containing strings.
Edu_order = ['other', 'pre-school', 'high-school', 'uni', 'masters']
House_order = ['no', 'unknown', 'yes']
Loan_order = ['no', 'unknown', 'yes']
Month_order = ['mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct',
               'nov', 'dec']
DOW_order = ['mon', 'tue', 'wed', 'thu', 'fri']
Last_out_order = ['nonexistent', 'failure', 'success']

# List of values to transform into numbers even though the values are not ordered.
Occupation_list = ['unemployed', 'unknown', 'student', 'blue-collar',
                   'technician', 'housemaid', 'admin.','retired',
                   'self-employed', 'entrepreneur', 'management', 'services']
Status_list = ['unknown', 'single', 'divorced', 'married']
Comm_list = ['cellular', 'telephone']

# Create a list containing all of the list of values.
Encoding_list = [Occupation_list, Status_list, Edu_order, House_order,
                 Loan_order, Comm_list, Month_order, DOW_order, Last_out_order]

In [None]:
# Pick non-numerical columns.
object_cols = df.select_dtypes(include= 'object').columns

#Transform string values to number with our LabelEncoder function.
for idx in range(len(object_cols)): 
    
    le = MyLabelEncoder()
    le.fit(Encoding_list[idx])
    df[object_cols[idx]] = le.transform(df[object_cols[idx]])
    
# View the DataFrame.
df.head()

### Balance the data

It’s important to determine whether the data is balanced in the Target column before we can create a BLR. An unbalanced data set is when the target variable has more observations in one specific class than the others. 

If a model is trained on an unbalanced data set, it will return poor results. For example, inaccurately predicting a class or classifying unseen observations. Therefore, unbalanced data will affect the estimate of the model intercept and can create skewed predicted probabilities. 

In [None]:
# Determine if values in a column are balanced.
df['Target'].value_counts()  

In [None]:
# You can also create a visualisation
# Create a plot with Seaborn.
import seaborn as sns

sns.set_theme(style='darkgrid')
ax = sns.countplot(x='Target', data=df)
ax.set_title('Target Imbalance')

The data is not balanced as there are many more 0 values than 1. Before you can balance the data, you need to import a few libraries. 

The new libraries you will need to install are:

imblearn: handles unbalanced data and relies on the scikit-learn library
scipy: for optimisation, linear algebra, integration
scikit-learn: simple and efficient tools for predictive data analysis
SMOTE: an oversampling technique that creates new samples from existing data.

Next, you need to balance the Target variable with the SMOTE() function. To balance the Target variable with the SMOTE() function.

In [None]:
# Import all the necessary packages:
import statsmodels.api as sm   
import imblearn
from imblearn.over_sampling import SMOTE  
from sklearn.model_selection import train_test_split 

# Set the variables:
X = df.drop('Target', axis = 1)
y = df['Target']

# Apply SMOTE as the target variable is not balanced.
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Specify the new data sets.
os_data_X, os_data_y = os.fit_resample(X_train, y_train)  

# Create two DataFrames for X and one for y:
os_data_X = pd.DataFrame(data = os_data_X, columns = X.columns) 

os_data_y = pd.DataFrame(data = os_data_y, columns = ['Target'])

# View DataFrame.
print(os_data_X.head())
os_data_y.head()

In [None]:
# Let's determine if the balancing worked by using the value_counts() function:
# Determine if values in a column are balanced by counting the values.
os_data_y['Target'].value_counts()


In [None]:
# Simple visualisation now.
sns.set_theme(style ='darkgrid')
ax = sns.countplot(x ='Target', data = os_data_y)
ax.set_title("New Balanced Target")

Notice how each data set contains exactly the same number of zeros and ones. From this, we can conclude the data sets for our analysis contain meaningful variables that are now balanced (Assumption 3). Remember that an unbalanced data set will return poor results, and we needed to use the SMOTE technique to balance the data set.

There are still two logistic regression assumptions we need to check. They are:

Assumption 4: The independent variables () should be independent of each other to limit/eliminate multicollinearity.
Assumption 5: The independent variables () are linearly related to the log odds.

Let’s investigate how to test these two assumptions and how to build and fit the model.

### Variance Inflation Factor - testing for collinearity

Apply VIF to satisfy Assumption 4.

Note that a VIF < 10 indicates limited to no correlation between independent variables. The closer the VIF is to 1, the less correlation between independent variables. However, a VIF > 10 indicates a multicollinearity problem due to strong correlations. Some experts and researchers in the field prefer a more conservative threshold of 5 or 2.5.

In [None]:
# Import the VIF package.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a VIF dataframe.
vif_data = pd.DataFrame()
vif_data['feature'] = df.columns
  
# Calculate VIF for each feature.
vif_data['VIF'] = [variance_inflation_factor(df.values, i)
                          for i in range(len(df.columns))]

# View the output.
vif_data

The values of Duration, Campaign and Target are less than 10 (even <5) indicating no correlation between these independent variables.

Since the columns Quarterly_emp and Price_idx have very high VIF values, we drop them to avoid multicollinearity between the columns and boost the performance of our logistic regression.

In [None]:
# Droping the columns with VIF > 10 to avoid multicollinearity problems.
df = df.drop(['Price_idx', 'Quarterly_emp'], axis = 1)

# View the DataFrame.
print(df.shape)
df.head()

### Testing linearity with log odds

The Box-Tidwell test can be used to explore whether the independent variables (x) are linearly related to the log odds (Assumption 5). However, the Box-Tidwell test is only applicable to continuous variables. Since we only have one continuous column left, we can do a visual check of the linearity with log odds using this column. When having multiple continuous columns or when we can’t visually see the linear relationship, it’s better to use Box-Tidwell test.  

Another way to test the linearity in question is to plot the continuous independent variables (in our case the Duration column) and to look for an S-shaped curve. This can be done with the following code using the Seaborn statistical plotting library for Python.

In [None]:
dur = sns.regplot(x = 'Duration', y= 'Target', data= df,
                  logistic= True).set_title("Duration Log Odds Linear Plot")

### Select necessary columns for BLR

Now that we have tested all the assumptions (2.1.2 Logistic regression), the next step is to select the independent variables that we think have an effect on the dependent variable. These independent variables are represented in the columns. Therefore, we will select only the necessary columns for the BLR. Once we select the variables, we just run the logit function and summarise the model efficacy by checking the p-values and R-squared value. 

In [None]:
# To select the necessary columns for BLR
# Name the new DataFrame and specify all the columns for BLR:
nec_cols = df.drop('Target', axis = 1).columns

# Set the independent variable.
X = os_data_X[nec_cols]  

# Set the dependent variable.
y = os_data_y['Target']  

# Set the logit() to accept y and x as parameters and return the logit object:
logit_model=sm.Logit(y, X)

# Indicate result = logit_model.fit() function.
result = logit_model.fit()  

# Print the results.
result.summary()


### Determine the accuracy of the model

To check if the BLR model is working and accurate, we need to first check if the LogisticRegression() function is active

In [None]:
# Import necessary packages:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Split X and y data sets into ‘train’ and ‘test’ in a 30:70 ratio:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=0)

# Set LogisticRegression() to logreg.
logreg = LogisticRegression(max_iter=5000) 

# Fit the X_train and y_train data sets to logreg. 
logreg.fit(X_train, y_train) 

The output indicates that the LogisticRegression() function is active! So, let’s proceed to determine the accuracy of the BLR.

In [None]:
# Determine BLR model’s accuracy:
y_pred = logreg.predict(X_test)

print('Accuracy of logistic regression classifier on test set: {:.2f}'\
      .format(logreg.score(X_test, y_test)))

To further test the model’s accuracy, we can also employ a confusion matrix to evaluate the accuracy of the classification.

### Create a confusion matrix

A confusion matrix is a tabular summary of prediction results. Think of a confusion matrix as a specific table layout that allows us to capture the performance of a classification algorithm in a simple visual manner. A confusion matrix is not a metric to evaluate a model. Instead, it provides insights into the predictions. It’s important to learn how to create a confusion matrix because it will help you to understand other classification metrics such as precision and recall.

It’s called a ‘confusion matrix’ because this cross-tabulation helps us assess if our chosen procedure is confusing two classes (i.e. frequently mislabelling one as another). Consider the following:

- each row of the matrix represents the actual number of observations in a given class
- each column represents the predicted number of observations in a given class (or vice versa)
- each cell in the table thus reports the number of observations by actual and predicted class.

A confusion matrix goes deeper than classification accuracy by showing the correct and incorrect (i.e. true or false) predictions of each class. In the case of a binary classification task, a confusion matrix is a 2 x 2 matrix. If there are three different classes, it would be a 3 x 3 matrix, and so on. 

Now that we have learned about these elements, let’s create a confusion matrix based on the actual and predicted dependent variable values (y_test and y_pred).

In [None]:
# Create the confusion matrix to test classification accuracy in BLR:
# Import the necessary package to create the confusion matrix. 
from sklearn.metrics import confusion_matrix  

# Indicate the confusion matrix needs to be created.
confusion_matrix = confusion_matrix(y_test, y_pred)  

# Plot the confusion_matrix.
sns.heatmap(confusion_matrix, annot=True, fmt='g')

To conclude this worked example, let’s calculate the precision (positive predictive value), recall (sensitivity, or true positive rate), f1-score (test accuracy) and support of the model

In [None]:
# Import the necessary package.
from sklearn.metrics import classification_report  

# Print a report on the model's accuracy.
print(classification_report(y_test, y_pred))  

The high precision scores indicate that the model is accurate. This is an important metric in this case, as we’ll need to use the model to select the most appropriate customers for the new project. In other situations, accuracy might be the most important criterion, because it is a broader measure of how many classifications are correct. In some cases, recall might be important, because we might want to know how many instances of selecting a customer were, in fact, correct.

## 2.1.6 Practical activity: Building a BLR model

### Basic setting up the environment

In [None]:
# Import all the necessary packages.
import numpy as np
import pandas as pd

In [None]:
# Import the data set.
df = pd.read_csv('breast_cancer_data.csv', 
                 index_col='id')

# View the DataFrame.
df.info()

In [None]:
# Determine if there are any null values.
df.isnull().sum()

In [None]:
# Determine the descriptive statistics.
df.describe()

In [None]:
# All null values will be dropped.
df.drop(labels='Unnamed: 32', axis=1, inplace=True)

In [None]:
# Determine the count of values.
# df['diagnosis'].value_counts(normalize=True)

In [None]:
# Determine if the data set is balanced.
df['diagnosis'].value_counts()

### Create the BLR 

In [None]:
# Import necessary packages.
import imblearn
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report

# Specify the columns.
target_col = 'diagnosis'
feature_cols = [c for c in df.columns if c != target_col]

# Set the variables.
X =  df[feature_cols]
y = df[target_col]

# Create the train and test data sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Specify and fit the model.
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)

### Calculate accuracy of the model

In [None]:
# Calculate the predicted labels and predicted probabilities on the test set.
# Predict test class.
y_pred = logreg_model.predict(X_test)

# Predict test probability.
y_pp = logreg_model.predict_proba(X_test)

In [None]:
# Create the confusion matrix for your classifier's performance on the test set.
con_mat = confusion_matrix(y_test, y_pred, labels=['M', 'B'])

In [None]:
# Predict cancer based on some kind of detection measure, like we did before.
confusion = pd.DataFrame(con_mat, index=['predicted_cancer','predicted_healthy'],
                         columns=['is_cancer', 'is_healthy'])

# View the output.
confusion

In [None]:
# Use float to perform true division, not integer division.
print(metrics.accuracy_score(y_test, y_pred))

In [None]:
# Create a confusion matrix.
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)

print(confusion_matrix)

In [None]:
# Create an accuracy report.
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

## 2.1.7 Worked example: Building an MLR model

Remember that there are three types of logistic regression: binary, multinomial, and ordinal. You’ve already built a BLR model consisting of binary data. Here, we’ll investigate multinomial logistic regression (MLR).

### Multinomial logistic regression

 MLR is similar to the BLR model you worked on earlier, but MLR can predict the probabilities of different possible outcomes of a categorical dependent variable conditional on a set of independent variables. The independent variable can be real-valued, categorical-valued, or binary-valued. For example, while BLR can predict binary outcomes (e.g. yes or no, 1 or 0), the MLR can predict one out of  possible outcomes, where  can be any integer greater than 1.
 
MLR explains the relationship between one nominal dependent variable and one or more independent variables. 

#### Prepare the workstation

In [None]:
# Import all the necessary packages: Pandas, NumPy, SciPy, Sklearn, StatsModels.
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn import metrics 
from sklearn.metrics import confusion_matrix

# Upload the CSV file.
oysters = pd.read_csv('oysters.csv')  

# Print the columns.
oysters.columns

In [None]:
# View the DataFrame.
oysters.info()

In [None]:
# Apply the value_counts() method, and 
# assign the results to a new DataFrame.
oysters_sex=oysters['sex'].value_counts()

# Print the contents.
print(oysters_sex)  

The next steps will be to separate the dependent variable from the independent variables, build the model, create the equation, and test the model’s accuracy. 

#### Set the variables.

In [None]:
# Set the independent and dependent variables:
# Set the independent variable.  
X = oysters.drop(['sex'], axis=1) 
# Set the dependent variable. 
y = oysters['sex']   

# Print to check sex column was dropped.
print(list(X.columns.values))  

# Specify the train and test data sets and 
# use 30% as the 'test_size' and a random_state of one.
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.30, random_state = 1, stratify=y) 

# Print the shape of all the train and tests sets.
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

The output is correct because 30% of 9,484 values is 2,845.2 (2,846 rounded up as you cannot get a partial oyster). 

#### Build the model

In [None]:
# Import the MinMaxScaler to normalise the data.
from sklearn.preprocessing import MinMaxScaler  

# Create a function and set values.
scaler = MinMaxScaler(feature_range = (0,1))  

# Add the X_train data set to the 'scaler' function.
scaler.fit(X_train)

# Specify X_train data set.
X_train = scaler.transform(X_train) 
# Specify X_test data set. 
X_test = scaler.transform(X_test)  

In [None]:
# Define the MLR model and  set predictions and parameters.
MLR = LogisticRegression(random_state=0, 
                         multi_class='multinomial', 
                         penalty='none', 
                         solver='newton-cg').fit(X_train, y_train)
# Set the predictions equal to the ‘MLR’ function and 
# specify the DataFrame.
preds = MLR.predict(X_test) 

# Set the parameters equal to the DataFrame and 
# add the ‘get_params’ function. 
params = MLR.get_params() 

# Print the parameters.
print(params)  

Next, we need to evaluate the MLR intercept and coefficients.

In [None]:
print("Intercept: \n", MLR.intercept_)
print("Coefficients: \n", MLR.coef_)

#### Create the linear equation from the logit model

In this example, we will use a different method to create a linear equation.

You’ll use the MNLogit(statsmodels) function, which is similar to the logistic regression (sklearn.linear_model) we employed earlier. However, it’s from another package. Python has different functions that can be employed to get similar results. It’s important to take note of these similarities as some team members or stakeholders might use a different approach than you. 

In [None]:
# Name the model and [2] Set model to the function.
logit_model=sm.MNLogit(y_train,sm.add_constant(X_train))
logit_model

# Specify how the function returns the results.
result=logit_model.fit()  

# Print the report as a result.summary() function: 
print("Summary for Sex:I/M :\n ", result.summary())

#### Determine the accuracy of the model

In [None]:
# Create and print a confusion matrix:
# y_test as the first argument and the predictions as the second argument. 
confusion_matrix(y_test, preds)

# Transform confusion matrix into an array:
cmatrix = np.array(confusion_matrix(y_test, preds))

# Create the DataFrame from cmatrix array. 
pd.DataFrame(cmatrix, index=['female','infant', 'male'],
columns=['predicted_female', 'predicted_infant', 'predicted_male'])

In [None]:
# Determine accuracy statistics:
print('Accuracy score:', metrics.accuracy_score(y_test, preds))  

# Create classification report:
class_report=classification_report(y_test, preds)

print(class_report)

The accuracy of the model is 55%, which is not very accurate and therefore not useful as a predictive model. It seems that there is a 48% chance of success to indicate females by employing size as a variable. Therefore, as breeding programmes are very expensive and time-consuming, it might not be the best way to proceed. While inaccuracy seems to be a negative indicator, in fact, we have saved the oyster breeders a lot of wasted time and money.

#### Visualise the model

You might need to create a visualisation of the outputs (confusion matrix) from the MLR model you created, for reporting purposes

In [None]:
# Import matplotlib to create a visualisation.
import matplotlib.pyplot as plt  

# Define confusion matrix.
cm = confusion_matrix(y_test, preds)  

# Create visualisation for the MLR:
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1, 2), ticklabels=('female', 'infant', 'male'))
ax.yaxis.set(ticks=(0, 1, 2), ticklabels=('female', 'infant', 'male'))

# ax.set_ylim(1.5, -0.5)
for i in range(3):
    for j in range(3):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='white', size='xx-large')
        
# Sets the labels.
plt.xlabel('Predictions', fontsize=16)
plt.ylabel('Actuals', fontsize=16)
plt.title('Confusion Matrix', fontsize=15)

plt.show()

The oyster research station needed to accurately determine the sex of mature oysters based on size measurements. Using multinomial logistic regression (MLR), we can now see from the output (which is essentially a confusion matrix) that our model correctly classified 314 females as females while it incorrectly classified 160 cases as infants and 460 cases as males. Similarly, 57 cases were incorrectly identified as infants while 676 cases were accurately classified as infants. But there were 110 cases that were incorrectly identified as males. Finally, there were 304 and 244 cases where the model incorrectly classified the males as females and infants respectively. But it also identified 521 accurate cases of male.

## 2.1.10 Worked example: Building a Support Vector Machine (SVM) Model

### Prepare Workstation

In [None]:
#  Import all the necessary packages.
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
import statsmodels.api as sm 

import warnings  
warnings.filterwarnings('ignore')

# Read the data file with Pandas.
df = pd.read_csv('customer_data.csv')  

# Sense-check the data.
df.info()  

### Prepare the data

As with the BLR we need to update the Edu column to eliminate any periods. The same code snippet and explanation applies to the SVM model as with the BLR model.

In [None]:
# Update all the details of the education column:
df.loc[df['Edu'].str.contains('basic'),'Edu' ] = 'pre-school'
df.loc[df['Edu'].str.contains('university'),'Edu' ] = 'uni'
df.loc[df['Edu'].str.contains('high'),'Edu' ] = 'high-school'
df.loc[df['Edu'].str.contains('professional') ,'Edu'] = 'masters'
df.loc[df['Edu'].str.contains('illiterate'),'Edu' ] = 'other'
df.loc[df['Edu'].str.contains('unknown'),'Edu' ] = 'other'

# Display all the unique values/check changes.
df['Edu'].unique()  

### Create dummy variable
The next step is to create the dummy variables (to account for the effects of one or more nominal-scale variables on the dependent variable). The process, explanations and code snippet are the same as with the BLR model


In [None]:
# Convert categorical variables to dummy variables:
cat_vars=['Occupation', 'Status', 'Edu', 'House',
          'Loan', 'Comm', 'Month', 'DOW', 'Last_out']

# Specify what needs to apply to all the variables.
for var in cat_vars: 
    # Specify details of the categorical list.
    cat_list = pd.get_dummies(df[var], prefix=var) 
    # Indicate the joining of the DataFrames.
    df1=df.join(cat_list) 
    # Set old the DataFrame with new df with dummy values.
    df=df1 
    
    cat_vars=['Occupation','Status', 'Edu', 'House', 'Loan',
              'Comm', 'Month','DOW','Last_out']
    
# Set the temporary DataFrame and add values.
df_vars=df.columns.values.tolist() 
# Indicate what columns are kept.
to_keep=[i for i in df_vars if i not in cat_vars] 
    
# Define a new DataFrame.
df_fin=df[to_keep] 

# Print the column names.
df_fin.columns.values 

### Balance the data

The SVM is not a regression model. Therefore, you don’t have to specify the variables (X and Y) as we did with the BLR model. However, you still have to apply the SMOTE process to balance the data. Before you apply the SMOTE process, identify the necessary columns that should be part of the SVM model. It will be the same columns as with the BLR model.

The same code snippet and explanations apply as the previous practical activity, only the order differs. With the BLR model, we first applied the SMOTE process and then selected the necessary columns. 

In [None]:
# Create the DataFrame to use as df_fin and replace missing values with zero:
df_fin = df_fin.fillna(0)

# Specify only the necessary columns for BLR: 
nec_cols = [ 'Status_divorced', 'Status_married',
            'Status_single', 'Status_unknown', 
            'Edu_high-school', 'Edu_masters', 
            'Edu_other', 'Edu_pre-school', 
            'Edu_uni', 'House_no', 'House_unknown',
            'House_yes', 'Loan_no', 'Loan_unknown', 
            'Loan_yes', 'DOW_fri', 'DOW_mon']

# Set the variables.
X = df_fin[nec_cols]  
y = df_fin.loc[:, df_fin.columns == 'Target']   

# Create a new DataFrame and 
# apply SMOTE as the target variable is not balanced.
os = SMOTE(random_state=0) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Specify column values.
columns = X_train.columns   

# Specify the new data sets.   
os_data_X,os_data_y=os.fit_resample(X_train, y_train) 

# Create two DataFrames for X and one for y:
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['Target'])

# Print the DataFrame.
print("length of oversampled data is ",len(os_data_X)) 

os_data_y

#### Checking to see if data is balanced.

In [None]:
# Determine if values in a column are balanced.
os_data_y['Target'].value_counts()  

The data is balanced as there aren’t more 0 values than 1.

### Build and apply the SVM Model

Next, you will build and apply the SVM model. This code snippet is different from the BLR model as it is specific to an SVM model.

In [None]:
# Import the necessary packages. 
from sklearn import svm  
from sklearn.metrics import confusion_matrix  

#Create an svm classifier using a linear kernel.
clf = svm.SVC(kernel='linear', gamma='scale')  

# Train the model using the training sets.
clf.fit(os_data_X, os_data_y)  

# Predict the response for the test data set.
y_pred = clf.predict(X_test)  

### Determine the accuracy of the model

Finally, the model was built and fitted to the data set. Let’s prepare the confusion matrix and accuracy report to determine how the SVM and BLR models compare. However, there is also another way to calculate the accuracy, precision, and recall of the model.

In [None]:
#  Import the scikit-learn metrics module for an accuracy calculation:
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Print the confusion matrix.
print(confusion_matrix(y_test, y_pred))  

# Specify model accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Specify model precision: what percentage of 
# positive tuples are labelled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Specify model recall: how good is the model at 
# correctly predicting positive classes?
print("Recall:",metrics.recall_score(y_test, y_pred))

The confusion matrix indicates (from left to right and top to bottom) there are:

7,360 true positives
3,621 false positives
789 false negatives
587 true negative. 
The accuracy of the model is 64%, precision is 14% and recall is 43%.

The BLR model predicted 85% fit, while the SVM predicted only 64% fit. So, which one can we trust? Next, you'll evaluate the accuracy of your BLR model by building an SVM, which will allow you to compare the two models. 

## 2.1.11 Practical activity: Build and fit an SVM model

In this activity, you will evaluate the accuracy of the BLR model built in the previous activity of this week. You will do this by building an SVM and comparing the two models.

### Prepare the workstation

In [None]:
# Import all the necessary packages.
import numpy as np
import pandas as pd

### Get the data

In [None]:
# Import the data set.
df = pd.read_csv('breast_cancer_data.csv', 
                 index_col='id')

# View the DataFrame.
df.info()

In [None]:
# Determine the number of null values.
df.isnull().sum()

In [None]:
# Determine the descriptive statistics.
df.describe()

In [None]:
# All null values will be dropped.
df.drop(labels='Unnamed: 32', axis=1, inplace=True)

In [None]:
# Count the values.
df['diagnosis'].value_counts(normalize=True)

In [None]:
# Determine if the data set is balanced.
df['diagnosis'].value_counts()

### Create a SVM Model

In [None]:
# Import necessary packages.
import imblearn
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

# Select columns.
target_col = 'diagnosis'
feature_cols = [c for c in df.columns if c != target_col]

# Set the variables.
X = df[feature_cols]
y = df[target_col]

# Create the train and test data sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
from sklearn import svm

# Create an SVM Classifier.
clf = svm.SVC(kernel='linear', gamma='scale') 

# Train the model using the training sets.
clf.fit(X, y)

# Predict the response for the test data set.
y_pred = clf.predict(X_test)

### Calculate the accuracy of the model

In [None]:
# Import necessary packages.
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Create a confusion matrix.
confusion_matrix = confusion_matrix(y_test, y_pred)

# View the confusion matrix.
print(confusion_matrix)

In [None]:
# Import necessary package.
from sklearn.metrics import classification_report

# Print the output.
print(classification_report(y_test, y_pred))

In [None]:
# Use float to perform true division, not integer division.
print(metrics.accuracy_score(y_test, y_pred))

## Demonstration: fitting a classification decision tree model

### Prepare workstation

In [None]:
# Import all necessary libraries.
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn
from sklearn import metrics
# Provides classes and functions to estimate many different statistical methods.
import statsmodels.api as sm  

# Note: Helps split data into sets to create BLR.
from imblearn.over_sampling import SMOTE  
from sklearn.model_selection import train_test_split

# Note: Indicates situations that aren’t necessarily exceptions.
import warnings  
warnings.filterwarnings('ignore')  

# Read the provided CSV file/data set.
df = pd.read_csv('customer_data.csv')  

# Print a summary of the DataFrame to sense-check it.
df.info()  

### Update Variables in Edu Columns

As with the BLR and SVM models, we need to update the Edu column to eliminate any periods.

In [None]:
# [1] Update all the details of the education column:
df.loc[df['Edu'].str.contains('basic'),'Edu' ] = 'pre-school'
df.loc[df['Edu'].str.contains('university'),'Edu' ] = 'uni'
df.loc[df['Edu'].str.contains('high'),'Edu' ] = 'high-school'
df.loc[df['Edu'].str.contains('professional') ,'Edu'] = 'masters'
df.loc[df['Edu'].str.contains('illiterate'),'Edu' ] = 'other'
df.loc[df['Edu'].str.contains('unknown'),'Edu' ] = 'other'

# [2] Display all the unique values/check changes.
df['Edu'].unique() 

### Create dummy variables

The next step is to create the dummy variables (to account for the effects of one or more nominal-scale variables on the dependent variable). When fitting a decision tree, however, you might not need to create dummy variables, for example, if the values are already discrete.

In [None]:
# Name new DataFrame and convert categorical variables to dummy variables:
cat_vars=['Occupation','Status','Edu','House','Loan',
          'Comm','Month','DOW','Last_out']

# Use the for loop keyword to specify what actions to
# apply to all the 'var' items:
# Specify what needs to apply to all the variables.
for var in cat_vars:  
    # cat_list='var'+'_'+var
    # Specify details of the categorical list.
    cat_list = pd.get_dummies(df[var], prefix=var)  
    # Indicate the joining of the DataFrames.
    df=df.join(cat_list) 

df_fin = df.drop(cat_vars,axis=1) 

# Specify the column names:
cat_vars=['Occupation','Status','Edu','House','Loan',
          'Comm','Month','DOW','Last_out']

# Set a temporary DataFrame and add values.
df_vars=df.columns.values.tolist()  

# Indicate what columns are kept.
to_keep=[i for i in df_vars if i not in cat_vars] 

# Define new DataFrame.
df_fin=df[to_keep]  

# Print the column.
df_fin.columns.values 

### Balance the data

The classification decision tree is not a regression model. Therefore, we don’t have to specify the variables (X and y) as we did with the BLR model. However, for consistency in comparing the final result with the BLR and SVM models, we still have to apply the SMOTE process to balance the data.

In [None]:
# Create a DataFrame to use as df_fin and replace missing values with zero.
df_fin = df_fin.fillna(0)  

# Select necessary columns: 
nec_cols = [ 'Status_divorced', 'Status_married',
            'Status_single', 'Status_unknown', 
            'Edu_high-school', 'Edu_masters', 
            'Edu_other', 'Edu_pre-school', 
            'Edu_uni', 'House_no', 'House_unknown',
            'House_yes', 'Loan_no', 'Loan_unknown', 
            'Loan_yes', 'DOW_fri', 'DOW_mon']

X = df_fin[nec_cols]
y = df_fin['Target']

# Create a new DataFrame and 
# apply SMOTE as the target variable is not balanced.
os = SMOTE(random_state=0)  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Specify column values.
columns = X_train.columns  
# Specify the new data sets.
os_data_X,os_data_y=os.fit_resample(X_train, y_train)  

# Create two DataFrames for X and one for y:
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['Target'])

# Print/check the DataFrame:
print("length of oversampled data is ",len(os_data_X))

os_data_y

In [None]:
# Determine if data is balanced
# Determine if values in a column are balanced.
os_data_y['Target'].value_counts()  

### Build and fit the decision tree model

Next, you will build and apply the classification decision tree model. This code snippet is different from the BLR and SVM models as it is specific to a classification decision tree model. 

In [None]:
# Import the DecisionTreeClassifier class from sklearn. 
from sklearn.tree import DecisionTreeClassifier  

# Create a classification decision tree classifier object as dtc: 
dtc = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)

# Train the decision tree classifier.
dtc = dtc.fit(os_data_X, os_data_y) 

# Predict the response for the test data set.
y_pred = dtc.predict(X_test)  

# Determine the accuracy of the model

Finally, let’s prepare the confusion matrix and accuracy report to determine how the SVM and BLR models compare. Recall a confusion matrix is a tabular summary of the prediction results that provides insights into the predictions and helps analysts understand key classification metrics .

A confusion matrix can go deeper than classification accuracy by also showing correct and incorrect (i.e. true or false) predictions of each class.

In [None]:
# Import scikit-learn metrics module for accuracy calculation:
from sklearn.metrics import confusion_matrix

# Use the print() function to display the confusion matrix results:
print(confusion_matrix(y_test, y_pred))

# Metrics for accuracy.
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) 

# Metrics for precision. 
print("Precision:",metrics.precision_score(y_test, y_pred)) 

# Metrics for recall.
print("Recall:",metrics.recall_score(y_test, y_pred)) 

In [None]:
# You can also use the following code to generate the classification report:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

We can note the following from these results:

The accuracy of the model is 72.7%, indicating the model is somewhat accurate at correctly identifying relevant customers versus irrelevant customers. (Hint: Remember that you have to use the weighted average.)
A precision score of 14.8% is very low, which indicates that many of the selected customers did not, in fact, fit the required profile. (Hint: We have specified the customers that fit the profile as 1.)
The recall score of 30% is relatively low, indicating that the model is incorrectly classifying many positive cases (i.e. there were 2,420 false positives, to be precise). (Hint: We have specified the customers that fit the profile as 1.)
These scores are not necessarily a problem if the business prefers to rather identify some people who don’t fit the customer profile rather than potentially miss out on potential customers who do. 

How do these scores compare with the output of the BLR and SVM models? Recall the BLR model predicted a 85% fit, and the SVM predicted only a 64% fit. The result of all three tests can be tabulated as follows:


     MODEL              ACCURACY (FIT)           PRECISION           RECALL
     
     BLR                    85%                     85%               85%
     SVM                    64%                     14%               43%
     Decision Tree          72.7%                   14.8%             30.5%



Summary of the test results of the binary logistic regression (BLR), support vector machine (SVM), and decision tree models

So, which model is more accurate and, therefore, the one we should use?

### Visualise the Model

You might need to create a visualisation of the output (confusion matrix) from the classification decision tree model you created. The main reason would be for reporting purposes.

In [None]:
# Import matplotlib to create a visualisation 
# and the tree package from sklearn:
import matplotlib.pyplot as plt 
from sklearn import tree

# Plot the decision tree to create the visualisation:
fig, ax = plt.subplots(figsize=(10, 10))
tree.plot_tree(dtc, fontsize=10)

# Print the plot with plt.show().
plt.show()  

You can change the number of levels of the decision tree to be displayed by adjusting the value of max_depth. Play around with this and change it to 3, 6, 10 or any number you prefer, indicating to Python how many levels to display. 

In [None]:
# Change the levels displayed on the decision tree:
dtc = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)

## Regression Decision Tree - Worked Example

Decision trees can solve both classification or regression problems, which is why decision trees are synonymous with the umbrella term ‘CART algorithm’. You’ve just worked through an example of how to use and check the accuracy of a classification decision tree (which, recall, predicts fixed or categorical target variables). In this example, you’ll learn how to do the same with a regression decision tree model. 

Regression trees predict continuous variables and answer questions like, ‘What predictions can be made about output Y given new input X?’ A regression decision tree uses an algorithm to fit the target value (i.e. the value you want to predict, like a stock’s value) using each of the independent variables (variables that affect the stock’s price). To do this, the available data is split at several points for each independent variable. The split, as you’ll soon see, occurs after the sum of squared errors (SSE) is calculated between the predicted and actual values. Remember that the target variables should be numeric or continuous to fit a regression decision model. 

### Business Problem

Adika Wati, a data analyst from Malaysia, is employed by an online electronic retailer, Electronics Online (EO). Mr Adam Wheeler, a manager at EO, is busy compiling the annual report to be presented at a board meeting and wants to indicate whether the company performed well and how accurate the forecasts are, based on historical data. 

The board will likely ask what predictions can be made about the stock price given new inputs, which requires a model that outputs continuous variables. Therefore, to assist, Adika is asked to build a regression decision tree model. After building this model, he will test its accuracy.

#### Prepare the workstation

In [None]:
# Import all necessary libraries:
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn
# Note: Provides classes and functions to estimate many
# different statistical methods.
import statsmodels.api as sm  

from imblearn.over_sampling import SMOTE 
from sklearn.model_selection import train_test_split

# Note: Indicates situations that aren’t necessarily exceptions.
import warnings  
# Filter out any warning messages.
warnings.filterwarnings('ignore')  

#Read the provided CSV file/data set.
df = pd.read_csv('ecommerce.csv')

#Print a summary of the DataFrame to sense-check it.
df.info()

#### Build and fit the model

In [None]:
# Specify that the column Median_s 
# should be moved into a separate DataFrame.
cols = df.columns[df.columns != 'Median_s']  

# Specify ‘X’ as the independent variables 
# and ‘y’ as the dependent variable:
X = df[cols]
y = df['Median_s']

# Split the data training and testing 30/70:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=0)

# Import the ‘DecisionTreeRegressor’ class from sklearn.
from sklearn.tree import DecisionTreeRegressor  

# Create the ‘DecisionTreeRegressor’ class 
# (which has many parameters; input only #random_state=0):
regressor = DecisionTreeRegressor(random_state=0)

# Fit the regressor object to the data set.
regressor.fit(X_train,y_train)  

#### Determine the accuracy of the model

Testing the accuracy of the regression decision tree model works slightly differently from the previous models you have tested. The regression decision tree model is based on statistical methods, therefore it will calculate the mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE). 

The MAE is average absolute error (i.e. calculate the absolute differences between each predicted value and its corresponding actual value and then take the average of these). MAE indicates, on average, how large of an (absolute) error we can expect between predicted and actual output, which makes MAE a very popular method to determine the accuracy of a regression decision tree in industry forecasts.

The MSE is the average squared error (i.e. calculate the squared differences between each predicted value and its corresponding actual value, and then take the average of these); RMSE is its square root (RMSE= Sqr Rt MSE). Once calculated, you can compare the RMSE and MAE to determine whether the forecast contains large but infrequent errors. The larger the difference between them, the more inconsistent the error size.

In [None]:
# Import the necessary packages:
from sklearn import metrics
import math

# Predict the response for the data test.
y_predict = regressor.predict(X_test)  

# Specify to print the MAE and MSE (to evaluate the accuracy of the new model):
print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, y_predict))
print("Mean Squared Error: ", metrics.mean_squared_error(y_test, y_predict))
# [3b] Calculate the RMSE.
print("Root Mean Squared Error: ", 
     math.sqrt(metrics.mean_squared_error(y_test, y_predict)))  

Fantastic! The output indicates that the MAE is 3.26, the MSE is 26.4, and the RMSE is 5.14, rounded to two decimals. What does the output imply? Let’s compare RMSE and MAE. The difference between them is RMSE - MAE = 5.14 - 3.26 = 1.88. ‘1.88’ is considered a small number since it is close to 0. Numbers close to 0 indicate that there are no large errors in the forecast. 

## 2.2.4 Creating a decision tree

### Prepare the workstation

In [None]:
# Import all the necessary packages.
import numpy as np
import pandas as pd

### Import & view data set

In [None]:
# Import the data set.
df = pd.read_csv('breast_cancer_data.csv', 
                 index_col='id')

# View the DataFrame.
df.info()

In [None]:
# Determine the number of null values.
df.isnull().sum()

In [None]:
# Determine the descriptive statistics.
df.describe()

In [None]:
# All null values will be dropped.
df.drop(labels='Unnamed: 32', axis=1, inplace=True)

In [None]:
# Count the values.
df['diagnosis'].value_counts(normalize=True)

In [None]:
# Determine if data set is balanced.
df['diagnosis'].value_counts()

### Create the decision tree

In [None]:
# Import the necessary packages.
import imblearn
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

# Set the columns.
target_col = 'diagnosis'
feature_cols = [c for c in df.columns if c != target_col]

# Set the variables.
X = df[feature_cols]
y = df[target_col]

# Split the data set into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Import necessary package.
from sklearn.tree import DecisionTreeClassifier 

# Create a Decision Tree classifer object.
dtc = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=1)

# Train the Decision Tree Classifer.
dtc = dtc.fit(X, y)

# Predict the response for the test data set.
y_pred = dtc.predict(X_test)

### Calculate the accuracy of the model

In [None]:
# Import necessary packages.
from sklearn.metrics import confusion_matrix
from sklearn import metrics

# Create a confusion matrix.
confusion_matrix = confusion_matrix(y_test, y_pred)

# View the confusion matrix.
print(confusion_matrix)

In [None]:
# Same option as with previous models for comparison between models.
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

### Plot the decision tree

In [None]:
# Import necessary packages.
import matplotlib.pyplot as plt
from sklearn import tree

# Plot the decision tree based on the Gini Index.
fig, ax = plt.subplots(figsize=(10, 10))
tree.plot_tree(dtc, fontsize=10)

plt.show()
#tree.plot_tree(dtc)

## 2.2.6 Regression Random Forests

### Prepare the workstation

In [None]:
# Import all necessary libraries:
import pandas as pd 
import numpy as np 
import scipy as scp
import sklearn
# Provides classes and functions to estimate many different statistical methods.
import statsmodels.api as sm  

from imblearn.over_sampling import SMOTE 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import metrics 
from sklearn.metrics import confusion_matrix

# Indicates situations that aren’t necessarily exceptions.
import warnings  
warnings.filterwarnings('ignore')  

# Read the provided CSV file/data set.
df = pd.read_csv('ecommerce.csv')  

# Print a summary of the DataFrame to sense-check it.
df.info()  

### Build and fit the model

In [None]:
# Prepare the data by indicating all the rows and columns for the Regression Random Forest (RRF):
X = df.iloc[:, 0:11].values
y = df.iloc[:, 11].values

In [None]:
# Import the train_test_split package:
from sklearn.model_selection import train_test_split

# Split the data set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

# Import the random forest regressor class:
from sklearn.ensemble import RandomForestRegressor

# Create the regressor object:
regressor = RandomForestRegressor(n_estimators=5, 
                                  random_state=0, 
                                  n_jobs=2)

# Fit the regressor to the data set and predict the y variable:
regressor.fit(X_train, y_train)

# Set y_pred.
y_pred = regressor.predict(X_test)  

# No output means that no errors were found; model was built and fitted correctly.

### Check the accuracy of the model

In [None]:
# Import the metrics package.
from sklearn import metrics  

# Calculate and display the metrics:
print("Mean Absolute Error:", metrics.mean_absolute_error(y_test, y_pred)) 
print("Mean Squared Error:", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The output indicates that the MAE is 2.65, the MSE is 20.42, and the RMSE is 4.52 (rounded to two decimals). To understand what this implies, let’s compare RMSE and MAE. The difference between them is RMSE - MAE = 4.52 - 2.65 = 1.87. 

Just like the 1.88 you determined for the regression decision tree, 1.87 is a relatively small number and indicates that there are no large errors in the prediction.

## 2.2.7 Creating a Random Forest

### Prepare the workstation and Data

In [None]:
# Import all the necessary packages.
import numpy as np
import pandas as pd

# Import data into Python.
df = pd.read_csv('breast_cancer_data.csv', 
                 index_col='id')

# View the DataFrame.
df.info()

In [None]:
# Determine the number of null values.
df.isnull().sum()

In [None]:
# Determine descriptive statistics.
df.describe()

In [None]:
# All the null values will be dropped.
df.drop(labels='Unnamed: 32', axis=1, inplace=True)

In [None]:
# Count the values.
df['diagnosis'].value_counts(normalize=True)

In [None]:
# Determine if the data set is balanced.
df['diagnosis'].value_counts()

### Create a random forest

In [None]:
# Import the necessary packages.
import imblearn
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

# Divide data into attributes and labels - all columns.
X = df.iloc[:, 0:4].values
y = df.iloc[:, 4].values

# Set columns.
target_col = 'diagnosis'
feature_cols = [c for c in df.columns if c != target_col]

# Set variables.
X = df[feature_cols]
y = df[target_col]

# Create test and train data sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Import necessary package.
from sklearn.ensemble import RandomForestClassifier

# Create a model.
forest=RandomForestClassifier(n_estimators=200, criterion='gini', 
                              min_samples_split=2, min_samples_leaf=2, 
                              max_features='auto', bootstrap=True, n_jobs=-1, 
                              random_state=42)

forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

### Calculate the accuracy of the model

In [None]:
# Import necessary package.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

### Plot the random forest

In [None]:
# Import necessary packages.
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.tree import export_graphviz

fig, axes = plt.subplots(nrows = 1, ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(forest.estimators_[0],
               filled = True);
fig.savefig('rf_individualtree.png')

## 2.3.4 Worked Example: K-means clustering

### Prepare the workstation

In [None]:
# Import libraries.
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Load the data.
df = pd.read_csv('fruit.csv')

# View the DataFrame.
df.info()

### Prepare the Data

In [None]:
# Drop unnecessary columns
df_fruit = df.drop(columns=['tree_age', 'location', 'colour_blossom'])

# Display a summary of the numeric variables.
df_fruit.describe()

### Visualise the Data

Visualisations help us to understand what we are working with. For example, are there any visible clusters, correlations and outliers? Let's plot a scatterplot and a pairplot to understand the data set.

In [None]:
# Import Seaborn and Matplotlib.
from matplotlib import pyplot as plt
import seaborn as sns

# Create a scatterplot with Seaborn.
sns.scatterplot(x='sepal_length', y='sepal_width',
                data=df_fruit, hue='fruit_type')


# Create a pairplot with Seaborn.
x = df_fruit[['sepal_length', 'sepal_width']]

sns.pairplot(df_fruit, vars=x,
             hue='fruit_type', diag_kind= 'kde')

Although there is a lot of data points overlaying, three groups are visible. How can you improve the accuracy or visibility of the three clusters? Let's investigate the elbow and silhouette methods.

### Improve the accuracy

In [None]:
# Import the KMeans class.
from sklearn.cluster import KMeans 

# Elbow chart for us to decide on the number of optimal clusters.
cs = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    cs.append(kmeans.inertia_)

plt.plot(range(1, 11), cs, marker='o')
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("CS")

plt.show()

In [None]:
# Now try silhouette method

# Import silhouette_score class from sklearn.
from sklearn.metrics import silhouette_score

# Find the range of clusters to be used using silhouette method.
sil = []
kmax = 10

for k in range(2, kmax+1):
    kmeans_s = KMeans(n_clusters = k).fit(x)
    labels = kmeans_s.labels_
    sil.append(silhouette_score(x, labels, metric = 'euclidean'))

# Plot the silhouette method.
plt.plot(range(2, kmax+1), sil, marker='o')

plt.title("The Silhouette Method")
plt.xlabel("Number of clusters")
plt.ylabel("Sil")

plt.show()

According to the graph, the optimal number of clusters will be four. Let’s investigate whether the accuracy improves, by testing k=3 and k=4. 

In [None]:
# Setting k = 4 ie Four clusters

# Use 4 clusters:
kmeans = KMeans(n_clusters = 4, max_iter = 15000, init='k-means++', random_state=0).fit(x)
clusters = kmeans.labels_
x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x, hue='K-Means Predicted', diag_kind= 'kde')

In [None]:
# Let's check the number of observations per predicted class.

# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts()

In [None]:
# Visualise the clusters
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

sns.scatterplot(x='sepal_length' , 
                y ='sepal_width',
                data=x , hue='K-Means Predicted',
                palette=['red', 'green', 'blue', 'black'])

In [None]:
# Trying 3 clusters now ie k = 3
# Evaluate and fit the model

# Use 3 clusters:
kmeans = KMeans(n_clusters = 3, max_iter = 15000, init='k-means++', random_state=0).fit(x)
clusters = kmeans.labels_
x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x, hue='K-Means Predicted', diag_kind= 'kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts()

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

sns.scatterplot(x='sepal_length' , 
                y ='sepal_width',
                data=x , hue='K-Means Predicted',
                palette=['red', 'green', 'blue'])

## 2.3.5 Practical Activity: K means clustering

### Prepare Workstation

In [None]:
# Import all the necessary packages.
import numpy as np
import pandas as pd

import warnings  
warnings.filterwarnings('ignore')

In [None]:
# Import the data into Python.
df_ais = pd.read_csv('ais.csv')

# View the output.
df_ais.info()

In [None]:
# Determine the number of null values.
df_ais.isnull().sum()

### Evaluate the variables

In [None]:
# Determine descriptive statistics.
df_ais.describe()

In [None]:
# List column names.
df_ais.columns

### Visualise Data

In [None]:
# Import Seaborn and Matplotlib.
from matplotlib import pyplot as plt
import seaborn as sns

# Create a scatterplot with Seaborn.
sns.scatterplot(x='lbm', y='bmi',
                data=df_ais, hue='sex')


# Create a pairplot with Seaborn.
x = df_ais[['lbm', 'bmi']]

sns.pairplot(df_ais, vars=x,
             hue='sex', diag_kind= 'kde')

### Improve Accuracy

#### Elbow Method

In [None]:
# Import the KMeans class.
from sklearn.cluster import KMeans 

# Elbow chart for us to decide on the number of optimal clusters.
cs = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', 
                    max_iter = 500, n_init = 10, random_state = 0)
    kmeans.fit(x)
    cs.append(kmeans.inertia_)

plt.plot(range(1, 11), cs, marker='o')
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("CS")

plt.show()

#### Silhouette method

In [None]:
# Import silhouette_score class from sklearn.
from sklearn.metrics import silhouette_score

# Find the range of clusters to be used using silhouette method.
sil = []
kmax = 10

for k in range(2, kmax+1):
    kmeans_s = KMeans(n_clusters = k).fit(x)
    labels = kmeans_s.labels_
    sil.append(silhouette_score(x, labels, metric = 'euclidean'))

# Plot the silhouette method.
plt.plot(range(2, kmax+1), sil, marker='o')

plt.title("The Silhouette Method")
plt.xlabel("Number of clusters")
plt.ylabel("Sil")

plt.show()

#### Evaluate and fit the model

In [None]:
# Use 5 clusters:
kmeans = KMeans(n_clusters = 5, max_iter = 15000, init='k-means++', random_state=0).fit(x)
clusters = kmeans.labels_
x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x, hue='K-Means Predicted', diag_kind= 'kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts()

### Visualise the clusters

In [None]:
# View the K-Means predicted.
print(x.head())

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

sns.scatterplot(x='bmi' , 
                y ='lbm',
                data=x , hue='K-Means Predicted',
                palette=['red', 'green', 'blue', 'black', 'orange'])