# The Machine Learning Process



The Machine Learning process is made up of several steps but in a basic sense, can be categorized in the following way:


### 1. Data Preprocessing
### 2. Data Modeling
### 3. Model Evaluation



## Example of applying the Machine Learning Process

### Problem Statement

A bank calls on you, a data scientist, to help them analyze some of their customer data. The problem they are having is that they would like to give loans to only people they are sure will not default on the loans.

They would like you to use the power of Machine Learning to create a model that can predict if a customer will default on a loan or not.

The way this is done is that you will predict a certain "Spending score" of several customers. The spending score is a metric that tells the bank whether you are eligible for a loan. If the spending score is above a threshold, then, you are eligible for a loan. If not, you are not eligible for a loan. Once the spending score predicted by your machine learning model is above the identified threshold set by the bank, they go ahead to give out the loan to that customer.

The spending score is the output of your model but what about the input? The input is all the data the bank has about each customer. Data such as gender, occupation sector, income, businessman/employed, age, and so on. are used to make the prediction of the Spending score

### Initial thoughts

The first thoughts to solve this problem are to determine what type of problem this is.

What are we predicting? A continuous variable? A categorical variable?

What Machine Learning type is best to make the prediction? Classification, Regression or Clustering?

What are the possible Machine Learning algorithms to use to perform the classification, regression or clustering?



### Answers to Initial thoughts

- We are predicting a spending score. The spending score can be any value from 1 to 100. These are numeric continuous values. 

- Check out these two links to understand the main types of variables we can have: Continuous, Discrete, Categorical.
(link 1: https://byjus.com/maths/continuous-variable/#:~:text=There%20are%20two%20types%20of%20continuous%20variables%20namely%20interval%20and%20ratio%20variables.)

(link 2:https://statistics.laerd.com/statistical-guides/types-of-variable.php )

- Because we are trying to predict numbers, the Machine Learning type should be a Regression.

- The Machine Learning Algorithms we will use for this regression are:

1. Linear Regression
2. Support Vector Machines
3. Decision Trees
4. Random Forests

- There are other regression algorithms, which will be discussed later.


Now that we have answered the initial thoughts, we can then proceed to applying the Machine Learning Process described above.

## Data Preprocessing

### Import the data

In [1]:
import pandas as pd
import numpy as np

customer_data = pd.read_csv("Customer_Data.csv")
customer_data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Customer_Data.csv'

**Note:** The *Gender* column is written as "Genre". So they will be used interchangeably in this notebook. However, the method of changing this column name will be given later on in this notebook

### Data Inspection

Before we focus on the task at hand, let us discuss some important Data Inspection tools in pandas

#### Some common data inspection tools

There are some very useful pieces of code that are usually utilized when performing data inspection. Some of these are given below:

In [None]:
""" .head() or .tail()

    DESCRIPTION
 These methods are used to view the first 5 or last 5 rows in a dataset.
 We can view more than 5 rows by passing the amount of rows we want as an argument to the method
"""

customer_data.head()

#customer_data.tail()

#customer_data.head(15)

#customer_data.tail(10)

In [None]:
""".shape

    DESCRIPTION
This method is used to give the number of rows and number of columns of the dataset. It usually gives the result in the format:
(Number of rows, Number of columns)
"""

customer_data.shape

In [None]:
""".columns

    DESCRIPTION
This is used to display the number of columns present in a dataset. It also displays the names of the columns.
"""

customer_data.columns

In [None]:
""".set_index()

    DESCRIPTION
This is used to set the index of the dataset to the values of a particular column

"""

customer_data.set_index("CustomerID", inplace =  True)



"""inplace = True is simpy telling pandas to effect his change on the customer_data dataframe. Without inplace = True, pandas

just displays the the result of the set_index() method but does not actually effect the change on the dataframe.

Instead of writing inplace = True, the following code snippet is equally viable:"""


customer_data = customer_data.set_index("CustomerID")


#Checking to see if the Customer ID column is now the index of the dataframe.
customer_data.head()

In [None]:
""".reset_index()

    DESCRIPTION
This is a method used to allow pandas use its own numbering for the index of the dataset.
"""

customer_data.reset_index(inplace = True)

customer_data


In [None]:
""".loc() and .iloc

    DESCRIPTION
.loc() uses the index or lable index of the dataframe while .iloc() uses the computer's internal table count
.loc() allows us to slice columns using the names of the columns while .iloc() allows us to slice columns using their indexes.
.loc() is also used to find where certain conditions are true

dataframe.loc[specified rows, specified columns]
"""


"""In the first example, we have .loc[28,:]. This specifies the exact row we want, row 28 and then, the specified column is
all the available columns which is denoted with the ':' symbol


In the second example, we search for the exact rows we are looking for. In this case, they are rows 28, 36, 167. Again,
the number of columns specified is all available columns. Notice that to specify more than one row, we use a list, that is,

[28, 36, 167]


The 3rd example specifies a range of rows. The rows range from row number 0 to row number 50 denoted by 0:50"""

#customer_data.loc[28, :]

#customer_data.loc[[28,36,167], :]

customer_data.loc[0:50,:]

In [None]:
#.loc

#customer_data.loc[0:, :]

#customer_data

                  
"""In this example, search for all rows from row 28 to row 157 and instead of getting all available columns, we can
specify which columns we want exactly. This could be by either naming them in a list like in example 2 or putting them in
a range like in example 3"""


#customer_data.loc[28:157, "Genre": "Annual Income (k$)"]

customer_data.loc[28:157, ["Genre", "Annual Income (k$)"]]

#customer_data.loc[28:157, "Genre": "Annual Income (k$)"]


In [None]:
#.loc()
"""
.loc() can also be used to find where a condition is true for some specific rows. For example, if we wanted to find the 
rows of our data that contain just men, we can use the code below
"""
customer_data["Genre"] == "Male"


In [None]:
"""As we can see, the result of the above code gives us the rows where pandas found the Gender (or Genre) column to be Male.
We can therefore use the .loc() to 'locate' where the result is True. Since .loc() always returns a dataframe, the result will
be in tabular form like a normal pandas dataframe.
"""

In [None]:
"""Completing the code to see the result, we have:"""

customer_data.loc[customer_data["Genre"]== "Male" , :]

In [None]:
"""We can also do the same thing with the Age column. Here we specify that we want where Ages are greater than 50"""
customer_data.loc[customer_data["Age"] > 50, :]

In [None]:
""".iloc

    DESCRIPTION
This is a method used to slice a dataframe by its index.
You specify the number of rows you want using indexes, and you also specify the number of columns using indexes.
Unlike .loc, you never use the names of the columns when using .iloc()

"""
customer_data.iloc[28:33, 0:3]

In [None]:
""".sort_values()

    DESCRIPTION
This is a method used to arrange a dataframe by a column specified in the bracket. If no column is specified, the dataframe is
rearranged by the index of the dataframe

The first example sorts the data by Gender(or Genre). The .sort_values() method is sorts in ascending order by default.
This is why in the first example, all the female values: Fem and Female appear first before the male values: Male and Mole since
"F" comes before "M" in the English Alphabet.

In the second example, we show how to make the sorting arrangement in descending order. We do this by specifying an argument
called 'ascending'. We set this argument called ascending to False, so that pandas knows that it should sort in descending 
order.

The third example shows us how we can sort by multiply columns. Here, we sort by "Genre" and "Age". Since Genre comes first,
pandas sorts first by Genre. Age comes next, so after sorting by Genre initially, an added layer of sorting is included when
pandas sorts by Age. The way this works is that after sorting by Genre, we would have all the Females on top of all the Males
if we sorted in ascending order. But it is possible that we have a female with an age of 45 and another female with age of 30 
below her. By placing this Age column as the second thing to sort by, pandas will rectify the arrangement of these females by
placing the female with age of 30 above the female with the age of 45. This is because pandas is trying to sort the ages in 
ascending order as well.
"""



customer_data.sort_values(["Genre"])
#customer_data.sort_values(["Genre"], ascending= False)
#customer_data.sort_values(["Genre", "Age"], ascending= False)

In [None]:
""".isnull() or .isna()

    DESCRIPTION
These two methods do the same thing. They are used to check if there are any empty records in the dataframe.
They return true when there are empty records and they return false when there are no empty records

The first example shows us where in each column, there is a null value. if there is a null value, the column would have a "True"
value and a "False" value otherwise.

In mathematics, True = 1, and False = 0. This is also true for python, therefore, if we place another method .sum() after
the .isnull() method as in the second example, it simply sums up all the True values in each column, thereby, giving us the sum 
of all the null values we have in each column.

"""



#customer_data.isnull()
customer_data.isnull().sum()

In [None]:
"""We can also apply these .isnull() and isnull().sum() methods to an individual column as well

Notice that it gives the rows where the Age is null. False values show us that age for that row is not null and True show us
that age for that row is null.

In the example below, only row 186 shows us that there is a null value in the Age column."""



customer_data["Age"].isnull()

In [None]:
"""Because the above code, gives us True and False values for each row of the dataframe, we can again use the .loc() method
to show us the rows that have null values in the dataframe."""



customer_data.loc[customer_data["Age"].isnull(), :]

In [None]:
""".any()
 
    DESCRIPTION
This is a method that is used to supplement the .isna() or .isnull() methods. While .isna() or .isnull() ask pandas a question
whether or not there are null values in the dataframe, to which pandas responds 'True' or 'False', .any() simply helps us to
ask pandas if there are any null values at all. So instead of giving us a list or array of True or False values like
.isna() or .isnull() methods give, .any() allows us to ascertain if there are any null values at all

In the code below, we have used .any() after .isna() and we can see that Age has a null value along with Spending Score.
We do not know how many null values there are, .any() just shows us if there are any null values at all.
"""


customer_data.isna().any()

In [None]:
""".duplicated()

    DESCRIPTION
This is a method that is used to determine if a dataframe contains any duplicated rows

Just like previous methods that ask pandas a question to which pandas responds True or False, .duplicated() asks pandas if there
are any duplicated rows in the dataset. Pandas responds by giving True or False responses for each row.

In similar fashion as previous examples, we can use the .loc[] method to find the actual rows in the dataset where it is
indicated there there are duplicated rows.

In the example below, there are no duplicated rows which is why the dataframe is empty.

"""


customer_data.loc[customer_data.duplicated()]

In [None]:
""".unique()

    DESCRIPTION
    
This method is used to find the unique values in a particular column. In the Gender column, the unique values are Male,
Female, 26, Fem, Mole, and 78.

Clearly, this Genre column needs to be cleaned appropriately"""





customer_data["Genre"].unique()

In [None]:
""".replace()

    DESCRIPTION
This method is used for replacing values in a particular record. It takes in two arguments: The value you want to replace and
the value to replace it by.

Note: This method replaces all instances of the object you want to replace. So be careful when using it.

The first example shows us how to replace a single value in the dataframe. We can also specify the column we want to make a 
replacement in. This is shown in example 2.

Next, we can replace a bunch of items by using two lists. One for the items you want to replace, the other for the items you
want to replace them with. In the 3rd example below, Mole is replaced by Male while Fem is replaced by Female. Also, note that
the item to be replaced is always written first while item we are using the replace comes second.

Note, to make changes permanent, we use the inplace = True statement in the bracket or we reassign the dataframe to itself
"""




customer_data.replace("Mole", "Male")
customer_data["Gender"].replace("Mole", "Male")
#customer_data.replace(["Mole", "Fem"], ["Male", "Female"])

In [None]:
""".rename()

    DESCRIPTION
This is a method that is used to replace the names of the columns in a dataset.
It uses a dictionary for its renaming process. The original name is in the "key" position of the dictionary, while the
new name is in the "value" position of the dictionary

In the example below, "Genre" is placed in the keys position of the dictionary and its corresponding value is "Gender". This 
helps us to change or rename the Genre column to Gender column. Next, we place the "Age" in the key position as well while 
we put its replacement (How_old_are_you) in the value position of the dictionary. We can do this for as many column names as we
see fit.
"""



customer_data = customer_data.rename(columns = {"Genre": "Gender", "Age": "How_old_are_you"})

customer_data

In [None]:
""".drop()

    DESCRIPTION
This is a method used to drop rows or columns from a dataset. All we just need to do is specify the row(s) or the column(s).
We also need to specify the axis argument. If "axis" = 0, then, we are dropping rows and if "axis" = 1, then, we are
dropping columns
"""

#customer_data.drop("How_old_are_you", axis = 1)
customer_data.drop([0,2,3])

In [None]:
""".dropna()

    DESCRIPTION
This is used for dropping rows or columns that have null values (i.e. NaN). Again, to make its effect permanent, use
inplace = True.

Notice that after using the method below, row 186 no longer exists on the dataframe. 

Again, if we use the axis = 1 statement in the brackets, we remove the Age column along with the Spending Score column since 
both of them contain NaN values
"""



customer_data.dropna()
#customer_data.dropna(axis = 1)

In [None]:
""".fillna()

    DESCRIPTION
This is a method used to fill NaN or null values that we may encounter in our dataset. It is used especially when we do not want
to lose information through dropping rows or columns.

In the first example, we use the value argument to fill the NaN values. This means that wherever we see NaN values in 
the How_old_are_you column, we replace it with 90 and anywhere in the Spending score column that we see NaN values, we replace
it with 44. The value argument can take either normal numbers or dictionaries as shown in the example 1.

In example 2, we can use the method argument instead. This specifies the method for filling up the NaN values. ffill implies
that we use the forward fill method. That is, take the previous value before the NaN, and then use it to fill forward the 
value of the NaN. bfill is backward fill and is the opposite of ffill. They are used when the values in the column are fairly
similar or within a very small range of values, for example, between 80 - 83.
"""


#customer_data.fillna(value = {"How_old_are_you" : 90, "Spending Score (1-100)" : 44})

customer_data.fillna(method = "ffill")

customer_data.fillna(method = "bfill")



In [None]:
"""In the above example, we filled the NaN values without specifying the particular column we want which might be a careless
way of doing things. If we specify the column name as shown in the examples below, you do not need to specify a dictionary
for the value argument. This time, we can just use a number instead.

In the next example, we fill the NaN values using the mean of the entire column. You can also do the exact same steps with
the mode, median, standard deviation and so on. When to use any of these methods (mean, median, mode, e.t.c.) depends on the
situation"""



#customer_data["How_old_are_you"].fillna(value = 95)

mean = customer_data["How_old_are_you"].mean()
customer_data["How_old_are_you"].fillna(value = mean)

In [None]:
""" Reordering columns

    DESCRIPTION
This changes the order of the columns.

This is done by just placing a list of how you want the existing columns in the dataframe to be arranged in the table.

Notice that the list of column headers passed into the customer_data[] object are the exact columns of the dataframe. If there 
is a slight difference, an error occurs.
"""



customer_data[["How_old_are_you", "Gender", "Spending Score (1-100)", "Annual Income (k$)"]]

In [None]:
""".at()/iat

    DESCRIPTION
This is used to change the value of a particular cell in pandas. For .at(), you specify the row index and column name while
for iat.(), you specify the row index and column index

"""
customer_data.at[180, "Annual Income (k$)"] = 50

#customer_data.iat[180,3] = 50

In [None]:
"""np.where()

    DESCRIPTION
This is used to search for a particular condition in a particular dataset. But mostly, this np.where() method is used to create
a new column in a dataset whenever a condition is met.

Its format is:

x = np.where(condition, value_if_true, value_if_false)

For example, we want to create a new column such that it records "young" if a person is below 50 years old and "old" if a person
is above 50 years old.

So the condition to be met is whether or not the customer is above 50 years old. If so, the person is labelled "old" in the new
column and if not, the person is labelled "young" in the new column.
"""


customer_data["New_column"] = np.where(customer_data["How_old_are_you"] > 50, "Old", "Young")

## Homework:


Use the datasets in the data_preprocessing_datasets folder and use the above tools learned to find and clean the data