# Regressor Model for Potential Ship Buyers (Predicting Crew Size)
The objective of this model is to recommend the crew member size for potential ship buyers. The dataset used for the model can be found in the repository.
#### Checklist
* Read the file and display columns.
* Calculate basic statistics of the data (count, mean, std, etc), examine data and state observations.
* Select columns that will be probably important to predict crew size.
* Create training and testing sets (use 60% of the data for the training and reminder for testing).
* Build a machine learning model to predict the crew size.
* Calculate the Pearson correlation coefficient for the training set and testing data sets.
* Explain Overfitting, and how it can avoided. 
* What’s the difference between bias and variance?
* When Will You Use Classification over Regression?

#### Import dependencies

In [1]:
import pandas as pd
import numpy as np

#### Read the dataset

In [2]:
ship_data = pd.read_csv('ship_info.csv')

#### Display columns

In [3]:
ship_data.head()

Unnamed: 0,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,Journey,Azamara,6,30.277,6.94,5.94,3.55,42.64,3.55
1,Quest,Azamara,6,30.277,6.94,5.94,3.55,42.64,3.55
2,Celebration,Carnival,26,47.262,14.86,7.22,7.43,31.8,6.7
3,Conquest,Carnival,11,110.0,29.74,9.53,14.88,36.99,19.1
4,Destiny,Carnival,17,101.353,26.42,8.92,13.21,38.36,10.0


#### Analyzing the data

In [4]:
ship_data.describe()

Unnamed: 0,Age,Tonnage,passengers,length,cabins,passenger_density,crew
count,158.0,158.0,158.0,158.0,158.0,158.0,158.0
mean,15.689873,71.284671,18.457405,8.130633,8.83,39.900949,7.794177
std,7.615691,37.22954,9.677095,1.793474,4.471417,8.639217,3.503487
min,4.0,2.329,0.66,2.79,0.33,17.7,0.59
25%,10.0,46.013,12.535,7.1,6.1325,34.57,5.48
50%,14.0,71.899,19.5,8.555,9.57,39.085,8.15
75%,20.0,90.7725,24.845,9.51,10.885,44.185,9.99
max,48.0,220.0,54.0,11.82,27.0,71.43,21.0


In [5]:
ship_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Ship_name          158 non-null    object 
 1   Cruise_line        158 non-null    object 
 2   Age                158 non-null    int64  
 3   Tonnage            158 non-null    float64
 4   passengers         158 non-null    float64
 5   length             158 non-null    float64
 6   cabins             158 non-null    float64
 7   passenger_density  158 non-null    float64
 8   crew               158 non-null    float64
dtypes: float64(6), int64(1), object(2)
memory usage: 11.2+ KB


In [6]:
ship_data.shape

(158, 9)

From analyzing the dataset we can see that the dataset has two object data type columns, one integer dtype and six float dtype, there are no missing values present, therefore data cleaning/handling of missing values is not needed, we can also see that the dataset has 158 columns and 9 rows with the crew column our target(dependent variable). The dataset also contains alot of continuous values therefore a regressor model will be best for fitting and predicting the data. 

#### Preprocessing the data
There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. From observation we can see that the dataset has some non-numeric values (dtype object columns), these non-numeric columns needs to be handled. First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called label encoding.

In [7]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in ship_data.columns.values:
    # Compare if the dtype is object
    if ship_data[col].dtypes=='object':
    # Use LabelEncoder to do the numeric transformation
        ship_data[col]=le.fit_transform(ship_data[col])

#### Splitting the data set into train and test sets and feature selection
I have successfully converted all the non-numeric values to numeric ones.

Now, i will split the data into train set and test set to prepare the data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, i will first split the data and then apply the scaling.

Also, features like Ship_name and Cruise_line are not as important as the other features in the dataset for predicting credit card approvals. I will drop them to design the machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection. 

In [8]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features Ship_name and Cruise_line
ship_data = ship_data.drop(['Ship_name', 'Cruise_line'], axis=1)
#ship_data = ship_data.values

# Segregate features and target into separate variables
X = ship_data.drop('crew', axis=1)
y = ship_data['crew']

print(X.head())
y.head()

# convert the DataFrame to a NumPy array
X = X.values
y = y.values

# Split into 60% train set and 40% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

   Age  Tonnage  passengers  length  cabins  passenger_density
0    6   30.277        6.94    5.94    3.55              42.64
1    6   30.277        6.94    5.94    3.55              42.64
2   26   47.262       14.86    7.22    7.43              31.80
3   11  110.000       29.74    9.53   14.88              36.99
4   17  101.353       26.42    8.92   13.21              38.36


#### Preprocessing the data ii
The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data. After considerable analysis of the dataset i decided to use Standardization as my rescaling method because its not limited in a range unlike Normalization.

In [9]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler and use it to rescale X_train and X_test
SS_scaler = StandardScaler()
rescaledX_train = SS_scaler.fit_transform(X_train)
rescaledX_test = SS_scaler.transform(X_test)

#### Building my machine learning model

In [10]:
# Import LinearRegression, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit logreg to the train set
reg_all.fit(rescaledX_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(rescaledX_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(rescaledX_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

R^2: 0.9362030498069416
Root Mean Squared Error: 0.8977962325945485


From the values gotten above we can see that the model R^2 value which can also be defined as the accuracy of the model is 0.9362030498069416 which is a very good model as the score is close to 1, also i tested the model on a separate test data(y_test) and used root mean squared to calculate the efficiency of the model predicting on new data, as we can see from the result gotten the model has passed all the tests in flying colors. We can confidently use this model for predicting the crew member size for ship buyers.   

#### Pearson correlation coefficient for the training and testing data sets.
Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, Age and Tonnage. Pearson's correlation coefficient (r) is a measure of the strength of the association between the two variables.
##### Pearson correlation formula
![image](https://media.geeksforgeeks.org/wp-content/uploads/20200311233526/formula6.png)

In [11]:
print('Pearson Coefficients:', reg_all.coef_)

Pearson Coefficients: [-0.00963213  0.54389059 -1.68304167  0.54554151  3.88951399  0.05898596]


#### What is overfitting?
Overfitting occurs when your model learns too much from training data and isn’t able to generalize the underlying information. When this happens, the model is able to describe training data very accurately but loses precision on every dataset it has not been trained on. This is completely bad because we want our model to be reasonably good on data that it has never seen before.
#### Why does it happen?
In machine learning, simplicity is the key. We want to generalize the information obtained from the training dataset, so we can surely say that we run the risk of overfitting if we use complex models.
Complex models will likely over-learn from training data and will think that the random error that drifts training data from the underlying dynamics is actually worth learning from. That’s the exact point at which the model stops generalizing and starts overfitting.
Complexity is often measured with the number of parameters used by your model during it’s learning procedure. For example, the number of parameters in linear regression, the number of neurons in a neural network, and so on.
So, the lower the number of the parameters, the higher the simplicity and, reasonably, the lower the risk of overfitting.
#### A simple example of overfitting
![image](https://miro.medium.com/proxy/1*1z_Id7wNBoGVWnWl238wmg.png)

Now it’s clear what happens here. The polynomial fits training data perfectly but loses precision on the test set. It doesn’t even get close to test points.
#### How to avoid overfitting
* Cross validation: Cross-validation is a powerful preventative measure against overfitting. The idea is clever: Use your initial training data to generate multiple mini train-test splits. Use these splits to tune your model. In standard k-fold cross-validation, we partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “holdout fold”). Cross-validation allows you to tune hyperparameters with only your original training set. This allows you to keep your test set as a truly unseen dataset for selecting your final model.
* Training with more data: It won’t work every time, but training with more data can help algorithms detect the signal better. You should ensure your data is clean and relevant.
* Remove Features: Some algorithms have built-in feature selection. For those that don’t, you can manually improve their generalizability by removing irrelevant input features. An interesting way to do so is to tell a story about how each feature fits into the model. This is like the data scientist's spin on software engineer’s rubber duck debugging technique, where they debug their code by explaining it, line-by-line, to a rubber duck. If anything doesn't make sense, or if it’s hard to justify certain features, this is a good way to identify them. In addition, there are several feature selection heuristics you can use for a good starting point.
* Regularization: Regularization refers to a broad range of techniques for artificially forcing your model to be simpler. The method will depend on the type of learner you’re using. For example, you could prune a decision tree, use dropout on a neural network, or add a penalty parameter to the cost function in regression. Often times, the regularization method is a hyperparameter as well, which means it can be tuned through cross-validation.

### Bias and Variance
The prediction error for any machine learning algorithm can be broken down into three parts:
* Bias Error
* Variance Error

#### Bias Error
Bias are the simplifying assumptions made by a model to make the target function easier to learn. Generally, linear algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.
  
  * Low Bias: Suggests less assumptions about the form of the target function.
  * High-Bias: Suggests more assumptions about the form of the target function.

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines. Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
#### Variance Error
Variance is the amount that the estimate of the target function will change if different training data was used. The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping between the inputs and the output variables. Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. This means that the specifics of the training have influences over the number and types of parameters used to characterize the mapping function.
  
  * Low Variance: Suggests small changes to the estimate of the target function with changes to the training dataset.
  * High Variance: Suggests large changes to the estimate of the target function with changes to the training dataset.

Generally, nonlinear machine learning algorithms that have a lot of flexibility have a high variance. For example, decision trees have a high variance, that is even higher if the trees are not pruned before use. Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression. Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

### When to use classification over regression
Classification is used when the output variable is a category such as “red” or “blue”, “spam” or “not spam”. It is used to draw a conclusion from observed values. Differently from, regression which is used when the output variable is a real or continuous value like “age”, “salary”, etc. Which is why i used regression in building my model because the values in the dataset are continuous.