# If your data hasn’t been cleaned and preprocessed, your model does not work.

In [3]:
# Imports first!

# Numpy, Matplotlib, and Pandas are the most popular libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [4]:
# Now you can read in your dataset by typing

dataset = pd.read_csv('input/train.csv')

In [5]:
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Now we have our dataset, but we need to create a matrix of dependent variables and a vector of independent variables. You can create the matrix of dependent variables by typing:

In [51]:
dataset.drop(['Name'],axis=1)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,male,35.0,0,0,373450,8.0500,,S
5,6,0,3,male,,0,0,330877,8.4583,,Q
6,7,0,1,male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,male,2.0,3,1,349909,21.0750,,S
8,9,1,3,female,27.0,0,2,347742,11.1333,,S
9,10,1,2,female,14.0,1,0,237736,30.0708,,C


In [52]:
X = dataset.iloc[:, :-1].values  

That first colon (:)means that we want to grab all of the lines in our dataset. :-1 means that we want to grab all of the columns of data except the last column. The .values on the end means that we want to grab all of the values.

Now we want a vector of dependent variable with only the data from the last column, so we can type

In [53]:
y = dataset.iloc[:, 1].values

In [54]:
y #our target is in column 1         ; column 0 is the PassengerId

#Remember when you’re looking at your dataset, the index starts at 0. If you’re trying to count the columns, start counting at 0, not 1

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,

# What happens if we have missing data?

We could just remove the lines where data are missing, but that’s a really not the smartest idea. That could easily cause problems. We need to find a better idea! The most common solution is to take the mean of the columns to fill in the missing data point.

You can easily do this with the imputer class from scikit-learn’s preprocessing model. 

We could just remove the lines where data are missing, but that’s a really not the smartest idea. That could easily cause problems. We need to find a better idea! The most common solution is to take the mean of the columns to fill in the missing data point.

You can easily do this with the imputer class from scikit-learn’s preprocessing model.

In [55]:
# To use the imputer, we would run something like this

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = np.nan, strategy = 'mean', axis = 0)



Mean is the default strategy, so you don’t actually need to specify that, but it’s here so you can get a sense of what information you want to include. The default values for missing_values is nan. If your data set has missing values that are called “NaN,” you‘ll stick with np.nan.

In [62]:
# Now to fit this imputer, we type

imputer = imputer.fit(X[:, :2])

We only want to fit the imputer to the columns where data are missing. The first colon means that we want to include all of the lines, while 1:3 means that we’re taking column indexes 1 and 2.

In [69]:
# Now we want to use the method that will actually replace the missing data. You’ll set that up by typing

X[:, 1:3] = imputer.transform(X[:, 1:3])

Try this out with other strategies! You might find that it makes more sense for your project to fill in the missing values with the median of the column. Or the mode! Decisions like these seem small, but they actually hold a lot of importance.

Just because something is popular doesn’t necessarily make it the right choice. The average (mean) of your data points isn’t necessarily the best choice for your model.

# What if you have categorical data?

You can’t exactly take the mean of cat, dog, and moose. What can we do? We can encode the categorical values as numbers! You’ll want to grab the Label Encoder class from sklearn.preprocessing.
Start with one column where you want to encode the data and call the label encoder. Then fit it onto your data

In [58]:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:, 4] = labelencoder_X.fit_transform(X[:, 4]) 
#we want to work with all of the lines and 0 means that we want to grab the first column.

Do you see the potential problem?

That system of labeling implies a hierarchical value to the data that could affect your model. 3 has a higher value than 0, but cat is not (necessarily…) greater than moose.
We need to create dummy variables!

We can create one column for cat, one for moose, and so on. Then we’ll fill the columns in with 1s and 0s (think 1=yes and 0=no.) That means that if you had cat in your original column, now you’d have a 0 in the moose column, a 0 in the dog column, and a 1 in the cat column.

That sounds complicated. Enter One Hot Encoder!

In [71]:
# mport the encoder and then specify the index of the column

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [3])

# Now a little fit and transform

X = onehotencoder.fit_transform(X).toarray()



ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'

In [37]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [None]:
# This will go ahead and fit and transform y into an encoded variable with 1 for yes and 0 for no.

# Train test split



At this point, you can go ahead and split your data into training and testing sets.  always separate your data into training and testing sets and never use your testing data for training! You need to avoid overfitting. (You can think of overfitting like memorizing super specific details before a test without understanding the information. When you memorize details, you’ll do a great job with your flashcards at home. You’ll fail any real test, though, where you’re presented with new information.)

Right now, we have a machine that needs to learn something. It needs to train on data and see how well it understands what it’s learned on separate data. Memorizing the training set is not the same thing as learning! The better your model learns on the training set, the better it will be at predicting the results for the testing set. You never want to overfit your model. You really want it to learn!

In [38]:
# First, we import

from sklearn.model_selection import train_test_split

# Now we can create X_train and X_test and y_train and y_test sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

It’s very common to do an 80/20 split of your data, with 80% of your data going to training and 20% to testing. That’s why we specified a test_size of 0.2. You can split it however you need to. You don’t need to set a random state, but I like to do that so that we can exactly reproduce our results.

# Now for feature scaling.

What is feature scaling? Why do we need it?

Well, look at our data. We have one column with animal ages from 4–17 and we have animal worth that ranges from $48,000-$83,000. Not only is the worth column made up of much higher numbers than the age column, but the variables also cover a much wider range of data. That means that the Euclidean distance will be dominated by worth and will wind up dominating the age data.

What if Euclidean distance doesn’t play a part in your specific machine learning model? Scaling the features will still make the model much faster, so you might want to include this step when you’re preprocessing your data.

There are many ways to do feature scaling. They all mean that we’re putting all of our features into the same scale so that none are dominated by another.

In [72]:
#Start with the import (you must be getting used to that)

from sklearn.preprocessing import StandardScaler

#Then create an object that we’ll scale and call the standard scale#r

sc_X = StandardScaler()

#Now we directly fit and transform our dataset. Grab the object and apply the methods.

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

#We don’t need to fit it to our test set, we just need a transform.

sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

ValueError: could not convert string to float: 'Boulos, Mrs. Joseph (Sultana)'

What about the dummy variables? Do you need to scale them?

Well, some people say yes and some say no. It’s a question of how much you want to hang on to your interpretation. It is good to have all of our data at the same scale. But if we scale our data, we lose our ability to easily interpret which observations belong to which variable.

What about y? If you have a dependent variable like 0 and 1, you really don’t need to apply feature scaling. It’s a classification problem with a categorically dependent value. But if you have a large range of feature values, then yes! You do want to apply the scaler!