### Getting Help

In [None]:
You saw the abs function in the previous tutorial, but what if you've forgotten what it does?

The help() function is possibly the most important Python function you can learn. If you can remember how to use help(),
you hold the key to understanding most other functions.

help(print)

In [None]:
return is another keyword uniquely associated with functions. When Python encounters a return statement, 
it exits the function immediately, and passes the value on the right hand side to the calling context.

### <font color="purple">Docsstrings</font>

In [None]:
Python isn't smart enough to read my code and turn it into a nice English description(for manual funtions made by me). 
However, when I write a function,I can provide a description in what's called the docstring.

In [None]:
def least_difference(a, b, c):
    """Return the smallest difference between any two numbers
    among a, b and c.
    
    >>> least_difference(1, 5, -5)
    4
    """
    diff1 = abs(a - b)
    diff2 = abs(b - c)
    diff3 = abs(a - c)
    return min(diff1, diff2, diff3)

The docstring is a triple-quoted string (which may span multiple lines) that comes immediately after the header of a 
function. When we call help() on a function, it shows the docstring.

In [1]:
def least_difference(a, b, c):
    """Return the smallest difference between any two numbers
    among a, b and c.
    
    >>> least_difference(1, 5, -5)
    4
    """
    diff1 = abs(a - b)
    diff2 = abs(b - c)
    diff3 = abs(a - c)
    return min(diff1, diff2, diff3)

In [2]:
help(least_difference)

Help on function least_difference in module __main__:

least_difference(a, b, c)
    Return the smallest difference between any two numbers
    among a, b and c.
    
    >>> least_difference(1, 5, -5)
    4



### <font color="Green">Return </font>

In [None]:
return is another keyword uniquely associated with functions. When Python encounters a return statement, 
it exits the function immediately, and passes the value on the right hand side to the calling context.

In [None]:
What would happen if we didn't include the return keyword in our function?
Python allows us to define such functions. The result of calling them is the special value None. 
(This is similar to the concept of "null" in other languages.)

Without a return statement, least_difference is completely pointless, but a function with side effects may do something 
useful without returning anything. We've already seen two examples of this: print() and help() don't return anything.
We only call them for their side effects (putting some text on the screen). Other examples of useful side effects 
include writing to a file, or modifying an input.



In [3]:
mystery = print()
print(mystery)


None


### <font color="red">Funtions</font>

In [12]:
def greet(who="Lord Ashish"):
    print("Greet",who)
greet()
greet("World")
greet(who="EveryBody")

Greet Lord Ashish
Greet World
Greet EveryBody


In [None]:
Functions Applied to Functions
Here's something that's powerful, though it can feel very abstract at first. You can supply functions as arguments to
other functions. Some example may make this clearer:

def mult_by_five(x):
    return 5 * x

def call(fn, arg):
    """Call fn on arg"""
    return fn(arg)

def squared_call(fn, arg):
    """Call fn on the result of calling fn on arg"""
    return fn(fn(arg))

print(
    call(mult_by_five, 1),
    squared_call(mult_by_five, 1), 
    sep='\n', # '\n' is the newline character - it starts a new line
)

In [None]:
Functions that operate on other functions are called "higher-order functions." You probably won't write your own for a
little while. But there are higher-order functions built into Python that you might find useful to call.

In [None]:
As you've seen, ndigits=-1 rounds to the nearest 10, ndigits=-2 rounds to the nearest 100 and so on. Where might this be
useful? Suppose we're dealing with large numbers:

The area of Finland is 338,424 km²
The area of Greenland is 2,166,086 km²

We probably don't care whether it's really 338,424, or 338,425, or 338,177. All those digits of accuracy are just 
distracting. We can chop them off by calling round() with ndigits=-3:

The area of Finland is 338,000 km²
The area of Greenland is 2,166,000 km²

### Booleans

In [None]:
Python has a type of variable called bool. It has two possible values: True and False.

In [None]:
Remember to use == instead of = when making comparisons. If you write n == 2 you are asking about the value of n. 
When you write n = 2 you are changing the value of n.

In [None]:
Boolean conversion
We've seen int(), which turns things into ints, and float(), which turns things into floats, so you might not be surprised
to hear that Python has a bool() function which turns things into bools.

print(bool(1)) # all numbers are treated as true, except 0
print(bool(0))
print(bool("asf")) # all strings are treated as true, except the empty string ""
print(bool(""))
# Generally empty sequences (strings, lists, and other types we've yet to see like lists and tuples)
# are "falsey" and the rest are "truthy"

In [14]:
bool()

True

In [17]:
total_candies=1
print("Splitting", total_candies, "candy" if total_candies == 1 else "candies")

Splitting 1 candy


In [None]:
def wants_plain_hotdog(ketchup, mustard, onion):
    """Return whether the customer wants a plain hot dog with no toppings.
    """
    return not ketchup and not mustard and not onion


In [None]:
def exactly_one_sauce(ketchup, mustard, onion):
    """Return whether the customer wants either ketchup or mustard, but not both.
    (You may be familiar with this operation under the name "exclusive or")
    """
    return (ketchup and not mustard) or (mustard and not ketchup)

In [None]:
def prepared_for_weather(have_umbrella, rain_level, have_hood, is_workday):
    # Don't change this code. Our goal is just to find the bug, not fix it!
    return have_umbrella or (rain_level < 5 and have_hood) or not (rain_level > 0 and is_workday)

# Change the values of these inputs so they represent a case where prepared_for_weather
# returns the wrong answer.
have_umbrella = False
rain_level = 0.0
have_hood = False
is_workday = False

# Check what the function returns given the current values of the variables above
actual = prepared_for_weather(have_umbrella, rain_level, have_hood, is_workday)
print(actual)

# Check your answer

In [None]:
def exactly_one_topping(ketchup, mustard, onion):
    """Return whether the customer wants exactly one of the three available toppings
    on their hot dog.
    """
    return (ketchup + mustard + onion)==1
# Check your answer
q6.check()

### Objects

In [None]:
In object-oriented programming languages like Python, an object is an entity that contains data along with associated
metadata and/or functionality. In Python everything is an object, which means every entity has some metadata 
(called attributes) and associated functionality (called methods). These attributes and methods are accessed via the dot
syntax.

In [None]:
The things an object carries around can also include functions. A function attached to an object is called a method.
(Non-function things attached to an object, such as imag, are called attributes).

For example, numbers have a method called bit_length. Again, we access it using dot syntax:

In [6]:
x=18
x.bit_length()

5

In [9]:
planets=['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
planets.index("Earth")

2

In [10]:
# Is earth in planets
"Earth" in planets

True

In [11]:
"Pluto" in planets

False

In [14]:
d = [1, 2, 3][1:]
d

[2, 3]

In [16]:
int(3.7)

3

### Loops

In [20]:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
for planet in planets:
    print(planet, end=' ') # print all on same line

Mercury @Venus @Earth @Mars @Jupiter @Saturn @Uranus @Neptune @

In [None]:
In addition to lists, we can iterate over the elements of a tuple

In [21]:
s = 'steganograpHy is the practicE of conceaLing a file, message, image, or video within another fiLe, message, image, Or video.'
msg = ''
# print all the uppercase letters in s, one at a time
for char in s:
    if char.isupper():
        print(char, end='')   

HELLO

In [22]:
loud_short_planets = [planet.upper() + '!' for planet in planets if len(planet) < 6]
loud_short_planets

['VENUS!', 'EARTH!', 'MARS!']

In [None]:
(Continuing the SQL analogy, you could think of these three lines as SELECT, FROM, and WHERE)

In [None]:
def count_negatives(nums):
    return len([num for num in nums if num < 0])

In [23]:
def count_negatives(nums):
    # Reminder: in the "booleans and conditionals" exercises, we learned about a quirk of 
    # Python where it calculates something like True + True + False + True to be equal to 3.
    return sum([num < 0 for num in nums])

In [40]:
def has_lucky_number(nums):
    return any([num % 7 == 0 for num in nums])

In [41]:
 has_lucky_number([10,14])

True

In [42]:
2>4

False

In [None]:
def elementwise_greater_than(L, thresh):
    return [ele > thresh for ele in L]

In [45]:
i=0
j=1
while j<len(meals):
    if meals[i]==meals[j]:
         return True
else:
    return False

SyntaxError: 'return' outside function (<ipython-input-45-b898d8496c47>, line 5)

### <font color="red">Strings and dictionary</color>

In [None]:
Strings are sequences
1. Strings can be thought of as sequences of characters. Almost everything we've seen that we can do to a list,
we can also do to a string.
2. But a major way in which they differ from lists is that they are immutable. We can't modify them.

In [47]:
'Pluto's a planet!'

SyntaxError: invalid syntax (<ipython-input-47-a43631749f52>, line 1)

In [64]:
print('pluto\'s  a planet')

pluto's  a planet


In [63]:
print('What\'s up?',"That's \"cool\"","Look, a mountain: /\\")

What's up? That's "cool" Look, a mountain: /\


In [62]:
print("Look, a mountain: /\\",  "\n1\n2 3")

Look, a mountain: /\ 
1
2 3


In [58]:
print("1 \n 3")

1 
 3


In [70]:
# startswith and endswidth
claim="Thousand Dollar $"
claim.startswith("Th"),claim.endswith("$")

(True, True)

In [73]:
a="Hello How are You"
a.split()

['Hello', 'How', 'are', 'You']

In [None]:
Occasionally you'll want to split on something other than whitespace:

In [75]:
datestr = '1956-01-31'
datestr.split('-')

['1956', '01', '31']

In [None]:
str.join() takes us in the other direction, sewing a list of strings up into one long string,
using the string it was called on as a separator.

In [77]:
' 👏 '.join([word.upper() for word in ['clap','clap','clap']])

'CLAP 👏 CLAP 👏 CLAP'

In [81]:
planet="Pluto"
position="9"
last="Happy"
"{}, you'll always be the {}th planet to me.{}".format(planet, position,last)

"Pluto, you'll always be the 9th planet to me.Happy"

In [None]:
So much cleaner! We call .format() on a "format string", where the Python values we want to insert 
are represented with {} placeholders.

Notice how we didn't even have to call str() to convert position from an int. format() takes care of that for us.

If that was all that format() did, it would still be incredibly useful. But as it turns out, it can do a lot more.
Here's just a taste:

In [91]:
pluto_mass = 1.303 * 10**22
earth_mass = 5.9722 * 10**24
population = 52910390
#         2 decimal points   3 decimal points, format as percent     separate with commas
"{} weighs about {:.2} kilograms ({:.3%} of Earth's mass). It is home to {:,} Plutonians.".format(
    planet, pluto_mass, pluto_mass / earth_mass, population,
)

"Pluto weighs about 1.3e+22 kilograms (0.218% of Earth's mass). It is home to 52,910,390 Plutonians."

In [92]:
# Referring to format() arguments by index, starting from 0
s = """Pluto's a {0}.
No, it's a {1}.
{0}!
{1}!""".format('planet', 'dwarf planet')
print(s)

Pluto's a planet.
No, it's a dwarf planet.
planet!
dwarf planet!


In [94]:
" {0} shells on the {0} {1}".format("sea","shore") 

' sea shells on the sea shore'

In [95]:
#Python has dictionary comprehensions with a syntax similar to the list comprehensions we saw in the previous tutorial.
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
planet_to_initial = {planet: planet[0] for planet in planets}
planet_to_initial

{'Mercury': 'M',
 'Venus': 'V',
 'Earth': 'E',
 'Mars': 'M',
 'Jupiter': 'J',
 'Saturn': 'S',
 'Uranus': 'U',
 'Neptune': 'N'}

In [97]:
# The in operator tells us whether something is a key in the dictionary
"Venus" in planets,"Hello" in planets

(True, False)

In [102]:
for i,j in planet_to_initial.items():
    print(i,j)

Mercury M
Venus V
Earth E
Mars M
Jupiter J
Saturn S
Uranus U
Neptune N


In [103]:
e = '\n'
len(e)

1

In [104]:
c = 'it\'s ok'
len(c)

7

In [105]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  

In [123]:
ex="10000e"
ex.isdigit()

False

In [None]:
def is_valid_zip(zip_str):
    return len(zip_str) == 5 and zip_str.isdigit()

In [169]:
doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
arr=[]
keyword="car"
def find(doc_list,keyword):
    arr=[]
    for i in range(len(doc_list)):
        l=doc_list[i].lower()
        l=l.split()
        if keyword.lower() in l:
            arr.append(i)
    return arr
find(doc_list,"Car")

[1]

In [176]:
arr=[]
for i in range(len(doc_list)):
    l=doc_list[i].lower()
    l=l.split()
    r=[token.strip('.,').lower() for token in l]
    print(l,r)


['the', 'learn', 'python', 'challenge', 'casino.'] ['the', 'learn', 'python', 'challenge', 'casino']
['they', 'bought', 'a', 'car'] ['they', 'bought', 'a', 'car']
['casinoville'] ['casinoville']


In [192]:
def word_search(doc_list, keyword):
    """
    Takes a list of documents (each document is a string) and a keyword. 
    Returns list of the index values into the original list for all documents 
    containing the keyword.
​
    Example:
    doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
    >>> word_search(doc_list, 'casino')
    >>> [0]
    """
    indices = [] 
    # Iterate through the indices (i) and elements (doc) of documents
    for i, doc in enumerate(doc_list):
        # Split the string doc into a list of words (according to whitespace)
        tokens = doc.split()
        # Make a transformed list where we 'normalize' each word to facilitate matching.
        # Periods and commas are removed from the end of each word, and it's set to all lowercase.
        normalized = [token.rstrip('.,').lower() for token in tokens]
        # Is there a match? If so, update the list of matching indices.
        if keyword.lower() in normalized:
            indices.append(i)
    return indices
        
            
            
    

In [193]:
a=[1,2,3]
b=[2,3,4]
dict(zip(a,b))

{1: 2, 2: 3, 3: 4}

In [194]:
def multi_word_search(documents, keywords):
    keyword_to_indices = {}
    for keyword in keywords:
        keyword_to_indices[keyword] = word_search(documents, keyword)
    return keyword_to_indices

###  Libraries

In [None]:
math is a module. A module is just a collection of variables (a namespace, if you like) defined by someone else.
We can see all the names in math using the built-in function dir().

In [3]:
from math import *
log(16,2),pi

(4.0, 3.141592653589793)

In [None]:
Three tools for understanding strange objects
In the cell above, we saw that calling a numpy function gave us an "array".
We've never seen anything like this before (not in this course anyways). But don't panic: we have three familiar builtin functions to help us here.

1: type() (what is this thing?)

In [7]:
import numpy
rolls = numpy.random.randint(low=1, high=6, size=10)
rolls

array([4, 4, 5, 2, 1, 1, 4, 2, 5, 2])

In [8]:
type(rolls)

numpy.ndarray

In [None]:
2: dir() (what can I do with it?)

In [10]:
dir(rolls)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__

In [13]:
rolls.mean(),rolls.tolist()

(3.0, [4, 4, 5, 2, 1, 1, 4, 2, 5, 2])

In [None]:
3 :help() (tell me more)

In [18]:
help(rolls.ravel)

Help on built-in function ravel:

ravel(...) method of numpy.ndarray instance
    a.ravel([order])
    
    Return a flattened array.
    
    Refer to `numpy.ravel` for full documentation.
    
    See Also
    --------
    numpy.ravel : equivalent function
    
    ndarray.flat : a flat iterator on the array.



In [23]:
a=numpy.array([[1,2,3],[4,3,4],[6,7,6]])
a.ravel()

array([1, 2, 3, 4, 3, 4, 6, 7, 6])

In [24]:
rolls<=3

array([False, False, False,  True,  True,  True, False,  True, False,
        True])

In [25]:
xlist = [[1,2,3],[2,4,6],]
# Create a 2-dimensional array
x = numpy.asarray(xlist)
print("xlist = {}\nx =\n{}".format(xlist, x))


xlist = [[1, 2, 3], [2, 4, 6]]
x =
[[1 2 3]
 [2 4 6]]


### Operator Oveloading

In [None]:
his turns out to be directly related to operator overloading.

When Python programmers want to define how operators behave on their types, they do so by implementing methods 
with special names beginning and ending with 2 underscores such as __lt__, __setattr__, or __contains__. Generally, names that follow this double-underscore format have a special meaning to Python.

So, for example, the expression x in [1, 2, 3] is actually calling the list method __contains__ behind-the-scenes. 
It's equivalent to (the much uglier) [1, 2, 3].__contains__(x).

If you're curious to learn more, you can check out Python's official documentation, which describes many,
many more of these special "underscores" methods.

We won't be defining our own types in these lessons (if only there was time!), but I hope you'll get to experience
the joys of defining your own wonderful, weird types later down the road.

In [None]:
def blackjack_hand_greater_than(hand_1, hand_2):
    """
    Return True if hand_1 beats hand_2, and False otherwise.
    
    In order for hand_1 to beat hand_2 the following must be true:
    - The total of hand_1 must not exceed 21
    - The total of hand_1 must exceed the total of hand_2 OR hand_2's total must exceed 21
    
    Hands are represented as a list of cards. Each card is represented by a string.
    
    When adding up a hand's total, cards with numbers count for that many points. Face
    cards ('J', 'Q', and 'K') are worth 10 points. 'A' can count for 1 or 11.
    
    When determining a hand's total, you should try to count aces in the way that 
    maximizes the hand's total without going over 21. e.g. the total of ['A', 'A', '9'] is 21,
    the total of ['A', 'A', '9', '3'] is 14.
    
    Examples:
    >>> blackjack_hand_greater_than(['K'], ['3', '4'])
    True
    >>> blackjack_hand_greater_than(['K'], ['10'])
    False
    >>> blackjack_hand_greater_than(['K', 'K', '2'], ['3'])
    False
    """
    "J"="K"="Q"=10
    if 
    
# Check your answer
q3.check()

In [None]:
def hand_total(hand):
    """Helper function to calculate the total points of a blackjack hand.
    """
    total = 0
    # Count the number of aces and deal with how to apply them at the end.
    aces = 0
    for card in hand:
        if card in ['J', 'Q', 'K']:
            total += 10
        elif card == 'A':
            aces += 1
        else:
            # Convert number cards (e.g. '7') to ints
            total += int(card)
    # At this point, total is the sum of this hand's cards *not counting aces*.

    # Add aces, counting them as 1 for now. This is the smallest total we can make from this hand
    total += aces
    # "Upgrade" aces from 1 to 11 as long as it helps us get closer to 21
    # without busting
    while total + 10 <= 21 and aces > 0:
        # Upgrade an ace from 1 to 11
        total += 10
        aces -= 1
    return total

In [None]:
We use data to decide how to break the houses into two groups, and then again to determine the predicted price in
each group. This step of capturing patterns from data is called fitting or training the model. The data used to fit the 
model is called the training data.

In [None]:
What is Model Validation
You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure
of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and 
compare those predictions to the target values in the training data. You'll see the problem with this approach 
and how to solve it in a moment, but let's think about how we'd do this first.

You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home
values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted
and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error 
(also called MAE). Let's break down this metric starting with the last word, error.

The prediction error for each house is:

error=actual−predicted
So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000.

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. 
We then take the average of those absolute errors. This is our measure of model quality. In plain English,
it can be said as

On average, our predictions are off by about X.

To calculate MAE, we first need a model. That is built in a hidden cell below, which you can review by clicking
the code button.


In [None]:
The Problem with "In-Sample" Scores
The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building
the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. 
The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict
high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used 
to build the model. The most straightforward way to do this is to exclude some data from the model-building process,
and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

### Underfitting and Overfitting

### Handling Missing values

Approach 1

In [None]:
Score from Approach 1 (Drop Columns with Missing Values)
Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.

In [None]:
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

Approach 2

In [None]:
Score from Approach 2 (Imputation)
Next, we use SimpleImputer to replace missing values with the mean value along each column.

Although it's simple, filling in the mean value generally performs quite well (but this varies by dataset). 
While statisticians have experimented with more complex ways to determine imputed values (such as regression imputation,
for instance), the complex strategies typically give no additional benefit once you plug the results into sophisticated
machine learning models.

In [None]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

Approach 3

In [None]:
Score from Approach 3 (An Extension to Imputation)
Next, we impute the missing values, while also keeping track of which values were imputed.

In [None]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

In [None]:
Given that thre are so few missing values in the dataset, we'd expect imputation to perform better than dropping columns 
entirely. However, we see that dropping columns performs slightly better! While this can probably partially be attributed
to noise in the dataset, another potential explanation is that the imputation method is not a great match to this dataset.
That is, maybe instead of filling in the mean value, it makes more sense to set every missing value to a value of 0,
to fill in the most frequently encountered value, or to use some other method. For instance, consider the GarageYrBlt 
column (which indicates the year that the garage was built). It's likely that in some cases, a missing value could 
indicate a house that does not have a garage. Does it make more sense to fill in the median value along each column in 
this case? Or could we get better results by filling in the minimum value along each column? It's not quite clear what's
best in this case, but perhaps we can rule out some options immediately - for instance, setting missing values in this 
column to 0 is likely to yield horrible results!

### Categorical Variables

In [None]:
Three Approaches
1) Drop Categorical Variables

2) Ordinal Encoding
Ordinal encoding assigns each unique value to a different integer.

This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).

This assumption makes sense in this example, because there is an indisputable ranking to the categories. 
Not all categorical variables have a clear ordering in the values, but we refer to those that do as ordinal variables. 
For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal 
variables.

3) One-Hot Encoding
One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data. 
To understand this, we'll work through an example.

In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green". 
The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the 
original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value was
"Yellow", we put a 1 in the "Yellow" column, and so on.

In contrast to ordinal encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect
this approach to work particularly well if there is no clear ordering in the categorical data
(e.g., "Red" is neither more nor less than "Yellow"). We refer to categorical variables without an intrinsic ranking as
nominal variables.
One-hot encoding generally does not perform well if the categorical variable takes on a large number of values
(i.e., you generally won't use it for variables taking more than 15 different values).

In [None]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])

In [2]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

In [None]:
Score from Approach 3 (One-Hot Encoding)
We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There are a number of parameters that can be
used to customize its behavior.

We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the 
training data, andsetting sparse=False ensures that the encoded columns are returned as a numpy array
(instead of a sparse matrix).
To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode 
the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names
with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

In [None]:
The output above shows, for each column with categorical data, the number of unique values in the column. For instance,
the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a
paved road, respectively.

We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable.
For instance, the 'Street' variable has cardinality 2.

In [None]:
Next, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset,
you'll only create a one-hot encoding for columns with cardinality less than 10.

In [None]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)


### PipeLines

In [None]:
Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles 
preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually
keep track of your training and validation data at each step.

Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at
scale. We won't go into the many related concerns here, but pipelines can help.


In [None]:
We construct the full pipeline in three steps.

Step 1: Define Preprocessing Steps
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to 
bundle together different preprocessing steps. The code below:

imputes missing values in numerical data, and
imputes missing values and applies a one-hot encoding to categorical data.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]
​
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

In [None]:
Gradient Boosting
Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble.

It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. (Even if its predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors.)

Then, we start the cycle:

First, we use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, 
we add the predictions from all models in the ensemble.
These predictions are used to calculate a loss function (like mean squared error, for instance).
Then, we use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model 
parameters so that adding this new model to the ensemble will reduce the loss. (Side note: The "gradient" in "gradient boosting" refers to the fact that we'll use gradient descent on the loss function to determine the parameters in this new model.)
Finally, we add the new model to ensemble, and ...
... repeat!


In [None]:
Parameter Tuning
XGBoost has a few parameters that can dramatically affect accuracy and training speed. The first parameters you should
understand are:

n_estimators
n_estimators specifies how many times to go through the modeling cycle described above. It is equal to the number of 
models that we include in the ensemble.

Too low a value causes underfitting, which leads to inaccurate predictions on both training data and test data.
Too high a value causes overfitting, which causes accurate predictions on training data, but inaccurate predictions on 
test data (which is what we care about).
Typical values range from 100-1000, though this depends a lot on the learning_rate parameter discussed below.

Here is the code to set the number of models in the ensemble:

In [None]:
early_stopping_rounds
early_stopping_rounds offers a way to automatically find the ideal value for n_estimators. Early stopping causes the model
to stop iterating when the validation score stops improving, even if we aren't at the hard stop for n_estimators. It's 
smart to set a high value for n_estimators and then use early_stopping_rounds to find the optimal time to stop iterating.

Since random chance sometimes causes a single round where validation scores don't improve, you need to specify a number 
for how many rounds of straight deterioration to allow before stopping. Setting early_stopping_rounds=5 is a reasonable
choice. In this case, we stop after 5 straight rounds of deteriorating validation scores.

When using early_stopping_rounds, you also need to set aside some data for calculating the validation scores -
this is done by setting the eval_set parameter.

We can modify the example above to include early stopping:

In [None]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)],
             verbose=False)

In [None]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

In [None]:
If you later want to fit a model with all of your data, set n_estimators to whatever value you found to be optimal when
run with early stopping.

learning_rate
Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the 
predictions from each model by a small number (known as the learning rate) before adding them in.

This means each tree we add to the ensemble helps us less. So, we can set a higher value for n_estimators without 
overfitting. If we use early stopping, the appropriate number of trees will be determined automatically.

In general, a small learning rate and large number of estimators will yield more accurate XGBoost models, though it will 
also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets 
learning_rate=0.1.

Modifying the example above to change the learning rate yields the following code:



In [None]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

In [None]:
n_jobs
On larger datasets where runtime is a consideration, you can use parallelism to build your models faster.
It's common to set the parameter n_jobs equal to the number of cores on your machine. On smaller datasets, this won't help.


The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction.
But, it's useful in large datasets where you would otherwise spend a long time waiting during the fit command.

In [None]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

In [None]:
it is indeed confusing and poorly documented. Basically with False (e.g. 0) it does not print anything. With any integer,
it will print the evaluation score at that step. So for verbose=100 it will tell you the score every 100 iterations.

Setting verbose=True is the same as setting it to 1. Thus it will print a lot!

### Data Leakage

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())